Say It Like You Mean It: Microsoft Research Teaching Computers Subtleties of Speech

REDMOND, Wash., Nov. 27, 2002 — Can you imagine having an accurate recording of your conversations, and being able to easily search the recording? You’d never forget a name, a conversation or a phone number again. You wouldn’t have to mumble excuses about a senior moment, or argue over who said what when.

Eric Chang, the manager of the Speech Group at Microsoft Research Asia, imagines that some day we’ll be able to record and store everything we hear onto small devices. Then, we’ll be able to search this audio record and retrieve information by typing a few simple words.

“Our vision is to enhance not only human-computer communication; it’s to enhance human-to-human communication,” says Chang.

Tracking Emotions

One of the missing components in interactions between people and computers is emotion. Most people think of computers as just a collection of ones and zeros mechanically shifting back and forth. Chang’s group wants to warm up our interactions.

“One thing we have done is try to extract emotion from speech,” says Chang. In one of their first experiments, they identified the emotions connected with a voice from a movie soundtrack and coordinated it with the changing expressions on a cartoon face. He shows a demonstration on his laptop. A smiling animated face with the voice of Shirley Temple begins to talk, and soon does the Shirley pout, wrinkles her nose and flashes her dimples as she laughs.

“What’s the benefit of doing this, you might ask,” Chang says. “One thing is that it’s fun to do. But there are practical applications as well. Let’s say your child calls you from school and is hurt and sad, you can tell from his voice. In the future, the program could detect the caller’s emotion and push the message to the top of your e-mail stack.”

Another application Chang sees is a business using emotion recognition when gathering customer feedback. “Imagine you have a call center with hundreds of call center operators,” says Chang. “How do you monitor the quality of their interaction with customers? Right now managers can just listen in and spot check. But with this technology, you can actually monitor the emotion on both sides. And if the customer is getting angrier and angrier, then the manager could spot check that call first, or the manager could intervene and take over.”

To develop the emotional cartoon face, the group had to find out what’s consistent in emotions across different people. “For example, we look for variations in energy — is it monotonous, or is it sharper or higher — is one way,” explains Chang. “Another way is pitch. We extract all the features from each sentence and then we send it to a classifier and analyze it.” Microsoft Research has found that sometimes emotions can be mixed, so they built individual classifiers for anger, happiness, and sadness. As a result, the system can identify a mixture of emotions.

Fast Forward

The speech group is also working on the ability to search video files by indexing the words spoken in the video. They’ve indexed over 600 talks on the internal Microsoft Technical Education site with a program developed to analyze the audio portion of a video file and convert the speech track to text. Then, if you want to review only portions of a long speech that you were interested in, you could type in a keyword or phrase and the program would search the video transcript for all the places that phrase appears. It then creates a list of the passages, highlighting the phrase and the surrounding text. You can scan the text, and play only the sections you’re interested in.

“Right now the transcription isn’t 100 percent correct,” says Chang. “But the idea is that before we did this, the only way you could search for a topic is by looking at a summary of the video.”

In the future, Chang says he can envision students using the audio feature to very easily go back and replay what a professor was saying at any point during a lecture. “Or imagine learning a foreign language and you want to review the right pronunciation for a word,” adds Chang.

In the past, storage limitations have made recording and storing a week’s worth of conversations and lectures impossible, but hardware storage has developed rapidly, and today is much less of a challenge for users.

True Caller ID

Caller identification has had many advocates from its inception as a way to effectively screen out telemarketers and other unwanted phone calls. But it has its limitations too. What if a trusted friend or family member was calling you from a pay phone, and really needed to reach you? You might ignore the call because you didn’t recognize the phone number. Chang’s group has developed an application that identifies the caller by voice.

In addition, the user interface includes a searchable waveform. A user can click on sections of the waveform and play back individual segments. In the future, the text transcription will be coordinated with the waveform so that you could play back exactly the section that you want.

Text to Speech

Though text-to-speech (TTS) is a more mature technology than voice recognition, it still has its shortcomings. Most TTS systems sound very stilted and unemotional, a little like Will Robinson’s robot pal in Lost in Space.

Chang’s group has developed a text-to-speech engine that sounds much more natural. It’s difficult to tell their synthesized voice from the voice of a real person.

“This will open up a lot of applications,” says Chang. “It could help a blind person browse the Web and get a much better experience. You can imagine having speech as a new modality for your desktop experience as well. You could use it to read your e-mail or a paper or page. And it’s all customized. You can choose from a selected set of voices.

They have also combined using text-to-speech with the image of a person. The face is real, but the mouth movements are synthesized to match the voice.

“You can use this for a virtual agent to read your e-mail or selected news. This makes speech a richer interface,” says Chang.

Chang said his group wrestled with many challenges developing this technology. “We recorded up to 20 hours of a professional announcer’s speech, and from that we extracted the natural intonation that people use when they speak a sentence,” says Chang. “Then we could superimpose that into our synthesized speech and make it sound better.”

There are several factors that determine how things sound, Chang says. “You need to know how to pronounce words with multiple pronunciations, for instance read (“reed”) vs. read (“red”),” says Chang. “You have to know which is which.

“Also, when a broadcaster reads a script, they’re not just reading the text,” continues Chang. “They actually go through an analysis step first.” His vision for the future is a TTS system that changes tone of voice by analyzing the context, something that people do naturally. If the text is about sports, the broadcaster might use an excited or disappointed tone of voice, depending on the score. But if it’s news of a serious crime, a broadcaster would need to take a more serious tone.

“It will be a long time before a machine can do this, but we are getting there,” says Chang.

Related Posts