DENVER, Sept. 16, 2002 — At the seventh International Conference on Spoken Language Processing (ICSLP 2002) here this week, researchers from all over the world will be discussing sequential map noise estimation, mulling over spectral density-based channel equalization, and musing about parametric speech distortion models.
Huh?
If it sounds like science fiction, it’s not. Those are just a few of the topics on tap at this week-long conference, a forum for sharing the latest speech-related technologies with the broader research community. The conference convenes this Monday with a series of tutorials from some of the brightest minds in the world of speech recognition.
Microsoft Research (MSR) will be there too, presenting 12 papers from its speech research groups in Redmond, Washington and Beijing, China. Areas of focus include modeling acoustic patterns and enhancing the ability of speech software to recognize language in noisy environments.
But while it’s not science fiction, what lies down the road for speech recognition technology is truly the stuff of motion pictures. Like other groups within MSR, the Speech Research Group engages in research that aims to make the computing experience more natural.
Imagine asking a car for directions to the airport, or phoning a computer to have it read e-mail messages aloud to you. These kinds of user-interface applications are not too far off, but according to Microsoft senior researcher Alex Acero, there is still much work to do.
Baby Steps
Speech is such a large and complex field, Acero explains, that even major advances generally represent baby steps toward the end goal of truly voice-enabled computing.
“People have been working on this technology for 40 years, and many expected it to advance more quickly than it has,”
he says.
“When you ask a complex question to another human, you expect them to understand and respond. And we are far from that.”
Instead, speech researchers focus on overcoming basic hurdles to simulating the process of spoken communications, such as
“noise robustness,”
or simply allowing the computer to
“hear”
what is said in a noisy environment like a moving car or shopping mall.
“The ability of this software to recognize and interpret commands degrades significantly when there is background noise,”
he says.
“One of our new algorithms, called ‘Splice,’ is designed to help with this problem. Its not like this algorithm is going to change everything, but it will make it better.”
Another area in which the team has made advances lies in creating the
“grammars”
through which the software understands and executes voice command and control functions, such as those associated with the Windows XP Media Player.
The problem, says Acero, is that people who use computers speak in different ways, and may have several ways of saying the same thing. With Windows Media Player, a user might say
“play Muddy Waters,”
or
“Id like to listen to Muddy Waters,”
or
“Let’s hear some Muddy Waters.”
The only way the software could recognize all of those commands was if the programmer was able to anticipate them all and painstakingly code each by hand.
“That is a very labor intensive process,”
says Acero.
“For complex interactions, its very difficult to anticipate everything and code it all by hand, and even if you could, youd have to be an expert.”
To help overcome this problem, Microsoft Research has come up with a new statistical modeling algorithm that essentially automates much of the work by allowing the software to assign probabilities and formulate possible word combinations into
“arcs.”
The algorithm is able to “learn” automatically the grammar for commands and prompts from example sentences, and then generalize those word combinations to recognize sentences that have not been provided.
“This is done through statistical methods,”
says Acero.
“All we need to do is provide example sentences of how to phrase a given command. This new algorithm puts new capabilities into the hands of general developers, so these kinds of speech applications should be much more accessible for them.”
Giving Back
Microsoft Research makes a habit of working with academic organizations to further the advancement of technology across all its areas of focus, and the speech group is no exception. Late last May, Microsoft and industry partner AT & T donated the Entropic Speech Processing System (ESPS) libraries and a software application for visually analyzing speech signals, known as
“waves+,”
to KTH, Swedens Royal Institute of Technology. KTH will facilitate free access to the popular application through its Wavesurfer speech development tool, available free of charge at the KTH Department of Speech, Music and Hearing Web site at http://www.speech.kth.se/wavesurfer .
Originally developed in the mid 1980s by Entropic Inc., ESPS was designed to provide a powerful analysis and display package for speech signals. The software was acquired by Microsoft in November 1999 as part of the companys acquisition of Entropic.
Waves+ is a speech-analysis tool developed by AT & T and licensed to Entropic in 1998. Although popular with speech developers for years, ESPS lay dormant from the time the technology was acquired, until Microsoft and AT & T decided to give these tools back to the speech research community.
“ESPS is widely used in speech-processing research and education, and Microsoft and AT & T wanted to ensure that ESPS and waves+ were still accessible to speech researchers,”
says Acero.
“A lot of people had been asking about it after the acquisition. Our goal in making the donation last spring was simply for ESPS to continue to evolve and be a fertile development platform for speech research for years to come.”
According to Acero, the donation enables the hundreds of sites worldwide that already use ESPS to continue to build on its capabilities, and increases access to ESPS in schools, universities and research laboratories.
“AT & T and Microsoft not only made this tool available, they made it free for everyone,”
says Acero.
“So its a very positive thing for the speech development community. Hopefully, it will help other researchers continue to improve speech and audio science and technology.”
And that, says Acero, is in everyones best interests. In a field that’s been marching on one baby step at a time for nearly 30 years, the next major advance might come from a global corporation with vast resources, or it might come from a graduate student working on a doctoral thesis. Events like the ICSLP and readily available tools like ESPS give everyone involved in speech-technology development a chance to contribute.
“For the last 25 years or so, speech technology has been based on the same models,”
he says.
“Today we are trying to change the status quo in a way that is like a revolution.”
The community is not there yet, says Acero, but this week in Denver, they are talking about — and listening to — some very promising results.