The quest to teach machines to understand human conversations has taken another big step forward with researchers achieving a new level of speech recognition for technology.
Xuedong Huang, who leads Microsoft’s Speech and Language Group, announced the new milestone in a blog post. Last year his team made headlines when it reached human parity on the “Switchboard conversational speech recognition task”, meaning it had created technology that could recognize words in spoken conversation as well as professional human transcribers.
Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems. The task involves the technology transcribing conversations between strangers who discuss various topics, such as sports and politics.
Xuedong said his team had achieved a 5.1 percent error rate, which had “significantly surpassed” last year’s achievement of 5.9 percent.
“We reduced our error rate by about 12 percent compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models,” he says.
The team also enabled its speech recognizing technology to tackle entire conversations. This let it adapt its transcriptions to context so it could predict what words or phrases were likely to come next — just like humans often do when conversing.
“Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years … Microsoft’s willingness to invest in long-term research is now paying dividends for our customers in products and services such as Cortana, Presentation Translator, and Microsoft Cognitive Services. It’s deeply gratifying to our research teams to see our work used by millions of people each day.”
Despite the latest achievement, Xuedong says many challenges still need to be addressed. These include achieving human levels of recognition in noisy environments with distant microphones. Accented speech also poses problems along with speaking styles and languages for which only limited training data is available.
“Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent. Moving from recognizing to understanding speech is the next major frontier for speech technology.”