REDMOND, Wash. — Aug. 1, 2011 — For years leading up to the launch of Kinect for Xbox 360, Microsoft was blending technologies for the connected living room, working toward its vision of a natural, powerful center for home entertainment.
At the same time, millions of people around the world had invited the newest iteration of video game consoles into their homes — the Xbox 360 video game and entertainment system, which was capable of handling games, movies, TV, music and photos — and it opened a world of Internet-connected possibilities.
“Bill Gates spoke about Microsoft’s strategy for the living room, with an intelligent entertainment center to enable amazing experiences,” says Thomas Soemo, principal program manager lead for the Xbox platform at Microsoft. “We knew that the Xbox 360 system was going to be a prime component of this vision.”
The challenge was that no one had ever really found an interface that worked well in the living room. Other industry attempts featured a keyboard to input commands on screen, which never resonated with consumers. The Xbox 360 Controller was great for games but limited for searching media — and unfamiliar territory for nongamers. There had to be a better way to interact.
“How do we solve this problem?” Soemo says. “How do we enable a very natural form of interaction with this device that also fits the social atmosphere of the living room? How do we achieve what feels like Star Trek? That’s the challenge we took on.”
With that challenge in front of them, the Xbox team set out to create the next-generation human-machine interface, capable of understanding requests and commands the way humans do — through speech and gesture. The resulting product, Kinect, has brought speech service beyond the telephone voice prompt and into millions of homes worldwide.
“We are witnessing the beginning of a revolution today,” Soemo says. “Speech recognition is entering the mainstream and redefining how people find, consume and interact with their media content on the Xbox 360.”
The Living Room Challenge
In creating the Kinect, one of the biggest engineering challenges was the living room itself. Living rooms tend to be larger rooms, leading to an unprecedented design requirement for the Xbox team — the Kinect’s microphone array would need to work seamlessly up to four meters away from the couch, much farther than other speech-recognition systems in the industry could comfortably handle.
Another complication was the fact that living rooms are social gathering places and are often filled with ambient noise, such as conversations, movie soundtracks and music.
“Imagine if everything you said could be interpreted by the Xbox 360 as a command,” says Keith Herold, a senior program manager lead with Microsoft Tellme, the company’s speech-recognition service that also powers Windows Phone 7 devices and appears in an array of other products. “That’s the big problem in the living room — how do we get the device to ignore everything but actual commands?”
To solve this, the Xbox team reached out to Ivan Tashev, a Microsoft Research principal software architect with more than a dozen patents related to helping machines capture and interpret sound.
Tashev had been prototyping technologies for speech enhancement, audio processing, microphone arrays and echo cancellation. For the Xbox 360 system, he went to work purifying the audio signal so the Kinect could understand what it was being told. He used his expertise in echo cancellation to subdue everything coming out of the console — soundtracks, movie dialogue, game audio — as well as room noise the microphone would pick up. This was an immensely challenging problem based on advanced mathematics, but Tashev relished the task.
“Basically, in Kinect I have technologies that are a summary of the research I did for seven years,” he says. “We know what’s coming out of the console — it’s a constantly shifting, dynamic signal. The trick was to remove that outbound signal from the incoming signal. And to do it in real time.”
Another challenge was to help the Kinect determine who is talking, focus on that source and ignore everything else. To solve this, Tashev used “beamforming” technology, which spotlights the person giving commands to the system.
“If there are four people in the room and one is talking, the spotlight goes to him or her, and if that person says ‘Xbox,’ then we start listening,” Tashev says.
In the end, the Kinect’s audio enhancement chain consists of six major stages that consecutively improve the quality of the speech signal, removing clutter, noise and reverberation from the room to help the speech recognizer do its job.
Making the Natural Interface Natural
With the audio pipeline in place, the next step was to integrate that signal with the Microsoft Tellme speech service. For this phase of the project, the Xbox team turned to Herold’s team to bring Microsoft Tellme to the Xbox 360 platform.
The living room presents unique challenges for voice technology. Kinect’s microphone array needs to work seamlessly up to four meters away from the couch and contend with ambient noise such as conversations, movie soundtracks and music.
“Our job was to take the remaining audio, now at this point just a player’s commands, and do something rational with it,” says Herold. “This project required us to step up and push our boundaries well past telephony voice response and desktop speech, into a much more human environment. We needed to put ourselves in the mindset of the living room environment and all of the interactions that are possible there. We wanted to change the way people thought of speech technology.”
Adding to the challenge was the Xbox team’s allowable error rate, which seemed impossibly low for a system with so many variables.
“We never want a command to trigger random actions on the console,” Herold says. “The idea of ‘never’ is not achievable of course, but we picked a suitably small number for never.”
The solution to this problem was the software equivalent of a concept first developed for backpack-sized walkie-talkies in the 1940s — the transmit, or “push-to-talk,” button. This was embodied as the keyword “Xbox.”
“When you say ‘Xbox,’ the system knows you’re talking to it and what’s coming next is a command. If you don’t say it first, you haven’t pushed the virtual ‘push-to-talk’ button, and the system won’t listen,” Herold says.
Since the Kinect supports both speech and gestures, the combined Xbox and Microsoft Tellme team spent considerable time determining how to enable both forms of interaction in a way that was complementary and intuitive. Their guiding principal was the concept of the Natural User Interface (NUI), in which people communicate with machines in the most human way possible.
For example, speech might be the best modality to search through thousands of songs, since gesturing to scroll through such a vast list could be tedious. Telling the machine, “Xbox: Bing, The Beatles” allows the user to get what they want in the most natural way possible from the vast collection of content available through Xbox LIVE.
Once the list is narrowed, using gesture to select a specific song may be the most natural interaction. Graphics, text and sounds on screen help cue users to make the interface more intuitive and easy to use.
According to Herold, this is the strength of “multimodal” interfaces, which combine speech with touch, gesture or other forms of input: Each modality is used where it is stronger, and the combination becomes much more powerful.
Advancing the Platform
For the first iteration of the device, the Xbox team prioritized the commands that would resonate most with people in their living rooms. They decided that simple navigation functions and media playback controls — “Xbox: play. Xbox: pause.” — gave people something valuable, while also demonstrating the system’s potential.
“When you’re building a new product on new technology, you can try and do everything and it may work most of the time, or you can stay laser focused on the key scenarios and make them amazing,” Soemo says. “The first release of Kinect was about shipping a product that handled those key speech experiences extremely well.”
From the start, however, the team was thinking long term. When the Xbox team announced the next round of Kinect functionality at the recent E3 conference in June 2011, it was the next step in a vision that began years ago.
“For the launch of Kinect, we leapt over some major technology hurdles on our way to ‘Xbox: play.’ and ‘Xbox: pause.’,” Soemo says. “Nobody had ever done highly accurate speech recognition from up to four meters away, without a physical ‘push-to-talk’ button, in an environment filled with ambient noise, all while playing in 5.1 surround sound. Because of the collaboration among the Xbox, Microsoft Research and Microsoft Tellme teams, we were able to take science fiction and make it science fact.”
Soemo says the functionality announced at E3 is just the second iteration in the journey toward the Xbox 360 system becoming the entertainment hub for the home — redefining how people discover and use the range of media content available on Xbox LIVE and making the remote a thing of the past.
“We are laying a foundation that will transform how people interact with devices,” Soemo says. “We are at that cusp. With Kinect, we’ve put speech into the living room. Now, Microsoft will continue to push the boundaries of NUIs to enable seamless experiences that span devices and platforms.”
With that foundation in place, the Kinect’s latest functionality goes well beyond simple navigation and allows people to use voice commands to traverse very large media catalogs with ease, and the team doesn’t plan to stop there.
“What are the most amazing experiences with speech we can imagine?” Herold says. “Can we create technology that is as natural as talking to a friend? This is where we want to go, and it’s happening in front of our eyes.”
No keyboard necessary.