Remarks by Craig Mundie, chief research and strategy officer for Microsoft
Microsoft College Tour
Massachusetts Institute of Technology
October 7, 2010
MODERATOR: Good afternoon, ladies and gentlemen. We’re delighted to have Craig Mundie with us today, giving the CSAIL Dertouzos distinguished lecture. As Microsoft’s chief research and strategy officer, Craig oversees one of the world’s largest computer science research organizations, and he’s responsible for Microsoft’s long-term technology strategy. Craig also directs the company’s fast-growing health care solution business, along with a number of technology incubations.
Craig has spent much of his career building startups in various fields, including supercomputing, consumer electronics, health care education and robotics. He joined Microsoft in 1992 and ran the consumer platforms division, which developed known PC platforms such as the Windows CE operating system, software for handheld PC, Pocket PC and Auto PC, and early console gaming products.
Before his current role, Craig served as Microsoft’s chief technical officer for advanced strategies and policy, working with Microsoft Chairman Bill Gates to develop the company’s global strategies around technical business and policy issues. Another longstanding focus for Craig is privacy, security and cyber security. He initiated Microsoft’s trustworthy computing initiative, which has leveraged new software development practices to significantly improve the security of the company’s products.
For more than a decade, Craig has also served as Microsoft’s principal technical policy liaison to the U.S. and foreign governments, with an emphasis on China, India and Russia. He served on the U.S. National Security Telecommunication Advisory Committee and the task force for national security in the information age. And in April 2009, Craig was appointed by President Barack Obama to the President’s Council of Advisors on Science and Technology, or PCAST.
Ladies and gentlemen, please welcome Craig Mundie.
CRAIG MUNDIE: Thank you, Victor. Thank you. That was quite a biographical sketch.
It’s great to be here at MIT. I get here relatively frequently. I lived in the area for 15 years and did a startup out in Littleton before I went to Microsoft in 1992.
For me, it’s been a really interesting 18 years there where we’ve watched our own involvement in computing go from the personal computer, as everybody has come to know and understand it, to a very, very diverse use of computing and application of it in many devices.
My own view is that the computer industry tends to evolve in long sort of cyclical waves, and there are really two things that drive that. One of the waves is the steady progression of technology, starting with the underlying hardware capabilities, and then progressing into other areas. And the second thing that drives the change is essentially innovation in what people decide they can do at any given moment with these computers.
But it’s an uneven process by which we aggregate these technological advances and bring them forward. In the personal computer beginnings, people talked a lot about the killer apps, word processors and spreadsheets. It was those applications that created a lot of popular use of the platform, and with its diffusion it became an environment where many, many people would be able to develop new capabilities. And that process of platform creation has been extended by us and by other companies now as we’ve diversified the use of computing around the world.
I think that these cycles are typically about 15 years long, and, after that accretion of technological capability, you start to look for the next big thing.
So, today, we look and find ourselves observing quite a range of technological changes, which seem to create the conditions for a fairly significant revolution. One of those, of course, is the steady progress of the microprocessor, and soon the emergence of systems on a chip. I think that that will actually have a fairly profound effect on our industry, perhaps more than many people currently predict.
But we have also seen steady advances in storage capability and interconnect and networking facilities, but one of the things that’s more obvious to people today is a discontinuity in display technology, both in terms of the size for a given expenditure of the display that you can get, and, more recently, the emergence of 3-D displays. And so that creates, so far, an entertainment experience, exploiting the three-dimensional display capability, but it’s our belief that that also creates an opportunity where large-scale displays and three-dimensional stereoscopic displays create the basis of, at least the component of a fundamental change in which the way people interact with computers.
Another fundamental advance is in sensors. They’ve become very cheap in many different forms, and the combination of new sensing capability, very high-performance capability computing and storage, and these new displays are coming together to create a potential revolution, and we call that the transition from GUI to NUI.
GUI, which was the graphical user interface, is what virtually all of you probably, or many of us, have grown up with as the predominant way in which we interact with our computers. But while it’s taken us a long way, there are a class of applications, and certainly a large number of people on the planet, who would prefer to bypass that particular model of interaction and to begin to deal with computing at a higher level of abstraction, to be able to get more out of a computer without treating it as a tool, and be able to interact with it more like you would interact with another person.
So, NUI, or natural user interface, or natural user interaction, we think is going to be one of the next big things.
And so today, to sort of stimulate conversation hopefully with you in the second half of this hour, I had my prototyping people take a number of technologies that we’ve been exploring at Microsoft and put them together in a demonstration that I’ll give today, that combines 3-D stereo display and these new sensing capabilities, in particular, a depth camera.
The first application of many new technologies to become popular have to be put in the hands of individual purchasers, and, in fact, consumers — and this is something that Microsoft had at its origin and it’s something that we are working really hard now to go back and repeat, and it’s something that we haven’t paid as much attention to in the intermediate years.
One place where we have had that kind of focus for a long time in the company is in our gaming business, in the Xbox environment, and so that was really chosen as the place where we would first try to move to offer people this natural user interaction model.
So, as you’ll see in a minute, what I brought is one of the Kinect sensors, which is going to be available in general for 150 bucks in about a month on the Xbox, and while there will be many new genres of games and applications developed in that environment, what I wanted to do in these demonstrations is show that we think that it just is the beginning, sort of an opening of the space where people will be able to think broadly about this more natural model of man-machine interaction, and there will be a whole new class of applications that will emerge from that.
So, let me begin by showing you a short two-minute video, which gives you an overview of the number of technologies that we’re researching that we think either display or sensing or man-machine interaction are going to be behind some of these future applications, and then I’ll come back and talk very specifically about the application of some of these in this kind of demonstration. So, let’s go ahead and run the video.
CRAIG MUNDIE: So, let me talk for a minute about some of these technologies as a way to help you understand the kind of research that’s going on. All the things that we showed are actual projects that have already reached a stage where we’re doing publication of the research results, and, of course, there’s a lot more that we’re doing that hasn’t reached that status yet.
So, go to the next slide, please.
So, the first one I’ll highlight is a technology we call immersive interaction, and here we see a world where there’s going to be a lot more telepresence, where you’re going to want to interact at a distance with people, but not just to have a videoconference but to collaborate in a more sort of visceral way.
So, here’s an example where we’ve taken this technology and the person at the other end, in this case all you can see is their hands and arms, but you’re playing a board game like this. The game is actually a projection in this virtual environment, and, as you play, you can actually move your pieces and see the hands of the person on the other side. And so both people are having the same experience at the same time; each is seeing a virtual representation of both the game environment and the actions of the other party.
So, here it’s just the hands, but I think this is a precursor, as the display capabilities and sensing capabilities evolve to be able to do a much more complete telepresent application, where there’s a lot of collaboration.
Go to the next one.
This one is gestural interaction. One of the applications that we see for this type of sensing and machine vision capability is to allow very precise control using just gestures. This whole Kinect technology combines gesture recognition, but on a complete skeletal level, with voice recognition.
But here we see an application like in the operating room theater, where they have an increasing dependence on imaging as part of the operating procedure, but there’s always the question of how does a doctor who’s actually doing that manipulate in a sterile way the imagery. And so here we’ve taken machine vision and gestures and applied it experimentally to give this kind of gesture-based control in the operating theater.
The third one I’ll talk about is what we call situated interaction. Here we start to see the combination of robotics technology, not in terms of perhaps building a machine, controlling a machine in a factory or even a robot that would move around in this environment, but recognizing that essentially robotic avatars may be an appropriate way to think about how computers can present themselves economically. And yet, it’s a very complex, distributed, concurrent computing problem in order to build these things.
So, we’ve started to use robotic techniques to build these type of situated interaction environments, and one of the experiments we did first was to create a virtual receptionist, where in the Microsoft lobby you could go and address this image of a person, the same way you historically talked to the receptionist, to order a shuttle to go across the campus.
So, what we were taking advantage of was we knew what most people wanted to do in that lobby environment, and so we had a constrained domain of discourse, but the interaction was completely natural, and it supported multiparty communication, and it recognized not only gesture but dress and other things. So, we tried to really have it emulate the social interaction experience that you’d have if you were really working with another person.
After this was done, I got quite enthused about it and asked them to build a robotic triage doctor because one of the goals that I’ve had for a while is to think that these kind of tools may be part of the answer to scaling access to basic health care, for example, in rural poor environments where there just aren’t any certainly high-level trained medical people.
So, here you might think that with a paramedic to do some of the physical activities, and much of the world’s knowledge represented this way in terms of being able to do triage or diagnostic work, that you can take people who are computer illiterate and be able to deal with them much as they would deal with an actual human doctor.
So, we’re quite interested in this as another way to think about how natural interaction happens between people and machines.
Go on to the next one.
The last one I want to talk about is really a segue into the main demo. This is actually a snip from the video of the Kinect camera from the team when it was under development. And most cameras, of course, whether video or still, in the past have always just flattened the world in front of them into a 2-D image, and even trying to get stereo perception from two cameras is quite difficult. Humans do it not simply because they have stereo vision, but they’re actually combining a lot of other activity in order to develop that sense of depth.
So, we actually decided we would use a different technology, one that literally did see in 3-D, in order to be able to combine that in a composite way with more traditional imaging, in order to create the basis of doing this people recognition and gesture recognition, and to be able to develop a complete model of a scene in front of us, and to do it very economically.
So, what I’ll show you after the demo is how this actually works, and how it’s becoming a platform for the development of new applications.
So, let’s go ahead and put your glasses on if you haven’t already because this part will actually be in 3-D.
So, as I approach the sensor, it recognizes I’m here and brings up, if you will, quote, “my desktop.” And in this environment, I’m going to use gestures to control that, and so the first thing I’ll do is essentially raise up the user interface.
In this environment, I can actually use my hand to scroll back and forth, to select in different basic areas, and the first thing I want to do is stop in this information category.
So, when I stop here, it says here’s a reminder that my aunt has a birthday, and I wanted to buy her a gift. So, I’ll say, “Computer, show me Aunt Sophie’s wish list.” In this case, Aunt Sophie’s some eccentric lady, and the way she has expressed her wish list is she places objects into this virtual room that is of her choosing, and people that she’s familiar with can come here and see what it is she might like to have.
But this isn’t particularly useful for me. There’s certain things in here I might go buy and others that I probably wouldn’t do. So, to make it easier to shop I’ll ask, “Computer, organize this in my preferred way for shopping.”
So, it takes all the items, clusters them together, recognizes certain categories, like shoes, it would be more appropriate to give a gift card for, categorizes them and sort of places them on these virtual shelves for me to look at.
So, maybe I could reach out and pick up or look at some of the items here like the lunchboxes, say, well, that’s not exactly what I had in mind.
So, I’ll go up to the next shelf and pick up this pasta maker, and bring it forward and examine it. Here it gives me the metadata for it. I can try to determine ,“Is this thing too complicated for my aunt?” So I’ll basically use a gesture to essentially explode it into its pieces, and say, that looks pretty hard for her to maintain, but I could essentially turn it around and look at it in different ways.
I’ll say, “OK, that thing is too hard,” so I’m going to push that one back onto the shelf, and I’ll go over here and pick up the green one, it looks a little simpler, and bring it in. That looks a little easier. Let’s expand that thing into piece parts; looks a lot simpler. I can essentially turn it around, look at different angles, push it back and look at it straight on, and I’ll say “Yeah, OK, that’s the one I want.” So, I’ll gesture this and put it into the box and buy it. So, “Computer, I’m done shopping.”
So, in this environment, I’ve got a more natural way of interacting with things. I’ll come back now, and we’ve been doing some experimentation around what it might be like to develop not just traditional videogames but where gaming and essentially large-scale multiplayer gaming may intersect in the future with a genre of television. So, here, let me select the entertainment category and show you a video that’s been put together, for what it might be like to have one of these player-participating TV series.
So, “Computer, run the 2080 trailer.”
CRAIG MUNDIE: So, we think people will start to produce environments like this, and we’ll show you what maybe one of the missions might be like here that you might play in. So, here’s actually a 3-D model, a virtual space where the players in the game can actually come together. So, while they’re not physically present, they can be telepresent with each other. They can converse and exchange items and ideas.
So, I’ll start walking around in this environment a little bit. So, there’s different people here.
Today, despite all the progress we’ve made in computing capability and graphics, it turns out trying to do this demo in real time, we are actually computer limited. We need a lot more computer power than we currently can reasonably put together.
So, here I’m going to go approach my friends that I’ve been playing with and talk to them a little bit about the game.
DEMO WOMAN: Hey, finally, you made it!
CRAIG MUNDIE: Yeah, it’s good to see you.
DEMO WOMAN: Hey, check it out. We found this piece, and it’s just like this other one I found a few episodes ago.
DEMO MAN: Wait, wait, which one was that?
DEMO WOMAN: Well, it was the one where she drew that marker on the garden wall. Ring a bell? OK, well here, let me just show you. OK, so anyway, I went to that wall, and I found one just like the one we found today.
DEMO MAN: Yeah, so we’re thinking that they’re probably part of a set. I don’t know, maybe there’s more around here if we looked.
CRAIG MUNDIE: Throw me that one that you’ve got there.
So, let me turn this around, and maybe this is a clue that we can essentially decode. So, let me spin it around and try different orientations for this environment and see if it starts to come into some alignment. So, we find out, well, actually, this does appear to be a picture if you look at it right, and in fact, it’s a video.
DEMO MAN: Very cool. Hey, wait a second, is that a secret scene?
DEMO WOMAN: Hey, they promised it would be in one of the mini-games. Maybe that’s where they show how Grid 47 —
CRAIG MUNDIE: I can knock it off axis a little bit. You realize it’s actually a lot of composite little pieces of video that only appear as a clue if you look at them all in the right direction.
Here, let me throw it back to you guys, and you go ahead and watch it. OK, I got to go back to work here.
DEMO WOMAN: Oh, OK, right, well, go ahead. Talk to you later.
CRAIG MUNDIE: Sorry I’m bothering you.
DEMO MAN: See you later.
CRAIG MUNDIE: OK, sorry to bother you.
So, in this environment, we see a lot of opportunity for people to get creative, and whether the games and TV series will actually emerge this way, it’s hard to tell, but at least the technological basis of doing that is clearly going to be possible.
So, let me explain to you a little bit about how this sensing environment works, and how we’ve been able to get this to be relatively economical.
So, you can bring up the output from the sensor over here.
So, what you actually see here is sort of a debugging image. The sensor’s sort of behind me if I face it, and make any kind of movement, you can see that what it’s doing is building a skeletal model of me. In this case, it builds a model of the 42 major joints in the human skeleton. It does it at 30 hertz, and it’ll do it for four people at a time. And it does that seeing essentially in the dark because this part of the imaging is done using infrared for illumination and timing, and so even in a dimly lit environment, or a room where you’re watching television, you can see these things.
Many people thought, well, how hard can this be; you just have a bunch of cameras and look at people. But it turns out, to make this really work is much, much more difficult than one would imagine, and there were a lot of problems.
So, for example, different parts of the body are occluded as a function of whether I’m facing the camera or not facing the camera, and if there were multiple people in here, as they pass in front of one another, there’s a momentary occlusion of the other player. You really can’t allow that to happen in a logical sense if you don’t want to disrupt the game play, particularly in a multiplayer environment. So, a lot of work was done to basically deal with this as a statistical problem and a modeling constraint problem, and not simply an image recognition or machine vision problem.
And so, as you move around in this environment, as you get closer to the camera, you can see that sort of the heat map changes, which is showing what the distance is, and each of these things can then be interpreted by the application.
The image on the left is the RGB regular video camera, and you can see there’s a big spotlight back there and part of the audience is visible in front of me. But taken together, each of these things can then be mapped into a set of commands that relate to whatever the application is that the person wants to do.
So, when I showed this to somebody a few days ago, they said, “Does this mean that I have to learn which gestures map to the Xbox controller buttons?” And that’s completely the wrong way to think about it. Here, what we’re trying to do is create a genre of games where there is no notion that you’re having to think about mapping what you would naturally do into a set of discrete actions, but rather you just operate quite naturally on it.
So, when we have now built the new games with that model, you find that people who never played a videogame before, perhaps even never had any interest in thinking they wanted to play a videogame, are able to immediately, literally in seconds get into these things and start to operate them because, in fact, it’s no different than operating in the natural world that they already understand.
So, to give you an idea of sort of an alternative mapping, what I’ve done is built a room and put a model in it, so when I step closer to this thing, it’s going to actually take that model and, as I get close enough, it’s just going to use that as a cue to pick it up.
Now, the green dots here are essentially mapping my hands, and it’s measuring not only where their position is sort of in this plane, but actually in space. So, when I actually put them in front of me, they turn red, and at that point they’re active in terms of manipulating the model.
So, I pull them back, my hands are not active. So, if I put one in, I can essentially spin this model around and do different things to it. If I actually put both hands in at this level, sort of by my head, the thing basically will essentially expand it into its piece parts. Then I can essentially manipulate that, turn it around and look at it.
Sort of at waist high, it basically becomes a zoom control, and you can see that I’m making fairly small physical gestures, fairly smooth, and it’s essentially tracking the modeling in real time.
Now, many of you probably went to the movies and saw something like “Avatar” in 3-D, and the process by which those things are created has a huge amount of post processing involved. At the end of the day, they stream a linear set of bits through these projectors that produce the polarized output, and the glasses allow each eye to see a different image.
Part of the magic in this particular presentation, which has not been done certainly very often anywhere in a theatrical kind of presentation, is to do this with real-time computed geometry. To do this, we’ve got the Xbox doing the sensing, feeding the output stream of my skeletal positions as input to a PC program, which is actually rendering the model.
So, we’ve got dual four-core Xeon processors, so essentially the biggest PC you can do, the highest end in video graphics card you can use, and this in 3-D real time is limited to about the 300 parts that we’ve got in there in this essentially very simple environment, a single light source, no real walls, not a lot of texturing.
So, while many people today talk about how far we’ve come with computing, and do we need any more, when you start to decide you want to create really high-resolution models, you want to create them in real time and operate them in real time, and certainly when you start to think about HD quality movies or games or multiparty games and telepresence, in a sense we’d be very happy to have another couple order decimal orders of magnitude in terms of computing and graphics performance in order to be able to do this at the frame rates that would be optimal, and with the resolution and amount of detail in the scene that we would really like to have.
So, we’re pretty optimistic. So, when I actually step far enough away, the thing will put the bike back together and put it back on the floor.
So, this is just an example of a small team who were able to take the basic technology of the gesturing system, and then map it into an application of their choice.
So, we get pretty enthused — you can take your glasses off, that’s sort of the end of the 3-D part.
So, we’re very enthused about this as the beginning of the emergence of this natural user interaction model, not just that level of academic experimentation in large visualization rooms. Certainly MIT and other universities, as well as other companies, have experimented with some of these things for a long time, but to be able to reduce this in practice to something that will be available in the mass market for $150 takes a huge amount of work, and clearly we’re not done in any sense of the word.
So, we’re very interested in collaboration with people in this space. We think that by making this technology available to a development community that will get larger and larger, we’re going to see a lot of innovation, and we’re quite enthused to see what kind of results emerge from that.
MODERATOR: Thank you very much for a fascinating talk.
CRAIG MUNDIE: Thank you, Victor, thank you. (Applause.)
So, anybody that’s interested, you’re welcome to come down. I’ll stay around for 15 or 20 minutes and chat if you have questions, and we’ll fire up the motorcycle in case any of you want to actually play with the sensor.