Remarks by Craig Mundie, chief research and strategy officer for Microsoft
Microsoft College Tour
Durham, North Carolina
October 5, 2010
HOST: Good afternoon, ladies and gentlemen. Welcome to the Distinguished Speaker Series. We are delighted to have with us today, Mr. Craig Mundie, chief research and strategy officer of Microsoft.
Mr. Mundie oversees one of the world’s largest computer science research organizations and is responsible for Microsoft’s long-term technology strategy. He joined Microsoft in 1992 and since then has spent much of his career building startups in areas including super computing, consumer electronics, healthcare, education and robotics.
For more than a decade, he also served as Microsoft’s principle technology policy liaison to the U.S. and foreign governments with an emphasis on China, India and Russia. In April 2009, he was appointed by President Barack Obama to the President’s Council of Advisors on Science and Technology.
He holds a bachelor’s degree in electrical engineering and a master’s degree in information theory and computer science from Georgia Tech. He enjoys traveling and spending time on his boat, Serendipity. Here to talk about how new trends in technology are transforming how we interact with computers, please help me welcome Mr. Craig Mundie. (Applause.)
CRAIG MUNDIE: Thank you very much. Good afternoon, everyone, it’s a pleasure to be here at the Fuqua School, and at Duke. I haven’t been here in quite a long time although I came to this area in 1977 with Data General and built their facility at the time in the research triangle. I lived here for five years and visited here in Chapel Hill, and my daughter was actually born in Raleigh, so it’s kind of like coming home a little bit.
This afternoon, we have an hour together, and I’d really like to use it in two parts: One, to give you some demonstrations and a little background on some things that we think are going to produce fundamental changes in the computer industry and, in particular, the way in which people use computers, and then use that hopefully as a stimulus to get you to engage in a discussion with me.
I come and speak at universities like this, and have for about 10 years. Bill Gates, before he retired from Microsoft to work at the foundation, he and I have always enjoyed going around to universities giving some demos about what we think is going to happen in the future, and engaging with faculty and students so that we can remain current on what’s happening in this environment and hopefully share with you some of the things we think are going to happen in the computing world ahead.
There’s always a lot of value that comes out of these conversations, and today I met with the administration of the university here, had a faculty roundtable, two student roundtables, and then this session today.
So, let me first begin and talk a bit about some of the changes that we think are about to happen in the world of computing. Every 10 or 15 years, there’s an accretion of technological advances that creates the right conditions for a substantive change in the way people use computers or the way computing is done. And this has been true in the industry as long as I’ve been involved, which is about 40 years, and frankly, predates that. And Microsoft and, you can say, the birth of Microsoft around the personal computer came about really because of the arrival of the microprocessor. You know, it was Bill Gates’ and Paul Allen’s insight that these small computers would find their way into lots of interesting things and that they would displace, in some sense, the traditional way in which computers were constructed. And they went out and created Microsoft based on that belief, and the rest is history, as you might say, in that field.
The evolution of technology now occurs in many ways, not just the microprocessor and the storage systems themselves, but importantly, in the display technologies, which are evolving quite rapidly both in terms of the way in which display is done and the size in which the displays can be done. And we also see a rapid introduction and decline in cost of sensors. And so sensing by computers is going to become very, very ubiquitous and will essentially allow us to emulate many of the human sensory interactions.
And so with all these things coming together, we find ourselves at an interesting moment in time where as computing has become ubiquitous, we don’t just think of the computing as the thing we address as a computer — it’s in your cars and cell phones and game consoles and many other devices — the question at hand is how do we really get it so literally everyone can use it and how can they get some higher-order benefit out of doing that?
And so, at Microsoft, we think that one of the biggest transitions here will be the transition from what we call GUI to NUI. GUI stands for the graphical user interface and is essentially what most of you that are students here probably have known all your life as far as how you interact with a computer, and it continues to get morphed to where we can do it with our fingers and touch instead of typing and pointing with a mouse, but nonetheless, the primary model of human interaction with machines has been through that graphical interface.
What we think is going to happen in the next few years is the move to the natural user interface, or NUI, and it won’t supplant the GUI completely, but it will create an environment where, for many types of applications and for many people today that are still put off by the computer literacy that’s required to get anything done, it may in fact represent a real breakthrough in terms of how people interact with these systems.
So, a few years ago, you know, we started to really focus on our research and how to bring it together to show what it might be like as we assemble these technologies to do different things. And so, to kick things off, I want to run a two-minute video, which just basically is a set of snapshots of a number of the research projects that are underway at Microsoft, and from which we have adapted some of these technologies for one of the principle things I’ll show you today. Would you go ahead and run the video?
(Break for video presentation.)
CRAIG MUNDIE: So these are just a collection, and, in fact, these are mostly all the ones where we’ve already done some publication of the work that we’re doing. And I just thought I wanted to highlight a few of these because as I go through the rest of this demo, and I’ll show you things that are really progressing into a tangible form, it’s all the kind of technologies that are represented here that show the range of research that is still going to be required in order to really make these things come to life and find many new forms of application.
So, let me just talk through a couple of these. One of the things you saw here is a technology we call immersive interaction. And what you’re actually looking at here is a person looking at a projection, and what you see coming in from the other side is at least the hands and arms of somebody else who’s completely in a different place. So here, they’re playing a virtual game, and, as they move the pieces around in this virtual environment, you kind of have at least the beginnings of a sense, well, you know, you can actually see and sense and understand what the other person is doing.
This will be the beginning of what we think of as more and more robust forms of telepresence where there will be a lot of very powerful forms of interaction at a distance, and this brings together both machine vision capabilities and a lot of novel display technologies. And so, we think that this is going to be an area that’s going to be particularly ripe for improvement.
The second one I’ll just highlight here is what we call gestural interaction. And, in the demo, we actually showed an experiment we’ve been doing with some people in the medical field. One of the big challenges in using computer, particularly images within the operating theater, is the question of how do you maintain a sterile environment and still allow the surgeon who’s actually operating to control these things that are increasingly important in their ability to actually perform the surgery.
And so, we took a prototype of the machine vision capability that has ultimately been productized, and I’ll show you in a minute in terms of this Kinect sensor, and used it to allow the doctor to basically make gestures. So, there’s actually no physical contact with anybody, but the surgeons themselves can control, in many ways, a rich interaction with the graphical imaging and display environment in the operating theater.
This one we call situated interaction. What you see here is the use of our robotics technology, but in this case instead of trying to create an anthropomorphized machine or control some physical device, what we’ve realized is that it may be as important to create robotic avatars to perform tasks or to essentially control interaction with people, as it is to control factory floor machines or other types of actual mechanical robots.
So, here is one of our first experiments. We built a virtual receptionist that we placed in some of the lobbies at Microsoft, which would arrange shuttle transportation across the campus. And the goal was to see whether the average person could walk up and address what appeared to be the picture of a person on a screen with all the same, you know, facility that they would have historically walked up and talked to a human who would then pick up a phone and arrange those shuttles.
And, in fact, this was quite successful. So, in this environment where you understand the particular situation, and yet you recognize you’re building a composite of a huge array of very complex, real-time interactions, the ability to build that as a large distributed system and to use robotics as the methodology for assembling them turned out to work quite well.
After seeing this one, I went on and asked the people who did this at the research group that the next thing I wanted to build was a robotic doctor. And, in fact, we built a triage doctor using this same kind of technology and gave it a machine learning system and a medical database and a Bayesian inference system with which it’s able to actually do diagnoses. The goal in this was to experiment with whether we could ultimately put a triage nurse, if you will, in a rural village at the cost of literally a personal computer who would have access to much of the world’s medical knowledge, and perhaps with just a paramedic-level person, would be able to elevate the level of clinical care or at least medical care that was available from nothing to something that would be quite sophisticated.
And I think that it’s these kinds of applications that are going to be critically important as we try to get beyond about 2 billion people who have access to computing and communication today and address the societal needs of so many people, another four-and-a-half billion today, and probably another two-and-a-half billion beyond that by the time the planet’s population goes asymptotic at about 9 billion people, and that’ll happen over the next few decades.
So today, there’s no way to get healthcare and education like we know it in the rich world scaled to another few billion people. We can barely find a way in the richest country to provide healthcare for all Americans, let alone an extra 6 billion. And so, I think it’s going to be this very high-volume technology delivered largely as consumer electronics and almost at consumer electronics price points that are going to become the infrastructure on which the next era of application development will be done.
But if we’re going to go into those rural villages, you know, and provide — whether it’s education or healthcare or training for improved farm productivity — we’re going to have to find a lot better way than pointing and clicking for those people to get these benefits, and the tasks will have to be more complete in what the technology provides.
Up to this point, computing has largely been a tool, and if you were trained in the use of the tool, you could get some amazing results, and the tools have become very, very powerful, but they’re still very tool-oriented. Our dream is that, increasingly, the computer systems themselves will be able to assist in more robust task completion and that in order for that to happen, we’re going to have to have a higher-order way for you to express what you want and a higher-order way for the system to deliver it back to you.
And so, in a sense, the way that we think this happens is that the computers become more like us. And the arrival of these new display technologies, new sensing capabilities, and extremely powerful computational facilities at low price and the global connection of the network and the cloud, all these things are essentially creating a new canvas on which we can paint a picture of computing and its uses that’s quite a bit different than that which we’ve known today.
So, let me just move on to one more slide by way of introduction. At the very end of that video, you saw a guy jumping around in front of a camera, and, in fact, that camera was a version of what you see on the stage here, which is — this is called the Kinect sensor. So, this is actually a production version of the sensor that we built, and our first application of this technology is going to be as a controllerless gaming environment for the Xbox console. And it will actually be available in November commercially.
The development of this sensor is an interesting story in that a few years ago, Microsoft of course had become fairly prominent in the gaming world, along with Sony and Nintendo, and Nintendo came out with the Wii, which was very immediately popular, and the people in our game business had to think and say, “Well, should we do something like the Wii or is there something that would be qualitatively different and better?” And of course the answer was, “Well, if you thought the controller movement was good, then being able to do all that stuff in general with no controller, well, that would be a lot better.”
And so, they sat around and did some thinking about that and, with a normal analysis of what seemed to be possible, they kind of concluded that right now that may not be possible. But they went over and talked to the research people at Microsoft and today we have about 900 PhDs in computer science who do pure research and we’ve been doing that for 18 years. And, low and behold, they found, as we frequently do, that if you looked in a multidisciplinary way at all the different components that were done in research, we were able to find a set of technologies that, when we integrated them together, gave us the prospect of actually solving this problem and doing it in a fairly immediate basis.
So, in essence, this sensor going into production went from impossible to production in less than three years. And you know, I think that this is the kind of innovation that’s really important, and, after I do the demo, using it, I’ll come back and explain a little bit more to you how it actually works.
But, you know, in essence, what we decided we had to do was find an economical — this thing sells for around $150 — way of being able to not only see in the traditional sense that a computer sees with a camera, but it had to be able to see reliably in the dark and in three dimensions. And it also wanted to introduce a very much improved audio input technology so that we could combine speech along with gestures for a natural way of interacting with the computer system.
So, that’s what was actually done. And so, what I’m going to do now is suggest that this demo — and I’ll just make one more comment — is going to be done in 3-D. And so many of you say, “Hey, you know, so what’s so cool about 3-D? You know, I went down and saw ‘Avatar’.” And so, I’ll tell you what’s interesting is that everything you’ve ever seen in 3-D was done in a preproduced environment where either using huge server farms or a lot of complexity and a lot of post processing. They actually create the movie, and then that movie is staged as a serial stream of bits that goes into projection systems that knows exactly how to take that and observe it.
And you can see that on home screens now with shuttered glasses, and you can go to movie theaters and see it using these polarized glasses. What nobody’s really done before is to take real-time 3-D, compute it in a personal computer, and essentially driven it out through one of these theater-grade cinematic displays. So, as far as we know right now, there’s exactly five of these projectors in the world, and we have two of them here. And so, this is the first of these speeches I’ve given this year, so you’re about to see something, which I hope works. (Laughter.) And which, truly, very, very few people have ever seen anything like this. So if you put the glasses on, we’ll start up the demo and show you what this is actually like.
So, as I approach the camera, it senses that I’m here and so what you’re actually looking at is just sort of my screensaver or desktop image, and, in the future, that thing will actually have depth, as you can perceive it here. But I’m going to use a gesture, in this case, to raise up the menu system. There it is. And then I basically can steer around in this menu environment by just using my hand.
The first thing I want to do is stop on this information section and, by hovering there, it’ll start that. So it pops up a birthday reminder. Says my aunt has a birthday in three days and do I want to buy her a gift? And so I’m going to in this case use a verbal command and say, “Computer, show me Aunt Sophie’s wish list.” So in the future world, Aunt Sophie basically collects things and puts them in an environment that is meaningful to her. So she likes traditional environment, has these sort of 3-D space, and she places things around as a way to remember the things that she’d like to have people consider giving to her.
But, of course, this isn’t particularly helpful to me as I’m trying to select a gift. Some of these things if she was buying for herself, you know, like shoes, she could go and buy it — it’s very difficult for me to do. So, I’m going to ask the computer to essentially take all these items and cluster them and reorganize them. So, “Computer, reorganize the wish list into a form that’s more for my analysis.”
So, all the items get extracted. All the shoes get taken away and converted into a gift card, you know, the different items get put together by different categories where we might shop for these things, and I can now use gestures to examine each of these things. So, if I use my hand, I can essentially make different selections as I move around in this environment, but I think maybe what I’m interested in is these pasta machines back there. So, I’m going to pick up the orange pasta machine. So, I’m going to bring it forward with a gesture. It basically brings it out; it hovers it here in free space.
As you can see, the thing actually — at least to most people — looks like it’s out in front of the screen. So, in this environment, I can now use commands — I can read about it, I can see the metadata about it, but I can use gestures, for example, to essentially expand this thing into a 3-D model so that I can look at the device. Well, this one looks pretty complicated for Aunt Sophie. You know, I can turn it around so that I can look at it in different ways. And say, no, that one, I think, is not good.
So, let me close that one up and put it away. And I’ll go over here, and this other one looks a little simpler. So, let me pick up this green one and look at that. I will essentially expand this one and that looks a lot simpler, fewer moving parts. You know, I can turn this thing around and look at it, that all looks pretty good. I can turn it back around the other way and say, “OK, that’s the one I like,” so I’ll basically swipe it down and put it into the box, and away it goes.
So, I bought that for Aunt Sophie, and the transaction will be completed. “Computer, I’m done shopping.” So, it takes me back to my sort of home page and, in this case, the next thing I want to show you is how this might be applied in a futuristic environment related to entertainment. So I’m going to reach up and essentially select the entertainment category.
And there are a variety of things I could do — play games, look at things — but one of the things we’ve actually had some people thinking about is what might it be like some 10 or more years in the future when we have the ability to have sort of interactive media and to use multiparty gaming, but in a rich, virtual, video environment.
So here I’m going to select the trailer and say let’s — “Computer, play the trailer for ‘The Spy from 2080’.”
(Video segment, movie trailer.)
CRAIG MUNDIE: So, while we think the traditional forms of entertainment won’t go away, these kind of technologies, the combination of large-scale social networking, the ability for people to interact and have more and more of these sophisticated display environments really does create an environment where there at least may be a genre in the future where these kind of missions are really defined by the involvement of the crowd itself.
So, I’m going to pick the missions and move into this environment to play. So here, I’m going to essentially enter a 3-D environment. The people that you see in the park are other people playing the game. The two in the middle I sort of recognize as people that are part of this, and so I’m going to basically make a gesture and start to walk around in this environment.
So here, we’ll increasingly see very sophisticated 3-D environments. You’ll be able to move around and navigate in them. Here, given the computational capability we have, we’re still using fairly weak forms of avatars to represent us, but through them, we can actually all be in a common virtual space even though we’re actually in a different space.
VOICE FROM VIDEO: Hey, finally, you made it.
CRAIG MUNDIE: Yeah, hi, glad I could come and visit you.
VOICE FROM VIDEO: Hey, check it out: We found this piece, and it’s just like this other one I found a few episodes ago.
CRAIG MUNDIE: Wait, wait, wait, which one was that?
VOICE FROM VIDEO: Oh, well, it was the one where she drew that marker on the garden wall, ring a bell? OK. Well, here, let me just show you.
(Video segment plays.)
VOICE FROM VIDEO: OK, so anyway, I went to that wall, and I found one just like the one we found today.
VOICE FROM VIDEO: Yeah, so we’re thinking that they’re probably part of a set. I don’t know, maybe there’s more around here if we looked.
CRAIG MUNDIE: Yeah, why don’t you throw me that one. Can you throw me that one? OK. So now I can look at this and essentially manipulate it. It looks like there’s some structure to this thing. And as I rotate it around in different configurations, you know, I may find that it actually converts it into a clue that we could all use. And so, when you get it in exactly the right orientation, it plays a video.
(Video segment plays.)
CRAIG MUNDIE: These things are each all individual video pieces that you can only see as on when you get them in the right orientation. Here, I’ll throw this back to you guys and you can essentially watch it. OK, I’ve got to get back to work here.
VOICE FROM VIDEO: OK, right, well, go ahead, we’ll talk to you later.
CRAIG MUNDIE: See you later, yeah, sorry, I didn’t want to bother you. (Laughter.)
So, in this environment, you know, we expect to see a lot more interesting things happen, but we think we’ll find quite a bit of other interesting applications for this.
So, what I’d like to do next is light up this screen over here and show you how this actual sensor is actually working. It has a multi-array microphone in it, it has a traditional video camera in it, and it has an infrared time-of-flight sensor. They’re all in this package. It actually has a pan-tilt control so that under the control of the software, the sensor can essentially find the people in the room and look at it.
Now, in this case, what you’re seeing on this screen on the left is the color image. Now, it turns out if the lights were on in the audience, you’d see yourself. And what’s being placed in front of you is essentially the skeletal model of me. Here, on this side, you see a depth map. This thing is actually being done in the infrared domain, and it’s not just done as a static, flat video image capture, but in fact, it’s actually seeing in three dimensions.
So, as I move closer to this, you can see my color changes and obviously all my hand movements, you know, you can see are being tracked in real time.
So, our ability to create these kind of models allows us to then take gestures or have any particular program create a set of gestures that work in this environment. Now, what I did is actually I stepped into this thing. Here, I put up a 3-D model of a motorcycle, and you can see that the dots on the screen actually represent my hands. And what I’ve done is we’ve arranged this so that when I stand at this distance, my hands don’t break a plane that’s just in front of me, so I move these things around and they don’t do as much.
But if I actually move into the plane, then the commands actually control pan and zoom, if I do this near my belt buckle, and, if I actually raise my hands up, I can do like I did earlier with the pasta maker, hopefully, and get it to essentially explode the bike into multiple pieces. And then I can essentially pan and zoom into that, too.
So, this is a way of — here it comes — of giving me a lot of ways of dealing in very natural ways with very complex environments. So, if I want to essentially make it zoom out and, you know, close up, then I can do that, too.
So, this is just to give you an idea that how we’re going to apply these technologies — if I want to step back, the bike should essentially deposit itself back on the floor. So, all that’s done by essentially the ability for the software to determine not only my gestures, but where I’m doing them in space relative to the camera and all of those things, my position, my orientation, the actual movement of my limbs all become inputs to, in this case, a program to navigate and manipulate that kind of 3-D model.
So, all of these things, I think, are quite interesting, and they’re just the tip of the iceberg in terms of the kind of stuff that we think we’re going to be able to do. So, you can take off the glasses now.
You know, just in summary, I want to talk a little bit about the importance of these transitions. I think that getting computers so that the average person, whether for entertainment purposes, for educational purposes, or for these important applications, the societal challenges in things like healthcare or education, I think that these transitions are important in enabling those to move forward.
That said, as the earlier video showed, the range of — I’ll say — academic and engineering pursuits that are going to be required in order to make this stuff commonplace and economical is really just an incredible array of opportunities. You know, I showed a few of them today in terms of the research that we’ve already published. You know, we obviously have a lot more that’s going on there.
In part, I show this to you, you know, recognizing that in this audience and at a school like Duke, you have people that are both in the sciences and engineering disciplines and people that are in the liberal arts and business environment. And I think no matter which of those domains you happen to be focused on, these things have an impact in every area. If you’re thinking about business, you know, this is going to change the way many people will interact.
If you’re thinking about creating new businesses, these technologies represent, potentially, take in concert with some of the microprocessor and silicon systems on a chip advances that will all be commonplace in the next few years, it may have an industry impact that’s as big as when Intel created the 4004, you know, 35 years ago. That was sort of the seminal event that created Microsoft and Intel and the semiconductor industry we now know, and I think that we are looking at a situation where we could have an impact as big as that again.
So, I think that there are a huge array of opportunities and yet a huge set of challenges that I certainly would welcome your help and thoughts about. You know, Duke I know has made investments, particularly in large-scale visualization capabilities. You know, this screen, that’s also right now one of a kind. As far as I know, it’s the only portable 3-D theatrical projection screen in the world.
When we went to get them made, we said, “We want one of these things, and we want to be able to roll it up and move it around.” And they said, “You want to do what?” You know, you don’t understand, you make one of these things, you stick it up in the theater, you never touch it again.
But it just shows that there are people and there are ingenious ways of building these kinds of solutions, whether it’s the raw technology or the applications of them, and certainly we would welcome your interest and participation in that.
So, let’s go on to the slide. I have one closing thing. So, this cool device, which people are going to be able to buy, we’re going to give two of them away in here today. If somebody wants to bring out the bowl full of numbers. All of you were entered into this as a door prize. OK. Here, I’ll hold it, you pick. They’re all your friends.
HOST: Which is why I’m going to get into trouble.
CRAIG MUNDIE: OK, I’ll pick, then. All right, we’re going to pick these things up and stir one around. So, the first lucky winner is 524-170. Close but no cigar, 170? You have to be present to win. (Applause.) OK. You don’t have to come down here now, we’ll find you.
Let’s see, the next one here, 524-173, 173? Anybody got it? Right there, OK. (Applause.)