Craig Mundie: University of Toronto

Remarks by Craig Mundie, Chief Research and Strategy Officer
University of Toronto
Toronto, Ontario, Canada
October 6, 2011

SVEN DICKINSON: Good morning. My name is Sven Dickinson. I’m the chair of the Department of Computer Science, and it’s a real pleasure and an honor to introduce our distinguished visitor, Craig Mundie.

Craig is chief research and strategy officer at Microsoft Corporation. In this role he oversees one of the world’s largest computer science research organizations, and is responsible for the company’s long-term technology strategy.

Craig has spent much of his career building startups in various fields, including supercomputing, consumer electronics, health care, education, and robotics, and he remains active in incubating new businesses.

For more than a decade he’s also been Microsoft’s principal technology policy liaison to the U.S. and foreign governments, with an emphasis on China, India, and Russia.

Another longstanding focus for Craig is privacy, security, and cybersecurity. Based on this work, he’s served on the U.S. National Security Telecommunications Advisory Committee, and the Merkel Foundation Task Force on National Security in the Information Age.

In April 2009, Craig was appointed by President Barack Obama to the President’s Council of Advisors on Science and Technology.

It’s a big day for the Department of Computer Science and the University of Toronto, for we have the tremendous honor of hosting Craig on his first visit to a Canadian university. Please join me in giving a warm welcome to Craig Mundie. (Applause.)

CRAIG MUNDIE: Thank you, appreciate it.

Good morning, everyone, and thank you. It’s great to be in Toronto, and to be able to talk to you today.

In the next hour or so I’d like to do several things. One is to explain a little bit about the vision that we have for some of the major trends that are happening in the field of computing and some of the implications of those things. Also at the end of that I’ll stay around and we’ll have a Q&A session. I’m happy to talk about anything that you all would be interested in discussing.

This idea of visiting universities and both sharing ideas and listening to what’s happening and what’s on the minds of people is something that both Bill Gates and I have had a very long interest in, and it’s been well more than 10 years that both of us have taken the initiative each year to go out and visit schools.

Each time I go, it’s not just to sort of give a lecture, I really want to learn and understand sort of what are the current issues and thoughts that exist on the campuses around the world, listen to the young people, if you will, also talk to the faculty and administration of the school.

So, I had already this morning a roundtable with a group of faculty members from a number of the schools, and as that discussion evidenced, that the topic of computing is not any longer localized to be a technical discussion. You know, the things that we’re doing really are creating a new societal infrastructure. And while the technological rate of evolution remains very high, the ancillary effects of the global adoption of these technologies produces all kinds of interesting new and challenging problems.

The job I have at Microsoft is really interesting to me, and I guess that’s why I keep doing it at my ripe old age now. But I get to sit on one hand at the bleeding edge of computer science, and that thing is at the heart of a lot of what’s going to affect our society going forward. On the other hand, I’ve had the luxury of serving as a liaison to many major governments, the last three U.S. presidents as an advisor, and I’ve immersed myself for more than 15 years into the question of technology policy, and this is becoming an interesting set of challenges as well. And when you start to blend these things together, it’s a fascinating cocktail.

I want to talk a little bit about the future of computing, but I think today sort of I’d be remiss in not stopping and reflecting for a moment on the passing of Steve Jobs yesterday. You know, I did know him personally, not super well, and, in fact, the times that I worked with him was actually in the interval when he had left Apple the first time and before he came back the second time. But many of us know him, and Bill Gates knew him for 30 years, and had tremendous respect for Steve and what he accomplished.

I think when you look at what he achieved, it’s a real reminder I think of two different things. One is the influence a single individual can have; it’s just tremendous. I’ve had the luxury personally of knowing people like Andy Grove, Bill Gates, Steve Jobs, and others, and each of these people came from an environment that you wouldn’t have predicted would land them in the roles that they had and the impact that they had, and I think it’s always a reminder to all of us that you shouldn’t — as I think Jobs was quoted as saying in a speech sometime in the last year or so, you know, go live your life and make something happen, don’t live the life other people want to plan for you, and I think it’s important to remember that.

And it’s just incredibly important to realize that you can assemble technologies in interesting ways that produce revolutionary effects, and we, all of us who get to work in these technology fields, have that gift in front of us, and it’s moments like this where you have to stop and reflect on that a little bit.

So, I wanted to share that passing thought before we launch into a little bit of a look at the future.

So, I want to talk about three things this morning. The first is this topic that’s now euphemistically known as big data. The second topic that I’ll talk about after that is the intersection between the physical world that we all grow up and live in, and the emerging cyber-environment or the virtual world, and how these two things are now becoming more and more intertwined.

And the third thing I’ll talk about is the emergence of what we call the natural user interaction or natural user interface model and its importance in solving a lot of interesting problems, and really empowering people to get a lot more value out of the computing capabilities that we have.

So, let me begin first and talk about big data. You know, data has been around a long time. If you go back in the evolution of science and the evolution of the scientific method, you know, in the beginning people didn’t have instruments, they had theories. They were grounded in the mathematics of things, perhaps. And the scientific method evolved to say we should look for experimental proof of the theory. And a little later, as we began to get computers, we had a lot of focus on modeling, and we built bigger and bigger computers and fancier and fancier models.

But the Fourth Paradigm, as we’ve called it in a book, a monograph that the Microsoft Research people produced a couple of years ago, the Fourth Paradigm is one where you really are doing science on a data-driven basis. The preponderance of — or the emergence of huge capabilities in sensing and sensor technologies, the ability to now aggregate this and for the first time have computational resources sufficient to ingest these huge amounts of data and to do analytics and correlations on them, has been really an incredible new capability.

And the emergence of these hyper-scale facilities for storage and computation, which really were an outgrowth of trying to build computer systems big enough to provide Web services on a global scale, has now emerged as the cloud.

So, that also has a democratizing effect. It used to be that only the largest companies or, in fact, governments had very, very large computing facilities in any given era, and the idea that a university would have regular access or even an individual or a small company just really wasn’t in the cards.

But now these hyper-scale facilities that we’re building sort of as the backbone of the Internet are now being organized and rented out, you know, on your Visa card, if you will, and so an individual has access literally to the same kind of computational facilities, in fact larger computational facilities than even any government does today.

You know, I travel around a lot. I’m not aware of a government anywhere in the world who has yet assembled computer systems as big as the ones that Microsoft and Google and Amazon and others, who are building these things for the Web service environment, have now actually assembled and operate every day.

So, there’s a real inversion coming, and the ability to use that to analyze data I think is incredibly powerful.

So, let me start and show you a demo. So, this just looks like a normal Excel spreadsheet, but one of the things we’ve been doing to Excel in the last few years is to sort of take the bounds off it so that there’s no limit to the number of sheets or the size of a sheet.

And part of the reason we wanted to do that was anticipating the arrival of these very large datasets, things that weren’t constrained by what you might have assembled yourself or just collected out of the data that would have been available within your corporation or your experimental environment.

So, up here on the right we now have also started to see emerge from us and other places data markets where very, very large datasets can be made available, either published by governments — the U.S. government has a huge initiative in this they call data.gov, where they have an initiative to take every sort of non-classified piece of data the U.S. government has and retained over history and put it out there for other people to use. Businesses are doing the same thing.

So, if you now just click on this button, you can log into an account, and through these accounts you can basically subscribe to datasets. Much as in the past you might have subscribed to an RSS feed for news or something, you can now subscribe to datasets.

But the magnitude of these things is really amazing. So, for example, all the census data that the U.S. just did, it’s all available as a dataset. And if you have a big enough pipe, you can click on these things and just pull down huge amounts of information, and start to cross-correlate that. So, now public datasets, private large-scale datasets, and even your own personal data assets or corporate data assets now want to be analyzed together.

So, people who have the skills and sort of thinking and manipulating a spreadsheet can now apply those kinds of things to these huge things.

So, for example, I’m going to for this demo say I want to import some historical weather data. And since this is short and I’ve already got some of it, I’ll just show you. So, I’ll just say I’ll limit the number of results. Let me just download say 500 of these and import this data.

So, what this is, is essentially weather data from the United States that is sort of all the observations that were taken over a long period of time. I think in this case it was like 30 years. And just at the same time I have taken a big slug of this data, and imported it previously, and I built a spreadsheet out of it.

And here you can filter it, you can analyze it, you can do all kinds of things you want. You can build Pivot Tables. And, of course, from that you can use standard charting techniques.

So, for example, this is actually a chart built out of this huge amount of rain data, and what I did is I extracted from it a graph over a long period of time, in this case I guess several years, of the precipitation data for the places I was going to go and speak at in this particular sequence of university visits. So, I was at Northwestern, I’m going to go to Georgia Tech later, Montreal tomorrow and Toronto today, and then good old Seattle is home.

So, by looking at this, you know, you can see who gets the most rain, who doesn’t, but you also can start to see anomalies. So, for example, in Illinois where I was a couple days ago, say, well what is this huge spike in their precipitation? So, you can take this and go out and do other things like Web searches, and, in fact, when we did this, we clicked and found, sure enough, there was this unusual Hurricane Ike that year, went inland in the United States, and drenched the middle part of the country, and they had this huge spike in precipitation data.

You know, you wouldn’t have known it, but if you think back and say, you know, even a few years ago, how likely would it have been, you could have figured this out, absent the ability to search generally on the network and be able to look at these patterns and use visualization techniques to find them.

You know, here’s another example in the blue line here and say, you know, what was it that happened here? It was in Atlanta. There was basically a weird storm that just dumped a huge amount of rain that they’d never seen before, and you can find these anomalies.

So, I’m just showing that, you know, the ability to correlate these things is just a relatively simple example of insights that can be gained by taking these huge datasets and being able to do your own analytics and discovery on them.

But while this kind of graphical analysis has been common, as the datasets get larger, and we’re really trying to learn things over much longer periods of time, we also have developed some other new capabilities that are starting to be quite fascinating.

So, let me show you this one. This is kind of a combination of a powerful visualization system, a geographical navigational system, and sort of a multimedia storytelling system.

So, if I was studying this data and I wanted to share some insight about this with you, it turns out what we mapped here was on the western half of the United States 30 years’ worth of that same rainfall data, and it sort of both color coded and height encoded so that you get a graphical sense of how much precipitation is there.

But the people who put this together built a fly-through of the data. So, they sent me that. So, I click on the play button, and it drives me through the model, flies me over the West Coast, allows me to look down and say, well, there’s sort of a valley here and that’s sort of interesting in its own right. And you can fly through the data and look at different patterns.

Now, at any point I can stop this data, I can jump to predetermined places, go way back out or look at a particular point, but I can also navigate in it, so I can manipulate it.

So, for example, if I roll up here a little ways and tip down, working my way up toward BC but notably home here in Seattle, one of the things people always think is Seattle is really wet. But it turns out it’s nowhere near as wet — as this shows, because this is us right here — as just to the East and just to the West, and what happens is there’s mountain ranges that create that spine in the United States, and they trap all the humidity that comes in, and it ends up as precipitation, either rain, we have a rainforest on the left side, and we have snow on the Cascades.

So, whether you’re trying to teach people this or understand the effects of geography, these kind of powerful visualization tools are really kind of interesting, and the combination of them with this multimedia authoring and storytelling I think is also going to be an important part of how people will communicate.

But as we get these bigger datasets, we’re finding that there’s a lot of features that you can extract from them, but humans can’t do it, even with powerful visualization capabilities.

I was in the faculty discussion this morning telling them that we have used machine learning on large medical datasets to be able to discover things like why is it that people actually get readmitted to the hospital so frequently and what could you do about it on a preventative basis.

But all of these things I’ve shown are operating on traditional — I’ll call them numerical datasets, but more and more we have non-numerical things that are of importance to us.

So, I brought another tool that has been developed. Here is essentially a human torso that has been imaged using — I forget whether this is a CAT scan or an MRI, but basically it’s a slice-based model, and the way a radiologist or a doctor, of course, would deal with these things in the past is they would just have sort of these top two and this bottom left views, which are the three axes, and they can essentially move around in the slices. So, this kind of moves up and down, this one moves back and forth, and this one front to back.

But you really have to know what you’re doing to try to learn much about any one of these things, because you can’t see anything more at any given instance than one point, one slice. So, people go through a lot of training in radiology and interpretation skills to do this.

Realizing this was getting hard as this sort of three-dimensional imaging became more economical and popular, on the bottom right you have an example of — if I roll out a little bit, this is actually the whole torso built in a sort of three-dimensional model.

But even here you have to be pretty expert at knowing what’s what as you begin to fly through this three-dimensional model, and even just changing the scale can mess people’s minds up about where they are.

So, what we started to ask ourselves is could we apply the machine learning capability to non-numerical data and teach it how to recognize and discriminate these different kind of patterns.

So, what I’m going to show you here is this has an algorithmic mechanism, and in this case we’ve already done the machine learning part to build the discriminating functions so that this model basically takes any one volumetric scan like that, and it will actually analyze it and find all the individual organs.

And so it now gives me a list of the ones that it was able to find out of that voxel image. And if I actually pick one, and look at it, so like if I pick the pelvis here, boom, there’s the entire pelvis, completely extracted from all the slices, re-aggregated, and then color-highlighted against everything else.

So, all the individual organs, both the skeletal ones and the soft tissues one, the computer is now able to find them in any one person’s scan.

If I look at, for example, I can look at the lungs, I can look at your kidney, and so things that would be really hard, and for the uninitiated I’ll argue impossible to find, anybody can find. So, this is a very powerful capability.

Let me just show you a video, because it’s easier than me trying to drive this a lot, of how this actually is getting applied where not only do we teach it to find these things, but here it said find the aorta and the kidneys, and you get to see them together. So, you can now have it find and then highlight particular structures within the body, and you can then manipulate them as if they were aggregated objects. Here’s a lesion in the right lung, importantly the liver and spleen.

The other thing we’re able to do now is to take an example of this, let’s say you were examining this and you found a lesion in the lung. You can now tell the system, go through all of the people we’ve ever seen and find anybody else who had a lesion like that one in that lung.

So, now you’re doing an image search by specifying a medical condition by example, and saying, find me everybody else who’s ever had that. And the ability to do that back and forth in time is just an incredibly powerful capability.

So, I just offer this as one example of what big data is going to enable when you couple it with this machine learning capability in order to sort of revolutionize any particular field, whether it’s scientific analysis or in this case medicine and medical imaging.

Let me move from that sort of world, which is sort of the physical virtual, to a more day-by-day version, and it’s been the case that three-dimensional virtual environments are not completely new. Arguably the most prevalent instance of this is gaming, game consoles: PlayStation, Xbox, Nintendo kind of things. So, people who were willing to devote enough energy to master the controller were able to map their thoughts about the game play into that environment.

But it’s really a tricky kind of thing. And now the Web brings us the environment where we’re no longer confined to those synthetic environments, we have the maps and volumetric models of pretty much the entire surface of the planet. Now we’re moving to allow people to create virtual models of all of the spaces in which they live.

And, of course, there’s more and more interplay, whether you’re talking about shopping or other things, between the physical world and the virtual world.

So, today, the state of the art has been that you either had to know how to operate within that environment yourself or, if you really wanted to facilitate crossing over, we needed to kind of have some hints or some cheating mechanisms to allow the computer to help you bridge between these two environments.

One of those hinting mechanisms Microsoft created was called the Microsoft Tag, and to just familiarize yourself with how this concept of tagging things is being used, there’s other things that are in sort of the similar domain, QR codes and a few others. This one is kind of notable in that the tag can be made into pictures that you can recognize as well, and still maintain the information content. And these tags are intermediated by a Web service. And what’s interesting about that is that the same tag, at the discretion of the person who placed it someplace, can dynamically change what it does.

So, for example, if you’re a packaged goods retailer, you might print a tag on a package of soap or bottled soft drinks or something, and today you might want to have it teach you about the product, tomorrow you might want to run a special in the store and have it give you a coupon or redeem a coupon, and the next day you might want to run a contest. And that was always a problem because all you had was either a UPC code, which was locally scannable, but how would you change what it did ex post facto?

So, in this environment, because you can dial that in, so, for example, on my business cards I have a tag that I created, and I can make it do different things. One thing it automates is completing a task of add my contact information to your cell phone. So, you just take and scan it with your cell phone, and it will automatically put my tag into your contacts. No button pushes, no clicks, no nothing; you just take a picture, boom, you get the card.

I can then change the mode of the tag, you know, so it takes you to my website at Microsoft. I mean, there’s kind of a lot of things that you can do that way.

So, let me just run a video where this is now being applied in like — this is a music concert environment that was done in Savannah, Georgia, but you can see how particularly young people are starting to use the navigation capability of their phone to bridge between these environments.

So, go run the video, please.

(Video segment.)

CRAIG MUNDIE: So, in fact, there’s some tags on that poster board over there for some of the things I’m going to talk about today. So, you can get these readers, just download them for any of the popular smartphones, and do that now.

But one of the challenges that we recognize is that that only works if somebody is then willing to go to the effort to go out and tag things. What we really wanted to do was essentially make it a lot more natural for people to be able to get that kind of help without having to depend on somebody having already put the hints out there.

So, in the new phones, if you put this one up there, we’ve arranged this phone so that it can be projected. So, this is one of the latest Windows Phones. In the search facility that we have on this phone now, you can see on the bottom of the screen there’s now several buttons, and the rightmost three is basically we’ve given the phone sort of eyes and ears, and not in the sense of just running specific applications but integrating it more directly in how you can search and get things done with the phone.

So, this one will listen to any piece of music, for example, and tell you the band and take you to a place where you can buy it.

It has the ears, which is sort of the microphone capability.

It knows and combines the context of where you are. So, I’m going to try, even with the microphone here, to do some audio stuff. So, I’ll just say: “Movies.” So, it converted my speech into movies, put that into Bing. Bing basically converted it into a Web search. And it knew that I was at the University of Toronto based on the geolocation stuff that’s in the phone and says, okay, here’s all of the different movies.

I can say I want to learn more about them, and say I wanted to go see “The Lion King”; you can click on that. It goes out and gets information about it. You can get the show times. You could just click on buy tickets.

And the applications that relate to movie information like the database, in this case they’ve all been brought forward into this one interface.

One of the things that we think is super important going forward is that you really want to help people get the task done. They don’t just search because they want to search, they’re trying to do something.

So, the question is, how can you take the contextual information and combine it in ways with the explicit input from the user, and facilitate getting the entire task done?

So, in this case if I was going to go to the movies or learn about that, you don’t want to have to navigate over and over again back to the search engine or run a different set of applications; you want that sort of thread of activity to be predicted and more automated. So, there’s a huge effort in Microsoft to be able to do that in all these different products.

Another thing that we’ve done is said, okay, well, now that we’ve all got cameras, so how do we do something more than just take pictures? And so we’ve got this Bing vision kind of capability.

So, I’ve got a book here on the table, and I’m just going to look at it through this thing. Sure enough, actually there’s no tag on the book, there’s no UPC code visible; it just recognizes the book cover, matches it, and gives me a whole bunch of Web links that it finds related to that.

So, if I click on any one of those links, here’s about the book, the price of it. I can look and see book reviews on it that are immediately brought forward. I can see if I want to go buy it in a physical bookstore; there it is. And if I want to go buy it online, I can immediately click on one of these apps and buy it. All the information necessary to do this is sort of propagated along automatically.

So, again it was sort of take one picture, didn’t push any buttons at all, follow a link, and then pick what you want to do. So, you go from a word or take one picture to getting the entire transaction done in a very, very short amount of time.

So, okay, let’s just try one more thing that’s sort of vision related. So, what I’ve got here is a menu from a restaurant. In this case it’s a French menu, and I don’t actually speak French. So, I’m going to tell it go look for the text. It finds it. And then I’m going to tell it to translate it.

So, there it actually recognized that it was French, I didn’t tell it that; it converted French to English, and it overlaid it sort of word for word or place for place on top of the image of the menu that it took it from.

So, this is what many people call augmented reality in some specific form, but you’re starting to connect together a lot of different elements. First, you know, it had to find it, it had to figure out, well, what was it, what language was it, then what should I do with it when I get it.

So, these kind of increasingly sophisticated capabilities where the computer is being endowed with more and more human-sensing-like capability and the ability to stitch them together, combine it with contextual information and help you get stuff done, is what we think is going to predominate in the way that people will use computers in the future.

Now, here I’m using a cell phone, some of these things are fairly automatic, but if we want to dream about a day where a lot more people are going to be able to get a lot more benefit from computing, then we really have to think about changing the modality of man-machine interaction. So, the last part of this talk is about this transition from the graphical user interface to the natural user interface.

We’ve been pursuing this in our research for probably, well, I could say almost the entire 20-year history of Microsoft Research. And certainly for more than the last 10 years the individual elements of machine vision, speech recognition, speech synthesis, scene analysis, all these things have been pursued as research things one at a time.

And a lot of people in the computer industry, including us, used these things more or less one at a time over the last 10 years, but almost always what we tried to do was use it as an alternative way to operate the GUI.

So, you know, now we’ve got touch as opposed to just using your mouse or typing or using arrows for navigation, but still we’re sort of trapped in the graphical interface that somebody created.

It’s a bit like this tag thing where if somebody took the pains to organize this stuff and put the tags there, it really helped a lot, but if you really were trying to operate in the general environment, you know, we weren’t very good at it.

And so what we wanted to was say, you know, we don’t want to be confined to this graphical model as the only way in which you have to funnel everybody’s interaction with the machine, because in a sense they’ve already predetermined the context of your dialogue. And if you want to elevate this to a higher level like translate this menu for me, you know, you really have to get up to a higher semantic level.

To use the key, the first step, if you will, the key to doing this was to change the way where you could interact with computers, and so the thing that really brought this home for us and really is I think the first mass market commercial example of a true natural user interaction system was when we launched Kinect for the Xbox about a year ago.

Three years before that, the product group had realized that we had these incredibly complex controllers that are more like learning a musical instrument, and only a handful of people really would become virtuosos at it. You’d seen some simplified things like the Wii that gave another demographic group some access to the gaming environment, but if you plotted that line, it seemed like the obvious end point was you didn’t want a controller at all.

But at the time, it actually seemed kind of impossible, but they came over and they sat down with the research people, and we started to bring people in who had been doing research in each of these sort of component parts. No one of them thought that they were doing something that was about creating a controller-less gaming experience; you know, they were doing machine vision or any variety of other things, and ultimately about 10 different groups from four labs on three continents got hauled in, and we found that if you actually took all of those things together, it was possible at that point in time to embark on the construction of a system that would see in three dimensions, recognize humans in that image in real time, multiple of them, build a skeletal map, and have ears that could hear across a space without close mic’ing. So, we built Kinect.

Most of you probably know, but I’ll just show you a very short clip of kind of the games that were built in the first generation of Kinect, and how people just started using it.

This thing became the fastest-selling consumer electronics product of any type in history, zero to 8 million units in about less than 60 days, and we’re working toward tens of millions of these things in less than a year now.

So, the thing that people loved about this was that they didn’t have to know anything before they could do something; they could just take what they understood about physical movement in the real world, and they could map that with no training onto the character in the game.

In this case, interestingly, the most popular game last year of all the things that were produced was this one, “Dance Central.” And if you say, you know, hey, who were the demographic group of Xbox gamers prior to the launch of Kinect, it was males between the age of 12 and 30, and trust me, they weren’t doing dancing most of the time. (Laughter.)

And so a year later, you look up and say, well, there’s this dramatic expansion of the demographic group, it’s become gender neutral, to some extent, the genre of games is still quite divided, but it’s just amazing how many things people are doing.

What was equally remarkable, and we anticipated it but probably didn’t even anticipate the rate at which people would get engaged was now that there was a 3-D depth camera and an array microphone, but mostly the depth camera, that was $149 retail; the number of people who aspired to do things where they could give machine vision of some type to some computer that they wanted to play with is just stunning. Within a week, people had started to write their own drivers, because we made it with a USB plug, so even though it plugged into an Xbox and was sold that way, you could just take it out of there and plug it into your PC. And we knew people would do that, and, in fact, we had embarked on building a kit that would support that.

But people weren’t even willing to wait for the kit that would make it easy; they just said, well, we’re going to write drivers and try to figure out how it works, and there was this huge wave of activity. There was a whole area on YouTube called “Kinect hacks” where people would just start to post videos of what they had figured out and the kinds of things they were doing.


By June of this year, we had actually come out and released the kit where many of the things that we had done to support the game developers, particularly the skeletal modeling and tracking and also the array microphonics, we just made available as high-level APIs, and this immediately elevated the kinds of things that people were doing.

But I brought a little demo reel that’s again some of the YouTube kinds of things to show you the incredible creativity that people have, and it gets unleashed when you give them this new technology.

If you wanted to do just the camera part of this before, if you were a researcher in computer science or industrial design or architecture, the price range of a camera that would do what this thing does historically was between $30,000 and $100,000 a copy. So, the number of people who could afford it was very small. And so when it suddenly came out for $149, all those people who dreamed they wanted to do something say, well, now I can.

So, let’s run the little demo reel.

(Begin video segment.)

CRAIG MUNDIE: So, this is a guy who’s putting shaders on this thing, and he’s using his gestures to control the shading effect.

Here’s people playing chess where they walk around and move the chess pieces.

This guy built his own car-racing game, and he uses his hands to steer and his feet to control the pedals.

This guy has sort of a holographic projection system, and he’s flying his helicopter using all gesture-based controls.

This guy’s playing different kinds of media on his multi-screen environment.

And this is a Microsoft guy. The WorldWide Telescope is the underlying technology to this rain navigation thing I showed you. Here’s using all of space image and he navigates that.

This guy built his own football game where he gives voice — he calls the huddle using a voice command, and he controls the game by his movements.

This is just mapping a full human skeletal onto an anthropomorphic robot.

This guy built a Barcalounger that he can drive around with gestures. (Laughter.)

And this is a thing called Eddie. We worked with a company, and we actually built two Software Development Kits. One is just there for people who want to work on the software on personal computers, but we also for about four years have had a Robotics Studio, which is a development kit for people who want to do robots.

And obviously one of the things that comes together here is the idea — this guy did all kinds of visual effect.

(End video segment.)

CRAIG MUNDIE: One of the things people wanted to do was they wanted to build robots and experiment with them, and this suddenly made it really economical for people to have high-quality vision to supplement sort of low-level sensors and motor capabilities. So, you can now buy this kit called Eddie, which is essentially the motor sensor part, as a package, and then a thing you can just plug your laptop into and stick your Kinect on the top, and you’ve essentially got a small mobile robot development platform, and these are becoming very popular as well.

So, as we thought about this, it became clear that we really could not only enable but think broadly about how people would use this capability, and one of the things that we also recognized is that again you wanted to cross this sort of boundary between the physical world and the virtual world, and the first thing that we set out to do was say, well, what if we really get fancy in terms of the software that drives these sensors, and we took a Kinect, hooked it into a PC, and built a capability that allows us to scan real-world images.

Today, if you want to have something in 3-D, it’s really hard to build those models. And we said, well, now that we can see in 3-D, can’t we just automatically model in 3-D?

In the past to do this you would have had to put clues on it. So, just like I was sort of hinting with tags, people who have done 3-D reconstruction in software take a lot of photographs of objects, they usually have to market the object with recognizable points so that you can correlate those things.


We started to get past that at the Web scale a few years ago with the work that we did in mapping a thing called Photosynth where you could take uncorrelated images and the system would build a point cloud from all those photographs. But we wanted to go one step further, not just have a point cloud that you could see but actually build 3-D models.

So, this is a Kinect hooked up to a PC.

(Break for direction.)

Well, don’t know what’s wrong with that.

Had this done what I expected, I’d be able to actually just walk around and scan this pot. And as I scan it in real time, it will actually build a model of the pot. But I built a model of this, and I’ll go on to the next part of this without scanning it.

And if I come over here, I’ve actually got a Kinect hooked up to a screen.

Oh, actually I want to show you one more thing before that. No, that’s all right, I’ll do this.

All right, so I’ve got a Kinect, I hook it up to this screen. We’ll log in. So, it sees me, takes the pot. Here we’ve got a virtual potter’s wheel. And the Kinect knows where I am, it knows what my skeletal thing is, and so what we’ve done is we’ve created an environment where — come on — the — well, this part isn’t working either.

What normally happens is it takes and builds a model of the pot, and I can actually adjust the dimensions of the pot by just using my hand movements to control the shape of the pot.

So, here it would be as if I had a potter’s wheel and I was trying to throw this pot. I know how to do it; I want to do it in a virtual environment instead of a physical environment. And because it’s actually built on a 3-D model, when I’m done, I can take that model and go back and do something different to it.

So, one of the things that we wanted to do was to build a capability where people would be able to use the Kinect-like features, the skeletal modeling capability, and map themselves not just into games but into environments where they could do other things.

So, one of the things we built and actually released in June or July this year is a thing called Avatar Kinect where people already had avatars that they were able to use in the games, but we wanted to allow them to sort of animate their avatar and send it out to meet other avatars.

So, we built Avatar Kinect as a telepresence system, and the first telepresence applications we wanted to put were in a social environment.

And so let me run a little video clip, and you’ll be able to see what this is like, and this is available on the Web today.

(Video segment.)

CRAIG MUNDIE: So, to do this one we had a couple of interesting challenges: one, how to create an environment where the avatars could go meet somebody. So, we created these different stages, from ones for little kids, tailgate parties, performance stages, interview stages, so that people could do it.

But the other thing that was interesting is the original avatars, those that are in your games, all we were animating was your major skeletal features. But we knew if you wanted to have telepresent type of interaction with people, particularly in a social environment, you needed to capture facial expression.

And it’s fascinating how good humans are at taking small visual cues about your face or a caricature of your face, and getting emotional information conveyed through it.

So, having studied that, we found what was the smallest kind of facial animation that we had to get accurate in real time in order for, coupled to gesture and voice, you to convey emotion as if you were really there.

So, here your avatar, even though it’s a caricature of you, does a pretty good job at doing that. But we had to take the camera, which is actually a fairly low-resolution depth sensor, and composite it together with the RGB in a way to be able to develop a face model so that it not only mapped your major joints for multiple in real time, but it actually would do the faces, too.

So, we did that and that was released as this product, Avatar Kinect. So, today, people can go and have meetings, they can record them, they can attend parties, they can do all kinds of stuff using an Xbox in their home.

In January I was at the World Economic Forum, and I was talking to a friend of mine, Maria Bartiromo. She works for CNBC and is a well-known financial commentator. We were talking about the Xbox and this natural user interaction, and I said, well, we’re going to come out later in the year with this Avatar Kinect, and we were talking about that, and kind of said, it would be really cool, why don’t we do a TV interview that way, because we had an interview set. So, in July I went and did that.

So, I’ll show you, because it was sort of educational, but Maria and I had a conversation about this technology, and we taped part of it live and part of it as avatars. So, let me just run that video so you get an idea of what that kind of telepresent interaction looks like today.

(Video segment.)

CRAIG MUNDIE: So, that went on for about 13 minutes on national television, and was trying to get people to understand what it’s going to be like when you can send your avatar out to do your bidding or meet friends.

Of course, the challenge there is here you can only do it in these fixed spaces and with the avatars that are not so realistic, they’re really just caricatures. But as the technology continues to improve, we can at least envision a day where we’re going to be able to make the avatars much more lifelike. In fact, there’s this phenomenon called the Uncanny Valley where when they’re caricatures you accept them quite naturally, there’s no cognitive dissonance. If they are perfect replicas of you in some sense, you would similarly do that — that’s called television, right, and everybody is sort of socialized to say that. You put a TV up there, people really attribute the people to like they’re really there.

And so we know that humans are able to do that if the realism is good enough. What’s bad is in the middle, if the avatars are sort of funky, not really very accurate, then there’s a cognitive dissonance problem, and it troubles you. We’ve seen movies that some of the early 3-D movies had these attributes, like Polar Express, if any of you saw that, in 3-D, it had that same kind of weirdness to it that bothered a lot of people.

And so we wanted to do two things going forward, and I don’t know whether we’ve got this thing working or not. Maybe it’s just dead.

Anyway, what I was going to show you is that what want to be able to do is to scan both models, not of just pot-like objects. You know, here I brought a small model, an architectural model of — actually this is the first-floor cutout of a new building we just built in Beijing for Microsoft.

And if I scan this thing, you know, you could look inside it, and you could kind of — oh, there it is. Oh, actually this is a video we recorded of it the other day.

So, this is me scanning this model, and on the left is the raw data, on the top right is the depth data, this is the map, and then we blend it with the RGB image. What you actually see, built in real time through that $149 camera, is a real 3-D model with the color imaging composited on top of it of this particular building.

And why is that interesting? Well, if I wanted in the future to have a model of this room, you know, you’d probably have a slightly different scanner, but we’ve proven now that we have the technology to build those models in real time.

So, you wouldn’t be confined to the stage that Microsoft prebuilt in the product, you’d be able to build your own conference room, living room, auditorium or whatever it might be.

The other thing we think is really interesting is that so much of the design work today is being done using CAD/CAM. So, actually when we were building this building in Beijing, which we opened a couple months ago, of course, the architects built this incredibly detailed, lit, 3-D model of the thing as part of the design process. And so you can take that model and import it, too.

So, whether you scan it yourself, get it from your architect, buy it on the Web, wherever it might be, there’s going to be a lot of ability to bring these things together.

I don’t know whether this is going to work or not. If not, we’ll show you a video of this.

Okay, this is a little better. So, here we’ve actually downloaded the architect’s model of our Beijing office that we just opened, and this is actually the lobby from which that little cardboard model was also made.

But here what I want to do is I want to do this Avatar Kinect-like thing, but I had the MSR people build me a more photo-real avatar.

So, like I did with Maria, I’m going to fly into this model now. So, let’s go in there. Boom, there I am.

So, now in real time I’ve got my skeleton mapped onto a photo-real avatar. In this case we haven’t done facial animation. And you can even see all my midsection and everything else is really there. (Laughter.) But in real time I’m standing here moving around.

So, I can basically point over here and say, okay, here’s the lobby up there, there’s this really cool display model, back over here down this hallway is the cafeteria, over here is where you go to eat, and, in fact, your ability to give a guided tour or, in fact, have other people pop in there with you is not that far away.

Also last week we demonstrated a talking head of me in this MSR celebration that shows the state of the art of building photo-real heads of people that you can take these animation and text-to-speech technologies, and we gave a live demonstration of me where they typed part of my speech in and the avatar read the speech in Mandarin. I don’t speak any Mandarin, but my talking head spoke perfectly in Mandarin.

So, you can start to see what I think is going to be very powerful in the business environment is the ability to start to have meetings that are really quite real in a telepresence sense at relatively low cost. You know, you’re not building quarter-million-dollar rooms in order to be able to have two people look like they can interact with each other.

And the other thing we think is going to be powerful is when we can, in fact, take this speech recognition and real-time translation capability, and be able to sit in front of this, speak in one language, have a meeting with somebody who’s telepresent with you and hears it in real time in their own language, and all of those things I think are going to be very important in this highly interconnected, globalized environment that we all especially have to work in these days.

All of those things I think you can see are not very far away. Just as four years ago it was thought to be almost impossible to just build that camera and microphone, certainly for $149, and be able to allow this to happen, we’re not that far away from evolving it and the computational facilities and the models to be able to endow your avatar with a lot more lifelike qualities, both at the skeletal and facial animation level, and even the ability to speak in other languages in real time.

And so it’s a very exciting time to be working in the field, and I think these are sort of three big things that are really going to change the way people think about computers.

So, whether you want to solve problems like health care and education for the poor, that’s one of the things I’m personally kind of passionate about, I think that if I take this kind of avatar and I say there’s not necessarily a person behind it, there’s just machine learning and Bayesian inference and other capabilities behind it, the computer now can be presenting itself as an avatar.

So, in places where we don’t have doctors and nurses or we don’t have teachers, I think that we will be able to mass produce them of sorts with subject matter expertise, and where the computer, at the cost of buying a PC and a camera setup like this one, will now actually be able to do that.

We’ve built prototypes of that, for example, of a doctor that can diagnose the 16 most prevalent diseases of children in the rural poor environments of the world, and you just walk up to the thing and it looks like a talking head, and it listens and you talk, it asks you questions, and behind the scenes it’s figuring out what disease you may have and what to do about it.

While you may ultimately be referred to a real doctor, many of these things will be able to be diagnosed and even have the prescribing done by a computer. And I think that’s the way we’re going to cut the cost of health care and maybe improve the quality of education for a planet that’s going to have 10 billion people in the next 50 to 100 years.

END