Remarks by Craig Mundie, Chief Research and Strategy Officer
Georgia Tech
Atlanta, Georgia
October 27, 2011
CRAIG MUNDIE: Thank you very much. Good evening, and thanks for coming and spending a little time with us. We’ve got about 90 minutes now, and I’ll use the first part of it to give a presentation, and share with you some of the things that I think are happening in computing and will affect all of us, whether you’re in the field or just ultimately informed by the use of computers.
You know, the context for this presentation goes back about 15 years. Back in the mid-’90s I was working to start many new things in Microsoft in the consumer computing space, but I also worked closely with Bill Gates. And Bill even at that time had a strong belief in both the importance of basic research, and we made a big investment in that area, but he also believed that we needed to stay grounded in what was really happening in the real world, and a big part of that in his mind and my mind was to go out and engage with people in the university environment.
So, pretty much every year since that time, Bill and I independently went around and would visit universities and give some talks like this, and meet with faculty and students and members of the administration.
So, given that I’ve been here all day, I did all that already, I met with a panel from the graduate student group, undergraduate group, the faculty, toured a bunch of projects in the GVU, met with President Peterson, and now I’m here with you.
The goal of these meetings is really to just encourage a dialogue where I want to know what’s happening on campuses and what’s on people’s minds, whether it’s faculty or students, and I want to share with you some of the things that we think are going to be important that you might not necessarily be aware of if all you did was sort of look at the state of the art in computing as you might acquire it at Best Buy, for example, and use it at home or at work. And so I’m going to talk about some of those things today.
So, let me find the clicker. Okay, here’s one.
The first thing I want to talk about is a major trend that’s happening in the area of what we call big data. The evolution of computing has been one where the magnitude of the computing capability that we’ve been building is getting bigger and bigger. Governments and big businesses for many years would build supercomputing capabilities. In fact, Bud Peterson was saying, you know, he’s going to build a new high-performance computing center for TAC.
But to some extent the companies like Microsoft, Google, Yahoo!, Amazon, eBay, some of these companies over the last decade have had to build computing facilities that are staggeringly large in comparison to anything that any business or government has ever built in the past, and we had to do that in order to be able to produce these Web-scale services on a global basis.
And now that we’ve done that, and seen a long progression in the decline of the cost of computing and the cost of storage, we are starting to see emerge a completely different paradigm in the way people think about data.
A few years ago, Microsoft published a book out of MSR, in memory of Jim Gray, who was the father of the relational database system, and what we talked about was how science itself was moving to what we called the Fourth Paradigm where over the centuries really we went from an environment where people had theories and they would think deeply about science but didn’t have a way to confirm it; that led to experimentation, lasted quite a longtime. More recently, as computers became powerful, we got into what we call the modeling phase, and now we’re entering this data-driven phase. And in science I think it’s pretty clear that the ability to get high rate, high accuracy sensor data at much, much lower costs and to store it is creating a novel way in which people are doing science.
But we also see this big data environment being applied to many domains related to business as well, and coupled with some other capabilities that are emerging it’s really altering the way that we operate our own businesses and many of the capabilities that we provide to people who are given access to these hyper-scale facilities.
It’s really quite remarkable and I think important when you realize that any one of you could take your sort of Visa card and get online and go rent access to a computing facility larger than anything any government in the world used to have. And this ability for people to store and compute on these massive amounts of data with these very, very high-scale computing facilities really should be thought of as a very important and novel way to do a lot of interesting science, as well as a lot of interesting business.
So, one of the questions that we had — and I’m going to show you a demo here — is to — okay, so this in a sense just looks like a good old Excel spreadsheet, and the thing that was important about Excel when it was introduced decades ago, and spreadsheets like it, was that it was really the first time that people who weren’t necessarily programmers could get the computer to do something useful for them without writing a program. So, for a long time these things have gotten fancier and fancier with the capability to do larger and larger analysis.
If you look at the contemporary spreadsheet, we’ve now sort of removed all the constraints on how many sheets you can have or what the dimensions in terms of rows and columns are, and so it’s really now just a virtual tableau in which you can place a lot of data.
The question is now that we have these big data sources that are in the cloud, not just the ones that you would have in your company or sort of on your own desktop, what do you do about it?
So, one of the things that we have now in the cloud is these things called data marts, and Microsoft has one. So, I just logged into it. So, there’s just a big button here on your spreadsheet, and here we just put in a selection of publicly available or commercially available datasets that are resident in the cloud. And so things that would have been really hard for you to get in the past are now literally one or two clicks away.
So, for example, it’s not included here, but we have all of the latest census data. So, if you’re a marketing person, and you want to go out and just do demographic analysis in the local zip code, you can do this with a couple of clicks.
But here just to keep things simple I imported a bunch of weather data, and because we don’t have a lot of time I’ll limit the number of these things to just say 300 rows, because this is not critical, and I’ll say import the data. So, this goes out to the cloud, sucks it in, puts it all in the spreadsheet, it’s all annotated with some XML metadata, and you can operate on this.
So, in this environment I actually put a much bigger chunk of this data in here before. So, I’m just going to flip to that now.
You can use the same things that you always knew about, Excel charting and those type of analytical tools, and now you can apply it to these unlimited size datasets. And, in fact, the analytics can either be performed locally or they can be performed in these cloud services.
So, here what I downloaded was 30 years’ worth of precipitation data for the Western half of the United States, and you’re seeing is Seattle and then the other four places I have visited in the last month and given a talk like this, the last of them being Atlanta.
So, you can actually see over this time horizon what rainfall or what the total precipitation looked like. And you can see actually in this case it was in Evanston there was an anomalous thing where, gee, this jumped up so high, and it’s very obvious when you can look at the graph, but you could have just stared at all that data for a long time and never noticed that.
And so here we’ve annotated it, and if I click on it, it goes out and actually we find out that what happened then was there was this hurricane that went inland in North America, and it sort of drenched the middle of the country, and that’s how we ended up with that giant anomaly.
Interestingly, here in the Atlanta line a couple of years ago there was a similar anomaly, and if you find out by looking back at that time period, you can now correlate that and say, well, in September of ’09 there was a big weird storm, and it flooded things in Atlanta.
So, suddenly people who would have not had access to the data or the tools to see these patterns emerge are able to do these kinds of things, and do them quite easily with a few clicks.
But this is sort of traditional numerical data, and yet it’s hard to see this stuff evolve over time and over such a large geography.
So, let me show you a different tool that we’ve got, and this is sort of a visualization system. I’ll start it going here. It’s essentially a multimedia presentation that anybody can author by taking control of this, saying record, and they could in this case annotate it or speak over it and record it.
So, this is a flyover of the Western United States and the visualization data, where the dots that live above each part of the geography indicate the kind of precipitation and what the total was.
I was personally very interested to see this, because if you get to this point right about here, you say, I can now look at Seattle, which everybody thinks is really wet, and you can actually find that it’s not as wet as you thought it was, because it’s right here. But it turns out it’s really wet on either side of us, because there’s a mountain range that collects all the water at the coast, and another one that collects it all just east of us, and while we’re a little damp, we’re not nearly as bad as the people next door.
But these kind of very, very powerful visualization tools also allow people to see things that they wouldn’t have noticed in the past, and we can integrate across huge amounts of data in doing this.
Another example that I didn’t bring today, but we’ve taken a few hundred years’ worth of all of the observed seismic data, and plotted it in a similar capability. When you look at it, you can actually see, by integrating visually across these things, where are all the fault lines, because over centuries all the little quakes appear on the fault lines. So, things that are hard to find in other ways suddenly start to leap out at you when you have these powerful capabilities.
But more and more the stuff that we’re doing is no longer confined to be just a bunch of numbers, and so another thing that we’ve been working on to facilitate this is our machine learning technologies where we can sort of put the computer to work at being able to do a class of analytics that may, in fact, be very difficult for people to do themselves.
And to give you an idea of how this is going to evolve now, what I have here is this is essentially sliced CAT scan data of a human torso. If you were a radiologist over the last, you know, 10 or 15 years, the way you would have originally had to analyze this is you would have looked at these things along the three axes. So, you could basically move one of these things, and you go up and down, you could move this one and go front to back, and you could move this one and you go left to right.
But you really had to have a great mental image of what the anatomy really was, and your own ability to kind of set these dials in exactly the right slice and exactly the right cross-section to be able to learn something.
So, to make it a little easier people did things like this where they started to build voxel-based models, and where you could essentially have a 3D reconstruction of these slices, and then be able to look around in that. But the problem is that it still doesn’t give you a very good way for people who are not really informed about these things to manipulate them.
So, one of the things that we did is we have taken these machine learning capabilities, and we taught it how to discriminate all of the skeletal structure and human organs in one of these voxel reconstructions from these slices.
So, what I’m going to do here — it actually runs locally — is I’m going to say, go find all the organs in this particular model. What pops up on the left is essentially all the different organs and a sort of little bar that kind of predicts what the confidence interval was that it found the right thing, from this humeral shaft not too confident, the knee not so good, because there’s not much of it there, but other things that it’s really quite good at.
So, for example, I could go down here and say, you know, show me the pelvis. So, now what happens is it not only puts these things in the right place, but you can see on the bottom right corner here it’s actually taken the pelvis, highlighted all the elements of that skeletal structure, and put them there so that you can now essentially manipulate them, zoom in, and look at it, and similarly for the soft organs look at the liver.
This is incredibly powerful when you realize that there’s nothing annotated in these diagrams. So, we can take this, feed it another person’s scans, and it will find all their organs. And even though they’re not exactly in the same place in every person, because scales are all different, the thing is now smart enough to be able to do what a human otherwise historically would have done, which is to look at one of these things and say, oh, I know what that particular part of the anatomy is.
So, I brought a video so that I don’t have to drive so much, and I’ll play it for you, which is just sort of a quicker tour through some of the things that could be done.
So, here the aorta and the kidneys are essentially highlighted as a unit and extracted against the background image. You can change the transfer function by just sliding sliders around that will show you more or less internal data. All of this can be stopped and controlled manually. Here it finds a lesion in the right lung. Here’s the liver and spleen. So, more and more of the internal structure is being presented.
If you, for example, think about the lesion in the lung, the other thing we’re now being able to do is to use the things that you’ve selected or highlighted here as an input to a query that says, for example, go find me all the other patients who have a lesion nominally like that in the same lung. So, now it will go back the other way and find all the people who have that. So, as you begin to do research or you’re trying to think about therapies, you suddenly have an incredibly new and powerful tool with which to go and do this.
In terms of getting people who are not trained in operating this kind of traditional diagnostic equipment, your own physician could be given this, you know, if he’s trying to sit at the end of your bed and explain what’s going on, he can drive this thing around with a mouse or his finger on a tablet, and so I think it’s going to be a very powerful capability in the medical field.
But in essence what we’re trying to say is this machine learning is really an important part of what this big data environment is going to be about. It’s not just going to be about the classical form of numeric and analytics, it’s going to be about finding patterns in the data that, in fact, people would have a tough time finding reliably.
One last example in the medical space, we’ve actually taken 10 years’ worth of medical data from some partner hospitals we work with, and our researchers took these same kind of capabilities, but not to image data per se, and went and started to ask the system to find answers to questions that are sort of medically and economically important.
So, for example, why do people so frequently get readmitted to the hospital shortly after they were discharged? We know that happens, we know it’s a big driver of costs, and, of course, it’s not good for the patient, but despite years of knowing that we haven’t been able to do an effective job at preventing those reoccurrences from happening.
And when we went and asked the computer to go through this thing, it found all of the classical things people knew about, but then it started to spit out more and more fine-grained correlations of things that seemed to be the causal reason that these reoccurrences happen, and from that we were able to build a model that is predictive.
So, every morning in some of these new products we’ve got the staff at the hospital can run this system across the existing patient population, and it will put out a rank ordered list of the likelihood of each patient being readmitted and why.
So, now you suddenly have the ability to intervene in advance of something bad happening just because you’re able to see the patterns in how that person has been treated in comparison to all the previous patients over a decade. And I think these are really powerful things and are going to change the way that people think about doing business.
So, the next thing I want to do is to talk a bit about where the physical world meets the virtual world. For the last few years, we’ve seen moving in this direction. We’ve had things like 3D games, now you can watch 3D movies on television or in the theatres, and it’s clear we can make a synthetic world and put objects in there and let you manipulate them or have different forms of entertainment related to that, and it’s also clear that we’ve been doing more and more sophisticated things with computers as it relates to what people do in the physical world.
But now as computers are part of our daily life and everything we do, we carry them in our pockets, we’ve got them in our cars, in our televisions, more and more there’s a desire at least to see a lot more of a direct coupling between the physical world and the virtual world.
So, today we’re at what I’ll call a point in time where we’re not as robust as we want to be in terms of having the computer be able to operate in the physical world directly, but we’re getting there.
So, next I’ll show you first a video, and then some other advances that have been made recently about how this physical-virtual interaction is going to happen.
The first thing I’ll show you is a thing that’s about Microsoft Tag. This is a family of digital image-based coding systems that you can put on packages or business cards or pretty much anywhere you can print anything, and there’s a variety of these that have been around. You could say the progenitor of these things was the barcodes that have been on your groceries and other items for decades.
But the barcode had a limited actually number space, and all it was was a serial number. And unless you could figure out how to correlate that serial number to something you wanted to do, it wasn’t that useful for anything else.
So, the question arises, you know, can we both make a code that would be more interesting and informative, but also one that we would be able to intermediate in terms of Web Services in order to be able to change its behavior dynamically.
So, there’s other things like QR codes that are sort of in this space, but most of those don’t have this sort of intermediation function of a Web Service. So, Microsoft Tag was created in order to give people that option.
So, today, if you, for example, look at USA Today, you know, on the cover of every section of USA Today and in many of the articles, or in many magazines, you see these little funny barcodes or colored triangle codes, and if you click on them with your cell phone, it automatically will take you and perform some action, like show you a related video or give you information about the product.
So, these are getting used more and more. So, I’ll show you a video of how this has been employed by people who have just recently put on a music festival — I think it was in Savannah — and so let’s run that video.
(Video segment.)
CRAIG MUNDIE: So, when we introduced this, anybody can go to the website and make their own tag. So, you can do it as an individual, you can do it as a corporation, and you can start to see just a fascinating array of applications.
So, for example, some of the car companies now everywhere there used to be a little sticker in your car that was like a safety warning or something, they just put the little tag there. The same for people building documentation for consumer goods, they’re putting it on packaging.
One of the things that’s interesting about this is if you put a code like this on a package, today you can click on it with your cell phone and it might take you and give you information about that product. If they want to do a promotion, have a coupon or something attached, they can just change the Web action in one place, and the next day you click on the thing it will take you and do something different. When the sale is over, they can put it back to some other thing.
These are the kinds of things that people have never been able to do before, but it comes from this sort of compositing of this kind of visually recognizable tag in local devices like your cell phone, and the ability to have a Web Service that will help you distinguish what kind of function you want to perform.
So, let me move on now and show you how some of this is evolving beyond that. It’s fine to give the computer hints, which is the way I think about the tag, and there are certainly cases like in the retail marketing environment where the hint is not only helpful to recognize the object, but it gives you this way of automating some tasks. We think that that’s a big part of what computing for the individual is going to be about in the future is it’s all about task completion or automating task completion.
Today, you know, people have grown up and they’ve used Web search, for example, for a long time, and in that environment you sort of end up having to ask yourself, well, why do they search? They don’t search because they just want to search, they search because they want to do something. And the question is, how do you figure out what it is they’re trying to do, and how far can you go toward automating that?
So, as we and other companies are moving forward in this space, a lot more of the focus is not just on can you index the Web, for example, but can you take all the other information that’s available to you, treat it as contextual information, and from that try to divine what the actual intention of the user is so that you can automate more of the tasks.
We also find that more and more people want to move into this environment we think of as natural user interaction where it’s great to have small devices or big devices that you can touch, and fancy graphical interfaces, but more and more people want alternatives, simple alternatives. So, one of the most obvious ones is speech.
What I have here, I’ve got a video cable connected to this Windows 7 phone, 7.5 phone, and you can see that it’s figured out that if you look up there near the top it says it’s midtown. So, it knows where we are in Atlanta. So, the things that I ask it to do will all be blended with this other contextual information.
At the bottom we’ve added these icons. The one on the far left gives you a lot of local information. The next one will listen to any piece of music as it plays and identify it for you. The next one is what we call Bing Vision where the camera actually will look at things, and then there’s the microphone.
So, let me first just try and give an example of how speech is going to work in this environment. So, I’ll do a really simple one. “Movies.”
So, it packages that up, sends it off to the search engine, which basically does a whole lot of searches, and it brings back all the different information about this environment, about what the movies are that are in this area. So, it got the ratings, it knows the times. If I click on this area, it lets me drill in on these things. I can look at these in any level that I want.
If I click on the Ides of March, it goes out and finds me information about it. I can go one click left and it shows me all of the information. I can push on any one of these buttons and actually buy the tickets. And another click, there’s all the Web apps that are on this phone that relate to buying tickets, watching movies, getting ratings. They’ve all been essentially distilled out and put there so that you can click on them and do anything that you want.
So, I said one word, I did one touch, and then scrolled a couple times, and all the things that are necessary to learn about movies, to pick a movie, to buy tickets for a movie, they all just show up.
And I think that this is an example of what we mean by trying to automate tasks that people want to get done. And if you’d have contrasted that even just a few months ago or certainly a couple years ago, you might have been able to find some of this information if you went out and did enough searches on the Web, but you would have had to navigate, you would have had no way to scope the search easily to just midtown, and so it’s really, really important to be able to take this information and put it together in some automated way.
So, let me show you another thing that we’ve been doing a lot of work on, and that’s the idea that speech is fine but in this world like where we have tags: can’t we get a lot of this stuff done in an environment where we don’t have the hinting going on, where you just want to have the computer look at things in a more natural way.
So, I brought just a book, for example, and I’ll push this Vision thing, and just let it look at the book for a second.
And so just looking at it, I didn’t push any buttons, it went out, it figured out what the book was just by looking at the cover, and then it found a whole bunch of different articles or Web links related to that. So, I’ll click on Elegant Universe. And much as happened with the movies, I get a description of the book, I can look at reviews about the book, I can buy the book, see what the prices are, and I can look at all the apps that are on my phone that might let me do other things related to the book; so just another example.
But there isn’t a tag on this book. It actually just recognized that. So, here today we can do a pretty much really good job on book covers, DVDs, CDs, Blu-ray disks, all kinds of standard things you might encounter in a retail environment, and because it recognizes every tag, UPC barcodes, QR codes, Microsoft Tags, essentially anything that’s hinted with a tag or are in the class of things that we currently can recognize, you essentially just point your phone at it, push one button, boom, it recognizes it and you can act on it.
There’s other things that we’re also doing in this sort of vision area that I think are also powerful, and I just want to show you one more.
So, I brought here a page, which is actually a simple menu from in this case a French restaurant. So, I’m going to look at this menu, and I’m going to say scan the text. So, it actually picked out what was text and what wasn’t on that document, and now I’m going to say translate it. And so then it translated it. Now, I didn’t tell it it was French, it figured that out. It converted it to English, because it knows that’s what I speak, and it overlaid it in place on the menu image to replace the underlying French text.
So, more and more we have this ability to have the computer do things that are otherwise really hard for people to do. I mean, it’s really quite challenging. But more and more this idea that the computer can just recognize things and take action on them with very simple commands I think will become better and better at a very rapid rate.
So, this is the beginning of an era of computing that we think will be substantially different, and one of the biggest changes in this environment is going to be the shift from the now well-worn graphical user interface, or GUI as it’s been called for many years, to what we call NUI or the natural user interface.
The whole idea here is to be able to escape the complexity associated with learning how to get the computer to do stuff for you. If you go back to the examples I gave of the big data things, whether it was the radiology tool, you had to be really schooled in that, even to do spreadsheets you had to be pretty sophisticated if you wanted to do any of the really fancy stuff.
And yet despite all that capability, I mean, we’ve had incredible success, really changed the planet, and certainly business and people’s lives by giving them that kind of computing. But as many of you probably read in the paper, I think it was this week the planet crossed population to 7 billion people, and despite the progress we’ve made with computers and smartphones, for the most part only about 2 billion people out of seven have really gotten any personal direct benefit from all this computing and connectivity. So, there’s 5 billion people out there that really are sort of yet to go, and over the course of the next couple 20, 30 years or so, maybe 50, many people predict we’re going to get to either 9 billion or maybe even 10 billion people by the end of the century.
All these people are going to come onto the planet are going to need a lot of capabilities that we don’t frankly have a way to scale to provide to them, whether it’s healthcare services, educational services, support for just improving their ability to make a living, and I think computing is going to play a critical part in addressing those challenges. It’s I think the only scale economic way to approach these incredible challenges.
And yet one of the hurdles is that those people don’t come with a Georgia Tech education, they’re not very literate, and in many cases they literally are illiterate. And so the question is, how do you get them to get benefit from computing if they are sort of forced to learn all the things that we learned.
So, when we talk about this natural user interaction, I described it as having two properties that are really going to be different. One is we make the computer more like us, and so the interaction between us and our computers, however they’re distributed in your environment, will be a lot more like interacting with another person, and the stuff I’ve just been showing you is sort of a step in that direction. But here it’s sort of blended together with the traditional graphical interface.
The other attribute is that we really want the computer to be less of a tool and more of a helper. So, again these steps I’ve shown you, which are toward automated task completion, is all about moving the computer to be more of an assistant just getting stuff done for you as opposed to saying, hey, I’m a sophisticated tool and if you can figure out how to drive me I might do interesting things for you. So, that’s the big pair of changes that we’re trying to make in this natural user interface space.
The research to support this goes back in our case 20 years to things like natural language processing, machine vision work and others, but it’s only been in the last couple years that the computing elements themselves have become powerful enough and inexpensive enough that we can start to do a reasonably good job emulating a lot of the key human senses — vision, hearing, speech, recognition, speech synthesis — and also powerful enough to do more than one at a time.
When we weren’t very good at it or we could only do one at a time, the thing that we tended to do was to just use it as an alternative way to operate the GUI. So, to some extent, the last few years with the emphasis on touch all we were doing was substituting the finger touch as an alternative way to you could say the historical view of the mouse or a stylus or some other pointing device in order to let that happen, and we had to get a breakthrough in the touch sensing technology in order to let that start to emerge, and it’s been very powerful but it’s still just a different way to operate the graphical interface.
What we’ve been asking ourselves for a few years is how do we really step farther back and fundamentally change the way that people interact with computers, and not just do a better GUI.
The real breakthrough for us came just about a year ago with the introduction of Kinect for the Xbox. You know, four years ago, it became clear to the people who run our gaming business, the game console business, that not only was gaming just going to be a component of a much more integrated entertainment environment in the home, but that the demographic profile of people who were willing to master a thing like a game console controller in order to be able to drive their entertainment system was just too small a percentage of the population.
The demographic description of people who would have been Xbox users or any of the similar game consoles typically would have been males from the age of 12 to 30, and it didn’t have very many females and it didn’t have very many people outside that age range.
And yet we think that if we could make this easier, not only would you be able to get more people involved, but it would facilitate using this kind of computational capability to provide access to a lot more entertainment capability, and ultimately to integrate it with these kind of machine learned models and different ways of getting things done in order to change that experience.
Four years ago, it sort of was obvious, having watched the evolution from Xbox and PlayStation with very sophisticated controllers to simplified things like the Wii, which did get a different demographic involved, that the obvious end point that you wanted to get to was no controller, where you could just get into this environment and not have to manipulate or learn anything at all.
And when we thought about it initially, it seemed really hard, and maybe it was just too early, but we ended up ultimately assembling about 10 different groups from four labs on three continents in the research area, and brought forward these things and did find that we, in fact, had a cover set of technologies that would, in fact, allow this type of controller-less gaming, and with it the ability to launch sort of the world’s first really natural user interface product.
So, for those of you that haven’t seen it, I’m just going to show you a very short video clip of what turned out last year to be the most popular title on the game console, which was Dance Central. And I promise you the males age 12 to 30 were not the ones who were running out to buy this one. (Laughter.)
But what’s happened in just a year the demographic profile of the people who are now buying games and playing them has become pretty much gender neutral, and the age range extends from like my two-year old granddaughter plays it, up to basically people who are much, much older, and are now enjoying doing things with their kids or grandkids that historically they never would have contemplated.
The controllers of these game systems in the past are best thought of as having the complexity of a music instrument, and if you were willing to devote the time to become a virtuoso, you could do some stunning things in these synthetic 3D environments.
But the thing that’s been fascinating is you could take, I contend, the world’s best Halo person, and then say, okay, get your controller out and here, let’s put your mom next to you, she’s going to compete with you, and you’re going to play Dance Central. You get the controller, she just gets to dance. And what you’ll find is that no matter how good they were at the first person shooters or anything else, they can’t map the complex movements of dance into those courting on the controller in any effective way. And yet people in 10 seconds can stand up in front of this thing and essentially dance.
So, let me just show you a short video of that.
(Video segment.)
CRAIG MUNDIE: So, here what you have in the game is you have professional dancers who you can set to dance at any particular level, and your job is to get up and match them.
What happens is the way this game works is every particular major joint in the human skeleton, in this case 42 of them, is mapped by the system 30 times a second for two people. And so the game in this case is to see how close you can come to emulating the pros, and the greater your variance the lower your score.
But it turns out that it also provides feedback, because if you get it wrong, it lights up, like your forearm should have been here and it happened here, it lights it up in red. So, as you sit there and do this, the thing is actually providing real-time feedback. So, the rate at which people improve their ability to dance goes up very dramatically.
Now we’re starting to see these things applied in a lot of other interesting ways. When we introduced Kinect, which was in November last year, almost immediately, in fact within a week people started to write their own drivers, because one of the things we did was we just put a standard USB plug on it, because while its first application was on the console, we really expected we wanted to see a transference of this technology onto the personal computer, anticipating there would be many, many other uses.
And so within a week people started to try to reverse-engineer the bit stream, and they did, and they got some rudimentary things working. By June of this year, we released a Software Development Kit which took some of the things that we had spent years developing and made it available just as standard libraries for people to do.
So, the next thing I’m going to show you is a little demo reel of just wild and crazy things that the community around the world has been able to do literally in a few months once they were able to take this $149 camera and plug it into the PCs they already had.
To put this in perspective, if you wanted to have an optical system that had an RBG input — that’s red-green-blue video — could see in three dimensions, which is the novel thing in this case, and had an array microphone that would allow you to do beam formed listening in space, you probably would have spent somewhere between $30,000 and $100,000 each to buy one of those. So, while there’s been a lot of interest in academia for years in these kinds of things in machine vision and robotics, in essence it was too expensive for most people to really get into it.
So, when this thing came out at $149, it was just revolutionary, because now anybody literally could sort of buy one, plug it into their PC, and start doing this stuff.
So, let’s run the demo reel, and I’ll just describe some of these things.
So, this guy is basically just using this thing to distort computationally the image that is being projected.
Here people are playing chess by walking around to move the chess pieces.
This guy is steering with his hands and accelerating and braking with his foot. He wrote his own auto racing game.
This is actually a sort of pseudo-holographic projection system, but the guy flies the helicopter just using gestures in front of it.
This is a big navigation system, and he wants to see things at a different perspective. So, this thing is tracking his head and changing the presentation.
This is the same tool I showed you for the precipitation data, but that’s the outer space version.
This guy made his own football game where he calls the plays verbally, that sets the thing, and then he uses the body motion to operate the play.
This is just mapping human actions onto a robot that has many of the same capabilities.
This guy wanted a controllable Barcalounger that he could drive around.
This is a thing that’s called Eddie. We did it with a company where you buy the base, which has a bunch of sensors, motor and battery, you stick your laptop or tablet in it, and plug your Kinect in at the top, and you essentially have a robotic development platform for people who want to do experiments.
This guy calls for different video effects, and then he uses his hands to apply them to the video that he’s in.
So, it’s just an unbelievable array of things going on.
Some other ones that I didn’t bring but I think are particularly profound were some students in Germany who in a matter of weeks they took a Kinect, the screwed it onto the top of a hard hat, put a laptop in a backpack, and built sort of a braille output belt. They wrote an application that allows the thing to basically see for the blind.
So, the camera sees the space in front of the blind person, who normally would have had to progress with some type of physical tapping cane in order to avoid obstructions. So, this thing would see all the obstructions, it would see where the corridors are. It used audio output to say there’s a corridor coming on the right, and it would build a little map and sort of push on your stomach to kind of give you a way of sensing where the objects were that were in front of you that you had to navigate around. So, they built this in a matter of weeks, and if you think about what it would have taken historically to build such an application, it would have taken a lot more than that.
So, the next thing that we said is, you know, all right, well, if we have all this kind of capability, how are people going to use it to do more and more interesting things?
So, another thing we’ve been working on for a few years, and you can see it in some of the products on the Web, for example, like Photosynth, where people, not even cooperating people, can go out and take just still photos of almost anything, and the system will correlate them, put them into a three-dimensional space, build a three-dimensional model of it, and then texture map the original surface onto that, and it allows people to go navigate around three-dimensional objects on the Web on top of the maps that are out there.
And what we wanted to do was think how can we make this something that more and more people can do. If you want people to people to operate in these 3D environments, to blend the physical and the virtual worlds together, how are we going to make that easy?
So, what we did is we took a Kinect camera, which I have here, attached to a personal computer, and the research people have been building algorithms that allow us to use this to build real 3D models of things in real time.
So, there’s nothing on this table that is annotated in any special way. What you see in the top left is actually the raw depth map from the sensor. On the right it’s been false-colored where we’ve been able to sort of recognize the geometry of things that are in that thing, and where the sort of vertical and horizontal surfaces are. On the bottom right we actually have started to extract an actual 3D model as a mesh, and on the left we have sort of the complete model. And, in fact, you see as I walk around here, it’s going to fill in the other part.
So, in essence I’ve got a hand scanner that will build real 3D models. And here I’ll go inside so you can get the inside, too. And in a short amount of time just scanning anything, you can do it.
So, go ahead and now turn on the RGB. So, now I’ve got the RGB camera. We’ve taken that and made it into a texture, and it’s being texture mapped in real time onto the 3D model. So, what you see on the left is not video, it’s a computer-generated 3D model with all of the RGB stuff painted on top of it in real time.
You can do this for anything, and if I wanted to turn around and make a model of the lamp or the table or the water bottle, you just point it at it and it makes that.
Once you get to this point, we can start to do other really interesting things. So, I brought a Kinect over here, and it’s connected to this display, which you’re seeing up there, and what I actually got was a 3D model that we made this way from the past, and the idea is to show people, well, what could you do if you had this kind of capability.
So, let’s say I wanted to throw a virtual pot. So, now the Kinect actually finds my hands, and if I apply them here I can essentially turn this pot and make it into a completely different shape if I want. So, things that I might struggle to do, say I want to make this a little bit different — okay, there’s my new pot. Now I actually have a 3D model of that pot, and if I wanted to go manufacture it or I wanted to output it or I wanted to put it into a 3D printer, all that could be done.
So, I started by scanning that one, I came over here and used things that I already know how to do like move my hands, and I was able to alter that geometry and get a 3D model out in real time.
So, this is the beginning of a world where we won’t be just doing books and pots and other things, we want to actually put people into these 3D spaces.
So, when we started the development of Kinect, I had a group of people who also started to develop a telepresence system. Given that I travel 140 nights a year, I’m really anxious to get a telepresence system — (laughter) — and the question is, how can we get people started down this path. Everybody has Skype video or instant messaging video or even higher end Cisco or Halo kind of fancy video teleconferencing systems, but in a sense it’s okay for two people, it really starts to break down when you get beyond two people. And there’s really nothing very natural about the interaction that you have, particularly in a multiparty conferencing environment.
So, we said, all right, we’ve got all these people who have Xboxes. Because there’s 50 million of those out there, there’s probably something on the order of maybe just 70 or 80 million avatars that have already been created by the people who have Xboxes, because that was always part of the use of the system, you created an avatar to represent you just in the interface.
With the arrival of Kinect the avatars became your representative in the game itself, and the question is, what could we do that wasn’t just gaming?
Because we felt that particularly the younger audience who would be using the Kinect and the Xbox had already acclimated to this idea of having an avatar to represent them, we decided that the first place we would target to try to build an inexpensive, multiparty telepresence system was to build an avatar-based one, and make it part of this Xbox environment.
And so we did that, and, in fact, in the summer of this year, 2011, we actually released this capability, and anybody who’s got a Kinect and an Xbox now can have meetings of up to you and seven friends in various three-dimensional stages that exist out there in cyberspace.
In this generation we designed the stages, and you can go and meet in that environment, but obviously the ultimate goal will be to expand that to a much broader array of places.
So, let me first show you a video, which is just the trailer for the release of this capability in the summer, but it will give you an idea of what it’s like to have a multiparty avatar-based way of projecting yourself into a three-dimensional space. In this environment, even though the avatars are caricatures of you and you can control sort of what they look like, the other breakthrough in this product is we added facial animation.
So, in the first wave of products it mapped all the 42 major skeletal joints, and that’s what you needed to dance or operate your character in the game, but it didn’t do anything with the faces. But in this application we decided to go farther and use this much in the way that this thing was done to both build a three-dimensional model of your face, we know the underlying human structure, musculature structure, and even though the resolution of the camera is fairly limited at that distance, we found a way to map it such that we could take the major facial features, capture them in real time, and map them onto the elements of the avatar face.
So, when you look at this, you’ll see in the video that not only does it capture people’s movement and gestures, it captures their mouth positions and their eyebrows. It’s stunning how good humans are at decoding emotional information from even very limited facial cues, but it’s super important in terms of any sense of reality in the interaction.
So, this was a necessary capability, and even though it’s very coarse, almost all of the major human emotions are captured and transmitted through this mechanism.
So, let me just show you the trailer to give you an idea of how this is actually working today.
(Video segment.)
CRAIG MUNDIE: So, this is actually shipping, and people are out there, literally hundreds of thousands of them, having these meetings, making their own recordings, posting them for people on Facebook or wherever you might post any other kind of video. I view this as the first step toward ultimately a much more sort of photorealistic and even business type of application of this kind of technology.
In January I was at the World Economic Forum in Switzerland, and a friend of mine, Maria Bartiromo, who many of you probably have at least seen on television on CNBC, we were talking, and she was saying her kids had gotten this Kinect and were talking about how much they were enjoying doing that, and that she was actually doing things that, of course, she never played any videogames before.
And I said, well, you know, I’m working on this thing called Avatar Kinect where we want to allow people to meet, and I said, you know, what we ought to do is we ought to do one of your TV shows as avatars. So, she said, oh, that’s a pretty cool idea, let’s do that.
So, in July of this year I took this same thing, and we set two of them up in the studio in New York, and we recorded a segment for her show, and it ran nationwide in July. I won’t show you all of it, but I’ll show you a clip from that actual TV show.
What they did is they recorded part of it as the real human part, and at the same time they recorded the avatar part of the interaction. And then they did a very good job of sort of cutting them together, side by side in some cases, so you could see what was the interaction really like.
So, during the interview, for the vast majority we were not in the same room, but you can see that there it’s quite a natural repartee between the two of us. We know each other, and so it wasn’t a question of learning, and it’s really amazing how quickly your mind is willing to sort of dispel disbelief and just sort of operate that way, because many of the visual and emotional cues that you look for in human interaction are captured with enough fidelity that you can start to do this.
So, let me show you the clip from the Maria show.
(Video segment.)
CRAIG MUNDIE: So, this was a lot of fun. It was actually 13 minutes by the time it aired, and, of course, it’s the first time anybody has ever done anything like this as avatars.
But one of the things that was interesting, you know, she had a lot of fun doing it, we didn’t show all that, but she was able to design her avatar. The last thing she did was change her jewelry before she went on. (Laughter.) There’s just sort of a psychology of interaction this way that actually even with the caricatures is better than what you get if you just said, hey, I’m sitting there staring at your headshot in a little video window someplace. I think that this is really going to be important in evolving the way that we interact with each other, particularly over time.
So, the obvious next step then is to think, you know, well, what’s it going to be like if I’m not confined to these artificial sets, and so ultimately the dream is to take and have something like this Kinect sensor, and it might be a somewhat different version or just an adaptation or evolution of this one, and what we want to be able to do — and I’ll turn this back on again for a second. This is the same thing I was showing you before. But here, for example, it’s actually a model.
Now, this model as I scan it turns out to be a model of part of the first floor, a cutaway of part of the first floor of the new Microsoft lab in Beijing. The architects, when they were doing some of the design studies, picked this particular model and built it. I can just scan this thing, and I’d have a 3D model.
Now, of course, in the 3D world I can make the scale anything I want. So, I can start with this model, and I could make it life sized, and then I could walk inside it.
Similarly, we think that since almost everything these days is getting built using CAD systems and 3D models, what I did next was I went to the architects and said, hey, I want the actual CAD model from the Beijing lab. So, we got that and we loaded it up.
So, let me start that up, and there it is. So, this is actually the CAD model the architects made of the lab we opened this year in Beijing, and in some ways it resembles parts of what you see here, but it is the whole model.
So, now what I can do is much as I did in the avatar thing I can say, okay, drop me into Beijing. Boom.
So, now here the next thing we’ve been trying to do is to make the avatars more photorealistic. So, you can see here it’s even got my belly and everything — (laughter) –and the head.
Now, in this case in this particular prototype we haven’t got the facial animation here but I’ll show you more about that, but you can see that my movement is fairly realistic in this environment, and I’ve actually had meetings, and I can say to people, hey, look, go down here and you’ll find the cafeteria, over here up on this new wall — and actually on the new building they’ve put lights on this thing and it makes a cascading waterfall, and, of course, I can be here but I can move other people here. I’ve done another demo where I have another person from Microsoft in a different lab, and we dropped them there next to me. So, I’m talking to them, we’re both essentially in this lab, and I’m really pointing out things that when they go to Beijing they actually see them, it looks just like that. So, it’s really kind of interesting about how sophisticated this is going to be.
So, the next challenge is essentially going to be the face, and so let me drop out of here, and show you the next thing in a second.
While I was in Beijing a couple of weeks ago, the next thing we’ve been working on is how to make the head more photorealistic, including real-time animation.
So, in a moment I’m going to show you a short video clip, and this was actually shown live in the Beijing lab when I was there for our 20th anniversary celebration recently.
You’ll see it’s starting to resemble me more and more. What you’re going to see is actually not a photograph, it is a 3D model that was built, and it’s sort of a full model of my head, but it uses a very sophisticated technique that blends together video of me speaking, and then it is able to composite that in real time on the underlying model in order to get more or less natural animation of the mouth.
The hardest part in doing everything is the mouth. It’s sophisticated, it moves fast, and it turns out you not only have the external structure but you actually have the lips, the teeth and the tongue, all of which people actually see. So, any attempt to build one of these talking heads and have some sense that it’s not too freaky looking, it’s going to have to be quite good at emulating these. In fact, when we do research in this space, actually if we do it at the level of the caricatured avatar, the cartoon kind of thing, you can get the emotional transfer and there’s no cognitive dissonance in looking at that.
If you get all the way to really photo-real and you can’t quite distinguish it from a video, people are actually quite happy to accept that. In a sense it’s the evolution of television. Everybody has been trained to look at those RBG dots on the wall, and actually think people are there. No, it’s just colored dots, but you’ve been trained to accept that as a representation of people.
And so the question is, what happens in the middle, and, in fact, the researchers called the middle the Uncanny Valley, because what happens is if you start to get more photo-real but the behavior isn’t natural, your brain freaks out. (Laughter.) You’ve seen this happen. Some of you, for example, Polar Express, one of the earlier 3D movies, it fell in the Uncanny Valley, and many people went to watch it and said, wow, the animation is kind of cool but there wasn’t something that just wasn’t right, and what wasn’t right was that the characters started to look pretty real, but they didn’t behave right in these sort of fine-featured ways.
So, it’s very important to sort of not fall in the Uncanny Valley. It’s why we started with Avatar Kinect with caricatured avatars and an audience that’s already acclimated to looking at them, and the question is how do you get across the valley and all the way to something that people will find natural enough.
So, I’m not going to claim that this is there yet, but I want to give you an idea that we’re approaching it fairly rapidly, and when you do that and you couple it to this ability to scan things, if you wanted to have a model of this room and have a performance stage here, and you wanted to have people attend it that way, in the future you’ll just come up here, you’ll basically scan the stage, and then you’ll publish it and you can put performers on it who aren’t really here, they’re just sitting someplace else.
The other thing that is I’ll say kind of magical about this particular demo is I’m going to speak in Mandarin, and I don’t speak Mandarin. (Laughter.) And the other thing that’s interesting is that this talking head is reading the text of my speech. So, there’s nothing that I recorded, said or did. So, this thing is reading the text, it’s trying to synthesize the correct emotion and sort of cadence of reading it, and, oh by the way, it’s converting it to Mandarin.
So, let’s go ahead and run that.
(Video segment.)
CRAIG MUNDIE: So, my Chinese friends say that was really good Mandarin. (Laughter, applause.) You can probably vouch for that.
So, the other reason this is really important, particularly when you start to think about this idea of meetings, is our goal is to be able to do that from the spoken word in real time. And what that means is that you can have a meeting where you send your avatar out to meet some other avatar, I speak in English and at the other end they hear it in their native language, potentially multiple different ones. And when they speak in their native language, I hear it in English.
So, this is sort of Star Wars like in terms of the translation capability, but we are not very far away. I predict it will only be handfuls of years before it will be possible to have 3D telepresent meetings with fairly photorealistic avatars, including the ability to do real- time multilingual translation. Of course, when you do the translation, if I’m sitting here speaking in English and it wants to come out in Mandarin, the entire face movement is completely different. So, you can’t just take my face and map it onto the avatar, you actually have to do the synthesis of what the right positions of the mouth, teeth, and tongue are while you’re speaking the other language.
So, we’re really making huge strides in all these areas, and it’s only through the integration of all these different research areas, and the ability to engineer these things, that I think we have such incredibly powerful capabilities in the future, and I really look forward to that.
So, that’s all I have for the presentation. So, let me stop there, and before we go to Q&A, which we’ll do for 20 minutes, we have some give-aways to reward those who happened to show up here. Thank you, ma’am. Thanks. (Applause.)
END