Craig Mundie: McGill University

Remarks by Craig Mundie, Chief Research and Strategy Officer
McGill University
Montreal, Quebec, Canada
October 7, 2011

GREGORY DUDEK: Well, good afternoon. Thank you for coming. My name is Gregory Dudek. I’m the director of McGill’s school of computer science. I just wanted to thank you for joining us here.

And before we get started, I wanted to say a few words, before we hear the talk, “Converging Worlds: A New Era in Computing” by Craig Mundie, who is the chief research and strategy officer for Microsoft.

I think McGill is probably an ideal choice for this kind of talk, because McGill has had a tremendously influential role in computer science in Canada. Arguably it has the highest impact professors per capita of any school in Canada, and I think it’s a great pleasure to have Craig here.

Currently, researchers at McGill are working on many different areas of computer science and software engineering, including software evolution, human-robot interaction, online learning environments, computer-generated models of the human heart to assist in surgery, sensor networks. And that work is not only happening in departments of computer science and electrical and computer engineering, but it’s happening really all over the university in all the different faculties, in the school of music where there’s an ultra-videoconferencing system for doing real-time musical collaboration across national boundaries, in the hospitals where brain imaging and the modeling of heart tissues is an ongoing practice, and in many other parts of the university.

Craig’s visit is a great opportunity for some of our best and our brightest professors and students to share some of the cutting-edge research that people are doing here, and see some of the game-changing ideas that he’s been sharing with us.

It’s a great pleasure to introduce him. As I said, he’s the chief research and strategy officer at Microsoft. In that role he oversees one of the largest computer science research organizations in the world. And as you know, Microsoft is not only a tremendously powerful and important company, but it’s also a company that’s participated in essentially transforming much of our society and much of our economy over the last couple of decades.

Craig spent much of his career building startups in several fields, including supercomputing, consumer electronics, health care, education, and robotics.

For more than a decade he’s been Microsoft’s principal technology policy liaison officer for the U.S. and for several foreign governments, including China and India.

Another longstanding focus of his is privacy, security, cybersecurity, which we all know is an emerging issue of enormous importance, and he’s served on the U.S. National Security Telecommunications Advisory Committee and the Markle Foundation Task Force on National Security in the Information Age. And in 2009 he was appointed on the U.S. President Barak Obama’s Presidents Committee of Advisors on Science and Technology.

Please join me in welcoming Craig Mundie to McGill University. (Applause.)

CRAIG MUNDIE: Good afternoon. Thanks for coming and spending a little time with me this afternoon.

In the next hour what I want to do is a couple of things: share with you a little bit about the ideas we have about some of the major trends that are happening in the computing field, and a little extrapolation of what those trends might mean for the world that we’re going to live and work and play in over the next 10 years or so.

My visit to McGill is one of a series of them that I’ll do this year, but is actually part of a longstanding commitment to this kind of interaction that Bill Gates and I both had fancied for many years. It’s probably well more than a decade that I’ve been making these annual trips out to visit universities, and I do them both in North America and around the world as I travel around doing a lot of the policy work.

In the last few years, you know, I’ve spoken to many U.S. universities but also schools in Japan, Korea, Russia, China, India, and other places.

So, the reason to do that is for me to get grounded each year in what’s happening on campuses to understand the view of faculty and students about what’s on their mind, their own sense of technology and its evolution, and to make sure that I don’t stay in Microsoft as sort of an ivory tower kind of environment, informed by our own research and business activities, but perhaps lose sight of what’s happening in the broader environment.

So, these are intended at some point to be an interactive discussion. Each time I come, I try to also visit the administration of the school, and I met with the principal here today, and also have some roundtables with students and faculty, and we had three of those this morning as well. So, I’m beginning to get a little flavor of what McGill is about and what’s on people’s minds here.

One of the things that I note and want to reinforce is what I think is the importance of multidisciplinarity in the future in problem-solving, and it’s clear that starting with the principal and the faculty, and even the culture of the institution here, there is an orientation to that, although people also recognize there are some impediments to it. But I just want to begin by encouraging everybody to think that this idea of cross-fertilization between the technology people and the liberal arts-type people and the more embedding of the computer environment into many of these problem spaces is going to be important in the years ahead.

There’s going to be three parts of the talk that I give today, and they will be first about what we euphemistically call big data. Obviously, data has been around a long time. Science has evolved over the centuries from one that was more theoretically derived when there wasn’t a lot of instrumentation to a more experimental method, and then you could say we went through a period where we used the increasingly powerful computers with an orientation on model, and trying to correlate the models with experimental results.

But the last few years have been an interesting time in that not only have computers become more powerful in the computational sense but the scale of these assets and their ability to store and manage huge amounts of data, and perhaps importantly to do a new form of analytics on these very high scale datasets really has created a completely new opportunity. And this is finding application not just in the fields of science and engineering, but, in fact, many aspects of even running businesses today are being informed by this big data kind of environment.

And certainly in places like Microsoft where we build these very high-scale services for Web operations and Internet scale advertising systems, for example, the ability to use this big data to inform what we do, to use machine learning in almost a real-time basis to adjust the operation of these systems are all a key part of what we’re about.

The second part of this is going to talk a little bit about the convergence, if you will, or the increasing interaction between the physical world and the virtual world.

There have been special cases, I’ll say, of the virtual worlds that have become popular. An example would be 3D gaming and game consoles for entertainment purposes. But many of these technologies are now finding other ways. And as people get novel display capabilities in their home and workplace, like 3D displays, there’s going to be more and more of an opportunity and more and more of a need to figure out how people can move with ease between the physical world and the cyberworld, and I’ll talk a little bit about that.

The last, and arguably for me maybe even the most important, will be to talk about the transition of the man-machine interaction from what we historically have called the graphical interface or GUI to a natural user interaction model or NUI. And while NUI won’t make GUI go away, it’s pretty clear that this is going to be a fundamental change in the way that people will interact with computers.

So, let me go through these things in that sequence and share with you some thoughts.

One of the things that has been available for a long time to people are classical tools at the desktop, and things like Excel spreadsheets and others are very powerful in that it allows people to have sort of direct expression of the problem they want to solve or the model they want to build, but there have been some practical limits both in the computational scale of a personal computer and even some of the tools that started out in life thinking that they were just an overgrown version of what people used to do in paper, and hence the spreadsheet paradigm itself, and now we find that it’s a very powerful way to operate on things.

And so the question is, how do we start to use these increasingly powerful capabilities and tools that people are familiar with in order to ingest and operate on or control the processing or analytics on these very, very high-scale datasets.

So, let me show you how that is evolving in our mind. So, if I put up here, this in a sense just looks like a regular old spreadsheet, but one of the things that we’ve now added is this ability, shown on the right up here, to import data from these data marts.

What’s happening is that more and more institutions, either government or business, are taking these datasets and making them available on the Internet. For example, in the U.S. the U.S. government made a decision a couple of years ago at the beginning of the Obama administration that they were going to take all the nonsensitive or nonclassified datasets that the U.S. government had, and they were going to publish them under an initiative they called, and they’ve made some significant progress on that even in the last couple of years.

So, data that used to reside or be sequestered in these servers in a government or a business but have great value if made more broadly available are starting to trickle out. The high-scale Web services provide a repository where we can both store these things, make them available, either for free or even under commercial terms, and the question is, how do you get a hold of them, how do you combine them perhaps with data assets of your own, in order to reach business decisions or make scientific advances.

So, in this case we now have the ability literally to just subscribe, much as you’d subscribe to an RSS feed or something else, you can now subscribe to commercial or noncommercial datasets that live in the cloud. And in these facilities now in this case there’s a bunch of them represented: census data, demographic data. For this demo I’m going to show you one using historical weather data.

So, I’m going to say I just want to import the data, and because we’ve already done it a long time, I’m just going to limit this to say 400 items — there are tens of thousands of them in there — and tell it to import the data.

And so what this has done is gone out on the Internet, sucked up 400 items out of this data in a random sample, and populated this as a database. Now, all the tools that I would have to deal with a database are available to me.

Now, I actually took a bigger chunk of this, because I can’t always be assured that the Internet connections are great, and I imported them from exactly the same dataset into this second sheet.

So, here’s the same data but a much larger sample of it, and with that I can do normal things that you think you would do at a desktop level. So, for example, I can build a pivot table and then make charts out of it.

So, here what I actually did with this data is I charted the five — my home, Seattle, and the four places where I was going to give this particular kind of lecture this week and next week, and they represent these different lines.

Now, what’s interesting is I’ve never looked at — this happens to be precipitation data over I think the last three years or five years, and I’ve never looked at this before. But when you look at it, you can see in these highlighted areas here, the red one and the blue one, that there’s some interesting anomalies. You know, you would think weather would be relatively predictable, or total precipitation over long periods of time, but suddenly you see like this vertical spike here, and actually in a different city a vertical spike here.

And it kind of shows the power of what we can do now by using perhaps visualization to ask a question, and then the other assets of the Internet to answer the question of why did that happen.

Now, it turns out this was Evanston, Illinois. I spoke at Northwestern University. And it turns out if you do a little research, you find out that at that time that year there was this very unusual occurrence where a hurricane actually went inland in the U.S. right up through the middle of the country and dumped this huge amount of precipitation there, and that was a real anomaly.

It just shows the power that comes now from having all this data around, and the ability to correlate it or learn interesting things where you’re looking across these different datasets.

Similarly, it turned out there was another anomaly at a slightly different time period, and this was I think in Atlanta, and sure enough they had a weird storm and it flooded a big part of Atlanta.

So, you can start to see these correlations between anomalous events and your ability to find other things that will help explain them. And while this is a very simplistic one to show in a demonstration, we think that that’s going to be increasingly powerful.

But as the datasets get bigger and the time horizons over which you want to look at these things, or the geographic scale, for example, that you want to do these observations gets bigger, we said there needs to be a better way to do this, and frankly to allow people to get more involved.

So, a few years ago, in Microsoft Research we started to build an engine that originally surfaced as the WorldWide Telescope, where, working with the global astronomy community, we had basically taken every astronomy related image, multispectral and everything else, and essentially put it in one big repository, and then allowed you to navigate through space and look at any particular image or discover images that related to phenomena there.

But the system was quite generalized, and so here is an example where we’ve now taken this thing, applied it to the geo model of the earth, and can now put big datasets into this model where it can analyze them and render them for a more sophisticated form.

What’s nice about this is it also allows people who do the analysis to make sort of a recorded tour of their analysis and their observations. So, here I’ll start playing back one that somebody made where they say, okay, this is the precipitation in the western half of the United States, we’re going to go over here and look at it. Well, it’s kind of interesting that there’s two almost extraordinarily big peaks of precipitation and a valley in between.

But you can stop these things at any point. You could jump around to different things where people kind of have bookmarked different elements of the analysis, and commented on them.

And, of course, you can actually navigate yourself. So, for example, if I wanted to move over here closer to like my home in Seattle — everybody thinks it’s really wet there, but I was really happy when I looked at this to find out that while I think it’s damp, it’s nowhere near as wet as the mountains to the west and the mountains to the east, and these kind of things you can really quantify or learn by looking at these very powerful tools.

And so these kinds of things are now being released to allow people to look at very large datasets over very long periods of time. So, we, for example, have another one of these where we’ve taken all of the world’s seismographic events that have been recorded that were pooled together over I think a 50-year period, and you can essentially play them like a movie, and watch essentially every earthquake big and small happen across the entire planet over that time scale.

What’s fascinating about that is that by looking at these patterns over time, you can actually see where the fault lines are, in many cases where they didn’t know they were before. And the only reason you see them is that nature has given you a history that if you can look at it in the aggregate allows you to see they’re all on a line. But if you just happen to look at them one at a time, you might never notice that they had that particular pattern.

So, I think again these are just relatively straightforward examples, but I think it’s these kinds of tools that are going to have to be applied to the data.

But the last part that has become I’ll say even more important is the idea that while visualization is powerful, that certain patterns in these very high scale datasets can actually be detected by big machines in a way that humans can’t do, even with the best visualization tools. And so another thing we’ve been doing is thinking about how to apply these machine learning technologies to these very, very large datasets.

Similarly interesting is that many of the datasets that are coming from sensors now are not merely producing strings of numbers, and so the bigger question is, how in a more complex form of data, for example in this case human physiological data, represented in X-rays or tomography scans, how can you have the machine help you with these kinds of things?

What you see here is actually part of our Amalga imaging product, and in the past a trained radiologist or a physician who wanted to look at this sort of volumetric representation of an individual would be given tools like these top two and bottom left panels, which is each a different axis of looking at these scans.

So, you have these very thin slices, and as you scan around, here I can move vertically through the thing, here I can move front to back, and here I can move side to side.

So, indeed you could find any particular point in the body, but at any given instant all you’d get to see is that point at that slice and the other things around it. Hard to gain real insight, although people did some amazing things with a lot of training. In a sense it’s a bit like learning how to play a videogame. If you do it enough, you get good at manipulating or thinking in that space, and you can gain insight.

To facilitate that some years ago, people started saying, well, let’s just put this stuff into a 3D sort of voxel-based models. In essence you’re taking slices and reintegrating them.

And you start to get people who were building things like this on the lower right, which said, okay, that’s a full volumetric reconstruction of everything that was in there. And maybe you could take it and turn it around and look at different things, but it was hard to see things in isolation. And it gets particularly hard when every human is slightly different in size and orientation internally.

So, we decided whether we could take these same machine learning techniques and apply them to this type of nonnumeric data in order to teach it to do what trained radiologists are able to do who understand the anatomy, and that’s to teach the machine to detect all the organs.

And so what we actually have here is a machine that has learned how to find the organs in an arbitrary set of these slices, and I’ll actually run it on this particular body. What it does, you can see on the left it pops up a panel of all the things it was able to identify.

And so I can take any one of these things — let me go down here a little bit to, okay, how about the pelvis, let me see the pelvis. So, there I now have all the elements of the pelvis of this person. They’ve basically not only been isolated within the traditional view, but in the 3D model they’ve all actually been extracted out, highlighted, and this whole thing essentially now becomes a manipulable model, too. And I can change the way it renders, I can do all kinds of things. I can manipulate this around and look at it.

So, for people who now have very little training, other than maybe your physician who really wants to look at a bone fracture or something here, no longer has to struggle either to find it in your particular set of scans or to figure out how he can answer the question in his mind by somehow manipulating all of these different slices.

Let’s see if there’s another one that’s interesting. There’s like this, you can look at the lungs, you can look at the gallbladder.

Now, what’s amazing about this is that you can feed it now any particular individual’s voxel representation of all these scans, and it will find the organs in that particular body without anybody helping it.

And so all the problems that people historically thought about, oh, these things all have to be carefully registered or they have to be guided by a human who could see these things is really just not true anymore, and this kind of processing we think is going to be super important.

So, I brought you just a video where we’ve used this tool through a set of different cases. So, let me run the video, and I’ll just add a few more thoughts why you can look at the kind of things that are being done with this kind of technology. So, go ahead with the video.

So, here this is actually the aorta and the kidneys. If you want to look at them as a unit, they can be highlighted and extracted and the rest of the body faded out. You can change the transfer function to highlight things like the interior structure of those things. And here you’re just moving sliders and clicking on things that are interesting. Here’s a lesion in the right lung, which you can isolate or look at in that context; the liver and spleen. And you can see there’s a lot of fairly fine detail that is correctly identified as belonging to each of these macro structures within the body.

The other thing that this has enabled us to do is we can now specify searches by common visual elements. So, for example, if I go to the one which was that guy’s lung who happened to have a lesion in it, I can say, hey, find me all the other people in our dataset who had a lesion like this one in that position in the lung, and it will essentially go out and do a search, not because you could in some mathematical or verbal way describe it, but you can say, I want all the people who had that problem, show me what we did for them.

And so this type of visual and image searching I think is going to also be powerful not only in the medical field but in many others. We actually are doing some research now where in general Web-based images you can do essentially little sketches yourself of things, and it will find images on the Web that correlate in some high way with the things that you’re drawing.

If you move beyond the big data problem into this idea that we’ve got a physical world and a virtual world, you know, that sort of one medical instance of it, where people need to move back and forth between the two, to a more general world where people are out wandering around with lots of sensors in their pocket, computers on their person, and lots of activities that they want to engage in and can be supplemented by that.

In the past, much as is true in trying to create let’s say three-dimensional images of things in the past, you would have to go in and sort of give hints to the computer. You could say, hey, you know, the human says that there’s these points are all known to be part of your pelvis. So, now you go try to figure out what the rest of the pelvis was. But you had to help it, you had to hint it.

Today, we’re at a point where we’re still sort of hinting at how we want the computer to help us move back and forth between the physical world and the virtual world, and I think this hinting is important and in some ways can be very useful in task automation.

Another big thing that we’re focused on is the idea that with all these things, like Internet search, people don’t search because they want to search, they search because they’re trying to do something. And the question is, how can you ultimately learn more and more about the kinds of things that they want to do, and then just have the computer help them get it done more directly. So, in some cases this tagging mechanism is both helping to create a facile way to move back and forth, but is also a way of creating automation in some of these tasks.

A few years ago, we in Microsoft Research created a tag concept, like a UPC barcode on steroids, and there are other similar related kinds of things like QR codes, but these codes are kind of unique in several respects. One, they have a very, very large data space in terms of unique code numbers. Two, the code is actually interpreted by an intermediate Web service that actually determines on a dynamic basis, well, what should you do when you see this tag. And three, the tags can actually be printed on things that actually look like human recognizable images. So, in terms of the amount of surface area of something that has to be consumed by a tag that only the machine can see you get rid of some of that problem.

So, to highlight what people are doing with this, and you might see it today like the USA Today or many of the magazines now when you read the articles you see the article has a little tag at the bottom, might be this one or it might be somebody else’s kind of tag, but in general what they’re trying to do is automate the idea that you can go from the thing you’re looking at, like a magazine page, to a related article or a video or whatever it might be, literally with one button push.

So, let me show you how this was applied by a whole activity, which was a music festival down in Savannah, Georgia, and you’ll get the idea of how this annotation through tagging is becoming more popular.

Go ahead and run the video.

(Video segment.)

CRAIG MUNDIE: So, in this kind of environment, in fact, I brought a little poster that had some tags on it in case anybody wants to try them afterwards, some things I’m going to talk about in the talk have little tags over there.

Retailers are taking these things and putting them on packaged goods, and because they can change every day what the tag does, one day it might teach you about the package or the product, the next day it might enter you in a contest, and the next day scanning the same thing in your store might offer you a coupon.

And I think the same thing is being applied to business cards now. My business cards and many other people, they print a little tag on the back that’s personalized, and you can automate things like taking them to your Web page or putting your contacts into your phone.

So, the phone is obviously the most personal computer these days, and is becoming quite powerful. If we take these techniques that we’ve been applying with more powerful PCs in the past and start to apply them to the phone, then we think we’re at a point where we can start to move beyond this need to have hinted specific things to enable it like this music festival or a retail shopping experience, and we can start to have the phone recognize things in a more unaided way.

So, I brought one of these new Windows Phones, and if you can put it up on the screen, you’re just seeing a projecting directly out of this phone through a cable.

So, I’m going to go to the search page, and what we’ve done now is integrate into this — and the phone knows where it is, it knows some of my history, it actually figures out I’m in downtown Montreal there. So, all this context is being applied to searches and other activities.

But we’ve also now got sort of eyes and ears on the phone in the form of cameras and microphones, and instead of just taking pictures and posting them to your Facebook page, we want to do more and more sophisticated things with these cameras and microphones.

So, let me start and show an example of — and some of these things you could do with individual applications in the past, but now what we want to do is integrate it more into the fabric of the phone.

So, in this new version of these phones you see the little icons at the bottom, there’s a microphone, an eyeball, a music note, and those things actually are built-in functions to hear voice commands, to see and recognize things and take actions, and, in fact, to just recognize an arbitrary piece of music as it plays. But we also use these Web services now to really aid the interpretation of voice commands.

So, I’ll try, even in this highly reverberant environment with a microphone, to give it a simple voice command anyway, and see what happens.

So, let’s say I want to find movies here near the campus. “Movies.”

So, it takes movies, sticks it into the search engine. That takes you to essentially results that it now presents not as a bunch of Web links but essentially some interpretation of things that relate.

So, here’s the movies that are playing near us right now. If I want to get more information, I can click on them. It gives me the whole number of movies. I might want to pick I guess the Ides of March, and say what do I know about that? It goes out, comes back, gives me information about it. One swipe and I get all the show times. One click I can buy the tickets. One more swipe and it basically has identified all the apps on the phone that relate to the movies.

So, you don’t have to go navigate over and over again to do these things. What did you want to do? You wanted to find a movie, buy the tickets, perhaps look up information related to it, and you want to do that in a minimum number of commands and clicks or touches.

So, as we make the computer able to do more of these things, once it decided that movies was not just a random thing it should try to make a link for but that there was some context about movies and what do you do with movies, then you’re able to produce this kind of automated capability.

So, let me kind of move beyond the ears to the eyes, and say, you know, what kind of new things are we doing there.

So, here I have two things, a book and a menu, and I’m going to say I want to scan this. So, if I just hover over the book, it will essentially start to recognize it. And when it does, it will go out and do some kind of Web activities to see what it can find about this book.

All right, this thing died. We’ve actually had some trouble with the Internet connection here. So, I have a video that I’ll show you that I recorded earlier of this.

So, here if you actually scan the book, it actually produces Web pages that give you the links of the things you can do with the book. So, just like movies, you’ll be able to do the same kind of thing about the book. You can actually click and buy the book, you can essentially end up at and do all that kind of stuff.

So, I’ll show you one more thing that I will do, which is how it actually can handle translation.

One of the things that we think is really interesting is the ability to not only have this phone look at information, but if you walk up and take a picture of a menu, and here I was doing this at a different conference, you look at the menu, which is exactly the same one I have here, and you tell it to scan the text, it finds it. And then you tell it to translate the text, and what it does is translates it and overlays it directly on top of the menu. And in doing so it also figured out in that case that it was French. I didn’t tell it it was French, it figured out that’s French and I want to know what it is in English, and it will automatically do that kind of conversion.

So, this ability to move back and forth between the physical world and the virtual world and to have the computer help you create — or solve these more complex tasks is going to get better and better, and this is just the beginning of what we can do.

The next part of this that I want to talk about is this transition from graphical interfaces, in this case on the phone, I’m still largely driving the phone primarily as a touch-based graphical device. With the addition of these kinds of things you’re starting to move to a more natural interaction. I didn’t have to type in the movie I wanted into a little keyboard, you know, I could speak to it. All of these things are endowing the computer with more humanlike senses, and the ability to process that information. And in that environment we want to go way beyond that, not just the simple things we could do there.

So, this quest for the natural user interface really reached a new milestone for us about a year ago when we launched Kinect. Kinect is essentially a three-dimensional vision system and an array microphone for the Xbox. We decided to try it first in that environment, because we had a dream, a very direct dream about how to translate this machine vision and speech into the game environment, and it became the most successful consumer electronics device every launched. It got the Guinness Book of World Records for the fastest zero to 8 million of anything that’s ever been bought.

And I think it showed that it really resonated with a lot of people, in fact people who weren’t the classical gamers. In the past, the Xbox community was dominated by males between the age of 12 and 30, but once Kinect was out there it turned out that the demographic expanded to almost be gender neutral in terms of the new things that people were buying and doing, and the age range became very broad.

So, many of you have probably seen it, but I’ll just show one short clip of what the new kind of Kinect games are in the classical deployment of this technology. So, run that video for me.

So, here this is Dance Central 2; it’s about to come out. It turns out Dance Central is one of the games where you would get up, one or more people, and the thing maps your motion directly onto your avatar on the screen, which is dancing alongside what turned out to be avatars that were professional dancers. You could do this from very basic to extremely sophisticated dance moves, and people would get up.

Interestingly, last year, this title was the fastest-selling title, and I guarantee you the old Xbox community wasn’t the one buying the dance title. It just shows that you really can get a lot of people engaged in activities that are very, very different.

But one of the things that we had anticipated, because we had a broad interest in this idea of machine vision and speech-based interaction, was that obviously other people would, too, and we had laid plans to produce a development kit that would allow people to take the sensor, which had a USB plug, and take it off their Xbox and put it into their personal computer.

We had a plan to do that, but within one week of this thing going on sale, around the world people went out and said, no, this is so exciting and it’s so cheap, $149, you know, we just want to play with this, and they started within a week to write their own drivers, to reverse-engineer the bitstream and other things so that they could start to use it. Rapidly there became a whole section of YouTube called “Kinect hacks,” and people started to post the kind of work that they were doing or the experimentation they were doing.

The rate at which this was going was really phenomenal, and yet we found that indeed many of the things that we’d worked hard to perfect, like very high-quality skeletal animation and detection, in the product that we gave to the game developers we had the sensor so that it would do four people simultaneously, 42 major joints, at 30 hertz, and it had facilities for recognizing one person from the next. So, if one person sits down in the game and another person stands up, you don’t really want to say stop and login. So, there’s a lot of things to recognize one person from the next.

And it turned out the community never really was able to get the array microphone working in any substantive way. It turns out that’s a very complicated problem to make it work and do stuff that’s interesting, like beam forming so that it only listens to your speech.

So, we released a kit in June, and there was an immediate spike not only in interest but in the sophistication of some of the things that people could do.

So, I brought a little demo reel of things people have posted to give you an idea of the range of creativity that has been unleashed by saying that these devices, which historically if you were a researcher in this field, either in application domains like architecture or mechanical engineering or 3D reconstruction kinds of things or a computer scientist who wanted to play in computer vision or speech control, the cameras used to sell for between $30,000 and $100,000. So, the number of people who could have one was really very small. So, when this came out at $149, it just unleashed all this creativity.

So, let me run this video, and you’ll see. Here’s a guy who’s actually got graphic shaders, and he’s controlling the shader in real time by his gestures, and you see that through the lens.

These are people with a virtual chessboard where you move the pieces by walking them around.

This guy made his own 3D car racing game that he steers with his hands and uses the pedals, even though they’re just in the air.

This is sort of a person working on a holographic kind of projection, and he wanted to fly his helicopter, and he uses essentially the Kinect gestures to do it.

This is a multimedia display system, and he uses commands, voice commands and gestures to make selections.

This is the WorldWide Telescope I talked about, and I showed you the rain application or precipitation, but here it’s a 3D one, and he can move around in space that way.

This guy actually made his own football game where he makes the play calls verbally, and then uses the gestures of the game to actually animate the game.

This maps a human skeleton onto a robot in a fairly complete way.

This guy had a Barcalounger that he wanted to drive around with gestures. (Laughter.)

This is a thing called Eddie. We worked with a company who basically makes robotic kits, and they made the base, which is actually the motor system and the sensor array, but then there’s a rig on top where you can just plug your laptop in and put your Kinect on top, and build essentially a full function robot, and that’s become very popular for robotics studies.

This guy has different visual effects he likes to create.

So, I mean, they range from the fun and artistic to people who are doing serious work in robotics, and there’s many, many more.

So, we’ve been thinking, well, how do you start to take this capability and apply it to a broader array of things where again you don’t want people to have to master so much the tool orientation of these things to get value, you want them to be able to do higher level capabilities.

So, we did a little hacking ourselves in the research group, and so this is a Kinect camera, and what we set out to do is to hook this up to a personal computer and use algorithms that allow us to in real time construct 3D models of anything that it scans. And again so there’s nothing yet.

So, I brought this pot here, and if we can get this thing going, as I stand here and just scan the pot, on the top you see the raw sensor data, on the right it’s sort of color-coded for depth, on the bottom right it’s started to construct the actual 3D model of the pot. And as I move around it, you can see things that were missing just get filled in. And the longer I hover around any particular thing, the more detail that it will add to the model.

In the last step — go ahead and turn on the RGB overlay. So, now we’ve actually taken the RGB camera from the sensor, and in real time texture mapped it onto the model. So, what you’re looking at there on the bottom left is not a photograph or an RGB image, it is actually a computer-generated 3D model, texture mapped, with the image that’s captured by the RGB camera all in real time.

It wasn’t very long ago people thought this was close to impossible, and we can do this with a $149 sensor and a regular PC.

So, this really begins to open the door for another completely novel range of applications. While this is simplistic, I’m now going to come over here, I’ve got a Kinect set up on a PC, and so here I have my virtual potter’s wheel.

With the Kinect camera this application is set up to just look for my arms on my skeleton, and my hands, and then do to the virtual pot what a real potter would do if they took this and were trying to throw the pot or change its shape. So, as I essentially raise my hands and bring them in, I can start to change the shape of the pot. I can say, okay, well, how about that, I like that pot, that’s a nice urn now.

But that’s a 3D model. So, if I wanted to take this thing and go out and have it fabricated or have a computer mill it or make a model or spray paint it with a digital painting system, all of those things can be done here.

In the past, it would have been really, really difficult for someone to say, okay, I have an arbitrary object that’s in my possession, I’d like to have a 3D model of it, please. People have tried to say, I’ll give you 3D sketching tools, I’ll give you CAD/CAM tools, but these kinds of things are just way too hard for the average person who just wants to throw the digital pot.

So, by being able to bring these things together we have an enabling technology for a whole new genre of applications, and where this is completely natural.

After my talk is over, if you want to come up here, we’ll have some of these demos, and you can sit here and play with them, too. And what you’ll find is, of course, like that one, it takes you zero training time. You can say, okay, take your hands, do what you would have done if it was a piece of clay, and it just does what you expect.

This is what I think of as the natural user interaction model where the things you know about in the real world are all it takes to be able to figure out how to do the same kinds of things in the virtual world.

So, we thought some more about this, and said, okay, well, what’s another thing that we really would like to do where people obviously know how to do it, but we don’t make it very easy in cyberspace, and that’s essentially to have multiparty meetings.

So, I’m very big, particularly given I travel 140 nights of the year, to find a way to do telepresence so that I can have more realistic meetings or even presentations like this one without having to fly everywhere.

So, when we started doing the Kinect development, I had the research people and some of the incubation people who work for me start down the path of saying, well, now that we’re going to have that sensor, couldn’t we build a telepresence system that would use this and allow people to get together in cyberspace.

Again with the Xbox environment we started with an advantage in that even from the old context there’s probably more than 80 million Xbox avatars that are already out there. People have made them for their historical interaction with the console.

So, here we had a big community of people, we knew we’d have sort of tens of millions of them that would be Kinect enabled, and we thought this is a great place to start to try to experiment with what it would be like to have these telepresent interactions.

So, we actually did that development, and in June of this year or July we released a new thing into Kinect Fun Labs on the Xbox that’s called Avatar Kinect. What it is, is a system that allows people, up to eight at a time, anywhere in the world to sit in their living room in front of their Xbox and project their avatar into a three-dimensional stage of one form or another, and go invite and have a discussion with other avatars.

This is very different, as you’ll see in a second, from videoconferencing. Certainly when you get beyond two people there’s nothing natural about saying I want to have an eight-party videoconference on Skype or anything else. You’re kind of looking at eight postage stamps of heads talking at you, and it’s very hard to share information, you lose a lot of the ability to gesture and understand body language.

But one of the challenges is when we did the games, like you saw Dance Central, or we do things like throw the pot, what we didn’t do was facial animation. Yet if you’re going to go and do these things where humans are communicating, the face is an incredibly important thing, and the emotion that is conveyed not just through your words and your gestures but in terms of your facial features is a critical component.

So, we studied it to try to figure out, it turns out humans are incredibly good at taking small cues from even highly caricatured faces and correctly interpreting those cues relative to emotion, at least the major emotions.

So, even though the camera has limited resolution, and we’re doing this at a distance of one to four meters, we actually figured out a way to combine the RGB input, the depth input, and a facial model to animate all the major features of your avatar and its face in real time as well.

So, we released this and it’s out there. Many of you may have tried it. There have been hundreds of thousands of people who are using it now since the summer. But let me just show you a short video clip of there’s sort of the trailer from the launch of Avatar Kinect in July.

(Video segment.)

CRAIG MUNDIE: So, we ship this thing with about 20 different stages that range from a performance stage like this one to a round conference table to the theme park and tailgate party kinds of things, and based on the stage you can either do something individually, record it and send it to your friends, or you can, in fact, go have your friends join you there, up to eight people at a time.

And so this has been really fascinating to look at, but, of course, it’s just the beginning. While we’re doing this for entertainment and casual interaction, in January I was at the World Economic Forum, and a friend of mine is Maria Bartiromo, who is on CNBC in the U.S., and you may see her up here, too, and is a financial commentator.

And we were talking about Kinect where she had actually had one and she and her kids were playing it, and I was telling her about these Avatar Kinects, and that it was going to come in the summer. And we kind of only have jokingly said, well, why don’t we do a national TV interview as avatars. And being a good sport she said, that’s interesting, let’s try to do that.

So, in fact, in July I went to New York, and we taped an interview which ended up running nationally as a 13-minute segment on their show, and I brought just a short clip to show you.

But the idea was to get people to understand in a more businesslike environment how would this kind of thing work, and even though the caricatures are pretty rough today, you can see that it isn’t that hard to suspend disbelief and think you’re seeing people talking.

The other thing you might notice even in that thing, and you’ll certainly see in even the clip with Maria and I, is that this is not like staring at a head. In the set the virtual camera moves around automatically, under the control of cinematography rules. So, in essence it’s like being in this case on a set where there are really camera people on dollies and moving different camera positions, and sometimes you see through your own eyes of your avatar and sometimes you’re sort of like watching yourself on television as a group. But all of that builds on people’s experience of watching TV and going to the movies, and yet also being there and seeing this real time animation.

So, let me just show you a clip of Maria and I on national TV, and you’ll see us both in our flesh and blood form, and how we ended up being portrayed as avatars.

(Video segment.)

CRAIG MUNDIE: So, that went on for about 13 minutes, but it was a lot of fun. I was particularly interested in it, because I think it shows that we will be able to build more and more sophisticated ways for people to interact at a distance, and in some sense share a common experience.

In order to take that a step farther, I want to show you some additional work we’re doing. Obviously one of the limitations in some sense of this is that it’s just caricatures. Some people say, well, you know, wouldn’t it be better, particularly if it’s serious, if these things were more lifelike.

Then there’s this other problem that we have 20 stages that we made where we were willing to do the 3D modeling, but if you’re going to generalize this, then you really want to be able to make anything into a stage. You want to take your own conference room or your own living room or whatever it might be, and make it into a 3D model that you could then have meetings in.

So, we said, hey, this thing we did with the pot, you know, maybe we could do it with other things, like anything.

So, let me just start and I’ll just give you an example, walk back here, and if you put this up there, I’ll turn this on, and boom, I’m going to basically make a model of the back part of this lovely building. As I look at these things, the longer I look, the detail fills in.

This is not a video again, the thing on your bottom left is actually a 3D model.

Go ahead and put the RGB overlay on it.

So, now it’s painted, texture mapped with the same colors, and that’s happening in real time.

So, obviously this wouldn’t be sufficient as a way you would want to do this entire facility, but algorithmically there would be no real difference.

The other thing would be to say, you know, you can have models, and so here I’ve got another one. This is actually an architect’s cutaway model of the entrance and part of the lobby of Microsoft’s new research building in Beijing. I could take this model and scan it, and in a matter of a short amount of time have quite a bit of detail.

If I wanted to play around like with furniture placement or something else, even before construction, I could digitize things in and then slide them in there, or I could do that by combining virtual objects and this scan of the physical world. Of course, because it’s 3D I could scale it up and do other things.

But it’s also possible then to take and recognize that so many things now are actually created virtually before they’re manufactured physically, and there’s no reason to think that the real models that were developed in the construction, for example, or the design shouldn’t be made available for this kind of function, too.

So, what I did to show that is to go back over here to my Kinect, and I got the actual CAD model from the architects that designed our building in Beijing, and this is it. So, you can see it sort of looks like the paper model here, and this is basically what the building actually looked like. It opened in June of this year.

So, the question is, how do I take Avatar Kinect and have it meet the model? The answer is no problem, let’s drop me into Beijing. Boom, now I’m standing here.

Now, you’ll note that this doesn’t look anymore so much like the caricature of me. In fact, it has all my gut here and everything. What this doesn’t have yet is the facial animation that I just showed you in Avatar Kinect, but I can turn around, I can walk around in this space, it composites me directly into the 3D model. I can turn around and point back over here and say, hey, this is the new lobby and up here on the wall is actually where we had this really cool light display, and over here the cafeteria is down there and there’s other things back over here. And, in fact, I could put other people in this with me, all of whom are standing in some completely different place, too, and they would see this much as they see the Avatar Kinect.

So, I think this just shows the path to higher and higher quality meetings. I don’t know whether this is going to be actually done this way or you’re going to be compositing other video together, but it’s pretty clear to me that we’re not very far away from being able to blend the physical and the virtual and the ability to move around in this environment.

So, let me just show you one last thing, which is something we demoed last week in Beijing as part of our anniversary celebration, 20 years of Microsoft Research, which is actually I had in the spring the researchers start to build a photorealistic head of me so that I could talk from the screen. But we’ve actually combined a number of technologies: text to speech, sort of a computer-based compositing of video and a 3D model together.

One of the things that’s super hard to do is to model the mouth, and yet it’s so important in terms of some sense of accuracy in the way people communicate verbally, because it isn’t just the external structure that matters, people actually see the movement of your teeth and your tongue, as well as your lips, and the face that’s around it. So, how to model all that and get it right is difficult.

Here you’ll see in a second a 3D face of me, not particularly friendly looking, but that’s a 3D computer model, it’s not a photograph. As you’ll see in a minute, when it animates, it’s actually using sort of compositing of the 3D model and around the mouth area it’s actually video of me talking, just reading a bunch of stuff, and they take the facial and word structure and map it to select in real time that part of the video, and seamlessly composite it in where the mouth should be.

The last thing that’s magic about this, as the title kind of implies, to do this they’ve also built a complete model of my vocal track. This is not going to be a recording of me talking, it’s a typed text of part of my introductory remark in Beijing. The system does text-to-speech conversion into Chinese, in Mandarin, and then from a native Mandarin speaker morphs it onto my vocal track.

So, you’ve stood here for almost an hour, sat here and listened to me talk. You’re now going to hear me talk in essentially perfect Mandarin, even though I speak only two words of Mandarin.

Go ahead and run that for me.

(Video segment.)

CRAIG MUNDIE: So, yeah, we’re getting very close now to what I think in a globalized environment of the future is very much the ultimate dream, which is you can go places, have very lifelike meetings with people you know or maybe have only met a little bit, who may not even speak your language.

We are also close to being able to do what I just described, not by going text to speech and that into Mandarin, but to be able, just as we’re doing this, to have the microphone hear me in English and have my avatar on the other end speak in another language.

So, in the globalized environment we all are going to live and work in, the ability to essentially have multiparty, multilingual conversations where everybody sits in one place, have a very natural interaction, and, in fact, each person speaks in their own language and hears the other person in their own language, I think is a stunning accomplishment. And yet by the magic of all these combined technologies, machine learning, machine vision, all of these things were basic science research over the last 10 or 15 years.

The people at Microsoft Research who worked on creating this camera, it took 10 different groups from four labs on three continents. Each had to make what turned out to be a critical contribution. And none of them started their research with any idea that what they would be producing was that device or that, in fact, it would ultimately be part of even doing what I’ve shown today as part of a telepresence system.

So, I offer you that as a way of reinforcing what I think is the importance of basic scientific research, curiosity-driven activities that are done to complement the great work that people have to do to make these products and ship them every day. We do that at Microsoft, we do it in conjunction often with people in universities, but I want to encourage all of you to think about how important it is to be able to support that kind of curiosity-driven basic science, and also to think about how difficult it is to transfer that knowledge at times into things that you can ship into the mass market. But we’re starting to do that more and more, I think it’s an incredibly exciting, and I’ll stop there.

All right, everyone, thank you very much for your time and attention. It’s been great to be at McGill. (Applause.)


Related Posts