Remarks by Craig Mundie, Chief Research and Strategy Officer
Northwestern University
Evanston, Illinois
October 4, 2011
CHAD A. MIRKIN: Good afternoon. My name is Chad Mirkin, and I’m a professor in chemistry here, and director of the International Institute for Nanotechnology. And it’s a pleasure to welcome everyone to this year’s Harris Lecture featuring Craig Mundie. We’re really excited to have a fantastic audience, including students and faculty from throughout the Northwestern Community and the surrounding areas, very fortunate to have people from the Evanston community, and also from many of the local high schools, including Lakeview, Amundsen, Evanston, and Glenbrook North. So, we especially welcome the high school students.
I also want to recognize our inspirational president, Morton Shapiro, who has taken the time to come here and help really set up this event. And his fantastic staff led for this event by Colleen Burroughs (ph), who did just a wonderful job of really putting together a lot of complex logistics to make things really run very, very smoothly.
The Harris Lectures were endowed in 1906 by Norman Wait Harris, a trustee of the university. In his letter of gift, Harris expressed the desire that “the fund should be used to stimulate scientific research of the highest type, and to bring the results of such research before the students and friends of Northwestern University, and through them before the world.” Over the years, lecturers have included scientists, writers, artists, theologians, critics and historians. This year is no exception in terms of high impactful speakers. We are very fortunate to have Craig Mundie. He is the Chief Research and Strategy Officer at Microsoft. Craig oversees one of the world’s largest computer science research organizations, and is responsible for Microsoft’s long-term technology strategy. He holds a bachelor’s degree in electrical engineering, and a master’s degree in information theory and computer science from Georgia Tech. He’s been an entrepreneur, having co-founded and directed Alliant Computer Systems, and a number of technology incubations. He’s truly a visionary thinker who works with government and business leaders around the world on technology policy, regulation and standards. He’s one of the world’s experts in cybersecurity and privacy, and security issues involving new technologies, and I’ve personally worked with Craig for the past three years on PCAST, the President’s Council of Advisors for Science and Technology. There I’ve gotten to know him quite well, and learn about his incredible intellect and passion for science and engineering. It’s really a treat to have him here today to lecture on the future of computer science.
Please join me, along with President Shapiro, in welcoming Craig Mundie to Northwestern University.
(Applause.)
CRAIG MUNDIE: Thank you, Chad.
President Shapiro, and students, and friends, it’s great to be here in Northwestern. It’s actually my first visit to this particular campus, and it’s been a great visit. It started early this morning in Chad’s lab, and I got my certified capability to build gold nanospheres. So, that was pretty good.
I’ve been doing these kind of talks in universities in visits with students and faculties for more than 10 years. My predecessor, you could say, at Microsoft was Bill Gates, and when he retired to become chairman of the company, I essentially inherited many of his responsibilities. And I’d worked with him for 10 years before that.
Bill was always a believer in the importance of basic science and, of course, engineering in a company like ours. And I was privileged to work with him and try to foster that kind of spirit within Microsoft, particularly the quest for the unknown that we do through our research activities. But we also, both of us, felt it very important to be grounded in what’s happening with young people, and in the university environment. And so, throughout all the years, both Bill and I would make these sojourns out to universities around the world. And a little bit to share our views of what was happening in our technological environment, but also to be informed by what we would hear in dealing with the students and the faculty. I’ve enjoyed those conversations with the students today in a roundtable, and faculty in a roundtable, and look forward to some more interchange with you today.
So, in the next 45 minutes or so, I’m going to talk and give you some thoughts that we have about important trends in technology, information technology in particular today, and it’s application broadly, and then I’m happy to answer questions until we run out of time, another 45 minutes or more, and I’m happy to take those questions in any area of interest, and I’d try to share my thoughts.
There are three things that I want to talk about in the course of this presentation. The first is a phenomena that today we call Big Data. Science started out with people who didn’t have a lot of access to technology years and years ago, and so they would think deeply about things, or they would think about it in mathematical terms. Science evolved, engineering evolved, we developed sort of the scientific methods that we’ve all used, which was really laboratory-driven in many cases. But we’re at a time now where the exponential increases in computational capability, coupled with just enormous increases in storage capability, have produced a phenomenon where we can get a huge amount of data, not just in the physical world sense, but we can get a lot of data through observations. And, as a result, I don’t think there’s any field of science, engineering, or frankly any other field of study that is going to make substantial progress today without the use of information technology. And in particular, we’re going to now start to find that a lot of the things that we can learn we’re going to learn through the use of the Big Data.
The ability to build these facilities, not just in super-computer centers that might be the province of well-endowed universities, or a National Science Foundation site, or maybe a few of the world’s largest companies, there’s a democratization happening in terms of access to these computing facilities that the world has never seen before. Everybody calls it the cloud, but what it really represents is a place where computer systems that are bigger than any that even the biggest governments have built in the past are now accessible to anybody with a credit card.
And, as a result, the ability to use this kind of information to get insight from very large amounts of data is available to every student, to every small business. And as a result, we think that this democratization of access to sort of hyperscale computing facilities is a very, very profound thing. And it’s really moving along.
So, what I want to do is show you a couple demos about how this is going to be used more and more by people like all of you in the audience, whether you’re in high school, or university, a graduate student, a researcher, this is just a standard Excel spreadsheet. But what’s novel about it is, in our case, we’ve sort of removed all the limits to the size of a spreadsheet, and so they become virtualized as well. And now we’re finding that everybody wants to put their data in these giant repositories.
So, in fact, we’ve got a capability other people are building similar ones, and we call it these data markets. And so, just like in the past you could do a search, and you might get a website, or you might have a colleague and you could ask them to send you some data, more and more these huge datasets are being assembled and put into these cloud facilities.
So, for example, if I click on this, it logs me in, and in preparation for this visit, I went out and subscribed to a few datasets. I could have gotten demographics. Here I chose the WeatherBug, which is a lot of historical weather observations and geodata. You have the U.S. Census Data. All these things are now just a click away for somebody sitting at a personal computer with an Internet connection.
So, for this thing, I’m actually going to click on this WeatherBug, and import a bunch of data. I’m going to I’m going to import this, and it’s getting lots of data. We’ll let that keep going for a while. What it’s actually doing is I should have limited how many of these, I don’t have all day, but I’ll stop that. I also downloaded some of this in advance.
And so, here is the same dataset, and I applied some filters to this, and I have a few hundred data points about this information. And the ability to just use tools that people know to click, and sort, and organize, and filter now can be applied to datasets that are actually huge. You could, for example, decide you want to know about all the people who lived in the Zip Code of Northwestern University, and then you could just download that data, and do kind of analysis on it.
Other things that you can do is now that you have these things you can, for example, build pivot tables and reorganize this data. So, here what we did is we plotted all this precipitation data for the cities I happen to be visiting on this tour, plus our hometown in Seattle and each colored line represents the total rainfall over time, over the last few years.
And, let’s see, Seattle is purple, everybody thinks it rains a lot in Seattle, but you can see in measured rainfall it’s not as bad as all these other places. But, there’s also, you start to see things in this data that you might not have known. It might be a topic of interest. So, here we put a highlight around this vertical spike in the Evanston data, and say, what happened at that time back in 2008.
So, if you actually go back and do a little research, sure enough, Hurricane Ike came inland that year and dumped a whole lot of rainfall right here in Evanston, and up the middle of the country. And so you begin to see anomalies, or patterns that you just never would be able to see if you didn’t have these kinds of tools. Another one that we did here was we looked and said, what was this big, blue spike in Toronto and sure enough, you go look there was a that was actually in Atlanta, I’m sorry, a weird storm flooding in Atlanta. People died that day.
So, when you get this huge amount of data and you can look at it over a long span of time, you start to be able to get insights that you just can’t get if you’re not looking at that amount of data, or you don’t have the analytic capability at this magnitude. So, let me show you one other thing that we’ve done.
We’ve built a new tool and what this tool does is it ingests these kind of very large data sets. This morning, you know, in some of the tours that I got, people were showing me some of the fantastic visualization capabilities that are being built here in the physical sciences and in the biological area. And this is essentially a 3D tool where you can make your own movies of your analysis of very large data sets.
So, in this case we took all the same rainfall data and we made a guided tour where we essentially are going to fly through this data, and this is just on the West Coast of the United States. So, the height and the color indicate the intensity of the rainfall. And this is over a 30-year period. And so I can get to this point in this animation. I can essentially zoom in on it. I can navigate and translate. I can basically go in.
This particular part is let’s go up here to hometown Seattle. Everybody thinks it’s really wet. Seattle is right in here and actually what you find if you look down in this valley of the dots is that, in fact, Seattle does get rainfall, but it doesn’t get as much as people think, because it all gets trapped by the mountains on the coast, or the Cascade Range in the interior. And, in fact, if you look at the whole Western Coast of the United States, it all exhibits this pattern. And if you were studying this you might want to decide should I go out to school out there, you know, let me pick a college that’s in the valley there. But, for whatever reason the ability to have these incredibly powerful visualization tools I think is obviously important.
One other thing, though, that we’ve been doing is saying, well, now that we have this big data, it’s not always just numbers. What I showed you was essentially numerical information, a lot of typical readings of instruments. But, we’ve also been looking at how we can build facilities that allow us to get a lot more information and help a lot more people be able to do things with nontech, non-numerical information.
So, here what I’ve actually got is something from our medical research activities and this is essentially 3D scans of a human torso. And the pictures on the top, these three here, the top left, top right, and bottom left, they’re the kind of navigation that an expert would typically have to do in order to look around in these images. So, here you can move up and down in the body. And over here you can essentially move front and back in the body. And here, of course, you can essentially go left and right. But, you really have to be an expert.
Now, it got a little better and the bottom right corner down here is an example of where people have taken 3D volumetric models and built them and tried to color code them. And so here, if I actually zoom out, you can see this is actually a 3D model of the torso. This made it a lot better for people. You didn’t have to be quite an expert radiologist to figure out, where is your spleen, or I want to look at your heart. But, we wanted to be able to have the computer be able to do this ourselves.
So, we developed a new machine learning capability, and it’s built into this application, which actually has been taught how to discriminate from a huge number of these images all the different organs of the body. So, I’m going to click this and it actually goes through this particular data set and analyzes it and it found all of these particular things that it recognized. And so now if I wanted to look at the gallbladder I can just click on that and it zooms in to the gallbladder. I could go down here and look at my left knee.
So, things that would be incredibly difficult for people who weren’t experts and trained in how to look at these images and manipulate them become something that’s literally trivial for people to do. Let me show you what we think is going to happen more and more with this. I have a short video, which takes the same basic idea where we’ve taught it to look at all these different organs, and where somebody has used it to be able to look at an entire 3D model of the human body, these are real scans of real people, and be able to look inside and highlight a particular thing. So, go ahead and run the video.
So, here you can see they changed the transfer function. This is actually the lungs and the aorta. Here’s a lesion in a lung, in the right lung, and all of these things can be found by the computer, not by a radiologist who had to outline them or highlight them. But, we can actually teach the computer to discriminate all these incredibly fine patterns, which allows us to find all of the macro structures inside this sort of volumetric x-ray and then to highlight them or manipulate them. So, you can either pull them out and look at them in isolation, or you can actually see them in context.
In addition we’ve now got the machine learning tuned up to the point where like you saw the lesion in the lung, you can say, okay, go find me any other people that are in my database who have had a lesion like this in the right lung. And so it’s an image-based search where you’re specifying what you want to see by giving an example of a person that currently has that particular problem.
So, these kind of tools, I think, are going to be super-important as we want to get more and more people involved in medical care and understanding images, and whether it’s medical images, or any other kind of information, it’s going to be something that I think will be very valuable.
The next thing I want to talk about is this convergence between the physical world and the virtual world. Those of us that have been for a while playing with things like computer 3D modeling, or even video games have encountered an environment where we can sort of leave behind the physical world, go over into a completely synthetic environment, and move around or interact in that environment. And one of the real challenges as we built this huge computing infrastructure is to find simple ways to allow people to take the devices that are normally in their lives, perhaps like their cell phone, and begin to make it a lot easier for them to move from one environment to the next, from the physical world to the virtual world.
So, we’ve been developing capabilities to do that. A lot of them are based, again, on machine vision and machine learning kind of capabilities. One of the things that we actually built a few years ago is a technology called Microsoft Tag. There’s other things that are somewhat similar, like QR codes, and other things, but you can think of them as like UPC or bar codes on steroids.
In this case the tags are actually active. They’re intermediated by a Web service and so the person who creates a tag can actually change what the effect is of scanning that tag. So, for example, I have one on my business card. And if I set it to deliver my contact information to whoever scans it with the phone that can be a completely automated task.
So, you can just take your phone and look at the thing and my contacts will end up in your phone. Packaged goods are now starting to make a lot of use of this. Where people want to print a tag on a physical inventory item, but they don’t want it to do the same thing every day. So, you scan it one day and it might offer you a coupon. You scan it the next day and it will tell you information about the packaging. You scan it a third day it might enter you in a contest.
And you can’t go back and reprint the packaging. So, the ability to now have a Web service that actually allows you to alter these things dynamically and to keep track of what people are doing, even though it’s anonymous, in terms of who is actually doing the scanning, it turns out to be kind of interesting. So, I’m going to show you a little video of how people are starting to use this in interesting ways that facilitate the bridging together of the physical world and the virtual world.
So, there was a music festival held recently down in the South and they decided to kind of go all in for this kind of capability and I’ll show you this little video. Go ahead.
(Video segment.)
CRAIG MUNDIE: So, these tags are kind of cool, but I think of them as sort of a little cheat or a hint, where the computers haven’t quite gotten good enough to recognize the same things that we do as people. But, if we’re going to make a fundamental change in the way that people and computers interact, we have to move beyond this idea that they have to have these kind of assists in order to help us. I think they’ll be valuable and important for a while, but the real question is, as the computers get more powerful, and in essence through microphones and cameras they start to hear and see, is there more that we can do.
So, I brought here a Microsoft phone, one of the latest ones, and I’ve got it hooked up to the projector here. So, what you see is essentially what I see on this phone. And one of the things that we can do is go to a search capability, is just a Web search.
But, now what we’ve built into the phone as an intrinsic capability is the ability for the phone to listen, for the phone to see, and to do a lot of recognition using the power of these cloud services as a way to do that.
One of the things it can do now, and you’ve been able to do this with custom apps, but not as an integral feature of the phone in the past, is it can listen to music and tell you what song is playing any time you hear it. But, it also in a sense that’s a hard problem, but not a super-hard problem, because of the way you can digitize the music. It’s actually a lot more difficult to start to give it commands about just spoken words, where you have no particular context that’s predetermined.
But, here the phone knows where it is in this case. So, it says it’s at Northwestern up there and it determines that from GPS locations, or cell towers that are nearby. And all this context is now being fed into the queries on a dynamic basis. So, it combines what it thinks it hears you say with the other context in order to provide information.
So, with the amplification here I’m going to try it. So, let’s say I want to find out something about movies here, and go to a movie. So, movies, so it converted my spoken word into text, put it into the search box, added the local context, and came up with movies near Evanston, Illinois.
So, I can look at that, I can click in on these things, pick a movie, click on Moneyball. And this is all actually going out over the net and doing it in real time. I can look at show times. I could go buy tickets. So, in a sense with one word, and one or two finger presses you can essentially complete tasks that historically would have taken quite a bit of pointing, clicking, or typing in order to get them done.
The next thing I’ll show you is how we’re endowing these things with vision. And so just like there’s a microphone down at the bottom, there’s also a little eyeball. And so what I’ve got here is a book and in this vision capability now, that’s integrated with these Web searches, the thing will recognize all kinds of objects, book covers, CD covers, it will read any kind of tag, and so here I brought a book and so I’m just going to hit the eyeball and have it look at this book. It finds the book and goes out and finds a bunch of things about it. I’ll click on this link. It gets me in one click to a book review, just from looking at the cover. I can read reviews on the book. I can see where I can buy the book. And the apps that happen to be on this phone that relate to shopping for books, they’re all there, too.
So, in a sense a lot of the navigation and the searching is all being done by a combination of the machine vision, the Web searches, and the local context. There’s one more thing that I particularly think is pretty cool. One of the things we’re really trying to do through the machine learning is to teach these things not only to see books, but to see text and recognize it, but also to learn how to do this in multiple languages.
So, I’m going to click on the old eyeball again, and I’m going to look at this document. And this happens to be a menu of a restaurant, and so I’m going to tell it to scan the text. Sorry, I turned it upside down. And now I’m going to say, translate it. So, there it did. It took what was a French menu, it actually decided it was French, I didn’t tell it that. And then it essentially did a real-time overlay of English on top of the text that was originally French.
And so if you happen to go to a French restaurant, or travel out of the country, this kind of thing could be really useful. But, the ability for these machines to combine their local processing capability, the Internet connection, and the amount of information that we have in the Web, allows us to build these capabilities more and more. And from that we get really excited about the prospect of having computers be genuinely more helpful.
I think this is one of the really key shifts that’s happening now, that for decades computers have largely been tools. They’ve been increasingly sophisticated, and in the hands of experts, people who have studied them, or done their apprenticeship, they are incredibly powerful. But, we still have literally billions of people on the planet, and we’re going to get a few billion more in the next 50 years, who really won’t have that kind of training, they won’t have the ability to master specific tools to get things done.
Yet, if we want to solve problems, whether it’s in healthcare or education, or other things, we need to really make computers change dramatically in two ways. First, the traditional interfaces, where we’re pointing, clicking, touching, et cetera, have to give way to a much more natural model of human interaction with the computer. In Microsoft we call this the transition from the graphical user interface to the natural user interface, or GUI to NUI.
The graphical stuff will still be around, I mean, it’s incredibly valuable for very fine detailed work, and the kind of things that we’ve employed computers for in the past. But, the natural user interface, which basically takes all of these things that relate to human senses, and the computer’s ability to emulate them, and begins to apply those to these tasks.
Having done that it then gives you the ability, which this demo of the phone and its ability to see and translate things is just the tip of the iceberg, and you start to give the computer the ability to be less of a tool and more of a helper. Where the semantic interaction between you and the computer goes up a few orders of magnitude, and interacting with it is just a lot more like interacting with another person.
So, we’ve been doing research in things related to human senses, and speech recognition, and speech synthesis for a long time at Microsoft. And in fact, at lunch with some of the faculty we were talking about how do you convert research into things that have real commercial value.
The tricky part is if you’re really doing basic research, with a very long time horizon, you aren’t really guided specifically toward a product when you begin. You don’t know what its ultimate application might be. And while we’ve had many, many transfers in Microsoft, for us one of the real breakthroughs, and one of the best examples of this kind of aggregation of our long-term research results culminated in this product called Kinect, which was a machine vision and hearing system as a peripheral for the Xbox gaming console.
So, this was launched last fall. And I’ll just run a short clip. Many of you probably have seen it, or know about it, but I’ll just show a quick clip for people who don’t know what the product was in its initial incarnation, you can see that. So, run that video, please.
So, here you just stand in front of this camera that you put under your television. And it sees you and it essentially maps your own body movements into the movement of an avatar on the screen. And in this context, of course, the goal was to support game playing, where there was no controller, you were the controller, and this ability to map multiple people in real time, both their skeletal movement and their spoken words, was achieved, and it became an incredibly popular hit product. In fact, we sold more of them in the first few months than any other consumer electronics widget that was ever manufactured.
So, it kind of struck a nerve, because people who were historically disenfranchised from the video gaming environment really found this to be a popular thing. In fact, of all the titles, this one I just showed you, Dance Central, was the most popular selling title on the Xbox in the last Christmas season. And I think the reason for that was there were so many more people who could get involved with dancing as a game than they could learning the 14 buttons on the remote controller that it really boosted people’s engagement with this.
When we built this product, we knew that this was really just the first instance of a mass-market product that brought machine vision, not just in the traditional 2D camera sense, but the breakthrough in this was that this camera actually sees in 3D. And so the ability to discriminate things with depth, just the way humans can see with some depth perception, and to then be able to discriminate objects in real-time was something that had been done before with stereo cameras in the lab, or very special laser illumination kind of systems, but the cheapest of those used to cost about $30,000, and it would be not unusual for people in a lab environment who wanted to do this kind of work to have to spend $100,000 just on a camera to support that.
So, when this thing came out at $149 at BestBuy, it was a revolution because you not only got the camera, you got the microphone, and other things. Within a week people around the world started buying these things, taking them, and unplugging them from the Xbox. We made it with a USB plug, and plugging it into their PC. And they started writing their own drivers, and other things. There was just a huge pent up demand in the world community for things that would allow them to start to give the computer vision, and listening capability.
So, we had actually anticipated this, although I think we were frankly even surprised by the rate at which people wanted to get going. And by June of this year, we had actually produced the developer’s kit for this product. In fact, in the last couple of months, we’ve released two. One kit was done specifically to support people who just wanted to do the vision and speech, and another one we actually integrated into our robotics studio, because the other thing people were really interested in was doing robots, and using this to give the robots a more economical hearing and vision system.
And so, I just want to run a demo reel of just things that we’ve taken off the Internet with the permission of the people who did them to show you the kind of creativity that’s been unleashed by giving this very inexpensive sight and sound capability to the average computer. So, run that demo real quick.
So, here’s a guy that’s basically doing an image adjustment and shaders computationally based on his interaction.
Here’s people who want to play virtual chess with each other, and they use their bodies to move the chess pieces around. Many of these things were done in a matter of a couple of weeks once people were given these kits.
This guy has his own racecar, and he uses his foot and his hands to drive it, and accelerate and brake.
This guy has a gyrocopter, or a helicopter that he flies essentially by using his body as the control to control it.
This one is like a junior version of one of the displays I saw earlier today.
This is actually the tool that I showed here about the rainfall. This is the same thing applied to the sky, all the sky data.
This guy has built his own football game, and he calls commands verbally, and then uses gestures to actually control the football game.
This is actually mapping human motion onto a fully articulated small robot.
This guy built his own Barcalounger that he wanted to be able to drive around onstage, and he’s controlling it with the camera.
And this actually came out with our robotic kit, it’s a robot called Eddie, and you just snap your laptop into it, you put the Kinect on the top, and you buy the whole robot system, including all of its sensor arrays, and motor controls, and everything else as a kit now.
This guy calls for different video effects, and uses his hand to apply them in the frame.
So, there’s just an incredible array. The list goes on and on. One of the things I was taken with in a prior set of this that I showed people was some students in Germany who were interested in helping produce aids for the blind. And so, they took a Kinect camera, they screwed it onto a hardhat, put a backpack with a laptop on their back, had a little set of earphones or speakers, and they created sort of a Braille belt of sorts. And they, in a few weeks, created a system where a blind person could put this on and walk down the hall, and the thing basically tells them by audio commands, you know, turn right, turn left, or hallway upcoming. And objects that actually appear within the field in front of them are mapped as a pattern on the belt that they wear around their middle. And so, they can feel the things that they’re about to run into, and they can hear instructions.
In the past, they would have just taken a white cane and sort of tapped their way along. And here, they put this thing on, and essentially can walk in places they’ve never been, be guided by walking navigation instructions that could be read to them, that can be from their phone, and can feel the things that are in front of them that they can’t see.
If you think about what it would have taken to do that just a year ago or so, it just wasn’t within the realm of possibility. But the computational capabilities of even a laptop, or even now a phone these days, are getting to the point where coupled to this kind of machine vision system, and the right algorithms, are really allowing stunning things to be done in a relatively short amount of time. So, I think these things really start to create an environment where we can imagine a lot of things that we used to do in the physical world, doing them with some cyberassist, if you will.
So, one of the things that we also have been doing a lot of research in, which combines the depth camera capabilities with novel algorithms, is to use it to be able to scan things. Today, if I wanted to build a 3D model of something, it’s actually quite a challenging task. Again, if you really master the tools, you can do some stunning things. But you have to be really good at it. And really simplified ones are even difficult for most people to master when you’re trying to build 3D models in a 2D environment. So, we said, well, if we’re really going to get more people in this sort of physical-virtual interaction environment, we’re going to have to find a way to make it a lot simpler for them to get the 3D environment into their environment. And whether it’s toys, or things you want to have in your game, or things you want to work on from a creativity point of view, we should find a way to do that.
So, some people in our lab in Cambridge, England, and our labs in Silicon Valley, and China, took a Kinect, which I have here hooked up to a PC, and we built some new algorithms for it, and so what you see here is essentially real-time construction from a single source of real 3D models. So, up in the top left, you actually have some of the depth data coming in to the camera. On the top right, it’s sort of been color-coded based on depth and the false color kind of tells you where it’s found different structures.
The bottom right is where it’s actually starting to make a model of whatever it sees. And if I walk around here to the other side, you’ll see it will fill it in. So, in real time, it keeps tracking all the things in the scene. There’s nothing marked here. There’s no special markers. There’s nothing that’s coaching the computer to figure out what’s on the table or where it is, but everything it sees it tries to, in real time, reorient and build a 3D model of it.
On the left is actually the 3D model, which has been fitted to that, and has a typical sort of trapezoidal mesh underneath it. And if I go ahead and turn on the RGB camera overlay, so now the RGB camera has taken the color image from what it sees where, and registered it, and overlaid it on the 3D model. And so, what I really have is a 3D model of things like this vase. And you can see inside it. And that could, in fact, I could stop this and go and cut that thing out, and I’d actually have a 3D model of the vase.
So, we took one of these things, and we took a model of this vase and actually put it in a different environment, and said, what can I do? So, here I brought a Kinect, and I’m standing in front of this virtual potter’s wheel. And I have that original vase. And so, if I want to basically make a new vase, I stand here, it takes my hands and wherever I apply them, I’m basically turning this pot. And when I’m done, I actually have another 3D model of that new pot.
So, whether you want to do artwork, or you want to do other things like this, the ability it turned out, when this is all done, we’ll let people come up and play with these things what’s magic about this is people who would have said, I have no idea how to use a 3D game, I have no idea how to manipulate a 3D object actually have to know absolutely nothing that they don’t already know from their natural life to be able to come here and stay on this. It takes two seconds to say, see, do it like that. And their ability to run whatever the app is, if you will, is merely a function of their ability to translate their real-world experiences into things where the computer understands how to interpret them.
I think that this is super important, and is all part of this natural user interaction model that we want to advocate for, and get people to start to think about novel ways that you can do this.
So, once we started to build this, one of the things, given that I travel 140 nights a year that I really want to get rid of is traveling. And so one thing that’s a long-standing interest of mine is telepresence. And the question is, how can I create more and more lifelike interaction at a distance. Everybody can sit around and have video calls, or Skype calls, or other things, where you have some video. That sort of works for two people. When you decide to have more than two people, it of gets a little complicated. There’s nothing natural about looking at all the little postage stamp pictures of the people you’re interacting with.
So, the question arose, now that we’ve got this ability for the computer to see in three dimensions, and the ability for it to map my body movements, couldn’t we do some more? And, in fact, couldn’t we start to use this to get avatars of us, and endow them with our real-time features and emotional expressions, as well as body animation, and send them out to meet other avatars.
And so, about the time we started developing the Kinect technology, I started a project to try to develop the first avatar-based telepresence system. And, in fact, we released this recently as part of the Fun Labs toolkit in Xbox. So, if you have an Xbox and a Kinect, you can go home and do this today, and you can do this with anybody in the world.
And so, one thing we had to add to make this work, and then I’m going to show you a little video of what it’s like, was we had to add facial animation. When we put the Kinect out in the beginning, all we cared about was the major skeletal animation. So, it maps the 42 major skeletal joints, 30 times a second, for up to four people at a time. And that was then mapped into gameplay, or whatever. But we also had to be able to recognize people. So, if you were in a game, for example, where you had two people playing a time, and one person would sit down and the other one jump up, you can’t tell them, okay, you have to login again. And if you had a whole bunch of people playing, it had to be able to recognize them.
So, more and more effort went into trying to use the things that you can glean from the depth camera, and the RGB information, and other cues, in order to be able to get reliable human identification in real time. And we made great progress on that.
The next step was to then try to come up with a way to take what is actually a fairly low resolution sensor, where you’re looking at these things at a fairly long distance, usually three to four meters, and then find a way to create a facial animation. All of you have grown up looking at cartoons, for example. And even though they’re incredible caricatures of something, including people, it’s amazing how good humans are at taking small visual cues about emotion and connecting with it.
And so, we did a lot of study to find out what are the few what is the smallest number of facial animation features that we’d have to get right in order to convey the major human emotions in real time, even though the avatars were mostly caricatures that people had made of themselves. And so we did that.
And so we launched this thing called Avatar Kinect. And in its first instance, we created about 20 different 3D stages that range from a performance stage like this, to an interview stage, to a tailgate party, to a roundtable conference kind of environment, and given the technology limits you and up to seven friends can essentially sit in independent places and send your avatars out to meet.
So, let me show you a video of what this Avatar Kinect thing was like, and is like, it’s out there in operation today.
(Video segment.)
CRAIG MUNDIE: So, this has been fascinating to watch. There’s tens of thousands of people every day now who are having these little virtual meetings. They’re just sending their avatars out to meet with other avatars. But many of not only the facial cues that give you a sense of emotion, and the gestures which are a big part of verbal communication, but the ability to have this sense that you’re in a common space is a completely different experience than it is if you just think I’m sitting there looking at postage stamp video of somebody at the other end of the wire.
And what’s interesting is, when these things are set up in these three dimensional environments, the viewpoint that each person is given uses things, for example, rules of cinematography to move the virtual camera around. So, sometimes you’re seeing from sort of your own eye position, and sometimes you’re actually seeing from a virtual camera position that shows you the context that you’re in.
So, even though the people can’t get up and move around in this generation of the product, there is a lot more of a sense of being there. And when combined with your experience of watching years and years of television, it’s very quick for your sort of mind to adjust to the idea that it’s like being there.
So, in January this year I was at the World Economic Forum, and I was talking to a woman who is a friend of mine, Maria Bartiromo, many of you have probably seen her on television on CNBC. She’s a commentator about the financial markets. And we were talking about this thing. She had gone there and she had some kids, and she had a Kinect system at home, and we were talking about that. I told her about Avatar Kinect, and I said, you know, what we ought to do when this thing comes out is, we ought to do the world’s first live TV interview on national television as avatars.
And she said, that’s a good idea, let’s do that. So, in July this year, I went to New York, and we wanted to do this a bit as an educational thing, but we decided that we would do an avatar about the launch of this product, an avatar-based interview. And so we took the interview stage, and we did it. And so, CNBC recorded that, and they’ve given me the video, and I’m going to show you a little clip from it. It was about a 13-minute long interview. And in the process it helped people understand, because you’ll see as you watch the video, sometimes you’re looking at the two of us in the normal TV studio kind of camera image, and then the rest of the time you see us actually sitting at a virtual TV news set completing the interview. And it helped people get an understanding of what this is like.
This was a lot of fun. It aired nationally, and I think, again, why is this important? Well, we’re moving from applications that are just for entertainment purposes, like Avatar Kinect, get together and socialize with your friends, to we did this as a demonstration to get people thinking, what’s it going to be like when the avatars get more and more real, and the ability to get the stages goes beyond just a few fixed ones that we’ve predetermined ourselves.
So, let me show you the interview with Maria to show you what it’s like when a business application of this kind of technology might start to emerge.
(Video segment.)
CRAIG MUNDIE: So, that was a lot of fun, and we kept thinking, well, how are we going to continue to evolve this. And there were really two parts that I consider to be the next natural step in this movement toward telepresence. One, you want to get beyond the caricature-type avatar to more photo-realistic ones. And, two, you want to move beyond a fixed set of stages that were pre-created for you to ones where you can do your own.
So, let me show you how we think that that might happen, and so I’m going to take this scanner again, and let’s say that I didn’t like the performance stage we had there. In the future I’ll say, I want to make a stage here at Northwestern, so I can essentially go over here, and scan the old podium, get the depth of the letters on the front of the podium, and so now I could cut that out and put it there.
Or, in a sense, as the cameras get better, if I turn the lights on in here, we can see a way, ultimately, to be able to just sort of scan this entire space. Today, the camera has a depth of field that’s only about four meters. And so the sweet spot is in this range of sort of one to four meters, and it will make a model of just about anything it sees in that zone.
So, we can see where people are going to start to build their own facilities where they want to have their virtual meetings. So, if you have a conference room today, just think you’ll take one of these, and you’ll walk around like you were painting the room, and in a matter of minutes, you’ll have a virtual conference room that’s exactly what your real one is. You can’t really distinguish them from a visual point of view. But, because they’re real 3D models, all the things that we did with those avatars we’ll be able to put into a space that’s yours.
So, for example here I brought a small cardboard model that was part of a cutout, an architectural cut model that was built when we were designing our new research facility in Beijing. And the doors are here and this is sort of part of the thing. And I could essentially scan a model like this, but because 3D models can be changed in size I could take a model like this, blow it up into something that was more the life size by just cranking that up a little bit. And then I’d have a model of that whole facility.
But, more and more all the architecture work is being done with 3D models, anyway. So, to show you what I think it might be like to have a more life-like, telepresent meeting, or the ability to go places, back in front of the Kinect, I’m going to have it load up the actual architectural model that was developed for that building. So, this is the 3D CAD model that the architects made when they were essentially designing this building for us in Beijing that opened this year. And so here, the customer says, well, that’s great, what can you do?
So, I’ll say, put me into the model. So, there I am. So, this is incrementally more photo-real. You can see it has all my bulges that the face is actually becoming fairly recognizable as me. And certainly if I was trying to meet with people who knew me I could have quite an interesting meeting. So, for example, I did a demo of this recently with a woman who works for me and she was going to Beijing, she hadn’t been there, and we were not in the same place, but I could essentially tell her, when you come here you come in the front door, it’s over there.
Back on the wall back here there’s this really cool light display. You can’t quite see it in the model, but it’s really cool when you go there. You can go over here and essentially go down the hallway, and the cafeteria is over there and other virtual demo facilities are back here. And so there’s a fairly natural interaction. You can move, you can gesture, you can speak, and of course, there’s no reason that we couldn’t fly another bunch of people in here, too.
And so whether you build the models yourself, have them built for you, or actually take CAD models that are increasingly built for everything before you make the real one, the ability to build this virtual world and cyberworld and have them come together, I think is going to get really, really good.
Now, on this particular one we’re not doing any of the facial animation. But, that’s making tremendous progress, too. This week was the 30th anniversary of I mean, the 20th anniversary of Microsoft Research, and I was in Beijing in that very building and I kicked off this worldwide celebration. And one of the things that we demoed there that day was a talking head of me. It was essentially a sort of big version of that model. It’s very photo-realistic and it uses a lot of the newest technology of combining video-like rendering capabilities along with speech models and models of the vocal track, in order to give this face not only my actual features, and the ability to animate them in real time, but to actually synthetically get the lip movements, and the teeth and the tongue, which are the hardest things to do, and have them look realistic, as well.
One more degree of difficulty, they decided to have me speak Mandarin. And so by taking speech samples of my voice they actually have built a computational model of my entire vocal tract. They took a Chinese speaker, who had perfect Mandarin intonation, and they took her speech and we can type any text, the computer converts the text to speech, converts that into Mandarin, and has my head speak it. And so even though I can’t utter any more than ni hau in Mandarin, the head speaks perfect Chinese.
So, people began to really appreciate that when you start to stitch these things together, the 3D modeling, the facial animation, and even the ability to have real-time translation, that’s one of the things that we’re really focused on now is the ability to have one person speak in one language, and have it come out in a different language in real time. And we’re getting surprisingly close to being able to do that.
So, if you think about what that’s like in a world that’s as globalized as the one we’re all living and working in now, the ability to go places in a virtual environment, meet a photo-realistic model of an avatar of someone that you want to work with, or collaborate with, and even if they don’t speak your native language, you just talk, they hear it in their language, they speak, you hear it in your language. That stuff is not science fiction any longer.
I think that all the things that I’ve shown, including that, probably will certainly happen within, I think, five years. And so I think these are just tremendously exciting times. The kinds of things that we’re able to do with these computational facilities, the ability to integrate all this data, and importantly, the ability to bring together people, whether it’s for entertainment, productivity, collaboration, training, teaching, medicine, whatever it is, I think it’s going to be dramatically improved through the use of all this.
So, while all of us can approach our sort of day jobs of science, or engineering, or education, or whatever it might be, and computers as we’ve known them have already been a great asset, I think we’re just at the surface of what’s going to happen when computers really do become more like us, and through them our ability to communicate and present ourselves around the world in completely natural ways and have the computer help us get stuff done, will provide as much benefit to mankind going forward, at the margin, as everything computing has done for mankind in the last 50 years. And that’s a lot.
So, it’s been fantastic to have the time to show you this and I have a couple of things to do before we do a Q&A.
END