Craig Mundie: University of Texas

BRUCE PORTER: Welcome everybody. My name is Bruce Porter. I’m the chair of the department of computer science. Thank you so much for coming.

I’m delighted to introduce today Craig Mundie, the Chief Research and Strategy Officer at Microsoft. As just one of his responsibilities, Craig oversees Microsoft Research, the world’s largest computer science research organization. He’s also responsible for the company’s long-term technology strategy. Over his career Craig has built startups in numerous fields including supercomputing, consumer electronics, robotics, healthcare, education, and he’s carried that entrepreneurial spirit to the research labs of Microsoft, a lesson to us all.

Craig has played senior roles in setting policy on telecommunications and national security under three presidents. In 2009, President Obama appointed Craig to the President’s Council of Advisors on Science and Technology, where he serves now with our own Professor Bill Press. So, please join me in welcoming Craig Mundie to the University of Texas at Austin.

(Applause.)

CRAIG MUNDIE: Thank you very much.

It’s great to be back in Austin again, and at UT. I’ve traveled here many times actually, although not to the university typically, to meet friends at AMD and Dell over the years. But it’s great to be here to talk with you. These sessions are something that Bill Gates and I started doing about 15 years ago, and throughout his tenure at Microsoft and now mine sort of in inheriting his responsibilities as well. We have always valued the ability to go and both share our ideas with people in the university about how technology is changing and might affect our lives in the future, but also to listen to things and learn from what’s going on in the university, both with the young people, the students, and with the faculty and the administration. And so, I’ll have the opportunity to do that here today as well.

So, in the course of the next hour, I want to share with you first some ideas about the technologies that we’re really focused on at Microsoft, and how we think they’re going to play out in the next few years, and then we’ll do a Q&A session. I know some of you may have to go to class at the end of the talk, although I’ll stay around here until probably at least 12:30, and we’ll have a Q&A session. And at the end, any of you that are interested to come up and play with the toys up here, we’ll be happy to allow you to do that, too. We’ll have a drawing and giveaway some door prizes that are probably interesting to many of you, and we’ll do that right after I finish talking.

So the last week or 10 days has been an important time at Microsoft. We on Monday of this week introduced a new line of these phones, and then of course built for the first time and announced our own tablet device. And at Microsoft we’ve been involved, I’ve been involved almost 20 years now in trying to prepare for a day when, in fact, we would find ourselves with microelectronics and software in virtually all the devices in our lives.

One of my first entrepreneurial jobs at Microsoft was, in fact, to work on interactive television, the first efforts we had in game consoles, the first operating system we made for handheld devices. And throughout that time the company has been very focused on trying to anticipate that day. And while we’ve at times had our own challenges, I think today we’re in a very interesting point of evolution.

One of the things that we have felt would be important is the ability to integrate this experience across many devices. Today, each of these categories of devices has sort of gotten smart and connected, but largely one ecosystem at a time. Phones used to be part of the phone world; televisions are still part of the TV world; cars are getting smart and part of the automotive industry; but increasingly the Internet is binding all these things together and the range of devices that we have just gets bigger and bigger.

And in the spirit of bigger and bigger, you know, people talk a lot about these phones and tablets, and they measure screen sizes and other things. And so, I wanted to bring with me today and show you the newest thing Microsoft has; it’s the world’s largest tablet. It’s this one; it’s an 82-inch tablet. (Laughter.) And it doesn’t quite fit in your pocket, but it portends something that we think is really important. And, in fact, it runs exactly the same version of Windows that we have on the Surface tablets, the Surface Pro ones that will come in the next couple of months, and for all the new Windows 8 devices.

So, one of our goals, and in fact we’ve now achieved it, was to have a common technological basis in terms of the operating system. It would go in everything from embedded devices to smartphones to tablets, to laptops, to the evolution of the desktop, and I’ll come back to that in a second, and then on up into the full range of devices, and even the cloud computers. And so, for the first time in our history, we’ve managed to wedge one basic operating system into all those classes of devices now that the microprocessors have become powerful enough to really allow that level of uniformity.

So, I wanted to show you some of the things and why we think that even having these super-scale devices is so interesting. This is the new interface model for Windows 8. And we have these Active Tiles. You can see them just changing and animating as a function of what’s happening in the underlying applications. And the reason the big screen is so interesting now is that we ultimately think, I think that the replacement for the desktop ultimately is the room.

When you go into a room in the future you won’t think there’s a computer sitting on the desk there. You’ll come in, you’ll be carrying your phone or a tablet. You’ll have other devices in the room that are intelligent. You’ll have sensors there. And increasingly we think that a lot of the surfaces in the room, the walls and even the horizontal desktop surfaces, will start to resemble these things more and more. And while today this one is sort of thick and was originally invented, where you’ve seen these before is on television, the company that we acquired earlier this year, Perceptive Pixel, made these things by hand with some special technology to allow the scaling of the ability for touch and high resolution pen. And they’ve used them for election reports, and weather things over the last four or five years.

We think it does portend a time when these very highsized surface devices will become an integral part of the computing environment. And as this talk goes on, you’ll see we’ve been very focused in trying to create what we think of as a multidevice experience where people no longer just think about one device that they can experience sort of one at a time as they go from place to place, but that increasingly the devices have to federate to create the right experience. And we’ll show you a little bit more about how that works.

Just to give you a little idea of how some of this works, if I hit this one, it’s sort of the news tile. Everything in these new interfaces is designed to be sort of scrollable. And architecturally they always just go from right to left, every one of these things is an Active Tile, and if you touched on it, you could basically go down and read the next article below it. So more and more we’ve got many of the newspapers and others are writing applications to plug into this environment, and it all works pretty nicely.

Another thing that we’ve got is the ability to run sort of multiple applications at a time. One of the things that we, of course, had in the conventional Windows environment was the ability to have many windows of multiple sizes, even occluded windows. But when we did the work even years ago on the small devices, the phones, having that kind of complexity in the windowing system doesn’t work very well in a physically small screen.

And yet, we knew that as we were moving into tablets and then ultimately even these big size environments, we wanted to be able to give people sort of this multiapplication environment, but in, if you will, a more controlled way. And so one of the things that we’ve got in the tablets and desktop, as well as this now, is the ability to do multiple things at the same time.

So, for example, if I looked at the finance page, I’d get a set of articles about that. I can have watch-lists and other things, perhaps the stocks that I care about, and in this environment we also have this ability where everything just sort of flicks in. So, in these systems applications can continue to run and you basically can just go from one to the next by flicking them in from the side of the screen.

And if I want to read this, and for example, be able to keep track of my stock portfolio at the same time, we’ve designed it so that the screen can be sort of a quarter of it is allocated to an application. You can put any of them in here. And the rest of the screen goes to the conventional one. And you can essentially slide these things back and forth to have them shift back and forth. So, you can see as I shifted the news basically went into this other configuration.

So, we’ve created an environment where the architecture of applications is such that they’re authored with the intent to be able to present themselves in either sort of the full screen environment, or the partialscreen environment. And as people write these things you get more and more of this mix and match capability. One other thing that I wanted to show you is how we’re really starting to build novel applications, now that we have very highperformance computing and the ability to put them into these kind of touch-enabled environments.

So, I have a new application called Fresh Paint, which happens to be a favorite of mine. I sponsored this work over about six years of development. And the goal was to actually replace the old aint program. Most everybody at some point played with the aint program. But, the paint program in the past was pick a color and all you were doing was flipping  toggling pixels on and off in that color, and really very, very primitive. We said, what would a new paint program look like, Fresh Paint? And what this is is actually a full physics simulation of painting, or drawing. So, we start with the paper, or the medium, and we have a physics model of that. We start with the ink, or the chalk, or the oil in this case. And we have a physics model of that, and all these things. And then we have a physics model of the instruments, the brushes and the crayons and things. And so this is a computational physics simulation of art.

So, just to show you what really capable people are doing, when we did the launch of Windows 8 recently, we had someone with one of these big tablets in the store and they actually painted this oil painting digitally. And they’re really pretty impressive for people that have never actually done this kind of thing before. It mimics enough of what they understand about real oil painting to be able to do it. One of the things you can see down here, it’s grayed out, because I’m not painting right now, is a fan. Why is there a fan there? Because in this system the oil paint also dries at the rate that oil paint dries. And if you’re in a hurry you can turn the fan on and it will dry faster.

I’ll just give you an idea of what it’s like. So, if I say I want a new painting, I can basically pick the type of paper or canvas that I want, let’s say I like this kind of canvas. And I maybe want it in that kind of color. And then I can go back to the tools. And pick colors. So, you guys have to help me, but something that it might be like a UT color maybe. So, the brushes become that color. I can pick this kind of brush and put it there and say, okay, I’ll do UT  and what’s interesting is if I actually get another brush, and I’ll pick a different color, just for grins I’ll pick some nice yellow color here. And what you’ll see is if I actually take this and paint on it through here, you’ll see that it smears through the ink. The oil is blended with the yellow and the orange.

So, that is actually all being done computationally. And if we waited long enough  if we waited a little while and came back and tried to do the same thing it wouldn’t blend, because the paint would have dried.

So, more and more we want to create these very realistic experiences, whether they’re for entertainment or art purposes, or for other types of applications. Let me show you a little bit about how we’re thinking about these multidevice experiences and I’ll do that by using this productivity application. One thing I didn’t show you before but this tablet, as giant as it is, is also unique in that it supports a very high-resolution pen. So, what I showed you was the capacitive touch of my finger. We can also use, I didn’t demonstrate it, but capacitive brushes. So, for the artists we actually have brushes that actually have metal bristles, and therefore they actually behave on the screen like a brush would. And you have these high-resolution pens.

And so here is basically a OneNote file, which in this case is just sort of a chronology of these different types of Surface and touch capabilities. But, if I wanted to, for example, say, well, you know, I remember when I did this Pixel Sense thing, I can essentially draw on this thing, I can pick a highlighter, for example, and say it’s really interesting, we bought perceptive pixel here in 2012, and I’m let’s just say really excited about the launch of this Surface thing, that’s really good.

Now, if I go over here, I’ve got these other capabilities, and so if I go and start OneNote on this device now and you can see there’s just a camera looking down on this, everything that I did over there has been mirrored over here. So, there’s the circle. There’s the highlight. There’s my annotation. So, more and more the focus we have is to try to facilitate creating this kind of transparent affordances as you move from device to device. And the same is true on the smartphone. You can have OneNote on the phone. So, now you could record a note on the phone. It would show up on your tablet. And, in fact, had I pushed the record button when I started over there it would have recorded everything that I said. It would have time synched it to the drawing actions and they also would have been synched over here.

So, we think that this may turn out to be important in educational environments, for example, where people can give a lecture, or record something like this, everybody in the class doesn’t have to make their own copy of it. It just shows up in their own digital version of it. And they’ll be able to use these things in different and interesting ways to build all kinds of applications. So, we think it’s important for certainly work environments but ultimately education and other things, as well.

Another thing I wanted to show you is how we’re bridging these things between the devices and the entertainment environment, as well. Microsoft has been invested  our strategy for television for the last decade has primarily been around the game console, but increasingly the Xbox has become a hub for more than that, where we also are doing this type of Smart Glass capability. So, let me give you an example of how this stuff is working. Let me just start first on the phone. I’ll turn this one on. And as you can see, this is one of the Windows Phones, the new Phone 8, and here I’ve actually got a tile, which goes to The New York Times, because I like to read it. And if I actually touch no that it will open up this thing, go out on the Internet, and get me a copy of this New York Times eb page.

Now, if I happen to be at home and I have an Xbox hooked to my television, increasingly we find that people, particularly young people like you, want to be able to, for example, watch television, or play games, and yet they have a tablet, or they have a phone in their hand, or their lap, and they really want to use these things in some integrated way. Now, this is a nice little screen, but reading the whole New York Times is pretty tough on the little screen. So, I could say, all right, I want to share this page and I want to share it to my Xbox, because we’ve created a common identity system for the phone, the tablet, the Xbox, and even the big displays, what will happen now is that tablet is logged in with the same identity as these devices. And it should find that screen. It will hook up to it and it will basically take this page and move it onto the Xbox.

So, now what you’ve got is the ability for the big screen to be used to look at the things that you’re controlling from your hand. One of the things that’s really been challenging about trying to get the Internet on your television was to find a way to navigate in that environment. So, if you look here at what’s happened is that this Smart Glass application, as we call it on the phone, now has affordances that are directly coupled to the television. So, for example, if I scroll on the little scroll bar on this phone, the scrolling happens on the TV.

And similarly if I use my finger in the middle I can basically move the cursor around so I have full tough on my phone for anything I want. And you’ve got essentially the pan and zoom capability by just pinching on the phone you can zoom in and read things. So, suddenly you’ve got the ability to use the phone, as not only a self-contained computing device, but it becomes the remote control for an Internet-enabled television experience. And we think that that’s going to be pretty interesting to people.

One other example of this, if I go back to the tablet, is essentially the ability to watch videos. Increasingly we expect the television will evolve in a way where almost everything will be delivered through the eb and sort of on an on-demand basis. So, whether it’s music videos, or full-length movies, here, for example, I bought a copy of Snow White and the Huntsman. So, if I click on that I started watching it. I can essentially say I want to resume watching it. And so it starts playing right here on my tablet.

And but much like what was true with the Internet part, on the tablet I can actually also have this capability to, say, play it on the Xbox. So, what it does is it pauses it here. It will basically  this tablet now went over and found the screen. And it basically transfers the movie to that and brings a form of both additional metadata onto the tablet, as well as the controls for the movie.

So, in a second this thing should resume the movie exactly where it was on my tablet. Now, if you look back at the tablet what’s changed is in the scene that is in the movie, dynamically it shows you the characters and gives you metadata about them. And so if you want to learn about the people or the characters as you watch it appears there. If I say show me the scenes this thing has essentially scanned the movie and broken it down into the scene changes. And if I click on a different scene it will automatically change and go to that scene. And the metadata will change, as well, in that environment. And then all the controls for the transport controls are also available. So, I don’t want to compete with the movie I’ll stop it.

So, this is an example of what we think of this sort of multidevice experience, and we think this will be built more and more. That sort of leads naturally to the question of if you want these experiences to be not just multidevice for one person what’s it going to be like to create an environment where people can communicate and collaborate over distanceSo, there’s been a big effort in the company on this question of telepresent collaboration. So, the next thing I’ll show you isn’t a product at least yet, but it something that’s a prototype that we built several generations of in our research group. But, it helps to illustrate what we think happens when we give people more natural user interaction.

This is a really big issue for us at Microsoft, is this concept of moving beyond point, click, and touch in a conventional graphical interface, to allowing more natural interaction. What that means is a couple of things. We want the computer to emulate people better. That means we want to emulate your senses of sight and hearing and the ability to speak. But, in addition we want to be able to make it easier for people to operate in a world, which is increasingly blending the physical world and the virtual world.

The Xbox 360 a couple of years ago when we launched Kinect, the sensor for Kinect, was the first real massmarket introduction of a NUI experience, NUI means natural user interaction, where we recognized that gaming was actually quite challenging. Mastering the controller for a game console was a lot like learning how to play a musical instrument. It requires chording; the ability to almost automatically convert your thought into action, but the action has to be communicated through complex manipulation through your hands of that controller. And in a sense that’s a real limiter to how many people really become proficient at gaming, and a real limiter in terms of the ability to apply the Xbox kind of technology to entertainmenttype applications.

So, I showed you that video application, again, still using a graphical interface, but with the newest version of the stuff that’s in the Xbox as of a week ago. If you have a Kinect you can talk to the Xbox. You can ask it to search for things. You can have it control all your medial watching, including conventional types of television. And we wanted to have that kind of ease of use in order to be able to expand the audience. So, it was gender neutral and largely age neutral. And in a couple of short years we’ve actually seen that happen.

Today there’s about maybe 80 million Xboxes in the world. And for those I guess there’s maybe 15 million of them now that have Kinect attached. And in that audience, in fact, we see that the usage profile has changed to be mostly the whole family and that the range of gaming activity is much broader. And the reason is you didn’t have to have a controller at all. Things that you knew how to do in the real world just move around, touch things, reach out, they became the input to the game console. So, there was no learning. If you were going to play the dance game you just got up and danced. And to the extent you knew how to physically move that was all you needed to know in order to be able to make that work.

So, the next demo I want to show you is this IllumiShare thing, it’s a short distance only across a stage, but it would work arbitrarily long. Someone ask Matt, who works with me, to come up, and he’ll sit at that desk and I’ll sit at this one. What you actually have here is just a desk. There’s just a piece of paper sitting here on the desk. And in front of me is a tablet. The tablet is running Skype. So, I have essentially a voice and video call going from this table to that table. Everybody is getting pretty familiar with that kind of thing.

But, if you wanted to do more than just talk, if you wanted to collaborate, you wanted to work together on it a design problem, or helping people with a homework kind of issue, it turns out to be quite challenging to do that. So, the question is, could we create a system that blended the physical and virtual in a way where it was just completely natural to be doing a task with somebody else, even though you weren’t physically in the same place.

So, Matt just put his pen on his table, and of course you see it appear here. And if I take my pen and put it on my table, it appears on his.

And so now what happens is this lamp, the thing looks like a desk lamp, actually is sort of a sophisticated blend of an illumination system, a projection system, and a camera system. And they’re interleaved in time at a high rate of speed so that you can project and read out at the same time. And so what it’s doing is essentially creating a composite, a parent virtual image, on each end that is the union of the physical objects and the virtual objects. And so if I take this thing and say, hey, I’ve been really struggling with this trigonometry thing, so they me this triangle and I’m really trying to understand it. So, I’m going to draw you the picture of the triangle here. And can you explain how the angles work, Matt.

So, he takes out his protractor, and says, okay, hey, I can measure these things and explain this problem to you. So, he says, okay, that one looks like 60 degrees. That one looks pretty obviously like 90 degrees. So, let me guess, what did you measure for this one? I’d say, no, I don’t think so, because I know there’s this rule that they’re supposed to add up to 180. So, this is wrong. So, this one must actually be 30 degrees.

So, what we’ve done is created an environment where people only knowing what they already know are able to manipulate things and see things in this environment, and interact at a distance they can talk and hear each other. And we’ve given young children, we just sit them down at this, you give them no lessons. You can give them a pile of magazines, little toys to play with, a deck of cards, drawing implements, and literally in a matter of seconds they start taking these things, they put them down, they create their own games, they play card games together, they draw tic-tac-toe puzzles together, and they just do it.

The computer basically maintains a continuous record of the evolved conversation, and at any moment in time in the cloud there is a composite image. And so you can stop a game, or a drawing, or anything else, you can go on to something else, and later you can essentially come back and ask it to reinstate it. At that point you both start with a projected image and then you build on that independently.

So, we think of this as just another example of natural interaction, and importantly in this case telepresent type of interaction. And we think that as we get these large surfaces in the workplace, maybe in the educational environment, the ability to bring all these things together to facilitate working and learning will be really valuable.

The next thing I want to talk about a little bit is big data, and ultimately machine-learning. You hear a lot about this term. I think it’s appropriately important now. And I don’t think that there’s really any field of certainly science and engineering, but increasingly business that is not going to be driven on an accelerated basis by the use of very, very high scale data. And that’s going to bring a lot of interesting challenges with it infrastructurally. But I think that it holds a lot of promise.

When I think of big data, there are really two separate ways to think about getting insight from a lot of data. One model basically builds on the classical ideas of human-guided investigation where you use visualization techniques, and other advanced ways to couple an expert or someone with domain knowledge to datasets, and coupled with these things use human intuition to gain insights. I think that will remain important. And as the datasets have grown in complexity, so too have the tools that we’ve been developing to facilitate that type of interaction.

But, beyond that, and we’ll talk more about that in a minute, there’s a concept called machine learning, which in the end may turn out to be even more broadly used, and potentially more important than even the best of the results that can be achieved by a human even with great tools.

Let me show you how the tools are really evolving for people in the scientific community. So, as you see on the screen are just three shots of some instruments that are part of the Monterey Bay Research environment in Monterey, California, and they spend a lot of time trying to research oceans, and in particular that bay to understand the ecology there.

What you have on the left is actually a floating instrument that just they put it out in the bay and it kind of drifts along with the currents. And it basically is able to take a lot of measurements. On the bottom right, one of the things they can measure with these laser-based instruments is, they can detect different types of RNA, and they can also detect chlorophyll. And what’s on the top right, looks like a torpedo, is actually an autonomous underwater vehicle. And they use these things in combination to collect a lot of data. So, the buoy drifts along and the torpedo like guy is told to basically swim around it, and to basically swim up and down from the surface to the bottom. So, they essentially get one class of data from the big floating buoy, and they supplement it with this other data.

But, of course, the tracks are quite different, and the question is how do you integrate these things and bring them together.

So, to show sort of how this works now, many people are familiar with Excel. And increasingly Excel is a platform into which we can plug these additional tools. And we started a few years ago in Microsoft Research and built a thing called the World Wide Telescope. At the time, the astronomy community had these sky surveys and other things, and we had the NASA images, and the Hubble images. There was just a huge amount of sensor data both in the visible and nonvisible spectra from outer space. And yet nobody had a good way to look at it all at once. And so we built that tool to facilitate looking at all of these things, and to correlate them, and even allow people to make guided tours for instructional purposes, or collaborative purposes.

So, many people asked us, well, why don’t we have tools that are that good to explore things closer to home on the Earth. And so we’ve adapted this World Wide Telescope technology to allow that to happen, where we take the Earth’s geography, and the ability to composite datasets and build tools that allow people to explore, examine these things in time series, and to create movies, in essence, or automations to facilitate that.

So, here in the latest version of Excel, you can just put like 150 rows of data here that were actually output, lat, long, start time, chlorophyll level, and altimeter readings from that guy floating around in Monterey Bay. And you could take any data. You could suck it down from the eb. You could get it out of a cloud store. You can develop your own computational things. But once you put it into Excel and you plug this thing called Layerscape in, which is a new plug-in we’ve developed for Excel, you can say I want to view this stuff in the World Wide Telescope.

And so what it does is it takes the lat/long data, it basically goes out and gets imagery from that, and it plots this trace, which in this case is the trace of that thing floating around in Monterey Bay. And I’ve got all of these little toggle-able parts of the dataset, so that they call that thing the Tethys, and if I wanted to see its data, that’s the Tethys sort of floating around in that environment. If I wanted to see the measurements from the RNA measurements, they have two different microbes they are trying to track the presence of, and you can see the different volumes there. And it turns out this relates to the surfaced salinity and temperature. So, you can turn on datasets and look at them in various ways.

But I might have been working with a colleague or myself. I could say run the movie part of this that I’ve already made, and so you can see the time sequence of the trace that gets created, what the actual pattern was that this thing operated on. And each of these animations are things, these are the other two different datasets that are now being animated and shown, and in each of these cases you have the ability to pan and zoom, and use sort of the tablet type interface to manipulate it and look at the information.

And so we think that these are very powerful ways to give scientists, whether you’re a data scientist in the business environment trying to understand your sales stuff on a global basis, or you’re a physical scientist who wants to really take these much, much larger datasets and manipulate them and understand them, the tools are getting fancier and fancier.

But the tools that I think we’re the most interested in now are really in the machine-learning area. And this is where we’re taking these super-scale facilities that we built largely as the backbone of these eb-scale services and we’ve started to turn them to operating on these very, very high scale datasets. The computer systems that have been built by the handful of companies that operate these global eb-scale services are wildly larger than the biggest computer systems that have ever been built anywhere. There is no government, university, consortium, anything else that has computing facilities that even come close to the capability of these super-scale eb facilities. And with them, we have the ability to ingest and operate on things terabytes at a time, and many of the datasets actually grow to petabytes scale.

This poses some interesting challenges down the road. I was talking to a student group here this morning and pointed out that we’ve all been socialized to this idea of distributed on-demand computing, and that all kind of presumed that the data was small enough to move around, or it was computed wherever you were doing the computing. This ML problem really is going to break that model. There is not enough bandwidth, even if you had a direct fiber connection, to take a petabytes here and move it over there so that you can compute on it, and then move it back.

And the idea that you could do it remotely is completely implausible because the cross-sectional bandwidth that’s required in order to have the super-scale computational facility and storage operate on these very, very highscale datasets really just can’t be achieved in a distributed network at all.

But, that said, Microsoft is building these tools and using them ourselves for the last five or six years, and increasingly we’ll try to make them available in some form to people through the Azure Cloud Service, but I wanted to give you an idea of what we do with this ML stuff already, and then give you some ideas about what we will do with it in the future.

First, all the things that we do in the NUI world related to speech, and ultimately even vision, is increasingly all being done using machinelearning. The ability for the computer to learn a new language, and to be able to recognize and understand it is increasingly done by this machine-learning capability. Today, if you use one of our tablets or phones, you go to Bing, you push the microphone button, and you talk to it, ask it a question, the natural language, understanding, and everything else is all basically evolving continuously. So, the more activity we get, the better the thing becomes. And so the learning is a continuous process, and it’s done at very high scale.

A few years ago, early in the life actually of the Xbox system, pretty early, when people really started to play a lot of these online games, there was an interesting problem that to the naive user you wouldn’t really think that much about. But when you have literally tens of millions of people who at any given time go on the network and say, find me somebody to play with, one of the challenges is finding the appropriate skill level of the people to hook them up to play with.

It turns out, if you don’t pay attention to this, if you’re a beginner and you get paired to play with an expert, it’s just no fun. The expert doesn’t have any challenge, and the beginner never wins. And gradually they both get frustrated. And yet it’s a real challenge to describe skill. It’s a bit like the old judge a few years back in the United States who was handling a trial about pornography, and they asked the judge how do you define pornography, and he said: I don’t know how to describe it, but I know it when I see it.

And increasingly we have these kind of problems over and over and over again, that the world of computing that we’re moving toward is no longer one  the old Don Knuth thing, data plus algorithms equal programs, and where you kind of knew with some predictability what you expected the program to do. Increasingly the people who work at Microsoft, some big part of them are literally writing code in some classical sense, but more often than not now they’re writing code that’s invoking subsystems whose output is statistically defined, not predictable. And that the systems, as they get more and more complex, are in fact an amalgam of these predictions.

So, in this case, given the range of games and the range of players, what we ended up doing was observing different behaviors of people as they played games, and from that synthesizing an abstraction of what skill was in each game. And the machine-learning system did that, and then it allowed us to match people up who were evenly paired. And that produces a much more interesting gaming experience.

The whole Bing search system and all the ad network behind it is all driven by machine-learning. Even years ago we started, and you’ll see in a minute a video that shows how this has evolved, but the whole concept of spam filtering, whether for e-mail or instant messaging, trying to separate the wheat from the chaff in that giant stream of information is also done with machinelearning.

Interestingly, Kinect, when we created it, we did all the work to build the sensor that would be able to allow the algorithms to detect and make these skeletal models. That is sort of the fundamental first thing that we did in the gaming system was the sensor and the software of it builds skeletal models for up to four people at 30 hertz for the 42 major skeletal joints within a degree or two of accuracy of the position of every joint, and that’s what we give to the programmer and say, okay, now make a game. It turned out it’s pretty hard to describe a gesture. And so, indeed, in the first wave of fames, people did some pretty cool things, but it took real experts, typically a couple hundred hours per gesture, to figure out how to describe the transformation of the 42 skeletal joints over a certain set of ranges in a time sequence that would constitute a gesture, any particular gesture.

And we wanted more people to be able to do this. So, recently we developed a capability where we use machinelearning to figure out the gestures. So, for example, if you wanted to have a gesture which was you were doing a game like swat the ball. So, you say, okay, that’s a swat, and this is a swat, but this isn’t a swat, that’s a kick. And so what we did is we said, okay, just stand in front of the sensor, hit the record button, and make a few swats, and then do some things that aren’t swats. And then stop the recording and go back like you’re editing your home movies, and just put brackets around the things that you consider swats. Feed it into the machine-learning system, and in about one minute it actually figures out how to describe a swat from the few examples that you gave it. And we’re doing this more and more.

So, now literally anybody can make a gesture of almost arbitrary complexity as long as they can perform it and record it, the system can now figure out how to describe it. Even though in many cases a programmer, no matter how smart they are, would have a very difficult time systematically writing down the code that would have described that gesture.

Another one is Clear Flow. In our system for the last few years, if you go to our mapping system and say get me a route to drive across Austin, you can say, show me the traffic, too. But when you ask it to make a route in the Microsoft system, it actually predicts what the traffic will be at each point along the path based on a lot of external events that are happening. So, if I asked you to drive across town today here, it’d say well, there’s no football game. But if you came next weekend and there was a football game and you asked it for the same route at the same time it would look and so, oh, there’s a football game, and it would route you a different way to try to avoid what it knows by experience is the congestion that occurs around the football game. And that’s also done by machinelearning.

One of the neatest things in the newest version of Excel, like the one I was just showing you on the tablet, the giant tablet here, we’ve introduced a new feature which is called Flash Fill, and this also was developed using a machinelearning capability. And what it does is, if you put in a set of data for example you might have gotten from somebody else and it was all strings mashed together, it’s really painful in a spreadsheet to try to figure out how you apply some common transformation to a whole bunch of data. But that happens a lot, particularly as people get into this big data kind of environment.

And so, what Flash Fill does is you basically can manually put into the next column some extraction, or computation that you want relative to that data, and as soon as you’ve given it to examples, the system actually synthesizes a program that does that, and it applies it to all the data, and if it’s ambiguous it shows you that this part is ambiguous, give me another example.

And by literally nothing more than a few examples, you can program the spreadsheet to do fairly complex transformations of data, and the system is actually writing code in real time behind your back that will perform these arbitrary transforms. And so, these are example of how ML is being used already at Microsoft, and I’ll show you a video now that includes a guy named David Heckerman. David is a longtime Microsoft researcher, and he did a lot of work in the spam area in the beginning; but he’s also a physician, and got involved originally at Bill Gates’ request in the search for an HIV/AIDS vaccine, and so I’ll show you this short video to explain how that’s playing out now.

(Video segment.)

So, this is an example where the basic science oftentimes that we pursue in support of our basic business interests and needs, also has broad application in other areas. And increasingly through our research activity and then through the products we try to take these tools and make them available to people. So, here while this was done originally in research, these techniques have now been ported to the Azure cloud platform, which actually ultimately will give it even higher scale. It’s all obviously based on parallel algorithms and today we’re doing these things clusters of about 10,000 nodes at a time. But, that obviously will get larger.

It really takes us into the realm of having to deal with very large data sets, also the issues of managing privacy, and constraints on use in this environment, because you have both patient data, as well as the genomic data of the virus itself. And we continually try different machine-learning algorithms. There’s no one specific, patented way of doing machine learning that works for everything. And so that’s also an evolving discipline. But, these are just examples of the way in which we’re using these computers to solve interesting problems and ultimately to help them be more like us.

That is what this natural user interface thing is all about, is really letting the computer be much more human-like in the way that you interact with it. From that you derive two real benefits, I think, in the years ahead. One is that you can change the level  the semantic level of the interaction between the computer and the person such that it generally becomes more helpful.

For 50 or 60 years we’ve largely built computer systems and given people tools. And the tools have become increasingly powerful, but they’re still tools. And you had to master the tool in order to get the benefit. And today the planet has about 7 billion people. Most people think that will get to be 9 billion people maybe by the next 30 or 40 years. And unfortunately today only about 2 billion people have really gotten any benefit from computing directly. And in many cases they either don’t have it, because they can’t afford it, but even the cell phones have shown that the rate of adoption, even in the poorest places, is going to be pretty high.

As core technology is declining I think we have a challenge ahead of us globally to improve the networking capability and to make it expand and have lower costs. So, with this change to the natural interface we’ll be able to have all those other, ultimately 7 billion people, find a way to get benefit out of their computer, without having to master some complex set of tools. When people ask me, what’s your dream for it? What does this mean? I say, okay, think of yourself as a rural farmer in Africa, or India, or Indonesia, or wherever it might be, some years not too far from now.

You want to pick up your cell phone and say what day should I fertilize? And the thing comes back and says, Thursday. And if you had a farm bureau person in that country who worked with that guy regularly, knew the country, knew the crops, knew the soil, knew the weather, knew everything, that’s the discussion he’d have. And the question is if you want to get the scale benefit you need to be able to handle that kind of question as an expert system in almost any domain with that same level of simplicity.

When you think of the amount of data that has to be brought together and analyzed, almost continuously in realtime, as a computer system, as a eb-based service, it’s pretty stunning, to answer that question. I mean, one you’d have to have the history of the farmer. You can ask his cell phone where he’s standing, that might tell you the field he’s in. You might know what kind of seeds he bought, because he bought them through the cell phone at some point.

You have to look at the weather and be able to understand that. You’d have to look at well, what fertilizers not only work for him, but might be available that he could actually get his hands on. And in the end you want to provide them the simplest possible direction about how to optimize his crop production. And I think that all these things are possible, but they show what big data really means. Somebody asked me the other day, isn’t this big data just something for big companies or governments, and I said no, no, it’s actually the way you’re going to solve the most simple questions for people, whether they’re in the rich world or the poor world. It’s bringing these things together that really provides that.

If you look at the kind of things that we’re moving to do now, for example in the next version of the stuff we’re doing for the living room, as we’ve moved beyond classical forms of gaming to these other forms, and where we’re increasingly doing media control and trying to improve the experience, we’ve already now got, for example, the ability to search verbally through voice commands on the television, which is trickier than people think, because in a space the size of your living room, the signal to noise ratio is really pretty bad, with a microphone on one side of the room and you sitting on the other side of the room. So, how do we actually solve that problem?

Well, in the latest version, what the Kinect sensor has is not a microphone. It has an array microphone. And so we can beam form with the microphone. And then the question is, well, where should the beam be? The answer is, you ask the skeletal tracker to find the heads, and then you basically have the array microphones basically form beams that basically point just at the heads of the people in the room. And then all of the other noise is ignored.

So, you get the effect of close mic’ing, by combining machine vision with essentially computational array microphonics. And so it’s something that otherwise would require you to have a headset on or hold your phone close to your mouth. You could just sit there in the middle of watching your movie and just talk to the thing. It knows what sounds it’s making, so it subtracts all those. The beams basically filter out all the other ambient noises and you get a signal to noise ratio sufficient to actually do speaker-independent voice recognition.

So, it’s a lot more complicated than you’d think in order to be able to do things that even appear to be simple. The game mechanics and other things are all changing with the gesture-based input, audio spatialization. Another thing that we’re working to is ultimately you want to be able to have multiple people in the room and whisper to one of them, and that’s again, something that’s possible. It’s the inverse of the beam-forming microphone, it’s the beam forming speaking array. And it’s been demonstrated that if you have the right array, and know how to do the computations and understand the physics of the room and the furniture and stuff, you could actually have the system whisper in one person’s ear, the other people can’t hear it.

So, things that people do with each other, like in the middle of a movie, if the phone rings you say, interrupt me, but only tell me. So, there’s all kinds of interesting things that happen when you start to bring these things together. Another one obviously is  an I think the sensors may have to get somewhat better in resolution to do this, but we already do face recognition, because when you come in you don’t want to have to log into your television every time. So, once you’ve basically given your facial recognition  go through a little exercise and the thing knows who you are, then when you walk in and sit down it says, hi, Craig, what do you want to do, and that’s a lot easier. It turns out you even have to do that in real time in these multiparty games, because if you have four people playing a game and any two have to be active at any time, people want to jump up and down. They sit on the sofa, and then it’s their turn, they jump up and they jump into the game. In real time you actually have to say, the player changed, and you have to know which player it is.

This is especially challenging in the TV environment, because  especially in entertainment, it’s in a reduced light environment. When you watch movies and things you turn the lights down low or off, and yet all this vision stuff is supposed to happen in the dark. So, it turns out we do that with infrared illumination, and we actually see in the dark. So, it just shows how many things from so many disciplines have to be brought together to make what appears to be just a simple piece of entertainment electronics for the future.

One of the really profound things that is part of this natural user interface thing is the ability to hear people, understand them, and act on it. And a few years ago Rick Rashid, who runs Microsoft Research, works for me, he started our research group 21 years ago and still heads it up. He gave the research group across the labs a mission; we have a number of these. We call them impossible things initiatives where we have something that we would obviously like to be able to do, but appears at least for now to be beyond the state of the art. But, it’s often a way to help galvanize the activity across these different disciplines. And so a couple of years ago Rick sort of gave them one of these charges which was I want to be able to get up in Beijing and give a speech when I speak in English it comes out in Chinese in my own voice.

And so this is one of the sort of Holy Grail things that computer science has been working on and I’m pleased to show you that last week in Beijing we actually did this. So here’s a video from the floor of the 21st Century computing conference in China last week.

(Video segment.)

So, this was completely unscripted. So, there was no predetermined set of words you use or constraints on what he would say. The text is English. It converts it into English text, which also allows some validation if you’re the speaker that didn’t really get what I meant. It converts the text into Chinese. And then it synthesizes his voice using a computational vocoder model of his larynx to produce speech in Chinese that reads that out.

So we’re extremely proud of this and I think it shows the kind of things that come from our long-term investment and willingness to keep investing in basic science. Today Microsoft, I think, certainly had the largest software computer science research operation in the world. We have 850 Ph.Ds. full time, who do nothing except basic science. And if you contrast that, I was talking to the dean this morning, there’s about 43 faculties here, going to 60, and they have other responsibilities. So, it really shows the commitment that we made, when we’re much larger than university even the biggest of them. And I think that it shows and that Microsoft has remained a force within the industry. I think we will continue to be and I attribute a lot of that not only to the creativity of the people in the product groups who make all these things, but the ability to enter new businesses, to be able to defend existing businesses, and ultimately be able to do breakthroughs like Kinect, or the speech capability.

When the business group came to us about five years ago and said, we want this controller-less gaming environment, but we’ve thought about it and it doesn’t seem possible. We sat down and we started flying people in from the different labs, when we started parsing the problem. In the end we found that we had seven groups in four labs, on three continents, each of whom ended up having a critical piece to make Kinect possible. And all of them had worked in their field of research on that related thing for or more years already, and none of them ever had any idea that the application of that technology would be to make a sensor for a game console.

So, it shows the importance in my mind of maintaining both a commitment to basic research, even in the business environment, not just in academia, and also the importance of not having too much coupling to only immediate problems in guiding what the research is. This is one, as a policy person, this is one of the things I worry the most about in our academic environment these days, is as the government has sort of reneged funding basic science in the universities, the universities turn to businesses and say, hey, you guys fund us and they usually do, but they come with strings attached, they’re usually much more shortterm in nature, not basic science anymore. And I think that we’re ultimately weakening the society’s ability to have these inventions that make fundamental change happen.

So, we at least are trying to do our part to pull on the oar of long-term curiosity-driven science in this important field. And it’s expanding, I mean, our newest lab is a combination of groups in New York and Boston. And one of the new groups that we’ve built there are people who do computational economics, because increasingly you can’t build these super-scale systems and operate their business models unless you understand the economics associated with them and they’re very different. So, let me stop there and move onto the all-important giving out of the prizes.

So, I guess we have raffle tickets here in the bag, and I know there’s some people in the other room and so we’ll wait to get a report in case any of them are  so we’re going to give out I guess eight things. The first two are going to be brand new Windows 8 Phones. So, these two people  and all of you who win, just find Erica after and she’ll be back in the back, and she’ll arrange to give you these things. The phones, because you have to pick your phone type and your phone type and your carrier you’ll get a coupon where we’ll get these things when you make those choices. The others we actually have them here.

So, the first phone winner, the last three digits are 700, 700. You have to be present to win. How about the other room? Okay. It’s in the other room, all right. Okay. The second phone ends in 603. It’s interesting if they’re all in the other room, 603, the lady back here. Okay. All right. So, the next thing we’re going to give away is three handy-dandy new Xbox 360s with Kinect and for 1000 points and a gold membership, the whole nine yards.

So, this one ends in 653. Close, but no cigar, 653, okay, there you go. You get the first one. Okay. The second one goes to 783, this man right here. Okay. And the last Xbox goes to 707. All right, this man right there. Okay. And the last thing we’re giving away, hot off the press, is three brand new Surface RTs. Okay, 799, the other room. The other room, okay. Okay, one down, two to go, 609, 609, okay. Right there. That row is doing pretty well. And the last one 610, 610, okay, right there. All right. Thank you. I hope you enjoy them.

END

Related Posts