Remarks by Rick Rashid, Senior Vice President, Microsoft Research and Roy Levin, Director, Silicon Valley Lab
Microsoft Research Road Show in Silicon Valley, Mountain View, Calif.
May 22, 2008
DAN’L LEWIN: Well, people are going to be filtering in for a little while. We’re expecting more than a full house today. By way of introduction, my name is Dan’l Lewin, I’m a Vice President here at Microsoft, and I get to work and play here, have for 30-plus years. This is one of those play days, where a lot of hard work has gone into what you’re going to see, and you’re going to get to explore a bit, and I’m pretty excited about it. It’s one of the more special things we get to do on this campus.
But before I spend a moment to introduce Rick and Roy, I would like to ask for a show of hands. How many people have been on this campus before, and how many people have not? (Show of hands.) It’s almost 50-50. Good. Welcome to those of you who have never been here before. And I will give a brief background, because there’s a large percentage of the audience that hasn’t. This is a primary campus for Microsoft in Silicon Valley, but not the only place that we’re located. There are now well over 2,000, rapidly growing to about 2,500 full-time people doing development and research for Microsoft in Silicon Valley. Most people don’t know that. These five buildings are the core, but you drove by a sixth building that we just opened up a little while ago on the way in as you came around the corner by the Starbucks, by the Computer History Museum.
There is also the TellMe Campus, which is just up the road off of Shoreline, a company we recently acquired, doing some really interesting things with voice applications that will span many, many areas of our business as well as general use of computers over time. And also Danger, a company that people may or may not know, but they have a little flip-top device, and they’re in Palo Alto. And we actually have a small office in Sunnyvale, and it was rumored that we were looking for a big one, too, a little while ago. That’s just a joke. All right. Enough of that.
So this is the Fourth Annual not annual, I should say we do this about every other year, the fourth time we’ve done our Microsoft Research Roadshow in Silicon Valley, and it’s a really good opportunity for us to show off a lot of what we do in research. And I’m going to leave it to Rick and Roy to talk more about the way we think about research, but I’m proud to be at Microsoft, and to be able to make the comment that one of their overarching goals is to contribute to the state-of-the-art, and do it in an open and collegial way, and so it’s a terrific opportunity for us to, again, share hands-on, and publicly a lot of what’s been going on.
It’s a terrific day for us. This morning we hosted 140 local high school students as well. So one of our attempts in furthering computer science and research in general is to drive awareness and enthusiasm with the student community, and to do our part for filling that gap in what often is perceived to be one of the biggest shortages we have beyond oil, and maybe water over time, but we’re really lacking in society and in our industry in general for talent in the technical disciplines. So things that we can do in that area to help stimulate it, and things that you can do are certainly appreciated.
General background, there are about 800 people in Microsoft Research. Rick may mention some of these things, but we’ve got centers, obviously here in Silicon Valley and Redmond, but also dedicated facilities in China, in India, the UK, and an operation in Boston that’s recently opened up as well.
All of this started some long time ago, 1991, when Rick Rashid came and helped build out the research operation within the company, and he’s been the long-standing leader of that organization since that time. Rick has a tremendous background both in the academic community and in doing research in industry, and has recently been elected to the National Academy of Engineering, soon to be inducted into the American Academy of Arts and Sciences, and recently it was announced that he is going to be receiving the IEEE Peori Award as well. His background in computer science at Carnegie Mellon, and the University of Rochester, and has done a tremendous amount of work with the National Science Foundation and in general furthering the cause and importance of computer science as a discipline on a global scale. And so we’re awfully proud to have him as part of the company and as part of the industry.
Following Rick, Roy Levin, who runs the Silicon Valley Lab working directly for Rick, has an operation here with 50-plus researchers doing some sophisticated work that he’ll talk about as well. Roy came in August of 2001, a little bit after I joined the company, so we’ve have some more tenure on the campus, and he’s been a great colleague in that regard. He’s got a long history in Silicon Valley. Roy goes back to work Xerox Park long ago, and then also with Digital and Compaq through the acquisitions, and then recently again joining Microsoft about seven-plus years ago. Roy has done terrific work in his field. He’s been a primary contributor to the Topaz Programming Environment, and also the co-leader and developer of Cedar, an experimental programming environment for high performance workstations. So both terrific hands-on talent, but also great leadership for our organization, and for the industry in computer science.
So, without further ado, I’m going to hand things off to Rick and let you get on with the day. Thanks so much for coming. (Applause.)
RICK RASHID: I’m just going to give you kind of a quick overview of my organization, Microsoft Research, tell you a little bit about what we do, why we exist, and what we’re about, and then move on from there. Dan’l mentioned, we started Microsoft Research back in 1991, and interestingly enough, it just didn’t happen by accident, there was actually a memo to the Microsoft Board of Directors in 1990 from a gentleman named Nathan Myhrvold, who eventually was the one that hired me into Microsoft. And Nathan in that memo argued that it was important for Microsoft to engage in long-term, basic research.
Now, what was interesting about that was that Microsoft in 1990 was an awfully small company. It had a very small number of products, I think I’ve almost got the full list on the screen here, and just a few thousand employees. It was certainly not the kind of company, or the scale of company that one might think about making a decision to invest in long-term research. But, the board of directors, and I think this is a statement about Microsoft, decided that, yes, it did want to do that. So I was brought in, in 1990, really to focus on creating a basic research organization.
When you look at Microsoft today, more than 16 years later, it’s a company that has been dramatically changed by that decision to create a basic research organization. Now, I’ll talk about this a little bit more as I go on, but a lot of what you think about as Microsoft today really came out of that decision to start a basic research group. And there’s a culture that’s developed within Microsoft over the years that really has research as a critical component of it.
When you sit through product reviews over the years, with Bill, now with Ray, and with Craig, Craig Mundie and Ray Ozzie, it’s interesting to see how many times the discussion of, well, what are you doing with research, what is the relationship, people saying these are technologies we’re bringing from our research organization. It’s a company that’s really been changed by that, and large parts of what you think about as Microsoft really came out historically from the research group.
This is my history slide, for people who don’t know what I’ve done. If you ever hear of somebody using the word NUMA, I made that up for a paper I wrote back in the early 1980s, because I couldn’t find anybody that had a name for those types of architectures. These are the non-uniform memory architectures, and I said, well, okay, I’ll make one up, and I made one up. I actually made two other names up at the time, too, for other architectures, but this is the only one that really stuck.
People, when they say micro-kernel, they often refer back to some of the work that I did on the mock operating system, which then of course has been used by many other people. So if you’ve used a version of Digital UNIX, or now Compaq, or HP UNIX, you’ve used code that I wrote back in the 1980s. If you use a Macintosh, you use code I wrote back in the 1980s. If you use an iPhone, you use code I wrote in the 1980s. If you use a Windows machine, you probably use code I’ve written. So one way or the other I’ve got you. I know the backdoor sequences for all those systems, and so you’re in trouble.
I’ve also done computer games. I did one of the very first network computer games, Alto Trek. I did the first online-only game that Microsoft produced called Allegiance. Interestingly enough they are actually sort of the same game. The Alto Trek, I started with the code from Alto Trek and rewrote it, and then eventually we built it up. You wouldn’t know that they were related, but if you know the history they are, and there’s code in one from the other.
Now, I’ve had the pleasure, really, and the privilege of being able to start Microsoft research and run it for its entire history. What that means is I’m the Microsoft executive who has been in the same the longest. I’ve been able to do that for that long, and what it’s done is it’s allowed me to have the same philosophy, the same approach, the same way of doing business for that long. I think that’s valuable in a research environment. We’ve always has the same mission statement. This is a slide from 1992. It’s been put in this presentation, but the words are from 1992.
Our first and foremost goal is to move forward the state of the art in computer science. And by that I don’t mean do something for Microsoft, I mean move the state of the art forward, because my belief is that unless the research organization has that as its goal, it’s not going to be successful. You can’t, first and foremost, focus on your company, or your product, you have to focus on the art, the science that you’re doing. If we do that, then we can be valuable, if we don’t do that, Microsoft can buy technology. We buy technology every day. The value of having a basic research group is having people at the state of the art.
When we are successful, when we have great ideas, then our goal is to move those ideas into products as quickly as we can. But, we don’t put the cart before the horse. We don’t buy at the front end of the innovation pipeline, what we do is try to harvest the other side of it. So really that’s an important relationship and one that I always stress within the organization. Ultimately the goal of Microsoft Research is to make sure there’s still a Microsoft in 10 years. We’ve been able to do that for 17 years, hopefully we’ll be able to do it for another 17 or more into the future.
We’re organized very much like a university would be. So, if you look at Microsoft research today, the way we function, it’s not that different from the way I saw Carnegie Mellon University functioning when I was a professor there back in the mid-1980s. Arguably there are things that we do in Microsoft Research that are closer to CMU in the mid-1980s than CMU is today. And part of that has to do with changes in the government-funding regime. We’ve had the advantage of being able to not have to worry about that.
We’re a very open environment, we publish the work that we do, and arguably we’re one of the top publishers of research and computer science today. I’ll speak to that a little bit more in a second. And we have tons of people coming through the organization, lots of visitors. We run the largest Ph.D. internship program in technology. So each year we’ll have over 1,000 Ph.D. interns in some part of Microsoft research around the world. In Redmond along last summer we had more than 300, we had more than 30 here in Mountain View just this last summer. And this year all those numbers will be greater. So we run a huge operation in that sense.
We’ve grown a lot, from one back in 1991, that’s hard to see on the chart, to today we have over 800 Ph.D. researchers. And, again, I want to stress that the Ph.D. researchers is the number I’m using, because if you said, how big is the organization that works for me, it’s a lot larger than that. I have about 1,300 people that work for me directly as full-time employees. And then on top of that, add visiting professors, interns, and so forth. So at the peak, in the middle of the summer when we’ve got most of the interns working in our labs, we’ll have more than 2,000 researchers doing research in computer science.
Dan’l mentioned the different labs that we have. The most recent one we’ve announced is Cambridge, Massachusetts, and we’ll be officially opening that some time this summer, and that’s right on the campus, right next to the campus at MIT.
We publish a lot. This is a slide from actually two years ago, so these numbers are now well past this. So whatever it says here, it’s way past that now. Most of the top conferences you’ll see today, somewhere between 10 and 20 percent of papers will have a Microsoft research author, and there are some conferences that have had as many as 30 percent of the papers. So we really view publishing, and peer review literature as a critical part of what we do. And the peer review process is an important part of our process.
We work extensively with the academic community. In fact, one of the first things we did when we started Microsoft research was to create a technical advisory board of academics. We did that before we had very many employees, and we’ve continued to do that, to maintain that advisory a board over the years. And then each of the other labs we’re created around the world has created their own advisory board of researchers from their communities, and their geographies.
We have extensive, it’s really hard to even run through the list, because it’s way too big a list, extensive relationships back to universities around the world. Within North America we support a number of different centers, like The Institute for Personal Robotics and Education, The Center for Computational Thinking. We just recently announced the Universal Parallel Computing Centers in Berkeley and Illinois, a ton of things going on. We have the new Faculty Fellowship Awards that we do, that have become very prestigious, graduate student fellowships, RFP-based research grant programs, like Digital Inclusion or our Sustainable Computing program, a ton of stuff going on.
In Europe we have something called The European Science Initiative. And, again, more things here than I really have time to speak to, but one of the things we’ve done there, as well, is to reach out to the institutions in Europe, and to work with them in various ways. We’ve created a joint research laboratory with INRIA, located just outside of Paris. We have a center that we support with the University of Trento, and the Trentino government, and the EU, in computational biology. We have a center we’ve just recently announced in Barcelona, in conjunction with the super-computer center. So there’s a of things going on.
Other parts of the world, Asia, we’re an educational institution in Asia, we have the right from the Chinese government to grant degrees in computer science. I think we’re the only outside institution that has that right. We have joint Ph.D. programs with organizations like Jiao Tong University in Shanghai. We support any number of joint laboratories in Japan, Korea.
We have a Latin American and Caribbean Collaborative Research Federation that we’ve established. I was just in Panama last week at our faculty summit event in Latin America, we had over 300 faculty from all over Latin America. One of the reasons that’s important there is because we’re hoping to create a community of Latin American computer scientists. Historically, in Latin America many of the researchers had their strongest relationships with Europe, or with the United States, and not with each other. And we’re trying to change that.
We’ve got great people. I’m not going to belabor this. You know many of these people, and the roles that they play. We have more members of the National Academy of Engineering than the entire University of Washington, to make a point. But, the thing that’s interesting to me is that we started now almost 17 years ago, you look at so many of the young people we hired back in those really early days, and they’re now winning career awards in their fields, people like Hugh Toppi , people that Harry Shum, who has now taken a prominent role running our search product activity, but helped to make our lab in Asia as strong as it is. We’ve really had a huge amount of success in promoting young people, and moving them forward, and I think that’s critical.
Driving our technology to products, we’re doing that at a significant rate. There’s a lot of unsung stories, a lot of things that we’ve done over the years that no one draws much attention to, but that were critical to the success of the company. There’s a technology we developed in the area of program optimization, for example, that allowed us to ship Windows 95 and Office 95 together, and gave us a dramatic competitive advantage over Lotus and WordPerfect at a critical juncture in time.
It’s an underlying technology, but it was important to the company. And it’s things like that that really make a difference. A lot of products have come out of research, a lot technologies in products, but also whole groups. I mean, I started the first e-commerce group in the company, the research group doing streaming media, and interactive TV back in 1992-93, developed into what became the digital media division, and then eventually became a critical part of what is now our entertainment and devices division.
The technologies that were underlying DirectX, and eventually the Xbox and Xbox 360, again, all came out of research we did in the 1994, ’95, ’96 timeframe. I was one of the first people to run with the DirectX group back in those days, and often I’ve had the role of running product groups in their transition from pure research over into the product arena.
Now, the last point I’m going to make here really is, why do we do this? Right. We do all this outreach to universities, we do all the research internally, what is the value of basic research to Microsoft? When people talk about this, I know a lot of companies that do research, and I think they don’t get it right. They’ll talk about basic research, and they’ll talk about the things that are on this slide. They’ll say, we have a basic research group, because it’s a great source of IP, and new product technology. And that’s true, basic research groups are good sources of IP and product technology, but I would argue that’s not why you do it, it’s the result of having done it.
So, say, well, we’ve got a research group, and they’re great at solving our problems, or solving our customer’s problems, and that’s why we do it. Well, that is a consequence of having a basic research group, but it’s not why you do it. Right. They’ll say, research groups are great early warning systems. They’ll say things like, hey, that search technology thing, that might be big some day. Okay. Well, we do that, right, like I did that. But, that’s not why you do it. Yes, we’re a source of information for the company. We get to tell people about new ideas and new technologies, we have our connections out into the industry, but that’s not why we do it.
I think the reason we do basic research is for survival. It’s so that you’re still going to be here someday. Right. It’s to give you the agility to change when change is critical. If there’s a new technology, if there’s a new business model, if there’s a change in the competitive climate, you need to be able to respond, and having a basic research group gives you the capacity to do that. And I think this is true not just for a company like Microsoft, this is true for a society, this is true for humanity more broadly.
The reason you invest in basic research is so that if something really bad happens, a war, a famine, Google (laughter) you can respond. And I think that’s critical. And when you look at people say, where’s the value of Microsoft Research, and I say, well, Microsoft is still here. Right. A lot of the companies that go back to 1990, when we started Microsoft Research, they’re not here any more. And I think that’s in part because we have been able to evolve and change. We’ve created I don’t know how many different billion dollar businesses in the last ten years, six, seven, eight of them. It’s a big deal to be able to have that capacity to change, the capacity to grow.
So I’m just going to finish up here. I’ve already said some of these things. Some of the things you’ll see as you walk around, these are technologies that have come out of research just recently that are now either being deployed in academia, being deployed in our products, being used by small businesses, Response Point is an example of that, all incubated within Research.
We also do things that have nothing to do with Microsoft that I can think of. We have work going on in areas like computational biology. We’re doing work in ecology, you’ll see some of that work here. We’re looking at vaccine candidate analysis for AIDS. We’re looking at malaria. We’re really thinking about, broadly, how computer science as a discipline can impact broadly the sciences, and help move science forward faster. And honestly, when I started Microsoft Research I would never have imagined we would be publishing in these venues. This is not your traditional computer science set of venues.
With that, I’m going to pass it on to Roy, who can talk to you a little bit more about the last year.
Roy. (Applause.)
ROY LEVIN: That’s interesting. Anyone know the password? Here comes the man with the password.
Great. So while we’re doing that, anybody know any good jokes?
It’s a pleasure to be here today. Thank you all for coming. I’m the director of the Microsoft Research lab here in Silicon Valley. And what I’d like to do is to follow-on from what Rick told you about the organization overall, and dig down a little bit into what we do here in this lab in particular. And if my slides would only come up, that would be a whole lot easier.
Ed, can you come back down here, please? There we go. Thank you.
So just to quickly summarize about the lab here, Rick gave you a sense of the organization overall. We are a smaller lab than some of the larger organizations in MSR, with about 50 or 60 researchers here in the Valley. Our focus is really on distributed computing. As Rick said, research overall covers just about every aspect of computer science you can think of, but we take a more particular focus down here. Distributed computing is still a big field, and in the next 20 minutes or so I want to give you a sense of the breadth of that field by talking about some of the work that we do.
The expertise in the lab actually is quite interesting, in that it spans a quite considerable spectrum, from theory, to practice. We have people who writes lots of systems and debug them. We have people who prove theorems, and kind of everything in-between. And we don’t divide them into any particular formalized sub-groups. We rely on the lack of structure in the organization to give us a cross-discipline, or cross-experience collaboration. And that works out extremely well, producing results that wouldn’t happen if you focused your research within individual groups.
Rick has already talked about our visibility in the professional community. I won’t belabor that. Just suffice it to say that all the things that he mentioned about the way MSR works overall apply to us specifically here in the Valley, as well, with academia, with our role in the field as professional contributors in research, and our contributions to the company.
So what I’d like to do is to move right into some of the work that we do, and you’re here probably mostly not to hear me, but to come and see the demos that we have next door, which highlight the work in the labs here. What I want to do is to give you a little sense of the breadth by talking about some projects that you will not see over in the demo area, and that work really falls into a variety of areas. The six things that you see listed here are the rough topic areas into which we group our research. As it turns out, making these categorizations is a little bit arbitrary in the work we do that will fit into multiple categories, but at least it’s a way of thinking about what goes on.
So I’m going to quickly go through roughly half-a-dozen projects that are going on in the lab, that span these areas in various ways, to give you a sense of the breadth of the work that we’re doing. And you’ll see more, of course, on the demo floor next door.
Let me start with some work in our theory area, in particular a classical problem in computer science, the problem of finding shortest paths in a graph. And here the problem is focusing on very large graphs, a problem most of us are familiar with, I’m in a car, I want to drive from point A to point B, what’s the shortest route, typically in time, but it could be in distance if I cared about that. And that’s a problem that has a classical solution. The classical solution is not particularly attractive when the graphs get as large as a roadmap of all the United States, or all of Europe, for example.
So in order to try to speed up that approach people have tried various techniques, and employed heuristics. And heuristics are good, they make it faster, but they don’t always give you the right answer. So what you’re getting is approximately shortest paths. And approximately might be okay in many cases, but often you want what’s provably optimal.
So the work that we’ve done here is really maintaining that prime consideration that you get an optimal route, but be able to do it very, very fast, much faster than the traditional algorithm, and you see here a factor of 10,000 in speedup, compared to the classical approach.
So just to give you a sense of what this is, here are two maps of the United States. On the left you see two blobs of color, one representing well, coming from source to destination, green to blue, and what the colors represent is all the places that the classical algorithm looked to try to find what might be the shortest route. And as you can see, there’s kind of a ball of possibilities that’s grown around each of them before it’s done, and it finds the blocked path that goes between them that is the optimal route.
On the right hand side, you see our work, which of course finds the same route, but you can barely see the places where it’s looked. In fact, they’re much smaller balls, and a few scattered points elsewhere around. And so you get the sense that that can be done much, much faster. And the techniques that are used here involve the little red blobs that are around the edge, which we call landmarks, do a lot of pre-computing to figure out distances between the source destination and these landmarks, and then exploit that pre-computation to make the individual searches very fast. It’s been done in a way that can be implemented on a server, which is what those numbers were for that I showed you, or on a desktop, or on a portable device. So that this becomes a real-time re-computation, which might be important to you as you’re driving down the freeway and you discover that the traffic is bad and you want to find an alternate route right now.
That’s an instance of a piece of theory work that we’ve been working on. And, in fact, some of this technology has found it’s way into Microsoft’s MapPoint products already with more to come.
Switching gears to a completely different area, but again suggested by what Rick talked about as our reasons for agility, and the need to be able to deal with new competitors and new business models, the whole area of electronic market design is one that we’ve moved into relatively recently where I think there’s a great deal of work to be done. This is basically the problem of designing computing systems that deal with participants who might be humans or computers that are competing in some sense for some advantage or other. Many examples, ad auctions that you see for sponsored search are an example of that, but there are many others in the online setting. And we’re designing these algorithms in a way that in economic terms maximizes social welfare. It is actually a quite difficult problem, and there are extremely subtle issues that arise here, and doing this in a way that is provable within some either optimal or within some bound of optimal is in some sense a kind of modern rocket science in my view.
Some of the projects that we are doing around the screen there, I don’t have time to go through them all, but just to suggest what one of them was about the sponsored search auction, you think about the ads that are placed, the keyword-related ads that show up down the side of a search page, there’s an ordering to those things, and there’s the so-called conversion rate that happens when somebody clicks through one of the ads, so you have a click through rate, and then you have a conversion rate, which is that they actually buy whatever it is that the ad was selling. And you want to place the orders on these things to maximize revenue, or profit, or fairness in some way. That turns out to be actually a quite unintuitive problem to solve, and you get somewhat anomalous things happening if you do it in the obvious way. This is where having people who can deal with this stuff in a mathematically precise sense becomes extremely important, and that’s one of the areas that we’ve come to focus on.
Moving on to yet another area, but again related to the Web, I want to talk a little bit about how you get relevance in Web queries. This is obviously a very important thing. You go to a search engine, you type in a bunch of words, you get back a bunch of query results, and you’d very much like to have those ordered by relevance so that the ones at the top are the ones of most importance or interest to you. The way that this was solved in search engines involves a variety of ranking algorithms, or ranking results. The classical way to do this, the page rank algorithm, uses the static structure of the Web graph. That is to say, it computes the relative interest of pages without paying attention at all to what your query was. What happens is that the query selects some subset of the pages, and then they’re ranked based on this static notion.
People have thought that it might be better to use information from the query to improve that result, and so there have been various algorithms designed where you take the query result and rank them based on some relationship of the pages that came back from the search engine in their relationship in the Web graph. But actually implementing these things at scale and understanding how will they do better is fairly hard.
So a project that we did was to implement some very specialized infrastructure to make it possible to analyze a variety of those algorithms at scale, which means having a representation of the Web graph, very large number of pages, that is extremely efficient to probe and to ask questions about, like where are all the pages that are two away from this one, either in the forward or the backward direction of the hyperlinks. Once you have a system like that, you can then go and analyze these graph algorithms, and discover what works best.
Let me just give you a sense of how this goes. In this chart, the Y axis is a standard measure of relevance that’s accepted in the industry, and if you start on the right-hand side, you see the very low things, which is of course what happens when you present people with the results in random order. That’s not very good. The next one over is the static page rank, and it has a certain value. Then you see everything to the left of that, the various algorithms for trying to incorporate query-specific information into the results. And you can see that some of them do a little bit better than page rank, but that by the time you get over to the left most column, it’s actually going up by a factor of two. This is a huge improvement. It may not seem like a lot, but people kill for a few percent in this area. So something that actually doubles the relevance is extremely attractive. And so what that tells us is that this query dependent ranking thing really does work, and now you can afford to go and invest in building infrastructure that will actually support that on a production scale.
When you visit the show next door, you’ll find a project entitled PINQ, P-I-N-Q, Privacy Integrated Queries, and that’s a very interesting and I think very significant piece of work in the area of ensuring privacy of individual information in this modern world where our data is spread around online. And what I want to do is just tee up a little bit of the background for that, and I encourage you to go and visit the booth to learn more about it. This issue of privacy is one that’s gotten a lot of attention, but it hasn’t gotten a lot of what I would call solid technical work. And we’re making some significant efforts to try to put a technical foundation under the notion of privacy. But probably most systems that exist today is that the guarantees they offer for the privacy of individual data are too weak. They are, for example, tied to attacks that have already been seen. So it’s sort of, well, we closed the door now that that horse is out of the barn, but we don’t really have a good way of saying how we might deal with future attacks; or they only work for certain classes of data. That’s not so good if you’re a data provider, for example, someone who hosts medical data, and you need a bullet-proof guarantee that your customers are going to be able to get the information you’re entitled to provide them, and not get information that you’re not entitled to provide them under, for example, the HIPAA regulations.
So the idea here in the work that we’re doing is to start over with a rigorous definition, rigorous in the mathematical sense, from first principles that’s general enough to encompass not just what we’ve seen in the privacy world today, but what we can plausibly anticipate will happen in the future. I’m going to give you the intuitive idea behind this rather than the mathematics. Imagine that there’s a database obtaining a pile of personal records, and you walk up to this database and you’re trying to consider whether to make your data available to it, but you’re concerned about privacy issues.
So the definition that we’d like the database to offer in the sense of preserving privacy is the following: Imagine that a query is thrown at the database and some result comes back, if the particular result that came back is equally likely whether your data has been added to the database or not, then in a sense that result doesn’t really reveal anything about you. And if the database could make that assurance to you, we think that would go a long way towards satisfying the desire for privacy. Notice that this definition doesn’t really say anything about the nature of the query, what kinds of queries are permitted, or how the data that you’re providing might be organized. So it’s a very general definition.
Nevertheless, it’s a definition that we’ve been able to work through and produce some very practical, specific examples. The PINQ demo will show you one of them, I won’t elaborate on that here, but in effect there’s been something of a mini-industry now in the creation of privacy preserving variants of standard queries that you put at a database. You ask for the average salary of a bunch of people, for example, there is a privacy-preserving version of it. Those kinds of things can be developed pretty straightforwardly now. So we think that this is a first step in what will become a very important area for the modern world.
Now I’m going to shift gears entirely again. Sorry for the lightning transitions here. In the area that we label as system architecture, we’re doing some work related to solving the Boolean satisfiability problem. This is another classic problem in computer science. It shows up in many, many places. I’ve indicated a few of them here, such as circuit verification. And the problem, which I’m not going to explain here, is one that is known to be NP hard, meaning that there are no efficient general solutions to it, but in special cases, and it turns out special cases arise a lot in practice, you can do a lot better. So for several decades people have been working on algorithms to solve this problem efficiently.
The work that we’ve done most recently looks at an old technique, which is to do some special purpose hardware, and couple it with general purpose software as a way of getting a performance enhancement. And the particular way that we’re doing this is with a custom FPGA arrangement hooked onto a standard CPU. At the architecture of this solution is a little unusual, people normally do this by saying, okay, I have a particular problem, I want to have hardware assist, so I’m going to run that problem through a special kind of a compiler that’s going to create the code that goes into my special FPGA, and then I’m going to be able to solve the problem fast. That practice is a slow one. Instead, what we’ve done is to build an architecture where you can solve SAT problems, and do this compilation process that I talked about once per application. In other words, if you’re doing circuit verification, you do the compilation once, and now you have an engine that will deal with any instance of a circuit verification problem.
The result of that is that you can load problems into it very quickly, and solve them very quickly. And that’s important, maybe not so much for circuit verification but, for example, in financial markets, which is another place where this gets used a great deal. So you have a very rapid instant set-up, the solution enables you to solve problems that are an order of magnitude bigger than the ones that standard software solvers can do, and do it five to 20 times faster, a very significant acceleration.
Moving into and area that would be perhaps more traditionally associated with distributed systems, but in this case a tool for dealing with distributed systems, if you’ve ever had the occasion to try to work with or build a distributed system, you know that lots of unexpected things happen. Those of us who live with them every day know that lots of unexpected things happen. And how do you do the quality control as a purveyor of such a system, how do you do the quality control to be able to anticipate the various bad things that can happen, and make sure that your system deals with them, know this as a tool that helps you for doing that. And the idea is, this one you have your system implemented on a variety of computers that communicate with each other in some way or other, various protocols, this is a tool that goes in and intercepts on each of those computers the place where they call through on a core interface, the WIN API in this case, and then there’s a controlling machine that basically intercepts all of those requests and basically drives unexpected events back. So it causes packets to be dropped, or it causes wrong answers t happen, or simply simulates a crash, or whatever. And then observes what happens to the system in the process of doing this, and it explores those possibilities in a systematic way. There’s obviously a very large state space here, and we use model checking as the technique for exploring that state space, and some guidance from the programmer about what the high level properties of the system are supposed to be, and then check those results to see whether the thing is done.
Now, this doesn’t give you a proof of correctness, model checking can never do that, but what it can do is to drive the system into states that you might have a very difficult time generating by yourself with a normal testing regime. And the result of that is that you get a much more robust, or a sense, at least, of a much more robust system as a result of having done that state space exploration. And I’ve noted a couple of the results here. This, I think, has been deployed and found bugs in systems that have been in operation for production news for a couple of years with lots of machines. So it is, in fact, capable of finding latent bugs in situation that don’t happen very often at all, but want to make sure that we get those things out so that they don’t happen at, by Murphy’s Law, the worst possible time.
And I want to wrap up this section by talking about one more topic in the systems area, and it’s a problem that probably most of you have experienced in one way or another. You sit down at your computer, you try to use some service or other, let’s say it’s e-mail, and whatever the operation is that you tried fails. And the question now is, why? What’s broken? In the modern world, as Leslie Lamport famously said, “A distributed system is one in which a computer whose name you don’t even know can prevent you from getting your work done.” And that’s exactly what’s happened here. You don’t know which computer has failed, you don’t know what service has failed, but something is preventing you from getting your work done.
So the idea of the Constellation project is to try to automate the solution to that. The manual solution is pretty unpleasant. It involves typically getting down into the detailed packet logs, which is something that’s really only for experts, and even then it’s pretty puzzling sometimes to sort out what’s going on.
So the idea of Constellation is to say let’s see if we can automate that, and the idea is to use machine learning as a way to observe what’s going on in the network, and to build up a set of dependency inferences about what depends on what from a service point of view. That’s something that happens over time. You collect this information, and then when something actually goes wrong, you’re in a position to go back through that information, again through the Constellation system, and say what’s likely broken now.
So if I can give you a sense of this, here’s you don’t have to read this. On the left here’s all the packet logs, Constellation is processing those and building up a graph, which you see on the right, that represents the dependencies among various computers and the services that they operate and offer. And then when you’re sitting at your computer, and your e-mail fails, you say to Constellation, why did this fail? It can go out and using that graph of dependencies do probes to particular computers bypassing all the ones that it believes are not relevant based on this dependency information, and analyze what’s working and what’s not, and give you a nice little report that says, well, your e-mail failed because the e-mail server tried to talk to this folder server, and the folder server tried to talk to a domain controller and it’s broken, something that would have taken you forever to figure out by hand if you’re an amateur, and a long time if you’re an expert. So our idea is that this system can really be a very helpful diagnosis tool for anyone running a large network, which these days means just about every company.
In this kind of whirlwind tool of the research, I hope I’ve given you a little sense of the work that we’re doing. I want to close with just one little preview of something that you’re going to see on the floor of the show. The WorldWide Telescope was a project that grew out of the work that Jim Gray and his team did in collecting data through astronomers from diverse sources, and making it available in a distributed fashion to a community of astronomers and interested public around the world. And this is a project that was carried out in our Redmond lab, and with other folks as well, and has now gone live, last week, perhaps many of you saw the announcement. This is a system that you can acquire yourself, it’s a free download. You’ll also get to see a demo there, next door. But I’m going to invite Dinoj up to tell you a little bit more about it, and to give you a little preview of that.
DINOJ SURENDRAN: Okay. I’m just going to get something I’m running at the moment. The WorldWide Telescope is a project which happened at the Next Media Lab at Microsoft Research in Redmond, and I work with Curtis Wong and Jonathan Fay there, and let’s switch this thing here, all right, brilliant. We’ve started that there. Let me bring up a quick tour. The WorldWide Telescope really ahs two parts. On the one side, it’s a portal, a viewpoint, a telescopic, if you will, pointing to the vast amounts of astronomical data and imagery that hundreds and thousands of astronomers have worked on very hard over decades to collect. And the data is already available, but it’s often very, very hard to get. So we’re providing a very intuitive interface to actually get to all that. So that’s the first part. The second aspect of the WorldWide Telescope is the fact that it’s a storytelling environment. So you can create stories and slide-based tours about your favorite object, and then share these with your friends. I’ll show you a couple of examples.
But let’s get back to the first point. So what you’re looking at right now is the digitized sky survey, which is a huge image of the sky. And when I say huge, I mean huge, we’re talking about a million-pixel by a million-pixel image. So it’s a couple of terabytes, and you can view this on your machine quite easily. And the point is that you never actually have to download terabytes of information. You only download a very small fraction of it. So we’ve talked about information being available in multiple wavelengths. Let me give you an example. Let’s go to search and search for some object, let’s say the Crab Nebula. Okay, here we already are. So it goes zooming off into space looking for the Crab Nebula, and there we are. So there we have a picture of the Crab Nebula, and we can cross fade with the background imagery. And you see in the bottom here we have lots of other images. So what appears in the bottom is a list of objects which appear in that field of view, and it appears with thumbnails so you can see things very easily. Many objects have different names. If we go over here, for example, that’s what we’re looking at, we have other images from various places, and if we look over here at the properties, and so this is actually an image from the Chandler Web site, so it goes and lets you see, so now you can see we get information on this. So that’s one aspect to the question of viewing an object in multiple wavelengths.
The second aspect something that we call surveys. So a survey is, well, a survey really, it’s a map of the entire sky, or a great deal of it, and we have several surveys. So here you see about 50, and this is what we’re starting off with. And these are in several wavelengths. So this is the default, what we start with, this is the radio, microwave, infrared, and one of my favorites is X-ray. This is the Crab Nebula, and we can cross-fade with the original image, the visible and X-ray, we can cross-fade like that. There seems to be some object here which looks rather curious, and so if we right click on it, and move this little finder scope, and it tells us that the object is called IC443, which means Index Catalogue Entry Number 443, which is not exactly intuitive. So we go over to research, and we go over let’s see whether the Wikipedia has an entry on it. Yes, it does. So now this is useful, it is a ton of information on the Internet, and with modern technology we can actually go and link to all this stuff. So let’s close that.
We should point out that there are other ways of getting information, SIMBAD and FEDS are well known astronomical catalogues that the astronomers use, and will have more technical data. So that is one aspect of it. So that should cover the first point about this being a portal into the sky, different wavelengths.
Okay, now let’s go back to the second point about WorldWide Telescope. So we have a bunch of tours. These are all available, made by various people, and the point is, you can make your own. So here’s a tour that I made yesterday, and here we have this. So it’s going to run. What we’re looking at here is data from Mars taken by the Mars Global Surveyor. So here we’re lucky enough to actually have elevation data, so we can zoom around and see stuff. So we’re looking at Olympus Mons, which is the largest object, tallest mountain in the solar system, and we can keep zooming off into some other part of the planet.
Now we are going to switch datasets. So instead of looking at the this is a panorama, so it displays panoramas. The NASA JPL Rovers put out several and this is just one of them, and we’re busy working on getting many more into here. Now, let’s hold this for a moment, all of these things are slides, and for each slide, this is how you make a tour, you specify the starting point, the ending point, and what you’re looking at, and then it goes and does interpolation by itself, and you can add music, speech, to make a tour which you can then send to other people.
So we anticipate many people, thousands of people, building tours and sharing it among themselves, and building up communities of things to share. I didn’t get any, but we have quite a few, actually, starting up. So let me stop here, and say that we actually have a I’ve just touched the surface here. There’s a lot of other imagery like the Mandel broad set, or looking at eclipses. You can see where the moon was in Nairobi yesterday, and so on. And all of this is available at the booth, so do come along and have a look. And if you go home, go to www.worldwidetelescope.org, and download it and try it for yourself. Notice the word “.org” this is free software, it’s dedicated to Jim Gray who tragically disappeared last year, and I think he would have really liked to see this. All right. I shall stop here. Thank you very much. (Applause.)