Tony Hey: OSCON 2009

Remarks by Tony Hey, Corporate Vice President for External Research for Microsoft, at OSCON 2009
San Jose, Calif.
July 23, 2009

TONY HEY: Thanks. It’s great to be here. Well, I’m a scientist and I feel slightly out of context in this audience, but I’ll tell you what we’re doing in the collaborations with the universities, and what I think the open source community could do to help the scientists.

So, this is just by way of introduction I used to be a serious academic before I joined Microsoft. I’d just like to pick out a couple of things on here. One is the thing at the bottom with NPI. With Josh Grundella and a couple of others, I wrote the first draft of NPI. I came for the session on super computing, and one of the key things that make the message parsing a basic success was the fact that Rusty Lufts and Bill Gartner wrote an open source version of NPI as the standards were being developed, and that was critical, and critical to the adoption of NPI. And I think that Microsoft continues to find that the open source version of NPI that we developed is actually included in Microsoft’s cluster. So, that was a nice surprise.

The other thing I would like to say is, I was the head of the computer science department, and I am of course on various funding agencies, and we fund a lot of projects in the UK, open source projects, and one of the things that when I was running the UK e-science program, we would translate that in American as cyber-infrastructure, was making sure that the software survived the end of the project, when the project cooled, people dissipated, and the software that controlled stuff just stopped working.

So, I set up a thing called the Open Middleware Infrastructure Institute, which was to do all the sort of things that academics don’t like doing, like specifications, documentation and testing. And so I did have a lot of experiences trying to make open source work, and I did in the end come to the conclusion that we needed a mix of software.

So, why is science particularly interesting at the moment? Well, we’re in the midst of a revolution. We know all about experimental science, the Greeks and the Egyptians looking at the stars, experimental science, and then we have physical science with Newton’s laws. And there are some decades, it’s obvious that computational science is a separate methodology within science. We need to know about power algorithms … and so on.

But today, the sciences are being flooded by all sorts of data from satellite surveys, sensor networks, super computers and so on, huge amounts of data, I’m literally drowning in it. I need tools to manage it. So, I believe that there’s a great opportunity for IT companies, like Microsoft, IBM and others, and the open-source community, and the computer science to actually help the scientists solve some of the problems that we’re facing.

So, science has to move from the data to information to knowledge. This is my organization. We have a number of themes that I won’t go into here, but we are working with scientists trying to help scientists solve problems in energy and the environment, and also in health and well-being, so we have projects. I will show you one in HIV.

So, what I’m going to tell you about is what my small team in advanced research tools and services group do, and they don’t do this alone, they do it in collaboration with a community of scientists. We find out what tools they want. And it may surprise you that many scientists, biologists, chemists, environmental scientists, and so on, actually like using Microsoft tools like Word and Excel. So, how can we make them more useful for scientists?

So, the goal is quite simply to support the scientific process better than we had done before. The option that we’re trying to do is to give people choice, so they can use if they want a Microsoft tool, they can use an open-source tool, they can interoperate with Google, or IBM, Oracle, whoever, and they have the choice. And they’ll only use the tools that are the best at their job, and that’s what we’re trying to do is provide them with tools that help them do their science better.

In my time running the research council program in the UK, I saw many generations of scientists saddened as their grad students became the system support person for the research group. That’s a fine job, but that wasn’t what they started off doing. They wanted to do the science, but they ended up being system support.

So, the challenge you have then is to build tools and technologies for the scientific community. So, I realize it’s slightly different than the community here, but let me tell you what we’re trying to do. So, we’re trying to open source extensions to varied platforms that we have in Microsoft that are used by the scientific community to make them more useful. And I’ll go around the circle and talk briefly about these tools. And they’re releasing them as open source.

So, the first one I would like to mention is called Project Trident. It’s a workflow system, and what it’s intended to do is make the scientists take the data streaming in from sensor networks, do all the steps that we have to do to put it into a form that the scientists can use. And these workflows enable people to do what they used to take weeks to do in a few hours. So in a particular project we’re working with is putting a sensor network on the ocean bed outside the Pacific Northwest, which is a highly active earthquake zone, and you’ll have sensor networks looking at the activity on the ocean floor. The data is streaming in, and you’re using this workflow named Project Trident. And you can see more details of that on our standards.

And I thought you’d like this one. You’ll notice my slide, what we’ve got plugged at the bottom which says, this is a Creative Commons, you can use this slide, and you can have it as long as you give attribution. So, we’ve created a plug-in for PowerPoint, and for all of Office, Word and Excel, that you can actually share these slides with a Creative Commons license. So, that’s one of the plug-ins, you’ll see other plug-ins we have for things like medical equations and so on you can see in the booth.

So, the Creative Commons plug-in is I think welcomed by the academic community. This is a project based on SQL Server, and here what we’re trying to do is provide you with a mechanism for capturing the context and the semantic information that (inaudible) gave in his talk in the conference organized by O’Reilly. We have extra semantic information besides the presentation. So, you can store semantic data in this. And what we’re trying to do is implement emerging standards that are coming from the various communities. You may not have heard of things like OAI, ORE, and standards for repositories for academic papers. So we’re trying to make it easier for academics to use these standards and to help the academics standards become accepted in the community.

This is a visualization tool for Excel, just to show that we have plug-ins for most of the Microsoft platforms. And this is actually a way of showing, doing network analysis of all sorts. And this is one of the visualizations that you can get. And this on the right you’ll see a picture of the library, what libraries used to look like, they will of course be very different in the future, and this is working with the British Library, and the community in Europe and the U.S., to try and enable collaboration. So, combining, if you like, a document repository with Web 2.0 technology, and putting it in a framework that’s relevant to the scientists, and populated with services and projects. It’s being led by the British Library, and we’re working with them on that. And, again, you can see a demo of that outside.

OK. So, those are actually enhancing our products, and taking the requirements that the researchers tell us they want and trying to make it more useful. The code will be open source and you can do what you like with it. This is a research tool. This is from our machine learning researcher, David Heckerman, and he’s applying his machine learning technology to HIV-AIDS, and he’s working with HIV-AIDS researchers around the world, and particularly with Mass General in Massachusetts, working looking at the genetic pools, and studying the correlations of how the HIV virus mutates within the genetic pools, and the deep science which is actually very useful, and leads to the hope that we will be able to produce a vaccine at some point for HIV-AIDS.

So it’s small communities, but this is the sort of tool that they need. The jobs they do at the moment vary in scale between 10 to 20 hours. Some of the jobs require thousands, and so on. It uses lots and lots of compute power for doing this. And at the moment what we have asked the community is, if they can send their data sets to David Heckerman and he will run them. So what we’ve been trying to do now is remove David Heckerman and port that to the cloud. So we now have that as an Azure service on the Microsoft Azure platform, and what this does is allow scientists to upload their data, do the analysis, and get the results back without intervention from David.

Then this was announced on Tuesday. It’s actually a plug-in for Moodle, and it enables you to go to your Outlook, to your Exchange and Messenger services within the Moodle framework. And it’s an open-source learning environment that you’re most familiar with, but this is a way of passing the value to the people who use these Microsoft technology parts when they’re actually in the Moodle portal itself. So, this is actually for this community, for this community release, kind of a GPL v2 license. So, those are just some of the projects that we’re dealing with. You’ll see more of them. But, the goal is to make it so that scientists and educators can use whatever tools they want, they have the choice and if you want to add and extend to the open source plug-in you can do that, you have the code.

So where we are moving towards, for the researchers I talk to, is that intensive world we’re going to be flooded with data that we’re going to have to actually analyze and mind map with the help of computers. So, we need some (inaudible) information. So, you’ve heard of social networks with people, if you like data net, and what we need to help turn data to information to knowledge, and so you’ll store, along with the data, you’ll have to take that (inaudible) and this will enable you to actually go and find the information more efficiently than we can at the moment, and to search and visualize all this data.

Where would you store that data? Some of it will be stored up in the cloud, no doubt, various types of clouds. And some of it will be stored on your desktop. Scientists are actually very protective of their data. They don’t like to give their data. So I think initially people will keep their own data on their own servers and desktops, but sometimes they’ll keep a network in the cloud, and eventually they’ll keep it in the cloud where we’ll have the security and the guarantees that they need.

So, at the moment we see the scientists will be doing their research and trying to solve the problems using some of these cloud services for computation, for storage, for identity, maybe for visualization. But, you’ll also have some software on your desktop, and sort of analysis programs, the usual client programs that you want on your desktop. So, this vision of the future in software on the client, and services on the cloud, is I think going to affect not only the business community, but also the scientific community.

So, thank you very much. You can find out more about our tools on these sites, and you can download Project Trident and you can also go and see them in the booth.

Thanks very much.

(Applause.)