Joseph Sirosh: Build 2015

Remarks by Joseph Sirosh, corporate vice president, Information Management and Machine Learning, on April 30, 2015.

JOSEPH SIROSH:  Software is eating the world.  Marc Andreessen said that about four years ago.  But today the cloud is eating software.  Software is being delivered as services in the cloud, and most pieces of software are connected to the cloud.  Now the cloud is not only eating software, but it is eating data.  And when the cloud digests software and digests data, it can produce intelligence.

Let me tell you what I mean.  For human history, most of human history, the world was analog.  Once upon a time, in the 1980s, analog data dominated.  Digital data was just being born.  And come the ’90s, you had the Internet, you have client-server computing, you have digital music, you have DVDs.  And the digital data exploded.

But with the Internet one thing very interesting happened, data started to have an IP address.  It became connected data.  It lived in datacenters, in the cloud, PCs and devices, but it was starting to have an IP address.

And fast-forward to today, digital data dominates.  Analog data is going down.  And about half of all digital data is connected.  It lives in the cloud.  It’s flowing in from mobile devices that are connected, the Internet of Things is just starting to emerge.  And there are about 10 zettabytes of it, that is 10 million terabytes.

What about the future?  Look at 2020.  In 2020, we’re going to have 50 zettabytes, 50 million terabytes of data.  The vast majority of that is in the cloud and flowing in from connected devices.  Analog has gone down almost to zero.  The cloud is where the data lives.  That’s where the planet’s data, intelligence, and information resides.

And there are profound implications for developers.  So the best way — so when I talk to developers about this, we tend to have about four conversations.  So developers want to look at historical data and look at it retrospectively.  It’s big data.  You want to analyze it.  You want to report on it.  But then there is real-time data, data that is flowing in over the wire.  And developers also want to analyze that as it is flowing though the wires and get insights from it.  Developers also want to look at historical data, learn patterns from it, and predict the future.  And they want to bring all of this together into intelligent SaaS apps.

To bring this to life, let me tell you a story.  My story is about the connected cow.  Yes, there is such a thing as connected cows.  It has all the buzzwords of today, the Internet of Things, analytics, the cloud.  It’s perhaps the sexiest story in the cloud.  But it’s also about human ingenuity and how this new technology is revolutionizing even some of the world’s oldest industries, such as dairy farming.

So these are the connected cows that I’m talking about.  Look at those pedometers on their legs.  They’re connected to Wi-Fi.  I have one of those in my hand, the cow pedometer.  It counts the steps of every cow on a dairy farm and sends that data to a service on Microsoft Azure.  It’s Fujitsu’s Gyuho, which means “cow step” service in the cloud.

So do cows need to take 10,000 steps a day, too?

But before I give you the answer, let me tell you about the modern dairy farm.  These days every company is a data company, even the ones you least expect.  A dairy farm has all the constraints of a modern business.  It has a fixed herd.  It has a pasture and labor which often is very expensive.  And their output is milk and beef, and they have to optimize that output under all of these constraints.

So what can a farmer do?  It turns out he has two controls.  He can detect health issues in cattle early and prevent loss.  He can also improve cattle production by accurate detection of estrus.  Now if you remember high school biology, estrus is when an animal is ready to mate and goes into heat, when the time is right, when the magic is ready to happen, so to speak.

But how could you detect estrus in hundreds of cows on a dairy farm?

Well, before I tell you, let me tell you about the importance of doing this.  These days, with artificial insemination, the probability of conception is pretty good, it’s about 70 percent.  But detection of estrus has been done by age-old methods.  It’s just really close observation.  And in the best of cases you get about a 55 percent accuracy.  And then the pregnancy rate was about 39 percent.  But, if you could move that detection rate up to 95 percent look what happens, you get a pregnancy rate of about 67 percent, which is a big improvement.  It’s very material for the dairy farmer.

But, this is hard.  Estrus lasts only 12 to 18 hours every 21 days, and it’s variable.  And it occurs like many of these things mostly between 10 p.m. and 8 a.m. when the farmer is taking his hard-earned rest.  So how can farmers tell when the time is right for hundreds of cows?  Could technology help?  Could you develop a heat map of the dairy so to speak?  (Laughter, applause.)  Well, that’s a question that a farmer in Japan asked Fujitsu, our partner.  Fujitsu engineers consulted some researchers and came up with a very ingenious system, something that was indeed 95 percent accurate for the detection of estrus.  It’s literally the hottest system for detecting heat.

Now this is that service.  Data from the pedometers that I showed you from the cows are sent over the wire to a service on the cloud, on Microsoft Azure, where it analyzes all of that data and from the footsteps of cows detects when an animal goes into heat and sends alerts on the farmer’s phone.  So the farmer knows which cow went into heat and exactly when.

Well, it turns out there’s a very simple secret to detecting when an animal goes into estrus.  Let me explain that to you with a graph.  The X axis is time of night in this graph.  Y axis is the number of steps that cow is taking.  So this is a cow on a normal night sleeping sometimes.  Let’s see what happens when the animal goes into heat.  Yes, the number of steps she takes goes up.  When an animal goes into heat she starts walking around furiously.  And that turns out to be about 95 percent accurate for the detection of estrus.  And the optimum time for artificial insemination for maximum conception rate is 16 hours from that.  And that’s when AI meets AI, artificial insemination that is.  (Laughter, applause.)

Now Fujitsu researchers found something else that was very amazing.  There’s a window around that optimum.  If in the first four hours of that window if you performed AI you would get a female and the second, next four hours, you would get a male.  So the farmer now had an amazing control at his disposal.  He could choose the gender of the calf based on the needs of his dairy farm.  Not only that, Fujitsu found that by analyzing the step patterns of cows they could detect about eight different diseases.  And remember that 5 percent false positive rate, 95 percent accurate, 5 percent false positive, even some of that turns out to be very important, because some of it is when the cow jumps over the fence and skips the farm.

Well, it’s an amazing application.  But it’s an application that today every one of you could build in a few days with the platform services that’s available on Azure.  And you can scale that out globally, distribute it, and this would never have been possible before.  So that’s the power of intelligent apps that live on cloud.

Now let’s look at the power of real-time analytics.  Data is always moving these days.  It’s constantly in motion.  Connected data coming from sensors, from websites, from apps in the cloud, from social apps, it’s streaming over the wire.  Developers want to easily process this data and get insights in real time.  So now let’s take a look at an example of a solution that does that on Azure.  So with me is Corom who is going to help me through a demo.  We’re going to create an app that’s going to be a fun engaging demo, and you are going to actually help us create data in motion.

When we started this we want to build a fun and intelligent application, something that would capture the attention of people worldwide.  So we went to the machine learning gallery, and today we have an exciting collection of intelligent APIs on that machine learning gallery.  It’s Gallery.AzureML.net.  All of you can go there.  And yesterday afternoon Harry Shum, the head of MSR, announced Project Oxford to bring some of the amazing intelligent APIs that have been created by Microsoft Research and Bing to developers like yourself.

There are face APIs.  There are image analysis APIs.  There are speech APIs.  So we took the face API.  It’s a really fun API that allows you to submit a picture, it detects the faces in that and it tells you an estimate of their age and gender.  So the URL for this demo is Howold.net.  Now I’d like everyone here actually to take a minute and go to this URL, especially everyone watching online.  Please navigate to this site and take a look.

So let’s take a couple of pictures.  Let’s look at that happy family.  Let’s submit that.  And you will see it identified the faces in the image, the grandparents, the parents, the children.  It looks like it got pretty accurate.  So let’s try another picture.  Let’s take, say, Da Vinci’s Mona Lisa, a painting, and so we’re using the Bing search API to search Bing, get the images from online, and hopefully we can bring Mona Lisa’s image up.  So let’s submit that photo and she’s 23.  Surprisingly when that model posed for Leonardo she was actually about 23.

So now let’s all try your pictures.  Take a picture from your mobile phone, from a PC, especially if you’re watching online, feel free to take a selfie and send it, as well.  Let’s see how it performs.  So Corom, maybe you can take one of our pictures from the desktop.  So that’s my family picture just uploaded at — so it’s analyzing that.  So we can actually see how it’s doing on our dashboard, as well.

Yes, the upload is taking a while, Corom.  So let’s go look at the dashboard and see what people are doing online.  It’s amazing; this is the Power BI dashboard.  John actually showed some of it earlier.  People now from all over the world seem to be coming to Howold.net and that map shows the distribution of people.  So what’s different about this dashboard, by the way, it’s pretty amazing that you’re seeing real-time flowing data being analyzed.  Can you see that graph that just spiked up?  That’s all of you bringing data in motion.

Now let’s peek behind the curtain a little bit and see how we built this streaming analytics and the dashboard.  So the data from the Howold.net website is captured as a JSON file.  There is a — it looks at the face data, the age data, looks at the browser, it looks at the location from which the IP address is coming, a collection of things.  The JSON is sent to Azure Event Hubs.  Azure Event Hubs is a fully managed service on the cloud to ingest data at millions of events per second.  And then we use Azure Streaming Analytics.  Azure Streaming Analytics is a fully managed service for complex event processing.  It lets you analyze streaming data, put complex queries in it, and get aggregates from that.  And then we use those aggregates and display it in a dashboard.

So now let’s take an example, let’s take that dashboard that is there, that trending graph, so that trending graph is the — it shows two lines.  You have the blue line showing the count of females in the faces submitted, in the pictures, the magenta is the number of males, and all developers here, if you have tried to construct a trending graph like this over streaming data I mean you know how hard it is.  It takes hundreds of lines of code, if not more, to set something like this up.

Let’s now go to streaming analytics, the Azure Stream Analytics query, and show you how simple it is.  So Azure Stream Analytics makes it really simple to configure such analytics using just about eight lines of SQL-like code.  It’s just a select statement.  It’s selecting the system’s time stamp as out time, the face.gender as gender, and the count of gender from the streaming input, and then aggregating it over a window over a 10-second interval.  And that’s all there is to it.  And that’s the data that is now being displayed in the Power BI dashboard.

So this is the magic of fully managed services like Azure Stream Analytics in the cloud and Power BI, it lets you analyze real-time flowing data and ambient intelligence around that into your applications.

So now let’s go from understanding the present to predicting the future.  So if we switch the slides, in February of this year, we launched a powerful service called Azure Machine Learning.  It allows you to learn patterns from historical data and predict the future and create APIs in the cloud that can be easily hooked up into any application.

For example, Ziosk is using Azure Machine Learning to produce recommendations on Ziosk tablets on Chili’s restaurants.  We have eSmart Systems of Norway using Azure Machine Learning to actually predict the energy usage in the future and control energy demand.  The Microsoft Band uses Azure Machine Learning for recommendations.  We have companies, more traditional companies, such as Gaffey Healthcare using Azure Machine Learning to estimate when an insurance claim will be paid so they know how to act on that particular claim.  It’s a very powerful service.

So the best way for me to bring this to life is to tell you another story.  For those of you who are watching this from outside the U.S., there is an annual sports event here called March Madness.  It’s the first of the NCAA Men’s Division One Basketball Tournament that’s played each spring with 68 collegiate teams to determine the national championship.  And for Americans we find it as a time to gamble.  We try to predict who will make it to the final four teams, and ultimately who will win it all.

It seems that everyone has a prediction for the winners.  Our President Obama, our CEO Satya Nadella, and Satya’s predictions got a lot of press, because it seemed to be extremely accurate.  Now, to be fair, Satya had help.  Satya had data scientists, and people who knew basketball to help him make his predictions.

But I want to tell you the story of Adam Garland.  Adam is a developer in our Microsoft Office group.  Now, unlike Satya, Bing and Google, Adam didn’t have any experience using machine learning.  He’s just a developer like you and me.  And we had a hackathon going to predict March Madness using Azure Machine Learning.  Adam thought it would be a great way for him to learn machine learning.  So he came to the tool, he played around with the data, and within a few hours he was able to create a model to predict who will predict in a basketball match.

And then, after a few days, he got to be really good.  And not only Adam, but many of the people who participated in that competition ended up predicting much better than the automated algorithms Google used and Bing used, and was pretty competitive.

So these Azure ML tools are very easy to use for developers.  So I thought it might be actually fun to replicate what Adam did, to show you how Adam built that app right in front of you, to actually publish it, publish an API live, and show you an Azure demo calling with a basketball match.

So let’s see how easy it is to create an experiment.  To predict the outcome of a match, whether Team 1 will win or Team 2, we typically need something called a binary classifier.  There are experiments in the Azure ML gallery.  So let’s take this particular one.

This is an experiment you’ll find in gallery.azureml.net.  So we just brought it to the studio, and this experiment is a workflow.  It’s actually really simple.  So it reads and cleans the data.  You split the data into training and test data.  You create a model with that training data, and then you score the test data that you then provide to the model and see how well it did.

So let’s actually run this and see how it performs.  So we’re going to now connect the March Madness data set.  To just get Adam’s experiment, you take all the data from the sample experiment, bring the March Madness data set, and now we actually go and run it.  So that experiment is running.

While that’s happening, let’s look at the data.  So you’ll see, this is data about 12 years of matches.  We have the feeds for each teams, and the final column there, Team 1 wins true or false, is what we’re trying to predict.

So let’s go back and see how the model ran.  So let’s look at the performance.  So that graph actually shows what’s called an auto seeker, but the most interesting number is the accuracy.  You see that simple model performed at the accuracy of 66 percent, not bad for a minute or so of work.  You just put in new data into that existing sample experiment and ran it.

So now let’s see what Adam did.  Adam did a couple of things.  Adam got an algo system called 2Class Decision Jungle.  That’s one of the algorithms in Azure Machine Learning.   He used something called Sweep Parameters to sweep over the parameters of that machine learning model and find the best one.  He also found that bringing in team performance data would be very, very helpful.  So he took the historical data, married it with team stats, and then let’s look at that data.  When you look at it, you will find that he brought in data such as the field goals made, rebounds.  He even found, for example, that team performance from away games was actually a really important predictor of performance, of match performance.  So let’s see how that model ran.

So you’ll see the performance went up dramatically.  It’s about 74 percent accurate.  And now that we have a model in Azure Machine Learning, we can now create an API with it, a cloud-hosted API.  Let’s do that now.  To do that we create a scoring experiment.  This is a workflow that will run inside of a Web service call.  And here’s the magic, it automatically identifies what the predictive model is, identifies the Web service input, and Web service output, and we are now ready to publish this as an API.  We can run it, and when you publish it you will get an API that is live in the cloud.

So let’s see that API now.  Well there, this is the page you get when you run and publish it.  You see that API key at the top, that’s the API key you will use in your REST calls.  We can test this.  We can test this API live in the cloud.  So now let’s take that historic match between Duke and Wisconsin that just happened and see what the API predicts.  So there you go.  The API predicts Duke will win over Wisconsin with a probability of 63.2 percent.  And that’s exactly what happened.

Now you can predict the future and build this intelligence in the cloud apps.  It’s very simple.  You have C# code, you have Python code, those are things that you can use to just call that API.

So now let me move from the demonstration to something else that’s pretty exciting; it’s about the language of data.  Now machine learning is a field of advanced statistics.  And over the last 20 years, a revolution has happened here with the development of Open Source R.  Statisticians all over the world have now started contributing their latest innovation and research as packages in R.  If there is one language that is truly about data, something that puts the power of data at the fingertips of every developer it is R.

In IEEE spectrums rankings of programming language popularity in 2014, R placed ninth after Ruby.  That’s such an amazing feat for a domain-specific language.  So if there’s a single language that you choose to learn today to tap into the power of that cloud, that would be R.  And that’s why Microsoft acquired Revolution Analytics, the company that made R scale to big data and brought Open Source R to the enterprise, making it enterprise grade.

For example, American Century Investments, one of the largest mutual fund companies in the world, uses it to build their quantitative investment platform.  DataSong, a local company, a big data marketing company in San Francisco, uses it along with Hadoop for their marketing performance management and attribution platform.

To bring this story of R and big data to life, let me tell you another story.  You know, we are all big data.  There are billions of letters of genomic data in every one of our cells.  It turns out that each human genome is about 2 gigabytes of information and with the power of R we can analyze all of this data.  And when we plot this set of information against the general population we can actually see some incredible things about ourselves.

So the first human genome was sequenced in the year 2000.  It cost about $200 million and over 10 years of work.  Since then the cost of mapping the genome has come down much faster than Moore’s Law.  It’s about $1,000 to get your entire genome and you can get the most important parts for about $100.  And this is driving an untold revolution.  In the future when you go to a doctor the first thing that they will do will be to map your genome to understand your disease risk.  This will let them treat you based upon your individual genome, personalized medicine.

You will even know the risks you should anticipate in the future and change your lifestyle because of it.  So let’s show that to you with a demo.  Joining me is Mario who worked at Revolution Analytics.  So genomes are analyzed using the language of R.  It has one of the richest libraries for analyzing genomic data.  A thousand people worldwide have contributed their genome to scientific research.  It is public data called the 1,000 Genomes Dataset.  It’s about 2 gigabytes of data per person.  We’re going to analyze all of that data and see their disease risks.  And Mario is going to actually walk through the demo.

Now it’s going to take only a little bit of code, but it’s a tremendous amount of computation.  So it really needs big Hadoop clusters, so we could show the Hadoop clusters that we’re actually showing, the slides.  We’re going to use about eight HD Insight clusters in four Azure datacenters, two in the U.S., one in Europe, and one in Southeast Asia for a total of about 1,600 cores in the calculation.

By the way, they are Linux clusters.  We’re actually running Revolution R on them, and Revolution R is going to coordinate the computation across those 1,600 cores in all of these regions across the world.  They’re actually going to take computation to where the data resides when we do that global computation.  So now let’s see the code.

So only a few lines of R code, this is the code that’s actually used to analyze the genomic data.  It leverages some extremely powerful functions in an R package called Bioconductor.  And let’s also see the code that coordinates the computation.  So this is a piece of the code that coordinates computation on HD Insight clusters across eight clusters, across four data-centers.  Let’s kick that run off Mario.  And let’s see what it produces.

It produces a heat map.  This is a heat map of disease risks.  Each row is an individual.  We have fictitious names on the right.  But, each column is a disease.  So let’s take, for example, the second row from the top, Clair Hickman.  Clair Hickman has a high risk of ulcerative colitis.  And let’s take, say, the third from the bottom, Sophie Albrecht.  It turns out Sophie has a high risk of multiple sclerosis.  By the way, this heat map is an amazing visualization produced by a package called Shiny in R and it lets you visualize things and zoom into these things.  It’s very, very powerful, as well.

So now you’ve seen the disease risk of a population, of the population we analyzed on the Hadoop cluster, and it almost happens in a few minutes.  And it’s pretty amazing chunking through all of that data.  But, what if we could take that R code, publish that as a Web service API.  So you saw me publish a Web service API with Azure Machine Learning on here.  What if I could do the same and create an API that lets you submit data and see your disease risks.

So let’s see the scoring experiment.  This is a scoring experiment in Azure ML.  See that execute R script?  That’s where we can actually copy the R code and publish that as an API.  So let’s do that now.  So Mario is going to copy some of the R code for scoring a person’s genomic data.  Put that into the execute R script and run that experiment.  By the way, as it’s running execute R script shows you how easy it is to operationalize R.  Any R code can be published as an API in the cloud.  And Azure Machine Learning lets you do the same thing, even with Python.  It is the simplest, easiest way to operationalize your analytics code as APIs in the cloud.

So now let’s see, let’s publish the Web service and let’s get that API in the cloud.  Here, that’s the API page.  You got the key.  There’s even sample code to call that API.  But, to test this we actually created — yes, there’s the sample code.  If you click on it you’ll see C#, Python and R code are automatically generated.

So now to call this API and show you the results we created a mobile app.  I’ll tell you about that in a minute.  Now early in April I sent a sample of my saliva to 23andMe.com to sequence my genome.  Last Friday I got the results from that.  And so for the first time I was able to analyze my DNA and see my future.  So let’s do that now.

So let’s take the mobile app, can you all see the mobile app?  So I’m going to log into 23andMe, get the data from that website, it’s uploading the genomic profile.  So you will see a graph that shows the risks.  So the blue bar is the population risk and red is mine.  So you’ll see that I have low risk for gallstones, low risk for asthma, certainly low risk for breast cancer.  But let’s look here at the top.  Look at that very top line.  That red bar shows .36 — I have about 2-1/2 times the population risk for prostate cancer.  And I just discovered that.  That’s my future predicted from my genomic data, R, and big data analysis.  That’s the power of big data and advanced analytics.

So we switch back to the slide.  Big data and advanced analytics can change lives.

(Applause.)

My next story is about the connected grid.  For a century the management of power has been by analogy systems, just like the media we were consuming was analog.  But now users are discovering the power of big data, analytics and the cloud.  And the hero of this, the hero of this story, is a small startup from Norway, born in a small town of 30,000 people, and they are reinventing the future of energy management.  And they’re going global with it.  They’re reinventing the grid with the cloud creating great efficiencies through data and advanced analytics.

To tell you that story, let me welcome Erik Asberg, the head of development of eSmart Systems to the stage.

Erik, welcome.

(Applause.)

ERIK ASBERG:  Thank you, Joseph.

So let me tell you about eSmart Systems.  We accelerate energy systems, we optimize energy investments, and we minimize carbon footprints through better, faster energy decisions.  Like many of you, we are in the business of disruption.  We’re taking an industry that has been operating in the same way for over 100 years, and we use technology to radically rethink the way we approach it.  Our disruption is directed at the business of energy, and we started with a pilot project in Norway three years ago.

So when we started our goals were to reduce power grid investments by increasing the station factor of existing grid capacity.  We needed detailed predictions of load and power generation.  And we needed to take advantage of IoT and big data because smart meters and sensors are a prerequisite for our predictive analytics.

So the big question was, how are we going to go off and do this?  Are we going to spend millions of dollars and thousands of personnel in datacenters?  For us, this was an incredibly easy decision.  We leveraged Azure from the very beginning.  And what makes this so amazing is that we made this journey of changing the business of energy without acquiring a single server.

So there’s a look at the eSmart Connected Grid application.  All the houses that you see here are equipped with smart meters and sensors.  And we use machine learning to identify potential congestions in the power grid.  We control consumer flexibility to avoid substation overloads.  And we use both short-term and long-term predictions to control heated floors and water heating through home automation.  eSmart and Azure Machine Learning use real data to learn for each specific area.  And we take this information and we make it available on your phone with consumer apps managed by Azure.

Every part of the power grid is unique.  There is no system today that is able to adapt to this uniqueness.  eSmart Connected Grid and Azure ML totally changed this reality.  This is truly revolutionary.

So we started two-and-a-half years ago with no employees, no revenue, and by the end of this year we’ll be over 40 employees, and over $50 million Norwegian Krones in revenue.  I couldn’t be happier to be partnering with Microsoft and Azure.  And I want to say a big thank you to Joseph for having us here today.

JOSEPH SIROSH:  Thank you, Erik.

(Applause.)

It’s an amazing story.

The cloud amplifies startups such as eSmart Systems.  It gives them the power to go global.  It gives them the power to build applications that revolutionize entire industries in a very short period of time.  In the future, all of the planet’s power generation, power grids, and power consumption will be optimized with real-time data and data analytics.

And so with the power of the cloud some even really small companies like eSmart Systems are able to make that dream come true at a speed that no one else can.  That agility, by the way, that agility is a hallmark of the cloud, and that’s its magic.

You know, the cloud turns hardware into software; software into services, fully managed services; and data into intelligence.  It makes you smart and it lets you free.

Thank you.

(Applause.)

END