Speech Transcript – Jim Allchin, Windows 2000 Press Briefing

Remarks by Jim Allchin
Windows 2000 Press Briefing
November 15, 1999
Las Vegas, NV

MR. ALLCHIN: You’re looking at the complaints department for Windows.I get letters, whether it’s Windows 95, Windows 98, NT, and I spend a lot of time going through those letters and feeling bad, because I do get a lot of nice letters that most of them that hit my desk are the ones where someone has had a bad experience.Two years ago we set out on a path to figure out what was real about our reliability and our compatibility, because at the same time that I was getting these letters, we knew that Dell.com was running it, NASDAQ was running it, the Chicago Board of Trade was running it, and so there were all these customers that were incredibly happy with the reliability that we were offering.

So we set out to figure out what we could do on reliability and what the real truth was.First thing we did is we asked a bunch of customers.And these are quotes from customers that we received.And the real question when you look at it is, what are we going to do about it, and what did we do about it?So we started a reliability initiative over two years ago.We put about 500 person years into it.We spent about $160 million, both in terms of tools, because that’s a core thing that we learned that we could do a lot through automation, as well as the people of the 500 that are listed here.We also did measurements.We couldn’t figure this out without going out to customers and auditing what was going on.So we grabbed all the logs off of their machines, about 5000 servers we did this, had terabytes of information, and we could measure exactly when the machine was up, when it was down, and then we could start to figure out why it went down.So we did targeted analysis of these 5000 to figure out what was going on.

Then we started on a plan to address the issues.And I’m going to spend quite a bit of time in this little talk drilling into and sharing the facts that we found, and also sharing the things that we did about it in Windows 2000.We also, in addition to the 5000 servers that we went out and visited, we went to our OEMs that handle support calls, primarily on the server, but also on the client, Compaq and HP and IBM, and we got their support data for what calls they were having.And then we made this plan and targeted improvements in Windows 2000.But, in the short-term we also picked up many of those bug fixes, and started moving them back, the improvements in SP4, SP5 are examples of us moving many of the fixes that we found, although a far cry from all of them, back into the current support in NT.We also made an ongoing commitment that over the years this is something that we are going to be much harder core about than we’ve been in the past.

So the first line I’m showing here is an analysis over about nine months from one of these sites about the reboot causes.And as you can see, 65 percent are reboots that were caused in what we classify as a planned reboot, and 35 percent were unplanned.I’m going to walk through these and give you an idea of what these particular segments mean.Starting down at the bottom here, right here, preventative reboot.Well, people didn’t know, you know, how long you can run it before you should reboot.So they would pick a time in the middle of the night, and then they would reboot it, and there were 20 percent of the reboots that were caused because they thought it was just good hygiene for some reason.And then there was another part that was hardware install and configuration, they were changing something on the hardware.This part here, they were adding a service pack, they were adding the option pack, changing some other characteristic of the operating system, updating the operating system in some way.These up here, the 7 percent, OS configuration, those are areas that caused a reboot, because they changed something in the TCP/IP stack, or they changed something in multimedia or in that area.Application install and configuration, these are when the little apps come up and say you must restart your computer right now.While it was particularly bad when those were apps on the server telling you to restart, but that’s what they were saying.

Over on the other side, the 35 percent of unplanned, I’m going to drill into this in the next slide of system failures, but that’s the core NT part.But, this apps area was also very interesting.In our own shop, in what we call our MIS group, ITG, they were rebooting their SAP machines because there would be an SAP failure of one of the particular processes.And SAP spawned many different processes.And the simplest thing to do was to reboot the box, because they didn’t have an easy way to kill out all these processes that existed.So that’s an example.And we found this at many of the sites that we visited.In terms of drilling into that 14 percent that I just talked about, these are actual blue screens, that’s when we talk about the word blue screen, UNIX calls them panics, well, this is our blue screen for NT 4 across, again, a 9 month period.This was done as a sample of PSF calls.And as you can see, there are 43 percent here with core NT, and you can see this wide area here, a huge part, was caused by device drivers.Some we wrote, some third party wrote, and an anti-virus was such a large part of it we decided to separate it and make it very visible in this chart.And then, of course, there was 13 percent of actual the memory failed, the disk failed, the processor failed.

So what did we do?Well, the first thing we did is we purchased, for tens of millions of dollars, a company that has one of the most advanced tools that we’ve ever seen for being able to analyze source code and look for problems in the source code before you actually go down the testing process.And this code is able to find and initialize variables, and find problems in memory leakages, and a whole series of other things.And we’ve run this over NT sources, and fixed literally thousands of problems that this has been able to discover.So with all our good code coverage analysis, here was a way to find error cases that we couldn’t drive the system through.

Another thing we learned from that set of statistics that I just showed you was that device drivers were a core problem.And we decided that was our problem, too, because we didn’t give enough tools to people for them to be able to test their drivers, and so that we could get a higher quality.So we created something called the driver verifier testing tool.

We also have worked on attacks on the system from a security perspective.We have a special team inside Microsoft, and we’ve hired outside security analysts to attack the system by both giving them code to review it as well as do it in a black box way.We put a full-time code penetration team inside of Microsoft.These are a full-time set of people, some of the best we have, that did nothing but do code reviews every day of the week.And then we used what they found in terms of best practices to do education and driving that through the whole process through other engineers.

We also did better system-wide testing.Every day we build a new version of Windows 2000, and every night we run a stress test on it.And the stress test is the equivalent of about three months of run-time that we’re able to accomplish every night.And it’s done on anywhere up to 1,500 machines each night.

The other thing we did is, we created for the first time long-haul stress environments, in which we took categorizations that we heard from our customers were important in how they were going to use servers, like Web servers, and file servers, and print servers, and DHCP DNS type environments, and we put stress on them, and we left the machines up for a long period of time under massive stress.So, this is the place that, for example, we do

— I don’t know, I think our qualification is 65 million entries inside Active

Directory, 2.3 billion look-ups inside DNS, that’s the sort of numbers that we expect to see from these tests.These tests we weren’t doing to this level before.

So, what do we actually do in terms of Windows 2000 to address this?This is the same pie chart that I showed you before, and it shows around the pie chart the specific things that we did in Windows 2000.I think it’s worthwhile just taking a look at some of them.The hardware and so on configuration we added plug and play, a very important thing that’s been missing in NT.We’re super-excited about all the capabilities that just that’s adding.

Service Pack slip streaming is something where you don’t necessarily have to bring the system down to do the integration of fixes to the system.The other thing is just we made many of the components of NT be able to be added to the system without doing reboots, they’re just a native part of the system.We eliminated dozens and dozens of configuration reboots even if you’re not adding a piece of software to the operating system.If all you’re doing is changing the IP address, or changing something about your WINS configuration, you do not have to reboot.

The other thing we did, Steve mentioned it, up in this area is that we tried to do something about
“DLL Hell,”
and I think we’ve made a huge step forward in Windows 2000.Before, applications were able to replace operating system files, that was a cause of crashes.We’ve now made that basically impossible.So, the integrity of the system is always assured.We’ve also added the Windows Installer, which is the thing that you just saw.Even in the case where Steve, on that machine, when I guess it was a Compaq, I don’t know which one it was, but when he didn’t have Office, it was sitting there, going to go install that from behind the scenes.

In terms of app failures, we added, in the SAP case, we added this kill process tree.So, if you want to kill a whole series of processes that are related together, like being a part of a particular app, we added that as a native part of the system.We also did a tremendous amount of work in the Web server area.That was an area where technically pieces of application code could be run inside of the Web server, and we found those mistakes were taking down the Web server, although the operating system didn’t come down, the Web server went down.People felt the same thing, that the system was down.So, we added the ability to separate the Web server core technology from any of the apps that are running, and the apps can no longer take down the Web server.

I could go on in terms of preventative reboots.We published best practices.One of the things I didn’t mention is that it was surprising, when we went and visited the customers, we found a 5X difference in their uptime.That goes back to the dichotomy I said before that was confusing to us, because we had all these customers that were having such a great experience, and we talked to some others who were saying, oh, well, we’re having to reboot.And we get into it and operational practices make a big difference.If you treat it like a mission critical environment, you had a better experience with the system.

In terms of system failures, that’s one that was core to me, and we spent a great deal of time on that.This is the list of some of the things that we’ve done in that area.Kernel verifier, we’ve added a bunch of technology to force inject faults into the system while the system is running.We just tell the system on some calls, you’re out of memory.And then we proceed to track what happens in it.We fixed a number of issues in that area.Driver verifier I’ve mentioned.I’ve mentioned many of these.A huge, huge differences in terms of quality.It turns out that virtually all of the virus detectors, the anti-virus software had problems.And through labs that we had with them, and through some technology that we provided to them, they were able to fix some of those tricky problems.These are when you get in a heavy stress environment on a multiprocessing machine.Sometimes they would hiccup.We had file system dev labs to work through things like the NetWare redirector, and everyone else who were adding file system components, and then in terms of device drivers, we went through all the ones that we’re shipping.

The other thing we did is, we went to the hardware qualification lab, and we said, you are not to pack a single driver that doesn’t go through this driver verifier.And so the quality that we’re going to get out of these drivers now is significantly better than anything that we’ve had in the past.

We do have a hardware compatibility list.We really believe in it.Those are systems and devices that Steve mentioned, combines about 9,000 different components that will work great on Windows 2000 and as time goes on we expect that number to obviously grow.

Of course, it’s not just reliability that customers have asked for.They ask for availability.I mean, obviously, you know, stopping a failure is a good thing, but if you do fail how long is it before you can get up and run again.And the other thing is, as I just said, it wasn’t just a technology issue, there really were operational practices that made a difference.So, we worked on those areas as well.

In this particular slide, I broke it into three basic scenarios, the dot coms, the line of businesses, and the branch offices.And we focused to improve the availability in each of those.In terms of dot coms, we tried to do something in the data storage layer by distributing the risks so that you can have up to four-node clusters and be able to do rolling upgrades, or if one part of the system fails just keep on running without a problem for a persistent state.

We also added network load balancing.If you happened to see Bill’s keynote last night, we were using both of these techniques in the demonstration where he pulled out one of the computation engines and kept running.

In the line of business, if you had planned downtime, we wanted to make it much easier.I mentioned rolling upgrade.If you take a cluster, you can take down one of the systems.Your service never stops, but you can take down one of the systems, put a new version of the operating system on, bring it up, then bring the other ones down, service keeps continuing, and you update that one.

In terms of unplanned downtime, we’ve done a bunch of work.An example would be that even in Windows 2000 Professional, when we ship it, it will have the if it crashes, we will automatically capture a mini-dump that will be small enough for us to transmit around.There’s actually three levels of dumps now, so we have these very sophisticated tools to be able to diagnose the problem and point to that device driver, or that particular problem there, and hopefully quickly turn around a fix.Or, if a user has done something, then we have something called space mode boot that can help walk them through the steps of what they might have done.

Branch office, can’t go out there, perhaps it’s in Thailand, you need to be able to have that system up and running.Well, if one of our services crashes, a service meaning like a DNS, Active Directory, Web server, those things can be automatically restarted without operator intervention.And, of course, we have remote administration across the net for all the entire system.

From the non-technology perspective, we have trained more people inside of Microsoft and more people externally than for any other product we’ve ever done.We also just in terms of numbers, 5,500 trained staff members that can go onsite and help customers, we have about 4,000 what we call product support services engineers.Again, they’ve had like 30 days of hands-on and classroom training on Windows 2000.Not to mention that they’re working with our early adopters directly supporting them today, so they have real hands-on experience.

Microsoft Official Curriculum, we finished it earlier than any other time in any of our releases — massive content.I was going to have a whole stack of content here today, but we didn’t quite get it down here in time.But you can imagine the amount of Microsoft press books, the amount of information that we can provide, or that we have already gotten down for TechNet.

And, of course, it’s not just our software, it’s the particular hardware that you’ve chosen, the apps that you’ve chosen, and the policies that you’re using to run the system.And there are a set of vendors, such as HP and Compaq and the like, that can provide the high up-time availability guarantee for you, and today there are about 99.9.I expect that number to go up with Windows 2000.That’s what they already provide with NT 4, but as time goes on, I would expect with Windows 2000 that number to go up.

We have had a lot of early adopters moving to Windows 2000.There’s a quote from Barnes and Noble, when we migrated our warehouse fulfillment application to Windows 2000, we saw increased scalability and improved reliability.BarnesandNoble.com system performance is much improved over NT 4.I have a video of this in just a minute.This is their fulfillment application, very critical to their business.

InfoSpace, another quote, over 2,100 web sites including partners such as AOL, Netscape, Lycos, and the Wall Street Journal depend on us to deliver reliable services and cutting edge solutions.We rely on Windows 2000.

I do have a video, and I would like to show that.

(Video shown.)

MR. ALLCHIN:So, we picked two, but there are many with our partnerships that we’ve got, both dot coms as well as outsource enterprises that are using Windows 2000 today, and are having a very good experience with it from the reliability perspective.

The last thing I wanted to cover is, when is it going to ship.And you’ve heard about Release Candidate 3.I mean, everyone on the planet has probably heard about Release Candidate 3 and RTM.But we’re very close.We’ll be releasing it this week.It will be our last release candidate.Now, these are the candidates that we’re getting closer, we’re trying to cut down on the amount of churn, amount of changes that were going into the system.We feel incredibly good.Our target is February 17th.You can count on it, you can be there.We’ve got enough buffer time so we’re going to continue to work on the quality of the system.We’re going to keep going as long as we can, and just continue to improve it.

Now, oftentimes I’ve said, we’re not going to ship it until it’s ready.And then people say, well, when is it ready, how do you know it’s ready.We have a very quantitative way to approach this.And I wanted to share some of the statistics with you that make it so that we feel like we’re ready.The first is extensive deployment and production.It doesn’t count if somebody is just casually using it.They’ve got to be depending on it and it’s been able to solve their particular problem.

So, we’ve done it in a variety of different dimensions.One dimension is just within Microsoft.We needed to have 50,000 clients on it, and we wanted basically all our infrastructure servers on it, about 800.And so, we’re at that level today.In fact, we’re higher than that.That was the level that we were targeting to make sure that we felt good about it.

Deployment in terms of joint development customers, or early adopters, and we expect that number to be in 22,000, give or take, clients, 1,500 or so servers, in production.We expect that number to be higher, but that was the target that we had set for that.

Another core part of our signoff process is partners, whether it’s the early adopters, the OEMs or ISVs.They’ve got to tell us that we’re doing the right job here.And in terms of deployment partners, you’ve heard of our joint development program, whether we need those partners to sign off for us before we move ahead.And that’s a core part of our sign-off criteria.OEM partners, the top 10, we expect them to sign-off.ISV partners, we have a set that we’ve worked very closely with, and we expect them to sign off on the system.

In terms of application device and system compatibility, we set a target for 450 client-side applications, and 75 server applications that were categorized through usage share that our marketing team told us were the ones to hit.And those all had to work, those had to be compatible.

We also set targets, and Steve just brought them up, which is 5,000 different devices, 4,000 different systems.And a core part, the one that we measure every day, I see the numbers every day about where we are in stress, these are both the stress for the long-haul numbers, and stress for the daily.And those systems must run 1,500 machines, no reliability problems overnight for three months.We have to be able to do that on a consecutive basis, as well as the long haul, and be able to see that the long-haul numbers match all these list of requirements that we’ve got laid out.

Obviously, customer requirements or issues that they bring up to us dealing with anything, dealing with Y2K, data corruption, security issues, we immediately address, and obviously wouldn’t ship if we know about any of those.

So, we’re very close to shipment.We feel very good about the quality.I hope all of you have had an opportunity to play with it.If not, please, you know, get some experience with it, it’s very close.Why not?I think you’ll have a better experience than whatever you’re running today, you’ll have a better experience, I’m sure.

Related Posts