Remarks by Ron Markezich, Microsoft Chief Information Officer
Microsoft Management Summit
Las Vegas, Nevada
April 21, 2005
RON MARKEZICH: Good morning. Good morning. It’s day three of MMS. Always good to see people still in the audience on day three of an event like this.
I’ve got to tell you, I’m honored to be speaking here today. The reason I’m honored is because of the jobs that all of you do for your companies. I’ve got to tell you, you all have hard jobs, and they’re jobs that if you don’t do them and you don’t do them well, your company does not run. I can tell that from firsthand experience. My first job out of college was actually being an ops analyst. I was on a tier two shift, carried a pager, got paged quite a bit in the middle of the night, ruined a couple of my relationships at the time with my girlfriends and was a job that was extremely difficult; in fact, it was a job so difficult I decided to go into management because I thought that might be a little easier.
So I don’t have to tell you, you have hard jobs. They’re jobs I know are not always acknowledged. And one of the things I want to do today is talk a little bit about how we do IT at Microsoft, how we manage IT and how we look to make the jobs that you all do easier, more effective and also increasing the contribution that these jobs can make to a company.
And so a lot of my talk today is going to focus on a broad segment of managing IT at Microsoft because I firmly believe the ops center, what I call the ops center in Microsoft, or ops centers at any companies are going to raise up the stack in terms of the services that they provide their organization so that they can provide more value, that they can take and centralize services that may not have been centralized before, they can drive up service levels and they can drive up costs. So I’m going to hit a variety of things.
I’ve also got a couple of my guys up here today to do a couple of demos. Now, I’ve got to tell you, we’re IT guys, we’re not professional marketing people, we’re not demo people, we’re IT guys, we have jobs just like all of you. And so we’re going to show you some of the stuff that we use to manage IT at Microsoft.
Let me start by giving you a little view of Microsoft’s IT environment. I’ll give you some data. For instance, we have 58,000 employees, about 89,000 total users. The delta there would be contractors or vendors that are essentially users of our system, temporary work staff.
The most amazing thing about our environment is with 58,000 employees we actually have 300,000 machines on our network. That’s one network worldwide. The reason we have so many machines is average number of PCs per employee is about three and we have a lot of labs, product group labs, development labs, product support labs that hang off that same network. And so, when you see some of the guys up here, if they look a little tired it’s because they have to worry about 300,000 machines on their network day in and day out.
We have about 100,000 e-mail accounts. One of the big advantages in my job is I can assume that every user, every employee, every vendor that’s on site at Microsoft has e-mail, has a computer, has network access. So when we start talking about our strategy, our strategy has a lot to do with automation. We look to automate as much as possible and centralize as much as possible. One of the reasons I can do that is everyone is connected with their own machine, in fact multiple machines.
We also look to consolidate: consolidate applications, consolidate functions, consolidate systems. One example of that is we use SAP for our ERP system. It’s on one database server, one instance, resides in Redmond. Regardless of where you are in the world, you’re accessing that SAP instance in Redmond. It’s on 1.9 terabytes right now.
In fact, one of the things I’ll talk about as my most important job is running our products before they’re released to customers. So SAP is an example that runs on SQL Server 2005 since last August — so we pay our employees, we close our books, we recognize revenue on a beta database server for up to a year before that server is released publicly.
We have about 400 buildings worldwide. About half of those are in the Redmond area — Puget Sound area we like to call it — and the other half are all around the world. We do business in about 83 countries.
We have a very remote workforce, 9.5 million remote connections per month. Even the people in Redmond doing product development you can think of as remote workers, because often they are going home, using their desktops, using remote desktop at home and doing development work from their home office. And then folks outside of Redmond typically travel to customer sites, partners, conferences, things like that.
And in terms of mission-critical applications at Microsoft, and this might be the same for your companies, e-mail is like oxygen to Microsoft employees. If they don’t have e-mail, they can’t breathe. E-mail is how everyone communicates. We do about 3 million messages per day internally. From the outside coming in we get about 10 million mails a day. Nine million of those are deleted as spam right off the bat, so we get about 90 percent illegitimate mail.
Our policy is if we delete a mail and it’s legitimate we’ve failed. Our service level is 100 percent of the mail that we delete as spam is truly spam. I’d hate to delete that mail from a customer with a questionable name that’s saying I’m ready to sign the contracts. So if we’re not 100 percent sure we don’t delete that. That means some mail gets in and we have client-side filtering that allows the users to have more strict control.
We do have about four 9s availability on mail.
The other thing in terms of being a CIO, of being in IT, is it’s amazing to me as you add new services and new capabilities to an organization what happens to them. For example, e-mail, we do about 3 million messages per day internally and that consistently grows every month. We rolled out Live Communication Server and then we just rolled out Office Communicator for IM, and that’s used extensively. I thought, you know what, once we roll that out and that gets used my e-mail volume will drop. Sure enough, e-mail chugs along at the same growth rate. The same with remote access — you roll out remote access, people use that extensively, you roll out Outlook Web Access, figure people are going to stop using remote access. That’s not the case — they just use Outlook Web Access and remote access.
So what happens with any IT service I think is that add you new services that don’t take away from other services, it just gives employees another means to do work, which is a great thing. It makes for productive employees, but it makes the manageability of that environment more difficult because now you have more mission-critical applications that you need to support.
The four dots on there are datacenters, essentially we have four core datacenters worldwide: Redmond, Singapore, Dublin and then we have a small one in Tokyo.
A couple things on our strategy, IT objectives. We have four key objectives for my organization. My first — and sometimes I say it’s my first, second and third priority — is to be Microsoft’s first and best customer. What that means is before we ship any enterprise product, my organization needs to sign off on that product and say that we run our business on that product, whether that’s MOM 2005, SMS 2003, SQL Server 2005 –go across the board, we are running our business on that product before that product ships.
Now, that makes my business customers a little nervous at times. We just started our annual budgeting process and we’re running our entire budgeting process on SQL 2005 with Reporting Services, with the new ETL capabilities, and I’ve got to tell you my corporate controller is a little nervous. Things have been going well so far, but it helps emphasize the philosophy we have, which is I’m willing to put the business at risk in order to make those products the best products that they can be before they ship to customers.
Second is to run a world class IT organization, an IT organization that we’re continually increasing SLAs, we’re continually decreasing the cost per unit. That means the cost per mailbox, the cost per server, we need to continually decrease the cost per employee and show that we are continually a world class organization.
Third is protecting Microsoft digital assets. I’ll tell you, Microsoft loves our employees, Microsoft employees are wonderful but everything else about Microsoft that we love is on a computer, other than the employees, and I think sometimes if we figured out a way to put employees on a computer we would do that, haven’t figured that out yet. But if you think of our intellectual property, our confidential data, our private data, it’s all sitting on computers somewhere and it makes it susceptible to cyber attacks or hackers. And so we spend a lot of time making sure we secure that environment and addressing threats as they evolve through the environment.
And the fourth, certainly not least, is driving productivity across our user base. We don’t do any manufacturing in our company, we outsource our manufacturing. So the CDs, which we’re seeing fewer and fewer of, they’re outsourced. Our manufacturing is R&D, it’s knowledge worker type work, it’s decision support, it’s customer connections, customer support.
So it’s very important for us to have a productive workforce, because they’re all knowledge workers, information workers, and our business model does not allow us to significantly increase the number of employees. Our business model is all about leveraging our partner base, leveraging our ecosystem and keeping the number of employees relatively small for our company and getting those employees focused on R&D. We love investing in R&D. In fact, we’ll take $100 million out of the cost of running our infrastructure over the last three years. I feel great about that because that $100 million gets poured into R&D so we can make better products for our customers.
Now, there are three strategy pillars I want my organization focused around. And this goes back actually a year ago when I took the CIO job. A friend of mine, another CIO said, ‘Hey, congratulations on taking the CIO job. Did you know you’re in the oldest profession in the world now?’ And I’m thinking, gosh, I’d never heard that; I’ve heard of other professions being the oldest profession in the world but I’ve never heard of the CIO being the oldest profession. He said, ‘Well, before there were even the heavens and the earth there was just chaos in the entire universe and who do you think could have created that chaos but a CIO so you must be in the oldest profession ever in the world.’
And so I thought about that and I’m thinking, gosh, that is so true of many IT environments, ours included. And so what we look to do is we look to automate, centralize and consolidate all around a theme of simplification. We cannot simplify too much and I’ll talk a little bit about some of the things we’ve done later but we really strive to simplify that environment through automation, centralization and consolidation.
Now, the ultimate goal for us and what we march to is to be a dynamic IT organization with a dynamic infrastructure. And that might be an overused word. You hear Dynamic Systems Initiative, what does that actually mean? I think it actually means taking functions across operations, applications and information workers, making them adaptive, driving up the service levels, driving down the cost, simplifying and ensuring that they’re adapting with the speed of the business.
Microsoft moves very fast as a culture. Even though we’re getting to be a large company we move fast, people are incented to be entrepreneurial, to think of new ideas, to be creative and that drives a quickness in the business. I need to make sure my infrastructure, my services, my applications can move with the speed of the business, meaning rather than provisioning a server in a month because we’ve got to wait for an order to come in from our vendor, we’ve got to get it in the datacenter, we’ve got to get space, power it up, do the build, which may take a while for that entire process to happen. We need to make sure we can provision computing power within an hour, within minutes because there may be a need that we didn’t anticipate or the business didn’t anticipate that we need to meet. That’s all about being a dynamic IT organization across the operations, the applications and the information worker areas.
We do that via DSI, Dynamic Systems Initiative, common engineering criteria with Windows Server System, having a set of common engineering criteria that I can assume those that have the WSS stamp of approval to adhere to, .NET especially for our line of business apps, and Trustworthy Computing.
Let me go through those three areas and I’m going to start on operations, especially around simplifying our environment. So I told you CIOs are in the oldest profession in the world. Let’s get that environment simple and make sure we don’t have chaos.
Some of the things that we’ve been able to do around infrastructure server consolidation, and consolidation I think is a bad term. It’s really elimination. My goal is not to consolidate servers because that makes it sound like I’m taking two servers and putting them in one place; I want to eliminate them, I want to eliminate my servers that reside outside of my datacenter, I want to consolidate as much as possible. We’ve been able to reduce 30 percent of our global infrastructure servers, largely because of a little capability in Office and Exchange 2003, which is the cache mode and the WAN optimization.
We had a problem in Microsoft before Exchange 2003. How many of you guys run Exchange? Any hands out here? Good, a few Exchange customers in here. How many of you are still on Exchange 2000 or earlier or Exchange 5.5? Oh good, not too many, the majority on 2003. So a lot of you may be able to relate to this.
At Microsoft, when we had Exchange 2000, we ended up with Exchange Servers in 74 sites worldwide, the reason being if you had a site with about 100 people or more you’d want an Exchange Server there for performance reasons. If there was a network circuit down, you didn’t want the e-mail in that office to be down. And you’ve got to remember, e-mail is like oxygen so if they don’t get e-mail they’re not breathing.
So we ended up with Exchange Server in 74 sites. Now, being a technology company our people that run those sites love to put servers in there because what they can do is they can take customers like you through that site and show them the blinking lights, beat their chest and say look at these great servers I have in this environment.
So when we gave them the excuse by putting an Exchange Server there they put a SharePoint Server in, a file server, Siebel servers, other application servers; we had a lot of what we call mini data closets, not datacenters but data closets in sites around the world.
What we’re able to do with 2003 and Exchange and Windows is pull those Exchange Servers out to the point where we have four sites with Exchange Servers today in our core datacenters. We took the excuse away from those offices and we said you know what, we’re taking the Exchange Server out so we’re taking domain controllers out and we’re taking all these other servers that grew up here over the years out of that site and we’re getting all servers out. We’re also taking that site and we’re connecting it via the Internet.
So for us, that did a tremendous thing around simplifying our environment and getting our infrastructure focused in our datacenters where we have critical environments and high availability.
We also took services like print and consolidated those as much as possible, where today in Redmond we have two print servers that serve literally half of the company.
When we got rid of those data closets we could also look at what we call datacenters. Now, if you ask my datacenter guy, John Coster, he would kind of start shivering when you start saying they’re datacenters because they’re very low tier datacenters. We could start closing those and focus on four datacenters worldwide where we could actually drive high availability in tier 3, tier 4 type datacenters, and then also getting more than 90 percent of our servers in those datacenters.
The other thing it’s enabled in terms of the consolidation is the security component — very quick to deploy patches because to be in a datacenter we require you to have SMS on your machine. So we can do patch deployment extremely quickly in the datacenters, literally within 48 hours, and if I needed to the only reason it’s 48 hours and not within 24 hours is because we do give our apps team 24 hours to test their applications before we deploy patches inside the datacenter.
And then we can also with a consolidated infrastructure make sure we have very clear and enforced security standards and policies.
Now, when it comes to monitoring and managing our environment we really focus on centralizing our monitoring function. I’ve got two monitoring centers. I follow the sun. I have a center in Redmond that does a day shift and I have a center in India that does a day shift there. So we do a follow the sun model and we don’t have a night shift in our NOC, in our monitoring center.
Now, what we do for monitoring is we use a MOM console. We had a big challenge before MOM 2005 because we get event streams from a lot of different sources. Network events we use SMARTS, a product that EMC now owns, for all our network monitoring. We get MOM events from our servers, we get MOM events from our messaging, we get MOM events or other events from our applications and we want to take that all into one console.
We used to have multiple consoles in our NOC. With MOM 2005 we’re able to leverage one console and bring events even from SMARTS and our telephony system in a common Nortel into that common console, so that interfaces with the other third party applications. It’s very effective for us in both of our NOCs in the world to have one console and have the team focused on that.
What I’d like to do, and I’ve got to tell you right now I saw one of my guys that runs our monitoring services, Tom McCleery yesterday after Steve Ballmer’s keynote and I said, “Tom, I’d love to do a demo of MOM to this group” and Tom was not planning to do a demo and so this is something actually less than 24 hours ago I talked to Tom about doing a demo and I would really like you to welcome Tom out on stage and he’s going to give us a demo with less than 24 hours preparation. Tom? (Applause.)
TOM MCCLEERY: Hey, Ron, how are you doing?
RON MARKEZICH: Tom, welcome. Glad you could do this; you’re a brave man.
TOM MCCLEERY: Good morning, everybody.
What Ron and I would like to spend a couple minutes doing is showing you how Microsoft IT uses the MOM Connector Framework as well as some of the tasks in our production MOM environment.
RON MARKEZICH: So let me just clarify, this is our production environment right now?
TOM MCCLEERY: Yes, absolutely. What I did is I took my laptop on short notice here and I created a VPN connection back into our corporate network and I’ve basically just brought up the MOM operator’s console onto my laptop and this is what you probably can see here.
I have to tell you I called the NOC this morning or our operations center and I said, well, it was about 7:00 this morning, I said, well, you know, starting at about 8:30 don’t touch anything in the console. (Laughter.)
RON MARKEZICH: They loved that one, huh?
TOM MCCLEERY: Yeah. They said, “Well, let me get this straight, Tom. You’re in Vegas and you don’t want us to touch anything in the console so that you and Ron can work alerts?” I said, yeah, yeah. They were like, “Have you lost any money?” (Laughter.)
So anyway, by production this is what I mean. If I click on Time and State here, this is a view that shows you all of the alerts in our production environment. As you could see here, we’ve got alerts in here that are about 10 seconds old, got about 26,000 alerts in there.
Being that this is all alerts, not all these alerts are going to be relevant and actionable to the ops analyst sitting in the ops center. So what we did is we created a new view over here, CS, stands for Computer Systems, and if you click on New Alerts here this is what our analysts would see in the NOC. It doesn’t look like it’s been too busy here but it appears that we have a backup server that is sitting over in I think SPA, that would be a Spain backup server, and it looks like we’ve had a MOM heartbeat failure and that server is actually down and it’s about 15 minutes past our SOA. (Laughter.)
RON MARKEZICH: So this could be good service. I can see it when we go through our scorecard this month it’s going to be my fault that we didn’t hit our SLAs because of this.
TOM MCCLEERY: All right, I’ll let them know. (Laughter.)
But basically the server is down and we could do a number of things here. We have tasks over here, I could try to ping this server. In fact, I may just try to ping it and see if it’s come back to us since — boy, this is not looking good, is it? (Laughter.)
RON MARKEZICH: What’s the mission critical application on this server, Tom?
TOM MCCLEERY: It’s just backing up data, you know. (Laughter.)
RON MARKEZICH: It’s just data.
TOM MCCLEERY: What’s a datacenter without data, huh?
So what I’m going to do in this particular scenario is through the MOM Connector Framework we have our ticketing system integrated. So what I’m going to do is I’m going to create a ticket.
Now, an analyst has three different options when they have this alert. They can mark it as resolved and it basically just resets it, it goes away and the alert will come back later hopefully or they can append it. And append means that there is an existing ticket on it and they’re going to just attach that alert to an existing ticket, let’s say somebody took the server down to patch it. The other thing they can do here is create a ticket and what this does — I’ll go ahead and hit Create — and what it’s doing in the back-end is the ticketing system connector is polling our MOM database to find all the alerts that are marked in the resolution state of Create Ticket. And so as it finds these Create Ticket alerts it forwards that over to the Web Services interface of our ticketing system where a ticket is then generated. And we use Siebel as our ticketing system.
So after Siebel creates the ticket what will happen is it will change the resolution state of that ticket to created successfully and forward that alert back into the MOM database. That allows us to go to where you see ticketed alerts and there’s our South America backup server here. And if I look at the custom properties it gives me a ticket number there.
So I’m going to take that ticket number — oh, there’s a message from somebody — and I’m going to plug it into my ticketing system here. We’ll search and see what’s going on with my ticket here.
So as you can see, it’s had a ticket created, it has the event text here in the description, a basic summary, the status is marked open and unassigned and I think I’m just going to go ahead and assign this one to you.
RON MARKEZICH: Good. It may take a while to get resolved if it’s assigned to me.
TOM MCCLEERY: You can hop a jet to South America and bring that thing back to us.
Another thing that Ron just spoke about earlier is integrating in non-Microsoft environments into the console. So what we have done is, as he mentioned, we have SMARTS that is integrated in, we have a Smarts to MOM bidirectional connector. And what that means is the SMARTS system or the connector is actually polling the smart system to find alerts that have already been correlated in the environment and need to be worked on. It will then take those alerts and forward those over into the MOM database. At the same time that it’s forwarding these alerts over there’s another query that’s going on querying the back-end of the MOM database looking for alerts that have been forwarded over to see if any kind of state has changed so that it can communicate that back to the Smarts environment, so both systems, monitoring systems are up to date at all times.
And as you see here, we have, let’s see, let’s take this guy right here and see exactly what’s going on. It looks like a router full to down, neighbor down, dead time or expired, looks like it’s probably an OSPF adjacency error, meaning that one of the routers has lost its neighbor.
Some of the interesting things we’ve done around the network stuff is created some tasks that allow us to drill into information. So if you want to put the lower level resources on the console, for example, to where those people you can automate some of the triage out before it actually passes over to the systems engineers. So in this particular case if I wanted to look at this device and see, oh, well, what’s some of the particulars on this device, where is this device located, we have a task that will query our config management database and return information to us. It should be coming up any second here.
RON MARKEZICH: Hopefully the router down isn’t the Vegas-Redmond link.
TOM MCCLEERY: No, it appears — well, I don’t think so. (Laughter.)
RON MARKEZICH: See, the network guys are going to get back at Tom to ask him not to do the alerts.
TOM MCCLEERY: We’ll let this thing come up.
RON MARKEZICH: So talk a little bit about what you’ve seen for benefits.
TOM MCCLEERY: Uh-oh, there we go. So benefits, this is going to drive efficiencies. This is actually — let me try it one more time. It’s the VPN connection here.
So the benefits are going to be that this standardizes the workflow that we go through in the NOC so if there’s set given things that we want an analyst to do on a given alert, we can put that into a document for the analyst to have and we can have them specifically run these tasks as opposed to getting into a command shell or creating their own little local bin of custom scripts.
So in a nutshell if we look back through, what the MOM Connector Framework has done for us, it’s allowed us to automate some of the tasks, the basic tasks that we do in the environment, it’s allowed us to drive efficiencies in the way we work in creating auto tickets and it really cuts down on the administrative point and click, as well as integrate some of our other third party software that we need to run in the NOC.
RON MARKEZICH: Great. I think you’ve got MSC back up there now.
TOM MCCLEERY: Oh, did I get it? Yeah, MSC is back but the netdocs — this would be an example of what was going to be pulled up here. It basically gives us the routing information, we have other utilities. In this particular event that I did earlier this particular network device is located in Germany, so you can open up contacts and some of the other building. This is in Unterschleissheim, Germany.
RON MARKEZICH: Great. Well, Tom, I think that was awesome for less than a day’s preparation. Well done, thank you very much.
TOM MCCLEERY: Thank you. (Applause.)
RON MARKEZICH: So I want to talk a little bit about how we know MOM actually provides us benefits with MOM 2005 and what Tom showed you. Some of the things that we measure for success metrics are alert to ticket ratio, ensuring that we drive the number of alerts to ticket down. We went from 35:1 to 2:1 today.
And so the important thing and one of the things I wonder, I’m going to ask the guys in the NOC when I get back, is the reason you didn’t want us to answer any alerts for a 45-minute window is then Ron is going to realize how few alerts we get during that period, but we’ll find out on that one, so alert to ticket ratio is about two to one. So the alerts coming through are much less, so it’s a much more efficient operation.
The other thing is proactive service requests. I would love an environment where every IT issue in that environment is found by an IT person or an ops center before it’s found by a user. I think we’re pretty good at that around the areas that we monitor, network computer systems, telephony. We’re about 87 percent so 87 percent of our service requests are found via our proactive monitoring and not via a reactive user.
It’s much worse on applications. In fact, the majority of our application issues are actually found by a user and so we’ll keep that note in mind when we start talking about where we need to head as a dynamic IT organization.
And then the last one, the length of time to detection. We have about 98 percent of our alerts are within two minutes of detection, which is a key metric for us, especially making sure we can get those things fixed before a user gets impacted by an outage.
I’ve got to talk a little bit about security. The security strategy is founded around four pillars. With “Longhorn” those four pillars will go to three pillars and I’ll talk about why.
First we want to secure the perimeter, we want to keep the bad guys out of the environment. We do that via things like Smart Cards. We require two-factor authentication to get in. Talk about a challenge on manageability, now you’re managing a very robust PKI infrastructure for access.
When Steve Ballmer was down in Vegas and he wanted to RAS in, Steve needed to use his Smart Card. We do not make exceptions for execs or for anyone else in the company.
We also secure the perimeter through means like secure wireless. We use 802.1x certificate based authentication, need a PKI infrastructure for that as well, as well as IPSec. One of the themes you’ll see through all of these is IPSec is a core component of securing our environment.
Securing our interior we have one network, I said 300,000 machines. We logically segment that network via IPSec so we can have a secure net and a not-secure net area that’s segmented and doesn’t cross over.
In terms of Smart Card, any of our IT people that have access to servers in our datacenter need a Smart Card to do admin functions on that server. So we do two-factor authentication for admin functions, a great control point for Sarbanes-Oxley. Our auditors love that control point. Third is we secure key assets, meaning things like source code, our intellectual property we secure via IPSec. So we have IPSec policy around our source code. To check in or check out you need to have the appropriate certificates on your machine. We also secure other things like Bill Gates’s e-mail via IPSec, so you need to have certificates on your machine to access that e-mail, so key assets we will do that way.
And then last but certainly not least is compliance and audit. I learned a long time ago if you don’t enforce a policy there’s no way that that policy is going to be followed. And so we’re very strict about enforcing policy that is important to us through our compliance and audit capabilities.
Let me talk for a minute about patch management. When it comes to patch management and how we patch those 300,000 machines, most of those are client machines or lab machines, so they’re not going to be in our managed space, which would be our datacenter. Our datacenter space is roughly about 10,000 total machines, so a very small percentage. The rest of those machines, users are admins on those machines.
And so to patch them we rely on Auto Updates quite extensively. Most of the client machines are running Windows XP SP 2 with Auto Updates on. If Auto Updates was turned off we also do an e-mail out to all of our employees.
Now, in the next month or two, as I look at Jack Schlafer in the front row, the person that runs my manageability services, we’re going to be turning off that e-mail and we’ll rely only on Auto Update or intranet notification if employees choose to get notified when there’s a patch necessary to be loaded. About 70 percent of our population is covered through that means.
We have SMS Deployed very extensively across the environment and so then we will start doing SMS software distribution. First we’ll do a voluntary for a period depending in the patch criticality and then we’ll force patch those machines via SMS.
Now, we have very smart, technically astute users at Microsoft and so some of those users may have actually disabled SMS, turned off Auto Update. Those people, if they’re not paying attention to their mail or they’re not getting the intranet alert about a new patch and they don’t patch their machine, we will not hesitate to shut their port off. So we will shut them down, they will lose connectivity to the network — that will get their attention — and then they’ll need to call the help desk, help desk will re-enable their port and ask them to load the patch. If they’re stubborn and they don’t load the patch after you get them re-enabled, they get shut off again. We use MBSA across the environment to detect who’s not patched and then we have written a custom tool, we’re all Cisco on the network, to disconnect the port based on the MAC address. That drives us to our patch compliance across those 300,000 machines. But SMS is key to that, Windows XP SP 2 is also key to that process.
Let me talk a little bit about IPSec. We love IPSec. We think IPSec is a great tool and a great means for us to take our one physical network and segment that network logically so that we can provide a level of security while leveraging those economies of having one physical network.
What we do with IPSec is we’ve built an environment called SecureNet. It’s about 220,000 machines in SecureNet. To be in SecureNet you need to meet certain criteria. You have to be running Windows XP SP 2 on the client, Server 2003 SP 1 on the server.
And so in that SecureNet area you have access to all the great corporate resources that we provide, things like e-mail, SAP, other services that you may not necessarily get if you’re not in SecureNet. So there’s a motivation, an incentive for owners of applications and computers to move those systems into our SecureNet environment and get on the RTM XP SP 2 build.
Then, outside of SecureNet, to get in they’ll need to do things like dial up. We’ll have labs there, which we segment so if someone is doing something nutty in a lab that impact that lab might have on the SecureNet machine is minimal.
Let me talk a little bit about our dog food program. I mentioned that my first and most important job is to be Microsoft’s first and best customer. We call this dog fooding. Dog fooding means eating your own dog food. We actually tried to change it to drinking our own champagne but people seemed to like eating their own dog food better, because one of the reasons is people that own the groups always have to take this handful of dog food and eat it to show their commitment to dog fooding. It doesn’t work as well if you make them drink champagne to show their commitment because that doesn’t show much of a commitment and eating dog food certainly does.
So we have a dog food program. As part of that dog food program we sit down with the product teams and we sign up for shared goals. We say before this product ships these are the goals that we need to see inside Microsoft to sign off. Those goals would be number of clients, number of servers. We need to show these types of capabilities, this type of value from this product.
Throughout that process then we get product feedback, we fix bugs, we actually have the IT people access to the bug database that the product groups use so we enter bugs directly into the product group database.
At the end of the process we sign off on that product. But I’ll tell you, if it does not meet our shared goals I do not hesitate and the VPs that run the product groups do not hesitate to delay shipping of that product and that’s a big deal for Microsoft when the mentality is driven around ship dates. We will not ship that product, guaranteed, unless IT signs off on that product.
Now, what I want to do is bring out one of my operations managers, Calvin Keaton. Calvin is going to demo a product that we’ve been dog fooding that’s now in beta called Data Protection Manager, a product that we’re super excited about value that we’re going to be getting in the IT organization because of this product. So let’s hear it for Calvin Keaton, our operations manager in Microsoft IT. (Applause.)
CALVIN KEATON: Thanks, Ron.
RON MARKEZICH: Thanks for joining us, Calvin.
CALVIN KEATON: Thank you.
So my name is Calvin Keaton and I’m the data protection service manager for Microsoft IT. And what this means is that I’m responsible for backup and recovery of data across the corporate IT environment.
Now, one of my most challenging business problems is branch office backup, and I’d like to talk to you a little bit about how I’m going to use Data Protection Manager to solve this problem for me.
Now, Microsoft has 130 branch office locations spread across the globe. These 130 locations comprise approximately 11 percent of the data that we back up on a daily basis. This 11 percent of data that I back up costs me 28 percent of my backup budget. This isn’t a particularly efficient use of my IT dollars and I’m looking to Data Protection Manager to fix that for me.
We’ve had Data Protection Manager deployed in our production environment since this past summer. At present, we use one Data Protection Manager server, located in the Redmond datacenter, to protect four branch office locations spread across the world.
On the screen here to the side you’re going to see a virtual server image of a Redmond DPM server and a Portland branch office location.
Now, if I’m a user in the Portland site, I’m going to come into work and I’m going to go ahead and make some changes to my files. In this case, let’s say that I decide that I want to make some changes to a PowerPoint that I’m going to present. Now I make the changes and I go ahead and rename that PowerPoint.
Now, as currently architected, our Data Protection Manager server will look on an hourly basis at each one of these sites and replicate any changed data over to the data protection server. Now, in this case, since we don’t have an hour to wait, I’m going to navigate over to the DPM server and I’m going to initiate a synchronization.
Now, what’s interesting about this synchronization is that it happens at the block level. This means that the entire file isn’t replicated over, only the portions of the file that actually change. This has very low impact. In an earlier slide, Ron spoke about the fact that we have roughly a hundred Internet connected offices around the globe. This block level replication is very important in terms of my ability to protect an Internet connected office with a remote backup solution.
So this synchronization that’s occurring right now is something that in my Portland USP server of roughly 300 gigabytes takes me ten minutes in production. So I can synchronize the data between my DPM server in Redmond and my Portland site in ten minutes on a nightly basis to ensure data protection.
There’s a second step to protecting data with Data Protection Manager, however, and that’s creating the actual shadow copy images that are used to do recoveries.
Now, as currently architected we have nightly shadow copies being done or nightly snaps being done on the part of the DPM server. These nightly snaps are the images that we use to do recovery. Since we can’t wait till tonight I’m going to go ahead and initiate a snap now.
So what I’ve just done is I’ve created an image of the data that I just synchronized. This image of the data can be used for later recovery.
RON MARKEZICH: Now, who would do this? Would you be doing this?
CALVIN KEATON: So the data protection server does this automatically based on a schedule that you determine. At present in production, we synch on an hourly basis and we snap on a nightly basis. But if you have a lot of change at a particular location or a very small amount of change, you can go ahead and modify the schedule to suit your needs. It’s extremely flexible in terms of the ability to suit the replication to the bandwidth available or the change at the site that you’re trying to protect.
So going back to being a user in Portland, I’m going to go ahead and simulate a little user error. Surprisingly enough, the majority of our restore requests that we get, and we work roughly 150 restore requests per month, are a result of clients accidentally deleting a file. So in my present tape backup world it takes roughly 24 hours for me to restore a piece of data in the scenario, the reason being that I have to locate the tape, have the tape loaded in the tape library, initiate the restore process and at the back-end of that make sure that the tape goes to where it needs to be.
In the Data Protection world it looks a little bit different. I’m going to go over to my recovery console. I’m going to highlight the E drive in Portland that I’m protecting. I’m going to highlight the image that I just created. I’m going to go to the PowerPoint that I just changed and I’m going to select recover, tell it to recover now and this will initiate the restore process. So if I go back over the Portland server we’ll find that the file is back where it’s supposed to be.
This is a process that generally takes me 24 hours with tape is something that I can do almost instantaneously with DPM.
Now, if I go over to the Data Protection Manager server monitoring screen and I look at the jobs we’re going to see that the actual recovery job and the shadow copy job took a very small amount of time, three seconds and five seconds respectively.
Now, as I said earlier, I can protect my Portland USP site with a ten minute nightly replication. This ten minute nightly replication replaces an eight hour backup window that was previously in place to protect the site.
Moving forward, we plan on deploying 14 Data Protection Manager servers to three of our datacenter locations. With these 14 servers, we will protect all 130 branch office locations. What this means is that I can replace 130 tape libraries, media servers and associated support infrastructure with 14 DPM servers located in the datacenter site.
I estimate that I’m going to save $2.8 million over the next two years as a result of this deployment, but that’s not the most exciting thing about this. The most exciting thing is that after this deployment is complete this summer, I’m going to be effectively lights out at every one of these branch office locations. I’ll remotely monitor, manage and protect the data at every one of these 130 branch office locations. I don’t need IT staff, all I need is a person sitting in Redmond at a Data Protection Manager console.
RON MARKEZICH: Great. Now, this is a v.1 product?
CALVIN KEATON: It’s a v.1 product.
RON MARKEZICH: And what phase is it in right now?
CALVIN KEATON: We just released beta. RTM is going to be this summer.
RON MARKEZICH: And how much are you using it today?
CALVIN KEATON: We’re protecting four sites right now with a single DPM server. We’ve actually got a fairly extensive test environment that we’re using to prepare for deployment. Between now and June we’re going to deploy — I actually just saw the schedule recently, we’re going to be deploying the 14 servers over the course of the next three months.
RON MARKEZICH: Great. So we’re coming up to budgeting right now, so I just took notes on that and we’ll make sure we reflect that in your budget next year, Calvin.
CALVIN KEATON: Make sure you take my bonus out of that. (Laughter.)
RON MARKEZICH: Great job.
CALVIN KEATON: Thanks a lot.
RON MARKEZICH: Thanks for joining us. (Applause.)
All right, so that’s an example of something that we’re beta testing. One of the things I love about my job is even though we’re beta testing this product, we’re getting value out of the product immediately. By the time we go ship it, because we’re already running our business on it, because I’m committing to you that we run our business on these products before we ship them, I’m getting value out of them right away. I can start reducing people like Calvin’s budget and allocating those funds elsewhere because he’s getting immediate value out of the dog-fooding of products like Data Protection Manager.
I’m on the last slide on the operations section of this discussion. I cannot leave this discussion without talking about process. When I think of what’s important to me as a CIO I think of three things: People are number one. What makes IT run is not technology but it’s people, dedicated people like Calvin and Tom, people in this audience to make sure that they are knowledgeable, they’re experts in their field, they’re committed and they make these organizations run.
Second is process. It’s very important for me to have standardized processes that are adhered to and understood. And third is technology. I think in general we spend a lot of time talking about technology and we need to spend more time talking about process.
So I want to just emphasize one thing here on Microsoft Operations Framework. MOF, our Microsoft Operations Framework, is based on ITIL, IT Infrastructure Library, and is something that we drive throughout our organization extremely hard. Once a year we do a MOF assessment to score ourselves on how we’re doing at standardizing and optimizing our core processes across our operations. We look to drive up that score every year and we look to take the feedback from that assessment and drive what we call SIP, Service Improvement Plans.
So if anyone here has not done an ITIL assessment or a MOF assessment, I’d highly recommend doing one. A lot of companies offer assessments, Microsoft as well as others, but for us as IT organizations, this has been extremely valuable. It’s something that typical IT people don’t always think of process as a solution to their problem but even some of our most anti-process people have really adopted MOF after seeing what the assessments look like.
Applications. The reason I’m talking about applications at MMS, Microsoft Management Summit, is because I truly think the role of the ops center and the role of IT and the jobs that people in this room do are going to expand and move up the stack outside of the core OS hardware critical environment and into both applications and information worker function.
On the application side, a couple things are driving this around the technology. Virtual Server 2005 is changing the way in which our op function does IT. We have created a compute service, which is a big deal for us because as a culture we’re very entrepreneurial with a lot of autonomous decision-making in IT across our apps team. But what we will start doing is centralizing procurement and provisioning of computing power and leveraging Virtual Server as a compute service both in dev and production environment, so in our labs and in our datacenters.
What that’s going to allow us to do is increase the provisioning, accelerate provisioning, accelerate time to market, accelerate SLAs and decrease my costs and it changes the role of that ops function, because they have to ensure appropriate utilization, appropriate availability and appropriate turnaround time of that computing power based on Virtual Server.
It’s going to mimic a service that we created about a year ago with a storage service, where all of our storage now at Microsoft that’s purchased as new storage goes onto a central storage service. We don’t do any decentralized or one-off storage purchases. We have all of our applications, the heavy storage needs running off the same service. Compute service is that next line.
The other piece is dynamic app infrastructure. We have two big releases this year in apps: SQL Server 2005 and Visual Studio 2005. For us they’re going to drive significant database server consolidation, which is a great thing for me. They’re also going to have an integrated development platform, meaning we’re not going to have a set of SQL Server developers and a set of Visual Studio developers; we can take those skill sets and use them in both areas. What it also allows me to do is take those people doing development support and maintenance and have them focus on that, and get my IT ops function focused on the components to support those activities.
The other two components on SQL Server 2005 and Visual Studio 2005 have to do with compliance, which is a big deal for us especially around Sarbanes-Oxley and other HIPAA-type regulatory compliance measures, as well as business intelligence. You saw a demo yesterday in Steve’s talk about BI in the ops center. The other thing we’ll be doing is centralizing business intelligence across the entire company so that there’s one business intelligence function that whether you’re reporting out of HR or you’re reporting out of finance or customers, you’re leveraging that same BI service across the entire company; a great model to take activities that happen today in silos that are very much the same and move them into a common service and optimizing that service.
For us the reporting services is what’s enabling this. One of the challenges we have had is all of the standard reports that exist across the company. We are pivot table junkies, we love pivot tables. You’ll see 40 meg, 50 meg reports with pivot tables. What happens with that is the manageability of those reports sitting on file servers around the company and having the ability to drill down and across those reports is extremely difficult. Reporting services will allow us to take that in a common infrastructure, in a common management method with common people managing that entire standard reporting environment. It changes the role of that ops function in that ops center.
The next thing I want to hit on is information workers. You can also think of these as knowledge workers. When it comes to the ops center and when it comes to manageability, we need to ensure a dynamic IW infrastructure. For us the foundation of that IW infrastructure is Active Directory. We have Active Directory as identity management system across the entire environment. All of our applications rely on AD for authentication, all of our network access relies on AD for authentication.
We also will have Network Access Protection to not only ensure access to the network but leverage that AD environment to ensure appropriate access to applications. This is a service that is critical for an central IT ops function to provide to the company. Otherwise you’re going to have different authentication methods in every different group that does applications or services across that organization, whether you’re a central IT group or a very decentralized IT group, which in today’s world with Sarbanes-Oxley and regulatory requirements is a big problem, because one of the big points of Sarbanes-Oxley is ensuring user access is appropriate, that the right people have the right access to the right system and those that don’t have access cannot get at the applications or the data within those systems. It’s very important to have that common IW infrastructure.
We will leverage access control with that both on the network and on the applications across our managed and unmanaged machines as well as all of our devices.
In addition to that, what’s very important for us is to ensure content that gets distributed across the company, whether an e-mail or an Office document, is protected. We use rights management services to do that. This is a technology that has taken my central ops function and helped it raise up the stack, because what that group can do now is provide a service out to everyone in the company to ensure e-mails and documents that they create that they can protect to ensure they don’t get forwarded, printed to people that should not get them.
An extreme type of user interface drives FIPS compliance and integration with our Smart Cards. In fact, we have 1,000 unique users of this, we’ve got 23,000 users per week so it’s extensively used essentially by every one of our users across the company, whether you’re a vendor or an employee. And if you haven’t seen this capability, you go into an e-mail and it’s as simple as pushing a button to protect that e-mail or pushing a button to protect that document. But what this means to you is that it’s an opportunity for us as a central ops function to provide a service out to the business that provides immediate business value because of the way the technologies have advanced.
Another great example of that is collaboration. We have a very distributed workforce. People all over the world having to collaborate quite a bit. We have been able to solve a core problem, which is users needing to set up collaboration sites, document collaboration and meeting collaboration in silos and move that into a central shared service. We do that with collaboration services using a product called SharePoint, which allows us to set up a service across the company to collaborate on documents, meetings, teams and personal collaboration.
I mentioned earlier Live Communication Services. We now are dog-fooding Office Communicator, which also allows you to collaborate via IM and give you phone integration; a great capability. I was in Boston two weeks ago doing some work in my hotel, forgot my Smart Card — it was back in Redmond — so I’m on via RPC over HTTP, just a standard Internet connection, doing e-mail and all of a sudden an IM window pops up. I’m getting a phone call in my office in Redmond. I can, in Boston, then direct that phone call to my cell phone, which is sitting right next to me, and answer my phone call that was going to Redmond. And so you have the phone integration with your client machine and your PSTN PBX back in your home office that you can now manage off your client wherever you are. A great capability, but also another example of where IT raises up the stack and provides a capability that benefits the company immediately and provides new operations opportunities to drive the same type of discipline, the same type of value that we’ve driven into the infrastructure into collaboration services.
Now, to end I want to talk a little bit about driving the package automation. One of the things I love hearing about at this session, and I attend the session as a customer like a lot of you, and I love hearing Kirill talk, I love hearing the product people talk about their future for the product because one of the things that we need to do as a company to benefit all of you is drive the package automation. We need to ensure that we’ve got software that can automate processes and capabilities that you have so you can stop writing PERL scripts, you can stop having a bunch of one-off applications or manual processes to do things that should be automated.
I’m excited about DSI, Dynamic Systems Initiative, because of the foundation Microsoft System Center will provide DSI. For me that will be the place I look to as a Microsoft customer for package automation to run my IT ops function.
You see a lot of products on here, but if I look at that from my lens as a customer I see a lot of elimination of custom apps, of custom scripts and of manual processes because I’m leveraging that packaged automation to drive up my SLAs and drive up my benefits to my business.
So with that, I’m going to leave you and let you enjoy the last day, the second to last day of the event, let you enjoy the party tonight. I look forward to talking to some of you as I’m around for the next day, love talking to those of you that I was able to talk to in the last day.
And the last thing I’ll tell you is the opportunity for the jobs that you’re in, the functions that you do to drive business value and to raise the level of functions and services that you provide your companies has never been greater. Microsoft I know is in there with you to provide technology to enable you to do that and I look forward to seeing you next year.
Thank you very much. (Applause.)