Growing Azure’s capacity to help customers, Microsoft during the COVID-19 pandemic

When Washington state started seeing the first wave of patients fall ill to COVID-19, public health officials and hospitals didn’t have a centralized way to share all the information they needed.

While an existing system could track available hospital beds, state officials couldn’t necessarily tell whether a particular facility had enough ventilators, staffing or equipment like masks or gloves to protect doctors or nurses. Hospitals focused on identifying and treating infected patients in the early days of the outbreak struggled to track resources, using a mix of everything from databases and spreadsheets to emails and Post-It notes.

In late March, the Washington State Department of Health approached Microsoft about expanding a cloud-based Hospital Emergency Response Solution powered by Azure. Developed at the request of healthcare providers seeing the first surge of patients, it allowed those hospitals to better track COVID-19 cases, which clinicians were available to treat those patients and critical equipment and supplies that were becoming hard to find around the globe.

“We all do surge planning and we cache equipment but this exceeded everyone’s planning as we saw everyone wanting the same things on a global scale,” said Erika Henry, emergency operations manager for the Washington Department of Health.

At the same time, Microsoft also began experiencing phenomenal demand for its cloud services as much of the world began sheltering at home due to COVID-19: Entire school districts moved to online learning models, Teams meetings moved from a convenience to an essential way of doing business, kids embraced video games as an outlet for interacting with friends and millions of employees across the globe moved to remote work practically overnight.

At the end of March, for instance, Microsoft Teams set a new daily record of 2.7 billion meeting minutes in one day, up from 900 million minutes just two weeks earlier. In April, that number climbed to 4.1 billion meeting minutes in a single day. Suddenly, Microsoft 365’s teamwork hub that allows people to meet, chat, call and collaborate online was seeing unprecedented usage.

Meanwhile, employees worked across the company to ensure Azure services could continue to scale and to help customers and organizations on the front lines of the pandemic response address their most urgent needs.

Within 12 days, the state was able to launch the Washington Healthcare Emergency and Logistics Tracking Hub (WA HEALTH) built on Microsoft Power Platform, which runs on Microsoft Azure.

“This really helped us understand how do we get patients to the right care at the right place in the right amount of time and also to understand the needs for personal protective equipment across the state — who had what and where were the gaps,” Henry said.

Cloud providers like Microsoft Azure are by their nature designed to expand and scale quickly and meet elastic demand. With the more than 60 datacenter regions around the globe — including three new regions announced this past May in Italy, New Zealand and Poland — Microsoft can shift traffic if a natural disaster or power outage affects capacity in one part of the world. Computing hardware to protect against demand spikes is stockpiled in warehouses around the globe, ready to be deployed to where it’s needed most.

By March the epicenter of that demand was in Europe, as countries such as Italy and Spain imposed nationwide lockdowns to slow the spread of the coronavirus. Around the same time, manufacturing issues in China and Southeast Asia due to the global health pandemic created a temporary disruption in the supply chain for certain datacenter hardware as dramatic spikes in usage began challenging computing capacity in some regions. In a Microsoft quarterly earnings call on April 29, the company said those hardware supply chain issues largely began resolving themselves late in the quarter that ended in March.

“The scope and the scale of the response to COVID-19 was completely unprecedented, in terms of how much of the world went digital inside a month,” said Mark Simms, a partner software architect who helped manage the COVID-19 response across Azure. “So the work that we had to do to get through the initial surge in demand and free up capacity for our customers to run critical health and safety workloads was also unprecedented.”

“We made some pretty profound changes in order to do the right thing, and we did them under a very short time frame,” Simms said.

Datacenter employees began working in round-the-clock shifts to install new servers while staying at least six feet apart. Microsoft product teams worked to find any further efficiencies to free up Azure resources for other customers. The company doubled capacity on one of its own undersea cables carrying data across the Atlantic and negotiated with owners of another to open up additional capacity. Network engineers installed new hardware and tripled the deployed capacity on the America Europe Connect cable in just two weeks.

Microsoft’s plan emphasized continuity of service for all its customers, but especially for those on the front lines of the COVID-19 response: health care providers, police and emergency responders, financial institutions, manufacturers of critical supplies, grocery stores and health agencies providing critical information to the public about how quickly the virus was spreading.

The Azure Global and Customer Experience engineering teams mobilized and monitored Azure services around the clock to ensure critical customers could continue to operate smoothly and meet new challenges posed by COVID-19. Employees from across the company volunteered to switch gears to help deploy urgent projects like WA HEALTH.

Microsoft’s Regional Government Emergency Response and Monitoring Solution, on which WA HEALTH was built, uses Microsoft Power Apps Portal and Microsoft Power BI, which run on Azure, to allow healthcare workers to quickly update counts of COVID-19 cases and critical resources at the end of a shift or whenever is convenient through a web portal or by mobile phone.

As the pandemic was unfolding and cases were surging, it allowed state and local health officials to use near-real-time dashboards — based on data from more than 100 acute care hospitals around the state — to coordinate responses. It continues to allow the state to monitor trends, as well as inform decisions such as how quickly counties can move toward safely reopening their economies, Henry said.

“This was a real-time moving event, and people needed to make decisions hour by hour and minute by minute,” said Gary Bird, a principal program manager for Microsoft Power Platform who worked to deploy WA HEALTH. “You really saw the whole company lean in across all fronts to make solutions happen.”

A screenshot of the Washington Healthcare Emergency and Logistics Tracking Hub (WA HEALTH)

The Washington Healthcare Emergency and Logistics Tracking Hub (WA HEALTH), which runs on Microsoft Azure, allows public health officials and hospitals across the state to track and share key information related to the COVID-19 pandemic.

Balancing supply and demand

Once the pandemic hit, Microsoft’s plans to deal with unexpected spikes in usage and expand Azure’s computing resources kicked in. The company began adding new servers to the hardest hit regions and installing new hardware racks 24 hours a day. To protect the health of critical datacenter employees, Microsoft also quickly established social distancing requirements, provided protective equipment and implemented strict disinfectant policies.

Microsoft prioritized capacity for existing customers while also reserving capacity for first responders who needed to quickly scale life and safety services. It also expanded cloud support for non-profits working to protect people’s health during the pandemic.

Microsoft’s Azure and product teams worked around the clock to ensure that services like Teams, Office and Xbox could meet rapidly exploding demand from customers. Next, they looked for efficiencies across all of Microsoft services running on Azure to free up more capacity for external customers.

Microsoft Teams was the first and most obvious service to experience massive growth. But other Azure services that enable remote work, such as Windows Virtual Desktop, which saw its usage triple in one month, and Azure Active Directory’s Application Proxy had to scale those services dramatically as financial institutions, schools, call centers and other companies moved thousands of employees onto those platforms practically overnight.

Not only were new customers signing up, but suddenly existing users were relying on the tools to power every single meeting or interaction in their workday, said Mark Longton, a principal group program manager for Microsoft Teams.

“Teams went from a service that was cool and convenient and something that people realized was the future of communication to very quickly becoming a mission critical, can’t-live-without-it sort of thing,” Longton said. “Really what this did was accelerate us into the future.”

Microsoft Teams began using early data from China and Italy to plan for expected growth as the pandemic spread. As more countries went into lockdown, dozens of Redmond-based Microsoft employees gathered remotely each Sunday night to watch telemetry, look for bottlenecks and troubleshoot as unprecedented numbers of remote European workers began logging in first thing Monday morning.

“Normally, you find and fix issues organically as you grow. When you take software and put it under explosive growth – with services getting used an order of magnitude more in one day — you tend to find all of those in a really short period of time,” said John Sheehan, Microsoft distinguished engineer for Azure quality.

Once critical services were successfully scaled and stabilized, the company shifted to looking for efficiencies. Microsoft opened up performance data for all services running on Azure to engineers across the company, asking them to submit ideas that would allow Microsoft to provide more computing capacity to the wider pool of Azure customers.

In Teams, small tweaks that few customers would notice — lengthening the time it takes for the three dots to appear when someone else is typing in a Teams chat or disabling the feature that suggests a contact every time you type a new letter in the “to” field — made the system run much more leanly. Engineers rewrote the code to make video stream processing 10 times more efficient in a marathon push over a weekend.

Teams went from a service that was cool and convenient and something that people realized was the future of communication to very quickly into becoming a mission critical, can’t-live-without-it sort of thing. Really what this did was accelerate us into the future.

Within a week, Teams was also able to spread its reserved capacity across additional datacenter regions, essentially sharing that load within geographies. That’s a process that could have easily taken months, said Aarthi Natarajan, Partner Director of Engineering for Teams.

“We would have probably said, ‘Let’s hit this region first and maybe then this one’ and there would have been a lot of debate,” she said. “In this case, we said, ‘No one is debating anything. We are doing this and we’re doing it globally right now.’”

To handle the Teams growth that now exceeds 75 million daily active users, Microsoft’s Azure Wide Area Network team added 110 terabits of capacity in just two months to the fiber optic network that carries Microsoft’s own data around the globe. It added 12 new edge sites that connect the network to infrastructure owned by local internet providers. Like off-ramps that help funnel vehicles off freeways and onto side roads, edge sites help reduce network congestion and allow data to flow more freely.

Microsoft also moved internal Azure workloads to avoid demand peaks in different parts of the world and to divert traffic from datacenter regions experiencing high demand.

Xbox, which runs on Azure, is accustomed to ramping up for large surges in usage that typically occur around the holidays or new video game releases. But the COVID-19 stay-at-home orders drove unprecedented growth from people connecting and exploring with their friends and family through gaming, including a 50% increase in multiplayer gameplay and a 30% increase in peak concurrent usage.

As this growth was occurring, Xbox and Azure worked quickly to move gaming workloads out of high demand datacenters in the United Kingdom and Asia — freeing up capacity for other Azure customers without negatively impacting the experiences of its own users.

“There’s no question that in those regions the people who were on the front lines of the COVID-19 efforts really needed that capacity more than us. And our telemetry gave us confidence that we could make these tradeoffs while protecting our customer experience,” said Casey Jacobs, who manages reliability for Xbox operations.

Microsoft also didn’t want to exacerbate network congestion that local internet service providers were experiencing as families began to work, attend college classes, stream movies and play games at home, often all at the same time. Xbox worked closely with regulators and other partners to help manage those risks to internet health, including changing release tactics and developing new features to decrease bandwidth usage during peak times of day.

“We simply knew that we can’t do harm to internet bandwidth that’s needed for first responders, business operations or commerce period,” said Jacobs.

Microsoft’s commitment to life and safety customers

Early on, Azure Partner Engineering Manager Jeremy Hollett started compiling a global list of customers that provide critical life and safety services — first responders, government agencies, health care providers, energy companies powering hospitals, retailers offering essential supplies, banks processing small business loans — as well as precisely what Azure services they use.

The list included RapidDeploy, which runs on the Microsoft Azure Government Cloud and can quickly deliver unified emergency response solutions for public safety agencies and 9-1-1 telecommunicators.

A screenshot of a RadiusPlus tactical mapping deployment

RapidDeploy has been able to quickly adapt its unified critical response solutions hosted in Microsoft Azure to better assist emergency responders and 9-1-1 telecommunicators during the COVID-19 pandemic. Image courtesy of RapidDeploy.

The city of Baltimore, for instance, requested a RadiusPlus tactical mapping deployment that went live in a matter of days to improve situational awareness including hospital bed count data, mapping data and other signals related to COVID-19 that could help 9-1-1 telecommunicators better route response teams.

After California issued its stay-at-home order to slow the spread of the coronavirus, officials expected an uptick in domestic violence incidents. The state asked RapidDeploy for an upgrade to RadiusPlus that would allow telecommunicators to initiate contact with victims via text so that they can more easily communicate, even when the perpetrator is nearby.

To help ensure that customers like RapidDeploy could continue to deliver lifesaving services without interruption, Microsoft began automatically approving requests for additional Azure capacity from life and safety customers providing essential services during the pandemic.

Hollett’s team also built a dashboard that allowed them to monitor the Azure services for each of those customers and quickly troubleshoot any issues.

“We essentially built a virtual war-room view of these customers. We wanted to know that they were healthy, and if something goes wrong in the night we’d know about it and jump on a call no matter what time it is,” Hollett said. “And if one of those critical life and safety customers created a support case, it elevated that flag straight to us.”

Steve Raucher, co-founder and CEO of RapidDeploy, said the company experienced zero problems spinning up solutions in Azure to address new challenges faced by agencies on the front lines of the COVID-19 response.

“The biggest indicator that everything is working is that we haven’t had any technical problems,” Raucher said. “The technology was working seamlessly before COVID-19, and there’s been no interruption since the pandemic began.”

Top image: Microsoft followed plans to quickly expand Azure’s computing resources during the COVID-19 pandemic, including adding new servers to regions experiencing high demand and installing new hardware racks around the clock. Image by Microsoft. 


Jennifer Langston writes about Microsoft research and innovation. Follow her on Twitter.

Chris Stetkiewicz contributed to this story.