Microsoft’s Cloud Outage Postmortem: What Went Wrong in Texas

By:  Kurt Mackie

Microsoft this week released its report on what went wrong with its cloud on Sept. 4, when a lightning strike at its U.S. South Central datacenter region caused a widespread service outage.

The outage affected regional operations (“South Central U.S.”) as served up from Microsoft’s datacenters in San Antonio, Texas. However, it also affected Azure Active Directory services and other services more broadly — in some cases, worldwide.

The lightning strikes actually damaged Microsoft’s storage servers. Microsoft had to set up new servers and transfer and validate customer data to fix matters, according to Microsoft’s Sept. 4 Azure Status History page account.

The effects weren’t just confined to Azure services. Users of Office 365 services such as Exchange Online, SharePoint Online, Microsoft Dynamics and Microsoft Teams also were affected. In addition, developers were blocked from using Microsoft’s Visual Studio Team Services (now called “Azure DevOps“), which is a collaboration service for developers.

Buck Hodges, director of engineering for Azure DevOps, provided some analysis, indicating that services were out for more than a day. Services went down on Sept. 4 at 9:45 UTC (Sept. 4 at 2:45 a.m. PST). Incidents ended on Sept. 6 at 0:05 UTC (Sept. 5 at 5:05 p.m. PST), according to Hodges.

Bad Failover Options
In a nutshell, Microsoft currently does not have good failover options in its South Central U.S. datacenter region. Microsoft was using the Azure Storage service in that region for recovery purposes. Azure Storage has two options that could have been used to keep the services going, but only at the risk of possible loss of customer data.

Here’s how Hodges explained that dilemma:

Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.

Microsoft has a better recovery solution that could address such incidents such as the lightning strikes, Hodges noted. This solution, called “Availability Zones,” became generally available back in March, but Availability Zones currently are available in just five Azure regions. They were not available in the South Central region.

Cross-Service Dependency Issues
It might be thought that Microsoft plans its datacenters so that they don’t have a single point of failure. However, the South Central outage “also impacted customers globally due to cross-service dependencies,” Hodges noted.

For instance, the Marketplace service located in the South Central datacenter is the main host used across the United States, and so the South Central outage affected the ability to get Visual Studio extensions across the United States.

Similarly, service dashboards showed errors for users in other regions because of “a non-critical call to the Marketplace service,” Hodges explained. Moreover, Microsoft was “unable to update service status due to a dependency on Azure Storage in South Central US.”

Visual Studio Team Services users in the United States also couldn’t us Release Management and Package Management services during the outage because they are localized in the South Central region.

Hodges didn’t explain why Azure Active Directory services were affected by the South Central outage. The peculiarity of Azure services being affected in this way was noted by Microsoft Most Valuable Professional Andrew Connell in an hour-long Microsoft Cloud Show podcast episode on the outage. Connell also singled out the Microsoft Graph team for not communicating about the effects the incident.

Azure Active Directory services apparently were affected globally because of the rerouting of traffic, according to the Azure Status History page’s account:

The design for AAD includes globally distributed sites for high availability so, when infrastructure began to shutdown in the impacted datacenter, authentication traffic began automatically routing to other sites. Shortly thereafter, there was a significantly increased rate in authentication requests. Our automatic throttling mechanisms engaged, so some customers continued to experience high latencies and timeouts.

The Azure Application Insights service was affected across multiple regions by the South Central outage because of a “dependency on Azure Active Directory and platform services that provide data routing,” according to the Azure Status History page. New Azure subscriptions were also affected by the outage.

Microsoft’s Next Steps
Hodges noted that Microsoft has already fixed the dashboard reporting problem where failed calls were made to the Marketplace. He also described some future objectives.

Microsoft is planning to move South Central U.S. services into Azure Availability Zones, when available, for better resiliency. Microsoft also is thinking about permitting asynchronous replication across regions. It wants to add redundancy for its tooling across more than one region, and it plans to regularly conduct failover testing of Visual Studio Team Services, among other plans in reaction to the outage.