Postmortem: VSTS 4 September 2018

Azure DevOps SRE

Postmortem – VSTS Outage – 4 September 2018

On Tuesday, 4 September 2018, VSTS (now called Azure DevOps) suffered an extended outage affecting customers with organizations hosted in the South Central US region (one of the 10 regions globally hosting VSTS customers). The outage also impacted customers globally due to cross-service dependencies. It required more than 21 hours to recover all VSTS services in South Central US because the recovery of VSTS services was dependent upon Azure restoring the data center. After VSTS services were recovered, we had an additional incident which lasted 2 hours impacting Release Management in South Central US due to a database that went offline. We also had intermittent failures for some Git and Package Management customers while the Azure storage accounts were being restored.

First, I want to apologize for the very long VSTS outage for our customers hosted in the affected region and the impact it had on customers globally. This incident was unprecedented for us. It was the longest outage for VSTS customers in our seven-year history. I’ve talked to customers through Twitter, email, and by phone whose teams lost a day or more of productivity. We let our customers down. It was a painful experience, and for that I apologize.

What happened?

The incident started with a high energy storm, including lightning strikes, that hit southern Texas near the South Central US datacenters. This resulted in voltage sags and swells across the utility fields that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process. VSTS services in South Central US became unavailable as a result. You can read more about the impact on the datacenter on Azure Status.

VSTS has multiple services with scale units (SUs) in South Central US. Beginning at 09:45 UTC 4 September 2018, all VSTS SUs in South Central US were down due to the datacenter power down process.

In addition to VSTS organizations hosted in the South Central US region, some global VSTS services, such as Marketplace, hosted there were also affected. That led to global impact such as inability to acquire extensions (including for VS and VS Code), general slowdowns, errors in the Dashboard functionality, and inability to access user profiles stored in South Central US.

In addition, users with VSTS organizations hosted in the US were unable to use Release Management and Package Management services. Build and release pipelines using the Hosted macOS queue failed. Additionally, the VSTS status page was out of date because it used data in South Central US, and the internal tools we use to post updates for customers are also hosted in South Central US.

VSTS services began to recover as Azure recovered the datacenter. Almost all services self-recovered, but a couple that didn’t had to be restarted manually.

After services had recovered, additional issues for customers occurred in Git, Release Management, and some Package Management as Azure’s recovery efforts continued. The incident for VSTS ended at 00:05 UTC 6 September 2018.

Why didn’t VSTS services fail over to another region?

We never want to lose any customer data. A key part of our data protection strategy is to store data in two regions using Azure SQL DB Point-in-time Restore (PITR) backups and Azure Geo-redundant Storage (GRS). This enables us to replicate data within the same geography while respecting data sovereignty. Only Azure Storage can decide to fail over GRS storage accounts. If Azure Storage had failed over during this outage and there was data loss, we would still have waited on recovery to avoid data loss.

Azure Storage provides two options for recovery in the event of an outage: wait for recovery or access data from a read-only secondary copy. Using read-only storage would degrade critical services like Git/TFVC and Build to the point of not being usable since code could neither be checked in nor the output of builds be saved (and thus not deployed). Additionally, failing over to the backed up DBs, once the backups were restored, would have resulting in data loss due to the latency of the backups.

The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

Availability Zones within a region

There are many failure modes affecting one or more datacenters in a region that can be addressed by using Azure’s recently introduced Availability Zones. The first Availability Zone regions became generally available March 2018, and currently there are five regions providing them. Here’s the description from the documentation.

Availability Zones are a high-availability offering that protects your applications and data from datacenter failures. Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters equipped with independent power, cooling, and networking. To ensure resiliency, there’s a minimum of three separate zones in all enabled regions. The physical separation of Availability Zones within a region protects applications and data from datacenter failures. Zone-redundant services replicate your applications and data across Availability Zones to protect from single-points-of-failure.

Availability Zones are designed to provide protection against incidents such as the lightning strike that affected South Central US. They provide low latency and high bandwidth connections that enables synchronous replication within a region due to the proximity of the datacenters. For example, Azure SQL is able to spread the nodes across the zones. Availability Zones would enable VSTS services in a region to continue to be available so long as the entire region does not become unavailable.

Synchronous replication across regions

Synchronous replication across regions involves committing every modification to persistent data to both regions before returning a response to the call to write persistent data. At first blush, it seems like the ideal solution because by guaranteeing that both regions have committed the data means that every fail over would have no data loss.

However, the reality of cross-region synchronous replication is messy. For example, the region paired with South Central US is US North Central. Even at the speed of light, it takes time for the data to reach the other data center and for the original data center to receive the response. The round-trip latency is added to every write. This adds approximately 70ms for each round trip between South Central US and US North Central. For some of our key services, that’s too long. Machines slow down and networks have problems for any number of reasons. Since every write only succeeds when two different sets of services in two different regions can successfully commit the data and respond, there is twice the opportunity for slowdowns and failures. As a result, either availability suffers (halted while waiting for the secondary write to commit) or the system must fall back to asynchronous replication.

Asynchronous replication across regions

Relaxing the requirement for all writes to always to be synchronously committed across regions or fail means that data will be written asynchronously to the secondary region, which is the fall back mode for synchronous writes. If the asynchronous copy is fast, then under normal conditions, the effect is essentially the same as synchronous replication where the data is written to the secondary at the same time as the primary.

Putting it all together

Moving forward, we plan to address failures within a region by using Availability Zones, and we have begun the work that will allow VSTS to use them. They are currently supported in five regions with more coming online in the future. South Central US and some of the other regions hosting VSTS services do not have Availability Zones yet. We will need to move our services to these regions, which will take time. Since not all geographies currently have Availability Zones, VSTS services in those geographies will not be moved in order to continue to honor data sovereignty. This also means that VSTS services in those geographies would not be available during an incident like this incident in South Central US.

Achieving perfect synchronous replication across regions to guarantee zero data loss on every fail over across regions at any point in time is not possible for every service that also needs to be fast. Our goal is to provide the best balance among competing objectives of high performance and high availability.

Addressing failure of a region is a hard problem. We are investigating the feasibility of asynchronously replicating data across regions. For Azure SQL, we would use active geo-replication. For Azure Storage, we would asynchronously write data to both the primary and secondary regions for every write, which would provide us a second copy in the secondary region that is ready to use in the event of a fail over. This will take advantage of the standard Azure Storage SLA, which is not available with GRS. We would need to provision compute and other resources in the paired data center, either at the time of fail over or to have available always as a warm standby.

Given all of the above, I still need to address the question of when I would fail over to another region if there is data loss. I fundamentally do not want to decide for customers whether or not to accept data loss. I’ve had customers tell me they would take data loss to get a large team productive again quickly, and other customers have told me they do not want any data loss and would wait on recovery for however long that took. The other challenge in this situation is that the decision has to be made in the face of incomplete information about how long the outage will continue. ETAs are notoriously difficult to get right and are almost always too optimistic.

The only way to satisfy both is to provide customers the ability to choose to fail over their organizations in the event of a region being unavailable. We’ve started to explore how we might be able to give customers that choice, including an indication of whether the secondary is up to date and possibly provide manual reconciliation once the primary data center recovers. This is really the key to whether or not we should implement asynchronous cross-region fail over. Since it’s something we’ve only begun to look into, it’s  too early to know if it will be feasible.

Other Problems

I will briefly cover some of the other issues we encountered during the incident.

Global impact: Some of our services in South Central US provide capabilities for VSTS services in other regions. The Release Management (RM) service in South Central US serves as the main host for RM users throughout the US. The Marketplace service provides extensions for users of VS, VSTS, and VS Code and is a single instance service. As I mentioned before, we will be working on moving these services into regions with Availability Zones.

Degraded performance: The User service is responsible for serving user profiles. Each user is in exactly one of the multiple User service instances globally. During the incident, services calling the User in South Central US failed slowly. Normally, the circuit breaker wrapping these calls opens to fail fast and provide a gracefully degraded experience. Unfortunately, the breaker wraps calls to all User service instances due to a difference in URL patterns. Because of that, the failure rate was too low for the breaker to open (only calls to South Central US failed, so most succeeded), resulting in customers in other regions experiencing very slow web UI if their profile was hosted in South Central US. To eliminate the perf impact during the incident, we opened the circuit breakers manually to turn off the profile experience for all users. We are changing the circuit breakers for User service calls to scope the calls to specific instances. Additionally, we also found excessive retries to retrieve user preferences that slowed the user experience during the incident.

Dashboard errors: Users in other regions saw errors on their Dashboards because of a non-critical call to the Marketplace service to get the URL for an extension. This area had not been tested for graceful degradation, and we have scheduled fault injection testing for Dashboards to begin immediately.

Incorrect service status: During the initial few hours of the incident, we were unable to update service status due to a dependency on Azure Storage in South Central US. We are already building a new service status portal that will be better at not only being resilient to region specific outages but also improve the way we communicate during outages by making it much easier for customers to see status about their organizations. We also had issues getting to the credentials used with our communication tools due to Azure AD issues that were occurring early in the incident.

Service startup issue: During the recovery, a subset of the virtual machines (VMs) for a service were running but there were not enough of them to serve the incoming load. We have a process called VssHealthAgent that monitors the health of the VMs. The VMs appeared to be unhealthy, so it would take each out of the load balancer in sequence, gather diagnostic information, restart them, and reinsert them into the load balancer. While it is written not to take out more than one at a time and not collect memory dumps more than once per hour, there was a timing problem exposed in this incident because all ATs were overloaded for a long period of time.  That caused the dump collection process to rerun very soon after coming back into the load balancer.  We fixed the bug to ensure that the number of times an instance is removed from the load balancer in a sustained degraded state like this is limited. While the impact of this issue was minor, I’ve included it as an example of the challenges inherent in automatic mitigations.

Next Steps

Here is a summary of the changes we are making based on what we learned from this incident.

  1. In supported geographies, move services into regions with Azure Availability Zones to be resilient to data center failures within a region.
  2. Explore possible solutions for asynchronous replication across regions
  3. Regularly exercise fail over across regions for VSTS services using our own organization.
  4. Add redundancy for our internal tooling to be available in more than one region.
  5. Fixed the regression in Dashboards where failed calls to Marketplace made Dashboards unavailable.
  6. Review circuit breakers for service-to-service calls to ensure correct scoping (surfaced in the calls to the User service)
  7. Review gaps in our current fault injection testing exposed by this incident.

I apologize again for the very long disruption from this incident.

Sincerely,

Buck Hodges (@tfsbuck)

Director of Engineering, Azure DevOps

Feedback usabilla icon