Postmortem – VS Marketplace outage – 4 September 2018

Azure DevOps SRE

On Tuesday, 4 September 2018, Visual Studio Marketplace suffered an extended outage affecting most of its customers. Marketplace hosts and serves extensions for the Visual Studio IDE, Visual Studio Code, and Azure DevOps. This was the first instance of the Marketplace service going down completely, and we sincerely apologize for the outage.

What happened and resultant customer impact

Azure resources that Marketplace depends on (largely Compute, Storage and SQL) were down during the incident in Azure South Central US and this took down the single instance Marketplace service completely from 2018-09-04 09:45 UTC to 2018-09-05 04:30 UTC. Following scenarios were impacted:

  • For extension consumers:
    • Browsing or searching extensions within VS IDE, VS Code or directly at Marketplace
    • Extension acquisition
    • Extension update of already acquired extensions
  • For extension publishers:
    • Extension publish: both new and update to existing extensions

Note: already acquired extensions were not impacted and continued to operate as usual during the outage. Even the already acquired Azure DevOps extensions, which take dependency on the Marketplace at runtime, continued to operate fine due to extensive use of CDN by the Marketplace service.

Once the Azure region recovered, Marketplace service was up soon. However, we observed slowness in Marketplace commands from 05:00 UTC to 10:00 UTC. The incoming load increased the CPU to >95% on our backend DB. Hence, in that duration, we throttled the extension search requests. Correspondingly some of our customers experienced failed or delayed commands intermittently during that duration. We then scaled out the DB to handle the load and stopped throttling the requests.

Why didn’t we fail over to another Azure region?

The Marketplace service is built on the same infrastructure as the rest of Azure DevOps. Azure DevOps service outage analysis provides insight into why the service was not failed over to another Azure region.

In this overall exercise we noticed other problems:

  1. Error message shown to users visiting Marketplace could have been more helpful
  2. VS IDE and VS Code didn’t provide helpful error messages when searching, acquiring or recommending extensions.

Next Steps

Following are the changes we are making based on what we learned from this incident:

  1. Make the Marketplace highly available: We will invest in higher availability by creating a hot standby of the Marketplace service across a different geographical region. The Marketplace service can also benefit from using availability zones within an Azure region, which we will leverage as part of Azure DevOps’s move to use availability zones.
  2. Improve error experience and messages:  We are working towards improving error messages to be more user friendly, in the highly unlikely situation of a reoccurrence of this incident. Major part of this work has already been addressed.

Sincerely,

Sanjay Malpani,

Engineering Manager,

Azure DevOps

Feedback usabilla icon