Postmortem – Availability issues with Visual Studio Team Services on 16 November 2017

Azure DevOps SRE

On 16 November 2017 we had a global incident with Visual Studio Team Services (VSTS) that had a serious impact on the availability of our service (https://blogs.msdn.microsoft.com/vsoservice/?p=15526). We apologize for the disruption. Below we describe the cause and the actions we are taking to address the issues.

Customer Impact

This was a global incident that caused performance issues and errors across all instances of VSTS, impacting many different scenarios. The incident occurred within Shared Platform Services (SPS), which contains identity, account, and commerce information for VSTS.

The incident started on 16 November at 12:15 UTC and ended at 12:40 UTC. The same issue occurred again the same day from 16:30 until 16:35 UTC.

The graph below shows the number of impacted users during the incident.

What Happened

The Commerce service in SPS is responsible for billing events. It’s the service in VSTS that interfaces with Azure Commerce to support purchasing extensions, VS subscriptions, pipelines, etc. The Commerce service is one of several services, including identity and account, which run as one service in SPS.

There was a change made to a stored procedure, used to fetch subscription information for an account, that resulted in high TempDB contention. The problem is a join condition using the OR operator yielding inefficient query plan. You can see the tempDB usage in the diagram below (Table Spool in the diagram). To fix it, we removed the OR and added hinting to force a good query plan.

 

Next Steps

Beyond the immediate fix to the SQL stored procedure, we are taking the following steps to prevent the issue going forward.

  1. We missed the OR in code review, so we are making sure engineers understand our SQL guidelines, including the use of UNION or UNION ALL rather than OR as well as a reminder to hint queries.
  2. We found and fixed three other stored procedures that we discovered to be suboptimal as part of the investigation of this incident.
  3. We had already begun work to pull the Commerce service out of the SPS service and separate it from identity and account. That work is well under way and will start going into production in January. This will ensure that an incident like this will be contained within the Commerce service and not affect critical operations like authentication.
  4. We are working on partitioning SPS. We currently have a dogfood instance in production, though the access pattern to trigger the issue was not present there (insufficient number of subscriptions). We have engineers dedicated to implementing a partitioned SPS service, which will allow for an incremental, ring-based deploy that limits the impact of issues. That is scheduled to begin deployment to production in early summer.

We again apologize for the disruption this caused you. We are fully committed to improving our service to be able to limit the damage from an issue like this and be able mitigate the issue more quickly if it occurs.

 

Sincerely,

Buck Hodges Director of Engineering, VSTS

Feedback usabilla icon