{"id":11351,"date":"2016-02-05T21:59:21","date_gmt":"2016-02-05T21:59:21","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/?p=11351"},"modified":"2019-02-16T22:46:27","modified_gmt":"2019-02-16T22:46:27","slug":"vs-team-services-incidents-on-feb-3-4","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/vs-team-services-incidents-on-feb-3-4\/","title":{"rendered":"VS Team Services incidents on Feb 3 &#038; 4"},"content":{"rendered":"<div class=\"post\">\n<div class=\"body\">\n<div id=\"6745a34b-7b9b-4d63-98ce-ed62a336b5a5\" class=\"postBody\" style=\"margin: 4px 0px 0px;border-width: 0px;padding: 0px\">\nWe had two serious availability incidents in the past 2 days.\u00a0 I&#8217;m incredibly sorry for the problems that these have caused.\u00a0 Increasingly we are bringing on more and more customers who rely on this service day in and day out and any significant interruption is entirely unacceptable.\u00a0 We know it.\u00a0 We live it.\u00a0 We breath it and, in the last 2 days, we have failed twice.\u00a0 We take this responsibility very seriously and continue to work to ensure such things do not happen.\nBoth of these incidents were service wide incidents that affect large numbers of customers and critical scenarios.<\/p>\n<h3>Worldwide outage causing login problems and extraordinarily slow response times<\/h3>\n<ul>\n<li>Start time: 09:11 GMT on Feb 4<\/li>\n<li>Duration: 5 hours, 20 minutes<\/li>\n<\/ul>\n<p>Let me start by characterizing the incident a bit.\u00a0 The root problem was in a service called SPS that handles accounts, identities, authorizations and the like.\u00a0 It\u2019s really the last remaining \u201cglobal\u201d service in VSTS.\nBelow you see a chart of failed requests (red) and succeeded requests (blue), bucketed to 5 minute intervals, during the period of the incident.\u00a0 I\u2019ve marked 4 important points of the graph.<\/p>\n<ol>\n<li>The incident started.\u00a0 You can see a dramatic rise in failed requests starting at 09:00 GMT.\u00a0 In the 4 hours that followed, a fairly common (for incidents) sequence happened \u2013 we started getting alerts, our 24&#215;7 team investigated, confirmed the problem, called engineers and got them out of bed (which involved engineers from my team and SQL Azure) and worked on diagnosing it.<\/li>\n<li>At this point we understood what was happening and we modified a stored procedure to fail immediately so that, rather than the whole system being down, only a subset of functionality was broken and you can see the volume of failed requests fall, not back to normal, but a lot and succeeded requests rise.<\/li>\n<li>Now we really understand exactly what is happening and believe we have the \u201creal\u201d fix for the problem that should completely restore the system to normal.\u00a0 We applied a fix and removed the short circuit in the stored procedure.\u00a0 The fix was not quite right and the error count shot back up.<\/li>\n<li>Within about 15 minutes, we had adjusted the fix, applied it and everything returned to normal.<\/li>\n<\/ol>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/LSI-Blog-Screenshot1.png\"><img decoding=\"async\" class=\"alignnone size-large wp-image-11352\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/LSI-Blog-Screenshot1-1024x392.png\" alt=\"LSI Blog Screenshot1\" width=\"879\" height=\"336\" \/><\/a>\nSo what happened?\u00a0 The root cause is that we (the Team Services team) changed the SQL Azure query processor compat level from 100 (SQL Server 2008) to 120 (SQL Server 2014) on one of the SPS databases.\u00a0 Over time, we had some drift in database compat levels in the service and we were trying to clean them up.\nIn the SQL Server 2014 query optimizer they made significant changes to the cardinality estimation.\u00a0 I\u2019m sure they were improvements but not for this query.\u00a0 The cardinality estimation was used to estimate a memory grant for the query (SQL preallocates memory for queries to avoid spills to disk, which are big performance problems and additional memory allocations, which create the possibility for deadlocks.\u00a0 The cardinality estimate is an important input into the memory request).\nIn this query, the memory grant estimation shot up from something pretty small to 3.5GB.\u00a0 Given that the server only has 48GB, that meant that it could run very few of these queries before it ran out of memory, causing every query in the system to back up and, essentially, serialize.\u00a0 That caused a traffic jam and resulted in so many of our customer requests timing out\/failing.\nThe ultimate resolution, for now, is that we added a hint to the query that tells the query optimizer the maximum memory grant to use for the query.\u00a0 It\u2019s expressed in % of memory and, for simplicity\u2019s sake, we set it to 1% of the memory available for this (or more on the order of 160MB).\u00a0 That was enough to unclog the system and allow everything to flow freely.\nLonger term, littering our code with memory grant limits is a very unsustainable practice.\u00a0 We use query plan hints a lot but memory grant caps are too fragile over time.\u00a0 So we are working with the SQL team see about changes in the cardinality estimator that will do better for this query and others like it.\u00a0 There are also changes we can make to the way this query is written that will reduce the likelihood of mis-estimations.\u00a0 It turns out that in most cases, this query only operates on 1 row but from a very big table.\u00a0 The cardinality estimator guessed <strong>a lot<\/strong> more than 1 row.\nHere\u2019s a greatly simplified view of the response timeline for the incident:<\/p>\n<ul>\n<li>1 minute \u2013 Incident team engaged<\/li>\n<li>10 minutes \u2013 first engineer engaged<\/li>\n<li>1 hr and 42 minutes \u2013 We now understood the issue well enough to know we have a SQL problem with a specific query and the first SQL Azure engineer joined the call.<\/li>\n<li>2 hrs and 24 minutes \u2013 A top SQL Azure live site engineer joined the call.\u00a0 I\u2019m trying to be sensitive about how I say this because I don\u2019t mean to degrade anyone in the process.\u00a0 However, from years of doing this, I\u2019ve learned there are engineers and then there are gods of live site debugging.\u00a0 Every good service has a few but they are very hard to come by and you have to be careful because you can\u2019t pull them into every incident or they will never have a life and will quit.\u00a0 This is the milestone where we got one of those from SQL Azure.\u00a0 Along the way we had many excellent engineers join and help.\u00a0 I\u2019d say when I was on the call, I saw probably 30 engaged on it.<\/li>\n<li>2 hours and 57 minutes \u2013 A top live site engineer from VSTS joined.<\/li>\n<li>3 hours and 38 minutes \u2013 The first mitigation was put in place restoring the service to partial usability.<\/li>\n<li>5 hours and 13 minutes \u2013 Issue fully resolved and system back to full health.<\/li>\n<\/ul>\n<p>In any incident, there are many things that we look at.\u00a0 And given that this just happened a little over 24 hours ago, we aren\u2019t nearly done.\u00a0 But here\u2019s a summary of some of the things that matter.<\/p>\n<ul>\n<li><strong>4.5 hours to the first mitigation \u2013<\/strong>\u00a0 The incident log show solid engagement with an engineer out of bed and on the problem within 10 minutes.\u00a0 I need to get someone to go through the transcript and diagnostic actions with a fine toothed comb to see what we can learn from it.\u00a0 I\u2019m sure there are things we could have done better.\u00a0 The one thing that jumps out at me is, given it was a world wide, whole service event, did we get our top live site engineers on the call fast enough?\u00a0 I kind of feel like we didn\u2019t acknowledge and respond to the severity of the incident quite the way I would like.<\/li>\n<li><strong>Improvements in cardinality estimation<\/strong> \u2013 Clearly the SQL server team has some work to do to further tune the new cardinality estimator and they are all over that.<\/li>\n<li><strong>Changes to our query<\/strong> \u2013 We can make some changes to this query to make it less fragile.<\/li>\n<li><strong>Circuit breakers<\/strong> \u2013 We spent a lot of effort in the past year building circuit breakers that help the over all service be resilient in the fact of incidents in fractions of the service \u2013 and we have lots of evidence that it has made a huge difference in the severity of incidents that occur.\u00a0 But we can see by this incident, and even more so by the next one I\u2019m going to talk about, that we have some tuning to do.\u00a0 The automatic circuit breakers didn\u2019t mitigate this and it was, ultimately and manually applied circuit breaker (stored proc change) that provided the first mitigation.<\/li>\n<li><strong>Anomaly detection<\/strong> \u2013 Ironically, one of the things I\u2019ve been talking with the team about recently is improving our ability to use our copious amounts of telemetry to detect changes in important KPI (in this case memory grant requests) before and after every single change we make to the service.\u00a0 The change to the compat level was made days before this incident and if we had had anomaly detection on that KPI, it would have alerted us to an important behavioral change that could have prevented this incident.\u00a0 Luckily we just release an anomaly detection feature in Application Insights and we are going to see how we can use that technology to identify issues like this in VSTS.<\/li>\n<li><strong>SQL Compat levels<\/strong> \u2013 We should not have let our compat levels drift so much and we need to put processes in place that keep us more up to date.<\/li>\n<li><strong>SPS partitioning<\/strong> \u2013 We need to eliminate SPS as a single global service and partition it into multiple instances so that, even when there is a severe incident, it\u2019s not worldwide.\u00a0 That work was already high on the back log.\u00a0 In fact, some of the prerequisites to enabling it are already in-flight.<\/li>\n<\/ul>\n<p>on to incident #2\u2026<\/p>\n<h3>Worldwide incident causing slow response times and missing UI elements<\/h3>\n<ul>\n<li>Start time: 3:30 GMT on Feb 3<\/li>\n<li>Duration: 3 hours<\/li>\n<\/ul>\n<div class=\"post\">\n<div class=\"body\">\n<div id=\"6745a34b-7b9b-4d63-98ce-ed62a336b5a5\" class=\"postBody\" style=\"margin: 4px 0px 0px;border-width: 0px;padding: 0px\">\nThis incident also\u00a0occurred on SPS, and it too had a world-wide impact.\n<\/div>\n<\/div>\n<\/div>\n<p>The graph below shows percentile of response time for all requests during the incident as measured from the browser in milliseconds. You can see from the chart that less than 10% of requests were impacted.\n<a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/LSI-Blog-Screenshot2.png\"><img decoding=\"async\" class=\"alignnone size-large wp-image-11362\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/LSI-Blog-Screenshot2-1024x367.png\" alt=\"LSI Blog Screenshot2\" width=\"879\" height=\"315\" \/><\/a><\/p>\n<div class=\"post\">\n<div class=\"body\">\n<div id=\"6745a34b-7b9b-4d63-98ce-ed62a336b5a5\" class=\"postBody\" style=\"margin: 4px 0px 0px;border-width: 0px;padding: 0px\">\n<ol>\n<li>The incident started. At around 3:30 UTC we saw a large spike in response times from less than 10% of the requests coming from the browser. At this point alerts fired and we engaged our on-call DRIs.<\/li>\n<li>At this point we understood the issue to be missing Azure DNS entries for our services, and mitigated the issue by disabling the job that was deleting the DNS entries and re-creating them by hand.<\/li>\n<li>A second incident was triggered which was actually a pre-cursor to the incident above.<\/li>\n<li>The incident was mitigated by failing over to the secondary SQL server.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<p>Why did we have a job deleting DNS entries? For each customer&#8217;s account for each service in VSTS, we use the Azure DNS service to create a DNS entry that directs that particular account to a particular service&#8217;s scale unit.\u00a0 This way DNS will resolve the correct scale unit to direct the http requests to that account.\nEvery Wednesday, we run an automated clean up job that goes through and deletes DNS entries for deleted, rename, abandoned, etc. accounts.\u00a0 This job has been in place and functioning without incident for a very long time.\nRecently, as part of our work to enable multiple Team Project Collections per account, some changes were made to this job that introduced a bug.\u00a0 That bug caused not only orphaned account DNS entries to get deleted but also our top-level service DNS entries to get deleted \u2013 meaning that calls from the client directly to a top-level service would timeout and fail with host not found errors. It is interesting to note that this did not impact our server response times, only response times measured from the browser, because our servers had the DNS entries cached for server to server calls.\nThis was clearly a big miss in our testing. How did we miss it? The cleanup job runs weekly, and we had a bug in the cleanup job that was masking this bug. The bug fix rolled out as a hot fix two weeks into the deployment cycle, which triggered the second bug.\nThe second blip in response time had the same root cause as the incident described above, so it was really a separate incident. It was mitigated by failing over to the secondary SQL server.\u00a0We didn&#8217;t do enough to dig into the root cause of this incident with urgency, which could have prevented the second incident from happening.\nHere&#8217;s a rough timeline for this incident:<\/p>\n<ul>\n<li>1 minute \u2013 Incident team engaged<\/li>\n<li>7 minutes \u2013 first engineer engaged<\/li>\n<li>1 hour &#8211; We determined the issue was due to missing Azure DNS issues, and engaged the Azure DNS team.<\/li>\n<li>2 hours and 20 minutes &#8211;\u00a0Azure DNS team determined from their logs that we were deleting the records.<\/li>\n<li>2 hours and 40 minutes &#8211; We discovered the Azure DNS cleanup job was running and disabled it to stop the damage.<\/li>\n<li>2 hours and 45 minutes &#8211; We started recreating Azure DNS records manually.<\/li>\n<li>3 hours &#8211; incident mitigated<\/li>\n<\/ul>\n<p>We learned a lot from this incident, here are some of the things we are doing to improve the service as a result of it:<\/p>\n<ol>\n<li><strong>Get smarter about the DNS entries we delete.\u00a0<\/strong>We are hardening\u00a0the Azure DNS cleanup job to only delete actual account DNS entries, and to never delete top level service entries.<\/li>\n<li><strong>Reduce time to root cause the issue<\/strong> by improving our tracing in this job, and educating more of the team on how we handle Azure DNS entries through better documentation and training.<\/li>\n<li><strong>Improve our resiliency<\/strong> when we\u00a0are experiencing service to service or browser to micro-service outages so the site remains operational for all users. We will do this by tuning our circuit breakers and improving our service to service exception handling to fall back to a degraded experience when this class of incident happens.<\/li>\n<li><strong>Root cause every incident after mitigation<\/strong>. If we had root caused the second blip in the response time graph, we would have discovered the SQL issue and could have prevented the incident that occurred the following day.<\/li>\n<\/ol>\n<p>I want to say again I am very sorry these incidents happened. We know how much this impacts you because we use the service every day and our teams are impacted too, so we know how painful it is. We are taking every measure to learn from these incidents and make the service better.\nBrian\n<\/p><\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>We had two serious availability incidents in the past 2 days.\u00a0 I&#8217;m incredibly sorry for the problems that these have caused.\u00a0 Increasingly we are bringing on more and more customers who rely on this service day in and day out and any significant interruption is entirely unacceptable.\u00a0 We know it.\u00a0 We live it.\u00a0 We breath [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-11351","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"acf":[],"blog_post_summary":"<p>We had two serious availability incidents in the past 2 days.\u00a0 I&#8217;m incredibly sorry for the problems that these have caused.\u00a0 Increasingly we are bringing on more and more customers who rely on this service day in and day out and any significant interruption is entirely unacceptable.\u00a0 We know it.\u00a0 We live it.\u00a0 We breath [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/11351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=11351"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/11351\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=11351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=11351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=11351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}