{"id":1711,"date":"2013-11-25T15:06:24","date_gmt":"2013-11-25T15:06:24","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/2013\/11\/25\/a-rough-patch\/"},"modified":"2024-03-27T15:01:48","modified_gmt":"2024-03-27T22:01:48","slug":"a-rough-patch","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/a-rough-patch\/","title":{"rendered":"A Rough Patch"},"content":{"rendered":"<p>Either I\u2019m going to get increasingly good at apologizing to fewer and fewer people or we\u2019re going to get better at this.\u00a0 I vote for the later.<\/p>\n<p>We\u2019ve had some issues with the service over the past week and a half.\u00a0 I feel terrible about it and I can\u2019t apologize enough.\u00a0 It\u2019s the biggest incident we\u2019ve had since the instability created by our service refactoring in the March\/April timeframe.\u00a0 I know it\u2019s not much consolation but I can assure you that we have taken the issue very seriously and there are a fair number of people on my team who haven\u2019t gotten much sleep recently.<\/p>\n<p>The incident started the morning of the Visual Studio 2013 launch when we introduced some significant performance issues with the changes we made. You may not have noticed it by my presentation but for the couple of hours before I was frantically working with the team to restore the service.<\/p>\n<p>At launch, we introduced the commercial terms for the service and enabled people to start paying for usage over the free level.\u00a0 To follow that with a couple of rough weeks is leaving a bad taste in my mouth (and yours too, I\u2019m sure).\u00a0 Although the service is still officially in preview, I think it\u2019s reasonable to expect us to do better.\u00a0 So, rather than start off on such a sour note, we are going to extend the \u201cearly adopter\u201d program for 1 month giving all existing early adopters an extra month at no charge.\u00a0 We will also add all new paying customers to the early adopter program for the month of December \u2013 giving them a full month of use at no charge.\u00a0 Meanwhile we\u2019ll be working hard to ensure things run more smoothly.<\/p>\n<p>Hopefully that, at least, demonstrates that we\u2019re committed to offering a very reliable service.\u00a0 For the rest of this post, I\u2019m going to walk through all the things that happened and what we learned from them.\u00a0 It\u2019s a long read and it\u2019s up to you how much of it you want to know.<\/p>\n<p>Here\u2019s a picture of our availability graph to save 1,000 words:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/0576.image_thumb_7E74326C.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-15819\" src=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/0576.image_thumb_7E74326C.png\" alt=\"Image 0576 image thumb 7E74326C\" width=\"804\" height=\"297\" srcset=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/0576.image_thumb_7E74326C.png 804w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/0576.image_thumb_7E74326C-300x111.png 300w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/0576.image_thumb_7E74326C-768x284.png 768w\" sizes=\"(max-width: 804px) 100vw, 804px\" \/><\/a><\/p>\n<h2>First Incident: Nov 13, 14, 18, 19<\/h2>\n<p>Let\u2019s start with the symptoms.\u00a0 What users saw was that the service became very slow to the point of being unresponsive during peak traffic periods.\u00a0 Peak hours start mid-morning on the east coast of the US \u2013 right when we were doing the launch keynote \ud83d\ude41<\/p>\n<p>The beginning coincided with the changes we made to enable the new branding and billing right before the launch.<\/p>\n<p>We\u2019ve gotten to a fairly mature stage of dealing with live-site incidents.\u00a0 We detect them quickly, triage them, bring in the right developers, collaborate and fan out to all the various people who need to know\/contribute, including pulling in partner teams like Azure fabric, SQL Azure, ACS, etc.\u00a0 Despite that this one took us a long time to diagnose for a number of reasons \u2013 for one, it was the first incident of this nature that we have ever hit (more on that in a minute).<\/p>\n<p>One of the things we did fairly early on was to rollback all of the feature switches that we enabled that morning \u2013 thinking just undoing the changes would likely restore the service.\u00a0 Unfortunately it didn\u2019t.\u00a0 And no one is completely sure why.\u00a0 There were some aspects of the issue that had been there for a while and were triggered by the spike in load that we saw but the bulk of the issue really was caused by the enabling of the feature flags.\u00a0 It\u2019s possible that some error was made in rolling them back but, at this point they\u2019ve been changed enough times that it\u2019s hard to tell for certain.<\/p>\n<p>Let\u2019s talk about what we do know.<\/p>\n<p>Within a fairly short period of time, we discovered that the biggest underlying symptom of the problem was SNAT port exhaustion.\u00a0 SNAT is a form of network address translations for outgoing calls from a service on Azure (for instance, calls from TFS \u2013&gt; SQLAzure, Azure storage, our accounts service, ACS, etc).\u00a0 For any given destination there are only 64K ports and once you exhaust them, you can no longer talk to the service \u2013 the networking infrastructure stops you.\u00a0 This is a design pattern you really need to think about with multi-tier, scale out services because the multiplexing across tiers with VIPs (virtual IP addresses) can explode the use of these SNAT ports.<\/p>\n<p>This was the first incident that we had where the core issue was a networking mechanics issue and we didn\u2019t have a great deal of experience.\u00a0 After struggling to identify what was happening for a few hours \u2013 and finally figuring out the port exhaustion issue, we pulled in the Azure networking team to provide us the expertise to further diagnose the issue.\u00a0 The core thing we struggled with next was figuring which service we were talking to was using up all the SNAT ports.\u00a0 This ultimately uncovered 2 \u201cissues\u201d that will get fixed:<\/p>\n<ol>\n<li>The Windows Azure networking infrastructure logging records which source IP address exhausted SNAT ports but not the target IP so the logs didn\u2019t help us figure out which service was killing us.\u00a0 Azure has now decided to add that to their logging info to make future incidents easy to diagnose.<\/li>\n<li>We, at the VSOnline level, have pretty involved trace information.\u00a0 However, we eventually discovered that we were missing the trace point for the one target service that was actually the choke point here.\u00a0 It was our connection to ACS.\u00a0 We have gone back and made sure that we have trace points for every single service we talk to.<\/li>\n<\/ol>\n<p>Once we understood the right code path that was causing it, we were able to focus our efforts.\u00a0 As with most of these things, it was not one bug but rather the interaction of several.<\/p>\n<h3>Send some people home<\/h3>\n<p>A brief aside\u2026 I was at the launch with a bunch of people when this happened.\u00a0 One of them was Scott Guthrie.\u00a0 Throughout the day we were chatting about the problems we were having and how the diagnostics were going.\u00a0 He gave me a piece of advice.\u00a0 Send some people home.\u00a0 He related to me a story about a particularly bad incident they had 18 month ago or so.\u00a0 He said the whole dev team was up for 36 hours working on it.\u00a0 They had a fix, deployed it and, unfortunately, not only did it not resolve the problem, it made it worse.\u00a0 After 36 hours of straight work, no one on the dev team was coherent enough to work effectively on the new problem.\u00a0 They now have a policy that after so many hours of working on a problem, they send half the team home to sleep in case they need to start rotating fresh people.\u00a0 That made a ton of sense to me so we did the same thing at about 8:00pm that evening.<\/p>\n<h3>The core problem<\/h3>\n<p>On launch day, we enabled a feature flag that required customers to accept some \u201cterms of service\u201d statements.\u00a0 Those statements had been on the service for a while and most people had already accepted them but on launch day we started enforcing it in order to keep using the service.<\/p>\n<p>However, there were 55 clients out there in the world who had automated tools set up, regularly pinging the service using accounts that no human\nhad logged into to accept the terms of service.\u00a0 Those 55 clients were the trigger for the bug.<\/p>\n<p>Those requests went to TFS.\u00a0 TFS then made a request \u201con behalf of the client\u201d to our account service to validate the account\u2019s access.<\/p>\n<p>The first bug was that the account service returned a 401 \u2013 Unauthenticated error rather than a 403 \u2013 Unauthorized error when it determined that the account had not accepted the terms of service.<\/p>\n<p>The second bug was that TFS, upon receiving that error, rather than just failing the request interpreted it as indicating that the ACS token that the TFS service itself used was invalid and it decided to contact ACS to refresh the token.\u00a0 It nulled out the token member variable and initiated an ACS request to refill it.<\/p>\n<p>The third bug was that the logic that was supposed to make subsequent requests block waiting for ACS to return was broken and, instead, every other request would call ACS to refresh the token.<\/p>\n<p>The net result was a storm of requests to ACS that exhausted SNAT ports and prevented authentications.<\/p>\n<p>Once we did the minimal fix to address this problem (by fixing the logic that allows all requests to fall through to ACS) the situation got MUCH better and was why Monday was much better than Wed or Thurs of the previous week.<\/p>\n<p>However, it turns out there was another problem hiding behind that one.\u00a0 Once we eliminated ACS SNAT port exhaustion, we still had a small(er) incident on Monday.<\/p>\n<h3>The second problem<\/h3>\n<p>In order to avoid calling through to the account service, we \u201ccache\u201d validations and check them in TFS and we save the round trip.\u00a0 The \u201cmain\u201d way we do this is by wrapping the validation in the token\/cookie returned to the client.\u00a0 So when a client comes back with the cookie TFS checks the signature, the expiration date and the claim and accepts the claim.\u00a0 It only falls through to the account service when this check fails.<\/p>\n<p>When this logic was implemented, it was decided not to change the token format for older clients \u2013 for fear of breaking older clients and not wanting to have to update them.\u00a0 So instead, older clients use a more traditional \u201chash table\u201d cache in TFS.\u00a0 It turns out that the hash table cache was broken and always caused fall through to the account service.<\/p>\n<p>Unfortunately, on Monday, someone with an old client was running hundreds of thousands of requests against TFS (about 8 per second).\u00a0 Combined with our daily peak load, this caused a storm against the account service and again slowed things down unacceptably.<\/p>\n<p>Part 2 of Monday\u2019s incident was a bug where TFS was holding a global lock while it made a call to our accounts service.\u00a0 This code path is not executed on every request but it is executed often enough that it caused the lock to serialize a large number of requests that should have been executed in parallel.\u00a0 That contributed to the overall slow down.<\/p>\n<p>As often happens in a prolonged case like this, along the way we found many other things that turned out not to be \u201cthe issue\u201d but none-the-less are worth fixing.\u00a0 My favorite was a regex expression that was sucking down 25% of the CPU on an AT just to re-write the account creation URL to include the proper locale.\u00a0 Yeah, that\u2019s crazy.<\/p>\n<p>All-in-all it was a bad incident but quite a number of improvements will come out of it.<\/p>\n<h3>Retrospective<\/h3>\n<p>With the incident now understood and mitigated, the next thing we\u2019ll turn our attention to is the retrospective.\u00a0 I think we\u2019ll end up with a lot of learnings.\u00a0 Among them are:<\/p>\n<ul>\n<li>We really need multi-instance support.\u00a0 The current service is one instance.\u00a0 It\u2019s scaled out but you either update it or you don\u2019t.\u00a0 Aside from our ability to control with feature flags (which is very nice) all customers are affected or none.\u00a0 The service is now at a scale where there\u2019s just too much risk in that.\u00a0 We need to be able to enable multi-instance so that we can roll out changes to smaller groups of people and observe the behavior in a production environment.\u00a0 We already have a team working on this but the result of this incident is that we\u2019ve accelerated that work.<\/li>\n<li>Most sophisticated multi-threaded apps have a lock manager.\u00a0 A lock manager manages lock acquisition and ordering, causing illegal ordering (that might cause a deadlock) to fail immediately.\u00a0 This turns a potential race condition into something that\u2019s detected every time the code path is executed.\u00a0 One of the things I learned here is that we need to extend that mechanism to also detect cases where we are holding onto important exhaustible resources (like connections, locks, etc) across potentially slow and unpredictable operations (like cross service calls).\u00a0 You have to assume any service can fail and you can not allow that failure (through resource exhaustion or any other means) to cascade into a broader failure across the service.\u00a0 Contain the failure to only the capabilities directly affected by the failure.<\/li>\n<li>We need to revisit our pre-production load testing.\u00a0 Some of these issues should have been caught before they went to production.\u00a0 This is a delicate balance because \u201cthere\u2019s no place like production\u201d but \u201ctrust thy neighbor but lock your door.\u201d<\/li>\n<li>I think there\u2019s another turn of the crank we can take in evolving some of our live site incident management\/debugging process.\u00a0 This includes improvements in telemetry.\u00a0 We should have been able to get to the bottom of these issues sooner.<\/li>\n<\/ul>\n<p>There\u2019s more and I\u2019m looking forward to the retrospective to get everyone\u2019s perspective.<\/p>\n<h2>Second Incident: Nov 20th<\/h2>\n<p>Unfortunately bad luck comes in 3\u2019s and we got hit with a pair of incidents on Wed.\u00a0 The first was an intermittent failure in Windows Azure Storage.\u00a0 It appears (but I don\u2019t have all the facts yet) that Azure was doing an upgrade on the cluster we use for VS Online and something went wrong with the upgrade.\u00a0 The result were intermittent outages starting 9:00PM PST on Wed night and going through lunch time on Thurs, unfortunately right through another peak usage time for us.\u00a0 Then, to add insult to injury, someone made an error in some global Microsoft networking setting, breaking a DNS setting, that impacted our service and many others across Microsoft.\u00a0 That added about another hour to our misery in the mid-afternoon.<\/p>\n<p>Now, I don\u2019t want to point fingers here because this was another example where a failure in one of the services we consume cascaded into having a much larger impact on our service than it should have. The changes that I talked about above (managing exhaustible resources and unreliable operations) would have help significantly mitigated this incident.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>All-in-all, it was an inauspicious week and a half.\u00a0 We are going to have incidents.\u00a0 Every service does.\u00a0 You can plan and test and be robust and there will be issues.\u00a0 However, what we\u2019ve experienced recently isn\u2019t reflective of what we are committed to providing.\u00a0 It\u2019s been a painful but productive learning experience and I believe the result will be a more robust service.\u00a0 I appreciate your patience and we will continue to work hard to provide the best service there is.\u00a0 Anything less is unacceptable.<\/p>\n<p>Brian<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Either I\u2019m going to get increasingly good at apologizing to fewer and fewer people or we\u2019re going to get better at this.\u00a0 I vote for the later. We\u2019ve had some issues with the service over the past week and a half.\u00a0 I feel terrible about it and I can\u2019t apologize enough.\u00a0 It\u2019s the biggest incident [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[8],"class_list":["post-1711","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-vsonline"],"acf":[],"blog_post_summary":"<p>Either I\u2019m going to get increasingly good at apologizing to fewer and fewer people or we\u2019re going to get better at this.\u00a0 I vote for the later. We\u2019ve had some issues with the service over the past week and a half.\u00a0 I feel terrible about it and I can\u2019t apologize enough.\u00a0 It\u2019s the biggest incident [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1711","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=1711"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1711\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=1711"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=1711"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=1711"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}