{"id":15765,"date":"2017-12-27T22:31:45","date_gmt":"2017-12-27T22:31:45","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=15765"},"modified":"2019-02-18T13:42:58","modified_gmt":"2019-02-18T21:42:58","slug":"postmortem-intermittent-failures-for-visual-studio-team-services-on-14-dec-2017","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/devopsservice\/postmortem-intermittent-failures-for-visual-studio-team-services-on-14-dec-2017\/","title":{"rendered":"Postmortem &#8211; Intermittent Failures for Visual Studio Team Services on 14 Dec 2017"},"content":{"rendered":"<p><span>On <\/span><span>14 <\/span><span>December <\/span><span>2017 <\/span><span>we <\/span><span>began to have <\/span><span>a series of incidents with Visual Studio Team Services (VSTS) <\/span><span>for several days <\/span><span>that had a serious impact on the availability of our service for many customers (incident blogs <\/span><a href=\"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=15675\"><span>#1<\/span><\/a> <a href=\"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=15715\"><span>#2<\/span><\/a> <a href=\"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=15736\"><span>#3<\/span><\/a><span>). We apologize for the disruption<\/span><span> these incidents had on you and your team<\/span><span>. Below we describe the cause and the actions we are taking to address the issues<\/span><span> which caused these incidents<\/span><span>.<\/span><span>\u00a0<\/span><\/p>\n<p><strong>Customer Impact<\/strong><\/p>\n<p><span>The issues <\/span><span>caused intermittent failures across multiple instances of the VSTS service within <\/span><span>certain<\/span> <span>US and Brazil<\/span><span>ian <\/span><span>data centers<\/span><span>. During this time<\/span><span>,<\/span><span> we experienced failures within our application which caused IIS <\/span><span>restarts <\/span><span>resulting in customer impact for various VSTS scenarios.\u00a0<\/span><span>\u00a0<\/span><\/p>\n<p><span>The incident<\/span><span>s<\/span><span> started on 14 December. The graph below shows periods of customer impact for the Central US (CUS) and South Brazil (SBR) scale units.<\/span><span>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/PostMortem12272017Impact.png\"><img decoding=\"async\" width=\"1024\" height=\"424\" class=\"alignnone size-large wp-image-15785\" alt=\"\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/PostMortem12272017Impact-1024x424.png\" \/><\/a><\/p>\n<p><strong>What Happened<\/strong><\/p>\n<p><span>For context VSTS uses DNS <\/span><span>to route traffic<\/span><span> to the correct scale unit. <\/span><span>When <\/span><span>accounts are created, <\/span><span>VSTS queues a job to create a DNS record for {account<\/span><span>}.visualstudio.com<\/span> <span>which <\/span><span>point<\/span><span>s<\/span><span> to the right scale unit. Because there is a delay between the DNS entry being added and <\/span><span>it <\/span><span>being used, we use <\/span><a href=\"https:\/\/www.iis.net\/downloads\/microsoft\/application-request-routing\"><span>Application Request Routing<\/span><\/a><span> (ARR) to re-route requests to the right scale unit until the DNS update is <\/span><span>complete<\/span><span>. Additionally, VSTS uses web sockets to provide real time updates in the browser for pull requests and builds via <\/span><span>SignalR<\/span><span>.\u00a0<\/span><span>\u00a0<\/span><\/p>\n<p><span>The application pools (w3wp process) <\/span><span>for <\/span><span>the Brazil and Central US VSTS scale units began crashing intermittently on <\/span><span>14 <\/span><span>December<\/span><span>. IIS <\/span><span>restart<\/span><span>ed<\/span><span> the application pools on failure, but all existing connections <\/span><span>were <\/span><span>terminated. Analysis of the dumps revealed that a certain call pattern <\/span><span>trigger<\/span><span>ed<\/span><span> the crash.\u00a0<\/span><span>\u00a0<\/span><\/p>\n<p><span>The request characteristics common to each crash were the following.<\/span><span>\u00a0<\/span><\/p>\n<ol>\n<li><span>The request was a web socket request.<\/span><span>\u00a0<\/span><\/li>\n<\/ol>\n<ol>\n<li><span>The request was proxied using ARR.<\/span><span>\u00a0<\/span><\/li>\n<\/ol>\n<p><span>The issue took a while to track down because we suspected recent changes to the code that uses <\/span><span>SignalR<\/span><span>. However, the root cause was that on <\/span><span>14 <\/span><span>December <\/span><span>we <\/span><span>released a fix to an unrelated <\/span><span>issue<\/span><span> which <\/span><span>added code to use the ASP.Net <\/span><a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.web.httpapplication.presendrequestheaders?view=netframework-4.7.1\"><span>PreSendRequestHeaders<\/span><\/a><span> event.\u00a0 Using this event in combination with web sockets and ARR caused an <\/span><span>AccessViolationException<\/span> <span>which <\/span><span>terminat<\/span><span>ed<\/span> <span>the process. We spoke with the ASP.Net team and they informed us that the <\/span><span>PreSendRequestHeaders<\/span><span> method is unreliable and we should replace it with <\/span><a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/system.web.httpresponse.addonsendingheaders(v=vs.110).aspx\"><span>HttpResponse.AddOnSendingHeaders<\/span><\/a><span> instead. We have released a fix with that change.<\/span><span>\u00a0<\/span><\/p>\n<p><span>While debugging the issue, we mitigated customer impact by redirecting ARR traffic once we realized that was the key cause.<\/span><span>\u00a0<\/span><\/p>\n<p><span>Here is a timeline of the incident.<\/span><span>\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/PostMortem12272017TimeLine.png\"><img decoding=\"async\" width=\"1024\" height=\"794\" class=\"alignnone size-large wp-image-15775\" alt=\"\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/PostMortem12272017TimeLine-1024x794.png\" \/><\/a><\/p>\n<ol>\n<li><span><span>IIS errors start in SBR<\/span><\/span><span><span> (Brazil South)<\/span><\/span><span>\u00a0<\/span><\/li>\n<li>I<span>IS errors start in CUS1<\/span><span> (<\/span><span>US Central)<\/span><span>\u00a0<\/span><\/li>\n<li><span>Workaround \u2013 stopped ARR traffic from going to SBR.<\/span><span>\u00a0<\/span><\/li>\n<li><span>Workaround \u2013 Redirected *.visualstudio.com wildcard from CUS1 to pre-flight (an internal instance).\u00a0<\/span><\/li>\n<\/ol>\n<p><strong>Next Steps<\/strong><\/p>\n<p>In order to prevent this issue in the future, we are taking the following actions.<\/p>\n<ol>\n<li>We have added monitoring and alerting specifically for w3wp crashes.<\/li>\n<li>We are working with the ASP.NET team to document or deprecate the PreSendRequestHeaders method. This <a href=\"https:\/\/docs.microsoft.com\/en-us\/aspnet\/aspnet\/overview\/web-development-best-practices\/what-not-to-do-in-aspnet-and-what-to-do-instead\">page<\/a> has been updated, and we are working to get the others updated.<\/li>\n<li>We are adding more detailed markers to our telemetry to make it easier to identify which build a given scale unit is on at any point in time to help correlate errors with the builds that introduced them.<\/li>\n<\/ol>\n<p>Sincerely,\nBuck Hodges\nDirector of Engineering, VSTS<\/p>\n","protected":false},"excerpt":{"rendered":"<p>On 14 December 2017 we began to have a series of incidents with Visual Studio Team Services (VSTS) for several days that had a serious impact on the availability of our service for many customers (incident blogs #1 #2 #3). We apologize for the disruption these incidents had on you and your team. Below we [&hellip;]<\/p>\n","protected":false},"author":1090,"featured_media":18098,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,1],"tags":[3],"class_list":["post-15765","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-postmortem","category-uncategorized","tag-postmortem"],"acf":[],"blog_post_summary":"<p>On 14 December 2017 we began to have a series of incidents with Visual Studio Team Services (VSTS) for several days that had a serious impact on the availability of our service for many customers (incident blogs #1 #2 #3). We apologize for the disruption these incidents had on you and your team. Below we [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts\/15765","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/users\/1090"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/comments?post=15765"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts\/15765\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/media\/18098"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/media?parent=15765"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/categories?post=15765"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/tags?post=15765"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}