{"id":16295,"date":"2018-02-26T19:08:54","date_gmt":"2018-02-26T19:08:54","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=16295"},"modified":"2019-02-27T11:12:12","modified_gmt":"2019-02-27T19:12:12","slug":"updated-and-completed-postmortem-performance-issues-and-failures-in-vsts-west-europe-7-february-2018","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/devopsservice\/updated-and-completed-postmortem-performance-issues-and-failures-in-vsts-west-europe-7-february-2018\/","title":{"rendered":"Updated and Completed Postmortem: Performance Issues and failures in VSTS West Europe \u2013 7 February 2018"},"content":{"rendered":"<p>A week ago we posted an <a href=\"https:\/\/devblogs.microsoft.com\/devopsservice\/?p=15946\">incomplete postmortem<\/a> and are now following up with the completed version. If you want the full story of how we progressed through this incident, start by reading that.\u00a0 This postmortem will cover the full root cause analysis but it won\u2019t rehash the first part of the investigation.<\/p>\n<h3>Customer Impact<\/h3>\n<p>On 7 February 2018 we had an incident which impacted users in our Western European scale unit. During this time, users experienced slow performance and 503 errors (service unavailable) when interacting with VSTS services. Close to 5,000 users were impacted at the peak of the incident. The incident lasted for two and a half hours from 10:10 &#8211; 12:40 UTC.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ImpactedUsers1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ImpactedUsers1-1024x475.png\" alt=\"Impacted Users \u2013 Western Europe Scale Unit\" width=\"662\" height=\"307\" class=\"alignnone wp-image-16425\" \/><\/a><\/p>\n<p><em>Quick aside on the graphs shown in this post: All times are in UTC and all graphs were generated using the internal version of Azure Log Analytics, lovingly known to us as \u201cKusto\u201d.\u00a0 We\u2019re on an old version of the client which is why our graphs look different than those produced by the public version of Azure Log Analytics.\u00a0 We cannot praise the Log Analytics team enough for the product they produced.\u00a0 It has fundamentally changed how we do DevOps and we couldn\u2019t imagine trying to do this type of analysis without it.<\/em><\/p>\n<p><span>In the <\/span><a href=\"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=15946\">earlier postmortem<\/a><span>, we posed these two questions:<\/span><em>\n<\/em><\/p>\n<ol>\n<li>What caused the initial failure?<\/li>\n<li>Why didn\u2019t the system automatically recover given the reactive measures we already have in place?<\/li>\n<\/ol>\n<p>When doing a postmortem for a long running incident like this, it is usually the case that what caused the start of the incident is the same thing that made the incident long running. That is, you can identify an event (e.g., a deployment carrying a bug, an increase in load, a bad SQL query plan, etc.) that causes the system to go into a bad state, and the system will stay in that state until you undo the effects of that event.\u00a0 It\u2019s typically related to the root cause.\u00a0 We found in this situation that the event that caused the bad state and the reason for staying in that state were unrelated. That caused us to follow the wrong path repeatedly.<\/p>\n<h3>What caused the initial failure?<\/h3>\n<p>Early on, we noticed a few things.\u00a0 First, we noticed the spike in SignalR reconnects.\u00a0 Here\u2019s the graph of reconnects that we looked at last time.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_SignalR_Reconnects1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_SignalR_Reconnects1-1024x317.png\" alt=\"SignalR Reconnects (blue) vs. Impacted Users (orange)\" width=\"721\" height=\"223\" class=\"alignnone wp-image-16475\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Second, we noticed that one of our databases got overwhelmed and concurrency circuit breakers were opening:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_Database_Concurrency_Circuit_Breakers1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_Database_Concurrency_Circuit_Breakers1-1024x315.png\" alt=\"Database Concurrency Circuit Breakers \/ 5 seconds\" width=\"718\" height=\"221\" class=\"alignnone wp-image-16415\" title=\"Database Concurrency Circuit Breakers \/ 5 seconds\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>However, neither of these led us to the initial failure.\u00a0 The reconnects subsided rather quickly, and circuit breakers are a symptom but not a cause.\u00a0 This line of thinking caused us to waste time running down theories that proved incorrect (some I covered in the initial postmortem and some aren\u2019t interesting for this write up).<\/p>\n<p>In the end, we found that the SignalR reconnects triggered the incident and were the cause of the SQL circuit breakers tripping.\u00a0 However, those aren\u2019t what caused it to keep going.\u00a0 We deal with blips like this all the time and most of the time the system recovers just fine.\u00a0 So, onto the real question\u2026<\/p>\n<h3>Why didn\u2019t the system automatically recover given the reactive measures we already have in place?<\/h3>\n<p>Let\u2019s take a step back and see what happened when the event started.\u00a0 We see our first impact at 10:15 UTC.\u00a0 One key thing stands out.\u00a0 At 10:14, CPU usage is normal across the 7 application tiers (ATs).\u00a0 At 10:15, CPU usage spikes on several ATs.\u00a0 By 10:17, CPU usage has fallen significantly on all ATs.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageByAT1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageByAT1-1024x373.png\" alt=\"CPU Usage by AT\" width=\"755\" height=\"275\" class=\"alignnone wp-image-16375\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Any time we see all AT CPUs move together like this, the general assumption is that there is a shared resource that is affecting all of them.\u00a0 So, let\u2019s check the usual suspects.\u00a0 SQL?<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageonDBs1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageonDBs1-1024x331.png\" alt=\"CPU Usage on the databases shared by these ATs\" width=\"780\" height=\"252\" class=\"alignnone wp-image-16395\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>No, database CPU drops.\u00a0 We check log writes \/ TempDB pressure just to make sure something else isn\u2019t wrong there, and everything looks okay (though the DB CPU drop is suspicious). Network?<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_TCPLookupSuccesses1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_TCPLookupSuccesses1-1024x411.png\" alt=\"TCP Lookup Successes (1 is good)\" width=\"777\" height=\"312\" class=\"alignnone wp-image-16305\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Network is fine outside of that one blip.\u00a0 Of course, we noticed that blip during the incident and chased it for a while. Shared Platform Services (SPS)? Looks fine.\u00a0 Storage? Looks fine.\u00a0Redis? The exceptions for Redis accesses are spiking. That looks promising.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_RedisExceptions1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_RedisExceptions1-1024x331.png\" alt=\"Redis Exceptions \/ Minute\" width=\"777\" height=\"251\" class=\"alignnone wp-image-16455\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>There is always one big question with Redis though: Is it the cause or the victim?\u00a0 That is, since Redis runs on all the ATs, is it the reason that the ATs are performing poorly, or, is it being impacted by High CPU on each of the ATs? It\u2019s virtually always the victim and hardly ever a cause in our experience.<\/p>\n<p>Fortunately, we\u2019ve gotten pretty good at determining whether Redis is a victim.\u00a0 Since each scale unit has ATs and job agents (JA), we can look to see if the JAs are experiencing issues with Redis:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_RedisExceptionsOnJAs1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_RedisExceptionsOnJAs1-1024x314.png\" alt=\"Redis Exceptions \/ Minute on Job Agents\" width=\"773\" height=\"237\" class=\"alignnone wp-image-16465\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Redis is working great on the job agents.\u00a0 Conclusion: As usual, Redis is the victim, not the cause.<\/p>\n<p>At this point, no shared resource is standing out as the problem.\u00a0 Let\u2019s go back and look at the ATs again.\u00a0 Are they all moving together throughout the incident?<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsagePerAT1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsagePerAT1-1024x373.png\" alt=\"CPU Usage Per AT Throughout Incident\" width=\"771\" height=\"281\" class=\"alignnone wp-image-16405\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Okay, that\u2019s suspicious.\u00a0 It looks like each AT is spiking at different times.\u00a0 Furthermore, it appears that a given AT is either doing nothing or completely slammed.\u00a0 Here\u2019s a graph of CPU usage on AT_4 to show that better:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageonAT_41.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_CPUUsageonAT_41-1024x378.png\" alt=\"CPU Usage on AT_4\" width=\"769\" height=\"284\" class=\"alignnone wp-image-16385\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Let\u2019s discuss background for a moment.\u00a0 Because of issues we\u2019ve had in the past, we\u2019ve created something called the VssHealthAgent which attempts to auto mitigate issues.\u00a0 One of the things the VssHealthAgent does is look for ATs that are under significant load, pull them out of the load balancer\u2019s rotation, take a process dump, recycle, and then put them back in rotation.\u00a0 It is supposed to have the following protections in place to prevent degenerate behaviors:<\/p>\n<ol>\n<li>Only pull one AT out of the load balancer at a time.<\/li>\n<li>Only pull a given AT out of the load balancer once per hour.<\/li>\n<\/ol>\n<p>So, does the VssHealthAgent have a bug causing it to be overzealous and continually pulling ATs out of rotation?\u00a0 To make a long story a little less long, while this first looked like a promising theory, ultimately the VssHealthAgent worked as expected throughout this whole period.\u00a0 Due to the time it takes to run the analysis on an unhealthy AT, it continually had one AT out of rotation for the whole incident, but it never had more than one in that state.<\/p>\n<h3>What are the ATs doing when their CPU is spiked?<\/h3>\n<p>For this we need <a href=\"https:\/\/github.com\/Microsoft\/perfview\">PerfView<\/a>. PerfView is a tool we use frequently to diagnose performance issues. PerfView is a command line tool.\u00a0 How are we going to run it?\u00a0 We could have someone log in and run it on the machine that\u2019s having issues, but that\u2019s a bad idea since it\u2019s both manual and for intermittent perf issues you\u2019ll likely miss capturing data at the right time.\u00a0 To solve this, we use an internal tool called Service Profiler that\u2019s running as a service on the VM and automatically captures PerfView traces (you can get similar functionality for your own service using <a href=\"https:\/\/azure.microsoft.com\/en-us\/blog\/application-insights-profiler\/\">Application Insights Profiler for Azure App Service<\/a>).\u00a0 Here is what we saw in multiple PerfView captures:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_PerfView1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_PerfView1.png\" alt=\"\" width=\"411\" height=\"208\" class=\"alignnone wp-image-16445\" \/><\/a><\/p>\n<p>The column headers are missing from the screen shot, but that\u2019s a kernel context swap taking 374 milliseconds.\u00a0 That\u2019s an <em>incredibly<\/em> long time for a SwapContext call.\u00a0 A common cause for that is thread starvation.\u00a0 We see signs of this in the HTTP calls we\u2019re making from the server as well \u2013 they are stuck waiting on threads to spin up.\u00a0 Let\u2019s look at active thread counts on a given AT:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveThreadsInAT_0.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveThreadsInAT_0-1024x356.png\" alt=\"Active Threads in AT_0\" width=\"757\" height=\"263\" class=\"alignnone wp-image-16225\" \/><\/a><\/p>\n<p>That jagged line indicates a problem.\u00a0 The way the CLR ThreadPool works is that it will expand very quickly to the number of threads specified in the MinWorker setting.\u00a0 Once that number is exceeded, it will throttle the addition of threads to <a href=\"https:\/\/gist.github.com\/JonCole\/e65411214030f0d823cb\">once every 500ms<\/a>.\u00a0 Our MinWorker setting is currently at 192.\u00a0 (Note, we have two processes on each AT which is why things grow quickly until it reaches close to 400).\u00a0 Once we exhaust that many active threads, thread creation is throttled and work starts to build up.\u00a0 The graph above and the PerfView analysis both indicate that the starved resource preventing work from proceeding was the number of threads available to process incoming requests.<\/p>\n<h3>Why are the ATs creating a huge number of threads?<\/h3>\n<p>This is the key question that took us a while to answer.\u00a0 We have 7 ATs that were handling the load fine before the event.\u00a0 Since the event wasn\u2019t persistent, why can\u2019t they handle the load after the event went away?\u00a0 Let\u2019s ask a basic question: Do we really have 7 ATs?\u00a0 We know that the VssHealthAgent has one pulled out of rotation for the length of the incident.\u00a0 And while the ATs were running relatively hot before the incident, we should be able to run on 6 ATs here.\u00a0 The key question actually isn\u2019t how many ATs are running, but it\u2019s how many the load balancer has in rotation accepting requests.\u00a0 The load balancer will only send traffic to ATs that are responding correctly to its probes.\u00a0 Let\u2019s graph the number of ATs that the load balancer is actually using:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_LoadBalancer1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_LoadBalancer1-1024x411.png\" alt=\"Number of ATs used by Load Balancer\" width=\"757\" height=\"304\" class=\"alignnone wp-image-16435\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>And there\u2019s the smoking gun.\u00a0 Throughout the incident, the load balancer was only successfully probing between 1 and 3 ATs.\u00a0 \u00a0Here\u2019s a graph of the active ASP.NET requests on a single AT for the length of the incident:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveASP.NetRequests_AT_51.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveASP.NetRequests_AT_51-1024x421.png\" alt=\"Active ASP.NET Requests for AT_5\" width=\"756\" height=\"311\" class=\"alignnone wp-image-16315\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>This shows that after spinning up and telling the load balancer it was ready for requests, an AT would get overwhelmed with requests because the load balancer didn\u2019t have enough ATs for fair distribution.\u00a0 The load is far too large for only having up to 3 ATs.<\/p>\n<p>That <em>should<\/em> be fine though.\u00a0 ASP.NET will queue up requests so that a process doesn\u2019t get overwhelmed and can continue to churn through requests.\u00a0 Remember the thread starvation issue from earlier?\u00a0 Here\u2019s a graph of the active thread count on AT_5.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveThreads_AT_51.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveThreads_AT_51-1024x384.png\" alt=\"Active Threads for AT_5\" width=\"755\" height=\"283\" class=\"alignnone wp-image-16335\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>As soon as an AT spins up, it exhausts the 192 threads available to it and begins thrashing as thread creation is throttled.\u00a0 Furthermore, it won\u2019t crash and recover.\u00a0 It\u2019ll just sit there thrashing.\u00a0 The probes from the load balancer are HTTP requests that go into the same ASP.NET queue.\u00a0 The time to respond will be too long, and the load balancer will stop sending traffic to AT_5.\u00a0 Once it stops taking traffic and catches up, the load balancer probes are once again successful. The cycle for the AT repeats. Each of the ATs repeatedly cycles through this state, leaving this scale unit badly under provisioned.<\/p>\n<p>In conclusion, for those two hours, what we have is a vicious cycle of ATs spinning up and getting instantly overwhelmed and thrashing.<\/p>\n<h3>What caused this vicious cycle?<\/h3>\n<p>It turns out there was something recently that put much more stress on the system.\u00a0 Here\u2019s a graph of active requests per AT for the past month:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveASP.NetRequestsperAT1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ActiveASP.NetRequestsperAT1-1024x392.png\" alt=\"Active ASP.NET Requests Per AT\" width=\"747\" height=\"286\" class=\"alignnone wp-image-16325\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>You\u2019ll notice that all of the ATs received load evenly up until 2 February and again after 10 February.\u00a0 What happened between those two dates?\u00a0 We introduced a bug that caused the AT process to intermittently crash.\u00a0 When that happens, the other ATs must pick up the load.\u00a0 With enough headroom in the system, that would have been okay.\u00a0 However, multiple ATs were going down at the same time.\u00a0 That puts the system under more pressure and when the SignalR reconnect spike event occurred, it sent the whole system into a tailspin.\u00a0 Here\u2019s a graph of process crashes leading up to and during the event (across all ATs, bucketed per 5 minutes):<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ATCrashes.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_ATCrashes-1024x274.png\" alt=\"AT Crashes \/ 5 minutes\" width=\"881\" height=\"236\" class=\"alignnone wp-image-16245\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>One embarrassing thing here is that we were partially blind to these crashes.\u00a0 We have telemetry set up to detect and alert on process crashes.\u00a0 Unfortunately, that telemetry had a bug in it.\u00a0 It only reported 10 crashes during this period, and all of those were during the incident.\u00a0 If this would have been working properly we could have likely avoided the incident altogether.\u00a0 This issue also ended up hurting this postmortem process significantly.\u00a0 The fact that we weren\u2019t seeing crashes in our telemetry invalidated the theory about what was happening here a few different times.\u00a0 It wasn\u2019t until we double-checked that telemetry, and found that it was wrong, that we were able to tie the whole story together.<\/p>\n<p>In conclusion, a broad generalization of what went wrong is that the system was running too close to the point of exhaustion due to the intermittent crashes, and an event that should not have been a problem caused a high-severity incident.<\/p>\n<h3>Mitigation<\/h3>\n<p>The mitigation that ended up working for this incident started at 12:13. We saw request rates return to normal between 12:18 \u2013 12:21 depending on the AT.\u00a0 During that period, we decided to do a manual process recycle on all of the ATs.\u00a0 Here\u2019s a chart of when we did the reboots:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_AT_RebootTimes1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_AT_RebootTimes1.png\" alt=\"Mitigation Reboot Times\" width=\"303\" height=\"193\" class=\"alignnone wp-image-16355\" \/><\/a><\/p>\n<p>It turns out that we got lucky in the order that we rebooted the instances.\u00a0 AT_IN_6, AT_IN_0 and AT_IN_4 were in a zombie state at the time we started mitigating.\u00a0 By having those three come up in close succession, along with having some other functional ATs in the mix already, the system stabilized.\u00a0 Here\u2019s the AT load balancer graph again where you can see the ATs recover around 12:20.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_LoadBalancer1.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/37\/2019\/02\/59356830_LoadBalancer1-1024x411.png\" alt=\"Number of ATs used by Load Balancer\" width=\"787\" height=\"316\" class=\"alignnone wp-image-16435\" \/><\/a><\/p>\n<p>Oddly, we unknowingly tempted our own fate by continuing to recycle instances after that point.\u00a0 You can see dips in the number of the ATs in the load balancer after things have recovered as we continued to progress through the full mitigation.\u00a0 Had those been combined with a poorly timed crash or two, we could have restarted the tailspin.\u00a0 This further shows the need to be able to have insight into what ATs the load balancer is routing requests to at real time.<\/p>\n<h3>Action Items<\/h3>\n<p>Here are the repair items from the last attempt at this root cause analysis:<\/p>\n<ol>\n<li>We\u2019ve fixed the process crash issue that was one of the causes of ATs being unavailable during this outage.<\/li>\n<li>We\u2019ve added code to our clients to have them stagger reconnect calls so that an incident that causes connections to be severed isn\u2019t made worse by SignalR sending thousands of reconnect calls simultaneously.<\/li>\n<li>We\u2019ve decreased the size of the SignalR message cache so that reconnect processing will be less CPU intensive on the ATs.<\/li>\n<\/ol>\n<p>In addition to those, here are further improvements we plan to make now that we fully understand what happened here:<\/p>\n<ol>\n<li>We need to figure out what the right setting is for MinWorker threads that we set in our w3wp process.\u00a0 An AT shouldn\u2019t be able to be overwhelmed with requests and it shouldn\u2019t thrash due to thread starvation.\u00a0 We need to optimize the thread count for best performance.<\/li>\n<li>We need to do the same analysis for MinIO threads.<\/li>\n<li>We have perf counters for Worker threads.\u00a0 We need to add one for IO threads to be able to detect if we ever have a similar problem there.<\/li>\n<li>We need to set MaxRequest threads to an amount lower than our MinWorker threads can handle.<\/li>\n<li>During an incident we need to have a view in a dashboard of the number of ATs the load balancer thinks it can access.\u00a0 If we would have had insight into this early on but the mitigation and postmortem would have happened much faster.<\/li>\n<li>Along with #5 we should consider if it\u2019s feasible to auto-mitigate the situation where the load balancer doesn\u2019t think it has enough ATs by automatically adding more.<\/li>\n<li>We need to fix our telemetry pipeline to accurately capture all AT process crashes and test it.<\/li>\n<li>We need a monitor to measure when a process is thrashing in cases similar to this.<\/li>\n<li>If we can reliably achieve #8 we should implement an auto-mitigation as part of the VssHealthAgent for recycling the process since it is unlikely to recover on its own.\u00a0 This would need to come with adequate safe guards to prevent this from getting in too tight of a cycle where it could make a bad situation worse.<\/li>\n<\/ol>\n<p>We apologize for the impact this had on our users. This incident has highlighted areas of improvements needed in our telemetry, detection and auto-mitigation capabilities. We are committed to delivering the improvements needed to avoid related issues in the future.<\/p>\n<p>&nbsp;<\/p>\n<p>Sincerely,\nTaylor Lafrinere\nGroup Engineering Manager,\u00a0VSTS Continuous Integration<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A week ago we posted an incomplete postmortem and are now following up with the completed version. If you want the full story of how we progressed through this incident, start by reading that.\u00a0 This postmortem will cover the full root cause analysis but it won\u2019t rehash the first part of the investigation. Customer Impact [&hellip;]<\/p>\n","protected":false},"author":1090,"featured_media":18098,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,1],"tags":[3],"class_list":["post-16295","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-postmortem","category-uncategorized","tag-postmortem"],"acf":[],"blog_post_summary":"<p>A week ago we posted an incomplete postmortem and are now following up with the completed version. If you want the full story of how we progressed through this incident, start by reading that.\u00a0 This postmortem will cover the full root cause analysis but it won\u2019t rehash the first part of the investigation. Customer Impact [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts\/16295","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/users\/1090"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/comments?post=16295"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/posts\/16295\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/media\/18098"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/media?parent=16295"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/categories?post=16295"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devopsservice\/wp-json\/wp\/v2\/tags?post=16295"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}