{"id":13905,"date":"2018-03-02T07:58:06","date_gmt":"2018-03-02T12:58:06","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/?p=13905"},"modified":"2019-02-27T03:42:09","modified_gmt":"2019-02-27T03:42:09","slug":"a-good-incident-postmortem","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/a-good-incident-postmortem\/","title":{"rendered":"A good incident postmortem"},"content":{"rendered":"<p>I wanted to call your attention to a good <a href=\"https:\/\/blogs.msdn.microsoft.com\/vsoservice\/?p=16295\">incident postmortem<\/a> done by Taylor Lafrinere this week.\u00a0 Taylor sits in my team room and, for a week, I saw him bent over his keyboard, often with two or three people staring over his shoulders trying to figure out what had caused this incident and what we needed to do to prevent it in the future.\u00a0 This is the kind of tenacity you have to have to, in the long term, run a highly available service.\u00a0 Only if you really understand the root cause and build mitigations and resiliency will you get there.<\/p>\n<p>It&#8217;s a bit long and detailed but it&#8217;s a good read.<\/p>\n<p>This is also a good opportunity for me to comment on our reliability of late.\u00a0 In many, important, ways we are in much better shape than we have ever been in.\u00a0 We have more reliability and isolation infrastructure in place than ever before.\u00a0 We very rarely have incidents that affect a large percentage of customers any more.\u00a0 Our practices help us isolate the effects of incidents so most people are completely unaware we are having issues.<\/p>\n<p>However, in the past few months, we&#8217;ve had too many of those &#8220;smaller&#8221; incidents and, unfortunately, a very disproportionate # of them have been on our European instances so the availability of our European instances has looked much worse than the overall service availability.\u00a0 There&#8217;s no one reason Europe has been hit hardest &#8211; it&#8217;s many reasons and we are taking steps to address them.<\/p>\n<p>Many of the issues have been self-inflicted &#8211; by that I mean code defects that got checked in, deployed and not caught until they caused issues for customers.\u00a0 In part that&#8217;s because we are making some pretty large systemic\/structural changes to the service right now (you&#8217;ll hear more about the resulting new capabilities in the next few months) and the level of rigor we typically apply just has been up to the magnitude of the churn that&#8217;s happening.\u00a0 We are working to improve that level of rigor while we simultaneously continue to improve isolation and resiliency.\u00a0 The RCA above is a great example of the ongoing effort and learnings that go into every incident we experience.<\/p>\n<p>At the same time, I recognize all that matters is that the service is good and healthy and doing what you need it to do.\u00a0 It hasn&#8217;t been as healthy as it should have been lately.\u00a0 For that I want to apologize.\u00a0 We are working hard to address the underlying issues and this dip in health will get fixed and we&#8217;ll come out of it stronger and more resilient than ever.<\/p>\n<p>Thank you,\nBrian<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I wanted to call your attention to a good incident postmortem done by Taylor Lafrinere this week.\u00a0 Taylor sits in my team room and, for a week, I saw him bent over his keyboard, often with two or three people staring over his shoulders trying to figure out what had caused this incident and what [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[9,21],"class_list":["post-13905","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-vs-team-services","tag-vsts"],"acf":[],"blog_post_summary":"<p>I wanted to call your attention to a good incident postmortem done by Taylor Lafrinere this week.\u00a0 Taylor sits in my team room and, for a week, I saw him bent over his keyboard, often with two or three people staring over his shoulders trying to figure out what had caused this incident and what [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/13905","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=13905"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/13905\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=13905"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=13905"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=13905"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}