{"id":1781,"date":"2013-11-12T13:04:23","date_gmt":"2013-11-12T13:04:23","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/2013\/11\/12\/listen-to-your-service-it-is-talking-to-you\/"},"modified":"2024-03-28T14:00:56","modified_gmt":"2024-03-28T21:00:56","slug":"listen-to-your-service-it-is-talking-to-you","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/listen-to-your-service-it-is-talking-to-you\/","title":{"rendered":"Listen to your service \u2013 it is talking to you"},"content":{"rendered":"<p>We had a nasty enough service incident on Friday that I can\u2019t let it pass without commenting on it.\u00a0 If you follow my blog or the service, then you know we were doing a deployment on Friday.\u00a0 The natural assumption, both inside and outside Microsoft, was that it was caused by the deployment.\u00a0 It wasn\u2019t, but more on that in a minute.<\/p>\n<p>We received notable Twitter traffic, blog comments and emails both about this incident and the pattern of incidents.\u00a0 I\u2019m going to tackle both topics (at least lightly) here.\u00a0 First, people are justified to be upset about Friday\u2019s incident.\u00a0 The service was mostly down for about 3 hours.\u00a0 That\u2019s clearly not acceptable.\u00a0 There was also a pattern of \u201cthe service has incidents too often\u201d.\u00a0 I agree.\u00a0 We know it and we are constantly working on making it better.\u00a0 And, actually, it is getting consistently better.\u00a0 It\u2019s not good enough but here is a trend of significant incidents over the past several months.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/2260.image_thumb_20A9E10F.png\"><img decoding=\"async\" class=\"alignnone size-full wp-image-15834\" src=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/2260.image_thumb_20A9E10F.png\" alt=\"Image 2260 image thumb 20A9E10F\" width=\"371\" height=\"165\" srcset=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/2260.image_thumb_20A9E10F.png 371w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/2260.image_thumb_20A9E10F-300x133.png 300w\" sizes=\"(max-width: 371px) 100vw, 371px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>We\u2019ve got a lot of work to do but we\u2019re making progress.\u00a0 Not only are incidents generally getting less frequent, they are getting less severe (though Friday\u2019s was pretty severe).\u00a0 Incidents as bad as Friday\u2019s are actually reasonably rare.<\/p>\n<p>A few of the comments I got about the service availability referenced our <a href=\"http:\/\/blogs.msdn.com\/b\/tfservice\/\">service status blog<\/a> as evidence that incidents happen almost every day.\u00a0 Well, kind of.\u00a0 It\u2019s true that some incident happens almost every day.\u00a0 However, the majority of incidents turn out to be very minor \u2013 with little or no customer impact.\u00a0 For now we have chosen to err on the side of notifying of potential issues before clarifying the impact.\u00a0 My guidance to the team is \u201cIf a customer is seeing a problem, they shouldn\u2019t go to the service status page and have it tell them everything is fine \u2013 that\u2019s a very frustrating experience\u201d.\u00a0 However, it causes us to be very chatty.\u00a0 Our plan it to improve our service health status reporting so customers can better understand whether incidents likely affect them \u2013 but that\u2019s months away.<\/p>\n<p>None of that takes away from the fact that we want and need to have fewer impactful events.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/b\/bharry\/archive\/2013\/10\/14\/how-do-you-measure-quality-of-a-service.aspx\">Availability and service quality<\/a> has been on my mind a lot lately as the service continues to mature and that\u2019s why I wrote a post on the topic a few weeks ago.\u00a0 As we analyzed various availability models, we took 6 months or so of historical data and tested the models against it.\u00a0 What we found is that dips in availability correlate very strongly with our deployment weeks \u2013 in other words, when we change things, stuff breaks.\u00a0 It proved the old operations maxim behind the rule \u201cthe best way to run a highly available service is to never change it\u201d.\u00a0 True but misses the point.<\/p>\n<p>So, it\u2019s not surprising that people would assume the outage we had on Friday was the result of our deployment. To be honest, I did too.\u00a0 However, Monday we did a deep dive on the root cause and it was enlightening.<\/p>\n<p>There were actually two distinct (and as best we can tell, unrelated) issues that happened on Friday.\u00a0 The first was a cache overflow and the second was a synchronization task that bogged down the system.<\/p>\n<p><strong>Cache overflow<\/strong> \u2013 we keep a cache of group membership resolution so that we don\u2019t have to go back to the database to check people\u2019s group membership every time we need it.\u00a0 The cache had 1,024 entries in it and a bad eviction policy.\u00a0 The load got high enough (we hit a new usage record on Friday) that the cache overflowed and began to thrash.\u00a0 The next problem was that the cache fault algorithm held a lock on the cache while it fetched the data from the database \u2013 never, ever, ever do that.\u00a0 That caused the thrashing cache to queue faults and the system backed up to the point that no one could log in any longer.\u00a0 We took away several action items: 1) increase the cache size 2) fetch the data outside the lock 3) switch to an LRU eviction policy and 4) add a new feature to our lock manager and data access layer so that when the right configuration setting is set, we throw an exception anytime a SQL call is made while a lock is held.<\/p>\n<p><strong>Sync job<\/strong> \u2013 We have a synchronization job that updates work item tracking with relevant list of users (for instance candidates to assign a work item to).\u00a0 We have to sync any relevant changes to a user (like display name changes).\u00a0 The trigger for the problem was two fold \u2013 an increase in the number of changes to identities and an increase in the total number of identities in the system.\u00a0 The root cause was a couple of bad design decisions.\u00a0 The first was the fact that any identity change triggers the identity to be a candidate for synchronization \u2013 even if work item tracking doesn\u2019t use the property that changed.\u00a0 So we need a filter to only deal with synchronization when a relevant property is changed.\u00a0 The more serious issue though was the \u201cdirectionality\u201d of the algorithm.\u00a0 What it did was query all changed identities in the system and \u201cjoin\u201d that to the list of relevant identities in an account.\u00a0 The problem is, that particularly on the public service, the number of change identities dwarfs, by orders of magnitude, the number of relevant identities in any given account.\u00a0 The result is hundreds of times more processing by, effectively, using the wrong index.\u00a0 The quick mitigation was to reverse the join and sync job times dropped from, sometimes, minutes to milliseconds.<\/p>\n<p>Neither of these things changed in the deployment.\u00a0 It was just a coincidence that the thresholds that triggered the issues happened on Friday.\u00a0 So my question was, if these things were a result of increased load\/data volume, then that didn\u2019t change overnight.\u00a0 What were the symptoms that we should have been able to detect ahead of time?\u00a0 The problem with these kind of things is that failures like this happen gradually and then all at once.\u00a0 It gets a little worse and a little worse and then the *** hits the fan.<\/p>\n<p>We generally have a \u201chygiene\u201d process that involves tracking exceptions, event log entries, response times, available memory, etc.\u00a0 The goal is to detect the early signs of a problem and address it before it passes the knee curve.\u00a0 The problem is that some of these measures are particularly noisy and too easy to convince yourself that you are just seeing noise.\u00a0 Clearly we have some work to do to change our assessment practice here.<\/p>\n<p>To help visualize this, here is a graph of the average times for the group membership resolution times.\u00a0 As you can see, it was regularly spiking at close to 10 seconds (the units of the y-axis are microseconds).\u00a0 In retrospect everyone knows that\u2019s not expected.\u00a0 Typical times should be in milliseconds.<\/p>\n<p>On Friday, the averages spiked to about 30 seconds and was enough to start triggering ti\nmeouts, etc.\u00a0 Not good.\u00a0 Should have seen this before Friday and never let it happen.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/6278.clip_image002_thumb_709F8F8D.jpg\"><img decoding=\"async\" class=\"alignnone size-full wp-image-15835\" src=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/6278.clip_image002_thumb_709F8F8D.jpg\" alt=\"Image 6278 clip image002 thumb 709F8F8D\" width=\"804\" height=\"260\" srcset=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/6278.clip_image002_thumb_709F8F8D.jpg 804w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/6278.clip_image002_thumb_709F8F8D-300x97.jpg 300w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/11\/6278.clip_image002_thumb_709F8F8D-768x248.jpg 768w\" sizes=\"(max-width: 804px) 100vw, 804px\" \/><\/a><\/p>\n<p>As a brief aside, I\u2019m breaking one of my golden rules here.\u00a0 Never use average data.\u00a0 Use percentiles.\u00a0 Averages hide \u201cthe truth\u201d.\u00a0 Lots of \u201cgood\u201d values can hide fewer VERY bad values.\u00a0 Look at 50th percentile or 75th ot 90th, etc.\u00a0 Don\u2019t look at averages.\u00a0 It turns out averages are easy to compute so, in a rush, it sufficed to demonstrate the problem here but it\u2019s never what I would use in a formal analysis.<\/p>\n<p>Anyway, I\u2019m sorry for the incident on Friday.\u00a0 There were plenty of learnings.\u00a0 The biggest one I want to share is that you have to listen to your service.\u00a0 Very often it will \u201csilently\u201d get gradually degrade and then it will catastrophically fail.\u00a0 You have to catch it before the cliff.<\/p>\n<p>Brian<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We had a nasty enough service incident on Friday that I can\u2019t let it pass without commenting on it.\u00a0 If you follow my blog or the service, then you know we were doing a deployment on Friday.\u00a0 The natural assumption, both inside and outside Microsoft, was that it was caused by the deployment.\u00a0 It wasn\u2019t, [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[13],"class_list":["post-1781","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-tfservice"],"acf":[],"blog_post_summary":"<p>We had a nasty enough service incident on Friday that I can\u2019t let it pass without commenting on it.\u00a0 If you follow my blog or the service, then you know we were doing a deployment on Friday.\u00a0 The natural assumption, both inside and outside Microsoft, was that it was caused by the deployment.\u00a0 It wasn\u2019t, [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1781","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=1781"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1781\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=1781"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=1781"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=1781"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}