{"id":1871,"date":"2013-10-14T06:43:20","date_gmt":"2013-10-14T06:43:20","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/2013\/10\/14\/how-do-you-measure-quality-of-a-service\/"},"modified":"2019-08-01T21:07:02","modified_gmt":"2019-08-01T21:07:02","slug":"how-do-you-measure-quality-of-a-service","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/how-do-you-measure-quality-of-a-service\/","title":{"rendered":"How do you measure quality of a service?"},"content":{"rendered":"<p>Since we started down the path of building an online service a couple of years ago, I have learned a lot.\u00a0 One of the things I\u2019ve learned a lot about is measuring the health of a service.\u00a0 I don\u2019t pretend to have the only solution to the problem so I\u2019m happy to have anyone with a differing opinion chime in.<\/p>\n<p>For the purpose of this post, I\u2019m defining the \u201cquality of a service\u201d as the degree to which it is available and responsive.<\/p>\n<h3>The problem<\/h3>\n<p>The \u201ctraditional way\u201d of tackling this problem is what\u2019s called \u201csynthetic transactions\u201d.\u00a0 In this approach, you create a \u201ctest agent\u201d that is going to make some request to your service over and over again every N minutes.\u00a0 A failed response indicates a problem and that time window is marked as \u201cfailing\u201d.\u00a0 You then take the number of failed intervals and divide by the total number of intervals in a trailing window, let\u2019s say 30 days, for instance, and that becomes your availability metric.<\/p>\n<p>So what\u2019s wrong with this?\u00a0 Let me start with a story\u2026<\/p>\n<p>When we first launched Team Foundation Service, we had a lot of problems with SQL Azure.\u00a0 We were one of the first high scale, interactive services to go live on SQL Azure and, in the process, discovered quite a lot of issues (it\u2019s much better now, in case you are wondering).\u00a0 But, 3 or 4 months after we launched the service, I was in Redmond and was paying a visit to a couple of the leaders of the SQL Azure team to talk about how the SQL Azure issues were killing us and I needed to understand their plan for addressing the issues quickly.<\/p>\n<p>As I walked through the central hallway on their floor, I noticed they had a service dashboard rotating through a set of screens displaying data about the live service.\u00a0 As and aside, this is a pretty common practice (we do it too).\u00a0 It\u2019s a good way to emphasize to the team that in a service business, \u201clive-site\u201d is the most important thing.\u00a0 I stopped for a few minutes to just watch the screens scroll by and see what it said about their service.\u00a0 Everything was <span style=\"color: #008000;\">green<\/span>.\u00a0 In fact, looking at the dashboard, you\u2019d have no clue there were any problems \u2013 availability was good, performance was good, etc, etc.\u00a0 As a user of the service, I can assure you, there was nothing <span style=\"color: #008000;\">green<\/span> about it.\u00a0 I was pretty upset and it made for a colorful beginning to the meeting I was headed for.<\/p>\n<p>Again, before everyone goes and says \u201cBrian said, SQL Azure sucks\u201d.\u00a0 What I said is 2 years ago it had some significant reliability issues for us \ud83d\ude42 .\u00a0 While it\u2019s not perfect now, it works well and I can honestly say that I\u2019m not sure we could run our service easily without it.\u00a0 The high scale elastic database pool it provides is truly fantastic.<\/p>\n<p>So how does this happen?\u00a0 How is it that the people who run the service can have a very different view on the health of the service than the people who use the service?\u00a0 Well, there are many answers but some of them have to do with how you measure and evaluate the health of a service.<\/p>\n<p>Too often measurements of the health of a service don\u2019t reflect the experience customers actually have.\u00a0 The \u201ctraditional\u201d model that I described above can lead to this.\u00a0 When you run synthetic transactions, you generally have to run them against some subset of the service endpoints, against some subset of the data.\u00a0 Further, while it\u2019s easy to exercise the \u201cread\u201d paths, the \u201cwrite\u201d paths are more tricky because you often don\u2019t actually want to change the data.\u00a0 So to bring this home, in the early days of TFService, we set up something similar and had a few synthetic transactions that would login, ping a couple of web pages, read some work items, etc.\u00a0 That all happened in a test account that our service deliver team created (because, we couldn\u2019t be messing with customer accounts, of course).\u00a0 Every customer of our system could, theoretically be down and our synthetic transactions could still be working fine.<\/p>\n<p>That\u2019s the fundamental problem with this approach in my humble opinion.\u00a0 Your synthetic transactions only exercise a small subset of the data (especially in an isolated multi-tenant system) and a small subset of the end-points, leaving lots of ways for missing the experience your customers are actually having.<\/p>\n<p>Another mistake I\u2019ve seen is evaluating the service in too much of an aggregate view.\u00a0 You might say 99% of my requests are successful and you might feel OK about that.\u00a0 If all those failures are clusters on a small number of customers, they will abandon you.\u00a0 And then the next set and so forth.\u00a0 So you can\u2019t blur your eyes too much.\u00a0 You need to understand what is happening to individual customers.<\/p>\n<h3>A solution<\/h3>\n<p>OK, enough about the problem, let\u2019s talk about our journey to a solution.<\/p>\n<p>One of the big lessons I learned from the very beginning was that I wanted our primary <strong>measure of availability to be based on real customer experience rather than on synthetic transactions<\/strong> (we still use synthetic transactions, but more on that later).\u00a0 Fortunately, for years, TFS has had a capability that we call \u201cActivity logging\u201d.\u00a0 It records every request to the system, who made it, when it arrived, how long it took, whether or not it succeeded, etc.\u00a0 This has been incredibly valuable in understanding and diagnosing issues in TFS.<\/p>\n<p>Another of the lessons I learned is that <strong>any measure of \u201cavailability\u201d, if you want it to be a meaningful measure of customer experience needs to represent both reliability and performance<\/strong>.\u00a0 Just counting failed requests leaves a major gap.\u00a0 If your users have to wait too long, the system can be just as unusable as if it\u2019s not responding at all.<\/p>\n<p>Lastly, any measure of <strong>availability should reflect the overall system health and not just the health of a given component<\/strong>.\u00a0 You may feel good that a component is running well but if a user needs to interact with 3 components, to get anything done, only one of them has to have a problem to cause the user to fail.<\/p>\n<p>Our first cut at an availability metric was to count requests in the availability log.\u00a0 The formula was availability = (total requests \u2013 failed requests \u2013 slow requests) \/ total requests.\u00a0 For a long time, this served us pretty well.\u00a0 It did a good job of reflecting the kinds of instability we were experiencing.\u00a0 It was based on real user experience and included both reliability and performance.\u00a0 We also did outside in monitoring with synthetic transactions, BTW, but that wasn\u2019t our primary availability metric.<\/p>\n<p>Over the past 6 months or so, we\u2019ve found this measure increasingly diverging from what we believe the actual service experience to be.\u00a0 It\u2019s been painting a rosier picture than reality.\u00a0 Why?\u00a0 There are a number of reasons.\u00a0 I believe the primary phenomenon is what I\u2019ll call \u201cmodified behavior\u201d.\u00a0 If you hit a failed request, for a number of reasons, you may not make any more requests.\u00a0 For instance, if you try to kick off a build and it fails, all the requests that the build would have caused never happen and never get the opportunity to fail.\u00a0 As a result, you undercount the\ntotal number of requests that would have failed if the user had actually been able to make progress.\u00a0 And, of course, if the system isn\u2019t working, your users don\u2019t just sit a beat their heads against the wall, they go get lunch.\u00a0 In this model, if no one is using the system, the availability is 100% (we\u2019ll, OK, actually it\u2019s undefined since the denominator is also 0, but you get the point) \ud83d\ude42 .<\/p>\n<p>We\u2019ve been spending the last several months working on a new availability model.\u00a0 We\u2019ve tried dozens and modeled them over all our data to see what we think appropriately reflects the \u201creal user experience\u201d.\u00a0 In the end, nothing else matters.<\/p>\n<p>The data is still measuring the success and failure of real user requests as represented in the activity log.\u00a0 But the computation is very different.\u00a0 One additional constraint we tried to solve for was we wanted a measure that could be applied equally to either an individual customer to measure their experience or to the aggregate of all of our customers.\u00a0 This will ultimately be valuable when we do get into the business of needing to actually provide refunds for SLA violations.<\/p>\n<p>First, like traditional monitoring, we\u2019ve introduced a \u201ctime penalty\u201d for every failure.\u00a0 That is to say, if we get a failure than we mark an entire time interval as failed.\u00a0 This is intended to address the \u201cmodified behavior\u201d phenomenon I described above.\u00a0 It changes the numerator from a request count to a time period.\u00a0 We need to change the denominator to a time period as well to make the math work.\u00a0 We could have just used # of customer or users multiplied by # of intervals in a month but that really dampens the availability curve.\u00a0 Instead we wanted the denominator to reflect the number of people actually trying to use the service and the duration in which they tried.\u00a0 To do that, we defined an aggregation period.\u00a0 Any customer who uses the service in the aggregation period gets counted as part of the denominator.\u00a0 So, let\u2019s look at the formula.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-15418\" src=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/7180.image_4EEFAB95.png\" alt=\"\" width=\"851\" height=\"261\" srcset=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/7180.image_4EEFAB95.png 851w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/7180.image_4EEFAB95-300x92.png 300w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/7180.image_4EEFAB95-768x236.png 768w\" sizes=\"(max-width: 851px) 100vw, 851px\" \/><\/p>\n<p>In English the process works like this:<\/p>\n<p>For each customer who used the service in a 5 minute aggregation period, count the number of minutes they experienced a failure (failed request or slow request).\u00a0 Sum up all those 1 minute failing intervals across all customers that used the service.\u00a0 Subtract that from the number of customers who used the service in the 5 minute aggregation period multiplied by 5 minutes.\u00a0 That gives you the number of \u201csuccessful customer minutes\u201d in that 5 minute aggregation period.\u00a0 Divide that by the total customer minutes (number of customers who used the service in the 5 minute aggregation period multiplied by 5 minutes) and that gives you a % of customer success.\u00a0 Average that over all of the 5 minutes aggregation periods (288 in 24 hours) in the window to get a % availability.<\/p>\n<p>We\u2019re still tweaking the values for 1 min intervals, 5 min aggregation period, 10 sec perf threshold.<\/p>\n<p>Of all the models we\u2019ve tried, this model provides a result that is reasonably intuitive, reasonably reactive to real customer problems (without being hyperactive) and more closely matches the experience we believe our customers are actually seeing.\u00a0 It\u2019s based on real customer experience, not synthetic ones and captures every single issue that any customer experiences in the system.\u00a0 To visualize the difference, look at the graph below.\u00a0 The orange line is the old availability model.\u00a0 The blue line is the results of the new one.\u00a0 What you are seeing is a graph of the 24 hour availability numbers.\u00a0 It will dampen a bit more when we turn it into a 30 day rolling average for SLA computation.<\/p>\n<p><img decoding=\"async\" class=\"alignnone size-full wp-image-15419\" src=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/0027.image_2D9005F9.png\" alt=\"\" width=\"837\" height=\"304\" srcset=\"https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/0027.image_2D9005F9.png 837w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/0027.image_2D9005F9-300x109.png 300w, https:\/\/devblogs.microsoft.com\/bharry\/wp-content\/uploads\/sites\/8\/2013\/10\/0027.image_2D9005F9-768x279.png 768w\" sizes=\"(max-width: 837px) 100vw, 837px\" \/><\/p>\n<p>There\u2019s a saying \u201cThere are lies, damn lies and statistics\u201d.\u00a0 I can craft an availability model that will say anything I want.\u00a0 I can make it look really good or really bad.\u00a0 Neither of those are, of course, the goal.\u00a0 What you want is an availability number that tells you what your customers experience.\u00a0 You want it to be bad when your customers are unhappy and good when your customers are satisfied.<\/p>\n<h3>Is that all you need?<\/h3>\n<p>Overall, I find this model works very well but there\u2019s still something missing.\u00a0 The problem is that no matter where you put your measurement, there can always be a failure in front of it.\u00a0 In our case, the activity log is collected when the request arrives at our service.\u00a0 It could fail in the IIS pipeline, in the Azure network, in the Azure load balancer, in the ISP, etc, etc.\u00a0 This is a place where we will use synthetic transactions because you are primarily just testing that a request can get through to your system.\u00a0 We use our Global Service Monitor service to place end points around the world and execute synthetic transactions every few minutes.\u00a0 We have some ideas for how we will integrate this numerically into our availability model but won\u2019t probably do so for a few months (this is not one of our real problems at the moment).<\/p>\n<p>When I first started into this space, the head of Azure operations said to me \u2013 outside in monitoring (what GSM, Keynote, Gomez, etc do) just measure the availability of the internet and \u201ctest in production\u201d \u2013 running tests inside your own data center measures the health of your app.\u00a0 I thought it was insightful.\u00a0 I think you still need to do it but it\u2019s important to think about the role it plays in your overall health assessment strategy.<\/p>\n<h3>A word about SLAs<\/h3>\n<p>I can\u2019t leave, even this ridiculously long post, without a word about SLAs (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Service-level_agreement\">Service Level Agreements<\/a>).\u00a0 An SLA generally defines the minimum level of service that a customer can expect from you.\u00a0 The phenomenon I\u2019ve seen happen in team after team is, once the SLA is defined, it becomes the goal.\u00a0 If we promise 99.9% availability in the SLA then the goal is 99.9% availability.\u00a0 My team and others have heard me rant about this far too many times, I suspect.\u00a0 <strong>The SLA is not the goal!\u00a0 The SLA is the worst you can possibly do before you have to give the customer their money back.<\/strong>\u00a0 The goal is 100% availability (or something close to that).<\/p>\n<p>Of course all of these things are trade offs.\u00a0 How much work does it take to get the last 0.0001% availability and how many great new features could you be providing instead.\u00a0 So, I\u2019ll never make my team do everything that is necessary to never have a single failure.\u00a0 But we\u2019ll investigate every failure we learn of and understand what we could do about it to prevent it and evaluate the cost benefit, knowing the issue and the solution.\u00a0 Right now, I\u2019m pushing for us to work towards 99.99% availability on a regular basis (that\u2019s 4.32 minutes of unexpected downtime a month).<\/p>\n<p>Sorry for the length.\u00a0 Hopefully it\u2019s at least somewhat useful to someone out there \ud83d\ude42\u00a0 As always, comments are welcome.<\/p>\n<p>Brian<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since we started down the path of building an online service a couple of years ago, I have learned a lot.\u00a0 One of the things I\u2019ve learned a lot about is measuring the health of a service.\u00a0 I don\u2019t pretend to have the only solution to the problem so I\u2019m happy to have anyone with [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[13],"class_list":["post-1871","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-tfservice"],"acf":[],"blog_post_summary":"<p>Since we started down the path of building an online service a couple of years ago, I have learned a lot.\u00a0 One of the things I\u2019ve learned a lot about is measuring the health of a service.\u00a0 I don\u2019t pretend to have the only solution to the problem so I\u2019m happy to have anyone with [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1871","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=1871"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/1871\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=1871"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=1871"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=1871"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}