{"id":10335,"date":"2017-05-20T13:11:00","date_gmt":"2017-05-20T13:11:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/premier_developer\/?p=10335"},"modified":"2019-02-14T20:23:44","modified_gmt":"2019-02-15T03:23:44","slug":"monitoring-and-why-it-matters-to-you","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/premier-developer\/monitoring-and-why-it-matters-to-you\/","title":{"rendered":"Monitoring, and Why It Should Come First In Your DevOps Strategy!"},"content":{"rendered":"<p>Senior Application Development Manager, <a href=\"https:\/\/www.linkedin.com\/in\/rogueagile\/\">Dave Harrison<\/a>, spotlights the importance of instrumentation and metrics as the backbone of your DevOps movement.&nbsp; <\/p>\n<hr>\n<p>Monitoring has long been the secret sauce of DevOps. How else do we get feedback on our priorities, and actual metrics \u2013 not guesses \u2013 on which features are in use? What\u2019s often overlooked though is that it can actually help you fight back against the wrong kind of change management \u2013 one that increases your bureaucratic workload and actually makes your build riskier and harder to fix. How is that possible?  <\/p>\n<h2>The Blame Game<\/h2>\n<p>Let\u2019s start with some basic negative cycles we\u2019ve all seen when there\u2019s very visible production outages. When bad things happen in production, we immediately start seeing the oddest thing happen \u2013 the SDLC process starts to dissolve into this negative cycle of blame and recriminations. As you can see below \u2013 there\u2019s a kind of vicious cycle that breaks out: <\/p>\n<ol>\n<li>Highly visible outages in production leads to blame and mistrust between siloed groups<\/li>\n<li>Management comes in with a mandate on reducing failures to near-zero, through change management and an increased focus on testing. <\/li>\n<li>Testing and QA go into a death march cycle where manual testing (no time for automation!) lengthens out release times from minutes to days or weeks.<\/li>\n<li>Change Approval Boards (CAB) and other manual authorization from stakeholders far removed from the coding also drags out the approval process to weeks. <\/li>\n<li>Release bits become larger and far more risky as more new code and functionality are released in huge batches <\/li>\n<li>Feedback time drops to near zero as user feedback on new functionality is delayed or lost as war room releases become more problematic <\/li>\n<\/ol>\n<p>The end result? That critical \u201chub\u201d of the wheel- open collaboration \u2013 breaks down completely. <\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/image304.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/image_thumb279.png\" width=\"1028\" height=\"580\"><\/a> <\/p>\n<p>In the wake of a disaster like this, John Allspaw noted that there are two counterfactual narratives that spring up: <\/p>\n<ol>\n<li><b>Blame change control.<\/b> \u201cHey, better CM practices could have prevented this!\u201d <\/li>\n<li><b>Blame testing<\/b> \u2013 \u201cIf we had better QA, we at least could have taken steps to detect it faster and recover!\u201d<\/li>\n<\/ol>\n<p>It\u2019s hard to argue with either of these. And it\u2019s true, the RIGHT kind of change controls do need to be implemented. But by clenching like this, as Gene Kim has noted in <a href=\"https:\/\/www.amazon.com\/DevOps-Handbook-World-Class-Reliability-Organizations\/dp\/1942788002\/ref=sr_1_1?ie=UTF8&amp;qid=1490387727&amp;sr=8-1&amp;keywords=the+devops+handbook\">The DevOps Handbook<\/a>, \u201cin environments with low-trust, command and control cultures, the outcomes of their change control and testing countermeasures end up hurting more than they help. Builds become bigger, less frequent and more risky.\u201d Why is this?  <\/p>\n<p>This is because the devs\/QA team begins implementing increasingly more clunky testing suites that take longer to execute, or writing unit tests that frequently don\u2019t catch errors in the user experience. In a pinch, the QA team begins adding a significant amount of manual smoke-testing versus automated tests. Management begins imposing long and mandatory change control boards every week to approve releases and go over introduced defects from the previous week(s) \u2013 I\u2019ve seen these groups grow into the 100\u2019s, most of whom are very far removed from the application. More controls, remote gatekeepers and a manual approval process leads to increased batch sizes and deployment lead times \u2013 which reduces our chances of a successful deployment for both dev and Ops. Our feedback loop \u2013 the times stretch out, reducing its value. A key finding of several studies is that high performing orgs relied more on peer review and less on external approval of changes. The more orgs rely on change approval, the worse their IT performance in both stability (MTTR and change fail rate) and throughput (deployment lead times and frequency). <\/p>\n<p>The main issue with the overreactive organization above is that it is trying to <b>focus on reliability or MTBF (Mean Time Between Failures)<\/b> &#8211; trying to prevent errors and bugs from happening. Sometimes, they even call their recap (punitive!) meetings \u201cZero Defect Meetings\u201d \u2013 as if such a kind of operational perfection is attainable! In contrast, DevOps savvy companies don\u2019t try to focus on MTBF \u2013 reducing their failure count. They know outages are going to happen. Instead, they try to treat each failure as an opportunity \u2013 what test was missing that could have caught this, what gap in our processes can address this next time? Especially they focus on improving their REACTION time \u2013 improving their <b>time to recovery, MTTR<\/b> (Mean Time to Recover). Testing and automated instrumentation \u2013 that famous passage about wanting \u201ccattle not pets\u201d, i.e. blowing away and recreating environments at whim \u2013 forms the heart of their adaptive, flexible response strategy. <\/p>\n<h2>Telemetry To The Rescue<\/h2>\n<p>Puppet Labs \u2013 in their excellent 2014 \u201cState of DevOps\u201d report \u2013 mentioned that organizations that want to improve on their reaction time (MTTR) benefit the most \u2013 and it\u2019s not even close, by an order of magnitude \u2013 from two technical tools\/approaches: <\/p>\n<ol>\n<li><b>Use of version control for <i>all <\/i>production artifacts &#8211; <\/b>When an error is identified in production, you can quickly either redeploy the last good state or fix the problem and roll forward, reducing the time to recover. <\/li>\n<li><b>Monitoring system and application health &#8211; <\/b>Logging and monitoring systems make it easy to detect failures and identify the events that contributed to them. Proactive monitoring of system health based on threshold and rate-of-change warnings enables us to preemptively detect and mitigate problems. <\/li>\n<\/ol>\n<p>We\u2019re going to talk about the second item \u2013 monitoring \u2013 and how it can help us avoid that vicious cycle. How can monitoring help turn the tide for us so we don\u2019t overreact because of a production outage? <\/p>\n<p>There are a few fixes that can transform that reactive, vicious cycle into a responsive but measured virtuous cycle that addresses the core problems you\u2019re seeing in PROD. Some are nontechnical or more process related than anything else \u2013 and note that fixing the issue starts with purity of code \u2013 as early in the process as possible: <\/p>\n<ol>\n<li><b>Adding or strengthening production telemetry (we can confirm if a fix works \u2013 and autodetect next time)<\/b><\/li>\n<li>Devs begin pushing code to prod (I can quickly see what\u2019s broken and make decisions to rollback vs patch). Note on this, a rollback \u2013 going to a previous version \u2013 is almost always easier and less risky. But sometimes fixing forward and rolling out a change using your deployment process is the best way forward.)<\/li>\n<li>Peer reviews. This includes not just code deployments but ops\/IT changes to environments! (remember the Phoenix project, 80% of our issues caused by unauthorized changes, often by IT to environments, 80% of our time stuck figuring out what in this soup of changes caused the issue \u2013 before we even lift a finger to resolve anything! I\u2019ll write more about how to do a productive peer review \u2013 especially pair programming, which is really a code review on programming \u2013 later.)<\/li>\n<li>Better <b><u>automated<\/u><\/b> testing (again, more on this later. Look at Jez Humble\u2019s excellent <a href=\"https:\/\/www.amazon.com\/Continuous-Delivery-Deployment-Automation-Addison-Wesley\/dp\/0321601912\">Continuous Delivery<\/a> or <a href=\"https:\/\/www.amazon.com\/Agile-Testing-Practical-Guide-Testers\/dp\/0321534468\/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1490389537&amp;sr=1-1&amp;keywords=agile+testing\">Agile Testing<\/a> for more on this.<\/li>\n<li>Batch sizes get smaller. The secret to smooth and continuous flow is making small, frequent changes. <\/li>\n<\/ol>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/image305.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/31\/2019\/04\/image_thumb280.png\" width=\"1028\" height=\"594\"><\/a><\/p>\n<p>A key driver here though is information radiators- a term that actually comes from Toyota\u2019s Lean principles. This creates a feedback loop, which broadcasts back issues as quickly as possible, radiating information out on how things are going.  <\/p>\n<p>Etsy \u2013 just to take one company as an example \u2013 takes monitoring so seriously that some of their architects have been quoted as saying their monitoring systems need to be <b>more available and scalable <\/b>than the systems they\u2019re monitoring. One of their engineers was quoted as saying, \u201cIf Engineering at Etsy has a religion, it\u2019s the Church of Graphs. If it moves, we track it. Sometimes we\u2019ll draw a graph of something that isn\u2019t moving yet, just in case it decides to make a run for it. Tracking everything is the key to moving fast, but the only way to do it is to make tracking anything easy. We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.\u201d <\/p>\n<p>Another great thinker in the DevOps space, <a href=\"https:\/\/theagileadmin.com\/about\/ernest-mueller\/\">Ernest Mueller<\/a>, has said \u2013 \u201cOne of the first actions I take when starting in an organization is to use information radiators to communicate issues and detail the changes we are making. This is usually extremely well received by our business units, who were often left in the dark before. And for Deployment and Operations groups who must work together to deliver a service to others, we need that constant communication, information and feedback.  <\/p>\n<p>I know I found that being true in my career. I discovered this fairly early on in my adoption of Agile with some sportswear companies here in the Oregon region. I worked for some very personality-driven orgs with highly charged, negative dynamics between teams. As I adopted Agile, which meant broadcasting honest retrospectives \u2013 including my screw-ups and failure to meet sprint goals \u2013 I expected a Donkey Kong type response and falling hammers. The most shocking thing happened though \u2013 the more brutally honest and upfront I was on what had gone wrong, I found myself having a better relationship with the business and my IT partners. And, mistakes we made on the team were owned up to \u2013 and they typically didn\u2019t repeat, not without the group holding the culprit (including me) responsible. That kind of \u201cgovernment in the sunshine\u201d type transparency and candor was the biggest single turning point of our Agile transformation.  <\/p>\n<h2>In Closing<\/h2>\n<p>It\u2019s been said, rightly, that <b>every lie we tell ourselves comes with a payoff and a price<\/b>. For developers or IT, we\u2019ve been very used to thinking we are AWESOME and WONDERFUL and the other guys are evil\/obstructive\/etc. Maybe that story \u2013 which has the short term payoff of making us feel good about our performance\u2013 comes with a heavy price, of limiting our success in rolling out easy to manage and maintain applications and delivering business value faster. By using instrumentation and telemetry, we demonstrate that we are not lying to ourselves or to our customers\/the business. And suddenly a lot of those highly charged, politically sensitive meetings you find yourself in lose a lot of their subjectivity and poison \u2013 the focus is on improving numbers versus the negative punish\/blame scenario. <\/p>\n<p>Like testing, instrumentation and monitoring seems to be a bolt on or an afterthought in every project. That\u2019s a huge mistake. <b>Make instrumentation and metrics the backbone of your DevOps movement<\/b>, as it\u2019s the only thing that will tell you if you\u2019re making specific progress and earn you credibility in the eyes of the business.  <\/p>\n<p><b>Don\u2019t let your developers tell you that it\u2019s too hard or have it be an afterthought.<\/b> It takes just a few minutes to make your release and application availability metrics available to all. <\/p>\n<p>And if your telemetry system is difficult to implement or doesn\u2019t collect the metrics you need, think about switching to another tool. Remember the Etsy lesson \u2013 making it easy and quick is the way to go. If your tool isn\u2019t easy to use and customize it needs to be junked. One reason among many why I really like Application Insights! <\/p>\n<hr align=\"center\" size=\"3\" width=\"100%\">\n<p><a href=\"https:\/\/blogs.msdn.com\/b\/premier_developer\/archive\/2014\/09\/15\/welcome.aspx\"><strong>Premier Support for Developers<\/strong><\/a> provides strategic technology guidance, critical support coverage, and a range of essential services to help teams optimize development lifecycles and improve software quality.&nbsp; Contact your Application Development Manager (ADM) or <a href=\"https:\/\/blogs.msdn.microsoft.com\/premier_developer\/contact-us\/\">email us<\/a><b><\/b> to learn more about what we can do for you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Senior Application Development Manager, Dave Harrison, spotlights the importance of instrumentation and metrics as the backbone of your DevOps movement.&nbsp; Monitoring has long been the secret sauce of DevOps. How else do we get feedback on our priorities, and actual metrics \u2013 not guesses \u2013 on which features are in use? What\u2019s often overlooked though [&hellip;]<\/p>\n","protected":false},"author":582,"featured_media":37840,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[35],"tags":[34,273,3],"class_list":["post-10335","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-alm","tag-alm","tag-monitoring","tag-team"],"acf":[],"blog_post_summary":"<p>Senior Application Development Manager, Dave Harrison, spotlights the importance of instrumentation and metrics as the backbone of your DevOps movement.&nbsp; Monitoring has long been the secret sauce of DevOps. How else do we get feedback on our priorities, and actual metrics \u2013 not guesses \u2013 on which features are in use? What\u2019s often overlooked though [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts\/10335","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/users\/582"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/comments?post=10335"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/posts\/10335\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/media\/37840"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/media?parent=10335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/categories?post=10335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/premier-developer\/wp-json\/wp\/v2\/tags?post=10335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}