Monitoring, and Why It Should Come First In Your DevOps Strategy!
Senior Application Development Manager, Dave Harrison, spotlights the importance of instrumentation and metrics as the backbone of your DevOps movement.
Monitoring has long been the secret sauce of DevOps. How else do we get feedback on our priorities, and actual metrics – not guesses – on which features are in use? What’s often overlooked though is that it can actually help you fight back against the wrong kind of change management – one that increases your bureaucratic workload and actually makes your build riskier and harder to fix. How is that possible?
The Blame Game
Let’s start with some basic negative cycles we’ve all seen when there’s very visible production outages. When bad things happen in production, we immediately start seeing the oddest thing happen – the SDLC process starts to dissolve into this negative cycle of blame and recriminations. As you can see below – there’s a kind of vicious cycle that breaks out:
- Highly visible outages in production leads to blame and mistrust between siloed groups
- Management comes in with a mandate on reducing failures to near-zero, through change management and an increased focus on testing.
- Testing and QA go into a death march cycle where manual testing (no time for automation!) lengthens out release times from minutes to days or weeks.
- Change Approval Boards (CAB) and other manual authorization from stakeholders far removed from the coding also drags out the approval process to weeks.
- Release bits become larger and far more risky as more new code and functionality are released in huge batches
- Feedback time drops to near zero as user feedback on new functionality is delayed or lost as war room releases become more problematic
The end result? That critical “hub” of the wheel- open collaboration – breaks down completely.
In the wake of a disaster like this, John Allspaw noted that there are two counterfactual narratives that spring up:
- Blame change control. “Hey, better CM practices could have prevented this!”
- Blame testing – “If we had better QA, we at least could have taken steps to detect it faster and recover!”
It’s hard to argue with either of these. And it’s true, the RIGHT kind of change controls do need to be implemented. But by clenching like this, as Gene Kim has noted in The DevOps Handbook, “in environments with low-trust, command and control cultures, the outcomes of their change control and testing countermeasures end up hurting more than they help. Builds become bigger, less frequent and more risky.” Why is this?
This is because the devs/QA team begins implementing increasingly more clunky testing suites that take longer to execute, or writing unit tests that frequently don’t catch errors in the user experience. In a pinch, the QA team begins adding a significant amount of manual smoke-testing versus automated tests. Management begins imposing long and mandatory change control boards every week to approve releases and go over introduced defects from the previous week(s) – I’ve seen these groups grow into the 100’s, most of whom are very far removed from the application. More controls, remote gatekeepers and a manual approval process leads to increased batch sizes and deployment lead times – which reduces our chances of a successful deployment for both dev and Ops. Our feedback loop – the times stretch out, reducing its value. A key finding of several studies is that high performing orgs relied more on peer review and less on external approval of changes. The more orgs rely on change approval, the worse their IT performance in both stability (MTTR and change fail rate) and throughput (deployment lead times and frequency).
The main issue with the overreactive organization above is that it is trying to focus on reliability or MTBF (Mean Time Between Failures) – trying to prevent errors and bugs from happening. Sometimes, they even call their recap (punitive!) meetings “Zero Defect Meetings” – as if such a kind of operational perfection is attainable! In contrast, DevOps savvy companies don’t try to focus on MTBF – reducing their failure count. They know outages are going to happen. Instead, they try to treat each failure as an opportunity – what test was missing that could have caught this, what gap in our processes can address this next time? Especially they focus on improving their REACTION time – improving their time to recovery, MTTR (Mean Time to Recover). Testing and automated instrumentation – that famous passage about wanting “cattle not pets”, i.e. blowing away and recreating environments at whim – forms the heart of their adaptive, flexible response strategy.
Telemetry To The Rescue
Puppet Labs – in their excellent 2014 “State of DevOps” report – mentioned that organizations that want to improve on their reaction time (MTTR) benefit the most – and it’s not even close, by an order of magnitude – from two technical tools/approaches:
- Use of version control for all production artifacts – When an error is identified in production, you can quickly either redeploy the last good state or fix the problem and roll forward, reducing the time to recover.
- Monitoring system and application health – Logging and monitoring systems make it easy to detect failures and identify the events that contributed to them. Proactive monitoring of system health based on threshold and rate-of-change warnings enables us to preemptively detect and mitigate problems.
We’re going to talk about the second item – monitoring – and how it can help us avoid that vicious cycle. How can monitoring help turn the tide for us so we don’t overreact because of a production outage?
There are a few fixes that can transform that reactive, vicious cycle into a responsive but measured virtuous cycle that addresses the core problems you’re seeing in PROD. Some are nontechnical or more process related than anything else – and note that fixing the issue starts with purity of code – as early in the process as possible:
- Adding or strengthening production telemetry (we can confirm if a fix works – and autodetect next time)
- Devs begin pushing code to prod (I can quickly see what’s broken and make decisions to rollback vs patch). Note on this, a rollback – going to a previous version – is almost always easier and less risky. But sometimes fixing forward and rolling out a change using your deployment process is the best way forward.)
- Peer reviews. This includes not just code deployments but ops/IT changes to environments! (remember the Phoenix project, 80% of our issues caused by unauthorized changes, often by IT to environments, 80% of our time stuck figuring out what in this soup of changes caused the issue – before we even lift a finger to resolve anything! I’ll write more about how to do a productive peer review – especially pair programming, which is really a code review on programming – later.)
- Better automated testing (again, more on this later. Look at Jez Humble’s excellent Continuous Delivery or Agile Testing for more on this.
- Batch sizes get smaller. The secret to smooth and continuous flow is making small, frequent changes.
A key driver here though is information radiators- a term that actually comes from Toyota’s Lean principles. This creates a feedback loop, which broadcasts back issues as quickly as possible, radiating information out on how things are going.
Etsy – just to take one company as an example – takes monitoring so seriously that some of their architects have been quoted as saying their monitoring systems need to be more available and scalable than the systems they’re monitoring. One of their engineers was quoted as saying, “If Engineering at Etsy has a religion, it’s the Church of Graphs. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. Tracking everything is the key to moving fast, but the only way to do it is to make tracking anything easy. We enable engineers to track what they need to track, at the drop of a hat, without requiring time-sucking configuration changes or complicated processes.”
Another great thinker in the DevOps space, Ernest Mueller, has said – “One of the first actions I take when starting in an organization is to use information radiators to communicate issues and detail the changes we are making. This is usually extremely well received by our business units, who were often left in the dark before. And for Deployment and Operations groups who must work together to deliver a service to others, we need that constant communication, information and feedback.
I know I found that being true in my career. I discovered this fairly early on in my adoption of Agile with some sportswear companies here in the Oregon region. I worked for some very personality-driven orgs with highly charged, negative dynamics between teams. As I adopted Agile, which meant broadcasting honest retrospectives – including my screw-ups and failure to meet sprint goals – I expected a Donkey Kong type response and falling hammers. The most shocking thing happened though – the more brutally honest and upfront I was on what had gone wrong, I found myself having a better relationship with the business and my IT partners. And, mistakes we made on the team were owned up to – and they typically didn’t repeat, not without the group holding the culprit (including me) responsible. That kind of “government in the sunshine” type transparency and candor was the biggest single turning point of our Agile transformation.
It’s been said, rightly, that every lie we tell ourselves comes with a payoff and a price. For developers or IT, we’ve been very used to thinking we are AWESOME and WONDERFUL and the other guys are evil/obstructive/etc. Maybe that story – which has the short term payoff of making us feel good about our performance– comes with a heavy price, of limiting our success in rolling out easy to manage and maintain applications and delivering business value faster. By using instrumentation and telemetry, we demonstrate that we are not lying to ourselves or to our customers/the business. And suddenly a lot of those highly charged, politically sensitive meetings you find yourself in lose a lot of their subjectivity and poison – the focus is on improving numbers versus the negative punish/blame scenario.
Like testing, instrumentation and monitoring seems to be a bolt on or an afterthought in every project. That’s a huge mistake. Make instrumentation and metrics the backbone of your DevOps movement, as it’s the only thing that will tell you if you’re making specific progress and earn you credibility in the eyes of the business.
Don’t let your developers tell you that it’s too hard or have it be an afterthought. It takes just a few minutes to make your release and application availability metrics available to all.
And if your telemetry system is difficult to implement or doesn’t collect the metrics you need, think about switching to another tool. Remember the Etsy lesson – making it easy and quick is the way to go. If your tool isn’t easy to use and customize it needs to be junked. One reason among many why I really like Application Insights!
Premier Support for Developers provides strategic technology guidance, critical support coverage, and a range of essential services to help teams optimize development lifecycles and improve software quality. Contact your Application Development Manager (ADM) or email us to learn more about what we can do for you.