A good incident postmortem
I wanted to call your attention to a good incident postmortem done by Taylor Lafrinere this week. Taylor sits in my team room and, for a week, I saw him bent over his keyboard, often with two or three people staring over his shoulders trying to figure out what had caused this incident and what we needed to do to prevent it in the future. This is the kind of tenacity you have to have to, in the long term, run a highly available service. Only if you really understand the root cause and build mitigations and resiliency will you get there.
It’s a bit long and detailed but it’s a good read.
This is also a good opportunity for me to comment on our reliability of late. In many, important, ways we are in much better shape than we have ever been in. We have more reliability and isolation infrastructure in place than ever before. We very rarely have incidents that affect a large percentage of customers any more. Our practices help us isolate the effects of incidents so most people are completely unaware we are having issues.
However, in the past few months, we’ve had too many of those “smaller” incidents and, unfortunately, a very disproportionate # of them have been on our European instances so the availability of our European instances has looked much worse than the overall service availability. There’s no one reason Europe has been hit hardest – it’s many reasons and we are taking steps to address them.
Many of the issues have been self-inflicted – by that I mean code defects that got checked in, deployed and not caught until they caused issues for customers. In part that’s because we are making some pretty large systemic/structural changes to the service right now (you’ll hear more about the resulting new capabilities in the next few months) and the level of rigor we typically apply just has been up to the magnitude of the churn that’s happening. We are working to improve that level of rigor while we simultaneously continue to improve isolation and resiliency. The RCA above is a great example of the ongoing effort and learnings that go into every incident we experience.
At the same time, I recognize all that matters is that the service is good and healthy and doing what you need it to do. It hasn’t been as healthy as it should have been lately. For that I want to apologize. We are working hard to address the underlying issues and this dip in health will get fixed and we’ll come out of it stronger and more resilient than ever.