We had a pretty serious outage last Thursday all told it was a little over 5 hours. The symptoms were that performance was so bad that the service was basically unavailable for most people (though there was some intermittent access as various mitigation steps were taken). It started around 14:00 UTC and ended a little before 19:30 UTC. This duration and severity makes this one of the worst incidents we’ve ever had on VS Online. We feel terrible about it and continue to be committed to doing everything we can to prevent outages. I’m sorry for the problems it caused. The team worked tirelessly from Thursday through Sunday both to address the immediate health issues and to fix underlying bugs that might cause recurrences. As you might imagine, for the past week, we’ve been hard at work trying to understand what happened and what changes we have to make to prevent such things in the future. It is often very difficult to find proof of the exact trigger for outages but you can learn a ton by studying them closely. On an outage like this, there’s a set of questions I always ask, and they include:
What happened?
What happened was that one of the core SPS (Shared Platform Services) databases became overwhelmed with database updates and started queuing up so badly that it effectively blocked callers. Since SPS is part of the authentication and licensing process, we can’t just completely ignore it – though I would suggest that if it became very sluggish, it wouldn’t be the end of the world if we bypassed some licensing checks to keep the service responsive.
What was the trigger? What made it happen today vs yesterday or any other day?
Though we’ve worked hard on this question, we don’t have any definitive answer (we’re still pursuing it though). We know that before the incident, some configuration changes were made that caused a significant increase in traffic between our “TFS” service and our “SPS” (Shared Platform Service). That traffic involved additional license validation checks that had been improperly disabled. We also know that, at about the same time, we saw a spike in latencies and failed deliveries of Service Bus messages. We believe that one or both of these were key triggers but we are missing some logging on SPS database access to be able to be 100% certain. Hopefully, in the next few days, we’ll know more conclusively.
What was the “root cause”?
This is different than the trigger in the sense that the trigger is often a condition that caused some cascading effect. The root cause is more about understanding why the effect cascaded and why it took the system down. It turns out that, I believe, the root cause was that we had accumulated a series of bugs that were causing extra SPS database work to be done and that the system was inherently unstable – from a performance perspective. It just took some poke at the system – in the form of extra identity or licensing churn to cause a ripple effect on these bugs. Most, but not all, of them were introduced in the last few sprints. Here’s a list of the “core” causal bugs that we’ve found and fixed so far:
- Many calls from TFS -> SPS were inappropriately updating the “TFS service” identity’s properties. This created SQL write contention and invalidated the identity by sending a Service Bus message from SPS -> TFS. This message caused the app tiers to invalidate their cache and subsequent TFS requests to make a call to SPS causing further property updates and a vicious cycle.
- A bug in 401-handling code was making an update to the identity causing an invalidation of the identity’s cache – no vicious cycle but lots of extra cache flushes.
- A bug in the Azure Portal extension service was retrying 401s every 5sec.
- An old behavior that was causing the same invalidation ‘event’ to be resent from each SPS AT (user1 was invalidated on AT1, user2 was invalidated from AT2 -> user1 will be sent 2 invalidations). And we have about 4 ATs so this can have a pretty nasty multiplicative effect.
We’ve also found/fixed a few “aggravating” bugs that made the situation worse but wouldn’t have been bad enough to cause serious issues on their own:
- Many volatile properties were being stored in Identity’s extended properties causing repeated cache invalidations and broad “change notifications” to be sent to listeners who didn’t care about the property changes.
- A few places were updating properties with unchanged values causing an unnecessary invalidation and SQL round trips.
All of these, in some form, have to do with updates to identities in the system that then often cause propagating change notifications (which in some cases were over propagated) that caused extra processing/updates/cache invalidations. It was “unstable” because anything that caused an unexpected increased load in these identity updates would spiral out of control due to multiplicative effects and cycles.
What did we learn from the event?
I always want to look beyond the immediate and understand the underlying pattern. This is sometimes called “The 5 whys”. This is, in fact, the most important question in the list. Why did this happen and what can we do differently? Not what bugs did we hit. Why were those bugs there? What should we have done to ensure those bugs were caught in the design/development process before anything went into production? Let me start with a story. Way back in 2008, when we were beginning to rollout TFS across very large teams at Microsoft, we had a catastrophe. We significantly underestimated the load that many thousands of people and very large scale build labs would put on TFS. We lived in hell for close to 9 months with significant performance issues, painful daily slowdowns and lots of people sending me hate mail. My biggest learning from that was, when it comes to performance, you can’t trust abstractions. In that case, we were treating SQL Server as a relational database. What I learned is that it’s really not. It’s a software abstraction layer over disk I/O. If you don’t know what’s happening at the disk I/O layer, you don’t know anything. Your ignorance may be bliss – but when you get hit with 10x or 100x scale/performance requirement and you fall over dead. We became very deep in SQL disk layout, head seeks, data density, query plans, etc. We optimized the flows from the top to the very bottom and made sure we knew where all the CPU went, all the I/Os went, etc. When we were done, TFS scaled to crazy large teams and code bases. We then put in place regression tests that would measure changes, not just in time but also in terms of SQL round trips, etc. So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests. So, we’ve got some bugs to fix and some more debt to pay off in terms of tuning the interaction between TFS and SPS but, most importantly, we need to put in place some infrastructure to better measure and flag changes in end to end cost – both in test and in production. The irony here (not funny irony but sad irony), is that there has been some renewed attention on this in the team recently. A few weeks ago, we had a “hack-a-thon” for small groups of people on the team to experiment with new ideas. One of the teams built a prototype of a solution for capturing important performance tracing information across the end-to-end thread of a request. I’ll try to do a blog post in the next couple of weeks to show some of these ideas. And just the week before this incident Buck (our Dev director) and I were having a conversation about needing to invest more in this very scenario. Unfortunately we had a major incident before we could address the gap.
What are we going to do?
OK, so we learned a lot, but what are we actually going to do about it. Clearly step 1 is mitigate the emergency and get the system back to sustainable health quickly. I think we are there now. But we haven’t addressed the underlying whys yet. So, some plans we are making now include:
- We will analyze call patterns within SPS and between SPS and SQL and build the right telemetry and alerts to catch situations early. Adding baselines into unit and functional tests will enforce that baselines don’t get violated when a dev checks in code.
- Partitioning and scaling of SPS Config DB will be a very high priority. With the work to enable tenant-level hosts, we can partition identity related information per tenant. This enables us to scale SPS data across databases, enabling a higher “ceiling” and more isolation in the event things ever go badly again.
- We are looking into building an ability for a service to throttle and recover itself from a slow or failed dependency. We should leverage the same techniques for TFS -> SPS communication and let TFS leverage cached state or fail gracefully. (This is actually also a remaining action item from the previous outage we had a months or so ago.)
- We should test our designs for lag in Service Bus delivery and ensure that our functionality continues to work or degrades gracefully.
- Moving to Service Bus batching APIs and partitioned topics will help us scale better to handle very ‘hot’ topics like Identity.
As always, hopefully you get some value from the details behind the mistakes we make. Thank you for bearing with us.
Brian
0 comments