Explanation of July 18th outage
Sorry it took me a week and a half to get to this. We had the most significant VS Online outage we’ve had in a while on Friday July 18th. The entire service was unavailable for about 90 minutes. Fortunately it happened during non-peak hours so the number of affected customers was fewer than it might have been but I know that’s small consolation to those who were affected. My main goal from any outage that we have is to learn from it. With that learning, I want to make our service better and also share it so, maybe, other people can avoid similar errors.
The root cause was that a single database in SQL Azure became very slow. I actually don’t know why, so I guess it’s not really the root cause but, for my purposes, it’s close enough. I trust the SQL Azure team chased that part of the root cause – certainly did loop them in on the incident. Databases will, from time to time, get slow and SQL Azure has been pretty good about that over the past year or so. The scenario was that Visual Studio (the IDE) was calling our “Shared Platform Services” (a common service instance managing things like identity, user profiles, licensing, etc.) to establish a connection to get notified about updates to roaming settings. The Shared Platform Services were calling Azure Service Bus and it was calling the ailing SQL Azure database. The slow Azure database caused calls to the Shard Platform Services (SPS) to pile up until all threads in the SPS thread pool were consumed, at which point, all calls to TFS eventually got blocked due to dependencies on SPS. The ultimate result was VS Online being down until we manually disabled our connection to Azure Service Bus an the log jam cleared itself up. There was a lot to learn from this. Some of it I already knew, some I hadn’t thought about but, regardless of which category it was in, it was a damn interesting/enlightening failure. **UPDATE** Within the first 10 minutes I’ve been pinged by a couple of people on my team pointing out that people may interpret this as saying the root cause was Azure DB. Actually, the point of my post is that it doesn’t matter what the root cause was. Transient failures will happen in a complex service. The interesting thing is that you react to them appropriately. So regardless of what the trigger was, the “root cause” of the outage was that we did not handle a transient failure in a secondary service properly and allowed it to cascade into a total service outage. I’m also told that I may be wrong about what happened in SB/Azure DB. I try to stay away from saying too much about what happens in other services because it’s a dangerous thing to do from afar. I’m not going to take the time to go double check and correct any error because, again, it’s not relevant to the discussion. The post isn’t about the trigger. The post is about how we reacted to the trigger and what we are going to do to handle such situations better in the future.
Don’t let a ‘nice to have’ feature take down your mission critical ones
I’d say the first and foremost lesson is “Don’t let a ‘nice to have’ feature take down your mission critical ones.” There’s a notion in services that all services should be loosely coupled and failure tolerant. One service going down should not cause a cascading failure, causing other services to fail but rather only the portion of functionality that absolutely depends on the failing component is unavailable. Services like Google and Bing are great at this. They are composed of dozens or hundreds of services and any single service might be down and you never even notice because most of the experience looks like it always does. The crime of this particular case is that, the feature that was experiencing the failure was Visual Studio settings roaming. If we had properly contained the failure, your roaming settings wouldn’t have synchronized for 90 minutes and everything else would have been fine. No big deal. Instead, the whole service went down. In our case, all of our services were written to handle failures in other services but, because the failure ultimately resulted in thread pool exhaustion in a critical service, and it reached the point that no service could make forward progress.
Smaller services are better
Part of the problem here was that a very critical service like our authentication service shared an exhaustible resource (the thread pool) with a very non-critical service (the roaming settings service). Another principle of services is that they should be factored into small atomic units of work if at all possible. Those units should be run with as few common failure points as possible and all interactions should honor “defensive programming” practices. If our authentication service goes down, then our service goes down. But the roaming settings service should never take the service down. We’ve been on a journey for the past 18 months or so of gradually refactoring VS Online into a set of loosely coupled services. In fact, only about a year ago, what is now SPS was factored out of TFS into a separate service. All told, we have about 15 or so independent services today. Clearly, we need more 🙂
How many times do you have to retry?
Another one of the long standing rules in services is that transient failures are “normal”. Every service consuming another service has to be tolerant of dropped packets, transient delays, flow control backpressure, etc. The primary technique is to retry when a service you are calling fails. That’s all well and good. The interesting thing we ran into here was a set of cascading retries. Our situation was Visual Studio –> SPS –> Service Bus –> Azure DB When Azure DB failed Service Bus retried 3 times. When Service Bus failed, SPS retried 2 times. When SPS failed, VS retried 3 times. 3 * 2 * 3 = 18 times. So, every single Visual Studio client launched in that time period caused a total of 18 attempts on the SQL Azure database. Since the problem was that the database was running slow (resulting in a timeout after like 30 seconds), that’s 18 tries * 30 seconds = 9 minutes each. Calls in this stack of services piled up and up and up until, eventually, the thread pool was full and no further requests could be processed. As it turns out SQL Azure is actually very good about communicating to it’s callers whether or not a retry is worth attempting. SB doesn’t pay attention to that and doesn’t communicate it to it’s callers. And neither does SPS. So a new rule I learned is that it’s important that any service carefully determine, based on the error, whether or not retries are called for *and* communicate back to their callers whether or not retries are advisable. If this had been done, each connection would have been only 30 seconds rather than 9 minutes and likely the situation would have been MUCH better.
A traffic cop goes a long way
Imagine that SPS kept count of how many concurrent calls were in progress to Service Bus. Knowing that this service was a “low priority” service and that calls were synchronous and the thread pool limited, it could have decided that, once that concurrent number of calls exceeded some threshold (let’s say 30, for arguments sake) that it would start rejecting all subsequent calls until the traffic jam drained a bit. Some callers would very quickly get rejected and their settings wouldn’t be roamed but we’d never have exhausted threads and the higher priority services would have continued to run just fine. Assuming the client is set to attempt a reconnect on some very infrequent interval, the system would eventually self-heal, assuming the underlying database issue was cleared up.
Threads, threads and more threads
I’m sure I won’t get out of this without someone pointing at that one of the root causes here is that the inter-service calls were synchronous. They should have been asynchronous, therefore not consuming a thread and never exhausting the thread pool. It’s a fair point but not my highest priority take away here. You are almost always consuming some resource, even on async calls – usually memory. That resource may be large but it too is not inexhaustible. The techniques I’ve listed above are valuable, regardless of sync or async and will also prevent other side effects, like pounding an already ailing database into the dirt with excessive retries. So, it’s a good point, but I don’t think it’s a silver bullet. So, onto our backlog go another series of “infrastructure” improvements and practices that will help us provide an ever more reliable service. All software will fail eventually, somehow. The key thing is to examine each and every failure, trace the failure all the way to the root cause, generalize the lessons and build defenses for the future. I’m sorry for the interruption we caused. I can’t promise it won’t happen again, *but* after a few more weeks (for us to implement some of these defenses), it won’t happen again for these reasons. Thanks as always for joining us on this journey and being astonishingly understanding as we learn, And, hopefully these lessons provide some value to you in your own development efforts.