April 23rd, 2013

Sprint 45 service issues

Brian Harry
Corporate Vice President

If you are a regular user of our Team Foundation Service – particularly during peak times, you’ve probably noticed that it hasn’t been running as seamlessly as usual.  Our recent issues started with our sprint 45 deployment that I wrote about here and foreshadowed here.  In this post I want to give you some insights to what happened and what we’ve done about it.  The team has been working long and hard since the deployment and I really appreciate the dedication and tenacity.

As our service has evolved, we’ve begun to expand the breadth of the services we are trying to offer.  We are also working towards having support in data centers around the world so you can choose where your data is hosted.  As part of this drive we’ve been gradually working to “decouple” components of Team Foundation Service to enable more independent evolotion and deployment.  Sprint 45 was the culmination of one of the biggest peices of work we’ve done to date in that effort.  Though you can’t really see the effect (ok, other than some of the bugs we introduced), we refactored the core Team Foundation Service into two services – what we call SPS or Shared Platform Services and Team Foundation Service (the functionality you know and love).

SPS contains many things but the simplest way to think about it is the data that is outside a given tenant account – Subscriptions, location, configuration, identity, profile, etc.  After many sprints of work, we finally deployed the new SPS service at the end of sprint 45 – in fact on March 22nd and 23rd.  It involved a pretty massive data refactoring and the first upgrade with planned downtime in a long time (and I believe the last in a long time).

When you think about the change technologically, the biggest difference was that, for the first time, we introduced a service to service split with fairly heavy traffic.  Before this, all of the calls between the “logical” services were either between DLLs in the same app domain or to the database.  This split caused those calls to become service to service REST calls, crossing not only processes but even machines in the data center.

We did a ton of testing ahead of this deployment, however, it’s clear we didn’t catch everything we should have and the result has been a lower quality of service than we (and you) expect.  The issues have mostly manifested themselves as slowdowns during peak hours (though in the first couple of days it was a little worse than that).  The build service has been the most (and longest) affected service – mostly because it’s very intensive: it does a full sync of all of your source to build it and that can put a lot of load on the system when lots of people are running builds at the same time.

I’m going to walk through a list of the issues that we hit.  I’m going to try to group them a bit make it easier to digest.

REST stack

After we went live and started to see production load, we saw a great deal of time spent in our REST stack.  I still don’t fully understand why we didn’t catch these issues in our pre-production load testing and the team is doing deep retrospectives over the next couple of weeks and this will be one of the key questions to answer.

ClientCertificationOption.Automatic – We have an object model for talking to TFS.  With the new service to service architecture, we are now using portions of that object model on the service.  A couple of years ago, we added multi-factor auth to our object model to support scenarios where stronger authentication is required.  We discovered that this option was still turned on in our server context – where it was not needed and the result was scanning of the certificate store on every service to service request.  We disabled the certificate lookup in this scenario and improved performance.

Thread pool threads – The new service to service architecture introduced a lot more outgoing web calls.  The way the web stack works, under the covers, all of these calls are async and use the thread pool.  We found that we were often running out of threadpool threads and waiting for the CLR to inject more into the pool.  By default the CLR starts with 1 thread per core and then uses a “hill climbing” algorithm to adjust threads as needed.  This algorithm tries to dampen changes in the threadpool size so it is not thrashing.  However, we found that it could not handle our rapidly changing demands well.  The result was long pauses of up to 20 seconds while requests queued up waiting for the thread pool to resize.  Working with the CLR team, we decided to increase the minimum number of thread pool threads to 6 per core using (ThreadPool.SetMinThreads).

Wasteful CPU use – Many years ago, someone wrote some code to scan every request/response data stream for invalid characters.  This was consuming a fair amount of CPU (I forget how much but more than 10%).  No one, including the dev who did it, could remember why this code was put in there or think of any reason it was needed.  We removed it.

Extra thread switching – We had a coding mistake that caused extra thread switches on every REST call.  This is best explained in code.  At the beginning of every REST call there’s some lazy initialization logic:

if (a == null) 
{
lock (sync)
{
if (a == null)
{
a = fetch some data;
}
}
}

The issue is that “fetch some data” was a cross service call and the developer wanted to make it async.  So they tried making the method async and doing and await fetch some data and discovered that you can’t do an await from within a lock.  Let’s forget that this would completely have broken the synchronization.  So they decided to do a Task.Run on the whole block of code.  The problem is that a == null only the first time the code path is used, after which a is cached.  The end result is that we did a thread switch just to run an if statement everytime there was a service to service call.

Too many cross service calls – As we started to investigate the cause of the slowdown, we quickly realized that we had way more “chattiness” between the services than we realized.  This chattiness was exacerbated by the various performance issues above.  Before the split most of these interactions were direct in-proc calls or database calls that were much lighter weight than the cross service calls.  We were able to rework the code to reduce the number of calls across service boundaries substantially.

We fixed all of these REST stack problems within the first few days and it really helped restore the service nearly constant poor performance to intermittant poor performance and build service issues.

Other misc bugs

In addition to the set of bugs/issues we had with the transition to separate service, we had another set of bugs that were uncovered/exaggerated as part of the transition.  Among them were:

Work item tracking identity sync – We have a process by which we sync identities from the identity system (now in the SPS service) into work item tracking (in the TFS service) for the purposes of enabling work item tracking rule enforcement based on groups and identities.  This change uncovered/exacerbated two identity syncing bugs.  One caused older accounts (the first 40K accounts) to start doing full identity syncs (consuming significant resources) frequently.  The other was some thrashing in the identity service causing lots of activity but no progress in syncing.

Stored procedure optimization – We also found a few stored procedures with some significant query plan problems.

That pretty much summarizes our “bugs”.  There were clearly too many and they were too significant.  I don’t ever expect to get to the point where we can deploy a major change and not miss any bugs but it’s clear to me we can do better.  However, if it were just this, the overall impact would not have lasted nearly as long as it has.  In addition to our own mistakes we got hit by a number of environment/operational issues.  The cloud environment (Azure) that we operate in is constantly changing.  Almost every day something, somewhere is being updated and we rely on enough of it that we can get wagged by many things.  Here are some of the environmental things…

CLR/.NET

GC/Pinning – As I mentioned above in the section on Thread pool threads, we introduced a bunch more async patterns in our service with this change than we had ever had before.  In fact, we now have about 10 async calls per second from each application tier machine.  Async calls, in the CLR, ultimately involve some amount of “pinning” GC memory so that the OS can transfer data into buffers asynchronously without blocking the GC.  The problem we have is that at that rate there’s almost always an outstanding async call, and therefore pinned buffers.  The CLR has some designs to help reduce the impact of this but we found they weren’t working sufficiently well for us.  We were using Large Azure roles (7GB of RAM) and finding that due to pinning effects on the GC, we were seing high CPU load from the GC and memory exhaustion that would ultimately result in machine recycling – causing intermittent availability.  In the process of investigating this, we also uncovered the issue described below under “Windows Azure”.  We engaged the CLR team quickly and started investigating.  They said they had seen this issue before in some high scale services but no service was able to work with them long enough to isolate and fix it – basically as soon as they found a mitigation their interest in it was over.  I view part of our role as driving feedback and requirements into our platform to improve the overall quality.  As such we’re continuing to work with the CLR on this.  We’re expecting a longer term improvement from them but, in the shorter term, we’ve found that we could increased to an XL role (14GB) and that mitigated the pinning effects, eliminating the critical issues.

SQL Azure

Config DB provisioning – Each of our services have a single database that we call Config DB.  It’s the core database that the service needs to run.  It’s got critical configuration information like connection strings, manages cross app tier notifications, etc.  Before now, we only had 1 Config DB for the entire service.  That database was on a XL reserved SQL Azure instance to make sure we had a committed level of capacity behind this critical database.  With the rollout of our new service and a second Config DB along with a new “accounts” DB, we needed to readjust how our config DBs were deployed so that we didn’t have to pay for 3 XL reservations but were able, instead, to share the reserved resources between our Config DBs.  In this process, there were manual operational procedures which caused all three database primaries to get landed on the same physical machine.  The load balancer realized this was a bad idea and began trying to move two of the primaries onto another of our reserved machines.  However, we hit a bug in the load balancer where it had difficulty balancing the resource consumption of the foreground work with the background database movement.  The result was that it kept trying to move the databases, getting part way through and then giving up.  This caused a lot of load on our SQL Azure databases that, coupled with our issues in the first section, compounded our performance problem.  Ultimately this was addressed by explicit operational intervention to place the databases.

SQL Azure storm – As it turns out, that first week, SQL Azure (on the cluster we use), had a bit of a storm with some misbehaving tenants using an undo share of the resources and causing the performance of our databases to suffer.  These storms happen from time to time in SQL Azure and they are working through throtting and quota mechanisms to contain them.  They were MUCH worse a year or so ago – the SQL team has made a lot of progress but the problem is not yet completely gone.

So far, most of what I’ve talked about has been, at least, tangentially related to the “big” change we were actually rolling out.  As it so happens we also got hit by some stuff that had absolutely nothing to do with the big changes we were rolling out.  Over the first couple of weeks we had worked through all of the issues we could find with our change and the service was generally running reasonably well – with the exception of the build service.  The build service was running poorly – during peak times builds were slow and frequently timing out/failing.  Once we had ruled out everything that had anything to do with our changes we discovered some other things going on.

Windows Azure

Slow disk I/O – The slow and failed builds were continuing to plague us.  Once we had stopped looking for causes related to the changes we had made, it became clear pretty quickly that the root problem had to do with the amount of time it was taking to download source out of version control.  After much investigation we concluded it had something to do with changes in Windows Azure.  We were seeing incredibly slow disk I/O on the local disk of our web role.  We contacted the Azure team and learned they had just rolled out (just about the same day we rolled out our changes), a change to both host and guest VMs that changed the way write through caching works.  We were seeing, under load, individual I/Os taking as much at 2 seconds – 2 orders of magnitude longer than it should.  The source code we manage is in Azure blob store.  However, to optimize delivery, we store a cache of source code on the application tier that we can use the OS TransmitFile API on.  It is a VERY optimized API for transmitting files over the network that has very low overhead because it avoids copying buffers in and out of user mode, etc.  The disk I/O slow down was causing this cache to behave very badly – causing build to take for ever, downloads to fail, builds to be aborted, etc.  Further, it significantly exacerbated our issues with the GC because we’d see page fault in GC collection taking a second or more.  It was a mess.  To mitigate it, we have turned off our AT cache and are instead fetching source directly from Azure blob store.  This is causing our file downloads to take about 1.6 times longer than it used to but it’s still way better than where we were.  We’re continuing to work with the Azure team on a more permanent fix to restore the local disk performance.

Network failures – Unfortunately, with the slow disk I/O problem mitigate, we’re still seeing an increased level of build failures.  Upon further investigation, we’ve discovered that we’re seeing a much higher level of network error in Azure than we are used to.  At this time, I still don’t know the cause of that – we believe it’s related to the last Azure fabric rollout but we don’t have a clear root cause yet.  We are seeing hundreds of builds a day failing due to network failures starting about April 12th.  Along the way of investigating many of these issues, we discovered that our build service didn’t have appropriate retry logic to handle failed downloads.  It’s really not been an issue for the past year but all of the issues we’ve had in the last month has uncovered this hole.  We’ll soon be rolling out an update that will do retries (which quite honestly should be there in the first place because all cross service cloud calls should handle intermittent failures well) and it should significantly improve the build success rate.  We’ll still see some performance issues from the last two issues but at least builds should stop failing.

Debugging issues

Memory dump problems – To add insult to injury, our ability to debug the service has been seriously compromised.  I haven’t actually looked deeply at this one myself so my description may be a little sketchy.  Debugging some of these problems are greatly facilitated taking memory dumps from production that we can diagnose off line.  Historically, we have used a feature called Reflective dumps using procdump /r.  This method reflects the memory state into a parallel process and then dumps from the parallel process so that there is almost no interruption of the production service.  However, apparently, for a couple of years the Windows OS reflective dump mechanism has had a bug that caused it to miss fragments of the process memory.  Apparently we’ve seen this in the form of “bad” memory dumps but it was infrequent enough that people just wrote it off to gremlins.  Some change in .NET 4.5 (which we rolled out on the service a few months ago), has caused this reflective dump to fail every time – so we can’t use reflective dumps any longer.  Now we have to take a straight process dump in production and, unfortunately, this takes a minute or two, literally.  This long of a pause causes cascades of timeouts and results in a significant service interruption.  So every time we needed to take a dump to look at thread state, GC issues, or the myriad of other issues here, we had to basically reboot one of the app tiers, causing significant service interruptions, failed builds, etc.  We’re working with the Windows team on getting them to produce a hotfix that we can use and the rest of the world can use to reliably get .NET 4.5 reflected memory dumps.

As you can see there were quite a lot of issues that conspired to make it a bad month for us.  We’ve been working incredibly hard and the service has gotten better day by day.  We’ve also been working closely with the platform teams to ensure there are durable solutions to the problems we’ve hit there that work not just for us, but for everyone.

This is a long post and if you’ve read this far, I’m impressed.  We’re very sorry for the problems that we’ve caused.  At this point we feel like we’ve worked through enough of our issues that we can get back on our regular deployment schedule.  As such, we are deploying the delayed Sprint 46 build today and will deploy the Sprint 47 build a week from Monday.  We’ll keep working hard until everything is completely back to normal and if you see anything that looks squirelly, please let us know.

Thank you,

Brian

Author

Brian Harry
Corporate Vice President

Corporate Vice President for Cloud Developer Services.

0 comments