This past weekend we upgraded the TFS server used by the Developer Division and other teams who deliver Visual Studio 2012. It is now running TFS 2012 RC!
Back in the beginning, the DevDiv server was our dogfood server. Then as all of the folks shipping products in Visual Studio, there were too many critical deadlines to be able to put early, sometimes raw, builds on the server. So we dogfood TFS on a server called pioneer, as described here. Pioneer is used mostly by the teams in ALM (TFS, Test & Lab Management, and Ultimate), and we’ve been running TFS 2012 on it since February 2011, which was a full year before beta. Never before have we been able to use TFS so early in the product cycle, and our ability to get that much usage on early TFS 2012 really showed in the successful upgrade of the DevDiv server.
We also run TFS 2012 in the cloud at http://tfspreview.com, and that’s been running for a year now. While that’s not a dogfood effort, it’s helped us improve TFS 2012 signficantly. The other dogfooding effort leading up to this upgrade was Microsoft IT. They upgraded a TFS server to TFS 2012 Beta, and we learned from that as well.
The scale of the DevDiv server is huge, being used by 3,659 users in the last 14 days. Nearly all of those users are working in a single team project for the delivery of Visual Studio 2012. Our branches and workspaces are huge (a full branch has about 5M files, and a typical dev workspace 250K files). As a result, we always find product issues when we upgrade it. For the TFS 2010 product cycle, we did not upgrade the server until after RTM. Having been able to do this upgrade with TFS 2012 RC, the issues we find will be fixed in the RTM release of TFS 2012!
Here’s the topology of the DevDiv TFS deployment, which I’ve copied from Grant Holliday’s blog post on the upgrade to TFS 2010 RTM two years ago. I’ll call out the major features.
- We use two application tiers behind an F5 load balancer. The ATs will each handle the DevDiv load by themselves, in case we have to take one offline (e.g., hardware issues).
- There are two SQL Server 2008 R2 servers in a failover configuration. We are running SP1 CU1. TFS 2012 requires an updated SQL 2008 for critical bug fixes.
- SharePoint and SQL Analysis Services are running on separate computer in order balance the load (cube processing is particularly intensive).
- We use version control caching proxy servers both in Redmond and for remote offices.
These statistics will give you a sense of the size of the server. There are two collections, one that is in use now and has been used since the beginning of the 2012 product cycle (collection A) and the original collection which was used by everyone up through the 2010 product cycle (collection B). The 2010 collection had grown in uncontrolled ways, and there were more than a few hacks in it from the early days of scaling to meet demand. Since moving to a new collection, we’ve been able pare back the old collection, and the result of those efforts has been a set of tools that we use on both collections (we’ll eventually release them). Both collections were upgraded. The third column is a server we call pioneer.
Grant posted the queries to get the stats on your own server (some need a little tweaking because of schema changes, and we need to add build). Also, the file size is now all of the files, including version control, work item attachments, and test attachments, as they are all stored in the same set of tables now.
Collection A | Collection B | Pioneer | |
Recent Users | 3659 | 1516 | 1,143 |
Build agents and controllers | 2,636 | 284 | 528 |
Files | 16,855,771 | 21,799,596 | 11,380,950 |
Uncompressed File Size (MB) | 14,972,584 | 10,461,147 | 6,105,303 |
Compressed File Size (MB) | 2,688,950 | 3,090,832 | 2,578,826 |
Checkins | 681,004 | 2,294,794 | 133,703 |
Shelvesets | 62,521 | 12,967 | 14,829 |
Merge History | 1,512,494,436 | 2,501,626,195 | 162,511,653 |
Workspaces | 22,392 | 6,595 | 5,562 |
Files in workspaces | 4,668,528,736 | 366,677,504 | 406,375,313 |
Work Items | 426,443 | 953,575 | 910,168 |
Areas & Iterations | 4,255 | 12,151 | 7,823 |
Work Item Versions | 4,325,740 | 9,107,659 | 9,466,640 |
Work Item Attachments | 144,022 | 486,363 | 331,932 |
Work Item Queries | 54,371 | 134,668 | 28,875 |
The biggest issue we faced after the upgrade was getting the builds going again. DevDiv (collection B) has 2,636 build agents and controllers, with about 1,600 being used at any given time. On pioneer, we didn’t have nearly that many running. The result was that we hit a connection limit, and the controllers and agents would randomly go online and offline. Working with WCF team, we now understand the connection settings, and this will be fixed for RTM. We’ve also now got a test in place that finds the issue and verifies the fix.
Here’s a subset of the issues we’ve found, including all of the ones we found the first day. There are probably another dozen that we’ve found during the week that I haven’t listed.
- Build machine connections are limited to 128 for some reason.
- Build config wizard iterates through every build machine and resolves its DNS entry.
- Code to handle file attachments not yet upgraded fails. prc_RetrieveFIle was calling prc_MigrateFile incorrectly.
- Issue with leading key transfer when there are tables with duplicate index names in the database.
- Warehouse upgrade timed out on dropping primary keys.
- Creating a build label times out. Bad query plan on prc_iiUpdateBuildInformation
- DevDiv Upgrade: File Migration Job is too slow
- Command-line only users will not have their location service cache updated on certain commands which leads to confusion on available services.
- Unable to access work items page. MaxJsonLength should be configurable (in web.config) for JSON action results.
- Updating build controllers/agents makes many calls to the server and suffers lock contention.
- QueryBuilds is slow on vstfdevdiv.
- Browser cache needs to be cleared for some users after upgrade.
- Work item template links created when the server was 2010 no longer work.
- SOAP notification breaks compat with previous releases.
- Get with a workspace version used by tfpt scorch is slow.
The upgrade to TFS 2012 RC was a huge success, and it was a very collaborative effort across TFS, central engineering, and IT. We quickly got everything to full scale and running smoothly. As a result of this experience and our experience on pioneer, TFS 2012 is not only a great release with an incredible set of features, but it’s also running at high scale on a mission critical server!
I hope you upgrade your server to TFS 2012. You are going to love it!
Follow me at twitter.com/tfsbuck
0 comments
Be the first to start the discussion.