TFS on Windows Azure at the PDC
**UPDATED May 18th 2001** – See http://blogs.msdn.com/b/bharry/archive/2011/05/18/update-on-tfs-on-azure.aspx
Hosting of ALM in the cloud as software as a service is gradually becoming more and more popular. The vision, of course, is ALM as a seamless service – making it really easy to get started, easy to scale, easy to operate, easy to access, … You’ve seen me write from time to time about our work with 3rd party hosting and consulting companies offering TFS services. We did a bunch of work in TFS 2010 on both the technical and licensing front to enable a new generation of cloud based TFS services.
Several months ago, I wrote a post about our initial investigation into porting TFS to the Windows Azure platform. Since then, we’ve continued to pursue it and today, at this year’s PDC, I demoed Team Foundation Server running on the Windows Azure platform. We announced that we’ll deliver a CTP (basically an early preview) next year. We aren’t, by any means, done with the technical work, but, for now, it’s a great case study to see what is involved in porting a large scale multi-tier enterprise application to Azure.
The demo I did today represents an important step forward in getting TFS running on the Azure cloud platform. When I wrote my post a few months ago, I talked about a few of the challenges porting the TFS data tier from SQL Server to SQL Azure. What I demoed today included not only that but also the remaining components of TFS running in the cloud – the ASP.NET application tier (running as a Web Role), the TFS Job Service (formerly a Windows service for periodic background tasks, now running as a worker role) and the TFS Build controller/agent (running in an Azure VM role). I demoed connecting from a web browser, Visual Studio and from Microsoft Test Manager.
One of the cool things (but makes for a mundane demo :)) is that, for the end user, TFS in the cloud looks pretty much like TFS on-premises. Other than the fact that you can log in with an internet identity rather than a windows identity, you’ll find that the Visual Studio experience, for example, looks pretty much identical to a local TFS scenario. Here’s a screenshot of the “new experience” of logging in with a internet id:
And here’s a screenshot of VS connected to TFS on Azure (as you can see, there’s not much difference):
The good news is that a lot of the work we did in TFS 2010 to improve the TFS architecture to scale from simple client installs (TFS Basic) to very complex, centrally hosted enterprise installs really helped prepare us for the port to Azure. The result is that it has been a pretty reasonable project. I’ll describe some aspects of the effort.
Porting TFS to SQL Azure
As I mentioned, step 1 was getting the SQL ported to SQL Azure. TFS is a very large SQL app and therefore, we feared this would not be a simple task. TFS has over 900 stored procedures and over 250 tables. Despite how involved the app is, the system was up and running with an investment of 2 people for about 1 month. The biggest issues we had to deal with were:
- OpenXML – SQL Azure does not support the OpenXML mechanism that SQL 2008 does for passing lots of data to a stored procedure. Our initial port was to move from OpenXML to the XML data type. However, we’ve found that to not be the best solution (some performance issues and XML escaping issues) and ultimately ported to Table Valued Parameters (TVPs).
- Select INTO – SQL Azure does not support “select into” because it requires that all tables have a clustered index and select into doesn’t allow for that. We ported all of our select into occurrences to use explicit temp tables.
- Sysmessages – SQL Azure doesn’t allow you to add messages to sysmessages. This means you can’t properly raise custom errors (something we make heavy use of). The truth is, even if you could, it wouldn’t be a great solution because sysmessages doesn’t really have a good localization store in a multi-tenant environment with different tenants with different languages. We have created a new mechanism for raising errors from SQL and ensuring that the end user ultimately gets a localized, intelligible message.
- Clustered indexes – As I mentioned above, SQL Azure requires clustered indexes on all tables. Some of ours didn’t have them. It’ was pretty straight forward to go back and add them though.
- No full text index – SQL Azure does not yet support full text indexing. For now we’ve disabled it in TFS (the only place we used it was in work item tracking) and are figuring out what our long term plan for that will be.
- Database directives – There are a lot of database directives that we use that SQL doesn’t support – partitioning, replication, filegroups, etc. Removing them was not a particularly big issue.
While this is not an exhaustive list of the issues, it’s a pretty good list of the larger ones we had to deal with.
Porting to the Web Role
While the TFS app tier is also pretty big, a straight port of it was surprisingly simple – about 1 person, 1 month. Azure’s implementation of ASP.NET is pretty faithful to the on-premises solution so most stuff ports pretty well. The harder parts of this (that have taken a few people several months) have been adapting to some of the cloud platform differences – with identity being the biggest one. Here’s some detail…
- Identity – Windows Azure doesn’t have Windows identities for use in the same way you would use them in an on-premises app. You need to use an internet identity system. Our first cut at this was to attempt direct use of LiveID – um, it turned out to be much more complicated than we originally expected. After getting our head out of the sand, we realized the right thing was to use App Fabric Access Control Services (otherwise known as ACS). ACS gives us an authentication system that is OpenID compatible and supports many providers, such as LiveID, Yahoo, Google, Facebook, and Active Directory Federation. This enables people to connect to our Azure based TFS service with whatever identity provider they choose. TFS was pretty baked to understand Windows identities so it was a fair amount of work to port and had some pretty non-obvious ramifications. For instance, we can’t put up a username/password dialog any more. These services (LiveID, etc) require that you use their web page to authenticate. That means that everywhere we used to get Windows credentials, we now have to pop up a web page to enable the user to enter their username/password with their preferred provider. From a technical stand point, identity has been the most disruptive change.
- Service identities – Not only do we have to deal with user identities (as described above) but TFS also has a number of cooperative services and they need to authenticate with each other. The end user oriented authentication mechanisms don’t really support this. Fortunately, ACS has introduced something called “Service identities” for this exact purpose. They enable headless services to authenticate/authorize each other using ACS as the broker.
- Calls from the cloud – The relationship between a TFS server and a TFS build controller (in TFS 2010 and before) is essentially peer to peer. Connections can be initiated from either side. This is potentially a big issue if you want your TFS server in the cloud and a build machine potentially on the other side of a Network Address Translation (NAT) device. One option is to use the Azure Service Bus to broker the bi-directional communication. We looked at our alternatives and ultimately decided to change the communication protocol so that the build controller acts like a pure client – it contacts the TFS server but the TFS server never contacts it. This is going to allow us, long term, to have simpler trust relationships between them and simplify management.
- Deployment – Another sizable chunk of work has been changing TFS to deploy properly to Azure. TFS used an MSI, wrote to the Registry, the GAC, etc. None of this is allowed on Azure – Azure apps are deployed as a zip file with a manifest that basically controls where files go.
- Tracing – We had work to do to hook our tracing (System.Diagnostics) and event logging into the Azure diagnostics store – neither of these is automatic.
- Admin console – The admin console had cases where it accessed the TFS database directly but that’s not very desirable in Azure. The SQL Azure authentication system is different. We didn’t want to complicate the world by exposing two authentication systems to administrators so we’ve eliminated the direct database access from the admin console and replaced it with web service calls.
Again, this isn’t a comprehensive list but captures many of the largest things we’ve had to do to get TFS on Azure.
Tuning for Azure
There’s a class of work we are doing that I call “Tuning for Azure” and by this, I mean, it’s not strictly necessary to provide a TFS service on Azure but it makes the service more practical. My best examples off the top of my head are:
- Blob storage – In our on-premises solution, all TFS data is stored in SQL Server – work item attachments, source code files, Intellitrace logs, … This turns out not to be a good choice on Azure – really for 2 reasons: 1) SQL Azure databases are capped at 50GB (might change in the future, but that’s the limit today) and 2) SQL Azure storage is more than 50X more expensive per GB than Windows Azure storage. As a result we’ve done work to move all of our blob data out of SQL and into a blob storage service that can be hosted on Windows Azure blobs. We believe this will both significantly reduce the pressure on the database size limit and reduce our operational cost.
- Latency and bandwidth – The fact that your TFS server will always be across the WAN means you need to think even harder about how you use it. As much as can be local needs to be local. Calls to the server really need to be async and cancelable. You have to really minimize round trips and payload size, etc.
Running as a Service
The biggest category of work we have left is stuff I call “Running as a Service”. Most all of this applies to running a traditional on-premises 3 tier app as a service but it becomes more important the larger the scale of the service is and a worldwide service like Azure makes it really important. An example of something in this category is upgrades. In TFS today, when you upgrade your TFS version, you take the server down, do the upgrade and then bring the server back up. For a very large on-premises installation, that might entail a few hours of down time up to about a day (for crazy big databases). For a service the size of a global service, you can’t take the service down for days (or even hours) while you upgrade it. It’s got to be 24×7 with minimal interruptions for small segments of the population. That means that all upgrades of the TFS service have to be “online” without ever taking a single customer down longer than it takes to update their own data. And that means running multiple versions of the TFS simultaneously during the upgrade window. It’s by no means the only big “running as a service” investment but it’s a good example of how expectations of this kind of service are just different.
We also haven’t yet tackled what billing would look like for this kind of online service. That can also be way more complicated than you might, at first, think.
That’s a lot of detail but it’s a pretty cool milestone to be able to demo the great progress we’ve made and share our experiences porting a large existing on-premises app to a cloud architecture. As always, it’s pretty fun to be messing around with the latest and greatest technology. As I said at the beginning, we’re still primarily looking at this as technological exercise and aren’t ready to talk about any product plans. For now, TFS provides a great on-premises solution and we have a growing set of partners providing hosted services for those that want to go that route. The move to the cloud is certainly gaining momentum and we’re making sure that your investment in TFS today has a clear path to get you there in the future.