Retrospective on TFS 2012 Update 1
In late November I made a tour of a few cities in Europe, speaking about VS/TFS 2012 and application lifecycle management. One of the talks I gave a couple of times was about how we are managing our new release cadence. We are shipping a service every 3 weeks and a new on premises server/client roughly every 3 months. I talked about how we plan releases, how we manage schedule, quality, etc. In one of the talks, someone asked me something like “How well is the new release process working?” I answered “Ask me again in about a year and I’ll tell you.” It was a bit of a humorous response but at the same time – the proof is in the pudding. A development/release process is only as good as the results – and you can only tell that after a few releases and a good track record.
When I talk about software development management practices, one of the key points I always stress is that the most important thing is to be constantly learning. Try things, measure the effects, learn and design solutions for the problems. With Update 1 out about a month now, we now have some early data on our process and are working on the learning phase.
What we are trying to do is very aggressive. Shipping a service every 3 weeks is challenging. But you have many advantages with a service – there’s no big configuration matrix; there’s only one, patching it is easy; most serious production issues are patched within 24 hours, diagnostics is way easier because we run the system and can analyze failures, etc.
On top of that, taking the same code base and then shipping it on premises in 3 months is super challenging.
While, I generally been very happy with the quality of our results with the online service, I can now say, after 30 days, that I’m not happy with the quality of our on premises TFS Updates. Thankfully the VS Update 1 seems to have gone well – no significant issues reported, but we have had a few issues with TFS Update 1.
I blogged about the first issue we found soon after we released update 1. The cause was a race condition that would cause an internal component to get into an invalid state if any requests were sent to the server while the upgrade was still in progress (something that might happen if you have a build machine configured against the server). While not a serious issue, it was a significant inconvenience and within days, we posted a refreshed Update 1 download that fixed the issue.
In the subsequent days and weeks, we’ve learned about additional issues that customers have hit. The most serious one (which is to say the one with the worst effect is documented in this post). The root cause of this is actually the root cause of a few of the issues that customers have reported. Between TFS 2012 shipping and Update 1, we did some significant refactoring of some of the core TFS services like identity management (Active Directory synchronization, etc). This was done because we are making the core services framework in TFS usable for other services that we are building in Developer Division – for example the “Napa” tools that Jason announced in July. We knew that this refactoring was significant but believed we had an adequate test plan in place. However, in retrospect, it’s clear we did not. We are working on rolling up a set of fixes for all of the issues that have been reported and that should be available within the next week or so.
We haven’t finished our retrospective/post mortem exercise yet but a couple of learnings are becoming clear:
1) We have to figure out a way to be able to do major refactorings without jeopardizing the on premises updates. We’re still thinking through what the strategy for this will be – branching? selective deployment? different scheduling model? something else?
2) For an on prem server, where the number of configurations are astronomical, we absolutely must get some amount of pre-release customer testing. This release we did not. None of our “CTPs” were “go-live” and none had an upgrade path to the final release guaranteeing that we’d get very little real world pre-release feedback.
3) We certainly discovered some ways in which our test matrix could be improved to catch more of the issues.
4) We really only had about 3 weeks of “end game” – final bug fixing and validation in QU1. In retrospect, we will probably need to take an approach of varying that more based on the nature of the changes going in to the release.
Please don’t draw the conclusion from this post that Update 1 was riddled with problems – it was not. Comparatively few people have experienced any issues. We’ve been using it ourselves successfully for months. However, we can do better. I’m not happy with how we did and we will learn from it. My goal is not to have to do any broadly available “hotfixes” after an Update. This time we had to do 2.
I’m sharing this because I’m happy to share both the good and the bad. I’ll tell you the things I’m proud of and I’ll tell you the things I’m not. And, if along the way, any of it is valuable to you and helps you figure out how to build better software, I’m happy.