TFS and Reliability and Disaster Recovery
We continue to evolve and improve upon the TFS reliability and disaster recovery story. Fundamentally reliability and disaster recovery are about preserving service (or minimizing outage) and eliminating data loss in the face of failure in components of your system. When we look at TFS component failure, we primarily focus on the application tier (or web tier), the data tier and the disk subsystem.
The solutions to addressing each of these failure points can be different. You can buy redundant hardware – machines with redundant power supplies, network connections, etc; RAID disk systems with redundant drives, controllers, etc. You can build amazingly fault tolerant hardware but the more you want, the more expensive it gets and none of it helps if you have an earthquake and your data center collapses. For both of these reasons people look to alternative ways of dealing with reliability and disaster recovery.
Looking at the TFS components, let’s examine the reliability and disaster recovery story for each beyond highly reliable hardware.
In V1 of TFS, we tested and supported what we called Application Tier “warm standby”. This means you can configure a second (or third, etc) application tier and have it ready to take over in the event of a failure in the “active” application tier machine. In the event of a failure, the secondary application tier needs to be activated – it doesn’t automatically take over. This process is described here. Because it requires manual intervention, it generally requires someone to notice what has happened and then an administrator to run the redirection process – and may include updating DNS information to point the old server name to the new server so all of the clients don’t have to be updated.
HP recognized an opportunity here and developed a solution using their HP Systems Insight Manager described here. This enables the application tier fail over to happen automatically when Insight Manager discovers that the active application tier is no longer functioning properly.
In future versions of TFS, we plan to support multiple (load balanced) active application tiers for the same Team Foundation Server so that, in the event that any one AT fails, a standard load balancer can remove it from the rotation and the system can continue to operate with the remaining functioning application tier machines.
The only TFS solution for Data Tier availability (without data duplication) is SQL Clustering. Clustering is a hardware configuration where multiple SQL database machines share the same disk subsystem (usually a SAN). Clustering provides automatic and transparent failover in the event of a failure of the primary SQL Server machine but does not address a failure in the shared disk subsystem. Although it is a very robust solution, the downside is that it is a fairly expensive solution and requires careful selection and matching of hardware components. You can read about how to configure TFS data tiers for clustering here.
Data Tier + Disk subsystem
Before shipping TFS V1, we did not test or document any solutions to reliability and disaster recovery of the disk subsystem beyond backup and restore (which for any large system can be a time consuming task). Since then we have tested both mirroring and log shipping. You can read more about them here, here and here. In both mirroring and log shipping you can configure a secondary, redundant system (data tier machine and disks) that can either be co-located or geographically distributed. These can help protect against a total catastrophe (like a fire or earthquake), allowing you to get your system back up and running on new hardware (servers and disks) in a short period of time. The primary difference is that log shipping is a scheduled, periodic update of the secondary database whereas mirroring is either synchronous or asynchronous with relatively short time lags. Mirroring also has additional features like a witness server that can automatically fail over to the secondary in the event the primary becomes unavailable. Unfortunately witness servers only work for single databases and TFS uses 7 different databases so fail over for TFS, even with mirroring, is a manual process.
In addition to the mirroring and log shipping described above, we support the variety of hardware level disk solutions. For example, you can use either RAID 5 or RAID10 disk configurations, multiple controller cards, host bus adaptors, etc. to make your disk subsystem fault tolerant. As a general rule, for any high traffic system, we recommend RAID10 in favor of RAID5 because RAID10 has substantially better write throughput than RAID5 and TFS is a “write-heavy” application.
And, of course, as a last resort, we support a strong backup and restore story that includes online and incremental backup.
We are in the process of evolving our TFS admin and operations documentation on MSDN. Over time all of this information will migrate there and should be much easier to find. In the meantime, I hope this provides some context and pointers to resources you can use as you learn about reliability and disaster recovery for TFS.