Monitoring Team Foundation Server 2018
Monitoring on-premises Team Foundation Server deployments is an important part of keeping them running smoothly, especially for large enterprise deployments. Good monitoring can help administrators avoid issues before they impact end users, as well as react quickly when user impacts do occur.
TFS has shipped management packs for System Center Operations Manager since way back in 2008. See https://www.microsoft.com/download/details.aspx?id=14720 for that original download, and https://www.microsoft.com/download/details.aspx?id=54791 for the latest TFS 2017 version. These management packs have largely offered the same functionality for all these years.
More recently, we’ve learned a lot about monitoring critical services by continually improving our DevOps practices for Visual Studio Team Services. Some key lessons learned include:
- Outside-in monitoring is not enough. It is reactive, rather than proactive. And it’s not fine-grained enough to give a good picture of end-user experience.
- Signal to noise ratio is key – noisy or unreliable monitors end up getting disabled or ignored.
In VSTS we have internalized these lessons by supplementing outside-in monitoring with proactive monitoring of infrastructure health (application-tier CPU utilization, for example) and by fine-tuning our alerts to make sure they are actionable.
With all this in mind, we took a step back to think about what we should recommend (and what we should ship) for monitoring Team Foundation Server deployments.
TL;DR – we recommend using System Center with the SQL, IIS, and Windows management packs plus a few custom monitors/alerts configured using built-in System Center Operations Manager capabilities. We believe this configuration will be just as easy to set up and will provide better monitoring than the previous TFS management packs. As a result, we do not plan to ship a TFS management pack for TFS 2018.
If you are currently using one of the TFS management packs, please read the recommendations below for more details. If they do not ring true for you and you believe the TFS management packs provide significant value, please let me know in the comments below and/or reach out to me at aaronha at microsoft dot com.
SQL, IIS, and Windows Management Packs
These three management packs provide a wealth of information about the underlying software on which a TFS deployment relies. Each of them can be easily installed from the management pack catalog.
The SQL management packs (there are multiple, for the various versions of SQL Server) cover a lot of ground. Everything from checking for installation of the latest service packs through CPU utilization and disk space availability. To learn more, download the SQL 2016 management pack guide at https://www.microsoft.com/download/details.aspx?id=53008.
The IIS management packs (again there are multiple, for the various versions of IIS) primarily monitor the availability of your web sites and their associated application pools. They can also be used for performance monitoring scenarios. To learn more, download the IIS 10 management pack guide at https://www.microsoft.com/download/details.aspx?id=54445.
The Windows management packs (again there are multiple, for the various versions of Windows) cover a ton of ground. Monitoring and alerting includes disks and disk partitions, processors and CPU utilization, network adapters and bandwidth usage, and memory utilization.
Between these three types of management packs, you can get quite extensive monitoring of your TFS deployments, from the ASP.Net web layer through the SQL backend and all the way down to the underlying OS. Much of the data they provide can be used to fix issues – resource constraints, for example – before they start impacting end-users.
Web Application Transaction Monitoring
A simple availability monitor for your TFS deployments can be set up by following the general instructions at https://technet.microsoft.com/library/hh457553.aspx. The simplest approach is to start with a Single URL monitor. My recommendation would be to hit the ProjectCollections REST endpoint, which will interact with the configuration database and return the list of team project collections for the deployment. The full URL will look something like http://mytfs:8080/tfs/_apis/ProjectCollections.
Make sure to set up the properties of the web application with the appropriate User Account and Authentication Method. See https://technet.microsoft.com/library/hh457542.aspx for more information here. Typically, Authentication Method should be Negotiate, and the User Account should be a user who has read access to the monitored TFS deployment(s). If you want to get fancy, you can alert on slow performance, response content, and so forth.
Once you’ve set things up, you can configure additional requests to monitor your other TFS deployments, if you have more than one.
Monitoring the TFS Job Agent, and TFS Build Agents
Build resource monitoring was removed from the TFS 2017 management pack, and back then I wrote up a recommended process for monitoring these resources. See https://blogs.msdn.microsoft.com/devops/2017/03/28/monitoring-build-resources-with-the-tfs-2017-management-pack/.
The TFS Job Agent service, which is used to run long running background tasks, can be monitored using the same approach documented for monitoring XAML build resources – the Windows Service management pack template.
TFS Management Packs
If you follow the above recommendations, TFS management packs will not provide any significant additional value. The old TFS management packs all had the same basic capabilities:
- Outside-in monitoring through pinging of a variety of web service endpoints.
- Event log monitoring for a variety of errors.
- Windows Service availability monitoring for the TFS job agent process.
The outside-in monitoring in the TFS management packs does similar work to the web application transaction monitor recommended above, but in a nosier way – in virtually all failure scenarios, all seven of the web service endpoints it monitors will start failing at once.
The Event Log monitoring is not covered by the recommendations from above, but it is again quite noisy. Dozens of monitors are provided for individual events. Most of these are not actionable, meaning that no reasonable instructions are provided to fix the underlying issue. For example:
TFS Event 3076 occurred. This is raised by the catalog service when a catalog entry has a missing parent node. This could indicate database inconsistencies. Check the health of the SQL Server configuration database.
If for some reason you did want an additional layer of monitoring around TFS Event Log errors, it is easy enough to use NT Event Log alerts (a built in System Center Operations Manager capability – see https://technet.microsoft.com/library/ff730470.aspx) to create either a blanket alert for all TFS errors in the event log, or specific alerts for subsets of errors. This is not recommended unless you want to set the priority and severity of these issues to something low and then review them periodically.
Finally, the TFS Job Agent monitoring doesn’t provide any additional value beyond the Windows Service monitor recommended above.
Setup and discovery
Another part of the value of the TFS management packs was meant to be getting TFS monitoring out of the box without having to do all the manual setting up of monitors and alerts described above. Getting the TFS management packs up and running was a rather cumbersome process, however, which you can read all about in Appendix A of the management pack guide available for download at https://www.microsoft.com/download/details.aspx?id=54791. Discovery was not automatic, since each server on which TFS resources resides needed to be configured to allow it to act as a proxy. Given all of this, setting up the individual monitors discussed above should be comparatively straightforward.
I believe these new recommendations should provide monitoring of TFS deployments that is as good or better than the old TFS management packs, and that is just as easy to configure. If you are a current user of one of the existing TFS management packs and this assessment or these new recommendations don’t ring true – please let me know.