{"id":1533,"date":"2014-04-03T10:47:00","date_gmt":"2014-04-03T10:47:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/visualstudioalm\/2014\/04\/03\/case-study-application-performance-monitoring-with-application-insights\/"},"modified":"2022-07-18T01:11:10","modified_gmt":"2022-07-18T09:11:10","slug":"case-study-application-performance-monitoring-with-application-insights","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/devops\/case-study-application-performance-monitoring-with-application-insights\/","title":{"rendered":"Case study: Application Performance Monitoring with Application Insights"},"content":{"rendered":"<p><span style=\"font-family: Calibri;font-size: medium\">James Beeson, Alan Wills &#8211;<\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">Our group runs about 20 web applications, serving a community of about 100k users spread around the world. Since we started using Application Insights, we\u2019ve found we have a much clearer view of our applications\u2019 performance, and as a result, our users are seeing better performing and more useful apps. This post tells you about our experiences.<\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">We&#8217;re pretty agile. We run a three-week sprint, and we adjust our plans for future sprints based on the feedback we get from the current release.<\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">The data we get from Application Insights broadly answers two questions about a web app:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\">Is it running OK? Is it available, and is it responding promptly and correctly? Does it respond well under load? And if not, what&#8217;s going wrong?<\/span><\/li>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">How is it being used? Which features are our customers using most? Are they achieving their goals successfully? Are they coming back?<\/span><\/span><\/li>\n<\/ul>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">We\u2019ll focus on the first question in this post.<\/span><\/span><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><\/span><\/span><\/p>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">The team room dashboard<\/span><\/span><\/h2>\n<p><span style=\"font-family: Calibri;font-size: medium\">We keep a dashboard running in the team room. It reminds us that there are real users out there! It looks like this:<\/span><\/p>\n<p><span style=\"font-size: medium\">\u00a0<a href=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/7418.01.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/7418.01.png\" alt=\"\" border=\"0\" \/><\/a><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">Here\u2019s how we set up our team room dashboard:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\">Use the slide show feature to show all our apps in turn.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Use the same dashboard layout for every app. That way, we get used to the way it should look, and quickly notice anything unusual. <\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Set the date range to 3 days, so we can see how things went over the weekend.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-family: Calibri;font-size: medium\">The selection of tiles you see in the screenshot are the ones we find consistently useful. <\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\">Availability \u2013 reassures us the site is running. <\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Performance index shows what proportion of requests are serviced in an acceptable time. If it dips, maybe we\u2019re overloaded (is CPU high?) or maybe a resource we\u2019re dependent on is having problems.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Reliability dips if the app has uncaught exceptions. Exceptions usually show to the user as error messages, or as something failing to happen. If they happen only when the request count is high, we\u2019ve probably got resource problems. But if we get a few all the time, it suggests that our users are finding a bug that our testers didn\u2019t.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">The request count gives us a feel for how widely the usage varies. When it\u2019s high, we quickly become aware of how perf and reliability are affected.<\/span><\/li>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">CPU tells us if we\u2019re hitting the endstop and should scale up.<\/span><\/span><\/li>\n<\/ul>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">In our weekly report for stakeholders, we quote the availability and reliability figures, with screenshots of the dashboard.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">We have found that in the months since we started displaying these dashboards, we\u2019ve become much more conscious of performance issues. Partly that\u2019s because we\u2019ve discovered and dealt with quite a few issues. But partly it\u2019s just because the measurements are there all the time. They come up in discussions more often, and we think of performance more when we\u2019re developing. As a result, our users are less likely to experience slowdowns or exceptions than they once were.<\/span><\/span><\/p>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Setting up availability monitors<\/span><\/span><\/h2>\n<p><span style=\"font-family: Calibri;font-size: medium\">The first tile on the dashboard is the availability monitor. Here are our tips about availability monitors:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\">Set up at least a single-URL test for your home page.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Edit the test and set the optional extras:<\/span> \n<ul>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Set an alert to send you an email if any two locations fail. <\/span><\/span><\/li>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Test for a string that you\u2019d only find in a correctly-working page. But don\u2019t forget to update the test when you deploy a change to the page.<\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Don\u2019t switch off availability tests just because you\u2019re planning a maintenance outage. There\u2019s no harm in seeing the outage on the graph, and, well, it\u2019s good to test your smoke alarm! When you take the site down, check to see that you receive an alert.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\">Set up an availability test that exercises the connection to the back end server \u2013 for example it could search the catalog. But see the discussion below.<\/span><\/li>\n<\/ul>\n<h2><span style=\"color: #3366ff;font-family: Calibri;font-size: medium\">Is your back end still running?<\/span><\/h2>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><strong>Checking the back end by E2E availability.<\/strong> Some teams like to set up an availability monitor based on a web test that runs a real end-to-end scenario using a dummy account. For example, it might order a widget, check out, and pay for it. The idea is to make sure that all the important functions are running. It\u2019s undoubtedly more thorough than pinging the home page, and it gives you confidence that your whole app is working. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">But be aware of the 2-minute timeout. Application Insights will log a failure if your whole test takes longer than that. And don\u2019t forget you\u2019ll have to update the test when there\u2019s any change in your user experience.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">So, although verifying that your back end is running correctly is a useful function, in our team we don\u2019t usually use availability tests for that.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><strong>Checking the back end by reliability.<\/strong> Instead, we set an alert to trigger if the reliability index (that\u2019s transactions without exceptions) dips below 90%. If the SQL server goes down, the web server\u2019s timeout exceptions will soon tell us about it.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Still, we\u2019d agree that if you have an app that isn\u2019t used every minute of the day, it can be nice to do a pro-active test periodically rather than waiting for some unfortunate user to discover the fault.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><strong>Build-in self tests. <\/strong>An interesting approach we\u2019re trying on one of our applications uses a built-in self-test. In the web service component of this app, we coded up a status page that runs a quick smoke test of all the components and external services that the app depends on. Then we set up an availability test to access that page and verify that all the \u201cOK\u201d results are there. The effect is at least as good as a complex web test, it\u2019s reliably quicker, and it doesn\u2019t need to be updated whenever there\u2019s some simple change in the UX.<\/span><\/span><\/p>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Setting up monitoring for a live site<\/span><\/span><\/h2>\n<p><span style=\"font-family: Calibri;font-size: medium\">This is the checklist of things we set up to monitor a live application:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Availability<\/strong>. At least a single-url \u2018ping\u2019 test, and preferably a back-end exerciser, too.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Performance<\/strong>. Follow the setup instructions on the performance page. To monitor a live site, we don\u2019t set any special configuration parameters in the config file. The one parameter we often set is the application display name \u2013 this sets the application name under which the data appear in Application Insights.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Alerts<\/strong>. Typically we set up an alerts on availability, performance, and reliability, to let us know if anything goes weird. We also set an alert on the request count. If there\u2019s a surge of requests for any reason, we like to keep a close eye on CPU usage and dependencies.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Dashboard<\/strong>. The team room dashboard, as discussed. Developers also set up their own.<\/span><\/li>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><strong>Usage<\/strong>. We usually install the basic SDK and web page monitors, which give us page and user counts. (But we\u2019ll talk about usage in a separate article.)<\/span><\/span><\/li>\n<\/ul>\n<p><span style=\"font-family: Calibri;font-size: small\">\u00a0<\/span><\/p>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Reviewing and diagnosing a performance issue<\/span><\/span><\/h2>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">In this section, we\u2019ll show how Application Insights helps us resolve a typical issue with a production web application. We\u2019ll put you in the driver\u2019s seat to make it more exciting!<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Let\u2019s suppose you see a dip in performance:<\/span><\/span><\/p>\n<p>\u00a0<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/0488.02.png\" alt=\"\" border=\"0\" \/><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Click on the performance tile and the server performance page appears. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><span style=\"font-family: Calibri\"><a href=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/8484.03.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/8484.03.png\" alt=\"\" border=\"0\" \/><\/a><\/span><\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">There\u2019s an odd peak in average response time (purple line) that doesn\u2019t seem to correspond to a peak in requests (blue line). In fact, there are earlier request peaks that don\u2019t cause slower responses. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">But you notice that there is a preceding peak in the calls made from the web server to other resources (the colored bars). Looking at the resource color keys, these are WCF services. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Zoom in on the interesting part of the chart by dragging across the small key chart.<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Switch to the Diagnostics, Events page, and select Event Type=All. (Notice how the zoom is preserved as you move from one chart to another.)<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\">\u00a0<a href=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/1738.04.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/1738.04.png\" alt=\"\" border=\"0\" \/><\/a><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">This page conveniently shows the reliability index above the table of events. Notice how it dips just after a deployment marker \u2013 the number of exceptions increased after we deployed a new version of the code. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Looking at the table, there are quite a lot of Timer Elapsed exceptions, and there are some resource calls that are taking more than three minutes. (Recall that two kinds of events appear: events that flag exceptions that users will see as failures of some sort; and performance events that flag requests that take a long time to service.)<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Open the event that flags the long resource calls. If you want, we can expand it first to pick a particular instance. Take a look at the stack:<\/span><\/span><\/p>\n<p><span style=\"font-size: medium\">\u00a0<a href=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/2018.05.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/2018.05.png\" alt=\"\" border=\"0\" \/><\/a><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Looking at several of these performance events, we find that when we drill into the call stack, there\u2019s typically a surprisingly long wait to open a SQL Azure connection. It\u2019s happening in one of our most frequently used MVC pages. A database connection should not take more than 100 milliseconds to open, but in this instance it\u2019s taking more than 18 seconds. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">Checking the Azure Management Portal, we notice that the SQL Azure Database and our Hosted Service (Web Roles) are running in two different locations. Every time a customer accesses this MVC page, we open a database connection across half a continent. <\/span><\/span><\/p>\n<p><span style=\"font-size: medium\"><a href=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/3056.06.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2014\/04\/3056.06.png\" alt=\"\" border=\"0\" \/><\/a><\/span><\/p>\n<p><span style=\"font-size: medium\"><span style=\"font-family: Calibri\">The cure is to host the different services in the same data cluster. To tell Windows Azure to do this, we define an affinity group in our subscription, and add to it the database and web role services. We did that, and the service calls immediately became satisfyingly swift.<\/span><\/span><\/p>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Monitoring pre-production<\/span><\/span><\/h2>\n<p><span style=\"font-family: Calibri;font-size: medium\">Application Insights isn\u2019t just for live applications. We use it for applications under development and test, too. Availability tests aren\u2019t so useful pre-production, but the performance tests certainly are. Provided the test server can send data the public internet, the results appear on Application Insights.<\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">There are a few things you\u2019ll want to configure differently for testing. Edit the ApplicationInsights.config file in your web project, so that you can configure performance monitoring. Here are some of the parameters that you\u2019ll want to change:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Set DisplayName<\/strong> to avoid your test results getting mixed in with your live results. It sets the application name under which your results appear in Application Insights.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Reduce PerformanceThreshold<\/strong> so that you see more events. The Monitoring Agent times all calls to your application from the web service host, and records a trace of all the internal calls, with timings. It sends that execution trace in a performance event, but only if the overall time exceeds the PerformanceThreshold, which defaults to 5 seconds. When you\u2019re tuning the performance of your app, reduce the PerformanceThreshold so that most requests exceed it. That way, you\u2019ll get a good sample of stack traces, so that you can see where the time is being spent. In practice, only a few events will be generated because of throttling limits imposed by the Monitoring Agent\u2014so you don\u2019t need to worry about a flood of events.<\/span><\/li>\n<li><span style=\"font-family: Calibri;font-size: medium\"><strong>Get more detailed execution traces by reducing Sensitivity.<\/strong><em> <\/em>In an execution trace, data is omitted for any call that completes in less than the Sensitivity, which defaults to 100ms. Set it lower to see more calls. Beware that this might create very large event traces.<\/span><\/li>\n<li><span style=\"font-size: medium\"><span style=\"font-family: Calibri\"><strong>See parameter values by setting Resources<\/strong>. If you want to know the actual parameter values in a call to an internal method, name it as a Resource. For example, if some calls seem to take an unusually long time, it might be useful to see what values are causing the problem.<\/span><\/span><\/li>\n<\/ul>\n<h2><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Improving performance with Application Insights<\/span><\/span><\/h2>\n<p><span style=\"font-family: Calibri;font-size: large\">We hope this has given you some feeling for how we set up Application Insights on our team. It helps us notice performance issues before the customers complain, and it helps us diagnose them. It also helps us get performance traces that we can use to improve performance even where it\u2019s already mostly acceptable. In general, it\u2019s made us more conscious of performance, and helped us create a better set of applications for our users.<\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: medium\">\u00a0<\/span><\/p>\n<h4><span style=\"font-size: medium\"><em><span style=\"color: #2e74b5\"><span style=\"font-family: Calibri Light\">Links<\/span><\/span><\/em><\/span><\/h4>\n<p><span style=\"font-size: medium\"><a href=\"http:\/\/msdn.microsoft.com\/library\/dn495324.aspx\"><span style=\"color: #0563c1;font-family: Calibri\">Performance and exception monitoring with Application Insights for Visual Studio Online<\/span><\/a><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>James Beeson, Alan Wills &#8211; Our group runs about 20 web applications, serving a community of about 100k users spread around the world. Since we started using Application Insights, we\u2019ve found we have a much clearer view of our applications\u2019 performance, and as a result, our users are seeing better performing and more useful apps. [&hellip;]<\/p>\n","protected":false},"author":154,"featured_media":45953,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[224,226,1],"tags":[],"class_list":["post-1533","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-azure","category-ci","category-devops"],"acf":[],"blog_post_summary":"<p>James Beeson, Alan Wills &#8211; Our group runs about 20 web applications, serving a community of about 100k users spread around the world. Since we started using Application Insights, we\u2019ve found we have a much clearer view of our applications\u2019 performance, and as a result, our users are seeing better performing and more useful apps. [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/1533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/users\/154"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/comments?post=1533"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/1533\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media\/45953"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media?parent=1533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/categories?post=1533"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/tags?post=1533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}