{"id":9791,"date":"2007-02-06T10:36:27","date_gmt":"2007-02-06T10:36:27","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/bharry\/2007\/02\/06\/managing-quality-part-3-performance-testing\/"},"modified":"2018-08-14T00:34:19","modified_gmt":"2018-08-14T00:34:19","slug":"managing-quality-part-3-performance-testing","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/bharry\/managing-quality-part-3-performance-testing\/","title":{"rendered":"Managing Quality (part 3) &#8211; Performance Testing"},"content":{"rendered":"<p>If you read my blog much, then you know performance and scale are near and dear to my heart.&nbsp; If you read back far enough in my blog, you&#8217;ll find a bunch of information on my philosphy of performance testing and what we&#8217;ve done for TFS.&nbsp; I&#8217;m going to repeat some of it here but if want more detail, I&#8217;ve included a map of previous posts at the bottom.<\/p>\n<h3>Disecting Performance Testing<\/h3>\n<p>There are a lot of aspects to performance testing designed to answer different questions.&nbsp; In my view, the main questions you want to answer with performance testing are:<\/p>\n<ul>\n<li>How long does my user have to wait?<\/li>\n<\/ul>\n<p>I call this &#8220;performance testing&#8221;.&nbsp; This is about measuring the end user&#8217;s perception of performance.&nbsp; You test this by simulating what the user actually does.&nbsp; In fact, &#8220;performance&#8221; tests actually automate the user interface (might be a Windows GUI, a web page, a command line, etc) and measure the time delay the user perceives.&nbsp; Performance tests can be run on a system under load or on an &#8220;idle&#8221; system.&nbsp; They can be run with lots of data or small amounts of data.&nbsp; All of these are interesting but the nature of the question you are asking is the same.<\/p>\n<ul>\n<li>How many users can my system support?&nbsp; Or a variation &#8211; how much hardware does it take to support my users?<\/li>\n<\/ul>\n<p>I call this &#8220;load testing&#8221;.&nbsp; The nature of load testing is to set up a &#8220;typical&#8221; configuration for the size of team you are trying to measure and run a statistical simulation of the load generated by the team.&nbsp; You measure response times.&nbsp; It is important that you develop a good profile of the load you expect a given user to put on the system &#8211; including both the proportion of different requests and the frequency of requests.&nbsp; Further, it is important to populate the system with a dataset that is representative of what the simulated user population will use.&nbsp; We use the limit testing tools described below to set up these data sets.<\/p>\n<p>A good practice is then to run a &#8220;ramp&#8221; load test where you start by simulating a small number of users and gradually increase the number of simulated users, watching the response times until they exceed what you deem to be an acceptable threshold.&nbsp; This defines the load the system and software under test can support.<\/p>\n<p>People frequently categorize &#8220;Scale&#8221; testing as different from &#8220;performance&#8221; testing.&nbsp; Strictly speaking performance testing is about testing how fast a given system is and scale testing is about testing how much more capacity a system has when you vary the resources.&nbsp; For example if I have a 1 processor system and I get X requests per second and I move to a 2 processor, how many requests per second can it handle?&nbsp; If the answer is less than X, that is called negative scaling (and is really bad but you&#8217;d be surprised how often it happens :)).&nbsp; My rule of thumb is (for a well designed system) 2 proc should give you about 1.9X performance, 4 proc &#8211; about 3.5X, 8 Proc &#8211; about 6.5X.&nbsp; Of course CPU is only one measure by which you can measure scale &#8211; you can measure scale as a function of memory, disk I\/O capacity, network bandwidth or any other limited resource in the system &#8211; but CPU is what people talk about most of the time.<\/p>\n<p>All that said, I roll scale testing into load testing.&nbsp; We measure load of a variety of different hardware configurations &#8211; varying CPUs, memory and disk subsystems.&nbsp; If we see unexpected performance ratios in load testing we investigate.<\/p>\n<ul>\n<li>How much data can my system support and still provide acceptable performance?<\/li>\n<\/ul>\n<p>I refer to this as &#8220;limit testing&#8221;.&nbsp; With limit testing, we build tools to load data into the system (e.g. #&#8217;s of work items, #&#8217;s of checkins, #&#8217;s of files, etc).&nbsp; We record the time it takes to add them as the count grows and we graph&nbsp;it.&nbsp; In an ideal world, the graph is flat &#8211; it takes the same amount of time to add the 1,000,000th work item as it did to add the 1st but, there is always some slow down.&nbsp; You want a graph that has a very low slope, no discontinuities (you don&#8217;t want it to jump from 10ms per work item at 1000 work items to 30ms per work item at 1001) and low deviation (you don&#8217;t want the average time to be 10ms but one out of every thousand takes 10 seconds).&nbsp; Once we &#8220;fill up&#8221; the database we test the overall system performance by running the &#8220;performance&#8221; tests described above to ensure that the system behaves reasonably well.<\/p>\n<h3>Setting Goals<\/h3>\n<p>A VERY important part of performance testing is setting goals.&nbsp; It&#8217;s also poorly understood and there are many opinions of how to do it &#8211; this being <em>my<\/em> blog, I&#8217;ll give you mine :).<\/p>\n<p><strong>Performance testing goals<\/strong> &#8211; The first thing is to identify your scenarios.&nbsp; What operations are people going to do the most?&nbsp; What are they going to be most sensitive about?&nbsp; What is most likely to be an issue?&nbsp; We generally prioritize our performance scenarios (pri 1, 2, 3) and automate them in that order.&nbsp; To give you some context, for TFS we have about 80 performance scenarios and we add a few more with each release.&nbsp; You can see the list below in the performance report.<\/p>\n<p>Next, you need to define your goals for the scenarios.&nbsp; This is more an art than a science.<\/p>\n<p>I start by asking what is the closest analogy to what the user does today?&nbsp; This will define a baseline for your user&#8217;s expectations.&nbsp; When I say this, I frequently hear people say &#8211; but we don&#8217;t have anything to compare to.&nbsp; That&#8217;s almost never true.&nbsp; You can compare to the competition. Or you can compare to the user&#8217;s alternative &#8211; for really new scenarios.&nbsp; For example, imagine you were building&nbsp;the first&nbsp;workflow tool and you were developing performance goals for notifications.&nbsp; Perhaps your users&#8217;s alternative is to send email.&nbsp; How long does that take?<\/p>\n<p>Secondly I just use &#8220;common sense&#8221;.&nbsp; My rule is anything under about half a second is noise unless you do it a lot in sequence.&nbsp; Anything over 5-10 seconds is a &#8220;long running operation&#8221; and the user is going to lose attention and start to think about something else.&nbsp; Anything over 15-20 seconds and the user is going to think the system is hung (unless you are providing visible progress indication).&nbsp; These are round numbers.&nbsp; Mostly I just use my judgement based on what I think I would expect for the operation.<\/p>\n<p>Lastly, be prepared to change it.&nbsp; When you start &#8220;beta&#8221; testing the system and users tell you something feels slow, go back and revisit your goals.<\/p>\n<p><strong>Load testing goals<\/strong> &#8211; There&#8217;s two parts here:&nbsp;1) identifying the response time threshold that you consider to be the limit of acceptable performance.&nbsp; For this I use the same rules as for performance testing.&nbsp;&nbsp;2) determining the # of users you want a given hardware config to support.&nbsp; I look at typical team sizes, breakpoints in hardware costs (traditionally, big breaks between 2 proc, 4 proc and 8 proc) and make my best guess about hardware and operations costs customers will find acceptable for different user populations.&nbsp; I define a spectrum of canonical hardware configurations and assign the goals.&nbsp; If you watched what we did in TFS 2005, you&#8217;ll see I did a pretty bad job \ud83d\ude42&nbsp; Our initial goals were 500 users for a modest 4 proc system.&nbsp; By the time we got to the point we could really measure the system in its full glory, we found we could actually support about 2,000 users on that system &#8211; so we changed our goals \ud83d\ude42<\/p>\n<p><strong>Limit testing goals<\/strong> &#8211; I have a lot of years of experience with this that&nbsp;have taught&nbsp;me to do your best to estimate how big\/how much data you can imagine your users will do and multiply it by 10.&nbsp; If you are luckly that will cover you but even that my not.&nbsp; People will use your system in totally unexpected ways.&nbsp; Underestimating th\nis is the most common mistake in setting goals for performance testing (and we made a few ourselves &#8211; like underestimating the #s of Team Projects people would create).<\/p>\n<h3>Running Tests<\/h3>\n<p><strong>Performance tests<\/strong> &#8211; We generally run &#8220;performance&#8221; tests on every single build.&nbsp; We compare results to the last build, the goal and to the best ever.&nbsp; Any significant delta from any of the above results in a bug and an investigation.&nbsp; Why we do the goal, is obvious.&nbsp; We compare to the last build to watch for regressions.&nbsp; However, there&#8217;s always some variation (10% variation is totally normal) so we ignore small build to build variations or we get flooded with bugs.&nbsp; This is why we compare to best ever &#8211; this helps us identify long term regressions.&nbsp; Occasionally we&#8217;ll reset &#8220;best ever&#8221; if we accept a new result as a new baseline.<\/p>\n<p>To help reduce variation, we run all performance tests 3 times in every run and average them.&nbsp; Even still 10% variation is common.&nbsp; We also run all performance tests on&nbsp;&#8220;pristine&#8221; hardware on an isolated lab network to reduce as much variation as we can.<\/p>\n<p><strong>Load tests<\/strong> &#8211; Load tests take a pretty long time to set up, run and analyze.&nbsp; We tend to run these later in the cycle when the code starts to stabilize.&nbsp; Generally, I&#8217;d say about once a week or so is adequate coverage.<\/p>\n<p><strong>Limit tests<\/strong> &#8211; We generally only run these a couple of times per product cycle.&nbsp; Some of the individual tests can take a week or more to run due to the amount of data population.&nbsp; The goals help define architectural design guidelines and the testing helps verify compliance.<\/p>\n<h3>Reports<\/h3>\n<p>Here&#8217;s a performance trend report:<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/image%7B0%7D%5B25%5D.png\"><img decoding=\"async\" style=\"border-right: 0px;border-top: 0px;border-left: 0px;border-bottom: 0px\" height=\"445\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/image%7B0%7D_thumb%5B13%5D.png\" width=\"753\" border=\"0\"><\/a> <\/p>\n<p>Here&#8217;s an example of a daily performance report.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/image%7B0%7D%5B17%5D.png\"><img decoding=\"async\" style=\"border-right: 0px;border-top: 0px;border-left: 0px;border-bottom: 0px\" height=\"1848\" src=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/8\/2019\/02\/image%7B0%7D_thumb%5B7%5D.png\" width=\"903\" border=\"0\"><\/a> <\/p>\n<p>The one think I feel is missing from our performance reporting is trending on <em>improvement<\/em>.&nbsp; Orcas is dramatically faster than TFS2005 but you wouldn&#8217;t know it because you&#8217;ll ultimately find that they both mee their goals.&nbsp; What you won&#8217;t see is that the goals changed and the results are way better (except by looking at the detailed reports that have way too much data).&nbsp; We&#8217;re working on a good way to visualize actual performance over time in addition to trend towards meeting goals.<\/p>\n<p>I will work on getting some Load testing and Limit Testing reports posted in a future post.&nbsp; This one is getting pretty long and we haven&#8217;t kicked those fully into gear for Orcas yet.<\/p>\n<h3>Miscellaneous footnotes<\/h3>\n<p>We do &#8220;performance&#8221; testing on a variety of hardware configs &#8211; including low memory, low bandwidth, etc to see how end user performance changes based on client hardware.<\/p>\n<p>Alot of our perf testing (and many of the results I&#8217;ve posted on my blog in the past few months) come from what I&#8217;d call dev &#8220;perf unit testing&#8221;.&nbsp; This means &#8211; not in the perf lab with official perf tests, but rather devs picking a set of scenarios they want to improve and creating and measuring specific tests for those scenarios.&nbsp; Sometimes those tests will get turned into &#8220;official&#8221; perf lab scenarios, but sometimes not.<\/p>\n<h3>In Closing..<\/h3>\n<p>As you can tell this is a HUGE topic &#8211; worthy of an entire book.&nbsp; I&#8217;ve probably written far too much already but hopefully it&#8217;s been at least a little useful.&nbsp; Here are the links to my previous posts that I promised.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2005\/10\/24\/how-many-users-will-your-team-foundation-server-support.aspx\">How many users will your Team Foundation Server support?<\/a>&nbsp;&#8211; Contains a good description of how to think about load testing.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2005\/11\/28\/497666.aspx\">Capacity testing for TFS<\/a>&nbsp;&#8211; Has some thoughts on &#8220;limit&#8221; or &#8220;capacity&#8221; testing.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2005\/12\/03\/499791.aspx\">&#8220;Concurrent&#8221; users<\/a>&nbsp;&#8211; A diatribe on the notion of &#8220;concurrent users&#8221;.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2006\/10\/06\/Orcas-Version-Control-Performance-Improvements.aspx\">Orcas Version Control Performance Improvements<\/a>&nbsp;&#8211; Dev &#8220;perf unit testing&#8221; on Orcas version control performance.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2007\/02\/02\/our-first-orcas-perf-results-at-load.aspx\">Our First Orcas perf results at load<\/a>&nbsp;&#8211; Some Orcas approximate load test results derived from stress testing (the topic of a future post)<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2007\/01\/14\/orcas-work-item-tracking-performance-improvements.aspx\">Orcas Work Item Tracking Performance Improvements<\/a>&nbsp;&#8211; Load test results on Orcas work item improvements.<\/p>\n<p><a href=\"http:\/\/blogs.msdn.com\/bharry\/archive\/2006\/01\/04\/509314.aspx\">Team Foundation Server Capacity Planning<\/a>&nbsp;&#8211; TFS 2005 server sizing recommendations based on our load testing results.<\/p>\n<p>&nbsp;<\/p>\n<p>I appologize for the length &#8211; it kind of got out of hand.<\/p>\n<p>Brian<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you read my blog much, then you know performance and scale are near and dear to my heart.&nbsp; If you read back far enough in my blog, you&#8217;ll find a bunch of information on my philosphy of performance testing and what we&#8217;ve done for TFS.&nbsp; I&#8217;m going to repeat some of it here but [&hellip;]<\/p>\n","protected":false},"author":244,"featured_media":14617,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[5],"class_list":["post-9791","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-tfs"],"acf":[],"blog_post_summary":"<p>If you read my blog much, then you know performance and scale are near and dear to my heart.&nbsp; If you read back far enough in my blog, you&#8217;ll find a bunch of information on my philosphy of performance testing and what we&#8217;ve done for TFS.&nbsp; I&#8217;m going to repeat some of it here but [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/9791","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/users\/244"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/comments?post=9791"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/posts\/9791\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media\/14617"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/media?parent=9791"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/categories?post=9791"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/bharry\/wp-json\/wp\/v2\/tags?post=9791"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}