Managing Quality (part 3) – Performance Testing
If you read my blog much, then you know performance and scale are near and dear to my heart. If you read back far enough in my blog, you’ll find a bunch of information on my philosphy of performance testing and what we’ve done for TFS. I’m going to repeat some of it here but if want more detail, I’ve included a map of previous posts at the bottom.
Disecting Performance Testing
There are a lot of aspects to performance testing designed to answer different questions. In my view, the main questions you want to answer with performance testing are:
- How long does my user have to wait?
I call this “performance testing”. This is about measuring the end user’s perception of performance. You test this by simulating what the user actually does. In fact, “performance” tests actually automate the user interface (might be a Windows GUI, a web page, a command line, etc) and measure the time delay the user perceives. Performance tests can be run on a system under load or on an “idle” system. They can be run with lots of data or small amounts of data. All of these are interesting but the nature of the question you are asking is the same.
- How many users can my system support? Or a variation – how much hardware does it take to support my users?
I call this “load testing”. The nature of load testing is to set up a “typical” configuration for the size of team you are trying to measure and run a statistical simulation of the load generated by the team. You measure response times. It is important that you develop a good profile of the load you expect a given user to put on the system – including both the proportion of different requests and the frequency of requests. Further, it is important to populate the system with a dataset that is representative of what the simulated user population will use. We use the limit testing tools described below to set up these data sets.
A good practice is then to run a “ramp” load test where you start by simulating a small number of users and gradually increase the number of simulated users, watching the response times until they exceed what you deem to be an acceptable threshold. This defines the load the system and software under test can support.
People frequently categorize “Scale” testing as different from “performance” testing. Strictly speaking performance testing is about testing how fast a given system is and scale testing is about testing how much more capacity a system has when you vary the resources. For example if I have a 1 processor system and I get X requests per second and I move to a 2 processor, how many requests per second can it handle? If the answer is less than X, that is called negative scaling (and is really bad but you’d be surprised how often it happens :)). My rule of thumb is (for a well designed system) 2 proc should give you about 1.9X performance, 4 proc – about 3.5X, 8 Proc – about 6.5X. Of course CPU is only one measure by which you can measure scale – you can measure scale as a function of memory, disk I/O capacity, network bandwidth or any other limited resource in the system – but CPU is what people talk about most of the time.
All that said, I roll scale testing into load testing. We measure load of a variety of different hardware configurations – varying CPUs, memory and disk subsystems. If we see unexpected performance ratios in load testing we investigate.
- How much data can my system support and still provide acceptable performance?
I refer to this as “limit testing”. With limit testing, we build tools to load data into the system (e.g. #’s of work items, #’s of checkins, #’s of files, etc). We record the time it takes to add them as the count grows and we graph it. In an ideal world, the graph is flat – it takes the same amount of time to add the 1,000,000th work item as it did to add the 1st but, there is always some slow down. You want a graph that has a very low slope, no discontinuities (you don’t want it to jump from 10ms per work item at 1000 work items to 30ms per work item at 1001) and low deviation (you don’t want the average time to be 10ms but one out of every thousand takes 10 seconds). Once we “fill up” the database we test the overall system performance by running the “performance” tests described above to ensure that the system behaves reasonably well.
A VERY important part of performance testing is setting goals. It’s also poorly understood and there are many opinions of how to do it – this being my blog, I’ll give you mine :).
Performance testing goals – The first thing is to identify your scenarios. What operations are people going to do the most? What are they going to be most sensitive about? What is most likely to be an issue? We generally prioritize our performance scenarios (pri 1, 2, 3) and automate them in that order. To give you some context, for TFS we have about 80 performance scenarios and we add a few more with each release. You can see the list below in the performance report.
Next, you need to define your goals for the scenarios. This is more an art than a science.
I start by asking what is the closest analogy to what the user does today? This will define a baseline for your user’s expectations. When I say this, I frequently hear people say – but we don’t have anything to compare to. That’s almost never true. You can compare to the competition. Or you can compare to the user’s alternative – for really new scenarios. For example, imagine you were building the first workflow tool and you were developing performance goals for notifications. Perhaps your users’s alternative is to send email. How long does that take?
Secondly I just use “common sense”. My rule is anything under about half a second is noise unless you do it a lot in sequence. Anything over 5-10 seconds is a “long running operation” and the user is going to lose attention and start to think about something else. Anything over 15-20 seconds and the user is going to think the system is hung (unless you are providing visible progress indication). These are round numbers. Mostly I just use my judgement based on what I think I would expect for the operation.
Lastly, be prepared to change it. When you start “beta” testing the system and users tell you something feels slow, go back and revisit your goals.
Load testing goals – There’s two parts here: 1) identifying the response time threshold that you consider to be the limit of acceptable performance. For this I use the same rules as for performance testing. 2) determining the # of users you want a given hardware config to support. I look at typical team sizes, breakpoints in hardware costs (traditionally, big breaks between 2 proc, 4 proc and 8 proc) and make my best guess about hardware and operations costs customers will find acceptable for different user populations. I define a spectrum of canonical hardware configurations and assign the goals. If you watched what we did in TFS 2005, you’ll see I did a pretty bad job 🙂 Our initial goals were 500 users for a modest 4 proc system. By the time we got to the point we could really measure the system in its full glory, we found we could actually support about 2,000 users on that system – so we changed our goals 🙂
Limit testing goals – I have a lot of years of experience with this that have taught me to do your best to estimate how big/how much data you can imagine your users will do and multiply it by 10. If you are luckly that will cover you but even that my not. People will use your system in totally unexpected ways. Underestimating th is is the most common mistake in setting goals for performance testing (and we made a few ourselves – like underestimating the #s of Team Projects people would create).
Performance tests – We generally run “performance” tests on every single build. We compare results to the last build, the goal and to the best ever. Any significant delta from any of the above results in a bug and an investigation. Why we do the goal, is obvious. We compare to the last build to watch for regressions. However, there’s always some variation (10% variation is totally normal) so we ignore small build to build variations or we get flooded with bugs. This is why we compare to best ever – this helps us identify long term regressions. Occasionally we’ll reset “best ever” if we accept a new result as a new baseline.
To help reduce variation, we run all performance tests 3 times in every run and average them. Even still 10% variation is common. We also run all performance tests on “pristine” hardware on an isolated lab network to reduce as much variation as we can.
Load tests – Load tests take a pretty long time to set up, run and analyze. We tend to run these later in the cycle when the code starts to stabilize. Generally, I’d say about once a week or so is adequate coverage.
Limit tests – We generally only run these a couple of times per product cycle. Some of the individual tests can take a week or more to run due to the amount of data population. The goals help define architectural design guidelines and the testing helps verify compliance.
Here’s a performance trend report:
Here’s an example of a daily performance report.
The one think I feel is missing from our performance reporting is trending on improvement. Orcas is dramatically faster than TFS2005 but you wouldn’t know it because you’ll ultimately find that they both mee their goals. What you won’t see is that the goals changed and the results are way better (except by looking at the detailed reports that have way too much data). We’re working on a good way to visualize actual performance over time in addition to trend towards meeting goals.
I will work on getting some Load testing and Limit Testing reports posted in a future post. This one is getting pretty long and we haven’t kicked those fully into gear for Orcas yet.
We do “performance” testing on a variety of hardware configs – including low memory, low bandwidth, etc to see how end user performance changes based on client hardware.
Alot of our perf testing (and many of the results I’ve posted on my blog in the past few months) come from what I’d call dev “perf unit testing”. This means – not in the perf lab with official perf tests, but rather devs picking a set of scenarios they want to improve and creating and measuring specific tests for those scenarios. Sometimes those tests will get turned into “official” perf lab scenarios, but sometimes not.
As you can tell this is a HUGE topic – worthy of an entire book. I’ve probably written far too much already but hopefully it’s been at least a little useful. Here are the links to my previous posts that I promised.
How many users will your Team Foundation Server support? – Contains a good description of how to think about load testing.
Capacity testing for TFS – Has some thoughts on “limit” or “capacity” testing.
“Concurrent” users – A diatribe on the notion of “concurrent users”.
Orcas Version Control Performance Improvements – Dev “perf unit testing” on Orcas version control performance.
Our First Orcas perf results at load – Some Orcas approximate load test results derived from stress testing (the topic of a future post)
Orcas Work Item Tracking Performance Improvements – Load test results on Orcas work item improvements.
Team Foundation Server Capacity Planning – TFS 2005 server sizing recommendations based on our load testing results.
I appologize for the length – it kind of got out of hand.