February 11th, 2007

Managing Quality (part 4) – Stress Testing

Brian Harry
Corporate Vice President

The goal of our stress testing is to run an application under load for an extended period of time and capture all “failures”.  The purpose is to uncover race conditions, long term resource leaks, and bugs that only occur as the result of unexpected sequences or combinations of operations.  Mostly we focus on server stress testing, but some teams do some client stress testing (using automated GUI tests) to find similar problems in client logic.

People use the term “Stress Testing” in different ways.  The Windows team uses the term “Stress Testing” to mean running systems at resource exhaustion (memory, disk, etc) and making sure the system handles it properly.  There’s a 90% overlap with what we do but a slide variation in purpose.

Considerations for Stress Testing

Tests – Like Load Testing (described in my last post in this series), we use the VSTS for Testers product to simulate many users executing randomly selected tests.  In fact, we use exactly the same set of tests in Stress testing as we do in Load testing.  While in Load Testing, you choose the distribution of tests based on as close a simulation of the “real world” mix as you can get, in Stress testing you artificially inflate the frequency of rare and disruptive tests.  As a result our Stress testing test mix is different than Load testing.

Measurements – The primary thing you are looking for with Stress testing are systemic failures.  It’s not functional testing – you aren’t trying to verify the results of the tests.  You don’t even care if the data was correct.  I see people get very confused over this point.  You have to remember that every kind of testing you do has a purpose.  All testing does not test everything.  Let functional testing do what it does (determine correctness of operations) and let stress testing do what it does (find rarely occurring “catastrophic” problems, leaks, etc).

In stress testing, you monitor:

  • Responses to look for exceptions, deadlocks, or other indications of catastrophic failures.
  • Tests per second – both to convince yourself that the system is performing a reasonable workload (if it isn’t running very many tests, you probably won’t get much good data) and to watch tests per second for unexpected shapes in the graph – reduction over time indicates some kind of leak; extreme spikes, drop-outs, etc imply some kind of resource contention or undesirable interaction between the tests.
  • Key performance counters – memory, cpu utilization, etc.  Again looking for long term trends.
  • The event log to look for indications that the system under test is experiencing problems you can’t observe from the responses.
  • Sometimes we run our stress tests with the debugger attached so that we can break immediately when an exception happens and have a better shot at determining the cause.

Failure resolution – One thing to look out for with stress testing is bug handling.  There is strong tendency to resolve bugs as “no repro” because, they don’t actually reproduce at will.  You don’t want to let go of an occurrence until you are certain you’ve exhausted every possibility to identify the cause.  If you see a pattern of stress bugs being resolved “no repro”, then you have a problem.  I’ve generally found that problem to be one of two things – not enough understanding on the part of developers of the importance of pursuing the cause relentlessly or not enough instrumentation to help identify the cause.  Attaching a debugger before the run starts can sometimes help.  Sometimes, when we find stress bugs we can’t isolate, we add specific instrumentation to the code aimed at helping identify it and run with that in future runs.

Load profile – We pick a load (# of simulated users) that we know runs the server at about 70% utilization and run a constant load profile at that level.  The goal is not to drive the system under test to saturation, but rather just to keep it very busy the whole time.  Unlike Load testing, we don’t ramp the load up over time because we are not measuring the increasing response times – in Stress testing, we don’t care about the response times.

Frequency/Duration – We run “short haul” and “long haul” stress testing.  A short haul run is an 8 hour run and starts at night and completes by morning – ready for analysis.  We generally run short haul runs on every day/build.  A long haul run is 120 hours (5 days).  We generally run long haul runs on builds that have good short haul pass rates.  The reason for both is that you need the cycle time of being able to fix bugs and pick up new build that you get with short haul runs.  But certain kinds of problems (particularly resource leaks) don’t always show up in an 8 hour run and 120 hours helps.  In the past we’ve done some math to equate how much simulated calendar time a long haul run (120 hours at high load) represents for a “typical” team.  With a reasonable set of assumptions – it generally comes out to be months but I don’t obsess on that question.  We’ve just determined 120 hours works out to be a good number and more than that doesn’t help much.  When I worked on the .NET Framework team, we experimented with “ultra long haul testing” – running for 5 weeks but it didn’t yield much new information.

Development cycle – We generally start short haul stress testing early in the cycle.  In my opinion, the earlier the better.  The kinds of problems found in stress testing are hard to isolate and debug and the closer you find them to their introduction the better you are.  We generally don’t start long haul testing until around our final Beta.  There’s no point doing it until short haul is passing at a very high rate – because it’s just going to find the same problems.  Because of the cycle time, anything you can find using short haul is better – you get to try fixes much faster.

Execution infrastructure – You don’t need particularly high end hardware.  You just need something that can run your tests and achieve a solid load.  Whereas in Load testing, we run the system under test “pristinely”, using separate clients for the load agents, in Stress testing it’s different.  We generally run the load agent on the same physical machine as the code under test.  You don’t care if the load on the server is affected by the test infrastructure and combining things allows you to conserve hardware and run more instances; remember – let each kind of testing do what it’s designed to.

For short haul testing, we generally have 10-15 “pods” (collections of machines) that run stress tests.  For long haul testing, we generally use 3-5.  There are a mix of topologies among the pods – single server/dual server, domain/workgroup, etc.  I’ll be talking about test matrices in a future post in this series.  You want to make sure you have plenty of redundancy in your execution infrastructure.  There are a few reasons:

  • The bugs you are going after are rare and you want to clock lots of simulated hours to make sure you are finding as much as you can.
  • More common failures tend to hide less common ones – so it’s not unusual for several runs to end in the same failure.  By having many machines you have more chance of hitting some of the less common ones.
  • It’s not uncommon to have to take a pod out of rotation for a day or more for detailed investigation of a failure – you don’t want to stop all of your testing when this happens.

Misc – There is some debate about the value of running debug vs release builds.  The advantage of debug builds is that you get the benefits of the asserts in your code.  A disadvantage is that the debugging code (asserts, etc) can affect the timing and cause you to miss failures.  I am a fan of running both debug and release builds – but opinions vary on this topic.

Stress Testing Reports

At the highest level, we track the overall stress runs build by build, noting # of runs, # of new bugs found, # of passes and tests per second.  This gives the 10,000 foot view of how you are progressing…

We have daily reports that include more information on the runs of that day…

And the status of bugs that have recently been found by stress testing…

Of course we also use the reports generated by VSTS for Testers that show Tests/sec, perf counters, etc.  I don’t have one handy at the moment (because we don’t send them around in email and I’m at home right now :)) but I’ve included a generic screen shot from MSDN…

Conclusion

Stress testing is an important part of any critical application.  It enables you to identify and fix hard to reproduce bugs.  There’s a significant investment in infrastructure, process and training to get going seriously with it but it pays dividends.  Visual Studio for Testers is a good tool that can get you a great start (shameless product plug here :)).

I hope this was useful to you.  Until next time…

Brian

Topics
TFS

Author

Brian Harry
Corporate Vice President

Corporate Vice President for Cloud Developer Services.

0 comments