A Problem of Scale

Hello, my name is Ben Anderson and I’m a Software Design Engineer in Test for the Visual C++ libraries team which is currently responsible for testing ATL, MFC, C Runtime, ATL Server, OpenMP and several other technologies. As I am currently assigned to QA on some of our new MFC Vista features, I’d like to talk to you today about one of the biggest continuing challenges we face in QA on a product like Visual C++ which is that of scale.

To ensure quality in our products we maintain increasingly larger suites of automated testcases which verify that the functions and classes we provide behave as expected. For every new feature we produce, we add a significant number of new testcases which means that over the years, our test suites have grown to gargantuan proportions. To give you a concrete example of the scope we’re talking about, the C-Runtime and Standard C++ Libraries suite of tests we maintain for servicing Visual Studio 2005 (printf, cout, STL etc as Ale describes in his posting below) produces over 9,000 individual results for a single configuration of the CRT. Multiply that by 3 target processor architectures (x86, x64 and Itanium), two cross-compilers (compilers which run on x86 machines but which produce x64 or Itanium code), six different link options (MT, MTd, MD, MDd, MD with statically linked C++ lib, MDd with static c++), three different runtime targets (native, CLR and CLR:pure), pre-jitted (ngened) and non prejit binaries, and five generations of OSes (Windows 98, 2000, XP, 2003 Server and Vista), as well as the myriad of language targets (US English and Japanese being our main checkpoints) all of which must be cross checked to make sure each option or configuration works with every other. Some of these configurations take in excess of 24 hours to run and with moving targets in the form of sometimes daily builds of both Visual Studio and Windows Vista, just keeping our lab automation up and running and capable of pushing through test runs is a significant challenge in and of itself. Further, once we have our results, if even a very small percentage of our tests fail, there is still a large amount of human effort involved in tracking down the causes of the failures, whether they be product issues which must be fixed, bugs in the testcases themselves, or transitory issues such as an incorrectly set up machine, or antivirus software colliding with file access in such a way as to prevent a file being deleted. Finally, developing new testcases which are capable producing valid results on such a wide variety of platforms is a challenging task – simply running a testcase on all platforms can be time consuming even if there are no problems to fix. And that’s just for the CRT and SCL. ATL, MFC, ATL Server and OpenMP all have their own set of separate concerns.

There are a number of ways we attempt to make testing on such a scale manageable and we are always looking for new solutions. To reduce complexity, whenever possible we try to get as much “cross-coverage” as possible. This means we might try running a permutation of two or more variables at once in such a way that we can be reasonably confident that we have hit the interesting scenarios and are not leaving holes. For example, if we need to hit two OSes (say Windows XP and Windows 2003 Server) and three runtimes (native, CLR and CLR:Pure), to get a complete matrix we would have to run our testcases 2×3=6 times. However, if we are confident that hitting native, CLR and CLR:Pure once each is what is important and that the differences between XP and 2003 Server are not significant with respect to the CLR, we may choose to do only three runs – XP/native, XP/CLR and 2003/CLR:Pure reducing the runtime and potentially duplicated analysis effort. Such tradeoffs of coverage against amount of runtime can be effective, but must be very carefully made to avoid missing problems and leaving holes in our coverage.

Another method we use is to employ a “gauntlet” system when checking in new tests, or changes to old ones. What this means is that every time someone wants to check in a change to our test sources, they submit their change to a server which runs the test both before and after the changes on a number of machines which will run the test under the various supported configurations. If there are more failures after the change than before it, the change is rejected and the results are reported to the submitting engineer to fix. Such a system is very useful in guarding the quality of checked in tests, but it is also an additional maintenance burden to keep all the machines up and running as well as to add new supported targets as they become available, or to add new suites of tests to cover new features. It is important to use such a system to keep our test suites robust as even a small decrease in the quality of our test suites can lead to huge increases in runtime and analysis costs (consider a tescase which hangs or corrupts the state of the machine). There are also a number of interesting ways in which such a system can be expanded to become more useful (such as to check for best practices in testcases, or injecting failure conditions into executables to ensure harnessing will report failure properly), but again, adding features adds maintenance burden, which gets you into the same trouble you were in the first place – handling scale.

One problem of scale that any long term project faces is when to cut back, or drop support for features that are no longer in common use in favor of spending effort on items that provide more bang for the buck. This is an issue that test does not have to face as soon as other disciplines, however as time goes on, it becomes more important to do so. Up to a certain point, it’s easier – and certainly safer – to continue running every testcase ever developed for your product, and as run times lengthen simply throw more machines at execution. This does nothing to aid run analysis though, and even though machines are cheaper than new employees, even they cost money along with the additional people necessary to monitor and fix issues with the machines. At a certain point, it becomes necessary to cut back. While it’s painful to admit, not every lovingly handcrafted testcase we have provides unique, or even useful coverage of our product. Some testcases even cause more trouble in the form of testcase bugs than they help in the form of catching product bugs, as it’s possible for the quality of a given testcase to be so poor that it requires constant maintenance and never catch a single product issue! Clearly in such cases it is better to switch off such tests than to run them over and over causing unnecessary work. The problem is identifying which of our tens of thousands of testcases are causing the problems. Ideally we would like to generate a list of testcases that will never catch a product issue for us in the future. Since we haven’t come up with a way of producing such a list (Microsoft has yet to develop time travel), we have come up with several other metrics to measure the effectiveness of our testcases. Our very best testcases are those which a) provide unique code-coverage (we instrument our code to identify what code in the product is hit during the execution of each individual testcase), b) have never had a bug in the testcase code which we have had to fix and c) have caught product issues in the past (preferably lots of them). We then rank the priority of our testcases based upon which of these categories they fulfill. Since we cannot know tha t a testcase which has never caught an issue will not find one in the future, we don’t want to completely abandon testcases, but it’s nice to be able to select certain testcases which we know to have been both effective and reliable in the past to do time sensitive testing or to sniff out potential problems at the beginning of a testpass. Testcases which do not provide unique coverage, have had a lot of bugs in their code that required fixing and have never caught a bug are also candidates for removal upon review.

Finally, in an effort to reduce the most painful bottleneck – analysis of failed results – we have developed tools to aid our test engineers in working through failed testcases and identifying their root causes quickly. One of the most useful features of the tools we use is to be able to associate product or testcase bugs with a failed results and mark down which testcases under which configurations these bugs are known to effect. This means that once a failure is analyzed, it doesn’t have to be reexamined every new run of the testcase until the bugs associated with it are marked as resolved at which point if the test is failing, the bug was either not fixed, or the testcase is failing for other issues which must be identified. Our tools also allow test engineers to mark which testcase failures they are investigating to avoid duplication of effort, and even do some basic auto-analysis by parsing failure logs for common messages. Such tools aid our efforts vastly, and we really feel the pain when they are not available for use.

Well, I hope that gives you a good summary of our efforts around dealing with the scale of such a fundamental piece of a developer’s toolbox as the C++ libraries. We are always looking for new ways of dealing with these issues and hope to improve our techniques further in the future. If you have any questions or suggestions, feel free to contact me at benjaman@microsoft.com.

Ben Anderson
Visual C++ Libraries Team

Category

Author

0 comments

Read next

UAC, Windows Vista & VS 2005

Managed Debugger Expression Evaluator