Managing Quality (part 5) – Dr. Watson
We receive a lot of quality feedback directly from customers and we use that information help direct our investments in bug fixing and testing. One of the tools we use to collect customer feedback is called “Dr. Watson”. All of our applications are instrumented to observe any unexpected condition – crash, exception, internal error, etc., capture useful information – stack trace, environmental info, etc., and upload that information (under your control) to a web site at Microsoft. We, the product teams, have access to that information to observe problems that users are encountering and to diagnose and fix them.
Dr. Watson events are put in “buckets” automatically by our web site infrastructure. The “bucketing” process is designed to identify problems that appear to be the same so that we can identify which problems are the most common and invest our effort in fixing the most important ones first.
Goals for fixing
Every release we (Developer Division) have a division wide goal of fixing 50% of all reported buckets. Before you go and say – “Hey, why are you only trying to fix half of them”, let me explain a few things.
First, that’s 50% of buckets, not 50% of events. If you graph # of occurrences by bucket, you get a graph that looks like this (X-axis is buckets sorted by frequency of occurrence):
As you can see a small number of buckets represents the vast majority of occurrences. By fixing 50% of the buckets, we probably fix better than 99% of the occurrences.
The second reason it’s 50% is that these issues are notoriously hard to track down. We try to capture all of the information we need to track down issues in the Watson report, but anyone who has tried to do post-mortem debugging with only a small subset of the process state will understand that it’s a difficult prospect.
And third, some Watson reports turn out to be over active reporting – in other words, we captured an “event” that turns out to be reasonably normal system behavior. In some cases we’ll change the logging to avoid capturing it and in some cases it’s infrequent enough to not be worth the bother.
To help keep an eye on this, we monitor the outcome of all Watson bugs to make sure we are being successful in following through on them. Here’s a table of outcome analysis for Watson bugs:
In my experience, it takes devs a while to learn to really leverage the limited information available in a Watson report to figure out how to diagnose a problem and fix it. I particularly watch “no repro” for this effect. Too many of these means you either need to improve your instrumentation or focus on learning how to better leverage what you have. I also watch “Won’t fix” and “By Design” – this generally means your infrastructure is catching “false positive” events and too many means you are wasting your time investigating bogus issues. Again, you probably need to go back to your capture infrastructure and see how to filter events better. As you can see (as of this report) we were fixing 64% of Watson reported bugs – well above our 50% goal.
Tracking events in detail
Every week I get a Watson report on the activity for that week. It includes several useful ways at looking at what is going on. These include:
1) A list of preexisting, active Watson bugs and their hit counts. Note the list is small because we resolve them pretty quickly – usually withing a week or two.
2) Buckets analyzed for the week.
3) Trend of new Watson bugs per month
Customer reported bugs are a very valuable way to make sure you are fixing the “right” bugs. Dr. Watson is a great way to collect this information from customers with very little overhead for them. We have other ways as well. In a future posting I’ll talk about how we leverage Connect, forums and other mechanisms to help us manage quality.
If you want to learn more about Dr. Watson, this link contains more info than you could possibly want 🙂