Endgame (Matt Gertz)
Well, I’m back from vacation, pleasantly tired and yet relaxed at the same time. I’m busy trying to dig out of e-mail, things are going well, Beta2 has been in your hands for several weeks, and I’ve got lots of great ideas to try out on this site.
(That’s the theory, anyway. In reality, I’m writing this post while waiting to start my vacation, having some extra time now that I’ve got everything squared away at the office, and plan on posting it when I get back. Hopefully, everything that I said in the first paragraph above is true… J)
I haven’t blogged at all about what I actually do at Microsoft, mostly because I’ve felt that it would be a boring read, but thought this might be an opportune time to write about it a bit. Although I started out at Microsoft as a developer, I transitioned to management about seven years ago, and since then have gradually contributed less and less to the actual coding. (I do have a couple of bug fixes in the version we’re currently developing, but that’s about the extent of it.) Correspondingly, I’ve become more and more involved with how we get the product out the door. That job reaches its peak in the endgame, which is close to where we are now, and so I thought I’d go into that in this post. (There’s no code in this blog post — if the esoteric details of shipping a product bore you, you might want to skip this one.)
We’re at a point now where all of the feature work is complete (even the features we added based on your feedback from Beta1 and Beta2) and where the various product units (PUs) are hitting “ZBB” – that is, are at zero bugs older than 48 hours. At that point, the backlog of bugs is gone, and the developers are fixing the bugs as fast as the testers can find them. Everyone is anxious to get the final product into customer hands. So, when do you stop fixing bugs and ship the product?
That’s a funny question, isn’t it? All bugs should be fixed, right? Alas, it’s an unfortunate fact that no large-scale product developed by any company can ship totally bug-free; the applications are very complex, the amount of platforms that they run on are very broad, and the state of the machines that they run on can vary wildly, making it impossible to test every single feature in every possible situation. The goal is therefore to use resources to ship a product with no defects that any customer would care about.
For example, imagine a memory leak in an application that wasted two bytes of memory for every document that you opened. (I hasten to add that we have tools that check for memory leaks when testing a code scenario in VS, so this example is slightly contrived.) If you kept the application running for a year without shutting it down, and if you opened two documents a day, then by the end of a year the application would have leaked slightly more than a kilobyte of memory. Obviously, this wouldn’t even be noticed by the customer, even if they could manage to keep from closing the app for a year.
As a developer, you’d still want to fix the bug when you discovered it, no matter how “trivial” it was – it’s the right thing to do, yes? But now let’s add twist to the situation. Let’s say that you’re partnered with Company XYZ. The folks at XYZ have been building tools which leverage the new features that you’ve added to your application. If you don’t ship on time, then their product won’t ship on time (or, even if it does, it won’t matter, because no one can try it out without your product also being available), and they will be losing money. If you delay the product in order to fix the 2-byte leak, they will most decidedly not be happy with you.
But let’s say you fix it anyway, you test it thoroughly, and you check it in. The next day, when you get the build from the lab, it turns out that you didn’t test it thoroughly enough – your fix actually caused a regression in an area that you didn’t even realize would be impacted by that code. Now, you’re stuck. Do you back out the change that you made, or do you press on and try to fix that new problem? If you fix the new problem, how can you guarantee that you won’t break something else?
Situations like this are why Microsoft products (and other companies as well) go through a triage process called “Shiproom” or “War Room” as they get closer to delivering a product. The goal of Shiproom is to balance cost and risk against quality as we get closer to the target ship date. In our case, Shiproom is made up of representatives from each product unit (Visual Basic, Visual C#, Visual C++, Team Developer, etc.) and is guided by a central core of people (the “ROL,” or Release Orchestration Leadership) who have experience shipping a product. This is my third tour of shiproom duty (VS2003, VS2005, VS2008) and I’m on the ROL team for this product cycle.
The process works like this: as we get close to the end, about two months from shipping, the PUs gradually start to “raise the bar” on bugs, discarding bugs which no customers will ever encounter in a realistic scenario and which cause no damage (e.g., the bug is a harmless one-second screen flicker on a second monitor which only repros when pressing Ctrl-Alt-Shift-Q on February the 29th of a leap-year). This is because fixing such a bug involves more time and more risk to the product than the actual bug deserves. The bar “goes up” (i.e., is made more restrictive) weekly, until we reach a period called “ASK mode.” At that point, the PUs need to ask the division’s permission to fix a given bug. The bar at that point is focused on making sure that all scenarios work and that there is no data loss occurring from any bug, that all legal obligations are met (legal documents, compliance, etc.), and that there are no security issues. We also make sure that there’s still time at this point to react to unforeseen problems that a bug fix might inadvertently cause. Test teams are madly beating up on the code at this point, looking for any bugs bad enough to keep us from shipping. The goal here is to make sure that every fix improves the product (as opposed to making it worse).
When bugs are found which might meet the bar (i.e., impact customers), the PU fixes them, tests them, and then brings them to Shiproom. The PU’s representative at Shiproom explains the c ustomer scenario, describes the fix, and details the testing done on the fix to best ensure that it does not adversely affect the rest of the codebase. The ROL team (and any others in the room) will then question the representative to make sure that all contigencies are covered and to verify that this fix is indeed worth the risk to the product. If it is, then the PU is allowed to submit the fix into the build; otherwise, the PU has the option of re-working the fix and trying again.
Finally, the day arrives where we generate what we hope is the final build of the product and send it out to partners so that they can verify their products as well. We move into “Escrow,” where any fix we take would cause another rebuild of the product, thereby causing a slip to the ship date. Even here, though, it’s still about quality, not the ship date – we would absolutely slip the product’s ship date before releasing it in a state where it would cause damage to a customer. It’s worth noting that the first words out of the PU representative’s mouth when presenting any bug in Shiproom are “The customer scenario is…”
So, we’re getting close to entering that endgame now. Helping to prepare for that endgame takes up about half of my day, the second half being dedicated to managing the VB Dev team, thinking about the next version of the product, and learning new things, and the third half to writing blog posts J.
I’m incredibly excited about this product. I know LinQ will just blow everyone away, and the IntelliSense changes are so cool that I can hardly stand to go back to VS2005 to type up applications for this blog. Our powerful new data designers make it easy for even a non-data-savvy guy like me to crank out great data apps, while leveraging the incredibly useful WCF tools to work with web services.
You are going to *love* this product.
While I was editing this post preparatory to posting it today, I was thinking about something that happened to me on the vacation I just returned from. My family and I were staying at a hotel outside of Baltimore, for which we had two nights of reservations. The first night was great — the staff went out of their way to make sure that our room was all set up with an extra rollaway bed for my kids, the pool was nice, etc. The next day, we returned from visiting the city and arrived in our bedroom to find out that all the sheets were gone from the beds. The cleaning service had removed the old sheets but forgotten to put on new ones. We called down to the desk to explain the situation, and they said they’d take care of it right away. So, we went down to the pool to stay out of their way.
An hour later, we went back up, it being well past my kids’ bed time — and the beds were still missing sheets. We called the front desk again. “Oh, we’ve been trying to call you. The staff went home early today and locked the room with the sheets in it, and we don’t have a key. We’re going to have to move you to another room.”
“Well, that doesn’t make much sense,” we replied. “We have 5 large suitcases and 5 small bags, all of which are mostly unpacked — it’d take us an hour to be able to move to the new room and it’s our kids’ bedtime now — we’ve got a big day tomorrow. It would make more sense to just retrieve the sheets from that other room and bring them to our room.”
“Oh, no, we couldn’t do that. That would leave us with a room that had no sheets, making marked as a “dirty” room, and we might get a customer who needed that room.”
“But — if we do it your way, the room we’d be vacating would have no sheets either. You’d still have the same amount of rooms available either way.”
The situation devolved into madness after that point. Since the hotel refused to call in someone with a key to the laundry room at the oh-so-late hour of 8PM, the only alternative was to sleep without sheets and pillows, which wasn’t really feasible. After much yelling to no avail, phoning the lead manager, etc, we finally agreed to the move, only to learn that they couldn’t give us the keys to the new room until we’d vacated the old room — otherwise, their computer would charge us for two rooms. So, after more yelling, my wife ended up having to guard our luggage in the hallway next to the new room (which was on a different floor) while I went and did the paperwork to get us transferred — firmly vowing never, ever to use that hotel chain again. (I won’t print their name here, since it would be unfair to Microsoft for me to involve them legally by using the company blog as a vehicle for my venting, but I will state that it’s a well-known chain of hotels that should have known better.) And, when all was said and done, we never got so much as an “I’m sorry this happened” from anyone connected with the hotel — in fact, the morning staff very nearly charged me for two rooms for that final night anyway.
So why am I telling you about this? Well, because that’s exactly the way I *don’t* want our VB customer service to be. I don’t want my customers to have to drop everything they’re doing in order to work around some lack of process on our part. “We can’t help you with the bug you encountered — our developers are all on vacation — either work around it or upgrade to something else” would be a feeble excuse indeed. Sure, sometimes things go wrong, but you’ve got to be prepared to do the right thing by your customer to get them out of that trouble. The customer has always got to come first, and that’s why I get such a kick out of being part of the shiproom process — it’s all about doing the right thing. And if we’re *not* doing the right thing, then, please let us know!