Dogfood Server Upgrade – End of Week 1
The first week of the Orcas Dogfood server upgrade will end today. It’s been a fantastic (if hectic) week. After the initial spate of issues we hit Monday, it quieted down pretty quickly. We got the significant issues fixed on Tuesday and have been making small performance patches all week. We’re down to the last half dozen or so issues to investigate and will finish that up at lower priority over the next few weeks.
Before & After
We now have enough data to start to do some semi-meaningful before and after comparisons. I picked a set of server requests to compare. I chose them based on a few criteria:
- Aggregate cost of the request (over the period of several days) is in the top 10
- The average execution time of the request is in the top 10
- The numbers looked suspicious to me in some way 🙂
Given that, here’s some comparative results. These results show average duration of the request during the month of February compared to the average time since the upgrade (with some tinkering to account for patches that we’ve made).
As you can see there are some very healthy improvements. The most concerning regressions are Upload and Download. We will be investigating those shortly. We have been inclined to disbelieve those results as we really didn’t change that code in Orcas but I think we have enough data now to show that something is afoot. We believe ReadIdentityFromSource is suffering from some ActiveDirectory latency issues but we don’t know for sure yet. ReadIdentity is showing a huge regression in multiples but pretty small in absolute value. It’s going to require some poking around to understand.
SQL CPU Utilization
We expected to see a substantial reduction in CPU utilization based on the changes we’ve made but we haven’t. The standard deviation has gotten much less (with no more large spikes) but the average doesn’t seem to have gone down much. We need a bit more trend data and need to do some investigation. I expect we’ll learn more about this over the next couple of weeks.
We’re going to be starting our detailed I/O analysis in the next couple of days (now that most of the biggest perf issues have been investigated and addressed). I’ll share that with you next week. However, I’ve done some preliminary looking at the I/O perf counters on the data tier and the results are interesting. I’m seeing a dramatic drop in reads on both the data drive and the TempDB drive (2X or more). However, I’m seeing increases in writes to both. The increases in writes to the data drive are small and the increases for the TempDB drive are modest. I think we’ll know a lot more after the detailed analysis.
Overall, things are going really well and I’m psyched about it. It’s been a lot of fun the last few days hammering out all of the issues that are hard to find outside a high-scale production environment. I’m planning on producing my March dogfood statistics next Tue or Wed, so keep your eyes open for that.
Until next time,