Calculating server capacity and planning for future user growth
App Dev Manager Omer Amin explorers server capacity and growth planning along with key metrics and tools to consider in your roadmap.
Do I need to add more servers if my user load grows by 10% each month for the next 12 months? That is a hard question to answer. Unless you have an Application Platform Management solution, it is hard to correlate function calls to CPU time and response times. Are there any other ways that we can answer that question using existing data What other methods or metrics can we use to model how servers would respond under load? One model is to look at CPU, Disk, Network, and Memory consumption and see how they scale as you add load. A benefit of this approach is that it is application and language agnostic. These metrics are readily available using tools like Performance Monitor which make them a great starting point.
Now that we can measure resource consumption, what do we measure. Application load constantly changes throughout the day, week, month, etc. We can simplify the analysis by focusing on average load and peak load periods. Average Load is the load on your system during day to day operations. Peak load is the load on your system during specific events which cause spikes in user traffic. This is dependent on the type of application workload and industry. Most business systems have a steady predictable load, however, peak load may be Monday morning when all users login concurrently to start their week and run some workflow. For online retailers, it might be specific shopping days or seasons like Christmas\Holiday season, etc. Systems and applications should be able to handle average and peak load events and respond in a timely manner. Users are generally ok with performance degradation during peak load events, however, applications should degrade gracefully rather than error out to avoid a poor experience.
It doesn’t make sense to have a lot of idle infrastructure sitting around for peak load events so one strategy might be to scale out using technologies like Azure VM Scale Sets that automatically scale in and out depending on pre-defined metrics.
Here are the relevant counters that you can use as a starting point.
- CPU – %Processor Time
- Memory – %Committed Bytes in Use
- Disk – Avg. Disk Queue Length
- Network – ((Total Bytes\Sec * 8)/current bandwidth) * 100
As you adapt this methodology, you can probably add\remove counters that you feel more directly correlate your application to user load.
To get started, setup data collection for the above counters, and collect data for a full day on a normal day and peak load day. This give you an understanding of where your resource consumption stands today.
|Server CPU Utilization Bucket|
|Cool||95th percentile <20%|
|Warm||95th percentile <33%|
|Medium||95th percentile <66%|
|Hot||95th percentile >=66%|
|Server Memory Utilization Bucket|
|Cold||95th percentile <8%|
|Cool||95th percentile <19%|
|Warm||95th percentile <30%|
|Medium||95th percentile <62%|
|Hot||95th percentile >=62%|
|Server Network Utilization Bucket|
|Cool||Avg > 1% &
95th percentile <2%
|Warm||95th percentile <5%|
|Medium||95th percentile <10%|
|Hot||95th percentile >=10%|
|Server Disk Queue Length Utilization Bucket|
|Cold||Avg <= 0.15|
|Cool||0.15 < Avg <= 1|
|Warm||1 < Avg <= 3.5|
|Medium||3.5 < Avg <= 9.5|
|Hot||9.5 < Avg|
To forecast future needs, you can plot user load against the resource consumption graphs above. There will most likely be a direct correlation between user load and resource consumption. You can use a linear or exponential extrapolation to plot how increase in user load will drive resource consumption. This technique is useful till 80% resource consumption, beyond that it really depends on a particular application. However, if your servers are running that hot, you should already be planning to bring down the resource consumption to lower level so that you have capacity for spikes.
Hopefully this method provides some guidance on helping you monitor and scale your workload using out of the box performance counters. For more detailed monitoring mechanisms, look at Application Insights and Log Analytics (aka Azure Monitor) which will allow you to drill down further and provide more insights.