Performance regression tests at Microsoft Security

Developer Support

In this post, Maor Frankel shares insights about performance regression testing in Microsoft Security.

During the past 6 years, a significant amount of my work has gone into improving the performance of the Web application I work on.

This has been the case at my previous work at Outbrain and at my current position in the front end (FE) core team at Microsoft Security.

The typical approach to testing performance in similar cases is to add logs from the client code that measure user flows you want to track (aka RUM, Real User Monitoring).

For instance, you might add a start page load marker every time a page starts to load and another marker when the page completes loading. The delta between these two markers can be measured and used as a metric for the page load performance.

Once you have this sort of visibility, you can identify your weak areas and improve them.

This is a necessary approach as there is no real way to know how your application is performing without measuring it from the user’s perspective.

However, this approach has one significant drawback, it only detects performance issues after they have been deployed and encountered by your users.

Making matters worse, logs from real user traffic are usually very noisy and inconsistent, meaning you need to look at relatively large time frames to get an accurate view of the performance. It can take a significant amount of time (hours, days) before you realize you have an issue or a regression.


While nothing can truly replace monitoring real user traffic, at MS Security we started to think of ways we could catch such regressions BEFORE they reach our users.

In theory, you can simply run a specific scenario with the current changes, measure the load time and then do the same with your production code and compare them.

This of course is practically useless for the following reasons:

A web application session involves many factors other than your actual code; this means that even if you open the same page with the exact same code running you are highly unlikely to measure the same load time.

And the differences can be significant, meaning it’s impossible to identify a regression brought on by a specific code change.

These are some factors that can contribute to the variations:

  1. Machine processing
  2. Server latency variations
  3. Network bandwidth
  4. Caching


We needed to find an approach that either reduces the number of variables in the test or can take these anomalies into account.

Remove variables – Mock server

To reduce the impact of our BE services we decided to run all our regression tests using a mock server (an http server which mimics the responses from a real server only with mocked data). This was simple for us as we use Testim as our test runner.

Testim has a built-in recording tool which can then replay the exact same latency for all the API calls.

BE services performance can be tested separately using simple API testing so there is real benefit of testing the whole flow E2E.

Using a mock server has a drawback, since mock servers usually return all requests with almost 0 latency.

One of the most common causes of performance regression derived from a change in the FE code is the addition of new API calls or making API calls in series rather than in parallel.

In such cases using a mock server would nullify the regression and basically be counterproductive.

To solve this, we added a built-in latency to all API calls so they simulate a real E2E flow.

We added a latency for all calls of 2000 ms. So far this seems to work well. We also considered measuring the 75th duration percentile of each API from production and adding that per API, but this is costly and requires maintenance over time.

Nullifying the Anomalies – Welch T-Test

Even when removing the BE from the equation, you would still have to have a very significant regression in performance to be definite it’s related to your changes.

The Welch T-test is a statistical formula that takes two sets of sample pools and determines within a confidence threshold, which are random anomalies verses those that are most likely a result of a true variation between the sets.

The T-test requires a relatively large sample pool, so to implement it we run each iteration (master vs production) 25 times, where each iteration reloads the page 4 times and measures each load (not including the initial load to avoid any initial cache penalty)

So overall, we run each scenario we test 200 times; 100 for master and 100 for production.

Using Testim also made this a simple implementation as Testim has a configuration setting to run a single test multiple times, and a single step multiple times.


Since we need to run many tests per cycle, we opted not to run it as part of our PR gates, but rather our nightly jobs or on-demand, this is, of course, a budget limitation; ideally you would run this on every change. Running nightly means you must backtrack on all the day’s changes to try to figure out who introduced the regression.

So far, running with this approach we have already managed to detect three significant performance regressions which would have otherwise reached our users.

Two of the regressions were introduced by developers who added a new API call in series to other API calls which was blocking the page load. In a real user scenario, this could have led to a significant increase in load time of up to 50%.

The third regression was due to an introduction of a third-party lib which had an impact on the render time of the UI. While this regression was not significant in its impact on the total load time, it does demonstrate that this testing approach can detect even subtle changes in performance.


There is no single solution for stopping performance regressions. While approaches like RUM are invaluable, they have limitations as I mentioned.

To minimize performance regressions, a wide set of tools need to be implemented. The approach I described, together with other approaches, can help you to detect performance issues as soon as possible.

While there can be many variations for the solution described here, like using different statistical tools than the Welch T-Test, the principle remains the same—use statistics to separate the random anomalies from the true regressions.

You can shift-left your performance testing, and in doing so, prevent regressions from reaching your users.


Discussion is closed.

Feedback usabilla icon