Testing the MSVC Compiler Backend

Troy Johnson

This post provides a behind-the-scenes look at how we test MSVC‘s backend, which is responsible for optimization and code generation.

Many people worldwide use our compiler and expect it to provide a high-quality experience in terms of correctness (compiled code behaves as written), performance (speed of the compiled code), and throughput (speed of the compiler itself). By sharing with you how we test, you will better understand the level of quality control that goes into producing the compiler that you use.

This topic may bring to mind a compiler bug that you have encountered in the past or are presently experiencing. You may wonder how it slipped through our testing. We are not writing this post to claim that our testing catches every problem or that we have fixed every detected problem. The compiler is a complex system supporting an evolving C++ language on old and new processors for applications ranging from office productivity to the latest Xbox games. We continually prioritize which features to implement and bugs to fix based on customer impact. Those actions generate more tests, but our set of tests remains finite whereas our customers’ creativity is infinite.

Correctness Testing

Our top priority is ensuring that your compiled code behaves as written, even when optimized. Therefore, our compiler engineers’ everyday role of authoring and reviewing pull requests (PRs) that modify the compiler’s code also includes writing and running tests. Tests are run using Microsoft Azure DevOps pipelines. A lot of testing occurs before PRs are merged to our source code repository so that we catch problems before they are introduced, and then additional testing occurs after merging. We aim for a balanced approach that catches most problems early but still allows our compiler engineers to move quickly.

Internal Tests

Our first line of defense is a compiler built with assertions enabled. Assertions are conditional aborts mixed into the compiler’s code to check for unexpected internal states. Our compiler engineers use this kind of compiler for their daily development work and our test pipelines use it too. The additional checks make throughput slower, so we release a faster version that has most checks disabled. We introduce new assertions as we work on the compiler, while keeping one eye on throughput since we don’t want to make the checked compiler too slow for development and testing.

Regression Tests

We run regression tests on every PR and gate merges on all tests passing. MSVC has endured for decades, so we have accumulated many tests based on previous bug reports and compiler engineers proactively writing tests. These tests must compile, execute, and produce their expected output. The tests are run for all supported targets with a variety of compiler options, which takes a few hours. We can’t have a test pipeline for every combination of compiler options, so we test the more common ones and allow individual tests to require specific additional options. Furthermore, not every option is appropriate for every test, so we limit some tests to certain configurations. Passing all of these tests is already a high bar, but we go further.

Optimization Tests

The main problem with using only traditional regression tests for an optimizing compiler is that optimizations tend to be invisible from a testing perspective. After all, the goal when adding an optimization is that the program runs faster, but its output should not change. Conversely, if an engineer introduces a bug that stops an optimization from occurring, then all regression tests would continue passing and the bug would remain undetected. What to do? To combat this problem, we added two new testing systems to supplement existing regression tests.

Unit Testing

The goal of a unit test is to check a particular subsystem within the compiler itself, ranging from only one function to one optimization phase, in isolation with respect to the rest of the compiler. We added a unit test framework, run under a merge-gating pipeline, that allows us to write small chunks of compiler code to verify key behaviors. For example, there is an internal compiler facility for matching fragments of the compiler’s intermediate representation (IR), so we can write tests to validate that it matches what it should match and does not match what it should not match. When a unit test catches a bug, it is easy to fix because we know exactly what is wrong.

Unit testing scales upward to test optimizations, but eventually becomes unwieldy. Internal compiler state must be created programmatically within the test and any related subsystems need to be mocked up. As we have written more of these tests, we have gotten better at making parts of the compiler more amenable to unit testing. Beyond a certain point though, we still need a different testing method.

FileCheck-Like Testing

Outside of Microsoft, tests written for the LLVM project, which is the basis for Clang and other compilers, use a program called FileCheck to scan textual compiler output, like messages or assembly code, for certain patterns that span multiple lines. For example, it can check that the compiler emits a mov instruction followed by an add instruction and require that both the mov and the add use the same register, no matter what that register happens to be. Thus, the tests can check that a specific message or code pattern was emitted while being immune to irrelevant code generation changes like arbitrary register selection. Think regular expressions but driven by a special purpose language tailored for checking multi-line compiler output.

For MSVC, we wanted to adopt a similar testing approach but faced a few challenges related to tests expecting output in a fixed order. First, MSVC is multithreaded, so debug messages emitted by the compiler can appear unordered. This problem is handled by including something in the message to make it unique, such as the function name, or by running the compiler in single-threaded mode for the test. Second, and independent of the multithreading issue, MSVC may emit functions to .asm files in a different order than they appeared in the C++ code. Furthermore, optimization settings may change that order, which makes it difficult to write tests for multiple functions at once. This behavior has years of inertia behind it and would be disruptive to change, so we needed a tool which would allow our engineers to write tests that tolerated functions being emitted in an arbitrary order. This behavior is a radical algorithmic departure from how LLVM’s FileCheck behaves and it is necessary only to cope with an output-sequencing quirk that is specific to MSVC, so we wrote our own tool.

The final challenge is arguably a strength. MSVC cannot ingest a serialized form of the compiler’s IR, nor can it apply exactly one optimization while skipping others. Therefore, our test input is always C++ code which passes through multiple compiler phases before reaching the phase that is the desired test subject. Although this limitation initially appears to be a weakness, it has the advantage that our tests are realistic because the IR that ultimately gets tested is less sanitized. For example, if we could write a test that accepted IR as input and ran exactly one optimization, then the test would present that optimization with exactly the input that it expected, and it would continue passing unless the optimization itself was modified later. When an optimization is run in production, however, the IR reaching the optimization may have greater variation and cause the optimization to fail. For example, a change to an earlier phase may invalidate an assumption made by the optimization, causing it to no longer apply. Thus, although they are more fragile, tests accepting C++ code as input more closely model real behavior than IR tests.

Assembly Diffs

We can generate assembly language diffs of our regression tests to see the effect of a given PR. The diffs are generated automatically and include a summary about whether code size increased or decreased overall, but reviewing the substance of the diffs is a manual process. First, the automation gives compiler engineers the ability to check whether their PR changed generated code. For example, it would be a surprise if a refactoring PR changed generated code and would indicate that the engineer made a mistake. Likewise, when implementing an optimization, it would be surprising if no code changed and may indicate that the engineer made a mistake. Second, it shows how the PR changed the generated code so that the engineer can verify their change made sense. For example, if the engineer attempted to cause more loop unrolling, then they would expect code size to increase, whereas if they optimized away redundant instructions, then they would expect code size to decrease.

Windows Testing

Every PR that might change the instructions emitted by the compiler is tested by using the modified compiler to build Windows for multiple architectures. Then, those builds of Windows perform their usual boot verification tests to ensure that they work correctly. Although we have this system automated, it takes a day or two, which means that compiler engineers must switch to a different task while waiting for their PR to pass Windows testing. Why do we bother to endure this delay and context switching?

First, as you might expect, building Windows has been a requirement for MSVC since day one. Although compiler PRs do not immediately affect Windows because it is built with a released version of the compiler, it is too late for us to discover a problem when the Windows team upgrades their compiler. Breaking the Windows build or especially boot verification via a compiler bug leads to extremely tedious debugging, so we are very proactive about not breaking it in the first place.

Second, Windows is a serious stress test for the compiler due to its size, complexity, and mix of legacy and newer code. It sometimes reveals problems that our other testing does not find and then we capture the failing behavior into a smaller regression test so that we can detect the same problem earlier in the future.

Dogfooding

Another Microsoft software product that is built as part of testing the compiler is the compiler itself! PRs to our compiler’s development branch are always built with the most recent preview release of the compiler. Another test pipeline uses the just-built compiler to build itself. This process ensures that we use our product while developing our product, also known as dogfooding. Any problems detected by these steps are fixed quickly to allow development to proceed unhindered.

Other Microsoft Software as Tests

Besides Windows and MSVC, MSVC is used to build other C++-based software that Microsoft ships, including Office, Teams, and SQL, so we have many customers in house. We don’t build this software as part of testing the compiler, but our internal customers give us feedback whenever they update to newer compilers. We work with them to ensure that MSVC meets their needs.

Real World Code Tests

We have additional Azure DevOps test pipelines that build what we call “real world code,” which is a selection of open-source software that can be built for Windows. The software universe is vast, so we can’t build everything, but we build a subset because it can reveal compiler bugs. We do not gate PRs on these tests, although developers can optionally run them. Instead, we run them regularly out of our production branch and receive weekly test reports.

Performance Testing

We have other Azure DevOps pipelines that run benchmark suites multiple times before and after a PR, flagging statistically significant improvements or regressions in performance and code size. SPEC CPU 2017 is one such suite. We use performance benchmarks to examine the impact of new optimizations. Not every performance improvement is easily measured; long-term forward progress can mean the accumulation of many small optimizations that each would register as noise. Periodically we review the performance progress that MSVC has made over the past several months to ensure that we’re moving in the right direction. We compare performance with other compilers available for Windows. We also use this opportunity to discuss opportunities for improvement and which applications or benchmarks we want to focus on next.

Throughput Testing

We have a collection of link repros that allows us to distill building large applications into a simple link command that invokes link-time code generation (LTCG). We can build these link repros, time them, and avoid merging any PR that significantly degrades throughput, but the best time for us to spot throughput issues is during code reviews. We try to spot the introduction of algorithms with O(N2) or worse complexity into our repository because those may become throughput bottlenecks later. We monitor our throughput variation over time because it is natural as more optimizations are added to a compiler for its throughput to gradually decrease. Occasionally, we will do a short sprint to improve throughput to rectify long-term throughput creep.

Walkthrough of a Bug Fix

Let’s look at what happens when a compiler engineer fixes a bug that you reported:

  1. A fix is identified and a PR is posted for other engineers to review. The PR contains both the fix and at least one new regression test based on the bug report. The PR may introduce additional asserts to help us catch similar problems in the future. It also may introduce new unit or FileCheck-style tests. The goal is to catch the same or similar problem as early as possible, so the appropriate style of test varies for each bug.
  2. Required test pipelines are run in Azure DevOps. The engineer iterates on the PR until all tests pass. The engineer may elect to run some optional pipelines, depending on the nature of the bug.
  3. The engineer may look at assembly diffs, run performance pipelines, or run throughput tests. This depends on the nature of the PR. The engineer continues to iterate on the PR if there is anything unexpected.
  4. The engineer submits their PR for Windows build and boot testing, then waits for it to pass while they move on to their next task. If a Windows test fails, then they investigate and may need to back up a few steps.
  5. The PR is merged with the review and approval of at least one other compiler engineer.
  6. Additional continuous integration testing is done on recent compilers and may flag the PR as needing additional work. If a fix cannot be submitted the same day, then the PR is reverted.

Walkthrough of a New Feature or Optimization

New features are inspired by the language standard and conversations with our customers; new optimizations are inspired by profiling and suggestion tickets. Both result in one or more engineers developing a prototype and discussing their approach in a design review with leadership and principal engineers. At this point, there is at least one motivating test case for the work, but as the work progresses, there is greater emphasis on producing a larger number of tests than in the bug-fix workflow above. For a bug fix, the engineer incrementally extends our testing to catch one additional (and often unusual) case, but for a new optimization or compiler feature, there are no existing tests that are specifically intended to cover it. Therefore, a variety of new tests must be produced. We try to generate most of these tests early so that we can evaluate our progress toward completion. There also is greater emphasis on measuring performance and throughput impact. Otherwise, the mechanics of submitting a PR and testing it are the same as above.

Release Testing

Development of major components like the compiler’s frontend and backend proceed independently in separate production branches off our main branch. At a regular cadence, we merge each production branch to our main branch after more extensive testing, which includes building Windows, running the boot verification tests, and also running every other test that the Windows team would run when integrating one of their own feature branches.

As the time approaches to release a new version of the compiler, we create a release branch from our main branch. We build preview releases of the compiler from the release branch. Preview releases allow others to try out the compiler and report problems early. For later previews, we focus on fixing bugs reported against the earlier previews, but otherwise are much more restrictive in terms of the changes that we allow into the release branch.

Each preview and final release undergoes additional testing where we use it to build a large collection of open source software, including the entire vcpkg catalog. Furthermore, because the compiler ships as part of Visual Studio, it must pass all of the testing processes in place for Visual Studio as well. Certain test workflows involve both the compiler and the IDE. Before merging the PR that inserts the released compiler into Visual Studio’s release branch, we build a Visual Studio release candidate from that PR. We install that version of Visual Studio on multiple virtual machines for different architectures, doing both clean installs and upgrade installs. We launch Visual Studio, create several flavors of C++ projects, build them, and run them to check various integrated features, like the debugger, static analysis, and the address sanitizer.

Conclusion

MSVC is an industrial quality C++ compiler with many customers depending on it. We take testing very seriously and we hope that this post has helped you to understand our process better. Please share your thoughts and comments with us through Developer Community. You can also reach us on X (@VisualC), or via email at visualcpp@microsoft.com.

Posted in C++

2 comments

Leave a comment

  • Christopher Mire 2

    “Those actions generate more tests, but our set of tests remains finite whereas our customers’ creativity is infinite.”

    I think I’m going to use that now every time someone ask me how we didn’t find a bug before.

Feedback usabilla icon