January 4th, 2018

Visual Studio 2017 Throughput Improvements and Advice

Terry Mahaffey
Principal Software Engineer

点这里看中文版

As C++ programs get larger and larger and the optimizer becomes more complex the compiler’s build time, or throughput, increasingly comes into focus. It’s something that needs to be continually addressed as new patterns emerge and take hold (such as “unity” builds in gaming). It’s something we’re focusing on here in the Visual C++ Team, and has become a major point of emphasis during the most recent 15.5 release and will continue to be going forward. I want to take a few minutes to update everyone on some of the specific changes we’ve made to help with your compile times, and provide a few tips on how you can change your project or use technologies baked into the tools to help with your build times.

Please note that not all changes are aimed at providing small increases in all scenarios. Typically we’re targeting long pole corner cases, trying to get compile times there down somewhere closer to the expected mean for a project of that size. We’ve recently started focusing on AAA game titles as a benchmark. There is more work to be done.

There are three pieces of the toolset which need to be improved individually. First there is the “front end” of the compiler, implemented in “c1xx.dll”. It’s the set of code which takes a .cpp file and produces a language independent intermediate language, or IL, which is then fed into the back end of the compiler. The compiler back end is implemented in “c2.dll”. It reads the IL from the front end and produces an obj file from it, which contains the actual machine code. Finally, the linker (link.exe) is the tool which takes the various obj files from the back end as well as any lib files you give it, and mashes them together to produce a final binary.

Compiler Front End Throughput

In many projects, the front end of the compiler is the bottleneck for build throughput. Luckily it parallelizes well, either by using the /MP switch (which will spawn multiple cl.exe processes to handle multiple input files), externally via MSBuild or other build systems, or perhaps even distributed across machines with a tool like IncrediBuild. Effective distribution and parallelization of building your project should be the first step you take to improve your throughput.

The second step you should take is make sure you are making effective use of PCH files. A PCH file is essentially a memory dump of cl.exe with a fully parsed .h file – saving the trouble to redo it each time it is included. You’d be surprised how much this matters; header files (such as windows.h, or some DirectX headers) can be massive once they are fully preprocessed – and often make up the vast majority of a post processed source file. PCH files here can make a world of difference. The key here is to only include files which are infrequently changed, making sure PCH files are a net win for you.

The final piece of advice here is actually to limit what you #include. Outside of PCH files, #including a file is actually a rather expensive process involving searching every directory in your include path. It’s a lot of File I/O, and it’s a transitive process that needs to be repeated each time. That’s why PCH files help so much. Inside Microsoft, people have reported a lot of success by doing a “include what you use” pass over their projects. Using the /showIncludes option here can give you an idea as to how expensive this is, and help guide you to only include what you use.

Finally, I want you to be aware of the /Bt option to cl.exe. This will output the time spent in the front end (as well as back end and linker) for each source file you have. That will help you identify the bottlenecks and know what source files you want to spend time optimizing.

Here are a few things we changed in the front end to help with performance.

Refreshed PGO counts

PGO, or “profile guided optimization”, is a back end compiler technology used extensively across Microsoft. The basic idea is you generate a special instrumented build of your product, run some test cases to generate profiles, and recompile/optimize based on that collected data.

We discovered that we were using older profile data when compiling and optimizing the front end binary (c1xx.dll). When we reinstrumented and recollected the PGO data we saw a 10% performance boost.

The lesson here is, if you’re using PGO in order to provide a performance boost to your product, make sure you periodically recollect your training data!

Remove usages of __assume

__assume(0) is a hint to the back end of the compiler that a certain code path (maybe the default case of a label, etc) is unreachable. Many products will wrap this up in a macro, named something like UNREACHABLE, implemented so debug builds will assert and ship builds will pass this hint to the compiler. The compiler might do things such as removing branches or switches which target that statement.

It stands to reason then that if at runtime an __assume(0) statement actually is reachable, bad code generation can result. This causes problems in a lot of different ways (and some people argue it might cause security issues) – so we did an experiment to see what the impact was on simply removing all __assume(0) statements by redefining that macro. If the regression was small, perhaps it wasn’t worth having it in the product, given the other issues it causes.

Much to our surprise, the front end actually got 1-2% faster with __assume statements removed. That made the decision pretty easy. This root cause here appears to be that although __assume can be an effective hint to the optimizer in many cases, it seems that it can actually inhibit other optimizations (particularly newer optimizations). Improving __assume is an active work item for a future release as well.

Improve winmd file loading

Various changes were made to how winmd files were loaded, for a gain of around 10% of load time (which is perhaps 1% of total compile time). This only impacts UWP projects.

Compiler Back End

The compiler back end includes the optimizer. There are two classes of throughput issues here, “general” problems (where we do a bunch of work in hopes of a 1-2% across the board win), and “long poles” where a specific function causes some optimization to go down a pathological path and take 30 seconds or longer to compile – but the vast majority of people are not impacted. We care about and work on both.

If you use /Bt to cl.exe and see an outlier which takes an unusual amount of time in c2.dll (the back end), the next step is to compile just that file with /d2cgsummary. Cgsummary (or “code generation summary”) will tell you what functions are taking all of the time. If you’re lucky, the function isn’t on your critical performance path, and you can disable optimizations around the function like this:

#pragma optimize("", off)
void foo() {
...
}
#pragma optimize("", on)

Then the optimizer won’t run on that function. Then get in touch with us and we’ll see if we can fix the throughput issue.

Beyond just turning off the optimizer for functions with pathological compile times, I also need to warn against too liberal a use of __forceinline when possible. Often customers need to use forceinline to get the inliner to do what they want, and in those cases, the advice is to be as targeted as possible. The back end of the compiler takes __forceinline very, very seriously. It’s exempt from all inline budget checks (the cost of a __forceinline function doesn’t even count against the inline budget) and is always honored. We’ve seen many cases over the years where liberally applying __forceinline for code quality (CQ) reasons can cause a major bottleneck. Basically, this is because unlike other compilers we always inline preoptimized functions directly from the IL from the front end. This sometimes is an advantage – we can make different optimization decisions in different contexts, but one disadvantage is we end up redoing a lot of work. If you have a deep forceinline “tree”, this can quickly get pathological. This is the root cause of long compile times in places like Tensorflow/libsodium. This is something we are looking to address in a future release.

Look into Incremental Link Time Code Generation (iLTCG) for your LTCG builds. Incremental LTCG is a relatively new technology that allows us to only do code generation on the functions (and dependencies, such as their inliners) which have changed in an LTCG build. Without it, we actually redo code generation on the entire binary, even for a minor edit. If you’ve abandoned the use of LTCG because of the hit it causes to the inner dev loop, please take another look at it with iLTCG.

One final piece of advice, and this also applies more to LTCG builds (where there is a single link.exe process doing codegen rather than distributed across cl.exe processes), consider adjusting the default core scaling strategy via /cgthreads#. As you’ll see below we’ve made changes to scale better here, but the default is still to use 4 cores. In the future we’ll look at increasing the default core count, or even making it dynamic with the number of cores on the machine.

Here are some recent changes made to the back end that will help you build faster for free:

Inline Reader Cache

In some other compilers, inlining is implemented by keeping all inline candidate functions after being optimized in memory. Inlining then is just a matter of copying that memory into the appropriate spot in the current function.

In VC++, however, we implement inlining slightly differently. We actually re-read the unoptimized version of an inlinee from disk. This clearly can be a lot slower, but at the same time may use a lot less memory. This can become a bottleneck, especially in projects with a ton of __forceinline calls to work through.

To help mitigate this, we took a small step towards the “in memory” inlining approach of other compilers. The back end will now cache a function after it’s been read for inlining a certain number of times. Some experimenting showed that N=100 was a good balance between throughput wins and memory usage. This can be configured by passing /d2FuncCache# to the compiler (or /d2:-FuncCache# to the linker for LTCG builds). Passing 0 disables this feature, passing 50 means that a function is only cached after it’s been inlined 50 times, etc.

Type System Building Improvements

This applies to LTCG builds. At the start of an LTCG build, the compiler back end attempts to build a model of all of the types used in the program for use in a variety of optimizations, such as devirtualization. This is slow, and takes a ton of memory. In the past, when issues have been hit involving the type system we’ve advised people to just turn it off via passing /d2:-notypeopt to the linker. Recently we made some improvements to the type system which we hope will mitigate this issue once and for all. The actual changes are pretty basic, and they involve how we implement bitsets.

Better scaling to multiple cores

The compiler back end is multithreaded. But there are some restrictions: we compile in a “bottom up” order – meaning a function is only compiled once all of its callees are compiled. This is so a function can use information collected during its callees’ compilation to optimize better.

There has always been a limit on this: functions above a certain size are exempt, and simply begin compiling immediately without using this bottom up information. This is done to prevent compilation from bottle-necking on a single thread as it churns through the last few remaining massive functions which couldn’t start sooner because of a large tree of dependencies.

We have reevaluated the “large function” limit, and lowered it significantly. Previously, only a few functions existed in all of Microsoft which triggered this behavior. Now we expect a few functions per project might. We didn’t measure any significant CQ loss with this change, but the throughput wins can be large depending on how much a project was previously bottlenecking on its large functions.

Other inlining improvements

We’ve made changes to how symbol tables are constructed and merged during inlining, which provide an additional small benefit across the board.

Finer grained locking

Like most projects we continually profile and examine locking bottlenecks, and go after the big hitters. As a result we’ve improved the granularity of our locking in a few instances, in particular how IL files are mapped and accessed and how symbols are mapped to each other.

New data structures around symbol tables and symbol mappings

During LTCG, a lot of work is done to properly map symbols across modules. This part of the code was rewritten using new data structures to provide a boost. This helps especially in “unity” style builds, common in the gaming industry, where these symbol key mappings can get rather large.

Multithread Additional Parts of LTCG

Saying the compiler is multithreaded is only partially true. We’re speaking about the “code generation” portion of the back end – by far the largest chunk of work to be done.

LTCG builds, however, are a lot more complicated. They have a few other parts to them. We recently did the work to multithread another one of these parts, giving up to a 10% speedup in LTCG builds. This work will continue into future releases.

Linker Improvements

If you’re using LTCG (and you should be), you’ll probably view the linker as the bottleneck in your build system. That’s a little unfair, as during LTCG the linker just invokes c2.dll to do code generation – so the above advice applies. But beyond code generation, the linker has its traditional job to do of resolving references and smashing objs together to produce a final binary.

The biggest thing you can do here is to use “fastlink”. Fastlink is actually a new PDB format, and invoked by passing /debug:fastlink to the linker. This greatly reduces the work that needs to be done to generate a PDB file during linking.

On your debug builds, you should be using /INCREMENTAL. Incremental linking allows the linker to only update the objs which have been modified, rather than rebuild the entire binary. This can make a dramatic difference in the “inner dev loop” where you’re making some changes, recompiling/linking and testing, and repeating. Similar to fastlink, we’ve made tons of stability improvements here. If you’ve tried it before but found it to be unstable, please give it another chance.

Some recent linker throughput improvements you’ll get for free include:

New ICF heuristic

ICF – or identical comdat folding, is one of the biggest bottlenecks in the linker. This is the phase where any identical functions are folded together for purposes of saving space, and any references to those functions are redirected to the single instance of it as well.

This release, ICF got a bit of a rewrite. The summary is we now rely on a strong hashing function for equality rather than doing a memcmp. This speeds up ICF significantly.

Fallback to 64 bit linker

The 32 bit linker has an address space problem for large projects. It often memory maps files as a way of accessing them, and if the file is large this isn’t always possible as memory mapping requires contiguous address space. As a backup, the linker falls back on a slower buffered I/O approach where it reads parts of the file as needed.

It’s known that the buffered I/O codepath is much, much slower compared to doing memory mapped I/O. So we’ve added new logic where the 32 bit linker attempts to restart itself as a 64 bit process before falling back to the buffered I/O.

Fastlink Improvements

/DEBUG:fastlink is a relatively new feature which significantly speeds up debug info generation – a major portion of overall link time. We suggest everyone read up on this feature and use it if at all possible. In this release we’ve made it faster and more stable, and we are continuing to invest in fastlink in future releases. If you initially used it but moved away because of a bad experience, please give it another shot! We have more improvements on the way here in 15.6 and beyond.

Incremental Linking fallback

One of the complaints we’ve heard about incremental linking is it is can sometimes be slower than full linking, depending on how many objs or libs have been modified. We’re now more aggressive about detecting this situation and bailing out directly to a full link.

Conclusion

This list isn’t by any means exhaustive, but it’s a good summary of a few of the larger throughput focused changes over the past few months. If you’ve ever been frustrated with the compile time or link time of VC++ before, I’d encourage you to give it another shot with the 15.5 toolset. And if you do happen to have a project which is taking an unreasonably long time to compile compared to other projects of similar size or on other toolsets, we’d love to take a look!

And remember, you can use /d2cgsummary to cl.exe or /d2:-cgsummary to the linker to help diagnose code generation throughput issues. This includes info about the inliner reader cache discussed above. And for the toolset at large, pass /Bt to cl.exe and it’ll break down the time spent in each phase (front end, back end, and linker). The linker itself will output its time breakdown when you pass it /time+, including how much time is spent during ICF.

Author

Terry Mahaffey
Principal Software Engineer

RIT class of 04

0 comments

Discussion are closed.