Faster C++ Iteration Builds

Russ Keldorph

We made improvements to C++ link time earlier in Visual Studio 2019, and we have more improvements to tell you about. As of version 16.7, we measure up to 5X improvement in some incremental linking and debugging scenarios and up to a 1.5X speedup in full linking. These represent some of the improvements The Coalition saw in their recent experiment. Keep reading to find out the details.

After the link time improvements in versions 16.0 and 16.2, we took a step back and re-evaluated the complete edit-build-debug (“inner loop”) experience of C++ developers. We were still looking at large projects like AAA games and Chrome because large projects are most likely to suffer from longer iteration times. We found a couple of opportunities that looked promising and went after them. The first, in Visual Studio 2019 version 16.6, is an improvement to some of the algorithms inside the Program Database (PDB) and Debug Interface Access (DIA) components, which are the libraries that enable writing and reading debug information respectively. The second, in Visual Studio 2019 version 16.7, is an optimization to speed up the worst case Incremental Linking time, which can be as bad as or worse than a full link.

Faster Debug Information

Program Database (PDB) creation is often the bottleneck when linking binaries, and for large, monolithic codebases, linking ends up being a very long pole at the end of the critical path. Furthermore, PDB reading is a significant contributor to delays when debugging large projects. It features prominently in profiles when hitting breakpoints and single-stepping—particularly when the developer has multiple debug windows like the Call Stack and Watch windows open in Visual Studio.

In our private benchmarks, these improvements showed some big gains in AAA Games and other large scenarios. The following chart has some examples of the improvements we saw.

Chart showing sample improvements for some common iteration build operations between version 16.5 and 16.6: link time (5 sec -> 2.5 sec), time to switch callstack frame (0.9 sec -> 0.2 sec), and initial PDB load time (53 sec -> 32 sec).

Note that the absolute time deltas in the chart are examples taken from different projects. However, all are indicative of the type of speedup we saw across multiple projects. That is, they are not cherry-picked outliers. To summarize, we often saw:

  • Up to 1.5X speedup for full linking
  • Up to 4X speedup in switching active function on call stack with many variables
  • 2X speedup of initial PDB load

Perhaps more compelling, though, is that since version 16.6 was released, the time to enter break state after a single step is faster by about 2X on average. The actual benefit depends on the size of your project and the number of debugger windows (watch, callstack, etc.) you have open, but the good news is that users who encountered stepping delays in the past are likely to notice improvements in version 16.6.

What We Did

For version 16.6, we profiled some common developer scenarios and found several opportunities to improve the code that both reads and writes debug information. Below are some examples of the types of algorithmic improvements we made.

  1. Avoid search by Relative Virtual Address (RVA) by caching the result of the previous request, which in 99% of cases uses the same RVA
  2. Compute older CRC-32 hash for type records on-demand (gives the most speedup in /Zi full link)
  3. Create fast-path for the VS debugger’s query pattern
  4. Improve memory-mapped file reading by using AVX-based memcpy tuned for multiples of the page size
  5. Use C++ std::sort instead of qsort
  6. Use integer division by a constant (e.g. page size) rather than division by a variable
  7. Reuse rather than rebuild hash tables
  8. Avoid virtual function calls and manually inline code for the two most common symbol lookups
  9. Prefetch PDB data in some cases

Note that the first item, caching the previous request’s result, was responsible for the vast majority of the PDB reading wins.

Better Worst-case Incremental Linking

Incremental linking is one of the most time-saving features of our toolset. It allows developers to iterate quickly when making common source changes in large projects by reusing most of the results of earlier links and strategically applying the differences made in the last source edit. However, it can’t accommodate all source changes and will sometimes be forced to fall back on full linking, which means the overall incremental link time can actually be worse than a full link, since incremental linking will spend time figuring out it can’t proceed before starting over from scratch. It makes sense that high-impact edits, like changing compiler or linker options or touching a widely-included header file require a rebuild, but simply adding a new object (.obj) file will also trigger a full re-link. For many developers, this isn’t a big deal since they rarely add new object files and/or full linking isn’t terribly long anyway. However, if you work on large binaries or you use a coding style or project system (like some variants of a Unity build) that commonly results in object files being added or removed, the hit to incremental link time can be tens of seconds or more. Unfortunately, these limitations are fundamental to the design of incremental linking and removing them would mean slowing down the most common case that incremental linking is optimized for: simple source edits to small numbers of existing translation units.

Type Merge Cache

In version 16.7, though we couldn’t reasonably make incremental linking work in more cases, we realized that we could improve how long it takes to link when we must fall back on full linking. The key insights were:

  1. Most of the time for a full link is spent generating debug information, and
  2. Generating correct debug information is much more forgiving than correctly linking an executable binary.

Conceptually similar to how incremental linking works, we added the ability to cache the results of earlier debug information generation (specifically, the result of type merging) and reuse that during subsequent links. This technique can mean drastic speedups (2X-5X) in link time when incremental linking falls back on full linking. The following chart has some examples of the impact on three AAA Game projects and Chrome.

Chart showing worst-case incremental link time difference between versions 16.6 and 16.7 for three AAA games and Chrome. Speedups range from 1.5X to 5.5X.

This caching does have some downsides, though:

  1. The cached data is stored in the PDB file, which is therefore larger, and
  2. The first (clean) link of an incremental build takes slightly longer since the cache must be built up.

The following table captures the benefits as well as downsides for the above projects.

Initial link time PDB size Subsequent full link time
Game X 10% 35.1% -48.8%
Game Y 1.4% 31.8% -81.1%
Game Z 3.4% 27.9% -64.2%
Chrome 10.9% 10.1% -29.4%

The “Subsequent full link time” column corresponds to a scenario where incremental linking is enabled (/INCREMENTAL) but had to fall back on full linking, such as when a new object file is introduced. As you can see, the impact of this new cache can be substantial when the full link time is measured in tens of seconds or minutes.

It’s interesting to note that the cache could be used for any full linking scenarios, not just the case when incremental linking must fall back to a full link. However, because of the drawbacks, it’s only on-by-default when incremental linking is used. Release builds and builds where incremental linking is disabled (/INCREMENTAL:NO) won’t see an impact unless the new /PDBTMCACHE linker switch is specified. Similarly, the /PDBTMCACHE:NO switch can be used to disable the cache creation and return to version 16.6 behavior if desired. Note that the linker does not rely on the presence of the cache. If the cache is present and passes validation, the linker will use it to accelerate linking, but a missing cache or a cache that has been invalidated is silently ignored.

Future work

We know there are at least a few people for whom the PDB size impact of the Type Merge Cache will be a concern, so, in the future, we might consider placing the cache in a separate file. We didn’t put it in the incremental link file (.ilk) because the feature isn’t fundamentally tied to incremental linking—that’s why there is a switch to control it independently.

In a future blog post, we’ll share the details of further link time improvements in version 16.8!

Upgrade today and let us know about the difference you see

We profiled developer inner loops in several scenarios that we track, and we tamped down a couple of hot spots that stood out in PDB reading and writing and incremental link fall-back. Did you notice a difference when upgrading from version 16.5 or earlier to 16.6 and/or 16.7? If so, we’d love to hear about them in the comments below or via email (visualcpp@microsoft.com). If you’ve got a problem or would like to share other feedback, please use Help > Send Feedback > Report A Problem / Provide A Suggestion in Visual Studio or go to Developer Community. You can also find us on Twitter (@VisualC)). And, of course, if you haven’t tried Visual Studio 2019 yet, we’d love for you to download it and give it a try.

Posted in C++

10 comments

Discussion is closed. Login to edit/delete existing comments.

  • Juan Abadia 0

    Any opportunities to use multithreading during the link phase and use all the cores availables?

    • Russ KeldorphMicrosoft employee 0

      Yes! Some phases of linking are more amenable to parallelizing than others, but there are still opportunities to do better. I just briefly mentioned it above, but the version 16.8 linker can use more threads to generate debug info. An example of the results was discussed in an earlier post, and we plan to follow up with more technical details in the future.

    • Gratian LupMicrosoft employee 0

      Hi Juan,

      If you use 16.8 you will already see that 🙂 There will be another blog post in the following weeks with details about the multi-threading of the PDB generation that was released in 16.8.

      On the large AAA games we tested, speedup was between 2.5x to 4.5x. If the machine has enough cores (like 8, 16 with HyperThreading/SMT), you will get 6 PDB threads + the 2 initial threads, one doing linker work and another some PDB stuff + dispatching work to those 6. With less cores the PDB threads will also scale back, like 4 or 2. Like I replied below to Chris Kline, this is only about /debug:full linking, not fastlink, since the point of the work is to avoid the need of fastlink.

      Thanks,
      Gratian

  • Chris Kline 0

    We have been unable to take full advantage of these linker improvements for a few years now due to the issues with /Debug:FASTLINK causing debugger stepping to be incredibly slow: https://developercommunity.visualstudio.com/content/problem/698861/vs2019-1621-c-debugger-seems-to-be-slow-at-steppin.html

    This continues to be a problem for us even in VS 16.8.3 on the latest 14.28 C++ toolchain.

    Unfortunately, until it’s solved, we can’t take advantage of these incremental build cycle improvements because it makes debugging unusable.

    • Gratian LupMicrosoft employee 0

      Hi Chris,

      I worked on a large part of these improvements between 16.6-16.8: your problem is still using /debug:fastlink, which can have the debugging issues that you mention and the “solution” was to make fastlink be no longer needed by making the usual debug info generation be as fast (or close enough) to fastlink.

      In the projects we tested (several big AAA games and large projects like Chrome and LLVM), in 16.8 with the multi-threading work, there is either no time difference between debug full/fastlink, or it’s small enough (1-2sec out of 20-30) that teams are switching away from fastlink now. And of course debugging works as it should, you can copy PDBs to other machines without worries, and other scenarios that were not working right with fastlink.

      The numbers in this post are all for /debug:full, not fastlink. The previous post about the Gears of War linker speedup, is also about /debug:full, which is now within 1s compared to fastlink. The speedup between the 16.7 and 16.8 linker is from the multi-threading of the full PDB generation – some games actually saw 4x speedup there, 2x is on the lower side.

      https://devblogs.microsoft.com/cppblog/the-coalition-sees-27-9x-iteration-build-improvement-with-visual-studio-2019/

      Let me know if you have other questions.

      Thanks,
      Gratian

  • 旭 姚 0

    The pdb file size is limited to 4GB(in practice ‘a big MMO Game’:The limit was exceeded compiling with debug:fastlink)

    • 旭 姚 0

      when compiling with debug:fastlink .
      we decreased pdb size by striping ‘DEBUG_S_SYMBOL’ .the debugger still works on the pdb!!!
      so what does ‘DEBUG_S_SYMBOL’ mean?

  • Rudy Pons 0

    I’ve had issues with the PDB size increasing with every build, especially when touching a header file impacting many obj. The PDB could increase of more than 100MB with every change -> build iteration, quickly causing a compilation error because the PDB is >4GB, forcing to delete it manually.
    Also, an issue is that depending of the amount of change, the incremental link is now longer than a full link. When impacting 100 obj on our project, an incremental link is ~2 minutes, when the full link is 30s (with unity/jumbo builds, I even managed to have incremental links > 15 minutes). Is there any way to trigger a full link with less obj changed?

    • Michael 0

      Have you reported this on Developer Community yet? I’m also seeing the perpetually increasing PDB size issue.

  • Michael 0

    This at least makes me excited that to know that something has changed and I’m not imagining things. So: is there a tool that can inspect PDB files to look for corruption? Starting with 16.8 I have seen many cases where VS and WinDbg both report corrupt debug records for some functions (but not all). It seems to be related to the fact that for most of our builds we use the Hostx86 cl.exe and link.exe (because the PCH files generated by the Hostx64 toolchain are too large), but the 64bit mspdbsrv.exe (because the 32bit version exhausts address space). For some builds we won’t mix-and-match and that seems to reliably always produce valid PDBs, but that may just be a sampling bias. It also seems likely that the mix-and-match is the problem because otherwise it would have been caught prior to release.

    Obviously I want to report this through the proper DevComm channels, but when most functions have valid debug info, it’s hard to know when I have enough of a reliably reproducing case.

Feedback usabilla icon