Faster C++ builds, simplified: a new metric for time

Avatar

Kevin

C++ Build Insights is a collection of tools and APIs that can help you gain a deeper understanding of your build times. Today, we’d like to teach you about a special metric offered in this toolkit: the wall clock time responsibility (WCTR). It can be used to estimate how much time is really spent by each activity in your build, even in the presence of parallelism. We first describe how to access this metric with vcperf. Then, we delve into a real-world case study of using the metric to guide the creation of a pre-compiled header (PCH) for the CopperSpice open source project. We hope you will find uses for WCTR in your own build analysis scenarios!

How to obtain and use vcperf

The examples in this article make use of vcperf: a tool that allows you to capture a trace of your build and to view it in the Windows Performance Analyzer (WPA). The latest version with WCTR support is available on GitHub.

1. Follow these steps to obtain and configure vcperf and WPA:

  1. Clone the vcperf repository from GitHub. Make sure you are synched to the latest version of the sources.
  2. Open the vcperf.sln solution file in Visual Studio 2019, and build the Release configuration for the architecture of your choice: x86 or x64. Choose the architecture that corresponds to your version of Windows.
  3. The vcperf executable you just built will be available in the out\Release\{architecture} directory.
  4. Obtain WPA by downloading and installing the latest Windows ADK.
  5. Overwrite the perf_msvcbuildinsights.dll file in your WPA installation directory with the one from your vcperf repository. This file is the C++ Build Insights WPA add-in, which must be available to WPA for correctly interpreting the build traces produced by vcperf. You must overwrite this file every time you upgrade your version of vcperf to ensure compatibility.
    1. The location of perf_msvcbuildinsights.dll in your vcperf repository is: packages\Microsoft.Cpp.BuildInsights.{Version}\wpa\{Architecture}. Use the same architecture as your version of Windows.
    2. WPA’s installation directory is typically: C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit.

2. Follow these steps to collect a trace of your build:

  1. Open an elevated command prompt and browse to the location of the vcperf you just built.
  2. Obtain a trace of your build:
    1. Run the following command: vcperf /start MySessionName.
    2. Build your C++ project from anywhere, even from within Visual Studio 2019 (vcperf collects events system-wide).
    3. Run the following command: vcperf /stop MySessionName outputFile.etl. This command will stop the trace, analyze all events, and save everything in the outputFile.etl trace file.
  3. Open the trace you just collected in WPA.

Accessing the WCTR metric in WPA

vcperf will give you a WCTR value for every activity in your build, from compiler and linker invocations down to template instantiations. For this reason, you can find a Wall Clock Time Responsibility column in all C++ Build Insights views: Build Explorer, Files, Template Instantiations, and Functions. If you are not yet familiar with these views, consider bringing yourself up to speed by reading the following articles:

Build ExplorerFinding build bottlenecks with C++ Build Insights
FilesFaster builds with PCH suggestions from C++ Build Insights
Template InstantiationsProfiling template metaprograms with C++ Build Insights
FunctionsImproving code generation time with C++ Build Insights

The combination of WCTR and the aggregation done by WPA is what makes this metric truly valuable. Consider the Files view shown below. It tells us that, in total, the aggregated duration of all the times Windows.h was parsed is 2,754 seconds. However, the duration of the entire build (not shown in the image) is around 2,100 seconds, which doesn’t make sense! What’s happening is that headers are parsed on many threads at the same time, adding seconds to our aggregated duration beyond what’s physically possible. As such, the 2,754 value cannot reliably tell us how much time would be saved if we put the Windows.h header in a PCH. The WCTR value, on the other hand, takes build parallelism into account to provide a more realistic estimate. In the example below, we can see that precompiling Windows.h would save us around 206 seconds of real, wall-clock time.

A screenshot of the Files view in WPA, showing a list of headers and their parsing times. Durations are shown using both the traditional Inclusive Duration metric, and the Wall Clock Time Responsibility metric.

Note: adding a header to a PCH has a time saving effect beyond its own inclusive parsing time. The header itself can include other headers which also become part of the PCH. If these headers are found in multiple inclusion trees, additional savings will occur. For this reason, the WCTR metric tends to underestimate the time that will be saved from adding a header in a PCH. Nonetheless, it is accurate enough to guide your decision-making.

Case study: guiding the PCH creation process for CopperSpice with WCTR

In this case study, we use the WCTR metric to decide which C++ projects within the CopperSpice code base should use a PCH. We proceed to make a simple PCH, improving total build time by 12%.

Use these steps if you would like to follow along:

  1. Make sure you have CMake version 3.16 or above installed and available on your PATH.
  2. Open an elevated x64 Native Tools Command Prompt for VS 2019.
  3. Clone the CopperSpice GitHub repository to a location of your liking.
  4. Change the directory to the root of the freshly cloned repository and run the following command: git checkout 9a97793e0c9f. This is the commit that was used for the case study below.
  5. Run the following command: mkdir build && cd build
  6. Generate the Visual Studio 2019 solution using CMake by running the following command: cmake -G “Visual Studio 16 2019” -A x64 ..
  7. Open the copperspice.sln in Visual Studio 2019. This file was generated in the build folder.
  8. Obtain a trace for a full rebuild of the solution:
    1. From your elevated command prompt, run the following command: vcperf /start CopperSpice
      1. Make sure you use a vcperf executable built from the vcperf GitHub repository. The vcperf that comes with Visual Studio 2019 will only support WCTR starting from the next version, 16.9.
    2. Rebuild the x64 Release configuration of the copperspice.sln solution file in Visual Studio 2019.
    3. Run the following command: vcperf /stop CopperSpice copperspice.etl. This will save a trace of the build in copperspice.etl.
  9. Open the trace in WPA.

We start by bringing up the Build Explorer view and setting it to the Timelines preset. We notice that total build time is around 4,050 seconds based on the timeline at the bottom of the view. We also notice that there are 2 large CL invocations: one in the middle and one on the right. Their numbers are 13 and 21, respectively.

A screenshot of the Build Explorer view in WPA. The mouse pointer is hovering over a large CL invocation at the middle of the timeline. The tooltip next to the mouse pointer indicates that this is CL invocation number 13. The time axis at the bottom of the view indicates a build duration of around 4,050 seconds.

A screenshot of the Build Explorer view in WPA. The mouse pointer is hovering over a large CL invocation on the right side of the timeline. The tooltip next to the mouse pointer indicates that this is CL invocation number 21. The time axis at the bottom of the view indicates a build duration of around 4,050 seconds.

We think that there may be opportunities to speed up these invocations with a PCH, so we bring up the Files view and set it to the Statistics preset. The data shows that, indeed, a lot of time in our build is spent parsing headers.

A screenshot of the Files view in WPA, showing a list of headers and their parsing times for the CopperSpice build. Durations are shown using both the traditional Inclusive Duration metric, and the Wall Clock Time Responsibility metric.

Currently, the Files view is telling us the header parsing time for our build as a whole. We’d like to get a breakdown of parsing time for each invocation individually. To do so, we use the WPA grouping feature to group our data by invocation ID.

An animation showing how to group the information from the Files view by CL invocation. The point clicks on the small gear icon at the top of the view. A popup dialogue opens showing the list of visible columns. The Invocation Id column is dragged using the mouse, and moved above the yellow line at the top of the list. The Apply button is clicked, followed by the OK button. The dialog box closes, and the Files view now shows its information grouped by the Invocation Id column.

The data is now showing a lot of header parsing activity in invocation 21. The top header is qdatastream.h, with an aggregated inclusive parsing duration of 1,987 seconds. We’d like to know whether pre-compiling this header would be worth the effort, so we look at the Wall Clock Time Responsibility value on the right. The field indicates 214 wall clock time seconds. We decide that this is significant enough to proceed.

A screenshot of the File view in WPA, showing the header parsing times for invocation 21 of the CopperSpice build. Durations are shown using both the traditional Inclusive Duration metric and the Wall Clock Time Responsibility metric.

Before jumping into creating our PCH, we also look at the data for invocation 13. The aggregated parsing duration of 807 seconds for qwidget.h could have fooled us into thinking that it’s an excellent PCH candidate. However, WCTR tells us that we’d only save around 47 seconds. Most other headers from this invocation also have a low WCTR. Given this information, we decide that creating a PCH for invocation 13 is not worth it for now.

A screenshot of the File view in WPA, showing the header parsing times for invocation 13 of the CopperSpice build. Durations are shown using both the traditional Inclusive Duration metric and the Wall Clock Time Responsibility metric.

We want to start creating a PCH for invocation 21, but we first need to know which project this invocation corresponds to. We switch back to the Build Explorer view, but this time set it to the Invocation Properties preset to get more information on CL invocation 21. From the Working Directory property, we learn that the project we need to modify is in the webkit directory.

A screenshot of the Build Explorer view in WPA, configured with the Invocation Properties preset. A table shows a list of properties for CL invocation 21. The WorkingDirectory property is highlighted.

We make our way through the CopperSpice repository until we find the CMake configuration file for the webkit project. We use CMake 3.16’s PCH feature to configure the project to add qdatastream.h into a PCH, as we found out earlier that this is the top header for invocation 21. Feel free to look at our GitHub commit to see the necessary CMake change.

Now that we’ve created our PCH for invocation 21, we collect a new trace for a full rebuild of the copperspice.sln solution in the x64 Release configuration. When complete, we load the trace in WPA and excitedly summon the Build Explorer view. As expected, we notice a marked decrease in build time for our CL invocation. Our new total build time has dropped to 3,550, an improvement of 500 seconds, or 12%.

A screenshot of the Build Explorer view in WPA, configured with the Timelines preset. Parallel build timelines for the CopperSpice project are shown. The time axis at the bottom of the view indicates a build duration of around 3,550 seconds.

We also look at the Files view to notice that header parsing time has gone down in CL invocation 22. We see that a lot more has changed in the view than just the qdatastream.h header. That is because adding qdatastream.h to a PCH affected all headers that include it or its children. Now that we have an updated Files view that shows these effects, it would be a good time to continue our PCH by selecting the second header. For the purpose of this case study, we will stop here.

Note: due to MSBuild’s nondeterministic build order, invocation 21 from our previous trace has become invocation 22 in our second trace.

A screenshot of the Files view in WPA, showing header parsing time for CL invocation 22 of the CopperSpice build. Durations are shown using the Wall Clock Time Responsibility metric.

If you’ve been paying attention, you may have noticed that the 500 seconds we saved is greater than the expected WCTR value of 214 seconds for qdatastream.h. As explained at the beginning of this article, WCTR values for the inclusive parsing time of a header file tend to underestimate PCH time savings. Adding a header into a PCH typically causes all other headers it includes to also be in the PCH, causing additional time savings. It’s possible to use the C++ Build Insights SDK to derive a more accurate estimate by taking the file inclusion graph into account. We leave this as an exercise to the reader… or as a topic for a future article!

Understanding wall clock time responsibility

Wall clock time responsibility is a quantity defined by the following integral:

A formula for wall clock time responsibility: WCTR sub A = integral of 1 over c of t from t-naught sub A to t-end sub A.

The symbols used in this formula are described below:

WCTR_AWall clock time responsibility of an activity A.
t0_AStart time of activity A.
te_AEnd time of activity A.
c(t)Function representing the number of concurrent activities at time t.

As you can see, WCTR differs from a simple duration in that it takes into account build parallelism. Consider the simple build timeline shown below.

A graph showing timelines for compiler and linker invocations. The x axis shows time in seconds and the y axis shows threads. 5 parallel compiler invocations are shown on the left, all lasting 5 seconds from second 1 to second 6. A single 3-second linker invocation is shown on the right, from second 7 to second 10.

Let’s say we want to know how much time we waited on the compiler for this build. Looking at the timeline, it’s easy to tell that we waited 5 seconds. But how would we calculate this value if we couldn’t look at the timeline? We can’t just add up the durations of all compilers because the result would be 25. This is longer than the build itself, so it wouldn’t make any sense!

Wall clock time responsibility solves this conundrum by assuming that each compiler invocation is responsible for only a fraction of whatever else is going on at this particular time in the build. In this simple case, each compiler invocation is responsible for one fifth of their duration because there is a total of 5 concurrent compilers working at the same time. They each have a wall clock time responsibility of one second. Adding up the wall clock time responsibility of each compiler now results in the correct value of 5. More formally, the wall clock time responsibility of each compiler for this example is calculated using the following formula:

A formula for discrete wall clock time responsibility of compiler A: WCTR sub A = summation from 1 to 5 of 1 over 5 = 1.

Real build timelines are never this simple! Let’s consider the slightly more complex example shown below.

A graph showing timelines for compiler and linker invocations. The x axis shows time in seconds and the y axis shows threads. 5 parallel compiler invocations with various durations are shown on the left, starting and ending at different points in time. A single 3-second linker invocation is shown on the right, from second 7 to second 10.

Again, it is trivial to tell at a glance that in this build we waited on the compiler for 6 seconds. But how could we derive this value without a picture of the timeline? By adding up their wall clock time responsibilities, of course!

In this case, we can’t just divide every compiler’s duration by 5 to obtain their WCTR because the number of concurrent compilers changes as time goes on. Instead, we do the computation second by second, and add the smaller results together as we go along. For example, the WCTR for the compiler on thread A is computed using the following formula:

A formula for discrete wall clock time responsibility of compiler A: WCTR sub A = summation from 1 to 5 of 1 over c of t = 1 over 1 + 1 over 3 + 1 over 4 + 1 over 5 + 1 over 4 = 2.03

Between seconds 1 and 3, there is only one compiler invocation, so we divide by 1. Between seconds 2 and 3, there are 3 concurrent compiler invocations, so we divide by 3. We continue this process, adding each result together until we reach second 6, at which point the compiler invocation on thread A is over. If you use this process to compute the WCTR of all 5 compiler invocations, you’ll see that adding their WCTRs together gives the correct result of 6.

Tell us what you think!

In this article, we shared how to collect and view the wall clock time responsibility metric by using vcperf and WPA. We hope you’ll find this information useful, and that it’ll give you a better understanding of how time is really spent in your builds.

Have you used WCTR to tweak and tune your builds? Have you written your own application that makes use of this metric with the C++ Build Insights SDK? Let us know in the comments below, on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

 

18 comments

Leave a comment

    • Avatar
      Kevin CadieuxMicrosoft employee

      Hello Ben,

      Thanks for sharing that talk; it was very interesting!

      From what I understand, causal profiling measures hypothetical speedup of a task by artificially making everything else running in parallel slower, and by then observing the effect on overall run time. WCTR does not really compare to that. It’s a metric computed based on traditional start and end times for each task.

      While WCTR is accurate at determining the share of wall clock time allocated to a particular task, it doesn’t know anything about dependencies between tasks. As such, reducing the WCTR of a task might have no effect if another task is running in parallel and delaying the build. Instead, what will happen is that the WCTR of the other task will become larger. This becomes less important if the durations measured are very small (like individual header parsing times, for example). In that case, there isn’t really one big bottleneck, but rather a death-by-a-thousand-cuts type of effect where reducing the WCTR of many small tasks will correlate with an overall reduction in build time.

      I hope this helps,

      Kevin

  • Avatar
    steve heller

    Without making any changes to the generated .sln file other than than setting /MP on, building everything on my machine took about 1400 seconds.
    That’s probably because I have a two-CPU system with 16 physical cores that can run 32 copies of the compiler simultaneously.
    So I guess the moral is “Get a real computer”.

    • Avatar
      Kevin CadieuxMicrosoft employee

      Hi Steve,

      Thanks for your comment. The case study in this blog post also makes use of /MP as a baseline, albeit with a less powerful machine. Even with a 32-core machine, you should be able to use the examples in this blog post to inspect WCTR values and create a PCH for even faster build speeds.

      Cheers,

      Kevin

  • Avatar
    Johan Seland

    I am getting an error when starting vcperf from an elevated command prompt (see below). The error indicate that there already is an ongoing trace.
    The (helpful) error messages tells me to run tracelog -l, but the output from that command does not show any ongoing MSVC_BUILD_INSIGHTS_SESSION.

    Could this error message also trigger on a different error condition?

    c:\projects\vcperf\out\Release\x64>vcperf.exe /start NinjaVS2015
    Microsoft (R) Visual C++ (R) Performance Analyzer DEVELOPER VERSION
    Starting tracing session NinjaVS2015...
    Failed to start trace.
    A trace that is currently being collected on your system is preventing vcperf from starting a new one. This can occur if you forgot to stop a vcperf trace prior to running the start command, or if processes other than vcperf have started ETW traces of their own. Please try running the vcperf /stop or /stopnoanalyze commands on your previously started traces. If you do not remember the session name that was used for starting a previous vcperf trace, or if you don't recall starting one at all, you can use the 'tracelog -l' command to list all ongoing tracing sessions on your system. Your currently ongoing vcperf trace will show up as MSVC_BUILD_INSIGHTS_SESSION_. You can then issue a vcperf /stop or /stopnoanalyze command with the identified session name.
    • Avatar
      Kevin CadieuxMicrosoft employee

      Hello Johan,

      In addition to the regular MSVC events, vcperf also requires kernel ETW events when performing an analysis. For this reason, you can see this error message if a kernel trace is already running on your system when executing the vcperf /start command. You can tell if this is the case by running the tracelog -l command in an elevated (won’t work with regular) command prompt. If a session named NT Kernel Logger is found in the output, it means a kernel trace is currently being collected. You can stop a kernel trace by running the xperf -stop command in an elevated command prompt.

      I hope this helps,

      Kevin