Profiling template metaprograms with C++ Build Insights

Avatar

Kevin

The use of templates in C++ programs can sometimes lead to longer builds. C++ Build Insights puts tools at your disposal for analyzing template usage patterns and their associated impact on build time. In this article, we show you how to use the vcperf analysis tool and the C++ Build Insights SDK to understand and fix problematic template patterns. We demonstrate the practical use of these tools through a case study in which we reduced build time for the Sprout open-source metaprogramming library by 25%. We hope these methods will help you approach your template endeavors with confidence!

How to obtain and use vcperf

The examples in this article make use of vcperf, a tool that allows you to capture a trace of your build and to view it in the Windows Performance Analyzer (WPA). The latest version is available in Visual Studio 2019.

1. Follow these steps to obtain and configure vcperf and WPA:

  1. Download and install the latest Visual Studio 2019.
  2. Obtain WPA by downloading and installing the latest Windows ADK.
  3. Copy the perf_msvcbuildinsights.dll file from your Visual Studio 2019’s MSVC installation directory to your newly installed WPA directory. This file is the C++ Build Insights WPA add-in, which must be available to WPA for correctly displaying the C++ Build Insights events.
    1. MSVC’s installation directory is typically: C:\Program Files (x86)\Microsoft Visual Studio\2019\{Edition}\VC\Tools\MSVC\{Version}\bin\Hostx64\x64.
    2. WPA’s installation directory is typically: C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit.
  4. Open the perfcore.ini file in your WPA installation directory and add an entry for the perf_msvcbuildinsights.dll file. This tells WPA to load the C++ Build Insights add-in on startup.

You can also obtain the latest vcperf and WPA add-in by cloning and building the vcperf GitHub repository. Feel free to use your built copy in conjunction with Visual Studio 2019!

2. Follow these steps to collect a trace of your build:

  1. Open an elevated x64 Native Tools Command Prompt for VS 2019.
  2. Obtain a trace of your build:
    1. Run the following command: vcperf /start /level3 MySessionName. The /level3 option enables the collection of template events.
    2. Build your C++ project from anywhere, even from within Visual Studio (vcperf collects events system-wide).
    3. Run the following command: vcperf /stop /templates MySessionName outputFile.etl. This command will stop the trace, analyze all events, including template events, and save everything in the outputFile.etl trace file.
  3. Open the trace you just collected in WPA.

Viewing template information in WPA

The most time-consuming activity related to templates for the compiler is to instantiate them. C++ Build Insights provides a WPA view called Template Instantiations that allows you to see the instantiation time of the top templates in your program. After opening your trace in WPA, you can open this view by dragging it from the Graph Explorer pane to the Analysis window, as shown below.

Image that demonstrates how to open the Template Instantiations view in WPA.

A note on the presence of template events in a vcperf trace and WPA

If you do not see a Template Instantiations view in the Graph Explorer pane, make sure that you have completed the WPA configuration steps from the How to obtain and use vcperf section correctly, and that you have passed the right arguments to vcperf when starting and stopping your trace.

For scalability reasons, vcperf only writes in the output trace the templates that take the most time to instantiate in your build. If template instantiations were not a significant factor in your build times, vcperf will omit template information, and the Template Instantiations view will not appear.

Case study: speeding up the Sprout template metaprogramming library

In this case study, we use a real open source project from GitHub to show how you can use vcperf and WPA to diagnose and treat long build times in template metaprogramming code. Specifically, we demonstrate how we’ve used these tools to reduce the build time for the Sprout library by around 25%.

Use these steps if you would like to follow along:

  1. Clone the Sprout C++ Libraries repository from GitHub.
  2. Checkout the following commit: 6b5addba9face0a.
  3. Obtain a trace for a full rebuild of the Sprout tests:
    1. Open an elevated x64 Native Tools Command Prompt for VS 2019.
    2. Run the following command: vcperf /start /level3 Sprout
    3. Build the Sprout tests by running the following command: cl /std:c++latest /D_HAS_DEPRECATED_IS_LITERAL_TYPE=1 /D_SILENCE_CXX17_IS_LITERAL_TYPE_DEPRECATION_WARNING /EHsc /I. /constexpr:steps100000000 .\testspr\sprout.cpp.
    4. Run the following command: vcperf /stop /templates Sprout sprout.etl. This will save a trace of the build in sprout.etl. The trace will contain template information.
  4. Open the trace in WPA.

We open the Build Explorer and Template Instantiations views one on top of the other. The Build Explorer view indicates that the build lasted around 13.5 seconds. This can be seen by looking at the time axis at the bottom of the view (labeled A). The Template Instantiations view shows a locus of condensed template instantiation activity somewhere between time 8 and 10.5 (labeled B).

Image of WPA with the Build Explorer and Template Instantiation view open. The Build Explorer view shows a total duration of around 13.5 seconds. The Template Instantiations view shows an area that contains many time-consuming template instantiations.

By default, all template specializations are grouped by the primary template’s name. For example, the std::vector<int> and std::vector<double> specializations would both be grouped under the std::vector primary template name. In our case, we would like to know if there is one specialization in particular that causes an issue, so we reorganize the view’s columns such that the entries are grouped by Specialization Name instead. This operation is shown below.

Image that shows how to configure the Template Instantiations view to group entries by full specialization name instead of primary template name.

We notice that there is one large instantiation of the sprout::tpp::all_of template that is responsible for around 2.15 seconds. There are also 511 instantiations of sprout::tpp::detail::all_of_impl. We surmise that sprout::tpp::all_of is a root template that calls into a highly recursive sprout::tpp::detail::all_of_impl template. This discovery is shown below.

Image of the Template Instantiations view in WPA, which shows that the longest template instantiation is for the sprout::tpp::all_of struct. The duration of this template instantiation alone is 2.16 seconds.

Investigating the codebase

We find that the culprit sprout::tpp::all_of template instantiation is triggered by a call to operator() of the following type in sprout\random\shuffle_order.hpp:

typedef sprout::random::shuffle_order_engine<
    sprout::random::minstd_rand0, 256> knuth_b;

This type internally contains a compile-time array of 256 elements, which ends up being passed to the default_remake_container function in the sprout\container\container_construct_traits.hpp header. This function has the following three template definitions. Function bodies have been replaced with comments for simplicity.

template<typename Container, typename Cont, typename... Args>
inline SPROUT_CONSTEXPR typename std::enable_if<
    sprout::is_fixed_container<Container>::value,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container(Cont&&, 
    typename sprout::container_traits<Container>::difference_type, 
    Args&&... args) 
{
    // Default case
}

template<typename Container, typename Cont, typename... Args>
inline SPROUT_CONSTEXPR typename std::enable_if<
    !sprout::is_fixed_container<Container>::value
        && !(sizeof...(Args) == 2 && sprout::tpp::all_of<
                sprout::is_input_iterator<
                    typename std::remove_reference<Args>::type>...>::value)
        ,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container(Cont&&, 
    typename sprout::container_traits<Container>::difference_type, 
    Args&&... args) 
{
    // Default case
}

template<typename Container, typename Cont, typename InputIterator>
inline SPROUT_CONSTEXPR typename std::enable_if<
    !sprout::is_fixed_container<Container>::value,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container(Cont&& cont, 
    typename sprout::container_traits<Container>::difference_type, 
    InputIterator first, InputIterator last) 
{
    // Special case when we have a non-fixed container with 2
    // input iterator args.
}

These definitions are all enabled or disabled based on certain conditions using the std::enable_if standard type traits class. Can you spot the sprout::tpp::all_of call in the second definition’s std::enable_if condition? We replicate it below:

!(sizeof...(Args) == 2 && 
    sprout::tpp::all_of<sprout::is_input_iterator<
        typename std::remove_reference<Args>::type>...>::value)

Looking at the condition as a whole, we can see that sprout::tpp::all_of does not need to be evaluated if default_remake_container is called with a number of arguments different from 2. In our case we have 256 arguments and we know that the condition will be false regardless of what sprout::tpp::all_of returns. In the eyes of the compiler, this does not matter. It will still evaluate sprout::tpp::all_of on our 256 arguments when trying to resolve the call to default_remake_container, resulting in a large, time-consuming recursive template instantiation.

Finding a new solution

We fix this situation by adding a level of indirection between default_remake_container and the sprout::tpp::all_of call. We first branch on the number of arguments:

template<typename Container, typename Cont, typename... Args>
inline SPROUT_CONSTEXPR typename std::enable_if<
    sizeof...(Args) == 2,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container(Cont&& cont, 
    typename sprout::container_traits<Container>::difference_type, 
    Args&&... args) 
{
    return default_remake_container_two_args<Container>(
        SPROUT_FORWARD(Cont, cont), SPROUT_FORWARD(Args, args)...);
}

template<typename Container, typename Cont, typename... Args>
inline SPROUT_CONSTEXPR typename std::enable_if<
    sizeof...(Args) != 2,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container(Cont&& cont, 
    typename sprout::container_traits<Container>::difference_type, 
    Args&&... args) 
{
    // Default case
}

Only when an argument count of 2 is confirmed do we evaluate sprout::tpp::all_of through a new function called default_remake_container_two_args:

template<typename Container, typename Cont, typename... Args>
inline SPROUT_CONSTEXPR typename std::enable_if<
    sprout::is_fixed_container<Container>::value || 
        !sprout::tpp::all_of<sprout::is_input_iterator<
            typename std::remove_reference<Args>::type>...>::value
        ,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container_two_args(Cont&&, Args&&... args) {
    // Default case
}

template<typename Container, typename Cont, typename InputIterator>
inline SPROUT_CONSTEXPR typename std::enable_if<
    !sprout::is_fixed_container<Container>::value,
    typename sprout::container_construct_traits<Container>::copied_type
>::type
default_remake_container_two_args(Cont&& cont, 
    InputIterator first, InputIterator last) 
{
    // Special case when we have a non-fixed container with 2 
    // input iterator args.
}

You can see the full details of this code change in our fork on GitHub.

Evaluating the final result

After fixing the code, we collect a second trace of the compilation using the same steps that we covered at the beginning of the case study. We notice that the build time has decreased by around 25% to a total of approximately 9.7 seconds. The Template Instantiations view is also gone, meaning that template instantiations are no longer an issue in this build. Victory!

Image of the Build Explorer view in WPA, which shows that the Sprout build has been reduced to around 9.7 seconds.

Identifying problematic template instantiations using the C++ Build Insights SDK

Recursive, time-consuming template instantiations are not an uncommon problem in codebases that use template metaprogramming, so we want to identify these problems faster in the future without having to go through the hassle of launching WPA and inspecting a trace manually. Luckily, most analysis tasks performed manually with vcperf and WPA can also be performed programmatically using the C++ Build Insights SDK. To illustrate this point, we’ve prepared the RecursiveTemplateInspector SDK sample. It prints out the most time-consuming template instantiation hierarchies in a build, along with statistics about them such as recursion tree depth, total number of instantiations and the root specialization name that started it all.

Let’s repeat the Sprout case study from the previous section, but this time by using the RecursiveTemplateInspector to see what it finds. Use these steps if you want to follow along:

  1. Clone the C++ Build Insights SDK samples GitHub repository on your machine.
  2. Build the Samples.sln solution, targeting the desired architecture (x86 or x64), and using the desired configuration (debug or release). The sample’s executable will be placed in the out/{architecture}/{configuration}/RecursiveTemplateInspector folder, starting from the root of the repository.
  3. Follow the steps from the Case study: speeding up a template metaprogramming library section to collect a trace of the Sprout compilation. Use the vcperf /stopnoanalyze Sprout sprout-raw.etl command instead of the /stop command when stopping your trace.
  4. Pass the sprout-raw.etl trace as the first argument to the RecursiveTemplateInspector executable.

As shown below, RecursiveTemplateInspector correctly identifies the sprout::tpp::all_of template instantiation that has been causing us problems, which triggers other instantiations recursively for a total of 4043 instantiations.

Image that shows the output of the RecursiveTemplateInspector sample for the problematic Sprout codebase in a command-line prompt. The top template instantiation hierarchy identified by the sample is for the sprout::tpp::all_of struct. The instantiation lasts for 2168 milliseconds, has a maximum recursion tree depth of 20, and is comprised of 4043 instantiations.

Rerunning RecursiveTemplateInspector on our fixed codebase shows that all problematic template instantiations are gone. The ones that remain have a low duration and can be ignored.

Image that shows the output of the RecursiveTemplateInspector sample for the fixed Sprout codebase in a command-line prompt. The sprout::tpp::all_of template instantiation has disappeared from the list. The new top template instantiations have a low duration and can be ignored.

Understanding the sample code

We first filter all stop activity and simple events by asking the C++ Build Insights SDK to forward what we need to the OnTemplateRecursionTreeBranch, and OnSymbolName functions. The name of the functions has no effect on how the C++ Build Insights SDK will filter the events; only their parameters matter. Continue reading to understand what is actually being filtered.

AnalysisControl OnStopActivity(const EventStack& eventStack)
    override
{
    MatchEventStackInMemberFunction(eventStack, this,
        &RecursiveTemplateInspector::OnTemplateRecursionTreeBranch);

    return AnalysisControl::CONTINUE;
}

AnalysisControl OnSimpleEvent(const EventStack& eventStack)
    override
{
    MatchEventStackInMemberFunction(eventStack, this,
        &RecursiveTemplateInspector::OnSymbolName);

    return AnalysisControl::CONTINUE;
}

We use the OnTemplateRecursionTreeBranch function to capture each branch of the template instantiation recursion tree one by one. Because C++ Build Insights events are represented as stacks, capturing recursive relationships between events is something that can be done with little effort. The TemplateInstantiationGroup capture class automatically unwinds the event stack, and presents all template instantiations that occurred in the parsing thread as a vector-like C++ container ordered from root to leaf. Because we tied the OnTemplateRecursionTreeBranch function to the stop activity event, we will always receive the branches in a given recursion tree at the point where the parsing thread is coming back from the deepest level. We take advantage of this fact to compute the maximum depth of the recursion tree as we process all its branches. Once the root instantiation itself reaches its stop event, we wrap up the instantiation tree by storing the total instantiation time for the tree, along with the translation unit in which it occurred.

void OnTemplateRecursionTreeBranch(FrontEndPass fe, 
    TemplateInstantiationGroup recursionTreeBranch)
{
    const TemplateInstantiation& root = recursionTreeBranch[0];
    const TemplateInstantiation& current = recursionTreeBranch.Back();

    auto& info = rootSpecializations_[root.SpecializationSymbolKey()];

    auto& visitedSet = info.VisitedInstantiations;

    if (visitedSet.find(current.EventInstanceId()) == visitedSet.end())
    {
        // We have a new unvisited branch. Update the max depth of the
        // recursion tree.

        info.MaxDepth = std::max(info.MaxDepth, recursionTreeBranch.Size());

        for (size_t idx = recursionTreeBranch.Size(); idx-- > 0;)
        {
            const TemplateInstantiation& ti = recursionTreeBranch[idx];

            auto p = visitedSet.insert(ti.EventInstanceId());

            bool wasVisited = !p.second;

            if (wasVisited)
            {
                // Stop once we reach a visited template instantiation,
                // because its parents will also have been visited.
                break;
            }

            ++info.InstantiationCount;
        }
    }

    if (recursionTreeBranch.Size() != 1) {
        return;
    }

    // The end of a hierarchy's instantiation corresponds to the stop
    // event of the root specialization's instantiation. When we reach
    // that point, we update the total instantiation time of the hierarchy.

    info.TotalInstantiationTime = root.Duration();

    info.File = fe.InputSourcePath() ? fe.InputSourcePath() :
        fe.OutputObjectPath();

    visitedSet.clear();
}

Template instantiation events don’t contain the name of the symbol that was instantiated. Deriving the name is a costly operation, and doing so while measuring an instantiation would skew the time measurement. Instead, a numeric key is issued, which we can later use to match with the appropriate name by listening to the SymboName event. The OnSymbolName function does just that, and stores the name of every root template instantiation.

void OnSymbolName(SymbolName symbolName)
{
    auto it = rootSpecializations_.find(symbolName.Key());

    if (it == rootSpecializations_.end()) {
        return;
    }

    it->second.RootSpecializationName = symbolName.Name();
}

At the end of our analysis, we traverse all root template instantiations, sort them by longest duration, and dump the top ones.

AnalysisControl OnEndAnalysis() override
{
    using namespace std::chrono;

    auto topSpecializations = GetTopInstantiations();
    
    if (specializationCountToDump_ == 1) {
        std::cout << "Top template instantiation hierarchy:";
    }
    else {
        std::cout << "Top " << specializationCountToDump_ << 
            " template instantiation " << "hierarchies";
    }
        
    std::cout << std::endl << std::endl;

    for (auto& info : topSpecializations)
    {
        std::wcout << "File:           " << 
            info.File << std::endl;
        std::cout  << "Duration:       " << 
            duration_cast<milliseconds>(
                info.TotalInstantiationTime).count() << 
            " ms" << std::endl;
        std::cout  << "Max Depth:      " << 
            info.MaxDepth << std::endl;
        std::cout  << "Instantiations: " << 
            info.InstantiationCount << std::endl;
        std::cout  << "Root Name:      " << 
            info.RootSpecializationName << std::endl << std::endl;
    }

    return AnalysisControl::CONTINUE;
}

Tell us what you think!

We hope the information in this article has helped you understand how you can use the tools provided by C++ Build Insights to profile your usage of templates.

Give vcperf a try today by downloading the latest version of Visual Studio 2019, or by cloning the tool directly from the vcperf Github repository. Try out the RecursiveTemplateInspector sample from this article by cloning the C++ Build Insights samples repository from GitHub, or refer to the official C++ Build Insights SDK documentation to build your own analysis tools.

Have you found problematic template instantiations using vcperf or the C++ Build Insights SDK? Let us know in the comments below, on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

 

This article contains code snippets from Sprout, Copyright (c) 2011-2019 Bolero MURAKAMI, distributed under the Boost Software License, Version 1.0.

2 comments

Comments are closed. Login to edit/delete your existing comments

  • Avatar
    Alexander Riccio

    yes, I love this! I will definitely have a use for this sooner or later. Maybe I should try building something large like LLVM and see if there’s anything really slow going on?

    The only negative thing I can say is that vcperf::Payload doesn’t have any SAL on that buffer 😉