{"id":27337,"date":"2021-01-19T15:00:35","date_gmt":"2021-01-19T15:00:35","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=27337"},"modified":"2021-01-20T12:18:01","modified_gmt":"2021-01-20T12:18:01","slug":"build-throughput-series-more-efficient-template-metaprogramming","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/build-throughput-series-more-efficient-template-metaprogramming\/","title":{"rendered":"Build Throughput Series: More Efficient Template Metaprogramming"},"content":{"rendered":"<p>In <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/build-throughput-series-template-metaprogramming-fundamentals\">the previous blog post<\/a> I shared how template specialization and template instantiation are processed in the MSVC compiler. We will now look at some examples from real-world code bases to show some ways to reduce the number of them.<\/p>\n<h5>Example 1<\/h5>\n<p>This example is extracted from our own MSVC compiler code base. The code tries to apply several stateless functors on an object. Because the functors are stateless, they are represented by a list of types. Here is the code:<\/p>\n<pre class=\"lang:c++ decode:true\">\/\/ A helper class which represents a list of types.\r\ntemplate&lt;typename...&gt; struct TypeList;\r\n\r\n\/\/ The definition of 'Object' is irrelevant and omitted.\r\nstruct Object;\r\n\/\/ The function which applies a stateless functor. Its definition is irrelevant and omitted.\r\ntemplate &lt;typename Functor&gt; void apply_functor(Object&amp; object);\r\n\r\n\/\/ We have two functors.\r\nstruct Functor1;\r\nstruct Functor2;\r\n\r\n\/\/ We want to apply the two functors above.\r\nvoid apply(Object&amp; object)\r\n{\r\n    using Functors = TypeList&lt;Functor1, Functor2&gt;;\r\n    apply_all_functors&lt;Functors&gt;(object); \/\/ 'apply_all_functors' is not implemented yet.\r\n}<\/pre>\n<p>Now let us see the initial implementation of <code>apply_all_functors<\/code>. We extract the functors from <code>TypeList<\/code> and apply them one by one:<\/p>\n<pre class=\"lang:c++ decode:true\">#include &lt;utility&gt;\r\n\r\ntemplate &lt;typename Functors&gt;\r\nstruct apply_all_functors_impl {\r\n    template &lt;size_t I&gt;\r\n    static void apply(Object&amp; object) {\r\n        using Functor = TypeListAt&lt;I, Functors&gt;; \/\/ 'TypeListAt' is not implemented yet.\r\n\r\n        apply_functor&lt;Functor&gt;(object);\r\n    }\r\n\r\n    template &lt;size_t... I&gt;\r\n    static void apply_all(Object&amp; object, std::index_sequence&lt;I...&gt;) {\r\n        (apply&lt;I&gt;(object), ...);\r\n    }\r\n\r\n    void operator()(Object&amp; object) const\r\n    {\r\n        apply_all(object, std::make_index_sequence&lt;TypeListSize&lt;Functors&gt;&gt;{}); \/\/ 'TypeListSize' is not implemented yet.\r\n    }\r\n};\r\n\r\ntemplate &lt;typename Functors&gt;\r\nconstexpr apply_all_functors_impl&lt;Functors&gt; apply_all_functors{};<\/pre>\n<p>To extract the functor from the list, we need a sequence of indices. This is obtained using <code>std::make_index_sequence<\/code>. We then use a fold expression to efficiently iterate through the sequence and call <code>apply<\/code> to extract and apply the functor one by one.<\/p>\n<p>The code above uses a class template so that the template arguments are shared across all its member functions. You can also use global function templates instead.<\/p>\n<p>There are several ways to implement <code>TypeListAt<\/code>\u00a0and <code>TypeListSize<\/code>. Here is one solution:<\/p>\n<pre class=\"lang:c++ decode:true\">\/\/ Implementation of TypeListSize.\r\ntemplate&lt;typename&gt; struct TypeListSizeImpl;\r\ntemplate&lt;typename... Types&gt; struct TypeListSizeImpl&lt;TypeList&lt;Types...&gt;&gt;\r\n{\r\n    static constexpr size_t value = sizeof...(Types);\r\n};\r\ntemplate&lt;typename Types&gt; constexpr size_t TypeListSize = TypeListSizeImpl&lt;Types&gt;::value;\r\n\r\n\/\/ Implementation of TypeListAt.\r\ntemplate&lt;size_t, typename&gt; struct TypeListAtImpl;\r\ntemplate&lt;size_t I, typename Type, typename... Types&gt; struct TypeListAtImpl&lt;I, TypeList&lt;Type, Types...&gt;&gt;\r\n{\r\n    using type = typename TypeListAtImpl&lt;I - 1, TypeList&lt;Types...&gt;&gt;::type;\r\n};\r\ntemplate&lt;typename Type, typename... Types&gt; struct TypeListAtImpl&lt;0, TypeList&lt;Type, Types...&gt;&gt;\r\n{\r\n    using type = Type;\r\n};\r\n\r\ntemplate&lt;size_t I, typename Types&gt; using TypeListAt = typename TypeListAtImpl&lt;I, Types&gt;::type;<\/pre>\n<p>Now let us examine the number of template instantiations in the initial implementation (assume we have <code>N<\/code> functors):<\/p>\n<ol>\n<li>We iterate through an integer sequence of <code>N<\/code> elements (with value <code>0, ..., N - 1<\/code>).<\/li>\n<li>Each iteration specializes one <code>TypeListAt<\/code> which instantiates <code>O(I)<\/code> <code>TypeListAtImpl<\/code> specializations (<code>I<\/code> is the element in the integer sequence).<\/li>\n<\/ol>\n<p>For example, when <code>TypeListAt&lt;2, TypeList&lt;T1, T2, T3&gt;&gt;<\/code>\u00a0(I = 2, N = 3) is used, it goes through the following:<\/p>\n<pre>TypeListAt&lt;2, TypeList&lt;T1, T2, T3&gt;&gt; =&gt;\r\nTypeListAtImpl&lt;2, TypeList&lt;T1, T2, T3&gt;&gt;::type =&gt;\r\nTypeListAtImpl&lt;1, TypeList&lt;T2, T3&gt;&gt;::type =&gt;\r\nTypeListAtImpl&lt;0, TypeList&lt;T3&gt;&gt;::type =&gt;\r\nT3<\/pre>\n<p>So, <code>apply_all_functors_impl&lt;TypeList&lt;T1, ..., TN&gt;&gt;::operator()<\/code> instantiates <code>O(N^2)<\/code> template specializations.<\/p>\n<p>How can we reduce the number? The core logic is to extract types from the helper class <code>TypeList<\/code>.<\/p>\n<p>To reduce the number of template instantiations, we can extract directly without using <code>std::integer_sequence<\/code>. This takes advantage of function template argument deduction which can deduce the template arguments of a class template specialization used as the type of the function parameter.<\/p>\n<p>Here is the more efficient version:<\/p>\n<pre class=\"lang:c++ decode:true\">\/\/ Function template argument deduction can deduce the functors from the helper class.\r\ntemplate &lt;typename... Functors&gt;\r\nvoid apply_all_functors_impl (Object&amp; object, TypeList&lt;Functors...&gt;*)\r\n{\r\n    ((apply_functor&lt;Functors&gt;(object)), ...);\r\n}\r\n\r\ntemplate &lt;typename Functors&gt;\r\nvoid apply_all_functors (Object&amp; object)\r\n{\r\n    apply_all_functors_impl(object, static_cast&lt;Functors*&gt;(nullptr));\r\n}<\/pre>\n<p>Now it only instantiates <code>O(N)<\/code> template specializations.<\/p>\n<p>Note: I intentionally leave <code>TypeList<\/code> as undefined. The definition is not even needed for the <code>static_cast<\/code> as I mentioned in <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/build-throughput-series-template-metaprogramming-fundamentals\">the previous blog post<\/a>. This can avoid all the overheads associated with defining a class (like declaring lots of compiler generated special member functions, generating debug information, etc.) which can happen accidentally (see the next example for more details).<\/p>\n<p>We apply this trick in the compiler code base and it cuts the memory usage to compile one expensive file by half. We also see noticeable compile time improvement.<\/p>\n<h5>Example 2<\/h5>\n<p>This example is extracted from the code base of an internal game studio. To my surprise, game developers love template metaprogramming \ud83d\ude0a.<\/p>\n<p>The code tries to obtain a list of trait classes from a type map.<\/p>\n<pre class=\"lang:c++ decode:true\">#include &lt;tuple&gt;\r\n#include &lt;utility&gt;\r\n\r\n\/\/ This class contains some useful information of a type.\r\ntemplate &lt;typename&gt;\r\nclass trait {};\r\n\r\n\/\/ TypeMap is a helper template which maps an index to a type.\r\ntemplate &lt;template &lt;int&gt; class TypeMap, int N&gt;\r\nstruct get_type_traits;\r\n\r\ntemplate&lt;int&gt; struct type_map;\r\ntemplate&lt;&gt; struct type_map&lt;0&gt; { using type = int; };\r\ntemplate&lt;&gt; struct type_map&lt;1&gt; { using type = float; };\r\n\r\n\/\/ we want to get back 'std::tuple&lt;trait&lt;int&gt;, trait&lt;float&gt;&gt;'.\r\nusing type_traits = get_type_traits&lt;type_map, 2&gt;::type; \/\/ 'get_type_traits' is not implemented yet.<\/pre>\n<p>Here is the initial implementation:<\/p>\n<pre class=\"lang:c++ decode:true\">template &lt;template &lt;int&gt; class TypeMap, int N&gt;\r\nstruct get_type_traits\r\n{\r\nprivate:\r\n    template &lt;int... I&gt;\r\n    static auto impl(std::integer_sequence&lt;int, I...&gt;)\r\n    {\r\n        return std::make_tuple(trait&lt;typename TypeMap&lt;I&gt;::type&gt;{}...);\r\n    }\r\npublic:\r\n    using type = decltype(impl(std::make_integer_sequence&lt;int, N&gt;{}));\r\n};<\/pre>\n<p>It also uses the same <code>make_integer_sequence<\/code> trick in example 1.<\/p>\n<p><code>get_type_traits<\/code> itself doesn\u2019t have the <code>O(N^2)<\/code> specializations issue. But unfortunately, the current <code>std::tuple<\/code> implementation in MSVC has O(n^2) behavior to instantiate where <code>n<\/code> is the number of its template arguments.<\/p>\n<p>This overhead can be completely avoided because the class only needs to get back a type which does not necessarily require instantiation.<\/p>\n<p>However, the initial implementation forces the instantiation of <code>std::tuple<\/code>\u00a0due to the definition of <code>impl<\/code>. As mentioned in <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/build-throughput-series-template-metaprogramming-fundamentals\">the previous blog post<\/a>, having a template specialization as the return type does not require instantiation if there is no function definition.<\/p>\n<p>The solution is to specify the return type of <code>impl<\/code> explicitly and remove the definition. This trick is not always possible when the return type is complicated. But in this case, we can specify it as:<\/p>\n<pre class=\"lang:c++ decode:true\">template &lt;int... I&gt;\r\nstatic std::tuple&lt;trait&lt;typename TypeMap&lt;I&gt;::type&gt;...&gt; impl(std::integer_sequence&lt;int, I...&gt;);<\/pre>\n<p>This change reduces the compile time by 0.9s where an <code>std::tuple<\/code>\u00a0of 85 template arguments is used. We have seen such <code>std::tuple<\/code> (with lots of template arguments) usages in quite a few code bases.<\/p>\n<h5>Summary<\/h5>\n<p>Here is a list of simple tips which can help reduce the number and overhead of template specialization\/instantiation:<\/p>\n<ol>\n<li>Avoid instantiating a non-linear number of template specializations.\nBe aware of type traits which require a non-trivial number of specializations (e.g., those using recursion).<\/li>\n<li>Leave class template as undefined if possible (e.g., help class which carries all the information in its template arguments).<\/li>\n<li>Prefer variable templates to class templates for values (<code>variable_template&lt;T&gt;<\/code> is much cheaper than <code>class_template&lt;T&gt;::value<\/code> and <code>class_template&lt;T&gt;()<\/code> is the worst \ud83d\ude0a)<\/li>\n<li>Be aware of expensive template (like <code>std::tuple<\/code> with lots of template arguments) and switch to a simpler type if you use the template for a different purpose than what it is designed for (e.g., using <code>std::tuple<\/code> as a type list).<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>In the previous blog post I shared how template specialization and template instantiation are processed in the MSVC compiler. We will now look at some examples from real-world code bases to show some ways to reduce the number of them. Example 1 This example is extracted from our own MSVC compiler code base. The code [&hellip;]<\/p>\n","protected":false},"author":6968,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-27337","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus"],"acf":[],"blog_post_summary":"<p>In the previous blog post I shared how template specialization and template instantiation are processed in the MSVC compiler. We will now look at some examples from real-world code bases to show some ways to reduce the number of them. Example 1 This example is extracted from our own MSVC compiler code base. The code [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/27337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/6968"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=27337"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/27337\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=27337"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=27337"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=27337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}