{"id":31372,"date":"2022-12-13T17:00:11","date_gmt":"2022-12-13T17:00:11","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=31372"},"modified":"2022-12-13T18:30:28","modified_gmt":"2022-12-13T18:30:28","slug":"improving-the-state-of-debug-performance-in-c","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/improving-the-state-of-debug-performance-in-c\/","title":{"rendered":"Improving the State of Debug Performance in C++"},"content":{"rendered":"<p>In this blog we will explore one change the MSVC compiler has implemented in an effort to improve the codegen quality of applications in debug mode. We will highlight what the change does, and how it could be extended for the future. If debug performance is something you care about for your C++ projects, then Visual Studio 2022 version 17.5 is making that experience even better!<\/p>\n<p><em>Please note that this blog will contain some assembly but being an expert in assembly is not required.<\/em><\/p>\n<h2>Overview<\/h2>\n<ul>\n<li><a href=\"#motivation\">Motivation: why we care about debugging performance.<\/a><\/li>\n<li><a href=\"#show-me-code\">Show me some code!: A few simple examples of before and after.<\/a><\/li>\n<li><a href=\"#how-we-did-it\">How we did it: About our new intrinsic and how you could use it.<\/a><\/li>\n<li><a href=\"#looking-ahead\">Looking ahead: What else we&#8217;re doing to make the experience better.<\/a><\/li>\n<\/ul>\n<h2><span id=\"motivation\">Motivation<\/span><\/h2>\n<p>You might notice that the title of this blog is a play on words based on a recent popular blog post of a similar name,\n<a href=\"https:\/\/vittorioromeo.info\/index\/blog\/debug_performance_cpp.html\">&#8220;the sad state of debug performance in\nc++&#8221;<\/a>. In the blog Vittorio Romeo highlights some general C++ shortcomings when it comes to debugging performance.\nVittorio also also filed this Developer Community ticket &#8220;<a href=\"https:\/\/developercommunity.visualstudio.com\/t\/std::move-and-similar-functions-resu\/1681875\">`std::move` (and\nsimilar functions) result in poor debug performance and worse debugging experience<\/a>&#8220;; thanks to him and everyone who\nvoted! Much of the reason for the observed slowdown is the cost of abstraction, with the notable example of\n<code>std::move<\/code> where the following code:<\/p>\n<pre>int i = 0;\r\nstd::move(i);<\/pre>\n<p>Would generate a function call when the code is conceptually:<\/p>\n<pre>int i = 0;\r\nstatic_cast&lt;int&amp;&amp;&gt;(i);<\/pre>\n<p>The function <code>std::move<\/code> is conceptually a named cast, much like static_cast but with a contextual meaning for code\naround it. The penalty for using this named cast is that you get a function call generated in the debug assembly.\nHere&#8217;s the assembly of the two examples above:<\/p>\n<style>\n.my_collapse {\n  display: table-row;\n  cursor: pointer;\n}\n.asm_entry {\n  vertical-align: top;\n}\n<\/style>\n<table border=\"1\">\n<tbody>\n<tr class=\"my_collapse\">\n<td><code>std::move<\/code> (click to expand)<\/td>\n<td><code>static_cast<\/code> (click to expand)<\/td>\n<\/tr>\n<tr>\n<td class=\"asm_entry\">\n<pre>main\tPROC\r\nsub\trsp, 56\t; 00000038H\r\nmov\tDWORD PTR i$[rsp], 0\r\nlea\trcx, QWORD PTR i$[rsp]\r\ncall\t??$move@AEAH@std@@YA$$QEAHAEAH@Z\r\nxor\teax, eax\r\nadd\trsp, 56\t; 00000038H\r\nret\t0\r\nmain\tENDP<\/pre>\n<\/td>\n<td class=\"asm_entry\">\n<pre>main\tPROC\r\nsub\trsp, 24\r\nmov\tDWORD PTR i$[rsp], 0\r\nxor\teax, eax\r\nadd\trsp, 24\r\nret\t0\r\nmain\tENDP<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><em>Note to readers: All code samples compiled in this blog were compiled with &#8220;<code>\/Od \/std:c++latest<\/code>&#8220;<\/em><\/p>\n<p>On the surface, the compiler only generated 2 extra instructions in the <code>std::move<\/code> case, but the &#8216;call&#8217; instruction, in\nparticular, is both expensive and executes this code in addition to the code above:<\/p>\n<pre class=\"asm_entry\">??$move@AEAH@std@@YA$$QEAHAEAH@Z PROC\t\t\t; std::move&lt;int &amp;&gt;, COMDAT\r\nmov\tQWORD PTR [rsp+8], rcx\r\nmov\trax, QWORD PTR _Arg$[rsp]\r\nret\t0\r\n??$move@AEAH@std@@YA$$QEAHAEAH@Z ENDP\t\t\t; std::move&lt;int &amp;&gt;<\/pre>\n<p><em>Note: to generate the assembly above, the compiler can be provided with the <a href=\"https:\/\/learn.microsoft.com\/en-us\/cpp\/build\/reference\/fa-fa-listing-file?view=msvc-170\"><code>\/Fa<\/code><\/a> option. Furthermore, the weird names\nlike &#8220;<code>??$move@AEAH@std@@YA$$QEAHAEAH@Z<\/code>&#8221; are a mangled name of the function template specialization of <code>std::move<\/code>.<\/em><\/p>\n<p>So really your binary is now at a 5 instruction deficit to the <code>static_cast<\/code> code, and this cost is multiplied by the\nnumber of times that <code>std::move<\/code> is used.<\/p>\n<p>Some compilers have already implemented some mechanism to acknowledge meta functions like <code>std::move<\/code> and <code>std::forward<\/code> as\ncompiler intrinsics (as noted in Vittorio&#8217;s blog) and this support is done completely in the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Compiler#Front_end\">compiler front-end<\/a>. As of\n17.5, MSVC is offering better debugging performance by acknowledging these meta functions as well! More on how we do it\nlater in this blog, but first&#8230;<\/p>\n<h2><span id=\"show-me-code\">Show me some code!<\/span><\/h2>\n<p><em>Note to readers: to take advantage of the new codegen quality, you will need to provide the <a href=\"https:\/\/learn.microsoft.com\/en-us\/cpp\/build\/reference\/permissive-standards-conformance?view=msvc-170\"><code>\/permissive-<\/code><\/a> compiler\noption. Also worthy to note that <code>\/permissive-<\/code> is implied when <code>\/std:c++20<\/code> or <code>\/std:c++latest<\/code> is used.<\/em><\/p>\n<p>Let&#8217;s take the simple example above again and make it a full program:<\/p>\n<pre>#include &lt;utility&gt;\r\n\r\nint main() {\r\n    int i = 0;\r\n    std::move(i);\r\n    std::forward&lt;int&amp;&gt;(i);\r\n}<\/pre>\n<p>Here&#8217;s the generated assembly difference between 17.4 and 17.5:<\/p>\n<table border=\"1\">\n<tbody>\n<tr class=\"my_collapse\">\n<td>17.4 (click to expand)<\/td>\n<td>17.5 (click to expand)<\/td>\n<\/tr>\n<tr>\n<td class=\"asm_entry\">\n<pre>_Arg$ = 8\r\n??$forward@AEAH@std@@YAAEAHAEAH@Z PROC\r\nmov\tQWORD PTR [rsp+8], rcx\r\nmov\trax, QWORD PTR _Arg$[rsp]\r\nret\t0\r\n??$forward@AEAH@std@@YAAEAHAEAH@Z ENDP\r\n_TEXT\tENDS\r\n_TEXT\tSEGMENT\r\n_Arg$ = 8\r\n??$move@AEAH@std@@YA$$QEAHAEAH@Z PROC\r\nmov\tQWORD PTR [rsp+8], rcx\r\nmov\trax, QWORD PTR _Arg$[rsp]\r\nret\t0\r\n??$move@AEAH@std@@YA$$QEAHAEAH@Z ENDP\r\n_TEXT\tENDS\r\n_TEXT\tSEGMENT\r\ni$ = 32\r\nmain\tPROC\r\nsub\trsp, 56\t\t; 00000038H\r\nmov\tDWORD PTR i$[rsp], 0\r\nlea\trcx, QWORD PTR i$[rsp]\r\ncall\t??$move@AEAH@std@@YA$$QEAHAEAH@Z\r\nlea\trcx, QWORD PTR i$[rsp]\r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z\r\nxor\teax, eax\r\nadd\trsp, 56\t\t; 00000038H\r\nret\t0\r\nmain\tENDP<\/pre>\n<\/td>\n<td class=\"asm_entry\">\n<pre>i$ = 0\r\nmain\tPROC\r\n$LN3:\r\nsub\trsp, 24\r\nmov\tDWORD PTR i$[rsp], 0\r\nxor\teax, eax\r\nadd\trsp, 24\r\nret\t0\r\nmain\tENDP<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Assembly reading tip: The <code>main PROC<\/code> above is our <code>main<\/code> function in the C++ code. The\ninstructions that follow <code>main PROC<\/code> are what your CPU will execute when your program is first invoked. In\nthe case above, it is clear that the code produced by 17.5 is much smaller, which can sometimes be an indication of a\nperformance win. For the purposes of this blog, the performance win is both in the size of the code produced and the\nreduction in indirections due to inlining the &#8216;call&#8217; instruction to <code>std::move<\/code> and <code>std::forward<\/code>. For the\npurposes of this blog we will rely on the newly generated assembly reduced complexity as an indicator of possible\nperformance wins.<\/p>\n<p>Yes, you read that right, the generated code in 17.5 doesn&#8217;t even create assembly entries for <code>std::move<\/code> or\n<code>std::forward<\/code>\u2014which makes sense, they&#8217;re never called.<\/p>\n<p>Let&#8217;s look at a slightly more complicated code example:<\/p>\n<pre>#include &lt;utility&gt;\r\n\r\ntemplate &lt;typename T&gt;\r\nvoid add_1_impl(T&amp;&amp; x) {\r\n    std::forward&lt;T&gt;(x) += std::move(1);\r\n}\r\n\r\ntemplate &lt;typename T, int N&gt;\r\nvoid add_1(T (&amp;arr)[N]) {\r\n    for (auto&amp;&amp; e : arr) {\r\n        add_1_impl(e);\r\n    }\r\n}\r\n\r\nint main() {\r\n    int arr[10]{};\r\n    add_1(arr);\r\n}<\/pre>\n<p>In this code all we want to do is add 1 to all elements of the array. Here&#8217;s the table (only showing the <code>add_1_impl<\/code>\nfunction with <code>std::forward<\/code> and <code>std::move<\/code>):<\/p>\n<table border=\"1\">\n<tbody>\n<tr class=\"my_collapse\">\n<td>17.4 (click to expand)<\/td>\n<td>17.5 (click to expand)<\/td>\n<\/tr>\n<tr>\n<td class=\"asm_entry\">\n<pre>??$add_1_impl@AEAH@@YAXAEAH@Z PROC\r\n$LN3:\r\nmov\tQWORD PTR [rsp+8], rcx\r\nsub\trsp, 72\t; 00000048H\r\nmov\tDWORD PTR $T1[rsp], 1\r\nlea\trcx, QWORD PTR $T1[rsp]\r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z\r\nmov\teax, DWORD PTR [rax]\r\nmov\tDWORD PTR tv72[rsp], eax\r\nmov\trcx, QWORD PTR x$[rsp]\r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z\r\nmov\tQWORD PTR tv68[rsp], rax\r\nmov\trax, QWORD PTR tv68[rsp]\r\nmov\teax, DWORD PTR [rax]\r\nmov\tDWORD PTR tv70[rsp], eax\r\nmov\teax, DWORD PTR tv72[rsp]\r\nmov\tecx, DWORD PTR tv70[rsp]\r\nadd\tecx, eax\r\nmov\teax, ecx\r\nmov\trcx, QWORD PTR tv68[rsp]\r\nmov\tDWORD PTR [rcx], eax\r\nadd\trsp, 72\t; 00000048H\r\nret\t0\r\n??$add_1_impl@AEAH@@YAXAEAH@Z ENDP<\/pre>\n<\/td>\n<td class=\"asm_entry\">\n<pre>??$add_1_impl@AEAH@@YAXAEAH@Z PROC\r\n$LN3:\r\nmov\tQWORD PTR [rsp+8], rcx\r\nsub\trsp, 24\r\nmov\tDWORD PTR $T1[rsp], 1\r\nmov\trax, QWORD PTR x$[rsp]\r\nmov\teax, DWORD PTR [rax]\r\nadd\teax, DWORD PTR $T1[rsp]\r\nmov\trcx, QWORD PTR x$[rsp]\r\nmov\tDWORD PTR [rcx], eax\r\nadd\trsp, 24\r\nret\t0\r\n??$add_1_impl@AEAH@@YAXAEAH@Z ENDP<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>17.4 has 21 instructions while 17.5 has only 10, but this comparison is made that much more extreme by the fact that we\nare calling <code>add_impl_1<\/code> in a loop so the complexity of executed instructions in 17.4 can ostensibly be significantly\nmore costly than in 17.5\u2014worse than that, actually, because we&#8217;re not accounting for the instructions executed in the\nfunctions <code>std::forward<\/code> and <code>std::move<\/code>.<\/p>\n<p>Let&#8217;s make the code sample even more interesting and extreme to illustrate the visible differences. It might be\nobserved that if we manually unroll the loop above we can get a performance win, so let&#8217;s do that using templates:<\/p>\n<pre>#include &lt;utility&gt;\r\n\r\ntemplate &lt;typename T, int N, std::size_t... Is&gt;\r\nvoid add_1_impl(std::index_sequence&lt;Is...&gt;, T (&amp;arr)[N]) {\r\n    ((std::forward&lt;T&amp;&gt;(arr[Is]) += std::move(1)), ...);\r\n}\r\n\r\ntemplate &lt;typename T, int N&gt;\r\nvoid add_1(T (&amp;arr)[N]) {\r\n    add_1_impl(std::make_index_sequence&lt;N&gt;{}, arr);\r\n}\r\n\r\nint main() {\r\n    int arr[10]{};\r\n    add_1(arr);\r\n}<\/pre>\n<p>The code above replaces the loop in the previous example with a single <a href=\"https:\/\/en.cppreference.com\/w\/cpp\/language\/fold\">fold expression<\/a>. Let&#8217;s peek at the codegen\n(again only snipping <code>add_1_impl<\/code> with <code>std::forward<\/code> and <code>std::move<\/code>, we also replace the mangled function name with\n<code>add_1_impl&lt;...&gt;<\/code>):<\/p>\n<table border=\"1\">\n<tbody>\n<tr class=\"my_collapse\">\n<td>17.4 (click to expand)<\/td>\n<td>17.5 (click to expand)<\/td>\n<\/tr>\n<tr>\n<td class=\"asm_entry\">\n<pre>add_1_impl&lt;...&gt; PROC \r\n$LN3: \r\nmov\tQWORD PTR [rsp+16], rdx \r\nmov\tBYTE PTR [rsp+8], cl \r\nsub\trsp, 248\t; 000000f8H \r\nmov\tDWORD PTR $T1[rsp], 1 \r\nlea\trcx, QWORD PTR $T1[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv74[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 0 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv70[rsp], rax \r\nmov\trax, QWORD PTR tv70[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv72[rsp], eax \r\nmov\teax, DWORD PTR tv74[rsp] \r\nmov\tecx, DWORD PTR tv72[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv70[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T2[rsp], 1 \r\nlea\trcx, QWORD PTR $T2[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv86[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 1 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv82[rsp], rax \r\nmov\trax, QWORD PTR tv82[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv84[rsp], eax \r\nmov\teax, DWORD PTR tv86[rsp] \r\nmov\tecx, DWORD PTR tv84[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv82[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T3[rsp], 1 \r\nlea\trcx, QWORD PTR $T3[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv130[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 2 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv94[rsp], rax \r\nmov\trax, QWORD PTR tv94[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv128[rsp], eax \r\nmov\teax, DWORD PTR tv130[rsp] \r\nmov\tecx, DWORD PTR tv128[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv94[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T4[rsp], 1 \r\nlea\trcx, QWORD PTR $T4[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv142[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 3 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv138[rsp], rax \r\nmov\trax, QWORD PTR tv138[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv140[rsp], eax \r\nmov\teax, DWORD PTR tv142[rsp] \r\nmov\tecx, DWORD PTR tv140[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv138[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T5[rsp], 1 \r\nlea\trcx, QWORD PTR $T5[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv154[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 4 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv150[rsp], rax \r\nmov\trax, QWORD PTR tv150[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv152[rsp], eax \r\nmov\teax, DWORD PTR tv154[rsp] \r\nmov\tecx, DWORD PTR tv152[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv150[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T6[rsp], 1 \r\nlea\trcx, QWORD PTR $T6[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv166[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 5 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv162[rsp], rax \r\nmov\trax, QWORD PTR tv162[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv164[rsp], eax \r\nmov\teax, DWORD PTR tv166[rsp] \r\nmov\tecx, DWORD PTR tv164[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv162[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T7[rsp], 1 \r\nlea\trcx, QWORD PTR $T7[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv178[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 6 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv174[rsp], rax \r\nmov\trax, QWORD PTR tv174[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv176[rsp], eax \r\nmov\teax, DWORD PTR tv178[rsp] \r\nmov\tecx, DWORD PTR tv176[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv174[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T8[rsp], 1 \r\nlea\trcx, QWORD PTR $T8[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv190[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 7 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv186[rsp], rax \r\nmov\trax, QWORD PTR tv186[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv188[rsp], eax \r\nmov\teax, DWORD PTR tv190[rsp] \r\nmov\tecx, DWORD PTR tv188[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv186[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T9[rsp], 1 \r\nlea\trcx, QWORD PTR $T9[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv202[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 8 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv198[rsp], rax \r\nmov\trax, QWORD PTR tv198[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv200[rsp], eax \r\nmov\teax, DWORD PTR tv202[rsp] \r\nmov\tecx, DWORD PTR tv200[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv198[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nmov\tDWORD PTR $T10[rsp], 1 \r\nlea\trcx, QWORD PTR $T10[rsp] \r\ncall\t??$move@H@std@@YA$$QEAH$$QEAH@Z \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv214[rsp], eax \r\nmov\teax, 4 \r\nimul\trax, rax, 9 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nadd\trcx, rax \r\nmov\trax, rcx \r\nmov\trcx, rax \r\ncall\t??$forward@AEAH@std@@YAAEAHAEAH@Z \r\nmov\tQWORD PTR tv210[rsp], rax \r\nmov\trax, QWORD PTR tv210[rsp] \r\nmov\teax, DWORD PTR [rax] \r\nmov\tDWORD PTR tv212[rsp], eax \r\nmov\teax, DWORD PTR tv214[rsp] \r\nmov\tecx, DWORD PTR tv212[rsp] \r\nadd\tecx, eax \r\nmov\teax, ecx \r\nmov\trcx, QWORD PTR tv210[rsp] \r\nmov\tDWORD PTR [rcx], eax \r\nadd\trsp, 248\t; 000000f8H \r\nret\t0 \r\nadd_1_impl&lt;...&gt; ENDP<\/pre>\n<\/td>\n<td class=\"asm_entry\">\n<pre>add_1_impl&lt;...&gt; PROC \r\n$LN3: \r\nmov\tQWORD PTR [rsp+16], rdx \r\nmov\tBYTE PTR [rsp+8], cl \r\nsub\trsp, 56\t; 00000038H \r\nmov\tDWORD PTR $T1[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 0 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T1[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 0 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T2[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 1 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T2[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 1 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T3[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 2 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T3[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 2 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T4[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 3 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T4[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 3 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T5[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 4 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T5[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 4 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T6[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 5 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T6[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 5 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T7[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 6 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T7[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 6 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T8[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 7 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T8[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 7 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T9[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 8 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T9[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 8 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nmov\tDWORD PTR $T10[rsp], 1 \r\nmov\teax, 4 \r\nimul\trax, rax, 9 \r\nmov\trcx, QWORD PTR arr$[rsp] \r\nmov\teax, DWORD PTR [rcx+rax] \r\nadd\teax, DWORD PTR $T10[rsp] \r\nmov\tecx, 4 \r\nimul\trcx, rcx, 9 \r\nmov\trdx, QWORD PTR arr$[rsp] \r\nmov\tDWORD PTR [rdx+rcx], eax \r\nadd\trsp, 56\t; 00000038H \r\nret\t0 \r\nadd_1_impl&lt;...&gt; ENDP<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Our 17.4 example clocks in at a whopping 226 instructions while our 17.5 example is only 106 and the complexity of the\ninstructions in 17.4 appears to be far more costly due to the number of call frame setups and &#8216;call&#8217; instructions which\nare not present on the 17.5 side.<\/p>\n<p>OK, perhaps the examples above are contrived and it might be far-fetched to think that code like the above would truly\nimpact performance, but let&#8217;s take some code that is all but guaranteed to have some kind of real world application:<\/p>\n<pre>#include &lt;vector&gt;\r\n\r\nint main() {\r\n    std::vector&lt;int&gt; v;\r\n    v.push_back(1); \r\n}<\/pre>\n<p>I will save you the massive assembly output on this one and simply callout the assembly size difference:<\/p>\n<ul>\n<li>17.4: 3136<\/li>\n<li>17.5: 3063<\/li>\n<\/ul>\n<p>Your assembly is 74 instructions shorter just by the compiler eliding these meta functions, and you can all but\nguarantee that in the places where <code>std::move<\/code> and <code>std::forward<\/code> are used, they may be used in a loop (i.e. resizing the\nvector and moving the elements to a new memory block). Furthermore, since these meta functions are never instantiated\nthe corresponding .obj, .lib, and .pdb will be slightly smaller after upgrading to 17.5.<\/p>\n<h2><span id=\"how-we-did-it\">How we did it<\/span><\/h2>\n<p>Rather than try to make the compiler aware of meta functions that act as named, no-op casts (i.e. the cast does not\nrequire a pointer adjustment), the compiler took an alternative approach and implemented this new inlining ability using\na C++ attribute: <code>[[msvc::intrinsic]]<\/code>.<\/p>\n<p>The new attribute will semantically replace a function call with a cast to that function&#8217;s return type if the function\ndefinition is decorated with <code>[[msvc::intrinsic]]<\/code>. You can see how we applied this new attribute in the STL: <a href=\"https:\/\/github.com\/microsoft\/STL\/pull\/3182\">GH3182<\/a>. The\nreason the compiler decided to go down the attribute route is that we want to eventually extend the scenarios it can\ncover and offer a data-driven approach to selectively decorate code with the new functionality. The latter is important\nfor users of MSVC as well.<\/p>\n<p>You can read more about the attribute and its constraints and semantics in the <a href=\"https:\/\/learn.microsoft.com\/en-us\/cpp\/cpp\/attributes?view=msvc-170#microsoft-specific-attributes\">Microsoft-specific attributes<\/a> section of\nour documentation.<\/p>\n<h2><span id=\"looking-ahead\">Looking ahead&#8230;<\/span><\/h2>\n<p>The compiler front-end is not alone in this story of improving the performance of generated code for debugging purposes,\nthe compiler back-end is also working very hard on some debug codegen scenarios that they will share in the coming\nmonths.<\/p>\n<p><b>Call to action:<\/b> what types of debugging optimizations matter to you? What optimizations for debug code would you like\nto see MSVC implement?<\/p>\n<p>Especially if you work for a game studio, please help us find out what your debugging workflow looks like by taking this\nsurvey: <a href=\"https:\/\/aka.ms\/MSVCDebugSurvey\">https:\/\/aka.ms\/MSVCDebugSurvey<\/a>. Data like this helps the team\nfocus on what workflows are important to you.<\/p>\n<p>Onward and upward!<\/p>\n<p><!-- It is important that this script is executed after all the tables are created -->\n<script>\nvar coll = document.getElementsByClassName(\"my_collapse\");\nvar i = 0;\nfor (; i < coll.length; i++) {\n  var elm = coll[i];\n  var content_row = elm.nextElementSibling;\n  \/\/ Hide them all by default.\n  content_row.style.display = \"none\";\n  elm.addEventListener(\"click\", function() {\n    var content = this.nextElementSibling;\n    if (content.style.display !== \"none\") {\n      content.style.display = \"none\";\n    }\n    else {\n      content.style.display = \"table-row\";\n    }\n  });\n}\n<\/script><\/p>\n<h4>Closing<\/h4>\n<p>As always, we welcome your feedback. Feel free to send any comments through e-mail at <a href=\"mailto:visualcpp@microsoft.com\">visualcpp@microsoft.com<\/a> or through <a href=\"https:\/\/twitter.com\/visualc\">Twitter @visualc<\/a>. Also, feel free to follow Cameron DaCamara on Twitter <a href=\"https:\/\/twitter.com\/starfreakclone\">@starfreakclone<\/a>.<\/p>\n<p>If you encounter other problems with MSVC in VS 2019\/2022 please let us know via the <a href=\"https:\/\/docs.microsoft.com\/en-us\/visualstudio\/ide\/how-to-report-a-problem-with-visual-studio?view=vs-2019\">Report\na Problem<\/a> option, either from the installer or the Visual Studio IDE itself. For suggestions or bug reports, let us\nknow through <a href=\"https:\/\/developercommunity.visualstudio.com\/\">DevComm.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog we will explore one change the MSVC compiler has implemented in an effort to improve the codegen quality of applications in debug mode. We will highlight what the change does, and how it could be extended for the future. If debug performance is something you care about for your C++ projects, then [&hellip;]<\/p>\n","protected":false},"author":39620,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[270,1,512,230,218],"tags":[8,140,100,65,55,66],"class_list":["post-31372","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-announcement","category-cplusplus","category-general-cpp-series","category-new-feature","category-performance","tag-announcement","tag-c","tag-c-language","tag-compiler","tag-debugging","tag-performance"],"acf":[],"blog_post_summary":"<p>In this blog we will explore one change the MSVC compiler has implemented in an effort to improve the codegen quality of applications in debug mode. We will highlight what the change does, and how it could be extended for the future. If debug performance is something you care about for your C++ projects, then [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/31372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/39620"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=31372"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/31372\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=31372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=31372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=31372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}