In this blog we will explore one change the MSVC compiler has implemented in an effort to improve the codegen quality of applications in debug mode. We will highlight what the change does, and how it could be extended for the future. If debug performance is something you care about for your C++ projects, then Visual Studio 2022 version 17.5 is making that experience even better!
Please note that this blog will contain some assembly but being an expert in assembly is not required.
Overview
- Motivation: why we care about debugging performance.
- Show me some code!: A few simple examples of before and after.
- How we did it: About our new intrinsic and how you could use it.
- Looking ahead: What else we’re doing to make the experience better.
Motivation
You might notice that the title of this blog is a play on words based on a recent popular blog post of a similar name,
“the sad state of debug performance in
c++”. In the blog Vittorio Romeo highlights some general C++ shortcomings when it comes to debugging performance.
Vittorio also also filed this Developer Community ticket “`std::move` (and
similar functions) result in poor debug performance and worse debugging experience“; thanks to him and everyone who
voted! Much of the reason for the observed slowdown is the cost of abstraction, with the notable example of
std::move
where the following code:
int i = 0; std::move(i);
Would generate a function call when the code is conceptually:
int i = 0; static_cast<int&&>(i);
The function std::move
is conceptually a named cast, much like static_cast but with a contextual meaning for code
around it. The penalty for using this named cast is that you get a function call generated in the debug assembly.
Here’s the assembly of the two examples above:
std::move (click to expand) |
static_cast (click to expand) |
main PROC sub rsp, 56 ; 00000038H mov DWORD PTR i$[rsp], 0 lea rcx, QWORD PTR i$[rsp] call ??$move@AEAH@std@@YA$$QEAHAEAH@Z xor eax, eax add rsp, 56 ; 00000038H ret 0 main ENDP |
main PROC sub rsp, 24 mov DWORD PTR i$[rsp], 0 xor eax, eax add rsp, 24 ret 0 main ENDP |
Note to readers: All code samples compiled in this blog were compiled with “/Od /std:c++latest
“
On the surface, the compiler only generated 2 extra instructions in the std::move
case, but the ‘call’ instruction, in
particular, is both expensive and executes this code in addition to the code above:
??$move@AEAH@std@@YA$$QEAHAEAH@Z PROC ; std::move<int &>, COMDAT mov QWORD PTR [rsp+8], rcx mov rax, QWORD PTR _Arg$[rsp] ret 0 ??$move@AEAH@std@@YA$$QEAHAEAH@Z ENDP ; std::move<int &>
Note: to generate the assembly above, the compiler can be provided with the /Fa
option. Furthermore, the weird names
like “??$move@AEAH@std@@YA$$QEAHAEAH@Z
” are a mangled name of the function template specialization of std::move
.
So really your binary is now at a 5 instruction deficit to the static_cast
code, and this cost is multiplied by the
number of times that std::move
is used.
Some compilers have already implemented some mechanism to acknowledge meta functions like std::move
and std::forward
as
compiler intrinsics (as noted in Vittorio’s blog) and this support is done completely in the compiler front-end. As of
17.5, MSVC is offering better debugging performance by acknowledging these meta functions as well! More on how we do it
later in this blog, but first…
Show me some code!
Note to readers: to take advantage of the new codegen quality, you will need to provide the /permissive-
compiler
option. Also worthy to note that /permissive-
is implied when /std:c++20
or /std:c++latest
is used.
Let’s take the simple example above again and make it a full program:
#include <utility> int main() { int i = 0; std::move(i); std::forward<int&>(i); }
Here’s the generated assembly difference between 17.4 and 17.5:
17.4 (click to expand) | 17.5 (click to expand) |
_Arg$ = 8 ??$forward@AEAH@std@@YAAEAHAEAH@Z PROC mov QWORD PTR [rsp+8], rcx mov rax, QWORD PTR _Arg$[rsp] ret 0 ??$forward@AEAH@std@@YAAEAHAEAH@Z ENDP _TEXT ENDS _TEXT SEGMENT _Arg$ = 8 ??$move@AEAH@std@@YA$$QEAHAEAH@Z PROC mov QWORD PTR [rsp+8], rcx mov rax, QWORD PTR _Arg$[rsp] ret 0 ??$move@AEAH@std@@YA$$QEAHAEAH@Z ENDP _TEXT ENDS _TEXT SEGMENT i$ = 32 main PROC sub rsp, 56 ; 00000038H mov DWORD PTR i$[rsp], 0 lea rcx, QWORD PTR i$[rsp] call ??$move@AEAH@std@@YA$$QEAHAEAH@Z lea rcx, QWORD PTR i$[rsp] call ??$forward@AEAH@std@@YAAEAHAEAH@Z xor eax, eax add rsp, 56 ; 00000038H ret 0 main ENDP |
i$ = 0 main PROC $LN3: sub rsp, 24 mov DWORD PTR i$[rsp], 0 xor eax, eax add rsp, 24 ret 0 main ENDP |
Assembly reading tip: The main PROC
above is our main
function in the C++ code. The
instructions that follow main PROC
are what your CPU will execute when your program is first invoked. In
the case above, it is clear that the code produced by 17.5 is much smaller, which can sometimes be an indication of a
performance win. For the purposes of this blog, the performance win is both in the size of the code produced and the
reduction in indirections due to inlining the ‘call’ instruction to std::move
and std::forward
. For the
purposes of this blog we will rely on the newly generated assembly reduced complexity as an indicator of possible
performance wins.
Yes, you read that right, the generated code in 17.5 doesn’t even create assembly entries for std::move
or
std::forward
—which makes sense, they’re never called.
Let’s look at a slightly more complicated code example:
#include <utility> template <typename T> void add_1_impl(T&& x) { std::forward<T>(x) += std::move(1); } template <typename T, int N> void add_1(T (&arr)[N]) { for (auto&& e : arr) { add_1_impl(e); } } int main() { int arr[10]{}; add_1(arr); }
In this code all we want to do is add 1 to all elements of the array. Here’s the table (only showing the add_1_impl
function with std::forward
and std::move
):
17.4 (click to expand) | 17.5 (click to expand) |
??$add_1_impl@AEAH@@YAXAEAH@Z PROC $LN3: mov QWORD PTR [rsp+8], rcx sub rsp, 72 ; 00000048H mov DWORD PTR $T1[rsp], 1 lea rcx, QWORD PTR $T1[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv72[rsp], eax mov rcx, QWORD PTR x$[rsp] call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv68[rsp], rax mov rax, QWORD PTR tv68[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv70[rsp], eax mov eax, DWORD PTR tv72[rsp] mov ecx, DWORD PTR tv70[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv68[rsp] mov DWORD PTR [rcx], eax add rsp, 72 ; 00000048H ret 0 ??$add_1_impl@AEAH@@YAXAEAH@Z ENDP |
??$add_1_impl@AEAH@@YAXAEAH@Z PROC $LN3: mov QWORD PTR [rsp+8], rcx sub rsp, 24 mov DWORD PTR $T1[rsp], 1 mov rax, QWORD PTR x$[rsp] mov eax, DWORD PTR [rax] add eax, DWORD PTR $T1[rsp] mov rcx, QWORD PTR x$[rsp] mov DWORD PTR [rcx], eax add rsp, 24 ret 0 ??$add_1_impl@AEAH@@YAXAEAH@Z ENDP |
17.4 has 21 instructions while 17.5 has only 10, but this comparison is made that much more extreme by the fact that we
are calling add_impl_1
in a loop so the complexity of executed instructions in 17.4 can ostensibly be significantly
more costly than in 17.5—worse than that, actually, because we’re not accounting for the instructions executed in the
functions std::forward
and std::move
.
Let’s make the code sample even more interesting and extreme to illustrate the visible differences. It might be observed that if we manually unroll the loop above we can get a performance win, so let’s do that using templates:
#include <utility> template <typename T, int N, std::size_t... Is> void add_1_impl(std::index_sequence<Is...>, T (&arr)[N]) { ((std::forward<T&>(arr[Is]) += std::move(1)), ...); } template <typename T, int N> void add_1(T (&arr)[N]) { add_1_impl(std::make_index_sequence<N>{}, arr); } int main() { int arr[10]{}; add_1(arr); }
The code above replaces the loop in the previous example with a single fold expression. Let’s peek at the codegen
(again only snipping add_1_impl
with std::forward
and std::move
, we also replace the mangled function name with
add_1_impl<...>
):
17.4 (click to expand) | 17.5 (click to expand) |
add_1_impl<...> PROC $LN3: mov QWORD PTR [rsp+16], rdx mov BYTE PTR [rsp+8], cl sub rsp, 248 ; 000000f8H mov DWORD PTR $T1[rsp], 1 lea rcx, QWORD PTR $T1[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv74[rsp], eax mov eax, 4 imul rax, rax, 0 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv70[rsp], rax mov rax, QWORD PTR tv70[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv72[rsp], eax mov eax, DWORD PTR tv74[rsp] mov ecx, DWORD PTR tv72[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv70[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T2[rsp], 1 lea rcx, QWORD PTR $T2[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv86[rsp], eax mov eax, 4 imul rax, rax, 1 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv82[rsp], rax mov rax, QWORD PTR tv82[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv84[rsp], eax mov eax, DWORD PTR tv86[rsp] mov ecx, DWORD PTR tv84[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv82[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T3[rsp], 1 lea rcx, QWORD PTR $T3[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv130[rsp], eax mov eax, 4 imul rax, rax, 2 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv94[rsp], rax mov rax, QWORD PTR tv94[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv128[rsp], eax mov eax, DWORD PTR tv130[rsp] mov ecx, DWORD PTR tv128[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv94[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T4[rsp], 1 lea rcx, QWORD PTR $T4[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv142[rsp], eax mov eax, 4 imul rax, rax, 3 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv138[rsp], rax mov rax, QWORD PTR tv138[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv140[rsp], eax mov eax, DWORD PTR tv142[rsp] mov ecx, DWORD PTR tv140[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv138[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T5[rsp], 1 lea rcx, QWORD PTR $T5[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv154[rsp], eax mov eax, 4 imul rax, rax, 4 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv150[rsp], rax mov rax, QWORD PTR tv150[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv152[rsp], eax mov eax, DWORD PTR tv154[rsp] mov ecx, DWORD PTR tv152[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv150[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T6[rsp], 1 lea rcx, QWORD PTR $T6[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv166[rsp], eax mov eax, 4 imul rax, rax, 5 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv162[rsp], rax mov rax, QWORD PTR tv162[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv164[rsp], eax mov eax, DWORD PTR tv166[rsp] mov ecx, DWORD PTR tv164[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv162[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T7[rsp], 1 lea rcx, QWORD PTR $T7[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv178[rsp], eax mov eax, 4 imul rax, rax, 6 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv174[rsp], rax mov rax, QWORD PTR tv174[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv176[rsp], eax mov eax, DWORD PTR tv178[rsp] mov ecx, DWORD PTR tv176[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv174[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T8[rsp], 1 lea rcx, QWORD PTR $T8[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv190[rsp], eax mov eax, 4 imul rax, rax, 7 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv186[rsp], rax mov rax, QWORD PTR tv186[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv188[rsp], eax mov eax, DWORD PTR tv190[rsp] mov ecx, DWORD PTR tv188[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv186[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T9[rsp], 1 lea rcx, QWORD PTR $T9[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv202[rsp], eax mov eax, 4 imul rax, rax, 8 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv198[rsp], rax mov rax, QWORD PTR tv198[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv200[rsp], eax mov eax, DWORD PTR tv202[rsp] mov ecx, DWORD PTR tv200[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv198[rsp] mov DWORD PTR [rcx], eax mov DWORD PTR $T10[rsp], 1 lea rcx, QWORD PTR $T10[rsp] call ??$move@H@std@@YA$$QEAH$$QEAH@Z mov eax, DWORD PTR [rax] mov DWORD PTR tv214[rsp], eax mov eax, 4 imul rax, rax, 9 mov rcx, QWORD PTR arr$[rsp] add rcx, rax mov rax, rcx mov rcx, rax call ??$forward@AEAH@std@@YAAEAHAEAH@Z mov QWORD PTR tv210[rsp], rax mov rax, QWORD PTR tv210[rsp] mov eax, DWORD PTR [rax] mov DWORD PTR tv212[rsp], eax mov eax, DWORD PTR tv214[rsp] mov ecx, DWORD PTR tv212[rsp] add ecx, eax mov eax, ecx mov rcx, QWORD PTR tv210[rsp] mov DWORD PTR [rcx], eax add rsp, 248 ; 000000f8H ret 0 add_1_impl<...> ENDP |
add_1_impl<...> PROC $LN3: mov QWORD PTR [rsp+16], rdx mov BYTE PTR [rsp+8], cl sub rsp, 56 ; 00000038H mov DWORD PTR $T1[rsp], 1 mov eax, 4 imul rax, rax, 0 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T1[rsp] mov ecx, 4 imul rcx, rcx, 0 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T2[rsp], 1 mov eax, 4 imul rax, rax, 1 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T2[rsp] mov ecx, 4 imul rcx, rcx, 1 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T3[rsp], 1 mov eax, 4 imul rax, rax, 2 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T3[rsp] mov ecx, 4 imul rcx, rcx, 2 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T4[rsp], 1 mov eax, 4 imul rax, rax, 3 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T4[rsp] mov ecx, 4 imul rcx, rcx, 3 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T5[rsp], 1 mov eax, 4 imul rax, rax, 4 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T5[rsp] mov ecx, 4 imul rcx, rcx, 4 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T6[rsp], 1 mov eax, 4 imul rax, rax, 5 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T6[rsp] mov ecx, 4 imul rcx, rcx, 5 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T7[rsp], 1 mov eax, 4 imul rax, rax, 6 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T7[rsp] mov ecx, 4 imul rcx, rcx, 6 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T8[rsp], 1 mov eax, 4 imul rax, rax, 7 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T8[rsp] mov ecx, 4 imul rcx, rcx, 7 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T9[rsp], 1 mov eax, 4 imul rax, rax, 8 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T9[rsp] mov ecx, 4 imul rcx, rcx, 8 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax mov DWORD PTR $T10[rsp], 1 mov eax, 4 imul rax, rax, 9 mov rcx, QWORD PTR arr$[rsp] mov eax, DWORD PTR [rcx+rax] add eax, DWORD PTR $T10[rsp] mov ecx, 4 imul rcx, rcx, 9 mov rdx, QWORD PTR arr$[rsp] mov DWORD PTR [rdx+rcx], eax add rsp, 56 ; 00000038H ret 0 add_1_impl<...> ENDP |
Our 17.4 example clocks in at a whopping 226 instructions while our 17.5 example is only 106 and the complexity of the instructions in 17.4 appears to be far more costly due to the number of call frame setups and ‘call’ instructions which are not present on the 17.5 side.
OK, perhaps the examples above are contrived and it might be far-fetched to think that code like the above would truly impact performance, but let’s take some code that is all but guaranteed to have some kind of real world application:
#include <vector> int main() { std::vector<int> v; v.push_back(1); }
I will save you the massive assembly output on this one and simply callout the assembly size difference:
- 17.4: 3136
- 17.5: 3063
Your assembly is 74 instructions shorter just by the compiler eliding these meta functions, and you can all but
guarantee that in the places where std::move
and std::forward
are used, they may be used in a loop (i.e. resizing the
vector and moving the elements to a new memory block). Furthermore, since these meta functions are never instantiated
the corresponding .obj, .lib, and .pdb will be slightly smaller after upgrading to 17.5.
How we did it
Rather than try to make the compiler aware of meta functions that act as named, no-op casts (i.e. the cast does not
require a pointer adjustment), the compiler took an alternative approach and implemented this new inlining ability using
a C++ attribute: [[msvc::intrinsic]]
.
The new attribute will semantically replace a function call with a cast to that function’s return type if the function
definition is decorated with [[msvc::intrinsic]]
. You can see how we applied this new attribute in the STL: GH3182. The
reason the compiler decided to go down the attribute route is that we want to eventually extend the scenarios it can
cover and offer a data-driven approach to selectively decorate code with the new functionality. The latter is important
for users of MSVC as well.
You can read more about the attribute and its constraints and semantics in the Microsoft-specific attributes section of our documentation.
Looking ahead…
The compiler front-end is not alone in this story of improving the performance of generated code for debugging purposes, the compiler back-end is also working very hard on some debug codegen scenarios that they will share in the coming months.
Call to action: what types of debugging optimizations matter to you? What optimizations for debug code would you like to see MSVC implement?
Especially if you work for a game studio, please help us find out what your debugging workflow looks like by taking this survey: https://aka.ms/MSVCDebugSurvey. Data like this helps the team focus on what workflows are important to you.
Onward and upward!
Closing
As always, we welcome your feedback. Feel free to send any comments through e-mail at visualcpp@microsoft.com or through Twitter @visualc. Also, feel free to follow Cameron DaCamara on Twitter @starfreakclone.
If you encounter other problems with MSVC in VS 2019/2022 please let us know via the Report a Problem option, either from the installer or the Visual Studio IDE itself. For suggestions or bug reports, let us know through DevComm.
0 comments