We hear that many of you would like to see more details of the improvements which the MSVC backend team have been working on recently. This blog post presents some of the optimizations the team has implemented for Visual Studio 2022. It was co-written by one of our backend team leads, Eric Brumer, and our developer advocate, Sy Brand. Keep an eye out for more posts in the future which will dig into other optimizations!
Byteswap Identification
Changing the endianness of an integer can be an important operation in contexts where data is being transmitted between processors with different byte orders, or over the network. This is often referred to as a “byteswap”. C++23 will add a std::byteswap
function to the standard library (already implemented in VS2022 17.1), but for codebases today it’s common to see custom implementations, which might look something like this:
int byteswap(int n) {
return ((n & 0xff) << 24u) |
((n & 0xff00) << 8u) |
((n & 0xff0000) >> 8u) |
((n & 0xff000000) >> 24u);
}
In Visual Studio 2019 16.11, this generated the following code for x64, and a similarly long sequence of instructions for Arm64:
mov eax, ecx
mov edx, ecx
and eax, 65280
shl edx, 16
or eax, edx
mov edx, ecx
shl eax, 8
sar edx, 8
and edx, 65280
shr ecx, 24
or eax, edx
or eax, ecx
ret 0
x64 and Arm64 both have instructions which carry out a byteswap, which should ideally be used instead. Visual Studio 2022 17.3 introduced automatic byteswap identification, and now outputs the following on x64 with /O2
:
bswap ecx
mov eax, ecx
ret 0
On Arm64, the results are much the same:
rev w0,w0
ret
This optimization isn’t just doing a simple pattern match on your code, it’s a generalized bit tracker. For example, this bit order reverser gets optimized to a single rbit on Arm64:
unsigned __int64 bitswap(unsigned __int64 x) {
x = _byteswap_uint64(x);
x = (x & 0xaaaaaaaaaaaaaaaa) >> 1 | (x & 0x5555555555555555) << 1;
x = (x & 0xcccccccccccccccc) >> 2 | (x & 0x3333333333333333) << 2;
x = (x & 0xf0f0f0f0f0f0f0f0) >> 4 | (x & 0x0f0f0f0f0f0f0f0f) << 4;
return x;
}
Loop Unswitching
For sake of code simplicity, we may choose to write a branch inside a loop whose condition is invariable in each iteration. For example, if we want to pet all our cats, and if it’s feeding time, we also want to feed them, we could write this:
void process_cats(std::span<cat> cats, bool is_feeding_time) {
for (auto&& c : cats) {
if (is_feeding_time) {
feed(c);
}
pet(c);
}
}
Since is_feeding_time
never changes inside the function and has no side-effects, it may be wasteful to carry out that check every iteration. Better hardware branch predictors will minimize this impact, but in many cases we can see performance wins by carrying out loop unswitching. This hoists the branch outside of the loop, generating code similar to this C++:
void process_cats(std::span<cat> cats, bool is_feeding_time) {
if (is_feeding_time) {
for (auto&& c : cats) {
feed(c);
pet(c);
}
}
else {
for (auto&& c : cats) {
pet(c);
}
}
}
As of Visual Studio 2022 version 17.1, MSVC may carry out loop unswitching at /O2
. This is a heuristic-driven optimization which may or may not be carried out depending on analysis results. This is because loop unswitching duplicates the loop body, which may affect instruction cache performance. You may pass the /Qvec-report:1
flag to see a report on which loops were unswitched.
Here is the x64 generated by Visual Studio 2019 version 16.11 with /O2
:
$LN15:
mov QWORD PTR [rsp+8], rbx
mov QWORD PTR [rsp+16], rsi
push rdi
sub rsp, 32
mov rbx, QWORD PTR [rcx]
movzx esi, dl
mov rax, QWORD PTR [rcx+8]
lea rdi, QWORD PTR [rbx+rax*4]
cmp rbx, rdi
je SHORT $LN3@process_ca
$LL4@process_ca:
test sil, sil
je SHORT $LN5@process_ca
mov ecx, DWORD PTR [rbx]
call void feed(cat)
$LN5@process_ca:
mov ecx, DWORD PTR [rbx]
call void pet(cat)
add rbx, 4
cmp rbx, rdi
jne SHORT $LL4@process_ca
$LN3@process_ca:
mov rbx, QWORD PTR [rsp+48]
mov rsi, QWORD PTR [rsp+56]
add rsp, 32
pop rdi
ret 0
The key part to note here are the lines directly under the $LL4@process_ca
label, which carry out the test on is_feeding_time
and branch on the result, all inside the loop.
Here is the code generated now with /O2
:
$LN22:
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32
mov rbx, QWORD PTR [rcx]
mov rax, QWORD PTR [rcx+8]
lea rdi, QWORD PTR [rbx+rax*4]
cmp rbx, rdi
je SHORT $LN14@process_ca
test dl, dl
je SHORT $LL11@process_ca
npad 2
$LL4@process_ca:
mov ecx, DWORD PTR [rbx]
call void feed(cat)
mov ecx, DWORD PTR [rbx]
call void pet(cat)
add rbx, 4
cmp rbx, rdi
jne SHORT $LL4@process_ca
mov rbx, QWORD PTR [rsp+48]
add rsp, 32
pop rdi
ret 0
$LL11@process_ca:
mov ecx, DWORD PTR [rbx]
call void pet(cat)
add rbx, 4
cmp rbx, rdi
jne SHORT $LL11@process_ca
$LN14@process_ca:
mov rbx, QWORD PTR [rsp+48]
add rsp, 32
pop rdi
ret 0
The code is now longer because two versions of the loop body are now generated, under the $LL4@process_ca
and $LL11@process_ca
labels. But also note that the branch occurs in the entry block of the function and selects between the two loop body versions:
cmp rbx, rdi
je SHORT $LN14@process_ca
test dl, dl
je SHORT $LL11@process_ca
Min/Max Chains
We have improved optimization of chains of std::min
and std::max
as of Visual Studio 2022 version 17.0.
Say we have three blankets and we want to give one to our cat. The one we pick shouldn’t be too hard. It shouldn’t be too soft. It should be just right.
We could write a function to give us the Just Right blanket from three, by picking the middle one. It could look something like this:
using softness = float;
softness just_right_blanket(softness a, softness b, softness c) {
return std::max(std::min(a,b), std::min(std::max(a,b),c));
}
In VS2019, this code was generated for x64 with /O2 /fp:fast
:
comiss xmm1, xmm0
lea rcx, QWORD PTR a$[rsp]
lea rax, QWORD PTR b$[rsp]
lea rdx, QWORD PTR a$[rsp]
movss DWORD PTR [rsp+16], xmm1
movss DWORD PTR [rsp+8], xmm0
cmovbe rax, rcx
movss DWORD PTR [rsp+24], xmm2
lea rcx, QWORD PTR c$[rsp]
comiss xmm2, DWORD PTR [rax]
cmovae rcx, rax
lea rax, QWORD PTR b$[rsp]
comiss xmm0, xmm1
cmovbe rax, rdx
movss xmm1, DWORD PTR [rax]
comiss xmm1, DWORD PTR [rcx]
cmovb rax, rcx
movss xmm0, DWORD PTR [rax]
ret 0
Arm64 codegen is similarly inefficient. Both x64 and Arm64 have single instructions for scalar floating point min and max, which we now use in VS2022 at /O2
and above with /fp:fast
. Here is the x64 code now:
movaps xmm3, xmm0
maxss xmm0, xmm1
minss xmm3, xmm1
minss xmm0, xmm2
maxss xmm0, xmm3
ret 0
And for Arm64:
fmax s16,s0,s1
fmin s17,s0,s1
fmin s18,s16,s2
fmax s0,s18,s17
ret
Backwards Loop Vectorization
Say I run a cat shelter with 32 cats and want to count how many crunchies they leave behind in their bowls after mealtime. So I write a function which takes a pointer to the first bowl, and sum it like so (yes, I know I could use std::accumulate
):
int count_leftovers(int* bowl_ptr) {
int result = 0;
for (int i = 0; i < 32; ++i, ++bowl_ptr) {
result += *bowl_ptr;
}
return result;
}
This all works and generates good code! But then I realize that my desk is actually at the far end of the room, so I need to walk all the way to the start of the line to begin counting. I decide to instead take a pointer to the last bowl and work backwards:
int count_leftovers(int* bowl_ptr) {
int result = 0;
// change ++bowl_ptr to --bowl_ptr
for (int i = 0; i < 32; ++i, --bowl_ptr) {
result += *bowl_ptr;
}
return result;
}
Unfortunately, if I was using VS2019, this loop would not be vectorized. Here is the code generated with /O2
:
xor eax, eax
mov edx, eax
mov r8d, eax
mov r9d, eax
add rcx, -8
lea r10d, QWORD PTR [rax+8]
$LL4@count_left:
add eax, DWORD PTR [rcx+8]
add r9d, DWORD PTR [rcx+4]
add r8d, DWORD PTR [rcx]
add edx, DWORD PTR [rcx-4]
lea rcx, QWORD PTR [rcx-16]
sub r10, 1
jne SHORT $LL4@count_left
lea ecx, DWORD PTR [rdx+r8]
add ecx, r9d
add eax, ecx
ret 0
The loop is unrolled but it is not vectorized.
We enabled vectorization for backwards-strided loops in VS2022 17.1. The code generated will depend a lot on the flags you use, particularly the /arch
flag for enabling use of AVX instructions instead of the default SSE2 ones.
Here is the code generated for /O2 /arch:AVX2
:
vpxor xmm2, xmm2, xmm2
vpxor xmm3, xmm3, xmm3
mov eax, 2
npad 3
$LL4@count_left:
vmovd xmm1, DWORD PTR [rcx]
vpinsrd xmm1, xmm1, DWORD PTR [rcx-4], 1
vpinsrd xmm1, xmm1, DWORD PTR [rcx-8], 2
vpinsrd xmm1, xmm1, DWORD PTR [rcx-12], 3
vmovd xmm0, DWORD PTR [rcx-16]
vpinsrd xmm0, xmm0, DWORD PTR [rcx-20], 1
vpinsrd xmm0, xmm0, DWORD PTR [rcx-24], 2
vpinsrd xmm0, xmm0, DWORD PTR [rcx-28], 3
lea rcx, QWORD PTR [rcx-64]
vinsertf128 ymm0, ymm1, xmm0, 1
vmovd xmm1, DWORD PTR [rcx+32]
vpinsrd xmm1, xmm1, DWORD PTR [rcx+28], 1
vpinsrd xmm1, xmm1, DWORD PTR [rcx+24], 2
vpinsrd xmm1, xmm1, DWORD PTR [rcx+20], 3
vpaddd ymm2, ymm0, ymm2
vmovd xmm0, DWORD PTR [rcx+16]
vpinsrd xmm0, xmm0, DWORD PTR [rcx+12], 1
vpinsrd xmm0, xmm0, DWORD PTR [rcx+8], 2
vpinsrd xmm0, xmm0, DWORD PTR [rcx+4], 3
vinsertf128 ymm0, ymm1, xmm0, 1
vpaddd ymm3, ymm0, ymm3
sub rax, 1
jne $LL4@count_left
vpaddd ymm0, ymm3, ymm2
vphaddd ymm1, ymm0, ymm0
vphaddd ymm2, ymm1, ymm1
vextracti128 xmm0, ymm2, 1
vpaddd xmm0, xmm2, xmm0
vmovd eax, xmm0
vzeroupper
ret 0
This both unrolls and vectorizes the loop. Fully explaining AVX2 vector instructions is a job for a different blog post (maybe check out this one), but the basic idea is that all those vpinsrd
instructions are loading the data from memory into vector registers, then the vpaddd
/vphaddd
instructions carry out addition on big chunks of data all at the same time.
Send us your feedback
We hope you found these details interesting! If you have ideas for similar posts you’d like to see, please let us know. We are also interested in your feedback to continue to improve our tools. The comments below are open. Feedback can also be shared through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.
In byteswap_wat_edition, do you mean the ARM64 RBIT instruction rather than REV?
You’re right, thanks, fixed!
I was thinking the same – that code is a bit reverse in each byte, not a byteswap.
Thanks! It’s cool to read about these new optimizations. Please keep them coming.
Somewhat related to byteswap is https://developercommunity.visualstudio.com/t/Missed-optimization:-loadstore-coalesci/987039 which MSVC fails to optimize down to a simply load instruction. Looking forward to seeing this improved in the future, as well.