MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7

Troy Johnson

This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization.

Optimizing Memory Across Block Boundaries

When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases.

In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage.

Example C++ Source Code:

#include <string_view>

bool compare(const std::string_view& l, const std::string_view& r) {
   return l == r;
}

Required Compiler Flags: /O2

X64 ASM:

17.4 17.7
sub        rsp, 56
movups     xmm0, XMMWORD PTR [rcx]
mov        r8, QWORD PTR [rcx+8]
movaps     XMMWORD PTR $T1[rsp], xmm0
cmp        r8, QWORD PTR [rdx+8]
jne        SHORT $LN9@compare
mov        rdx, QWORD PTR [rdx]
mov        rcx, QWORD PTR $T1[rsp]
call       memcmp
test       eax, eax
jne        SHORT $LN9@compare
mov        al, 1
add        rsp, 56
ret        0
$LN9@compare:
xor        al, al
add        rsp, 56
ret        0
sub         rsp, 40
mov         r8, QWORD PTR [rcx+8]
movups      xmm1, XMMWORD PTR [rcx]
cmp         r8, QWORD PTR [rdx+8]
jne         SHORT $LN9@compare
mov         rdx, QWORD PTR [rdx]
movq        rcx, xmm1
call        memcmp
test        eax, eax
jne         SHORT $LN9@compare
mov         al, 1
add         rsp, 40
ret         0
$LN9@compare:
xor         al, al
add         rsp, 40
ret         0

 

ARM64 ASM:

17.4 17.7
str         lr,[sp,#-0x10]!
sub         sp,sp,#0x20
ldr         q17,[x1]
ldr         q16,[x0]
umov        x8,v17.d[1]
umov        x2,v16.d[1]
stp         q17,q16,[sp]
cmp         x2,x8
bne         |$LN9@compare|
ldr         x1,[sp]
ldr         x0,[sp,#0x10]
bl          memcmp
cbnz        w0,|$LN9@compare|
mov         w0,#1
add         sp,sp,#0x20
ldr         lr,[sp],#0x10
ret
|$LN9@compare|
mov         w0,#0
add         sp,sp,#0x20
ldr         lr,[sp],#0x10
ret
str         lr,[sp,#-0x10]!
ldr         q17,[x1]
ldr         q16,[x0]
umov        x8,v17.d[1]
umov        x2,v16.d[1]
cmp         x2,x8
bne         |$LN9@compare|
fmov        x1,d17
fmov        x0,d16
bl          memcmp
cbnz        w0,|$LN9@compare|
mov         w0,#1
ldr         lr,[sp],#0x10
ret
|$LN9@compare|
mov         w0,#0
ldr         lr,[sp],#0x10
ret

 

Vector Logical and Arithmetic Optimizations

We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64.

Example C++ Source Code:

#include <cstdint>

void s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) {
    for (int i = 0; i < n; i++) {
        a[i] = (b[i] - c[i]) > 0 ? (b[i] - c[i]) : (c[i] - b[i]);
    }
}

Required Flags: /O2 /arch:AVX for X64, /O2 for ARM64

X64 ASM:

17.4 17.7
$LL4@s32_1:
movdqu    xmm0, XMMWORD PTR [r11+rax]
add       ecx, 4
movdqu    xmm1, XMMWORD PTR [rax]
lea       rax, QWORD PTR [rax+16]
movdqa    xmm3, xmm0
psubd     xmm3, xmm1
psubd     xmm1, xmm0
movdqa    xmm2, xmm3
pcmpgtd   xmm2, xmm4
movdqa    xmm0, xmm2
andps     xmm2, xmm3
andnps    xmm0, xmm1
orps      xmm0, xmm2
movdqu    XMMWORD PTR [r10+rax-16], xmm0
cmp       ecx, edx
jl        SHORT $LL4@s32_1
$LL4@s32_1:
vmovdqu    xmm1, XMMWORD PTR [r10+rax]
vpsubd     xmm1, xmm1, XMMWORD PTR [rax]
vpabsd     xmm2, xmm1    ; vector abs
add        edx, 4
vmovdqu    XMMWORD PTR [rbx+rax], xmm2
lea        rax, QWORD PTR [rax+16]
cmp        edx, ecx
jl         SHORT $LL4@s32_1

 

ARM64 ASM:

17.4 17.7
|$LL24@s32_1|
sbfiz       x9,x8,#2,#0x20
ldr         q19,[x9,x2]
add         w8,w8,#4
ldr         q18,[x9,x1]
cmp         w8,w10
sub         v16.4s,v18.4s,v19.4s
cmgt        v17.4s,v16.4s,v21.4s
and         v20.16b,v17.16b,v16.16b
sub         v16.4s,v19.4s,v18.4s
bic         v16.16b,v16.16b,v17.16b
orr         v16.16b,v16.16b,v20.16b
str         q16,[x9,x0]
blt         |$LL24@s32_1|
cmp         w8,w3
bge         |$LN27@s32_1|
|$LL24@s32_1|
sbfiz       x9,x8,#2,#0x20
ldr         q17,[x9,x1]
add         w8,w8,#4
ldr         q16,[x9,x2]
cmp         w8,w10
sabd        v16.4s,v17.4s,v16.4s     ; vector abs
str         q16,[x9,x0]
blt         |$LL24@s32_1|

 

Vector Remainder Loops

Each iteration of a vectorized loop computes the same work as multiple iterations of the original scalar loop. Depending on the number of original scalar iterations and the vector width, there may be work left over when the vector loop ends. Although this work can be completed by a copy of the original scalar loop, it may contain sufficient work to warrant further vectorization using a smaller vector width. This optimization is now employed for targets that have a single vector width by partially packing a vector.

In the example below, three loops are emitted for the original loop. The first processes 64 elements per vector iteration by packing 16 8-bit values into a 128-bit vector register and then unrolling the vector loop by four. The second loop processes 8 elements per vector iteration by packing 8 8-bit values into half of a 128-bit vector register. The third loop processes one element per scalar iteration.

Example C++ source code:

#include <cstdint>

void test(int8_t * __restrict d, int8_t * __restrict a, int8_t * __restrict b, int n) {
    for (int i = 0; i < n; i++) {
        d[i] = a[i] & (~b[i]);
    }
}

Required compiler flags: /O2

X64 ASM:

17.4 17.7
$LL4@test: ; 1st vector loop (64 element iters)
movdqu  xmm0, XMMWORD PTR [rcx-16]
add     edi, 64         ; 00000040H
movdqu  xmm1, XMMWORD PTR [rdx+rcx-16]
movdqu  xmm2, XMMWORD PTR [rdx+rcx]
lea     rcx, QWORD PTR [rcx+64]
pandn   xmm1, xmm3
lea     rax, QWORD PTR [r14+rcx]
pand    xmm1, xmm0
pandn   xmm2, xmm3
movdqu  xmm0, XMMWORD PTR [rcx-64]
movdqu  XMMWORD PTR [r9+rcx-80], xmm1
pand    xmm2, xmm0
movdqu  xmm0, XMMWORD PTR [rcx-48]
movdqu  xmm1, XMMWORD PTR [rdx+rcx-48]
movdqu  XMMWORD PTR [r9+rcx-64], xmm2
pandn   xmm1, xmm3
pand    xmm1, xmm0
movdqu  xmm2, XMMWORD PTR [rdx+rcx-32]
movdqu  xmm0, XMMWORD PTR [rcx-32]
pandn   xmm2, xmm3
pand    xmm2, xmm0
movdqu  XMMWORD PTR [r9+rcx-48], xmm1
movdqu  XMMWORD PTR [r9+rcx-32], xmm2
cmp     rax, rsi
jl      SHORT $LL4@test
mov     r14, QWORD PTR [rsp]
mov     rsi, QWORD PTR [rsp+24]
$LN9@test:
movsxd  rcx, edi
mov     rdx, rbx
mov     rdi, QWORD PTR [rsp+32]
mov     rbx, QWORD PTR [rsp+16]
cmp     rcx, rdx
jge     SHORT $LN3@test
sub     r8, r11
lea     rax, QWORD PTR [rcx+r11]
sub     r10, r11
sub     rdx, rcx
$LL8@test: ; Scalar loop (1 element iters)
movzx   ecx, BYTE PTR [rax+r8]
lea     rax, QWORD PTR [rax+1]
not     cl
and     cl, BYTE PTR [rax-1]
mov     BYTE PTR [rax+r10-1], cl
sub     rdx, 1
jne     SHORT $LL8@test
$LN3@test:
add     rsp, 8
ret     0
$LL4@test: ; 1st vector loop (64 element iters)
movdqu    xmm0, XMMWORD PTR [rcx-16]
add       edx, 64          ; 00000040H
movdqu    xmm1, XMMWORD PTR [r8+rcx-16]
movdqu    xmm2, XMMWORD PTR [r8+rcx]
lea       rcx, QWORD PTR [rcx+64]
andnps    xmm1, xmm3
lea       rax, QWORD PTR [rsi+rcx]
andps     xmm1, xmm0
andnps    xmm2, xmm3
movdqu    xmm0, XMMWORD PTR [rcx-64]
movdqu    XMMWORD PTR [r9+rcx-80], xmm1
andps     xmm2, xmm0
movdqu    xmm0, XMMWORD PTR [rcx-48]
movdqu    xmm1, XMMWORD PTR [r8+rcx-48]
movdqu    XMMWORD PTR [r9+rcx-64], xmm2
andnps    xmm1, xmm3
andps     xmm1, xmm0
movdqu    xmm2, XMMWORD PTR [r8+rcx-32]
movdqu    xmm0, XMMWORD PTR [rcx-32]
andnps    xmm2, xmm3
andps     xmm2, xmm0
movdqu    XMMWORD PTR [r9+rcx-48], xmm1
movdqu    XMMWORD PTR [r9+rcx-32], xmm2
cmp       rax, rbp
jl        SHORT $LL4@test
mov       rbp, QWORD PTR [rsp+16]
mov       eax, edi
and       al, 63    ; 0000003fH
cmp       al, 8
jb        SHORT $LN27@test
$LN11@test:
mov       ecx, edi
movsxd    rax, edx
and       ecx, -8
movsxd    rsi, ecx
lea       rcx, QWORD PTR [rax+r10]
$LL10@test:          ; 2nd vector loop (8 element iters)
movq      xmm1, QWORD PTR [r8+rcx]
add       edx, 8
movq      xmm0, QWORD PTR [rcx]
andnps    xmm1, xmm3
andps     xmm1, xmm0
movq      QWORD PTR [r9+rcx], xmm1
add       rcx, 8
mov       rax, rcx
sub       rax, r10
cmp       rax, rsi
jl        SHORT $LL10@test
$LN27@test:
mov       rsi, QWORD PTR [rsp+24]
$LN9@test:
movsxd    rcx, edx
mov       rdx, rdi
mov       rdi, QWORD PTR [rsp+32]
cmp       rcx, rdx
jge       SHORT $LN3@test
sub       r11, r10
lea       rax, QWORD PTR [rcx+r10]
sub       rbx, r10
sub       rdx, rcx
npad      6
$LL8@test:            ; Scalar loop (1 element iters)
movzx     ecx, BYTE PTR [rax+r11]
lea       rax, QWORD PTR [rax+1]
not       cl
and       cl, BYTE PTR [rax-1]
mov       BYTE PTR [rax+rbx-1], cl
sub       rdx, 1
jne       SHORT $LL8@test
$LN3@test:
pop       rbx
ret       0

ARM64 ASM:

17.4 (was not vectorized) 17.7
|$LL4@test|
ldrsb       w9,[x10,x1]
sub         w3,w3,#1
ldrsb       w8,[x1]
bic         w8,w8,w9
strb        w8,[x11,x1]
add         x1,x1,#1
cbnz        w3,|$LL4@test|
|$LN3@test|
ret
|$LL27@test|  ; 1st vector loop (64 element iters)
ldr         q17,[x2,w8 sxtw #0]
sxtw        x11,w8
ldr         q16,[x1,w8 sxtw #0]
add         x10,x11,#0x10
bic         v16.16b,v16.16b,v17.16b
ldr         q17,[x10,x2]
str         q16,[x0,w8 sxtw #0]
ldr         q16,[x10,x1]
add         w8,w8,#0x40
cmp         w8,w9
bic         v16.16b,v16.16b,v17.16b
str         q16,[x10,x0]
add         x10,x11,#0x20
ldr         q17,[x10,x2]
ldr         q16,[x10,x1]
bic         v16.16b,v16.16b,v17.16b
str         q16,[x10,x0]
add         x10,x11,#0x30
ldr         q17,[x10,x2]
ldr         q16,[x10,x1]
bic         v16.16b,v16.16b,v17.16b
str         q16,[x10,x0]
blt         |$LL27@test|
and         w9,w3,#0x3F
cmp         w9,#8
blo         |$LN30@test|
|$LN11@test|
and         w9,w3,#0xFFFFFFF8
|$LL29@test|  ; 2nd vector loop (8 element iters)
ldr         d17,[x2,w8 sxtw #0]
ldr         d16,[x1,w8 sxtw #0]
bic         v16.8b,v16.8b,v17.8b
str         d16,[x0,w8 sxtw #0]
add         w8,w8,#8
cmp         w8,w9
blt         |$LL29@test|
|$LN30@test|
cmp         w8,w3
bge         |$LN32@test|
|$LN21@test|
add         x9,x1,w8,sxtw #0
sub         x13,x2,x1
sub         w8,w3,w8
sub         x10,x0,x1
|$LL31@test|  ; Scalar loop (1 element iters)
ldrsb       w12,[x13,x9]
sub         w8,w8,#1
ldrsb       w11,[x9]
bic         w11,w11,w12
strb        w11,[x10,x9]
add         x9,x9,#1
cbnz        w8,|$LL31@test|
|$LN32@test|
ret

 

Conditional Moves/Selects

MSVC now considers more opportunities to use conditional move instructions to eliminate branches. Once such an opportunity is recognized, whether to use a conditional move or leave the branch intact is subject to legality checks and tuning heuristics. Eliminating a branch is often beneficial because it removes the possibility of the branch being mispredicted and may enable other optimizations that previously would have stopped at the block boundary formed by the branch. Nevertheless, it does convert a control dependence that could be correctly predicted into a data dependence that cannot be removed.

This optimization applies only when the compiler can prove that it is safe to unconditionally perform a store because the instruction will always store one of two possible values instead of only performing a store when the condition is true. There are numerous reasons why an unconditional store may be unsafe. For example, if the memory location is shared, then another thread may observe a store when previously there was no store to observe. As another example, the condition may have protected a store through a potentially null pointer.

The optimization is enabled for X64 as well as ARM64. (ARM64 conditional execution was discussed in a previous blog post.)

Example C++ source code:

int test(int a, int i) {
    int mem[4]{0};
    if (mem[i] < a) {
        mem[i] = a;
    }
    return mem[0];
}

Required compiler flags: /O2 (for the sake of simplifying the example output, /GS- was also passed)

X64 ASM:

17.4 17.7
$LN6:
sub     rsp, 24
movsxd  rax, edx
xorps   xmm0, xmm0
movups  XMMWORD PTR mem$[rsp], xmm0
cmp     DWORD PTR mem$[rsp+rax*4], ecx
jge     SHORT $LN4@test
mov     DWORD PTR mem$[rsp+rax*4], ecx
$LN4@test:
mov     eax, DWORD PTR mem$[rsp]
add     rsp, 24
ret     0
$LN5:
sub      rsp, 24
movsxd   rax, edx
xorps    xmm0, xmm0
movups   XMMWORD PTR mem$[rsp], xmm0
lea      rdx, QWORD PTR mem$[rsp]
cmp      DWORD PTR [rdx+rax*4], ecx
lea      rdx, QWORD PTR [rdx+rax*4]
cmovge   ecx, DWORD PTR [rdx] ; cond
mov      DWORD PTR [rdx], ecx
mov      eax, DWORD PTR mem$[rsp]
add      rsp, 24
ret      0

ARM64 ASM:

17.4 17.7
|$LN5|
sub         sp,sp,#0x10
movi        v16.16b,#0
mov         x9,sp
str         q16,[sp]
ldr         w8,[x9,w1 sxtw #2]
cmp         w8,w0
bge         |$LN2@test|
str         w0,[x9,w1 sxtw #2]
|$LN2@test|
ldr         w0,[sp]
add         sp,sp,#0x10
ret
|$LN9|
sub         sp,sp,#0x10
movi        v16.16b,#0
mov         x9,sp
str         q16,[sp]
ldr         w8,[x9,w1 sxtw #2]
cmp         w8,w0
cselge      w8,w8,w0
str         w8,[x9,w1 sxtw #2]
ldr         w0,[sp]
add         sp,sp,#0x10
ret

 

Loop Optimizations for Array Assignments and Copies

The optimizer now recognizes more cases where an assignment/copy to contiguous memory can be raised to memset/memcopy intrinsics and lowered to more efficient instruction sequences. For example, consider the following array-assignment in a count-down loop.

Example C++ Source Code:

    char c[1024];

    for (int n = 1023; n; n--)
        c[n] = 1;

Previously, the code generated was a loop with individual byte-assignments. Now, more efficient wider assignments are performed via block-copy libraries or vector instructions.

X64 ASM:

17.4 17.7
mov         QWORD PTR __$ArrayPad$[rsp], rax
mov         eax, 1023
npad        2
$LL4@copy_while:
mov         BYTE PTR c$[rsp+rax], 1
sub         rax, 1
jne         SHORT $LL4@copy_while
vmovaps  zmm0, ZMMWORD PTR __zmm@01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101
vmovq    rdx, xmm0
lea      rax, QWORD PTR c$[rsp+1]
mov      ecx, 7
npad     14
$LL11@copy_while:
vmovups  ZMMWORD PTR [rax], zmm0
vmovups  YMMWORD PTR [rax+64], ymm0
vmovups  XMMWORD PTR [rax+96], xmm0
lea      rax, QWORD PTR [rax+128]
vmovups  XMMWORD PTR [rax-16], xmm0
sub      rcx, 1
jne      SHORT $LL11@copy_whilevmovups  ZMMWORD PTR [rax], zmm0
vmovups  YMMWORD PTR [rax+64], ymm0
vmovups  XMMWORD PTR [rax+96], xmm0
mov      QWORD PTR [rax+112], rdx
mov      DWORD PTR [rax+120], edx
mov      WORD PTR [rax+124], dx
mov      BYTE PTR [rax+126], dl

ARM64 ASM:

17.4 17.7
mov         x8,#0x3FF
mov         x9,sp
mov         w10,#1|$LL4@copy_while|
strb        w10,[x8,x9]
sub         x8,x8,#1
cbnz        x8,|$LL4@copy_while|
add         x0,sp,#1
mov         x2,#0x3FF
mov         w1,#1
bl          memset

Summary

We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you.  Please share your thoughts and comments with us through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

10 comments

Comments are closed. Login to edit/delete your existing comments

  • David Lowndes 0

    In the first example, you have string_view passed as const reference, which I’ve understood to date as probably being less efficient than just passing by value.
    What are the results if you pass by value?

    • Jan Ringoš 0

      On Windows, with current miserable X64 ABI-mandated calling convention, it’s irrelevant.
      Anything larger than a register is copied to stack first and then passed by a pointer. Passing string_view by reference is likely faster, as it skips the copy step.
      It also means upgrading to many modern C++ features (optional, span, unique_ptr) is a pessimisation.

      • Chris Green 0

        Though with vectorcall it will pass small structs in registers

        • Jan Ringoš 0

          Yeah, __vectorcall is something at least, but new modern calling convention is a long overdue.
          Way too many things (from modern C++ features to function using 5+ arguments) incur regular performance penalties because of the current ABI.

          IDK if anyone at Microsoft has investigated what it would entail, but I’ve been collecting random thoughts on such calling convention here on my github in case anyone’s interested.

        • Chris Green 1

          Compilers are going to want to come up with a new calling convention for AMX too. With double the # of GPRs there are a lot of regs available for passing/returning structs

          • Jan Ringoš 0

            I wish the new one would be a complete overhaul, but knowing the industry and Microsoft, it’s going to be some bare minimal extension :’-(

  • Chris Green 0

    Will the last example of widening bytes in fill loops also work with functions like std::fill_n?

    • Troy JohnsonMicrosoft employee 0

      Yes, this applies to std::fill_n as well. You’ll see either a call to memset or inline code that uses wide stores.

  • Adam Folwarczny 0

    Vector reminder loop – have considered to just reuse the wide iteration for the reminder, just by updating pointer accordingly ? Then scalar version wouldn’t be executed (and few bytes would be written twice)

  • Avery Lee 0

    Interesting that the ARM64 optimization for vector abs relies on lack of signed integer overflow UB, which VC++ has historically been reluctant to take advantage of. I can’t get the X64 version to generate vpabsd, but perhaps the heuristics have changed. It seems like the SSE2 output could be improved (psubd / pcmpgtd / pxor / psubd).

    In the vector remainder example, it’s odd that the X64 code generator translates (x & ~b) to inversion via andnps and then andps instead of just one andnps instruction. Furthermore, the optimization seems to be incomplete because it only works if the count is variable. If the count is hardcoded to 8, vectorization fails, and if it’s hardcoded to 24, it produces a single 128-bit vector iteration followed by 8 scalar loop iterations.

Feedback usabilla icon