This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization.
Optimizing Memory Across Block Boundaries
When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases.
In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage.
Example C++ Source Code:
#include <string_view>
bool compare(const std::string_view& l, const std::string_view& r) {
  return l == r;
}
Required Compiler Flags: /O2
X64 ASM:
17.4 | 17.7 |
sub        rsp, 56 movups xmm0, XMMWORD PTR [rcx] mov       r8, QWORD PTR [rcx+8] movaps XMMWORD PTR $T1[rsp], xmm0 cmp       r8, QWORD PTR [rdx+8] jne        SHORT $LN9@compare mov       rdx, QWORD PTR [rdx] mov       rcx, QWORD PTR $T1[rsp] call       memcmp test       eax, eax jne        SHORT $LN9@compare mov       al, 1 add        rsp, 56 ret        0 $LN9@compare: xor        al, al add        rsp, 56 ret        0 |
sub         rsp, 40 mov       r8, QWORD PTR [rcx+8] movups xmm1, XMMWORD PTR [rcx] cmp       r8, QWORD PTR [rdx+8] jne         SHORT $LN9@compare mov       rdx, QWORD PTR [rdx] movq     rcx, xmm1 call        memcmp test        eax, eax jne         SHORT $LN9@compare mov       al, 1 add        rsp, 40 ret         0 $LN9@compare: xor         al, al add        rsp, 40 ret         0 |
ARM64 ASM:
17.4 | 17.7 |
str        lr,[sp,#-0x10]! sub        sp,sp,#0x20 ldr        q17,[x1] ldr        q16,[x0] umov       x8,v17.d[1] umov       x2,v16.d[1] stp        q17,q16,[sp] cmp        x2,x8 bne        |$LN9@compare| ldr        x1,[sp] ldr        x0,[sp,#0x10] bl         memcmp cbnz       w0,|$LN9@compare| mov        w0,#1 add        sp,sp,#0x20 ldr        lr,[sp],#0x10 ret |$LN9@compare| mov        w0,#0 add        sp,sp,#0x20 ldr        lr,[sp],#0x10 ret |
str        lr,[sp,#-0x10]! ldr        q17,[x1] ldr        q16,[x0] umov       x8,v17.d[1] umov       x2,v16.d[1] cmp        x2,x8 bne        |$LN9@compare| fmov       x1,d17 fmov       x0,d16 bl         memcmp cbnz       w0,|$LN9@compare| mov        w0,#1 ldr        lr,[sp],#0x10 ret |$LN9@compare| mov        w0,#0 ldr        lr,[sp],#0x10 ret |
Vector Logical and Arithmetic Optimizations
We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64.
Example C++ Source Code:
#include <cstdint> void s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) { for (int i = 0; i < n; i++) { a[i] = (b[i] - c[i]) > 0 ? (b[i] - c[i]) : (c[i] - b[i]); } }
Required Flags: /O2 /arch:AVX for X64, /O2 for ARM64
X64 ASM:
17.4 | 17.7 |
$LL4@s32_1: movdqu xmm0, XMMWORD PTR [r11+rax] add    ecx, 4 movdqu xmm1, XMMWORD PTR [rax] lea    rax, QWORD PTR [rax+16] movdqa xmm3, xmm0 psubd  xmm3, xmm1 psubd  xmm1, xmm0 movdqa xmm2, xmm3 pcmpgtd xmm2, xmm4 movdqa xmm0, xmm2 andps  xmm2, xmm3 andnps xmm0, xmm1 orps   xmm0, xmm2 movdqu XMMWORD PTR [r10+rax-16], xmm0 cmp    ecx, edx jl     SHORT $LL4@s32_1 |
$LL4@s32_1: vmovdqu xmm1, XMMWORD PTR [r10+rax] vpsubd     xmm1, xmm1, XMMWORD PTR [rax] vpabsd    xmm2, xmm1   ; vector abs add        edx, 4 vmovdqu    XMMWORD PTR [rbx+rax], xmm2 lea        rax, QWORD PTR [rax+16] cmp        edx, ecx jl         SHORT $LL4@s32_1 |
ARM64 ASM:
17.4 | 17.7 |
|$LL24@s32_1| sbfiz      x9,x8,#2,#0x20 ldr        q19,[x9,x2] add        w8,w8,#4 ldr        q18,[x9,x1] cmp        w8,w10 sub        v16.4s,v18.4s,v19.4s cmgt       v17.4s,v16.4s,v21.4s and        v20.16b,v17.16b,v16.16b sub        v16.4s,v19.4s,v18.4s bic        v16.16b,v16.16b,v17.16b orr        v16.16b,v16.16b,v20.16b str        q16,[x9,x0] blt        |$LL24@s32_1| cmp        w8,w3 bge        |$LN27@s32_1| |
|$LL24@s32_1| sbfiz      x9,x8,#2,#0x20 ldr        q17,[x9,x1] add        w8,w8,#4 ldr        q16,[x9,x2] cmp        w8,w10 sabd       v16.4s,v17.4s,v16.4s    ; vector abs str        q16,[x9,x0] blt        |$LL24@s32_1| |
Vector Remainder Loops
Each iteration of a vectorized loop computes the same work as multiple iterations of the original scalar loop. Depending on the number of original scalar iterations and the vector width, there may be work left over when the vector loop ends. Although this work can be completed by a copy of the original scalar loop, it may contain sufficient work to warrant further vectorization using a smaller vector width. This optimization is now employed for targets that have a single vector width by partially packing a vector.
In the example below, three loops are emitted for the original loop. The first processes 64 elements per vector iteration by packing 16 8-bit values into a 128-bit vector register and then unrolling the vector loop by four. The second loop processes 8 elements per vector iteration by packing 8 8-bit values into half of a 128-bit vector register. The third loop processes one element per scalar iteration.
Example C++ source code:
#include <cstdint> void test(int8_t * __restrict d, int8_t * __restrict a, int8_t * __restrict b, int n) { for (int i = 0; i < n; i++) { d[i] = a[i] & (~b[i]); } }
Required compiler flags: /O2
X64 ASM:
17.4 | 17.7 |
$LL4@test: ; 1st vector loop (64 element iters) movdqu xmm0, XMMWORD PTR [rcx-16] add    edi, 64        ; 00000040H movdqu xmm1, XMMWORD PTR [rdx+rcx-16] movdqu xmm2, XMMWORD PTR [rdx+rcx] lea    rcx, QWORD PTR [rcx+64] pandn  xmm1, xmm3 lea    rax, QWORD PTR [r14+rcx] pand   xmm1, xmm0 pandn  xmm2, xmm3 movdqu xmm0, XMMWORD PTR [rcx-64] movdqu XMMWORD PTR [r9+rcx-80], xmm1 pand   xmm2, xmm0 movdqu xmm0, XMMWORD PTR [rcx-48] movdqu xmm1, XMMWORD PTR [rdx+rcx-48] movdqu XMMWORD PTR [r9+rcx-64], xmm2 pandn  xmm1, xmm3 pand   xmm1, xmm0 movdqu xmm2, XMMWORD PTR [rdx+rcx-32] movdqu xmm0, XMMWORD PTR [rcx-32] pandn  xmm2, xmm3 pand   xmm2, xmm0 movdqu XMMWORD PTR [r9+rcx-48], xmm1 movdqu XMMWORD PTR [r9+rcx-32], xmm2 cmp    rax, rsi jl     SHORT $LL4@test mov    r14, QWORD PTR [rsp] mov    rsi, QWORD PTR [rsp+24] $LN9@test: movsxd rcx, edi mov    rdx, rbx mov    rdi, QWORD PTR [rsp+32] mov    rbx, QWORD PTR [rsp+16] cmp    rcx, rdx jge    SHORT $LN3@test sub    r8, r11 lea    rax, QWORD PTR [rcx+r11] sub    r10, r11 sub    rdx, rcx $LL8@test: ; Scalar loop (1 element iters) movzx  ecx, BYTE PTR [rax+r8] lea    rax, QWORD PTR [rax+1] not    cl and    cl, BYTE PTR [rax-1] mov    BYTE PTR [rax+r10-1], cl sub    rdx, 1 jne    SHORT $LL8@test $LN3@test: add    rsp, 8 ret    0 |
$LL4@test: ; 1st vector loop (64 element iters) movdqu   xmm0, XMMWORD PTR [rcx-16] add       edx, 64        ; 00000040H movdqu   xmm1, XMMWORD PTR [r8+rcx-16] movdqu   xmm2, XMMWORD PTR [r8+rcx] lea       rcx, QWORD PTR [rcx+64] andnps   xmm1, xmm3 lea       rax, QWORD PTR [rsi+rcx] andps     xmm1, xmm0 andnps   xmm2, xmm3 movdqu   xmm0, XMMWORD PTR [rcx-64] movdqu   XMMWORD PTR [r9+rcx-80], xmm1 andps     xmm2, xmm0 movdqu   xmm0, XMMWORD PTR [rcx-48] movdqu   xmm1, XMMWORD PTR [r8+rcx-48] movdqu   XMMWORD PTR [r9+rcx-64], xmm2 andnps   xmm1, xmm3 andps     xmm1, xmm0 movdqu   xmm2, XMMWORD PTR [r8+rcx-32] movdqu   xmm0, XMMWORD PTR [rcx-32] andnps   xmm2, xmm3 andps     xmm2, xmm0 movdqu   XMMWORD PTR [r9+rcx-48], xmm1 movdqu   XMMWORD PTR [r9+rcx-32], xmm2 cmp       rax, rbp jl        SHORT $LL4@test mov       rbp, QWORD PTR [rsp+16] mov       eax, edi and       al, 63   ; 0000003fH cmp       al, 8 jb        SHORT $LN27@test $LN11@test: mov       ecx, edi movsxd   rax, edx and       ecx, -8 movsxd   rsi, ecx lea       rcx, QWORD PTR [rax+r10] $LL10@test:         ; 2nd vector loop (8 element iters) movq      xmm1, QWORD PTR [r8+rcx] add       edx, 8 movq      xmm0, QWORD PTR [rcx] andnps   xmm1, xmm3 andps     xmm1, xmm0 movq      QWORD PTR [r9+rcx], xmm1 add       rcx, 8 mov       rax, rcx sub       rax, r10 cmp       rax, rsi jl        SHORT $LL10@test $LN27@test: mov       rsi, QWORD PTR [rsp+24] $LN9@test: movsxd   rcx, edx mov       rdx, rdi mov       rdi, QWORD PTR [rsp+32] cmp       rcx, rdx jge       SHORT $LN3@test sub       r11, r10 lea       rax, QWORD PTR [rcx+r10] sub       rbx, r10 sub       rdx, rcx npad      6 $LL8@test:           ; Scalar loop (1 element iters) movzx     ecx, BYTE PTR [rax+r11] lea       rax, QWORD PTR [rax+1] not       cl and       cl, BYTE PTR [rax-1] mov       BYTE PTR [rax+rbx-1], cl sub       rdx, 1 jne       SHORT $LL8@test $LN3@test: pop       rbx ret       0 |
ARM64 ASM:
17.4 (was not vectorized) | 17.7 |
|$LL4@test| ldrsb      w9,[x10,x1] sub        w3,w3,#1 ldrsb      w8,[x1] bic        w8,w8,w9 strb       w8,[x11,x1] add        x1,x1,#1 cbnz       w3,|$LL4@test| |$LN3@test| ret |
|$LL27@test| ; 1st vector loop (64 element iters) ldr        q17,[x2,w8 sxtw #0] sxtw       x11,w8 ldr        q16,[x1,w8 sxtw #0] add        x10,x11,#0x10 bic        v16.16b,v16.16b,v17.16b ldr        q17,[x10,x2] str        q16,[x0,w8 sxtw #0] ldr        q16,[x10,x1] add        w8,w8,#0x40 cmp        w8,w9 bic        v16.16b,v16.16b,v17.16b str        q16,[x10,x0] add        x10,x11,#0x20 ldr        q17,[x10,x2] ldr        q16,[x10,x1] bic        v16.16b,v16.16b,v17.16b str        q16,[x10,x0] add        x10,x11,#0x30 ldr        q17,[x10,x2] ldr        q16,[x10,x1] bic        v16.16b,v16.16b,v17.16b str        q16,[x10,x0] blt        |$LL27@test| and        w9,w3,#0x3F cmp        w9,#8 blo        |$LN30@test| |$LN11@test| and        w9,w3,#0xFFFFFFF8 |$LL29@test| ; 2nd vector loop (8 element iters) ldr        d17,[x2,w8 sxtw #0] ldr        d16,[x1,w8 sxtw #0] bic        v16.8b,v16.8b,v17.8b str        d16,[x0,w8 sxtw #0] add        w8,w8,#8 cmp        w8,w9 blt        |$LL29@test| |$LN30@test| cmp        w8,w3 bge        |$LN32@test| |$LN21@test| add        x9,x1,w8,sxtw #0 sub        x13,x2,x1 sub        w8,w3,w8 sub        x10,x0,x1 |$LL31@test| ; Scalar loop (1 element iters) ldrsb      w12,[x13,x9] sub        w8,w8,#1 ldrsb      w11,[x9] bic        w11,w11,w12 strb       w11,[x10,x9] add        x9,x9,#1 cbnz       w8,|$LL31@test| |$LN32@test| ret |
Conditional Moves/Selects
MSVC now considers more opportunities to use conditional move instructions to eliminate branches. Once such an opportunity is recognized, whether to use a conditional move or leave the branch intact is subject to legality checks and tuning heuristics. Eliminating a branch is often beneficial because it removes the possibility of the branch being mispredicted and may enable other optimizations that previously would have stopped at the block boundary formed by the branch. Nevertheless, it does convert a control dependence that could be correctly predicted into a data dependence that cannot be removed.
This optimization applies only when the compiler can prove that it is safe to unconditionally perform a store because the instruction will always store one of two possible values instead of only performing a store when the condition is true. There are numerous reasons why an unconditional store may be unsafe. For example, if the memory location is shared, then another thread may observe a store when previously there was no store to observe. As another example, the condition may have protected a store through a potentially null pointer.
The optimization is enabled for X64 as well as ARM64. (ARM64 conditional execution was discussed in a previous blog post.)
Example C++ source code:
int test(int a, int i) { int mem[4]{0}; if (mem[i] < a) { mem[i] = a; } return mem[0]; }
Required compiler flags: /O2 (for the sake of simplifying the example output, /GS- was also passed)
X64 ASM:
17.4 | 17.7 |
$LN6: sub    rsp, 24 movsxd rax, edx xorps  xmm0, xmm0 movups XMMWORD PTR mem$[rsp], xmm0 cmp    DWORD PTR mem$[rsp+rax*4], ecx jge    SHORT $LN4@test mov    DWORD PTR mem$[rsp+rax*4], ecx $LN4@test: mov    eax, DWORD PTR mem$[rsp] add    rsp, 24 ret    0 |
$LN5: sub      rsp, 24 movsxd  rax, edx xorps    xmm0, xmm0 movups  XMMWORD PTR mem$[rsp], xmm0 lea      rdx, QWORD PTR mem$[rsp] cmp      DWORD PTR [rdx+rax*4], ecx lea      rdx, QWORD PTR [rdx+rax*4] cmovge  ecx, DWORD PTR [rdx] ; cond mov      DWORD PTR [rdx], ecx mov      eax, DWORD PTR mem$[rsp] add      rsp, 24 ret      0 |
ARM64 ASM:
17.4 | 17.7 |
|$LN5| sub        sp,sp,#0x10 movi       v16.16b,#0 mov        x9,sp str        q16,[sp] ldr        w8,[x9,w1 sxtw #2] cmp        w8,w0 bge        |$LN2@test| str        w0,[x9,w1 sxtw #2] |$LN2@test| ldr        w0,[sp] add        sp,sp,#0x10 ret |
|$LN9| sub        sp,sp,#0x10 movi       v16.16b,#0 mov        x9,sp str        q16,[sp] ldr        w8,[x9,w1 sxtw #2] cmp        w8,w0 cselge     w8,w8,w0 str        w8,[x9,w1 sxtw #2] ldr        w0,[sp] add        sp,sp,#0x10 ret |
Loop Optimizations for Array Assignments and Copies
The optimizer now recognizes more cases where an assignment/copy to contiguous memory can be raised to memset/memcopy intrinsics and lowered to more efficient instruction sequences. For example, consider the following array-assignment in a count-down loop.
Example C++ Source Code:
char c[1024]; for (int n = 1023; n; n--) c[n] = 1;
Previously, the code generated was a loop with individual byte-assignments. Now, more efficient wider assignments are performed via block-copy libraries or vector instructions.
X64 ASM:
17.4 | 17.7 |
mov        QWORD PTR __$ArrayPad$[rsp], rax mov        eax, 1023 npad       2 $LL4@copy_while: mov        BYTE PTR c$[rsp+rax], 1 sub         rax, 1 jne         SHORT $LL4@copy_while |
vmovaps zmm0, ZMMWORD PTR __zmm@01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101 vmovq    rdx, xmm0 lea      rax, QWORD PTR c$[rsp+1] mov      ecx, 7 npad     14 $LL11@copy_while: vmovups ZMMWORD PTR [rax], zmm0 vmovups YMMWORD PTR [rax+64], ymm0 vmovups XMMWORD PTR [rax+96], xmm0 lea      rax, QWORD PTR [rax+128] vmovups XMMWORD PTR [rax-16], xmm0 sub      rcx, 1 jne      SHORT $LL11@copy_whilevmovups ZMMWORD PTR [rax], zmm0 vmovups YMMWORD PTR [rax+64], ymm0 vmovups XMMWORD PTR [rax+96], xmm0 mov      QWORD PTR [rax+112], rdx mov      DWORD PTR [rax+120], edx mov      WORD PTR [rax+124], dx mov      BYTE PTR [rax+126], dl |
ARM64 ASM:
17.4 | 17.7 |
mov        x8,#0x3FF mov        x9,sp mov        w10,#1|$LL4@copy_while| strb       w10,[x8,x9] sub        x8,x8,#1 cbnz       x8,|$LL4@copy_while| |
add        x0,sp,#1 mov        x2,#0x3FF mov        w1,#1 bl         memset |
Summary
We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. Please share your thoughts and comments with us through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.
Interesting that the ARM64 optimization for vector abs relies on lack of signed integer overflow UB, which VC++ has historically been reluctant to take advantage of. I can’t get the X64 version to generate vpabsd, but perhaps the heuristics have changed. It seems like the SSE2 output could be improved (psubd / pcmpgtd / pxor / psubd).
In the vector remainder example, it’s odd that the X64 code generator translates (x & ~b) to inversion via andnps and then andps instead of just one andnps instruction. Furthermore, the optimization seems to be incomplete because it only works if the count is variable. If the count is hardcoded to 8, vectorization fails, and if it’s hardcoded to 24, it produces a single 128-bit vector iteration followed by 8 scalar loop iterations.
Vector reminder loop – have considered to just reuse the wide iteration for the reminder, just by updating pointer accordingly ? Then scalar version wouldn’t be executed (and few bytes would be written twice)
Will the last example of widening bytes in fill loops also work with functions like std::fill_n?
Yes, this applies to std::fill_n as well. You’ll see either a call to memset or inline code that uses wide stores.
In the first example, you have string_view passed as const reference, which I’ve understood to date as probably being less efficient than just passing by value.
What are the results if you pass by value?
On Windows, with current miserable X64 ABI-mandated calling convention, it’s irrelevant.
Anything larger than a register is copied to stack first and then passed by a pointer. Passing string_view by reference is likely faster, as it skips the copy step.
It also means upgrading to many modern C++ features (optional, span, unique_ptr) is a pessimisation.
Though with vectorcall it will pass small structs in registers
Compilers are going to want to come up with a new calling convention for AMX too. With double the # of GPRs there are a lot of regs available for passing/returning structs
I wish the new one would be a complete overhaul, but knowing the industry and Microsoft, it’s going to be some bare minimal extension :’-(
Yeah, __vectorcall is something at least, but new modern calling convention is a long overdue.
Way too many things (from modern C++ features to function using 5+ arguments) incur regular performance penalties because of the current ABI.
IDK if anyone at Microsoft has investigated what it would entail, but I’ve been collecting random thoughts on such calling convention here on my github in case anyone’s interested.