A Tour of 4 MSVC Backend Improvements

We hear that many of you would like to see more details of the improvements which the MSVC backend team have been working on recently. This blog post presents some of the optimizations the team has implemented for Visual Studio 2022. It was co-written by one of our backend team leads, Eric Brumer, and our developer advocate, Sy Brand. Keep an eye out for more posts in the future which will dig into other optimizations!

Byteswap Identification

Changing the endianness of an integer can be an important operation in contexts where data is being transmitted between processors with different byte orders, or over the network. This is often referred to as a “byteswap”. C++23 will add a std::byteswap function to the standard library (already implemented in VS2022 17.1), but for codebases today it’s common to see custom implementations, which might look something like this:

int byteswap(int n) {
    return  ((n & 0xff) << 24u) |
            ((n & 0xff00) << 8u) |
            ((n & 0xff0000) >> 8u) |
            ((n & 0xff000000) >> 24u);
}

In Visual Studio 2019 16.11, this generated the following code for x64, and a similarly long sequence of instructions for Arm64:

        mov     eax, ecx
        mov     edx, ecx
        and     eax, 65280
        shl     edx, 16
        or      eax, edx
        mov     edx, ecx
        shl     eax, 8
        sar     edx, 8
        and     edx, 65280
        shr     ecx, 24
        or      eax, edx
        or      eax, ecx
        ret     0

x64 and Arm64 both have instructions which carry out a byteswap, which should ideally be used instead. Visual Studio 2022 17.3 introduced automatic byteswap identification, and now outputs the following on x64 with /O2:

        bswap   ecx
        mov     eax, ecx
        ret     0

On Arm64, the results are much the same:

        rev         w0,w0
        ret

This optimization isn’t just doing a simple pattern match on your code, it’s a generalized bit tracker. For example, this bit order reverser gets optimized to a single rbit on Arm64:

unsigned __int64 bitswap(unsigned __int64 x) {
    x = _byteswap_uint64(x);
    x = (x & 0xaaaaaaaaaaaaaaaa) >> 1 | (x & 0x5555555555555555) << 1;
    x = (x & 0xcccccccccccccccc) >> 2 | (x & 0x3333333333333333) << 2;
    x = (x & 0xf0f0f0f0f0f0f0f0) >> 4 | (x & 0x0f0f0f0f0f0f0f0f) << 4;
    return x;
}

Loop Unswitching

For sake of code simplicity, we may choose to write a branch inside a loop whose condition is invariable in each iteration. For example, if we want to pet all our cats, and if it’s feeding time, we also want to feed them, we could write this:

void process_cats(std::span<cat> cats, bool is_feeding_time) {
    for (auto&& c : cats) {
        if (is_feeding_time) {
            feed(c);
        }
        pet(c);
    }
}

Since is_feeding_time never changes inside the function and has no side-effects, it may be wasteful to carry out that check every iteration. Better hardware branch predictors will minimize this impact, but in many cases we can see performance wins by carrying out loop unswitching. This hoists the branch outside of the loop, generating code similar to this C++:

void process_cats(std::span<cat> cats, bool is_feeding_time) {
    if (is_feeding_time) {
        for (auto&& c : cats) {
            feed(c);
            pet(c);
        }
    }
    else {
        for (auto&& c : cats) {
            pet(c);
        }
    }
}

As of Visual Studio 2022 version 17.1, MSVC may carry out loop unswitching at /O2. This is a heuristic-driven optimization which may or may not be carried out depending on analysis results. This is because loop unswitching duplicates the loop body, which may affect instruction cache performance. You may pass the /Qvec-report:1 flag to see a report on which loops were unswitched.

Here is the x64 generated by Visual Studio 2019 version 16.11 with /O2:

$LN15:
        mov     QWORD PTR [rsp+8], rbx
        mov     QWORD PTR [rsp+16], rsi
        push    rdi
        sub     rsp, 32   
        mov     rbx, QWORD PTR [rcx]
        movzx   esi, dl
        mov     rax, QWORD PTR [rcx+8]
        lea     rdi, QWORD PTR [rbx+rax*4]
        cmp     rbx, rdi
        je      SHORT $LN3@process_ca
$LL4@process_ca:
        test    sil, sil
        je      SHORT $LN5@process_ca
        mov     ecx, DWORD PTR [rbx]
        call    void feed(cat)              
$LN5@process_ca:
        mov     ecx, DWORD PTR [rbx]
        call    void pet(cat)               
        add     rbx, 4
        cmp     rbx, rdi
        jne     SHORT $LL4@process_ca
$LN3@process_ca:
        mov     rbx, QWORD PTR [rsp+48]
        mov     rsi, QWORD PTR [rsp+56]
        add     rsp, 32   
        pop     rdi
        ret     0

The key part to note here are the lines directly under the $LL4@process_ca label, which carry out the test on is_feeding_time and branch on the result, all inside the loop.

Here is the code generated now with /O2:

$LN22:
        mov     QWORD PTR [rsp+8], rbx
        push    rdi
        sub     rsp, 32  
        mov     rbx, QWORD PTR [rcx]
        mov     rax, QWORD PTR [rcx+8]
        lea     rdi, QWORD PTR [rbx+rax*4]
        cmp     rbx, rdi
        je      SHORT $LN14@process_ca
        test    dl, dl
        je      SHORT $LL11@process_ca
        npad    2
$LL4@process_ca:
        mov     ecx, DWORD PTR [rbx]
        call    void feed(cat)
        mov     ecx, DWORD PTR [rbx]
        call    void pet(cat)     
        add     rbx, 4
        cmp     rbx, rdi
        jne     SHORT $LL4@process_ca
        mov     rbx, QWORD PTR [rsp+48]
        add     rsp, 32      
        pop     rdi
        ret     0
$LL11@process_ca:
        mov     ecx, DWORD PTR [rbx]
        call    void pet(cat)
        add     rbx, 4
        cmp     rbx, rdi
        jne     SHORT $LL11@process_ca
$LN14@process_ca:
        mov     rbx, QWORD PTR [rsp+48]
        add     rsp, 32    
        pop     rdi
        ret     0

The code is now longer because two versions of the loop body are now generated, under the $LL4@process_ca and $LL11@process_ca labels. But also note that the branch occurs in the entry block of the function and selects between the two loop body versions:

        cmp     rbx, rdi
        je      SHORT $LN14@process_ca
        test    dl, dl
        je      SHORT $LL11@process_ca

Min/Max Chains

We have improved optimization of chains of std::min and std::max as of Visual Studio 2022 version 17.0.

Say we have three blankets and we want to give one to our cat. The one we pick shouldn’t be too hard. It shouldn’t be too soft. It should be just right.

We could write a function to give us the Just Right blanket from three, by picking the middle one. It could look something like this:

using softness = float;
softness just_right_blanket(softness a, softness b, softness c) {
    return std::max(std::min(a,b), std::min(std::max(a,b),c));
}

In VS2019, this code was generated for x64 with /O2 /fp:fast:

        comiss  xmm1, xmm0
        lea     rcx, QWORD PTR a$[rsp]
        lea     rax, QWORD PTR b$[rsp]
        lea     rdx, QWORD PTR a$[rsp]
        movss   DWORD PTR [rsp+16], xmm1
        movss   DWORD PTR [rsp+8], xmm0
        cmovbe  rax, rcx
        movss   DWORD PTR [rsp+24], xmm2
        lea     rcx, QWORD PTR c$[rsp]
        comiss  xmm2, DWORD PTR [rax]
        cmovae  rcx, rax
        lea     rax, QWORD PTR b$[rsp]
        comiss  xmm0, xmm1
        cmovbe  rax, rdx
        movss   xmm1, DWORD PTR [rax]
        comiss  xmm1, DWORD PTR [rcx]
        cmovb   rax, rcx
        movss   xmm0, DWORD PTR [rax]
        ret     0

Arm64 codegen is similarly inefficient. Both x64 and Arm64 have single instructions for scalar floating point min and max, which we now use in VS2022 at /O2 and above with /fp:fast. Here is the x64 code now:

        movaps  xmm3, xmm0
        maxss   xmm0, xmm1
        minss   xmm3, xmm1
        minss   xmm0, xmm2
        maxss   xmm0, xmm3
        ret     0

And for Arm64:

        fmax        s16,s0,s1
        fmin        s17,s0,s1
        fmin        s18,s16,s2
        fmax        s0,s18,s17
        ret

Backwards Loop Vectorization

Say I run a cat shelter with 32 cats and want to count how many crunchies they leave behind in their bowls after mealtime. So I write a function which takes a pointer to the first bowl, and sum it like so (yes, I know I could use std::accumulate):

int count_leftovers(int* bowl_ptr) {
    int result = 0;

    for (int i = 0; i < 32; ++i, ++bowl_ptr) {
        result += *bowl_ptr;
    }
    return result;
}

This all works and generates good code! But then I realize that my desk is actually at the far end of the room, so I need to walk all the way to the start of the line to begin counting. I decide to instead take a pointer to the last bowl and work backwards:

int count_leftovers(int* bowl_ptr) {
    int result = 0;

    // change ++bowl_ptr to --bowl_ptr
    for (int i = 0; i < 32; ++i, --bowl_ptr) {
        result += *bowl_ptr;
    }
    return result;
}

Unfortunately, if I was using VS2019, this loop would not be vectorized. Here is the code generated with /O2:

        xor     eax, eax
        mov     edx, eax
        mov     r8d, eax
        mov     r9d, eax
        add     rcx, -8
        lea     r10d, QWORD PTR [rax+8]
$LL4@count_left:
        add     eax, DWORD PTR [rcx+8]
        add     r9d, DWORD PTR [rcx+4]
        add     r8d, DWORD PTR [rcx]
        add     edx, DWORD PTR [rcx-4]
        lea     rcx, QWORD PTR [rcx-16]
        sub     r10, 1
        jne     SHORT $LL4@count_left
        lea     ecx, DWORD PTR [rdx+r8]
        add     ecx, r9d
        add     eax, ecx
        ret     0

The loop is unrolled but it is not vectorized.

We enabled vectorization for backwards-strided loops in VS2022 17.1. The code generated will depend a lot on the flags you use, particularly the /arch flag for enabling use of AVX instructions instead of the default SSE2 ones.

Here is the code generated for /O2 /arch:AVX2:

        vpxor   xmm2, xmm2, xmm2
        vpxor   xmm3, xmm3, xmm3
        mov     eax, 2
        npad    3
$LL4@count_left:
        vmovd   xmm1, DWORD PTR [rcx]
        vpinsrd xmm1, xmm1, DWORD PTR [rcx-4], 1
        vpinsrd xmm1, xmm1, DWORD PTR [rcx-8], 2
        vpinsrd xmm1, xmm1, DWORD PTR [rcx-12], 3
        vmovd   xmm0, DWORD PTR [rcx-16]
        vpinsrd xmm0, xmm0, DWORD PTR [rcx-20], 1
        vpinsrd xmm0, xmm0, DWORD PTR [rcx-24], 2
        vpinsrd xmm0, xmm0, DWORD PTR [rcx-28], 3
        lea     rcx, QWORD PTR [rcx-64]
        vinsertf128 ymm0, ymm1, xmm0, 1
        vmovd   xmm1, DWORD PTR [rcx+32]
        vpinsrd xmm1, xmm1, DWORD PTR [rcx+28], 1
        vpinsrd xmm1, xmm1, DWORD PTR [rcx+24], 2
        vpinsrd xmm1, xmm1, DWORD PTR [rcx+20], 3
        vpaddd  ymm2, ymm0, ymm2
        vmovd   xmm0, DWORD PTR [rcx+16]
        vpinsrd xmm0, xmm0, DWORD PTR [rcx+12], 1
        vpinsrd xmm0, xmm0, DWORD PTR [rcx+8], 2
        vpinsrd xmm0, xmm0, DWORD PTR [rcx+4], 3
        vinsertf128 ymm0, ymm1, xmm0, 1
        vpaddd  ymm3, ymm0, ymm3
        sub     rax, 1
        jne     $LL4@count_left
        vpaddd  ymm0, ymm3, ymm2
        vphaddd ymm1, ymm0, ymm0
        vphaddd ymm2, ymm1, ymm1
        vextracti128 xmm0, ymm2, 1
        vpaddd  xmm0, xmm2, xmm0
        vmovd   eax, xmm0
        vzeroupper
        ret     0

This both unrolls and vectorizes the loop. Fully explaining AVX2 vector instructions is a job for a different blog post (maybe check out this one), but the basic idea is that all those vpinsrd instructions are loading the data from memory into vector registers, then the vpaddd/vphaddd instructions carry out addition on big chunks of data all at the same time.

Send us your feedback

We hope you found these details interesting! If you have ideas for similar posts you’d like to see, please let us know. We are also interested in your feedback to continue to improve our tools. The comments below are open. Feedback can also be shared through Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

4 comments

Discussion is closed. Login to edit/delete existing comments.

Kalle Niemitalo November 30, 2022

In byteswap_wat_edition, do you mean the ARM64 RBIT instruction rather than REV?
- Sy Brand Author November 30, 2022
  
  You’re right, thanks, fixed!
- Thief November 30, 2022
  
  I was thinking the same – that code is a bit reverse in each byte, not a byteswap.
Christoph Hausner November 28, 2022

Thanks! It’s cool to read about these new optimizations. Please keep them coming.

Somewhat related to byteswap is https://developercommunity.visualstudio.com/t/Missed-optimization:-loadstore-coalesci/987039 which MSVC fails to optimize down to a simply load instruction. Looking forward to seeing this improved in the future, as well.