MSVC ARM64 Optimizations in Visual Studio 2022 17.7

Hongyon Suauthai (ARM)

September 28th, 20230 2

In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2^nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We started introducing these performance optimizations in 17.6 release and we have landed them in 17.7 release.

By element operation

ARM64 supports by-element operation in several instructions such as fmul, fmla, fmls, etc. This feature allows a SIMD operand to be computed directly by a SIMD element using an index to access it. In the example below where we multiply a scalar with an array, MSVC duplicated the vector element v0.s[0] into all lanes of a SIMD register, then multiplied that with another SIMD operand represented by array b. This is not efficient because the dup instruction will add 2 more execution latency before executing the fmul instruction.

To better understand this optimization let’s take the following sample code:

void test(float * __restrict a, float * __restrict b, float c) {
    for (int i = 0; i < 4; i++)
        a[i] = b[i] * c;
}

Code generated by MSVC 17.6:

        dup         v17.4s,v0.s[0]
        ldr         q16,[x1]
        fmul        v16.4s,v17.4s,v16.4s
        str         q16,[x0]

In Visual Studio 2022 17.7, we eliminated a duplicate instruction. It now can multiply by a SIMD element directly and the code generation is:

        ldr         q16,[x1]
        fmul        v16.4s,v16.4s,v0.s[0]
        str         q16,[x0]

Neon supports for shift right and accumulate immediate

This instruction right-shifts a SIMD source by an immediate and accumulates the final result with the destination SIMD register. As mentioned earlier, we started working on this optimization in the 17.6 release and completed the feature in the 17.7 release.

In the 17.5 release MSVC turned right shifts into left shifts using a negative shift amount. To implement it that way, it copied a constant -2 into a general register, and then duplicated a general register into the SIMD register before the left shift.

We eliminated the duplicate instruction and used right shift with an immediate value in the 17.6 release. It was an improvement, but not yet good enough.

We further improved the implementation in 17.7 by combining right shift and add by using usra for unsigned and ssra for signed operation. The final implementation is much more compact than the previous ones.

To better understand how this optimization works let’s look at the sample code below:

void test(unsigned long long * __restrict a, unsigned long long * __restrict b) {
    for (int i = 0; i < 2; i++)
        a[i] += b[i] >> 2;
}

Code generated by MSVC 17.5:

        mov         x8,#-2
        ldr         q16,[x1]
        dup         v17.2d,x8
        ldr         q18,[x0]
        ushl        v17.2d,v16.2d,v17.2d
        add         v18.2d,v17.2d,v18.2d
        str         q18,[x0]

Code generated by MSVC 17.6:

        ldr         q16,[x1]
        ushr        v17.2d,v16.2d,#2
        ldr         q16,[x0]
        add         v16.2d,v17.2d,v16.2d
        str         q16,[x0]

Code generated by MSVC 17.7:

        ldr         q16,[x0]
        ldr         q17,[x1]
        usra        v16.2d,v17.2d,#2
        str         q16,[x0]

Neon right shift into cmp

A right shift on a signed integer shifts the msb, with the result being either -1 or 0, and is equivalent to cmlt. Similar to the previous optimization, we progressively made improvements and completed the feature in 17.7.

To better understand why we introduced this optimization let’s take the following snippet of code,

void shf2cmlt(int * __restrict a, int * __restrict b) {
    for (int i = 0; i < 4; i++)
      b[i] = a[i] >> 31;
}

Code generated by MSVC 17.5:

        ldr         q16,[x0]
        mvni        v17.4s,#0x1E
        sshl        v17.4s,v16.4s,v17.4s
        str         q17,[x1]

Code generated by MSVC 17.6:

        ldr         q16,[x0]
        sshr        v16.4s,v16.4s,#0x1F
        str         q16,[x1]

sshr was certainly an improvement but not the best solution because its throughput is 1 and can go through only the V1 pipeline. cmlt has better throughput as it can go through pipelines V1 or V2 on Arm’s Neoverse N1 core as in Arm® Neoverse™ N1 Software Optimization Guide.

Code generated by MSVC 17.7:

        ldr         q16,[x0]
        cmlt        v16.4s,v16.4s,#0
        str         q16,[x1]

Neon right shift and narrow into shifted narrow

This is another case where we can combine 2 instructions into one using shrn. The shrn instruction right shifts each element of an unsigned integer value from the source SIMD register by an immediate value, puts the result into a SIMD register, and writes the vector to the lower half of the destination SIMD register.

In this example, the generated code multiplies 16-bit integers, and puts the result into an unsigned 32-bit destination. It right-shifts the result by an immediate value, then it extracts, narrows half of it, and writes to the final destination.

To better understand this optimization, consider the following sample of code:

void test(unsigned short * __restrict a, short * __restrict b) {
    for( int i = 0; i < 4; i++ )
        b[i] = (a[i] * a[i]) >> 10;
}

Code generated by MSVC 17.6:

        ldr         d16,[x0]
        umull       v16.4s,v16.4h,v16.4h
        sshr        v16.4s,v16.4s,#0xA
        xtn         v16.4h,v16.4s
        str         d16,[x1]

Code generated by MSVC 17.7:

        ldr         d16,[x0]
        umull       v16.4s,v16.4h,v16.4h
        shrn        v16.4h,v16.4s,#0xA
        str         d16,[x1]

Scalar code-generation improved negation of bool value

A negation of a Boolean value can generally be optimized using an exclusive or with 1.

For example,

0 ^ 1 = 1

1 ^ 1 = 0

In 17.6, MSVC generates unnecessary code using cmp + cseleq to negate boolean function’s return values.

To better understand this optimization let’s take the following sample code:

bool foo();
bool negate_bool() {
    return !foo();
}

Code generated by MSVC 17.6:

        bl          |?foo@@YA_NXZ|
        uxtb        w8,w0
        cmp         w8,#0
        cseteq      w0
        ldr         lr,[sp],#0x10

It extracts the return value of function foo() and compares with 0, then uses cseteq to check if the return value is equal to 0 to return 1, otherwise it returns 0.

While in 17.7 we improve it by doing a bit-wise exclusive or which is equivalent to negation.

Code generated by MSVC 17.7:

        bl          |?foo@@YA_NXZ|
        eor         w8,w0,#1
        uxtb        w0,w8
        ldr         lr,[sp],#0x10

ARM64 instruction eliminate redundant comparisons

There are certain cases where MSVC fails to recognize that the comparison conditions are the same and the conditional flag is unchanged. In the following example, there are 3 comparisons for if (cond) for assigning a, b, and c using cseleq.

struct A {
        int a;
        int b;
        int c;
};   

void foo(struct A *sa, int cond) {
    int a = 1, b = 2, c = 3;
    if (cond) {
        a = 20;
        b = 33;
        c = 48;
    }

    sa->a = a;
    sa->b = b;
    sa->c = c;
}

Code generated by MSVC 17.6:

        cmp         w1,#0
        mov         w9,#1
        mov         w8,#0x14
        cseleq      w10,w9,w8
        cmp         w1,#0
        mov         w9,#2
        mov         w8,#0x21
        cseleq      w8,w9,w8
        stp         w10,w8,[x0]
        cmp         w1,#0
        mov         w8,#0x30
        mov         w9,#3
        cseleq      w8,w9,w8
; Line 19
        str         w8,[x0,#8]

We taught the backend to recognize the pattern. The code-generation in the 17.7 release has been optimized.

Code generated by MSVC 17.7:

        cmp         w1,#0
        mov         w9,#1
        mov         w8,#0x14
        cseleq      w10,w9,w8
        mov         w9,#2
        mov         w8,#0x21
        cseleq      w8,w9,w8
        stp         w10,w8,[x0]
        mov         w9,#3
        mov         w8,#0x30
        cseleq      w8,w9,w8
; Line 19
        str         w8,[x0,#8]

Summary

We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. We will continue to improve MSVC for ARM developers in 17.8. Please stay tuned. Your feedback is very valuable to us. Please share your thoughts and comments with us through Virtual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.