MSVC ARM64 optimizations in Visual Studio 2022 17.6 

Jiong Wang (ARM Ltd)

In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the Visual Studio 2022 version 17.6. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let’s review some interesting optimizations in this blog. 

Before diving into technical details, we’d encourage you to create feedback here at Developer Community if you have found performance issues. The feedback helps us prioritize work items in our backlog. This, optimize neon right shift into cmp, is an example of good feedback. Including a tagged subject title, detailed description of the issue, and a simple repro simplifies our analysis work and helps us deliver a fix more quickly. 

Now, let’s see the optimizations. 

Auto-Vectorizer supports more NEON instructions with asymmetric operands

The ARM64 backend already supports some NEON instructions with asymmetric typed operands, like Add/Subtract Long operations (SADDL/UADDL/SSUBL/USUBL). These instructions add each vector element in the lower or upper half of the first source SIMD register to the corresponding vector element of the second source SIMD register and write the vector result to the destination SIMD register. The destination vector elements are twice as long as the source vector elements. Now, we have extended such support to Multiply-Add Long and Multiply-Subtract Long (SMLAL/UMLAL/SMLSL/UMLSL). 

For example: 

void smlal(int * __restrict dst, int * __restrict a, 
           short * __restrict b, short * __restrict c) 
{ 
    for (int i = 0; i < 4; i++) 
        dst[i] = a[i] + b[i] * c[i]; 
} 

In Visual Studio 2022 17.5, the code-generation was: 

sxtl        v19.4s,v16.4h 
sxtl        v18.4s,v17.4h 
mla         v20.4s,v18.4s,v19.4s  

Extra signed extensions are performed on both source operands to match the type of destination. Now it has been optimized into a single smlal v16.4s,v17.4h,v18.4h. 

The ARM64 ISA further supports another variant for these operations, which is called Add/Subtract Wide. For them, the asymmetry happens between source operands, not between source and destination. 

For example: 

void saddw(int *__restrict dst, int *__restrict a, short *__restrict b) 
{ 
    for (int i = 0; i < 4; i++) 
        dst[i] = a[i] + b[i]; 
}

In Visual Studio 2022 17.5, the code-generation was: 

sxtl        v17.4s,v16.4h 
add         v18.4s,v17.4s,v18.4s  

The narrow source gets extra signed extension to match the other wide source. In the 17.6 release, this has been optimized into a single saddw v16.4s,v16.4s,v17.4h. The same applies to UADDW/SSUBW/USUBW. 

Auto-vectorizer now supports small types on ABS/MIN/MAX 

ABS/MIN/MAX have slightly complex semantics. Normally, the compiler middle-end or back-end will have a pattern matcher to recognize IR sequences with if-then-else semantics and see if they could be converted into ABS/MIN/MAX. There is an issue when the operands are in small types (int8 or int16) though. 

 As specified by the C++ standard, small types are promoted to int, which is 32-bit on ARM64. This is perfect for scalar operations because they really can only operate on scalar register width. For ARM64, the smallest width is 32-bit utilizing the sub-register. However, this is not true for SIMD ISA whose minimum operation width is the width of vector lane (element). For example, ARM64 NEON supports operating on int8, int16 for a couple of operations including ABS/MIN/MAX. So, to generate SIMD instructions operating on small element sizes and deliver higher computing throughput, the auto-vectorizer needs to do analysis and narrow the type back to the original small type when it is safe to do so. 

On top of this, ABS/MIN/MAX bring further challenges because their source operands are scattered in different basic blocks due to their built-in if-then-else semantics. We found the MSVC auto-vectorizer may only narrow the type of the operand in one basic block making it inconsistent with operands in other basic blocks. Such inconsistency will cause the auto-vectorizer to fail matching the pattern and cancel vectorization. 

  For example: 

void test(signed char * __restrict a, signed char * __restrict b) 
{ 
    for (int i = 0; i < 16; i++) 
        a[i] = b[i] > 0 ? b[i] : -b[i]; 
}

In Visual Studio 2022 17.5, there was no vectorization, and the code-generation was: 

ldrsb       w8,[x1,#1] 
cmp         w8,#0 
bgt         |$LN16@test_abs_v| 
neg         w8,w8 
sxtb        w8,w8   

In the 17.6 release, the code-generation has been improved into a single abs v16.8h,v16.8h. 

Scalar code-generation improved based on value range analysis

When one register is compared with an immediate value, the compiler can deduce the value range of the register, and this information is useful for later optimizations, for example evaluating comparison results statically. 

MSVC already has an infrastructure for doing value range deduction, the backend just needs to teach the middle-end about the semantics of its supported comparison instructions. 

The ARM64 backend previously missed this support for CBZ and CMP. For cbz reg, label, the reg must equal to zero in true path, and the same applies for CBNZ on false path. While for cmp reg, #imm, the reg must equal imm in true path. Knowing such equivalence, the compiler could simplify code-generation. Let’s see a simple example: 

int cal(int a) 
{ 
    if (!a) 
        return a; 
    else 
        return a * 2; 
} 

In Visual Studio 2022 17.5, MSVC was generating the following instruction sequence: 

|int cal(int)| PROC 
        cbnz        w0,|$LN2@cal| 
        mov         w0,#0         <- redundant mov 
        ret 
|$LN2@cal| 
        lsl         w0,w0,#1 
        ret 

The mov w0, #0 is redundant because when the execution reaches there, w0 must be zero. After our recent optimization, the code-generation is optimal in release 17.6: 

|int cal(int)| PROC 
        lsl         w8,w0,#1 
        cmp         w0,#0 
        cseleq      w0,wzr,w8 
        ret 

Scalar code-generation improved to catch more if-conversion opportunities

You may have noticed there is another difference in the code-generation for the above test case. CSELEQ is used instead of branch. The ARM64 ISA supports the CSEL instruction to do if-conversion which saves both execution cycles and code size. Previously, MSVC couldn’t generate CSEL when the selected value came from a return statement. We fixed this in the 17.6 release, so for the above example CSELEQ is used instead of CBNZ.  

In summary, for the if-return-else-return pattern, the ARM64 backend has been taught to generate a CSEL instruction if the return statement is in any of the following operations: 

  • Unary: NOT and type conversion 
  • Binary: Add, Subtract, And/Or/Xor, Logical Shift Left/Right 

Let’s see another example: 

int test (int a) {  
    if (a > 0xFFFF)  
        return ~a;  
    else 
        return a;  
}  

In Visual Studio 2022 17.5, the code-generation was: 

|int test(int)| PROC 
        mov         w8,#0xFFFF 
        cmp         w0,w8 
        ble         |$LN3@test| 
        mvn         w0,w0 
|$LN3@test| 
        ret  

It employs a branch and contains multiple basic blocks. In the 17.6 release, the code-generation is improved by if-conversion: it is lean and mean: 

|int test(int)| PROC 
        mov         w8,#0xFFFF 
        cmp         w0,w8 
        cinvgt      w0,w0 
        ret 

ARM64 instruction combiner now supports instructions with multiple definitions

As a modern compiler, MSVC is carrying out various instruction combinations across compilation stages. While a bunch of combine rules have already been implemented, we recently found one interesting case was missing. Take the following case for example:  

int test(int a, int b) 
{ 
    if (a < b) 
        return b - a; 
    else 
        return 7; 
}  

In Visual Studio 2022 17.5, the code-generation was: 

|int test(int,int)| PROC 
        cmp         w0,w1 
        sub         w0,w1,w0 
        blt         |$LN3@test| 
        mov         w0,#7 
|$LN3@test| 
        ret 

The cmp w0, w1 and the following sub w0, w1, w0 can be combined into a single SUBS. We missed it because the later stage combiner had glitches when supporting combining into instructions with multiple definitions. Now we have fixed this, so the code-generation in the 17.6 release has been improved into: 

|int test(int,int)| PROC 
        subs        w9,w1,w0 
        mov         w8,#7 
        cselgt      w0,w9,w8 
        ret  

The code-generation is utilizing SUBS to set the conditional code and doing subtraction at the same period, it also benefited from the above mentioned if-conversion optimization on return statement, so CSEL is used instead of branch

In closing

That is all for this blog and we will keep you updated on our progress. Your feedback is very valuable for us. Please share your thoughts and comments with us through Virtual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com. 

Posted in C++

7 comments

Discussion is closed. Login to edit/delete existing comments.

  • Mason Boswell 0

    Awesome work, keep it up!

  • Roger B 0

    Is there anything new to report for the x64 backend? It basically never wins against gcc/clang in any scenario I’ve tried and the number of issues that keep rolling in on developercommunity seems to agree with my experiences.

    In the intermediate 5-6 years before ARM on Windows PC really takes off (maybe), the x64 ecosystem really needs to be improved still.

  • Przemyslaw Wirkus 0

    Fantastic progress Jiong! 🙂

    Przemek

  • Sjoerd Meijer 1

    Hi Jiong,
    Great work, nice read.
    Just out of curiousity, I copied the examples to compiler explorer and looked at codegen for some open-source compilers.

    For function `cal` it looks like a one LSL suffices: `lsl w0, w0, #1`

    For:

    int test (int a) {  
        if (a > 0xFFFF)  
            return ~a;  
        else 
            return a;  
    }  

    The constant doesn’t need to be materialised:

            cmp     w0, #16, lsl #12                // =65536
            cinv    w0, w0, ge

    https://godbolt.org/z/KKndKjK31

    Cheers.

    • Jiong Wang (ARM Ltd)Microsoft employee 0

      Thanks for the feedback, Sjoerd. The constant issue is in our back log, please stay tuned.

  • Andrew Pinski 1

    The first example in ‘Scalar code-generation improved based on value range analysis’. Could further optimized to just ‘lsl w0, w0, 1’ as shifting zero left by 1 is still 0.
    This is handled in gcc by the phiopt pass (i assume something similar in llvm too).

    • Jiong Wang (ARM Ltd)Microsoft employee 0

      Good spot, Andrew, I have created a problem report to track it. Please stay tuned for the fix.

Feedback usabilla icon