MSVC ARM64 optimizations in Visual Studio 2022 17.8

Visual Studio 2022 17.8 has been released recently (download it here). While there is already a blog “Visual Studio 17.8 now available!” covering new features and improvements, we would like to share more information with you about what is new for the MSVC ARM64 backend in this blog. In the last couple of months, we have been improving code-generation for the auto-vectorizer so that it can generate Neon instructions for more cases. Also, we have optimized instruction selection for a few scalar code-generation scenarios, for example short circuit evaluation, comparison against immediate, and smarter immediate split for logic instruction.

Auto-Vectorizer supports conversions between floating-point and integer

The following conversions between floating-point and integer types are common in real-world code. Now, they are all enabled in the ARM64 backend and hooked up with the auto-vectorizer.

From	To	Instruction
double	float	fcvtn
double	int64_t	fcvtzs
double	uint64_t	fcvtzu
float	double	fcvtl
float	int32_t	fcvtzs
float	uint32_t	fcvtzu
int64_t	double	scvtf
uint64_t	double	ucvtf
int32_t	float	scvtf
uint32_t	float	ucvtf

For example:

void test (double * __restrict a, unsigned long long * __restrict b) 
{ 
    for (int i = 0; i < 2; i++)
    { 
        a[i] = (double)b[i]; 
    } 
}

In Visual Studio 2022 17.7, the code-generation was the following in which both the computing throughput and load/store bandwidth utilization were suboptimal due to scalar instructions being used.

ldp    x9, x8, [x1]
ucvtf  d17, x9
ucvtf  d16, x8
stp    d17, d16, [x0]

In Visual Studio 2022 17.8.2, the code-generation has been optimized into:

ldr    q16,[x1]
ucvtf  v16.2d,v16.2d
str    q16,[x0]

A single pair of Q register load & store plus SIMD instructions are used now.

The above example is a conversion between double and 64-bit integer, so both types are the same size. There was another issue in the ARM64 backend preventing auto-vectorization on conversion between different sized types and it has been fixed as well. MSVC also auto-vectorizes the following example now:

void test_df_to_sf (float * __restrict a, double * __restrict b, int * __restrict c)
{
    for (int i = 0; i < 4; i++)
    {
        a[i] = (float) b[i];
        c[i] = ((int)a[i]) << 5;
    }
}

The code-generation in Visual Studio 2022 17.7 was:

ldp     d17, d16, [x1]
ldp     d17, d16, [x1]
fcvt    s17, d17
fcvt    s16, d16
fcvtzs  w8, s17
stp     s17, s16, [x0]
lsl     w9, w8, #5
fcvtzs  w8, s16
lsl     w8, w8, #5
stp     w9, w8, [x2]

Scalar instructions plus loop unrolling were employed. In Visual Studio 2022 17.8.2, the loop is vectorized:

ldr     q16, [x1]
fcvtn   v16.2s, v16.2d
str     d16, [x0]
fcvtzs  v16.2s, v16.2s
shl     v16.2s, v16.2s, #5
str     d16, [x2]

Auto-vectorizer now supports extended left shifts

Extended left shifts are also common in real world code, therefore the ARM64 ISA has native instructions to support it. Neon has SSHLL and USHLL to support signed and unsigned extended left shift. They extend the shift source first, then perform shift on the extended value. Let’s have a look at the following example:

void test_signed (signed short * __restrict a, signed char * __restrict b)
{
    for (int i = 0; i < 8; i++)
        a[i] = b[i] << 7;
}

void test_unsigned (unsigned short * __restrict a, unsigned char * __restrict b)
{
    for (int i = 0; i < 8; i++)
        a[i] = b[i] << 7;
}

The code-generation in Visual Studio 2022 17.7 was:

|test_signed| PROC
    ldr    d16, [x1]
    sxtl   v16.8h, v16.8b
    shl    v16.8h, v16.8h, #7
    str    q16, [x0]
|test_unsigned| PROC
    ldr    d16, [x1]
    ushll  v16.8h, v16.8b, #0
    shl    v16.8h, v16.8h, #7
    str    q16, [x0]

There is vectorization, an independent type promotion is done first and followed by a normal shift. They can be optimized into a single extended shift. We have taught the ARM64 backend about this, and Visual Studio 2022 17.8.2 now generates:

|test_signed| PROC
    ldr    d16, [x1]
    sshll  v16.8h, v16.8b, #7
    str    q16, [x0]
|test_unsigned| PROC
    ldr    d16, [x1]
    ushll  v16.8h, v16.8b, #7
    str    q16, [x0]

A single SSHLL or USHLL is used. But SSHLL and USHLL only encode shift amounts within [0, bit_size_of_shift_source - 1]. For example, the shift amount can only be [0, 15] for the above testcases. Therefore, we cannot use both instructions if we want to move the element to the upper half of the wider destination, because the shift amount then will be 16, which is out of the encoding range. For this case, signedness does not matter and ARM64 Neon has SHLL (Shift Left Long – by element size) to support it.

Let us increase the shift amount to the element size of the shift source, which is 8:

void test_signed(signed short * __restrict a, signed char * __restrict b)
{
    for (int i = 0; i < 8; i++)
        a[i] = b[i] << 8;
}

void test_unsigned(unsigned short * __restrict a, unsigned char * __restrict b)
{
    for (int i = 0; i < 8; i++)
        a[i] = b[i] << 8;
}

The code-generation in Visual Studio 2022 17.7 was:

|test_signed| PROC
    ldr    d16, [x1]
    sxtl   v16.8h, v16.8b
    shl    v16.8h, v16.8h, #7
    str    q16, [x0]
|test_unsigned| PROC
    ldr    d16, [x1]
    ushll  v16.8h, v16.8b, #0
    shl    v16.8h, v16.8h, #7
    str    q16, [x0]

And in Visual Studio 2022 17.8.2, it becomes:

|test_signed| PROC
    ldr   d16, [x1]
    shll  v16.8h, v16.8b, #8
    str   q16, [x0]
|test_unsigned| PROC
    ldr   d16, [x1]
    shll  v16.8h, v16.8b, #8
    str   q16, [x0]

Scalar code-generation improved on immediate materialization for CMP/CMN

On the blog introducing ARM64 optimizations for Visual Studio 2022 17.6, there was one piece of feedback on unnecessary immediate materialization for integer comparison. In the 17.8 development cycle, we have improved a couple of relevant places inside the ARM64 backend.

For integer comparison, the ARM64 backend is now smarter and knows an immediate could be adjusted to immediate + 1 or immediate – 1 then fits into adjusted comparison. Here are some examples:

int test_ge2gt (int a)
{
    if (a >= 0x10001)
        return 1;
    else
        return 0;
}

int test_lt2le (int a)
{
    if (a < -0x1fff)
        return 1;
    else
        return 0;
}

0x10001 inside test_ge2gt does not fit into the immediate encoding for the ARM64 CMP instruction, either verbatim or shifted. However, if we subtract it by 1 and turn greater equal (ge) into greater (gt) accordingly, then 0x10000 will fit into the shifted encoding.

For test_lt2le, the negative immediate, -0x1fff, does not fit into immediate encoding for ARM64 CMN instruction, but if we subtract it by 1 and turn less (lt) into less equal (le) accordingly, then -0x2000 will fit into shifted encoding.

So, the code-generation is the following by Visual Studio 2022 17.7:

|test_ge2gt| PROC
    mov     w8, #0x10001
    cmp     w0, w8
    csetge  w0
|test_lt2le| PROC
    mov     w8, #-0x1FFF
    cmp     w0, w8
    csetlt  w0

There is an extra MOV instruction to materialize the immediate into the register because it does not fit into encoding verbatim. After the above-mentioned improvements, Visual Studio 2022 17.8.2 generates:

|test_ge2gt| PROC
    cmp     w0, #0x10, lsl #0xC
    csetgt  w0
|test_lt2le| PROC
    cmn     w0, #2, lsl #0xC
    csetle  w0

The sequence is more efficient.

Scalar code-generation improved on logic immediate loading

We have also taken steps further to improve immediate handling of other instructions. One improvement is: ARM64 has a rotated encoding for logic immediate (please refer to description of DecodeBitMasks in the Arm Architecture Reference Manual for details), this immediate encoding is used by AND/ORR. If one immediate does not fit into rotated encoding verbatim, it could after a split.

For example, programmers frequently write code patterns like the following:

#define FLAG1_MASK 0x80000000
#define FLAG2_MASK 0x00004000

int cal (int a)
{
    return a & (FLAG1_MASK | FLAG2_MASK);
}

The compiler middle-end usually logically combines all the ANDed immediates with the return expression then returns a & 0x80004000 which does not fit into the rotated encoding, hence a MOV/MOVK sequence will be generated to load the immediate, the cost will be three instructions. The code generation in Visual Studio 2022 17.7 was:

mov   w8, #0x4000
movk  w8, #0x8000, lsl #0x10
and   w0, w0, w8

If we split 0x80004000 into 0xffffc000 and 0x80007fff, ANDing them sequentially will have the same effect as ANDing 0x80004000, but both 0xffffc000 and 0x80007fff fit into the rotated encoding, so we save one instruction. The code generation in Visual Studio 2022 17.8.2 is:

and  w8, w0, #0xFFFFC000
and  w0, w8, #0x80007FFF

The immediate gets split in a way that the split parts fit into two AND instructions. We only want to split the immediate when it has sole use site, otherwise we will end up with duplicated sequences. Therefore, the split is guided with use count.

Scalar code-generation now catches more CCMP opportunities

The CCMP (conditional compare) instruction is useful for accelerating short circuit evaluation, for example:

int test (int a)
{
    return a == 17 || a == 31;
}

For this testcase, Visual Studio 2022 17.7 was smart enough to employ CCMP and generated:

cmp     w0, #0x11
ccmpne  w0, #0x1F, #4
cseteq  w0

However, CCMP only takes a 5-bit immediate, so if we change the testcase to:

int test (int a)
{
    return a == 17 || a == 32;
}

The immediate #32 does not fit into CCMP’s encoding, so the compiler will stop generating it, hence the code generation in Visual Studio 2022 17.7 was:

    cmp  w0, #0x11
    beq  |$LN3@test|
    cmp  w0, #0x20
    mov  w0, #0
    bne  |$LN4@test|
|$LN3@test|
    mov  w0, #1
|$LN4@test|
    ret

It employs an if-then-else structure and is verbose. Here, the compiler should have a better cost model and knows it will still be beneficial if it moves #32 into a register and employ CCMP’s register form. We have fixed this in Visual Studio 2022 17.8.2, and the code generation becomes:

cmp     w0, #0x11
mov     w8, #0x20
ccmpne  w0, w8, #4
cseteq  w0
ret

Using MOVI/MVNI for immediate move in smaller loops

We missed an opportunity to use shifted MOVI/MVNI for combining immediate move

In small loops. For example,

void vect_movi_msl (int *__restrict a, int *__restrict b, int *__restrict c) {
    for (int i = 0; i < 8; i++)
        a[i] = 0x1200;

    for (int i = 0; i < 8; i++)
        c[i] = 0x12ffffff;
}

In 17.7 release, MSVC generated:

|movi_msl| PROC
    mov x9, #0x1200
    movk x9, #0x1200, lsl #0x20
    ldr x8, |$LN29@movi_msl|
    stp x9, x9, [x0]
    stp x8, x8, [x2]
    stp x9, x9, [x0, #0x10]
    stp x8, x8, [x2, #0x10]

Scalar MOV/MOVK instructions are employed, and 8 iterations are needed to initialize each array. Both immediates can be loaded into vector registers using MOVI/MVNI, therefore increasing the storage bandwidth and reducing the iteration number.

Shifted MOVI shifts the immediate to the left and fills with 0s, so 0x1200 can be loaded as the following:

0x12 << 8 = 0x1200

Shifted MVNI shifts the immediate first, then inverts the result:

~((0xED) << 0x18) = 0x12ffffff

In 17.8, MSVC generates:

|movi_msl| PROC
    movi v17.4s, #0x12, lsl #8
    mvni v16.4s, #0xED, lsl #0x18
    stp q17, q17, [x0]
    stp q16, q16, [x2]

Benefiting from vector register’s width and the employment of paired store, only one iteration is needed.

In closing

That is all for this blog, your feedback is valuable for us. Please share your thoughts and comments with us through Visual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.

2 comments

Discussion is closed. Login to edit/delete existing comments.

Tom Kirby-Green January 11, 2024

Great to see these ARM investments continuing. My primary hardware platform for running Windows 11 these days is M2 and M3 based Apple hardware via Parallels. I’m really looking forward one day to seeing IHVs produce native ARM based Windows boxes with a decent software ecosystem.
Hal Pierson January 10, 2024

Can anybody tell me when operator[] will support multiple parameters?

Hal Pierson

MSVC ARM64 optimizations in Visual Studio 2022 17.8

Auto-Vectorizer supports conversions between floating-point and integer

Auto-vectorizer now supports extended left shifts

Scalar code-generation improved on immediate materialization for CMP/CMN

Scalar code-generation improved on logic immediate loading

Scalar code-generation now catches more CCMP opportunities

Using MOVI/MVNI for immediate move in smaller loops

Author

2 comments

Read next

C++ in VS Code: Getting Started & Configuring IntelliSense

What’s New in vcpkg (January 2024)

Auto-Vectorizer supports conversions between floating-point and integer

Auto-vectorizer now supports extended left shifts

Scalar code-generation improved on immediate materialization for CMP/CMN

Scalar code-generation improved on logic immediate loading

Scalar code-generation now catches more CCMP opportunities

Using MOVI/MVNI for immediate move in smaller loops

Author

2 comments

Read next

C++ in VS Code: Getting Started & Configuring IntelliSense

What’s New in vcpkg (January 2024)

Stay informed