In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We started introducing these performance optimizations in 17.6 release and we have landed them in 17.7 release.
By element operation
ARM64 supports by-element operation in several instructions such as fmul
, fmla
, fmls
, etc. This feature allows a SIMD operand to be computed directly by a SIMD element using an index to access it. In the example below where we multiply a scalar with an array, MSVC duplicated the vector element v0.s[0]
into all lanes of a SIMD register, then multiplied that with another SIMD operand represented by array b
. This is not efficient because the dup
instruction will add 2 more execution latency before executing the fmul
instruction.
To better understand this optimization let’s take the following sample code:
void test(float * __restrict a, float * __restrict b, float c) { for (int i = 0; i < 4; i++) a[i] = b[i] * c; }
Code generated by MSVC 17.6:
dup v17.4s,v0.s[0] ldr q16,[x1] fmul v16.4s,v17.4s,v16.4s str q16,[x0]
In Visual Studio 2022 17.7, we eliminated a duplicate instruction. It now can multiply by a SIMD element directly and the code generation is:
ldr q16,[x1] fmul v16.4s,v16.4s,v0.s[0] str q16,[x0]
Neon supports for shift right and accumulate immediate
This instruction right-shifts a SIMD source by an immediate and accumulates the final result with the destination SIMD register. As mentioned earlier, we started working on this optimization in the 17.6 release and completed the feature in the 17.7 release.
In the 17.5 release MSVC turned right shifts into left shifts using a negative shift amount. To implement it that way, it copied a constant -2 into a general register, and then duplicated a general register into the SIMD register before the left shift.
We eliminated the duplicate instruction and used right shift with an immediate value in the 17.6 release. It was an improvement, but not yet good enough.
We further improved the implementation in 17.7 by combining right shift and add by using usra
for unsigned and ssra
for signed operation. The final implementation is much more compact than the previous ones.
To better understand how this optimization works let’s look at the sample code below:
void test(unsigned long long * __restrict a, unsigned long long * __restrict b) { for (int i = 0; i < 2; i++) a[i] += b[i] >> 2; }
Code generated by MSVC 17.5:
mov x8,#-2 ldr q16,[x1] dup v17.2d,x8 ldr q18,[x0] ushl v17.2d,v16.2d,v17.2d add v18.2d,v17.2d,v18.2d str q18,[x0]
Code generated by MSVC 17.6:
ldr q16,[x1] ushr v17.2d,v16.2d,#2 ldr q16,[x0] add v16.2d,v17.2d,v16.2d str q16,[x0]
Code generated by MSVC 17.7:
ldr q16,[x0] ldr q17,[x1] usra v16.2d,v17.2d,#2 str q16,[x0]
Neon right shift into cmp
A right shift on a signed integer shifts the msb, with the result being either -1 or 0, and is equivalent to cmlt
. Similar to the previous optimization, we progressively made improvements and completed the feature in 17.7.
To better understand why we introduced this optimization let’s take the following snippet of code,
void shf2cmlt(int * __restrict a, int * __restrict b) { for (int i = 0; i < 4; i++) b[i] = a[i] >> 31; }
Code generated by MSVC 17.5:
ldr q16,[x0] mvni v17.4s,#0x1E sshl v17.4s,v16.4s,v17.4s str q17,[x1]
Code generated by MSVC 17.6:
ldr q16,[x0] sshr v16.4s,v16.4s,#0x1F str q16,[x1]
sshr
was certainly an improvement but not the best solution because its throughput is 1 and can go through only the V1 pipeline. cmlt
has better throughput as it can go through pipelines V1 or V2 on Arm’s Neoverse N1 core as in Arm® Neoverse™ N1 Software Optimization Guide.
Code generated by MSVC 17.7:
ldr q16,[x0] cmlt v16.4s,v16.4s,#0 str q16,[x1]
Neon right shift and narrow into shifted narrow
This is another case where we can combine 2 instructions into one using shrn
. The shrn
instruction right shifts each element of an unsigned integer value from the source SIMD register by an immediate value, puts the result into a SIMD register, and writes the vector to the lower half of the destination SIMD register.
In this example, the generated code multiplies 16-bit integers, and puts the result into an unsigned 32-bit destination. It right-shifts the result by an immediate value, then it extracts, narrows half of it, and writes to the final destination.
To better understand this optimization, consider the following sample of code:
void test(unsigned short * __restrict a, short * __restrict b) { for( int i = 0; i < 4; i++ ) b[i] = (a[i] * a[i]) >> 10; }
Code generated by MSVC 17.6:
ldr d16,[x0] umull v16.4s,v16.4h,v16.4h sshr v16.4s,v16.4s,#0xA xtn v16.4h,v16.4s str d16,[x1]
Code generated by MSVC 17.7:
ldr d16,[x0] umull v16.4s,v16.4h,v16.4h shrn v16.4h,v16.4s,#0xA str d16,[x1]
Scalar code-generation improved negation of bool value
A negation of a Boolean value can generally be optimized using an exclusive or with 1.
For example,
0 ^ 1 = 1
1 ^ 1 = 0
In 17.6, MSVC generates unnecessary code using cmp
+ cseleq
to negate boolean function’s return values.
To better understand this optimization let’s take the following sample code:
bool foo(); bool negate_bool() { return !foo(); }
Code generated by MSVC 17.6:
bl |?foo@@YA_NXZ| uxtb w8,w0 cmp w8,#0 cseteq w0 ldr lr,[sp],#0x10
It extracts the return value of function foo()
and compares with 0, then uses cseteq to check if the return value is equal to 0 to return 1, otherwise it returns 0.
While in 17.7 we improve it by doing a bit-wise exclusive or which is equivalent to negation.
Code generated by MSVC 17.7:
bl |?foo@@YA_NXZ| eor w8,w0,#1 uxtb w0,w8 ldr lr,[sp],#0x10
ARM64 instruction eliminate redundant comparisons
There are certain cases where MSVC fails to recognize that the comparison conditions are the same and the conditional flag is unchanged. In the following example, there are 3 comparisons for if (cond)
for assigning a
, b
, and c
using cseleq
.
struct A { int a; int b; int c; }; void foo(struct A *sa, int cond) { int a = 1, b = 2, c = 3; if (cond) { a = 20; b = 33; c = 48; } sa->a = a; sa->b = b; sa->c = c; }
Code generated by MSVC 17.6:
cmp w1,#0 mov w9,#1 mov w8,#0x14 cseleq w10,w9,w8 cmp w1,#0 mov w9,#2 mov w8,#0x21 cseleq w8,w9,w8 stp w10,w8,[x0] cmp w1,#0 mov w8,#0x30 mov w9,#3 cseleq w8,w9,w8 ; Line 19 str w8,[x0,#8]
We taught the backend to recognize the pattern. The code-generation in the 17.7 release has been optimized.
Code generated by MSVC 17.7:
cmp w1,#0 mov w9,#1 mov w8,#0x14 cseleq w10,w9,w8 mov w9,#2 mov w8,#0x21 cseleq w8,w9,w8 stp w10,w8,[x0] mov w9,#3 mov w8,#0x30 cseleq w8,w9,w8 ; Line 19 str w8,[x0,#8]
Summary
We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. We will continue to improve MSVC for ARM developers in 17.8. Please stay tuned. Your feedback is very valuable to us. Please share your thoughts and comments with us through Virtual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.
0 comments