In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We started introducing these performance optimizations in 17.6 release and we have landed them in 17.7 release.
By element operation
ARM64 supports by-element operation in several instructions such as fmul
, fmla
, fmls
, etc. This feature allows a SIMD operand to be computed directly by a SIMD element using an index to access it. In the example below where we multiply a scalar with an array, MSVC duplicated the vector element v0.s[0]
into all lanes of a SIMD register, then multiplied that with another SIMD operand represented by array b
. This is not efficient because the dup
instruction will add 2 more execution latency before executing the fmul
instruction.
To better understand this optimization let’s take the following sample code:
void test(float * __restrict a, float * __restrict b, float c) { Â Â for (int i = 0; i < 4; i++) Â Â Â Â Â Â a[i] = b[i] * c; }
Code generated by MSVC 17.6:
      dup        v17.4s,v0.s[0]       ldr        q16,[x1]       fmul       v16.4s,v17.4s,v16.4s        str        q16,[x0]
In Visual Studio 2022 17.7, we eliminated a duplicate instruction. It now can multiply by a SIMD element directly and the code generation is:
      ldr        q16,[x1]       fmul       v16.4s,v16.4s,v0.s[0]        str        q16,[x0]
Neon supports for shift right and accumulate immediate
This instruction right-shifts a SIMD source by an immediate and accumulates the final result with the destination SIMD register. As mentioned earlier, we started working on this optimization in the 17.6 release and completed the feature in the 17.7 release.
In the 17.5 release MSVC turned right shifts into left shifts using a negative shift amount. To implement it that way, it copied a constant -2 into a general register, and then duplicated a general register into the SIMD register before the left shift.
We eliminated the duplicate instruction and used right shift with an immediate value in the 17.6 release. It was an improvement, but not yet good enough.
We further improved the implementation in 17.7 by combining right shift and add by using usra
for unsigned and ssra
for signed operation. The final implementation is much more compact than the previous ones.
To better understand how this optimization works let’s look at the sample code below:
void test(unsigned long long * __restrict a, unsigned long long * __restrict b) { Â Â for (int i = 0; i < 2; i++) Â Â Â Â Â Â a[i] += b[i] >> 2; }
Code generated by MSVC 17.5:
      mov        x8,#-2       ldr        q16,[x1]       dup        v17.2d,x8       ldr        q18,[x0]       ushl       v17.2d,v16.2d,v17.2d       add        v18.2d,v17.2d,v18.2d        str        q18,[x0]
Code generated by MSVC 17.6:
      ldr        q16,[x1]       ushr       v17.2d,v16.2d,#2       ldr        q16,[x0]       add        v16.2d,v17.2d,v16.2d        str        q16,[x0]
Code generated by MSVC 17.7:
      ldr        q16,[x0]       ldr        q17,[x1]       usra       v16.2d,v17.2d,#2        str        q16,[x0]
Neon right shift into cmp
A right shift on a signed integer shifts the msb, with the result being either -1 or 0, and is equivalent to cmlt
. Similar to the previous optimization, we progressively made improvements and completed the feature in 17.7.
To better understand why we introduced this optimization let’s take the following snippet of code,
void shf2cmlt(int * __restrict a, int * __restrict b) { Â for (int i = 0; i < 4; i++) Â Â b[i] = a[i] >> 31; }
Code generated by MSVC 17.5:
      ldr        q16,[x0]       mvni       v17.4s,#0x1E       sshl       v17.4s,v16.4s,v17.4s        str        q17,[x1]
Code generated by MSVC 17.6:
      ldr        q16,[x0]       sshr       v16.4s,v16.4s,#0x1F        str        q16,[x1]
sshr
was certainly an improvement but not the best solution because its throughput is 1 and can go through only the V1 pipeline. cmlt
has better throughput as it can go through pipelines V1 or V2 on Arm’s Neoverse N1 core as in Arm® Neoverse™ N1 Software Optimization Guide.
Code generated by MSVC 17.7:
      ldr        q16,[x0]       cmlt       v16.4s,v16.4s,#0        str        q16,[x1]
Neon right shift and narrow into shifted narrow
This is another case where we can combine 2 instructions into one using shrn
. The shrn
instruction right shifts each element of an unsigned integer value from the source SIMD register by an immediate value, puts the result into a SIMD register, and writes the vector to the lower half of the destination SIMD register.
In this example, the generated code multiplies 16-bit integers, and puts the result into an unsigned 32-bit destination. It right-shifts the result by an immediate value, then it extracts, narrows half of it, and writes to the final destination.
To better understand this optimization, consider the following sample of code:
void test(unsigned short * __restrict a, short * __restrict b) { Â Â for( int i = 0; i < 4; i++ ) Â Â Â Â Â Â b[i] = (a[i] * a[i]) >> 10; }
Code generated by MSVC 17.6:
      ldr        d16,[x0]       umull      v16.4s,v16.4h,v16.4h       sshr       v16.4s,v16.4s,#0xA       xtn        v16.4h,v16.4s        str        d16,[x1]
Code generated by MSVC 17.7:
      ldr        d16,[x0]       umull      v16.4s,v16.4h,v16.4h       shrn       v16.4h,v16.4s,#0xA        str        d16,[x1]
Scalar code-generation improved negation of bool value
A negation of a Boolean value can generally be optimized using an exclusive or with 1.
For example,Â
0 ^ 1 = 1
1 ^ 1 = 0
In 17.6, MSVC generates unnecessary code using cmp
+ cseleq
to negate boolean function’s return values.
To better understand this optimization let’s take the following sample code:
bool foo(); bool negate_bool() { Â Â return !foo(); }
Code generated by MSVC 17.6:
      bl         |?foo@@YA_NXZ|       uxtb       w8,w0       cmp        w8,#0       cseteq     w0        ldr        lr,[sp],#0x10
It extracts the return value of function foo()
and compares with 0, then uses cseteq to check if the return value is equal to 0 to return 1, otherwise it returns 0.
While in 17.7 we improve it by doing a bit-wise exclusive or which is equivalent to negation.
Code generated by MSVC 17.7:
      bl         |?foo@@YA_NXZ|       eor        w8,w0,#1       uxtb       w0,w8        ldr        lr,[sp],#0x10
ARM64 instruction eliminate redundant comparisons
There are certain cases where MSVC fails to recognize that the comparison conditions are the same and the conditional flag is unchanged. In the following example, there are 3 comparisons for if (cond)
for assigning a
, b
, and c
using cseleq
.
struct A { Â Â Â Â Â Â int a; Â Â Â Â Â Â int b; Â Â Â Â Â Â int c; };Â Â void foo(struct A *sa, int cond) { Â Â int a = 1, b = 2, c = 3; Â Â if (cond) { Â Â Â Â Â Â a = 20; Â Â Â Â Â Â b = 33; Â Â Â Â Â Â c = 48; Â Â Â } Â Â sa->a = a; Â Â sa->b = b; Â Â Â sa->c = c; }
Code generated by MSVC 17.6:
      cmp        w1,#0       mov        w9,#1       mov        w8,#0x14       cseleq     w10,w9,w8       cmp        w1,#0       mov        w9,#2       mov        w8,#0x21       cseleq     w8,w9,w8       stp        w10,w8,[x0]       cmp        w1,#0       mov        w8,#0x30       mov        w9,#3       cseleq     w8,w9,w8 ; Line 19        str        w8,[x0,#8]
We taught the backend to recognize the pattern. The code-generation in the 17.7 release has been optimized.
Code generated by MSVC 17.7:
      cmp        w1,#0       mov        w9,#1       mov        w8,#0x14       cseleq     w10,w9,w8       mov        w9,#2       mov        w8,#0x21       cseleq     w8,w9,w8       stp        w10,w8,[x0]       mov        w9,#3       mov        w8,#0x30       cseleq     w8,w9,w8 ; Line 19        str        w8,[x0,#8]
Summary
We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you.  We will continue to improve MSVC for ARM developers in 17.8. Please stay tuned.  Your feedback is very valuable to us. Please share your thoughts and comments with us through Virtual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.
0 comments