In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the Visual Studio 2022 version 17.6. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let’s review some interesting optimizations in this blog.
Before diving into technical details, we’d encourage you to create feedback here at Developer Community if you have found performance issues. The feedback helps us prioritize work items in our backlog. This, optimize neon right shift into cmp, is an example of good feedback. Including a tagged subject title, detailed description of the issue, and a simple repro simplifies our analysis work and helps us deliver a fix more quickly.
Now, let’s see the optimizations.
Auto-Vectorizer supports more NEON instructions with asymmetric operands
The ARM64 backend already supports some NEON instructions with asymmetric typed operands, like Add/Subtract Long operations (SADDL/UADDL/SSUBL/USUBL). These instructions add each vector element in the lower or upper half of the first source SIMD register to the corresponding vector element of the second source SIMD register and write the vector result to the destination SIMD register. The destination vector elements are twice as long as the source vector elements. Now, we have extended such support to Multiply-Add Long and Multiply-Subtract Long (SMLAL
/UMLAL
/SMLSL
/UMLSL
).
For example:
void smlal(int * __restrict dst, int * __restrict a,
short * __restrict b, short * __restrict c)
{
for (int i = 0; i < 4; i++)
dst[i] = a[i] + b[i] * c[i];
}
In Visual Studio 2022 17.5, the code-generation was:
sxtl v19.4s,v16.4h
sxtl v18.4s,v17.4h
mla v20.4s,v18.4s,v19.4s
Extra signed extensions are performed on both source operands to match the type of destination. Now it has been optimized into a single smlal v16.4s,v17.4h,v18.4h
.
The ARM64 ISA further supports another variant for these operations, which is called Add/Subtract Wide. For them, the asymmetry happens between source operands, not between source and destination.
For example:
void saddw(int *__restrict dst, int *__restrict a, short *__restrict b)
{
for (int i = 0; i < 4; i++)
dst[i] = a[i] + b[i];
}
In Visual Studio 2022 17.5, the code-generation was:
sxtl v17.4s,v16.4h
add v18.4s,v17.4s,v18.4s
The narrow source gets extra signed extension to match the other wide source. In the 17.6 release, this has been optimized into a single saddw v16.4s,v16.4s,v17.4h
. The same applies to UADDW
/SSUBW
/USUBW
.
Auto-vectorizer now supports small types on ABS
/MIN
/MAX
ABS
/MIN
/MAX
have slightly complex semantics. Normally, the compiler middle-end or back-end will have a pattern matcher to recognize IR sequences with if-then-else semantics and see if they could be converted into ABS
/MIN
/MAX
. There is an issue when the operands are in small types (int8 or int16) though.
As specified by the C++ standard, small types are promoted to int
, which is 32-bit on ARM64. This is perfect for scalar operations because they really can only operate on scalar register width. For ARM64, the smallest width is 32-bit utilizing the sub-register. However, this is not true for SIMD ISA whose minimum operation width is the width of vector lane (element). For example, ARM64 NEON supports operating on int8, int16 for a couple of operations including ABS
/MIN
/MAX
. So, to generate SIMD instructions operating on small element sizes and deliver higher computing throughput, the auto-vectorizer needs to do analysis and narrow the type back to the original small type when it is safe to do so.
On top of this, ABS
/MIN
/MAX
bring further challenges because their source operands are scattered in different basic blocks due to their built-in if-then-else semantics. We found the MSVC auto-vectorizer may only narrow the type of the operand in one basic block making it inconsistent with operands in other basic blocks. Such inconsistency will cause the auto-vectorizer to fail matching the pattern and cancel vectorization.
For example:
void test(signed char * __restrict a, signed char * __restrict b)
{
for (int i = 0; i < 16; i++)
a[i] = b[i] > 0 ? b[i] : -b[i];
}
In Visual Studio 2022 17.5, there was no vectorization, and the code-generation was:
ldrsb w8,[x1,#1]
cmp w8,#0
bgt |$LN16@test_abs_v|
neg w8,w8
sxtb w8,w8
In the 17.6 release, the code-generation has been improved into a single abs v16.8h,v16.8h
.
Scalar code-generation improved based on value range analysis
When one register is compared with an immediate value, the compiler can deduce the value range of the register, and this information is useful for later optimizations, for example evaluating comparison results statically.
MSVC already has an infrastructure for doing value range deduction, the backend just needs to teach the middle-end about the semantics of its supported comparison instructions.
The ARM64 backend previously missed this support for CBZ
and CMP
. For cbz reg, label
, the reg
must equal to zero in true path, and the same applies for CBNZ
on false path. While for cmp reg, #imm
, the reg
must equal imm
in true path. Knowing such equivalence, the compiler could simplify code-generation. Let’s see a simple example:
int cal(int a)
{
if (!a)
return a;
else
return a * 2;
}
In Visual Studio 2022 17.5, MSVC was generating the following instruction sequence:
|int cal(int)| PROC
cbnz w0,|$LN2@cal|
mov w0,#0 <- redundant mov
ret
|$LN2@cal|
lsl w0,w0,#1
ret
The mov w0, #0
is redundant because when the execution reaches there, w0
must be zero. After our recent optimization, the code-generation is optimal in release 17.6:
|int cal(int)| PROC
lsl w8,w0,#1
cmp w0,#0
cseleq w0,wzr,w8
ret
Scalar code-generation improved to catch more if-conversion opportunities
You may have noticed there is another difference in the code-generation for the above test case. CSELEQ
is used instead of branch. The ARM64 ISA supports the CSEL
instruction to do if-conversion which saves both execution cycles and code size. Previously, MSVC couldn’t generate CSEL
when the selected value came from a return statement. We fixed this in the 17.6 release, so for the above example CSELEQ
is used instead of CBNZ
.
In summary, for the if-return-else-return pattern, the ARM64 backend has been taught to generate a CSEL
instruction if the return statement is in any of the following operations:
- Unary: NOT and type conversion
- Binary: Add, Subtract, And/Or/Xor, Logical Shift Left/Right
Let’s see another example:
int test (int a) {
if (a > 0xFFFF)
return ~a;
else
return a;
}
In Visual Studio 2022 17.5, the code-generation was:
|int test(int)| PROC
mov w8,#0xFFFF
cmp w0,w8
ble |$LN3@test|
mvn w0,w0
|$LN3@test|
ret
It employs a branch and contains multiple basic blocks. In the 17.6 release, the code-generation is improved by if-conversion: it is lean and mean:
|int test(int)| PROC
mov w8,#0xFFFF
cmp w0,w8
cinvgt w0,w0
ret
ARM64 instruction combiner now supports instructions with multiple definitions
As a modern compiler, MSVC is carrying out various instruction combinations across compilation stages. While a bunch of combine rules have already been implemented, we recently found one interesting case was missing. Take the following case for example:
int test(int a, int b)
{
if (a < b)
return b - a;
else
return 7;
}
In Visual Studio 2022 17.5, the code-generation was:
|int test(int,int)| PROC
cmp w0,w1
sub w0,w1,w0
blt |$LN3@test|
mov w0,#7
|$LN3@test|
ret
The cmp w0, w1
and the following sub w0, w1, w0
can be combined into a single SUBS
. We missed it because the later stage combiner had glitches when supporting combining into instructions with multiple definitions. Now we have fixed this, so the code-generation in the 17.6 release has been improved into:
|int test(int,int)| PROC
subs w9,w1,w0
mov w8,#7
cselgt w0,w9,w8
ret
The code-generation is utilizing SUBS
to set the conditional code and doing subtraction at the same period, it also benefited from the above mentioned if-conversion optimization on return statement, so CSEL
is used instead of branch
In closing
That is all for this blog and we will keep you updated on our progress. Your feedback is very valuable for us. Please share your thoughts and comments with us through Virtual C++ Developer Community. You can also reach us on Twitter (@VisualC), or via email at visualcpp@microsoft.com.
The first example in ‘Scalar code-generation improved based on value range analysis’. Could further optimized to just ‘lsl w0, w0, 1’ as shifting zero left by 1 is still 0.
This is handled in gcc by the phiopt pass (i assume something similar in llvm too).
Good spot, Andrew, I have created a problem report to track it. Please stay tuned for the fix.
Hi Jiong,
Great work, nice read.
Just out of curiousity, I copied the examples to compiler explorer and looked at codegen for some open-source compilers.
For function `cal` it looks like a one LSL suffices: `lsl w0, w0, #1`
For:
<code>
The constant doesn't need to be materialised:
<code>
https://godbolt.org/z/KKndKjK31
Cheers.
Thanks for the feedback, Sjoerd. The constant issue is in our back log, please stay tuned.
Fantastic progress Jiong! 🙂
Przemek
Is there anything new to report for the x64 backend? It basically never wins against gcc/clang in any scenario I’ve tried and the number of issues that keep rolling in on developercommunity seems to agree with my experiences.
In the intermediate 5-6 years before ARM on Windows PC really takes off (maybe), the x64 ecosystem really needs to be improved still.
Awesome work, keep it up!