{"id":32267,"date":"2023-05-29T14:42:26","date_gmt":"2023-05-29T14:42:26","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=32267"},"modified":"2024-09-10T07:55:35","modified_gmt":"2024-09-10T07:55:35","slug":"msvc-arm64-optimizations-in-visual-studio-2022-17-6","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-arm64-optimizations-in-visual-studio-2022-17-6\/","title":{"rendered":"MSVC ARM64 optimizations in Visual Studio 2022 17.6\u00a0"},"content":{"rendered":"<p><span data-contrast=\"auto\">In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the <\/span><a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/visual-studio-2022-17-6-now-available\/\"><span data-contrast=\"none\">Visual Studio 2022 version 1<\/span><span data-contrast=\"none\">7.6<\/span><\/a><span data-contrast=\"auto\">. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let\u2019s review some interesting optimizations in this blog.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Before diving into technical details, we&#8217;d encourage you to create feedback here at <\/span><a href=\"https:\/\/developercommunity.visualstudio.com\/cpp\"><span data-contrast=\"none\">Developer Community<\/span><\/a><span data-contrast=\"auto\"> if you have found performance issues. The feedback helps us prioritize work items in our backlog. This, <\/span><a href=\"https:\/\/developercommunity.visualstudio.com\/t\/arm64-backend-neon-optimize-neon-rig\/10146259?q=backend+neon\"><span data-contrast=\"none\">optimize neon right shift into cmp<\/span><\/a><span data-contrast=\"auto\">, is an example of good feedback. Including a tagged subject title, detailed description of the issue, and a simple repro simplifies our analysis work and helps us deliver a fix more quickly.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Now, let&#8217;s see the optimizations.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<h2>Auto-Vectorizer supports more NEON instructions with asymmetric operands<\/h2>\n<p><span data-contrast=\"auto\">The ARM64 backend already supports some NEON instructions with asymmetric typed operands, like Add\/Subtract Long operations (SADDL\/UADDL\/SSUBL\/USUBL). These instructions add each vector element in the lower or upper half of the first source SIMD register to the corresponding vector element of the second source SIMD register and write the vector result to the destination SIMD register. The destination vector elements are twice as long as the source vector elements. Now, we have extended such support to Multiply-Add Long and Multiply-Subtract Long (<code>SMLAL<\/code>\/<code>UMLAL<\/code>\/<code>SMLSL<\/code>\/<code>UMLSL<\/code>).<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">For example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">void smlal(int * __restrict dst, int * __restrict a,\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 short * __restrict b, short * __restrict c)\u00a0\r\n{\u00a0\r\n    for (int i = 0; i &lt; 4; i++)\u00a0\r\n        dst[i] = a[i] + b[i] * c[i];\u00a0\r\n} <\/code><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, the code-generation was:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">sxtl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v19.4s,v16.4h\u00a0\r\nsxtl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v18.4s,v17.4h\u00a0\r\nmla\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v20.4s,v18.4s,v19.4s\u00a0<\/code><span data-contrast=\"auto\">\u202f<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559685&quot;:180,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">Extra signed extensions are performed on both source operands to match the type of destination. Now it has been optimized into a single <\/span><code>smlal v16.4s,v17.4h,v18.4h<\/code><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The ARM64 ISA further supports another variant for these operations, which is called Add\/Subtract Wide. For them, the <\/span><span data-contrast=\"none\">asymmetry happens between source operands, not between source and destination.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">For example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">void saddw(int *__restrict dst, int *__restrict a, short *__restrict b)\u00a0\r\n{\u00a0\r\n    for (int i = 0; i &lt; 4; i++)\u00a0\r\n        dst[i] = a[i] + b[i];\u00a0\r\n}<\/code><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, the code-generation was:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">sxtl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.4s,v16.4h\u00a0\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v18.4s,v17.4s,v18.4s\u00a0\u00a0<\/code><\/pre>\n<p><span data-contrast=\"auto\">The narrow source gets extra signed extension to match the other wide source. In the 17.6 release, this has been optimized into a single <\/span><code>saddw v16.4s,v16.4s,v17.4h<\/code><span data-contrast=\"auto\">. The same applies to <code>UADDW<\/code>\/<code>SSUBW<\/code>\/<code>USUBW<\/code>.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n<h2>Auto-vectorizer now supports small types on <code>ABS<\/code>\/<code>MIN<\/code>\/<code>MAX<\/code><code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/code><\/h2>\n<p><span data-contrast=\"auto\"><code>ABS<\/code>\/<code>MIN<\/code>\/<code>MAX<\/code> have slightly complex semantics. Normally, the compiler middle-end or back-end will have a pattern matcher to recognize IR sequences with if-then-else semantics and see if they could be converted into <code>ABS<\/code>\/<code>MIN<\/code>\/<code>MAX<\/code>. There is an issue when the operands are in small types (int8 or int16) though.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><span data-contrast=\"auto\">As specified by the C++ standard, small types are promoted to <code>int<\/code>, which is 32-bit on ARM64. This is perfect for scalar operations because they really can only operate on scalar register width. For ARM64, the smallest width is 32-bit utilizing the sub-register. However, this is not true for SIMD ISA whose minimum operation width is the width of vector lane (element). For example, ARM64 NEON supports operating on int8, int16 for a couple of operations including <code>ABS<\/code>\/<code>MIN<\/code>\/<code>MAX<\/code>. So, to generate SIMD instructions operating on small element sizes and deliver higher computing throughput, the auto-vectorizer needs to do analysis and narrow the type back to the original small type when it is safe to do so.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">On top of this, <code>ABS<\/code>\/<code>MIN<\/code>\/<code>MAX<\/code> bring further challenges because their source operands are scattered in different basic blocks due to their built-in if-then-else semantics. We found the MSVC auto-vectorizer may only narrow the type of the operand in one basic block making it inconsistent with operands in other basic blocks. Such inconsistency will cause the auto-vectorizer to fail matching the pattern and cancel vectorization.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><span data-contrast=\"auto\">\u00a0For example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">void test(signed char * __restrict a, signed char * __restrict b)\u00a0\r\n{\u00a0\r\n    for (int i = 0; i &lt; 16; i++)\u00a0\r\n        a[i] = b[i] &gt; 0 ? b[i] : -b[i];\u00a0\r\n}<\/code><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, there was no vectorization, and the code-generation was:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">ldrsb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x1,#1]\u00a0\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0\u00a0\r\nbgt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN16@test_abs_v|\u00a0\r\nneg\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8\u00a0\r\nsxtb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8\u00a0<\/code><span data-contrast=\"auto\">\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">In the 17.6 release, the code-generation has been improved into a single <\/span><code>abs v16.8h,v16.8h<\/code><i><span data-contrast=\"auto\">.<\/span><\/i><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2>Scalar code-generation improved based on value range analysis<\/h2>\n<p><span data-contrast=\"auto\">When one register is compared with an immediate value, the compiler can deduce the value range of the register, and this information is useful for later optimizations, for example evaluating comparison results statically.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">MSVC already has an infrastructure for doing value range deduction, the backend just needs to teach the middle-end about the semantics of its supported comparison instructions.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The ARM64 backend previously missed this support for <code>CBZ<\/code> and <code>CMP<\/code>. For <\/span><code>cbz reg, label<\/code><i><span data-contrast=\"auto\">, <\/span><\/i><span data-contrast=\"auto\">the <\/span><code>reg<\/code> <span data-contrast=\"auto\">must equal to zero in true path, and the same applies for <code>CBNZ<\/code> on false path. While for <\/span><code>cmp reg, #imm<\/code><i><span data-contrast=\"auto\">, <\/span><\/i><span data-contrast=\"auto\">the <\/span><code>reg<\/code> <span data-contrast=\"auto\">must equal<\/span> <code>imm<\/code> <span data-contrast=\"auto\">in true path. Knowing such equivalence, the compiler could simplify code-generation. Let\u2019s see a simple example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">int cal(int a)\u00a0\r\n{\u00a0\r\n    if (!a)\u00a0\r\n\u00a0\u00a0\u00a0\u00a0    return a;\u00a0\r\n\u00a0\u00a0\u00a0\u00a0else\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return a * 2;\u00a0\r\n}<\/code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, MSVC was generating the following instruction sequence:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int cal(int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,|$LN2@cal|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &lt;- redundant mov\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0\r\n|$LN2@cal|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lsl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w0,#1\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0<\/code><\/pre>\n<p><span data-contrast=\"auto\">The <\/span><code>mov w0, #0<\/code> <span data-contrast=\"auto\">is redundant because when the execution reaches there, <\/span><code>w0<\/code> <span data-contrast=\"auto\">must be zero. After our recent optimization, the code-generation is optimal in release 17.6:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int cal(int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lsl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w0,#1\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w0,wzr,w8\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret <\/code><\/pre>\n<h2>Scalar code-generation improved to catch more if-conversion opportunities<\/h2>\n<p><span data-contrast=\"auto\">You may have noticed there is another difference in the code-generation for the above test case. <code>CSELEQ<\/code> is used instead of branch. The ARM64 ISA supports the <code>CSEL<\/code> instruction to do if-conversion which saves both execution cycles and code size. Previously, MSVC couldn\u2019t generate <code>CSEL<\/code> when the selected value came from a return statement. We fixed this in the 17.6 release, so for the above example <code>CSELEQ<\/code> is used instead of <code>CBNZ<\/code>. <\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In summary, for the if-return-else-return pattern, the ARM64 backend has been taught to generate a <code>CSEL<\/code> instruction if the return statement is in any of the following operations:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\uf0a7\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[9642],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf0a7&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Unary: NOT and type conversion<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf0a7\" data-font=\"Wingdings\" data-listid=\"1\" data-list-defn-props=\"{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Wingdings&quot;,&quot;469769242&quot;:[9642],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;\uf0a7&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}\" aria-setsize=\"-1\" data-aria-posinset=\"2\" data-aria-level=\"1\"><span data-contrast=\"auto\">Binary: Add, Subtract, And\/Or\/Xor, Logical Shift Left\/Right<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"auto\">Let\u2019s see another example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">int test (int a) {\u00a0\u00a0\r\n\u00a0\u00a0\u00a0 if (a &gt; 0xFFFF)\u00a0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return ~a;\u00a0\u00a0\r\n\u00a0\u00a0\u00a0 else\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return a;\u00a0\u00a0\r\n}\u00a0<\/code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, the code-generation was:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int test(int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0xFFFF\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w8\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ble\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN3@test|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mvn\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w0\u00a0\r\n|$LN3@test|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0<\/code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">It employs a branch and contains multiple basic blocks. In the 17.6 release, the code-generation is improved by if-conversion: it is lean and mean:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int test(int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0xFFFF\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w8\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cinvgt\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0\r\n<\/code><\/pre>\n<h2>ARM64 instruction combiner now supports instructions with multiple definitions<\/h2>\n<p><span data-contrast=\"auto\">As a modern compiler, MSVC is carrying out various instruction combinations across compilation stages. While a bunch of combine rules have already been implemented, we recently found one interesting case was missing. Take the following case for example:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">int test(int a, int b)\u00a0\r\n{\u00a0\r\n\u00a0\u00a0\u00a0\u00a0if (a &lt; b)\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return b - a;\u00a0\r\n\u00a0\u00a0\u00a0\u00a0else\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return 7;\u00a0\r\n} <\/code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">In Visual Studio 2022 17.5, the code-generation was:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int test(int,int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w1\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w1,w0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 blt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN3@test|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#7\u00a0\r\n|$LN3@test|\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0<\/code><\/pre>\n<p><span data-contrast=\"auto\">The <\/span><code>cmp w0, w1<\/code><span data-contrast=\"auto\"> and the following <\/span><code>sub w0, w1, w0<\/code> <span data-contrast=\"auto\">can be combined into a single <code>SUBS<\/code>. We missed it because the later stage combiner had glitches when supporting combining into instructions with multiple definitions. Now we have fixed this, so the code-generation in the 17.6 release has been improved into:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|int test(int,int)| PROC\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 subs\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,w1,w0\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#7\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cselgt\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w9,w8\u00a0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ret\u00a0<\/code><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/pre>\n<p><span data-contrast=\"auto\">The code-generation is utilizing <code>SUBS<\/code> to set the conditional code and doing subtraction at the same period, it also benefited from the above mentioned if-conversion optimization on return statement, so <code>CSEL<\/code> is used instead of branch<\/span><\/p>\n<h2>In closing<\/h2>\n<p><span data-contrast=\"auto\">That is all for this blog and we will keep you updated on our progress. Your feedback is very valuable for us. Please share your thoughts and comments with us through <\/span><span data-contrast=\"none\">Virtual C++ Developer Community<\/span><span data-contrast=\"auto\">. You can also reach us on Twitter (<\/span><a href=\"https:\/\/twitter.com\/visualc\"><span data-contrast=\"none\">@VisualC<\/span><\/a><span data-contrast=\"auto\">), or via email at <\/span><a href=\"mailto:visualcpp@microsoft.com\"><span data-contrast=\"none\">visualcpp@microsoft.com<\/span><\/a><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{&quot;134233117&quot;:true,&quot;134233118&quot;:true,&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:240}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the Visual Studio 2022 version 17.6. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let\u2019s review some interesting optimizations in [&hellip;]<\/p>\n","protected":false},"author":119260,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3946,1],"tags":[],"class_list":["post-32267","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backend","category-cplusplus"],"acf":[],"blog_post_summary":"<p>In the last couple of months, the Microsoft C++ team has been working on improving MSVC ARM64 backend performance and we are excited to have a couple of optimizations available in the Visual Studio 2022 version 17.6. These optimizations improved code-generation for both scalar ISA and SIMD ISA (NEON). Let\u2019s review some interesting optimizations in [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/119260"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=32267"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32267\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=32267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=32267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=32267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}