{"id":33353,"date":"2024-01-09T17:00:51","date_gmt":"2024-01-09T17:00:51","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=33353"},"modified":"2024-09-10T07:55:26","modified_gmt":"2024-09-10T07:55:26","slug":"msvc-arm64-optimizations-in-visual-studio-2022-17-8","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-arm64-optimizations-in-visual-studio-2022-17-8\/","title":{"rendered":"MSVC ARM64 optimizations in Visual Studio 2022 17.8\u00a0"},"content":{"rendered":"<p>Visual Studio 2022 17.8 has been released recently (download it <a href=\"https:\/\/visualstudio.microsoft.com\/downloads\/\">here<\/a>). While there is already a blog \u201c<a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/visual-studio-17-8-now-available\/\">Visual Studio 17.8 now available!<\/a>\u201d covering new features and improvements, we would like to share more information with you about what is new for the MSVC ARM64 backend in this blog. In the last couple of months, we have been improving code-generation for the auto-vectorizer so that it can generate Neon instructions for more cases. Also, we have optimized instruction selection for a few scalar code-generation scenarios, for example short circuit evaluation, comparison against immediate, and smarter immediate split for logic instruction.<\/p>\n<h2><span style=\"font-size: 14pt;\">Auto-Vec<\/span><span style=\"font-size: 14pt;\">torizer<\/span><span style=\"font-size: 14pt;\"> supports conversions between floating-point and integer<\/span><\/h2>\n<p>The following conversions between floating-point and integer types are common in real-world code. Now, they are all enabled in the ARM64 backend and hooked up with the auto-vectorizer.<\/p>\n<table>\n<tbody>\n<tr>\n<td>From<\/td>\n<td>To<\/td>\n<td>Instruction<\/td>\n<\/tr>\n<tr>\n<td><em>double<\/em><\/td>\n<td><em>float<\/em><\/td>\n<td><em>fcvtn<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>double<\/em><\/td>\n<td><em>int64<\/em>_t<\/td>\n<td><em>fcvtzs<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>double<\/em><\/td>\n<td><em>uint64_t<\/em><\/td>\n<td><em>fcvtzu<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>float<\/em><\/td>\n<td><em>double<\/em><\/td>\n<td><em>fcvtl<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>float<\/em><\/td>\n<td><em>int32_t<\/em><\/td>\n<td><em>fcvtzs<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>float<\/em><\/td>\n<td><em>uint32_t<\/em><\/td>\n<td><em>fcvtzu<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>int64_t<\/em><\/td>\n<td><em>double<\/em><\/td>\n<td><em>scvtf<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>uint64_t<\/em><\/td>\n<td><em>double<\/em><\/td>\n<td><em>ucvtf<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>int32_t<\/em><\/td>\n<td><em>float<\/em><\/td>\n<td><em>scvtf<\/em><\/td>\n<\/tr>\n<tr>\n<td><em>uint32_t<\/em><\/td>\n<td><em>float<\/em><\/td>\n<td><em>ucvtf<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For example:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">void test (double * __restrict a, unsigned long long * __restrict b) \r\n{ \r\n    for (int i = 0; i &lt; 2; i++)\r\n    { \r\n        a[i] = (double)b[i]; \r\n    } \r\n} <\/code><\/pre>\n<p>In Visual Studio 2022 17.7, the code-generation was the following in which both the computing throughput and load\/store bandwidth utilization were suboptimal due to scalar instructions being used.<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">ldp    x9, x8, [x1]\r\nucvtf  d17, x9\r\nucvtf  d16, x8\r\nstp    d17, d16, [x0]<\/code><\/pre>\n<p>In Visual Studio 2022 17.8.2, the code-generation has been optimized into:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">ldr    q16,[x1]\r\nucvtf  v16.2d,v16.2d\r\nstr    q16,[x0]<\/code><\/pre>\n<p>A single pair of Q register load &amp; store plus SIMD instructions are used now.<\/p>\n<p>The above example is a conversion between double and 64-bit integer, so both types are the same size. There was another issue in the ARM64 backend preventing auto-vectorization on conversion between different sized types and it has been fixed as well. MSVC also auto-vectorizes the following example now:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">void test_df_to_sf (float * __restrict a, double * __restrict b, int * __restrict c)\r\n{\r\n    for (int i = 0; i &lt; 4; i++)\r\n    {\r\n        a[i] = (float) b[i];\r\n        c[i] = ((int)a[i]) &lt;&lt; 5;\r\n    }\r\n}<\/code><\/pre>\n<p>The code-generation in Visual Studio 2022 17.7 was:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">ldp     d17, d16, [x1]\r\nldp     d17, d16, [x1]\r\nfcvt    s17, d17\r\nfcvt    s16, d16\r\nfcvtzs  w8, s17\r\nstp     s17, s16, [x0]\r\nlsl     w9, w8, #5\r\nfcvtzs  w8, s16\r\nlsl     w8, w8, #5\r\nstp     w9, w8, [x2]<\/code><\/pre>\n<p>Scalar instructions plus loop unrolling were employed. In Visual Studio 2022 17.8.2, the loop is vectorized:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">ldr     q16, [x1]\r\nfcvtn   v16.2s, v16.2d\r\nstr     d16, [x0]\r\nfcvtzs  v16.2s, v16.2s\r\nshl     v16.2s, v16.2s, #5\r\nstr     d16, [x2]<\/code><\/pre>\n<h2><span style=\"font-size: 14pt;\">Auto-vectorizer now supports extended left shifts<\/span><\/h2>\n<p>Extended left shifts are also common in real world code, therefore the ARM64 ISA has native instructions to support it. Neon has <code>SSHLL <\/code>and <code>USHLL <\/code>to support signed and unsigned extended left shift. They extend the shift source first, then perform shift on the extended value. Let&#8217;s have a look at the following example:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">void test_signed (signed short * __restrict a, signed char * __restrict b)\r\n{\r\n    for (int i = 0; i &lt; 8; i++)\r\n        a[i] = b[i] &lt;&lt; 7;\r\n}\r\n\r\nvoid test_unsigned (unsigned short * __restrict a, unsigned char * __restrict b)\r\n{\r\n    for (int i = 0; i &lt; 8; i++)\r\n        a[i] = b[i] &lt;&lt; 7;\r\n}<\/code><\/pre>\n<p>The code-generation in Visual Studio 2022 17.7 was:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_signed| PROC\r\n    ldr    d16, [x1]\r\n    sxtl   v16.8h, v16.8b\r\n    shl    v16.8h, v16.8h, #7\r\n    str    q16, [x0]\r\n|test_unsigned| PROC\r\n    ldr    d16, [x1]\r\n    ushll  v16.8h, v16.8b, #0\r\n    shl    v16.8h, v16.8h, #7\r\n    str    q16, [x0]<\/code><\/pre>\n<p>There is vectorization, an independent type promotion is done first and followed by a normal shift. They can be optimized into a single extended shift. We have taught the ARM64 backend about this, and Visual Studio 2022 17.8.2 now generates:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_signed| PROC\r\n    ldr    d16, [x1]\r\n    sshll  v16.8h, v16.8b, #7\r\n    str    q16, [x0]\r\n|test_unsigned| PROC\r\n    ldr    d16, [x1]\r\n    ushll  v16.8h, v16.8b, #7\r\n    str    q16, [x0]<\/code><\/pre>\n<p>A single <code>SSHLL<\/code> or <code>USHLL<\/code> is used. But <code>SSHLL<\/code> and <code>USHLL<\/code> only encode shift amounts within <code>[0, bit_size_of_shift_source - 1]<\/code>. For example, the shift amount can only be <code>[0, 15]<\/code> for the above testcases. Therefore, we cannot use both instructions if we want to move the element to the upper half of the wider destination, because the shift amount then will be 16, which is out of the encoding range. For this case, signedness does not matter and ARM64 Neon has <code>SHLL<\/code><em> (Shift Left Long &#8211; by element size) <\/em>to support it.<\/p>\n<p>Let us increase the shift amount to the element size of the shift source, which is 8:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">void test_signed(signed short * __restrict a, signed char * __restrict b)\r\n{\r\n    for (int i = 0; i &lt; 8; i++)\r\n        a[i] = b[i] &lt;&lt; 8;\r\n}\r\n\r\nvoid test_unsigned(unsigned short * __restrict a, unsigned char * __restrict b)\r\n{\r\n    for (int i = 0; i &lt; 8; i++)\r\n        a[i] = b[i] &lt;&lt; 8;\r\n}<\/code><\/pre>\n<p>The code-generation in Visual Studio 2022 17.7 was:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_signed| PROC\r\n    ldr    d16, [x1]\r\n    sxtl   v16.8h, v16.8b\r\n    shl    v16.8h, v16.8h, #7\r\n    str    q16, [x0]\r\n|test_unsigned| PROC\r\n    ldr    d16, [x1]\r\n    ushll  v16.8h, v16.8b, #0\r\n    shl    v16.8h, v16.8h, #7\r\n    str    q16, [x0]<\/code><\/pre>\n<p>And in Visual Studio 2022 17.8.2, it becomes:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_signed| PROC\r\n    ldr   d16, [x1]\r\n    shll  v16.8h, v16.8b, #8\r\n    str   q16, [x0]\r\n|test_unsigned| PROC\r\n    ldr   d16, [x1]\r\n    shll  v16.8h, v16.8b, #8\r\n    str   q16, [x0]<\/code><\/pre>\n<h2><span style=\"font-size: 14pt;\">Scalar code-generation improved on immediate materialization for CMP\/CMN<\/span><\/h2>\n<p>On the blog introducing <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-arm64-optimizations-in-visual-studio-2022-17-6\/\">ARM64 optimizations for Visual Studio 2022 17.6<\/a>, there was one piece of feedback on unnecessary immediate materialization for integer comparison. In the 17.8 development cycle, we have improved a couple of relevant places inside the ARM64 backend.<\/p>\n<p>For integer comparison, the ARM64 backend is now smarter and knows an <em>immediate<\/em> could be adjusted to <em>immediate<\/em> <em>+ 1<\/em> or <em>immediate &#8211; 1<\/em> then fits into adjusted comparison. Here are some examples:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">int test_ge2gt (int a)\r\n{\r\n    if (a &gt;= 0x10001)\r\n        return 1;\r\n    else\r\n        return 0;\r\n}\r\n\r\nint test_lt2le (int a)\r\n{\r\n    if (a &lt; -0x1fff)\r\n        return 1;\r\n    else\r\n        return 0;\r\n}<\/code><\/pre>\n<p><code>0x10001<\/code> inside <code>test_ge2gt<\/code> does not fit into the immediate encoding for the ARM64 <a id=\"post-33353-_Int_I4YP8m5X\"><\/a><code>CMP<\/code> instruction, either verbatim or shifted. However, if we subtract it by 1 and turn greater equal (ge) into greater (gt) accordingly, then <code>0x10000<\/code> will fit into the shifted encoding.<\/p>\n<p>For <em>test_lt2le<\/em>, the negative immediate, <em>-0x1fff<\/em>, does not fit into immediate encoding for ARM64 <em>CMN<\/em> instruction, but if we subtract it by 1 and turn less (lt) into less equal (le) accordingly, then <em>-0x2000<\/em> will fit into shifted encoding.<\/p>\n<p>So, the code-generation is the following by Visual Studio 2022 17.7:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_ge2gt| PROC\r\n    mov     w8, #0x10001\r\n    cmp     w0, w8\r\n    csetge  w0\r\n|test_lt2le| PROC\r\n    mov     w8, #-0x1FFF\r\n    cmp     w0, w8\r\n    csetlt  w0<\/code><\/pre>\n<p>There is an extra <code>MOV<\/code> instruction to materialize the immediate into the register because it does not fit into encoding verbatim. After the above-mentioned improvements, Visual Studio 2022 17.8.2 generates:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|test_ge2gt| PROC\r\n    cmp     w0, #0x10, lsl #0xC\r\n    csetgt  w0\r\n|test_lt2le| PROC\r\n    cmn     w0, #2, lsl #0xC\r\n    csetle  w0<\/code><\/pre>\n<p>The sequence is more efficient.<\/p>\n<h2><span style=\"font-size: 14pt;\">Scalar code-generation improved on logic immediate loading<\/span><\/h2>\n<p>We have also taken steps further to improve immediate handling of other instructions. One improvement is: ARM64 has a rotated encoding for logic immediate (please refer to description of <code>DecodeBitMasks<\/code> in the <em>Arm Architecture Reference Manual<\/em> for details<em>)<\/em>, this immediate encoding is used by <code>AND<\/code>\/<code>ORR<\/code>. If one immediate does not fit into rotated encoding verbatim, it could after a split.<\/p>\n<p>For example, programmers frequently write code patterns like the following:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">#define FLAG1_MASK 0x80000000\r\n#define FLAG2_MASK 0x00004000\r\n\r\nint cal (int a)\r\n{\r\n    return a &amp; (FLAG1_MASK | FLAG2_MASK);\r\n}<\/code><\/pre>\n<p>The compiler middle-end usually logically combines all the <code>AND<\/code>ed immediates with the return expression then returns <code>a &amp; 0x80004000<\/code> which does not fit into the rotated encoding, hence a <code>MOV<\/code>\/<code>MOVK<\/code> sequence will be generated to load the immediate, the cost will be three instructions. The code generation in Visual Studio 2022 17.7 was:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">mov   w8, #0x4000\r\nmovk  w8, #0x8000, lsl #0x10\r\nand   w0, w0, w8<\/code><\/pre>\n<p>If we split <code>0x80004000<\/code> into <code>0xffffc000<\/code> and <code>0x80007fff<\/code>, <code>AND<\/code>ing them sequentially will have the same effect as <code>AND<\/code>ing <code>0x80004000<\/code>, but both <code>0xffffc000<\/code> and <code>0x80007fff<\/code> fit into the rotated encoding, so we save one instruction. The code generation in Visual Studio 2022 17.8.2 is:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">and  w8, w0, #0xFFFFC000\r\nand  w0, w8, #0x80007FFF<\/code><\/pre>\n<p>The immediate gets split in a way that the split parts fit into two <code>AND<\/code> instructions. We only want to split the immediate when it has sole use site, otherwise we will end up with duplicated sequences. Therefore, the split is guided with use count.<\/p>\n<h2><span style=\"font-size: 14pt;\">Scalar code-generation now catches more CCMP opportunities<\/span><\/h2>\n<p>The <code>CCMP<\/code> (conditional compare) instruction is useful for accelerating short circuit evaluation, for example:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">int test (int a)\r\n{\r\n    return a == 17 || a == 31;\r\n}<\/code><\/pre>\n<p>For this testcase, Visual Studio 2022 17.7 was smart enough to employ <code>CCMP<\/code> and generated:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">cmp     w0, #0x11\r\nccmpne  w0, #0x1F, #4\r\ncseteq  w0<\/code><\/pre>\n<p>However, <code>CCMP<\/code> only takes a 5-bit immediate, so if we change the testcase to:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">int test (int a)\r\n{\r\n    return a == 17 || a == 32;\r\n}<\/code><\/pre>\n<p>The immediate <code>#32 <\/code>does not fit into <code>CCMP<\/code>\u2019s encoding, so the compiler will stop generating it, hence the code generation in Visual Studio 2022 17.7 was:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">    cmp  w0, #0x11\r\n    beq  |$LN3@test|\r\n    cmp  w0, #0x20\r\n    mov  w0, #0\r\n    bne  |$LN4@test|\r\n|$LN3@test|\r\n    mov  w0, #1\r\n|$LN4@test|\r\n    ret<\/code><\/pre>\n<p>It employs an if-then-else structure and is verbose. Here, the compiler should have a better cost model and knows it will still be beneficial if it moves <code>#32<\/code> into a register and employ <code>CCMP<\/code>\u2019s register form. We have fixed this in Visual Studio 2022 17.8.2, and the code generation becomes:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">cmp     w0, #0x11\r\nmov     w8, #0x20\r\nccmpne  w0, w8, #4\r\ncseteq  w0\r\nret<\/code><\/pre>\n<h2><span style=\"font-size: 14pt;\">Using MOVI\/MVNI for immediate move in smaller loops<\/span><\/h2>\n<p>We missed an opportunity to use shifted <code>MOVI<\/code>\/<code>MVNI<\/code> for combining immediate move<\/p>\n<p>In small loops. For example,<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">void vect_movi_msl (int *__restrict a, int *__restrict b, int *__restrict c) {\r\n    for (int i = 0; i &lt; 8; i++)\r\n        a[i] = 0x1200;\r\n\r\n    for (int i = 0; i &lt; 8; i++)\r\n        c[i] = 0x12ffffff;\r\n}<\/code><\/pre>\n<p>In 17.7 release, MSVC generated:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|movi_msl| PROC\r\n    mov x9, #0x1200\r\n    movk x9, #0x1200, lsl #0x20\r\n    ldr x8, |$LN29@movi_msl|\r\n    stp x9, x9, [x0]\r\n    stp x8, x8, [x2]\r\n    stp x9, x9, [x0, #0x10]\r\n    stp x8, x8, [x2, #0x10]<\/code><\/pre>\n<p>Scalar <code>MOV<\/code>\/<code>MOVK<\/code> instructions are employed, and 8 iterations are needed to initialize each array. Both immediates can be loaded into vector registers using <code>MOVI<\/code>\/<code>MVNI<\/code>, therefore increasing the storage bandwidth and reducing the iteration number.<\/p>\n<p>Shifted <code>MOVI<\/code> shifts the immediate to the left and fills with 0s, so <code>0x1200<\/code> can be loaded as the following:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">0x12 &lt;&lt; 8 = 0x1200<\/code><\/pre>\n<p>Shifted <code>MVNI<\/code> shifts the immediate first, then inverts the result:<\/p>\n<pre class=\"prettyprint language-c\"><code class=\"language-c\">~((0xED) &lt;&lt; 0x18) = 0x12ffffff<\/code><\/pre>\n<p>In 17.8, MSVC generates:<\/p>\n<pre class=\"prettyprint language-default\"><code class=\"language-default\">|movi_msl| PROC\r\n    movi v17.4s, #0x12, lsl #8\r\n    mvni v16.4s, #0xED, lsl #0x18\r\n    stp q17, q17, [x0]\r\n    stp q16, q16, [x2]<\/code><\/pre>\n<p>Benefiting from vector register\u2019s width and the employment of paired store, only one iteration is needed.<\/p>\n<p><strong><span style=\"font-size: 14pt;\"><em>In closing<\/em><\/span><\/strong><\/p>\n<p>That is all for this blog, your feedback is valuable for us. Please share your thoughts and comments with us through <a href=\"https:\/\/developercommunity.visualstudio.com\/cpp\">Visual C++ Developer Community<\/a>. You can also reach us on Twitter (<a href=\"https:\/\/twitter.com\/visualc\">@VisualC<\/a>), or via email at <a href=\"mailto:visualcpp@microsoft.com\">visualcpp@microsoft.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Visual Studio 2022 17.8 has been released recently (download it here). While there is already a blog \u201cVisual Studio 17.8 now available!\u201d covering new features and improvements, we would like to share more information with you about what is new for the MSVC ARM64 backend in this blog. In the last couple of months, we [&hellip;]<\/p>\n","protected":false},"author":119260,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3946,1],"tags":[],"class_list":["post-33353","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backend","category-cplusplus"],"acf":[],"blog_post_summary":"<p>Visual Studio 2022 17.8 has been released recently (download it here). While there is already a blog \u201cVisual Studio 17.8 now available!\u201d covering new features and improvements, we would like to share more information with you about what is new for the MSVC ARM64 backend in this blog. In the last couple of months, we [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/33353","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/119260"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=33353"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/33353\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=33353"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=33353"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=33353"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}