{"id":32881,"date":"2023-09-28T15:00:51","date_gmt":"2023-09-28T15:00:51","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=32881"},"modified":"2024-09-10T07:55:31","modified_gmt":"2024-09-10T07:55:31","slug":"msvc-arm64-optimizations-in-visual-studio-2022-17-7","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-arm64-optimizations-in-visual-studio-2022-17-7\/","title":{"rendered":"MSVC ARM64 Optimizations in Visual Studio 2022 17.7"},"content":{"rendered":"<p>In Visual Studio 2022 version 17.6 we added a <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-arm64-optimizations-in-visual-studio-2022-17-6\/\">host of new ARM64 optimizations<\/a>. In this 2<sup>nd<\/sup> edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the <a href=\"https:\/\/devblogs.microsoft.com\/visualstudio\/visual-studio-2022-17-7-now-available\/\">Visual Studio 2022 version 17.7<\/a> for both scalar ISA and SIMD ISA (NEON). We started introducing these performance optimizations in 17.6 release and we have landed them in 17.7 release.<\/p>\n<h2>By element operation<\/h2>\n<p>ARM64 supports by-element operation in several instructions such as <code>fmul<\/code>, <code>fmla<\/code>, <code>fmls<\/code>, etc.\u00a0 This feature allows a SIMD operand to be computed directly by a SIMD element using an index to access it. In the example below where we multiply a scalar with an array, MSVC duplicated the vector element <code>v0.s[0]<\/code> into all lanes of a SIMD register, then multiplied that with another SIMD operand represented by array <code>b<\/code>.\u00a0 This is not efficient because the <code>dup<\/code> instruction will add 2 more execution latency before executing the <code>fmul<\/code> instruction.<\/p>\n<p>To better understand this optimization let\u2019s take the following sample code:<\/p>\n<pre>void test(float * __restrict a, float * __restrict b, float c) {\r\n  \u00a0\u00a0for (int i = 0; i &lt; 4; i++)\r\n \u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0a[i] = b[i] * c;\r\n}<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dup\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.4s,v0.s[0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 fmul\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v17.4s,v16.4s\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]<\/pre>\n<p>In Visual Studio 2022 17.7, we eliminated a duplicate instruction. It now can multiply by a SIMD element directly and the code generation is:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 fmul\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4s,v0.s[0]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]<\/pre>\n<h2>Neon supports for shift right and accumulate immediate<\/h2>\n<p>This instruction right-shifts a SIMD source by an immediate and accumulates the final result with the destination SIMD register. As mentioned earlier, we started working on this optimization in the 17.6 release and completed the feature in the 17.7 release.<\/p>\n<p>In the 17.5 release MSVC turned right shifts into left shifts using a negative shift amount. To implement it that way, it copied a constant -2 into a general register, and then duplicated a general register into the SIMD register before the left shift.<\/p>\n<p>We eliminated the duplicate instruction and used right shift with an immediate value in the 17.6 release. It was an improvement, but not yet good enough.<\/p>\n<p>We further improved the implementation in 17.7 by combining right shift and add by using <code>usra<\/code> for unsigned and <code>ssra<\/code> for signed operation. The final implementation is much more compact than the previous ones.<\/p>\n<p>To better understand how this optimization works let\u2019s look at the sample code below:<\/p>\n<pre>void test(unsigned long long * __restrict a, unsigned long long * __restrict b) {\r\n \u00a0\u00a0 for (int i = 0; i &lt; 2; i++)\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 a[i] += b[i] &gt;&gt; 2;\r\n}<\/pre>\n<p>Code generated by MSVC 17.5:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,#-2\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dup\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.2d,x8\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q18,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ushl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.2d,v16.2d,v17.2d\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 add\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v18.2d,v17.2d,v18.2d\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q18,[x0]<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ushr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.2d,v16.2d,#2\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 add\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.2d,v17.2d,v16.2d\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]<\/pre>\n<p>Code generated by MSVC 17.7:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x1]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 usra\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.2d,v17.2d,#2\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]<\/pre>\n<h2>Neon right shift into cmp<\/h2>\n<p>A right shift on a signed integer shifts the msb, with the result being either -1 or 0, and is equivalent to <code>cmlt<\/code>.\u00a0 Similar to the previous optimization, we progressively made improvements and completed the feature in 17.7.<\/p>\n<p>To better understand why we introduced this optimization let\u2019s take the following snippet of code,<\/p>\n<pre>void shf2cmlt(int * __restrict a, int * __restrict b) {\r\n  \u00a0 for (int i = 0; i &lt; 4; i++)\r\n  \u00a0 \u00a0 b[i] = a[i] &gt;&gt; 31;\r\n}<\/pre>\n<p>Code generated by MSVC 17.5:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mvni\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.4s,#0x1E\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sshl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.4s,v16.4s,v17.4s\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x1]<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sshr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4s,#0x1F\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]<\/pre>\n<p><code>sshr<\/code> was certainly an improvement but not the best solution because its throughput is 1 and can go through only the V1 pipeline. <code>cmlt<\/code> has better throughput as it can go through pipelines V1 or V2 on Arm\u2019s Neoverse N1 core as in <a href=\"https:\/\/developer.arm.com\/documentation\/PJDOC-466751330-9707\/r4p1\/?lang=en\">Arm\u00ae Neoverse\u2122 N1 Software Optimization Guide<\/a>.<\/p>\n<p>Code generated by MSVC 17.7:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmlt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4s,#0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1]<\/pre>\n<h2>Neon right shift and narrow into shifted narrow<\/h2>\n<p>This is another case where we can combine 2 instructions into one using <code>shrn<\/code>. The <code>shrn <\/code>instruction right shifts each element of an unsigned integer value from the source SIMD register by an immediate value, puts the result into a SIMD register, and writes the vector to the lower half of the destination SIMD register.<\/p>\n<p>In this example, the generated code multiplies 16-bit integers, and puts the result into an unsigned 32-bit destination. It right-shifts the result by an immediate value, then it extracts, narrows half of it, and writes to the final destination.<\/p>\n<p>To better understand this optimization, consider the following sample of code:<\/p>\n<pre>void test(unsigned short * __restrict a, short * __restrict b) {\r\n \u00a0\u00a0 for( int i = 0; i &lt; 4; i++ )\r\n \u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0b[i] = (a[i] * a[i]) &gt;&gt; 10;\r\n}<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 umull\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4h,v16.4h\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sshr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4s,#0xA\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 xtn\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4h,v16.4s\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x1]<\/pre>\n<p>Code generated by MSVC 17.7:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 umull\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v16.4h,v16.4h\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 shrn\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4h,v16.4s,#0xA\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x1]<\/pre>\n<h2>Scalar code-generation improved negation of bool value<\/h2>\n<p>A negation of a Boolean value can generally be optimized using an <em>exclusive or<\/em> with 1.<\/p>\n<p><em>For example,<\/em><em>\u00a0<\/em><\/p>\n<p>0 ^ 1 = 1<\/p>\n<p>1 ^ 1 = 0<\/p>\n<p>In 17.6, MSVC generates unnecessary code using <code>cmp<\/code><em> + <\/em><code>cseleq<\/code> to negate boolean function&#8217;s return values.<\/p>\n<p>To better understand this optimization let\u2019s take the following sample code:<\/p>\n<pre>bool foo();\r\nbool negate_bool() {\r\n \u00a0 \u00a0return !foo();\r\n}<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 bl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |?foo@@YA_NXZ|\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 uxtb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseteq\u00a0\u00a0\u00a0\u00a0\u00a0 w0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp],#0x10<\/pre>\n<p>It extracts the return value of function <code>foo()<\/code> and compares with 0, then uses cseteq to check if the return value is equal to 0 to return 1, otherwise it returns 0.<\/p>\n<p>While in 17.7 we improve it by doing a bit-wise <em>exclusive or <\/em>which is equivalent to negation.<\/p>\n<p>Code generated by MSVC 17.7:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 bl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |?foo@@YA_NXZ|\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 eor\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w0,#1\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 uxtb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,w8\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp],#0x10<\/pre>\n<p>&nbsp;<\/p>\n<h2>ARM64 instruction eliminate redundant comparisons<\/h2>\n<p>There are certain cases where MSVC fails to recognize that the comparison conditions are the same and the conditional flag is unchanged.\u00a0 In the following example, there are 3 comparisons for <code>if (cond<em>)<\/em><\/code> for assigning <code>a<\/code>, <code>b<\/code>, and <code>c<\/code> using <code>cseleq<\/code><em>.<\/em><\/p>\n<pre>struct A {\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 int a;\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 int b;\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 int c;\r\n};\u00a0 \u00a0\r\n\r\nvoid foo(struct A *sa, int cond) {\r\n \u00a0\u00a0 int a = 1, b = 2, c = 3;\r\n \u00a0\u00a0 if (cond) {\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 a = 20;\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 b = 33;\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 c = 48;\r\n\u00a0\u00a0\u00a0 }\r\n\r\n \u00a0\u00a0 sa-&gt;a = a;\r\n \u00a0\u00a0 sa-&gt;b = b;\r\n\u00a0\u00a0\u00a0 sa-&gt;c = c;\r\n}<\/pre>\n<p>Code generated by MSVC 17.6:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w1,#0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#1\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x14\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w10,w9,w8\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w1,#0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#2\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x21\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9,w8\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w10,w8,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w1,#0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x30\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#3\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9,w8\r\n; Line 19\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x0,#8]<\/pre>\n<p>We taught the backend to recognize the pattern. The code-generation in the 17.7 release has been optimized.<\/p>\n<p>Code generated by MSVC 17.7:<\/p>\n<pre> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w1,#0\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#1\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x14\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w10,w9,w8\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#2\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x21\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9,w8\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w10,w8,[x0]\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#3\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,#0x30\r\n \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cseleq\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9,w8\r\n; Line 19\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x0,#8]<\/pre>\n<h2>Summary<\/h2>\n<p>We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you. \u00a0We will continue to improve MSVC for ARM developers in 17.8.\u00a0 Please stay tuned. \u00a0Your feedback is very valuable to us. Please share your thoughts and comments with us through <u>Virtual C++ Developer Community<\/u>. You can also reach us on Twitter (<a href=\"https:\/\/twitter.com\/visualc\">@VisualC<\/a>), or via email at <a href=\"mailto:visualcpp@microsoft.com\">visualcpp@microsoft.com<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We [&hellip;]<\/p>\n","protected":false},"author":129663,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3946,1,217,218],"tags":[],"class_list":["post-32881","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backend","category-cplusplus","category-faster","category-performance"],"acf":[],"blog_post_summary":"<p>In Visual Studio 2022 version 17.6 we added a host of new ARM64 optimizations. In this 2nd edition of our blog, we will highlight some of the performance improvements to MSVC ARM64 compiler backend, we will discuss key optimizations in the Visual Studio 2022 version 17.7 for both scalar ISA and SIMD ISA (NEON). We [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32881","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/129663"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=32881"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32881\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=32881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=32881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=32881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}