{"id":32856,"date":"2023-09-25T16:16:05","date_gmt":"2023-09-25T16:16:05","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=32856"},"modified":"2023-09-25T13:33:22","modified_gmt":"2023-09-25T13:33:22","slug":"msvc-machine-independent-optimizations-in-visual-studio-2022-17-7","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-machine-independent-optimizations-in-visual-studio-2022-17-7\/","title":{"rendered":"MSVC Machine-Independent Optimizations in Visual Studio 2022 17.7"},"content":{"rendered":"<p>This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization.<\/p>\n<h2>Optimizing Memory Across Block Boundaries<\/h2>\n<p>When a small struct is loaded into a register, we can optimize field accesses to extract the correct bits from the register instead of accessing it through memory. Historically in MSVC, this optimization has been limited to memory accesses within the same basic block. We are now able to perform this same optimization across block boundaries in many cases.<\/p>\n<p>In the example ASM listings below, a load to the stack and a store from the stack are eliminated, resulting in less memory traffic as well as lower stack memory usage.<\/p>\n<p><strong>Example C++ Source Code:<\/strong><\/p>\n<pre class=\"prettyprint language-cpp\"><code class=\"language-cpp\">#include &lt;string_view&gt;\r\n\r\nbool compare(const std::string_view&amp; l, const std::string_view&amp; r) {\r\n\u00a0\u00a0 return l == r;\r\n}<\/code><\/pre>\n<p><strong>Required Compiler Flags:<\/strong> \/O2<\/p>\n<p><strong>X64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 56px;\">\n<tbody>\n<tr style=\"height: 28px;\">\n<td style=\"width: 16.101%; height: 28px;\">17.4<\/td>\n<td style=\"width: 83.899%; height: 28px;\">17.7<\/td>\n<\/tr>\n<tr style=\"height: 28px;\">\n<td style=\"width: 16.101%; height: 28px; vertical-align: top;\">\n<pre>sub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 56\r\nmovups\u00a0    xmm0, XMMWORD PTR [rcx]\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 r8, QWORD PTR [rcx+8]\r\nmovaps\u00a0    XMMWORD PTR $T1[rsp], xmm0\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 r8, QWORD PTR [rdx+8]\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN9@compare\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rdx, QWORD PTR [rdx]\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rcx, QWORD PTR $T1[rsp]\r\ncall\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0memcmp\r\ntest\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eax, eax\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN9@compare\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 al, 1\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 56\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00\r\n$LN9@compare:\r\nxor\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0al, al\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 56\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00<\/pre>\n<\/td>\n<td style=\"width: 83.899%; height: 28px; vertical-align: top;\">\n<pre>sub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 40\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0  r8, QWORD PTR [rcx+8]\r\nmovups\u00a0     xmm1, XMMWORD PTR [rcx]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0  r8, QWORD PTR [rdx+8]\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN9@compare\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0  rdx, QWORD PTR [rdx]\r\nmovq\u00a0\u00a0\u00a0\u00a0\u00a0   rcx, xmm1\r\ncall\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0memcmp\r\ntest\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eax, eax\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN9@compare\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0  al, 1\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rsp, 40\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00\r\n$LN9@compare:\r\nxor\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0al, al\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rsp, 40\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><strong>ARM64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 56px;\">\n<tbody>\n<tr style=\"height: 28px;\">\n<td style=\"width: 13.9125%; height: 28px;\">17.4<\/td>\n<td style=\"width: 86.0875%; height: 28px;\">17.7<\/td>\n<\/tr>\n<tr style=\"height: 28px;\">\n<td style=\"width: 13.9125%; height: 28px; vertical-align: top;\">\n<pre>str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp,#-0x10]!\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x1]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0]\r\numov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,v17.d[1]\r\numov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x2,v16.d[1]\r\nstp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,q16,[sp]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x2,x8\r\nbne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN9@compare|\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x1,[sp]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x0,[sp,#0x10]\r\nbl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 memcmp\r\ncbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,|$LN9@compare|\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#1\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp],#0x10\r\nret\r\n|$LN9@compare|\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#0\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp],#0x10\r\nret<\/pre>\n<\/td>\n<td style=\"width: 86.0875%; height: 28px; vertical-align: top;\">\n<pre>str\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0lr,[sp,#-0x10]!\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x1]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0q16,[x0]\r\numov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,v17.d[1]\r\numov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x2,v16.d[1]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x2,x8\r\nbne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN9@compare|\r\nfmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x1,d17\r\nfmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x0,d16\r\nbl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 memcmp\r\ncbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,|$LN9@compare|\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#1\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0lr,[sp],#0x10\r\nret\r\n|$LN9@compare|\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,#0\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lr,[sp],#0x10\r\nret<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2>Vector Logical and Arithmetic Optimizations<\/h2>\n<p>We continue to add patterns for recognizing vector operations that are equivalent to intrinsics or short sequences of intrinsics. An example is recognizing common forms of vector absolute difference calculations. A long series of bitwise instructions can be replaced with specialized absolute value instructions, such as vpabsd on X64 and sabd on ARM64.<\/p>\n<p><strong>Example C++ Source Code:<\/strong><\/p>\n<pre>#include &lt;cstdint&gt;\r\n\r\nvoid s32_1(int * __restrict a, int * __restrict b, int * __restrict c, int n) {\r\n    for (int i = 0; i &lt; n; i++) {\r\n        a[i] = (b[i] - c[i]) &gt; 0 ? (b[i] - c[i]) : (c[i] - b[i]);\r\n    }\r\n}<\/pre>\n<p><strong>Required Flags: \/O2 \/arch:AVX for X64, \/O2 for ARM64<\/strong><\/p>\n<p><strong>X64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 21.7285%;\">17.4<\/td>\n<td style=\"width: 78.2715%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 21.7285%; vertical-align: top;\">\n<pre>$LL4@s32_1:\r\nmovdqu\u00a0   xmm0, XMMWORD PTR [r11+rax]\r\nadd\u00a0\u00a0\u00a0\u00a0   ecx, 4\r\nmovdqu\u00a0   xmm1, XMMWORD PTR [rax]\r\nlea\u00a0\u00a0\u00a0\u00a0   rax, QWORD PTR [rax+16]\r\nmovdqa\u00a0   xmm3, xmm0\r\npsubd\u00a0\u00a0   xmm3, xmm1\r\npsubd\u00a0\u00a0   xmm1, xmm0\r\nmovdqa\u00a0   xmm2, xmm3\r\npcmpgtd   xmm2, xmm4\r\nmovdqa\u00a0   xmm0, xmm2\r\nandps\u00a0\u00a0   xmm2, xmm3\r\nandnps\u00a0   xmm0, xmm1\r\norps\u00a0\u00a0\u00a0   xmm0, xmm2\r\nmovdqu\u00a0   XMMWORD PTR [r10+rax-16], xmm0\r\ncmp\u00a0\u00a0\u00a0\u00a0   ecx, edx\r\njl\u00a0\u00a0\u00a0\u00a0\u00a0   SHORT $LL4@s32_1<\/pre>\n<\/td>\n<td style=\"width: 78.2715%; vertical-align: top;\">\n<pre>$LL4@s32_1:\r\nvmovdqu    xmm1, XMMWORD PTR [r10+rax]\r\nvpsubd\u00a0\u00a0\u00a0\u00a0\u00a0xmm1, xmm1, XMMWORD PTR [rax]\r\n<strong>vpabsd<\/strong> \u00a0\u00a0\u00a0 <strong>xmm2, xmm1\u00a0\u00a0\u00a0 ; vector abs<\/strong>\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0edx, 4\r\nvmovdqu\u00a0\u00a0\u00a0\u00a0XMMWORD PTR [rbx+rax], xmm2\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR [rax+16]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0edx, ecx\r\njl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL4@s32_1<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><strong>ARM64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 56px;\">\n<tbody>\n<tr style=\"height: 28px;\">\n<td style=\"width: 18.7807%; height: 28px;\">17.4<\/td>\n<td style=\"width: 81.2193%; height: 28px;\">17.7<\/td>\n<\/tr>\n<tr style=\"height: 28px;\">\n<td style=\"width: 18.7807%; height: 28px; vertical-align: top;\">\n<pre>|$LL24@s32_1|\r\nsbfiz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,x8,#2,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q19,[x9,x2]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,#4\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q18,[x9,x1]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w10\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v18.4s,v19.4s\r\ncmgt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v17.4s,v16.4s,v21.4s\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v20.16b,v17.16b,v16.16b\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v19.4s,v18.4s\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v17.16b\r\norr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v20.16b\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x9,x0]\r\nblt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LL24@s32_1|\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w3\r\nbge\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN27@s32_1|<\/pre>\n<\/td>\n<td style=\"width: 81.2193%; height: 28px; vertical-align: top;\">\n<pre>|$LL24@s32_1|\r\nsbfiz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,x8,#2,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x9,x1]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,#4\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x9,x2]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w10\r\n<strong>sabd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.4s,v17.4s,v16.4s<\/strong>\u00a0\u00a0\u00a0\u00a0 <strong>; vector abs<\/strong>\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x9,x0]\r\nblt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LL24@s32_1|<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2>Vector Remainder Loops<\/h2>\n<p>Each iteration of a vectorized loop computes the same work as multiple iterations of the original scalar loop. Depending on the number of original scalar iterations and the vector width, there may be work left over when the vector loop ends. Although this work can be completed by a copy of the original scalar loop, it may contain sufficient work to warrant further vectorization using a smaller vector width. This optimization is now employed for targets that have a single vector width by partially packing a vector.<\/p>\n<p>In the example below, three loops are emitted for the original loop. The first processes 64 elements per vector iteration by packing 16 8-bit values into a 128-bit vector register and then unrolling the vector loop by four. The second loop processes 8 elements per vector iteration by packing 8 8-bit values into half of a 128-bit vector register. The third loop processes one element per scalar iteration.<\/p>\n<p><strong>Example C++ source code: <\/strong><\/p>\n<pre>#include &lt;cstdint&gt;\r\n\r\nvoid test(int8_t * __restrict d, int8_t * __restrict a, int8_t * __restrict b, int n) {\r\n    for (int i = 0; i &lt; n; i++) {\r\n        d[i] = a[i] &amp; (~b[i]);\r\n    }\r\n}<\/pre>\n<p><strong>Required compiler flags: <\/strong>\/O2<\/p>\n<p><strong>X64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 1729px;\">\n<tbody>\n<tr style=\"height: 28px;\">\n<td style=\"width: 28.8298%; height: 28px;\">17.4<\/td>\n<td style=\"width: 71.1702%; height: 28px;\">17.7<\/td>\n<\/tr>\n<tr style=\"height: 1701px;\">\n<td style=\"width: 28.8298%; vertical-align: top; height: 1701px;\">\n<pre>$LL4@test: ; 1<sup>st<\/sup> vector loop (64 element iters)\r\nmovdqu\u00a0 xmm0, XMMWORD PTR [rcx-16]\r\nadd\u00a0\u00a0\u00a0\u00a0 edi, 64\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ; 00000040H\r\nmovdqu\u00a0 xmm1, XMMWORD PTR [rdx+rcx-16]\r\nmovdqu\u00a0 xmm2, XMMWORD PTR [rdx+rcx]\r\nlea\u00a0\u00a0\u00a0\u00a0 rcx, QWORD PTR [rcx+64]\r\npandn\u00a0\u00a0 xmm1, xmm3\r\nlea\u00a0\u00a0\u00a0\u00a0 rax, QWORD PTR [r14+rcx]\r\npand\u00a0\u00a0\u00a0 xmm1, xmm0\r\npandn\u00a0\u00a0 xmm2, xmm3\r\nmovdqu\u00a0 xmm0, XMMWORD PTR [rcx-64]\r\nmovdqu\u00a0 XMMWORD PTR [r9+rcx-80], xmm1\r\npand\u00a0\u00a0\u00a0 xmm2, xmm0\r\nmovdqu\u00a0 xmm0, XMMWORD PTR [rcx-48]\r\nmovdqu\u00a0 xmm1, XMMWORD PTR [rdx+rcx-48]\r\nmovdqu\u00a0 XMMWORD PTR [r9+rcx-64], xmm2\r\npandn\u00a0\u00a0 xmm1, xmm3\r\npand\u00a0\u00a0\u00a0 xmm1, xmm0\r\nmovdqu\u00a0 xmm2, XMMWORD PTR [rdx+rcx-32]\r\nmovdqu\u00a0 xmm0, XMMWORD PTR [rcx-32]\r\npandn\u00a0\u00a0 xmm2, xmm3\r\npand\u00a0\u00a0\u00a0 xmm2, xmm0\r\nmovdqu\u00a0 XMMWORD PTR [r9+rcx-48], xmm1\r\nmovdqu\u00a0 XMMWORD PTR [r9+rcx-32], xmm2\r\ncmp\u00a0\u00a0\u00a0\u00a0 rax, rsi\r\njl\u00a0\u00a0\u00a0\u00a0\u00a0 SHORT $LL4@test\r\nmov\u00a0\u00a0\u00a0\u00a0 r14, QWORD PTR [rsp]\r\nmov\u00a0\u00a0\u00a0\u00a0 rsi, QWORD PTR [rsp+24]\r\n$LN9@test:\r\nmovsxd\u00a0 rcx, edi\r\nmov\u00a0\u00a0\u00a0\u00a0 rdx, rbx\r\nmov\u00a0\u00a0\u00a0\u00a0 rdi, QWORD PTR [rsp+32]\r\nmov\u00a0\u00a0\u00a0\u00a0 rbx, QWORD PTR [rsp+16]\r\ncmp\u00a0\u00a0\u00a0\u00a0 rcx, rdx\r\njge\u00a0\u00a0\u00a0\u00a0 SHORT $LN3@test\r\nsub\u00a0\u00a0\u00a0\u00a0 r8, r11\r\nlea\u00a0\u00a0\u00a0\u00a0 rax, QWORD PTR [rcx+r11]\r\nsub\u00a0\u00a0\u00a0\u00a0 r10, r11\r\nsub\u00a0\u00a0\u00a0\u00a0 rdx, rcx\r\n$LL8@test: ; Scalar loop (1 element iters)\r\nmovzx\u00a0\u00a0 ecx, BYTE PTR [rax+r8]\r\nlea\u00a0\u00a0\u00a0\u00a0 rax, QWORD PTR [rax+1]\r\nnot\u00a0\u00a0\u00a0\u00a0 cl\r\nand\u00a0\u00a0\u00a0\u00a0 cl, BYTE PTR [rax-1]\r\nmov\u00a0\u00a0\u00a0\u00a0 BYTE PTR [rax+r10-1], cl\r\nsub\u00a0\u00a0\u00a0\u00a0 rdx, 1\r\njne\u00a0\u00a0\u00a0\u00a0 SHORT $LL8@test\r\n$LN3@test:\r\nadd\u00a0\u00a0\u00a0\u00a0 rsp, 8\r\nret\u00a0\u00a0\u00a0\u00a0 0<\/pre>\n<\/td>\n<td style=\"width: 71.1702%; vertical-align: top; height: 1701px;\">\n<pre>$LL4@test: <strong>; 1<sup>st<\/sup> vector loop (64 element iters)<\/strong>\r\nmovdqu\u00a0\u00a0\u00a0 xmm0, XMMWORD PTR [rcx-16]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0edx, 64\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ; 00000040H\r\nmovdqu\u00a0\u00a0\u00a0 xmm1, XMMWORD PTR [r8+rcx-16]\r\nmovdqu\u00a0\u00a0\u00a0 xmm2, XMMWORD PTR [r8+rcx]\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rcx, QWORD PTR [rcx+64]\r\nandnps\u00a0\u00a0\u00a0 xmm1, xmm3\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR [rsi+rcx]\r\nandps\u00a0\u00a0\u00a0\u00a0\u00a0xmm1, xmm0\r\nandnps\u00a0\u00a0\u00a0 xmm2, xmm3\r\nmovdqu\u00a0\u00a0\u00a0 xmm0, XMMWORD PTR [rcx-64]\r\nmovdqu\u00a0\u00a0\u00a0 XMMWORD PTR [r9+rcx-80], xmm1\r\nandps\u00a0\u00a0\u00a0\u00a0\u00a0xmm2, xmm0\r\nmovdqu\u00a0\u00a0\u00a0 xmm0, XMMWORD PTR [rcx-48]\r\nmovdqu\u00a0\u00a0\u00a0 xmm1, XMMWORD PTR [r8+rcx-48]\r\nmovdqu\u00a0\u00a0\u00a0 XMMWORD PTR [r9+rcx-64], xmm2\r\nandnps\u00a0\u00a0\u00a0 xmm1, xmm3\r\nandps\u00a0\u00a0\u00a0\u00a0\u00a0xmm1, xmm0\r\nmovdqu\u00a0\u00a0\u00a0 xmm2, XMMWORD PTR [r8+rcx-32]\r\nmovdqu\u00a0\u00a0\u00a0 xmm0, XMMWORD PTR [rcx-32]\r\nandnps\u00a0\u00a0\u00a0 xmm2, xmm3\r\nandps\u00a0\u00a0\u00a0\u00a0\u00a0xmm2, xmm0\r\nmovdqu\u00a0\u00a0\u00a0 XMMWORD PTR [r9+rcx-48], xmm1\r\nmovdqu\u00a0\u00a0\u00a0 XMMWORD PTR [r9+rcx-32], xmm2\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, rbp\r\njl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL4@test\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rbp, QWORD PTR [rsp+16]\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eax, edi\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0al, 63\u00a0\u00a0\u00a0 ; 0000003fH\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0al, 8\r\njb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN27@test\r\n$LN11@test:\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ecx, edi\r\nmovsxd\u00a0\u00a0\u00a0 rax, edx\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ecx, -8\r\nmovsxd\u00a0\u00a0\u00a0 rsi, ecx\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rcx, QWORD PTR [rax+r10]\r\n$LL10@test:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>; 2<sup>nd<\/sup> vector loop (8 element iters)<\/strong>\r\nmovq\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0xmm1, QWORD PTR [r8+rcx]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0edx, 8\r\nmovq\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0xmm0, QWORD PTR [rcx]\r\nandnps\u00a0\u00a0\u00a0 xmm1, xmm3\r\nandps\u00a0\u00a0\u00a0\u00a0\u00a0xmm1, xmm0\r\nmovq\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0QWORD PTR [r9+rcx], xmm1\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rcx, 8\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, rcx\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, r10\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, rsi\r\njl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL10@test\r\n$LN27@test:\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsi, QWORD PTR [rsp+24]\r\n$LN9@test:\r\nmovsxd\u00a0\u00a0\u00a0 rcx, edx\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdx, rdi\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdi, QWORD PTR [rsp+32]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rcx, rdx\r\njge\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LN3@test\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0r11, r10\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR [rcx+r10]\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rbx, r10\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdx, rcx\r\nnpad\u00a0\u00a0\u00a0\u00a0\u00a0\u00a06\r\n$LL8@test:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>; Scalar loop (1 element iters)<\/strong>\r\nmovzx\u00a0\u00a0\u00a0\u00a0\u00a0ecx, BYTE PTR [rax+r11]\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR [rax+1]\r\nnot\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0cl\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0cl, BYTE PTR [rax-1]\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BYTE PTR [rax+rbx-1], cl\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdx, 1\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL8@test\r\n$LN3@test:\r\npop\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rbx\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>ARM64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 28.8298%;\">17.4 (was not vectorized)<\/td>\n<td style=\"width: 71.1702%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 28.8298%; vertical-align: top;\">\n<pre>|$LL4@test|\r\nldrsb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,[x10,x1]\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w3,w3,#1\r\nldrsb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x1]\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,w9\r\nstrb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x11,x1]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x1,x1,#1\r\ncbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w3,|$LL4@test|\r\n|$LN3@test|\r\nret<\/pre>\n<\/td>\n<td style=\"width: 71.1702%; vertical-align: top;\">\n<pre>|$LL27@test|\u00a0 <strong>; 1<sup>st<\/sup> vector loop (64 element iters)<\/strong>\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x2,w8 sxtw #0]\r\nsxtw\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x11,w8\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x1,w8 sxtw #0]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x10,x11,#0x10\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v17.16b\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x10,x2]\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x0,w8 sxtw #0]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x1]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,#0x40\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v17.16b\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x0]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x10,x11,#0x20\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x10,x2]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x1]\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v17.16b\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x0]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x10,x11,#0x30\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q17,[x10,x2]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x1]\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,v16.16b,v17.16b\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[x10,x0]\r\nblt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LL27@test|\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,w3,#0x3F\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,#8\r\nblo\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN30@test|\r\n|$LN11@test|\r\nand\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w9,w3,#0xFFFFFFF8\r\n|$LL29@test|\u00a0 <strong>; 2<sup>nd<\/sup> vector loop (8 element iters)<\/strong>\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d17,[x2,w8 sxtw #0]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x1,w8 sxtw #0]\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.8b,v16.8b,v17.8b\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 d16,[x0,w8 sxtw #0]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0w8,w8,#8\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w9\r\nblt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LL29@test|\r\n|$LN30@test|\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w3\r\nbge\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN32@test|\r\n|$LN21@test|\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,x1,w8,sxtw #0\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x13,x2,x1\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w3,w8\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x10,x0,x1\r\n|$LL31@test|\u00a0 <strong>; Scalar loop (1 element iters)<\/strong>\r\nldrsb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w12,[x13,x9]\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,#1\r\nldrsb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w11,[x9]\r\nbic\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w11,w11,w12\r\nstrb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w11,[x10,x9]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,x9,#1\r\ncbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,|$LL31@test|\r\n|$LN32@test|\r\nret<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2>Conditional Moves\/Selects<\/h2>\n<p>MSVC now considers more opportunities to use conditional move instructions to eliminate branches. Once such an opportunity is recognized, whether to use a conditional move or leave the branch intact is subject to legality checks and tuning heuristics. Eliminating a branch is often beneficial because it removes the possibility of the branch being mispredicted and may enable other optimizations that previously would have stopped at the block boundary formed by the branch. Nevertheless, it does convert a control dependence that could be correctly predicted into a data dependence that cannot be removed.<\/p>\n<p>This optimization applies only when the compiler can prove that it is safe to unconditionally perform a store because the instruction will always store one of two possible values instead of only performing a store when the condition is true. There are numerous reasons why an unconditional store may be unsafe. For example, if the memory location is shared, then another thread may observe a store when previously there was no store to observe. As another example, the condition may have protected a store through a potentially null pointer.<\/p>\n<p>The optimization is enabled for X64 as well as ARM64. (<a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220817-00\/?p=106998\">ARM64 conditional execution<\/a> was discussed in a previous blog post.)<\/p>\n<p><strong>Example C++ source code: <\/strong><\/p>\n<pre>int test(int a, int i) {\r\n    int mem[4]{0};\r\n    if (mem[i] &lt; a) {\r\n        mem[i] = a;\r\n    }\r\n    return mem[0];\r\n}<\/pre>\n<p><strong>Required compiler flags: <\/strong>\/O2 (for the sake of simplifying the example output, \/GS- was also passed)<\/p>\n<p><strong>X64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 22.8897%;\">17.4<\/td>\n<td style=\"width: 77.1103%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 22.8897%; vertical-align: top;\">\n<pre>$LN6:\r\nsub\u00a0\u00a0\u00a0\u00a0 rsp, 24\r\nmovsxd\u00a0 rax, edx\r\nxorps\u00a0\u00a0 xmm0, xmm0\r\nmovups\u00a0 XMMWORD PTR mem$[rsp], xmm0\r\ncmp\u00a0\u00a0\u00a0\u00a0 DWORD PTR mem$[rsp+rax*4], ecx\r\njge\u00a0\u00a0\u00a0\u00a0 SHORT $LN4@test\r\nmov\u00a0\u00a0\u00a0\u00a0 DWORD PTR mem$[rsp+rax*4], ecx\r\n$LN4@test:\r\nmov\u00a0\u00a0\u00a0\u00a0 eax, DWORD PTR mem$[rsp]\r\nadd\u00a0\u00a0\u00a0\u00a0 rsp, 24\r\nret\u00a0\u00a0\u00a0\u00a0 0<\/pre>\n<\/td>\n<td style=\"width: 77.1103%;\">\n<pre>$LN5:\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 24\r\nmovsxd\u00a0\u00a0 rax, edx\r\nxorps\u00a0\u00a0\u00a0\u00a0xmm0, xmm0\r\nmovups\u00a0\u00a0 XMMWORD PTR mem$[rsp], xmm0\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdx, QWORD PTR mem$[rsp]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0DWORD PTR [rdx+rax*4], ecx\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rdx, QWORD PTR [rdx+rax*4]\r\n<strong>cmovge<\/strong>\u00a0\u00a0 <strong>ecx, DWORD PTR [rdx] ; cond<\/strong>\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0DWORD PTR [rdx], ecx\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0eax, DWORD PTR mem$[rsp]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rsp, 24\r\nret\u00a0\u00a0\u00a0\u00a0\u00a0\u00a00<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>ARM64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 23.0683%;\">17.4<\/td>\n<td style=\"width: 76.9317%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 23.0683%; vertical-align: top;\">\n<pre>|$LN5|\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x10\r\nmovi\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,#0\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,sp\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[sp]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x9,w1 sxtw #2]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w0\r\nbge\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 |$LN2@test|\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,[x9,w1 sxtw #2]\r\n|$LN2@test|\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,[sp]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x10\r\nret<\/pre>\n<\/td>\n<td style=\"width: 76.9317%; vertical-align: top;\">\n<pre>|$LN9|\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x10\r\nmovi\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v16.16b,#0\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,sp\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 q16,[sp]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x9,w1 sxtw #2]\r\ncmp\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w0\r\n<strong>cselge\u00a0\u00a0\u00a0\u00a0\u00a0 w8,w8,w0<\/strong>\r\nstr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w8,[x9,w1 sxtw #2]\r\nldr\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w0,[sp]\r\nadd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sp,sp,#0x10\r\nret<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h2>Loop Optimizations for Array Assignments and Copies<\/h2>\n<p>The optimizer now recognizes more cases where an assignment\/copy to contiguous memory can be raised to memset\/memcopy intrinsics and lowered to more efficient instruction sequences. For example, consider the following array-assignment in a count-down loop.<\/p>\n<p><strong>Example C++ Source Code:<\/strong><\/p>\n<pre>    char c[1024];\r\n\r\n    for (int n = 1023; n; n--)\r\n        c[n] = 1;<\/pre>\n<p>Previously, the code generated was a loop with individual byte-assignments. Now, more efficient wider assignments are performed via block-copy libraries or vector instructions.<\/p>\n<p><strong>X64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 30.3037%;\">17.4<\/td>\n<td style=\"width: 69.6963%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 30.3037%; vertical-align: top;\">\n<pre>mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 QWORD PTR __$ArrayPad$[rsp], rax\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 eax, 1023\r\nnpad\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2\r\n$LL4@copy_while:\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 BYTE PTR c$[rsp+rax], 1\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, 1\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL4@copy_while<\/pre>\n<\/td>\n<td style=\"width: 69.6963%;\">\n<pre>vmovaps\u00a0 zmm0, ZMMWORD PTR __zmm@01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101\r\nvmovq\u00a0\u00a0\u00a0\u00a0rdx, xmm0\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR c$[rsp+1]\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ecx, 7\r\nnpad\u00a0\u00a0\u00a0\u00a0\u00a014\r\n$LL11@copy_while:\r\nvmovups\u00a0 ZMMWORD PTR [rax], zmm0\r\nvmovups\u00a0 YMMWORD PTR [rax+64], ymm0\r\nvmovups\u00a0 XMMWORD PTR [rax+96], xmm0\r\nlea\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rax, QWORD PTR [rax+128]\r\nvmovups\u00a0 XMMWORD PTR [rax-16], xmm0\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0rcx, 1\r\njne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0SHORT $LL11@copy_whilevmovups\u00a0 ZMMWORD PTR [rax], zmm0\r\nvmovups\u00a0 YMMWORD PTR [rax+64], ymm0\r\nvmovups\u00a0 XMMWORD PTR [rax+96], xmm0\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0QWORD PTR [rax+112], rdx\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0DWORD PTR [rax+120], edx\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0WORD PTR [rax+124], dx\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0BYTE PTR [rax+126], dl<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>ARM64 ASM:<\/strong><\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 30.5717%;\">17.4<\/td>\n<td style=\"width: 69.4283%;\">17.7<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 30.5717%; vertical-align: top;\">\n<pre>mov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,#0x3FF\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x9,sp\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w10,#1|$LL4@copy_while|\r\nstrb\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w10,[x8,x9]\r\nsub\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,x8,#1\r\ncbnz\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x8,|$LL4@copy_while|<\/pre>\n<\/td>\n<td style=\"width: 69.4283%; vertical-align: top;\">\n<pre>add\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x0,sp,#1\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x2,#0x3FF\r\nmov\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 w1,#1\r\nbl\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 memset<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Summary<\/h2>\n<p>We would like to thank everyone for giving us feedback and we are looking forward to hearing more from you.\u00a0 Please share your thoughts and comments with us through <a href=\"https:\/\/developercommunity.visualstudio.com\/home\">Developer Community<\/a>. You can also reach us on Twitter (<a href=\"https:\/\/twitter.com\/visualc\" target=\"_blank\" rel=\"noopener\">@VisualC<\/a>), or via email at\u00a0<a href=\"mailto:visualcpp@microsoft.com\">visualcpp@microsoft.com<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization. Optimizing Memory Across Block Boundaries When a small [&hellip;]<\/p>\n","protected":false},"author":129662,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,218],"tags":[],"class_list":["post-32856","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","category-performance"],"acf":[],"blog_post_summary":"<p>This blog post presents a selection of machine-independent optimizations that were added between Visual Studio versions 17.4 (released November 8, 2022) and 17.7 P3 (released July 11, 2023). Each optimization below shows assembly code for both X64 and ARM64 to show the machine-independent nature of the optimization. Optimizing Memory Across Block Boundaries When a small [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32856","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/129662"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=32856"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/32856\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=32856"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=32856"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=32856"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}