{"id":36345,"date":"2026-03-04T19:41:02","date_gmt":"2026-03-04T19:41:02","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=36345"},"modified":"2026-03-05T20:41:31","modified_gmt":"2026-03-05T20:41:31","slug":"c-performance-improvements-in-msvc-build-tools-v14-51","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/c-performance-improvements-in-msvc-build-tools-v14-51\/","title":{"rendered":"C++ Performance Improvements in MSVC Build Tools v14.51"},"content":{"rendered":"<p>The Microsoft C++ team made big changes to optimization quality that we are\nproud to share with the release of Microsoft C++ (MSVC) Build Tools v14.51. We will use two\nbenchmarks to illustrate the effect of these improvements compared to MSVC\nBuild Tools v14.50.<\/p>\n<p>Our first benchmark is <a href=\"https:\/\/www.spec.org\/cpu2017\/\">SPEC CPU<sup>\u00ae<\/sup> 2017<\/a> which\ncovers a spectrum of software and is recognized throughout the computing\nindustry. It is often used to evaluate computer hardware, which we are not\ndoing here. We are interested in performance for both x64 and arm64, but we\nwill share only the relative performance between the two compiler versions. We\nevaluate the compiler&#8217;s performance in two configurations, the default build\noptions that Microsoft Visual Studio (VS) arranges (primarily this means \/O2 \/GL) and a second\none with Profile Guided Optimization (PGO) enabled. Overall, the two targets and two configurations means tracking four results:\n{x64, arm64} x {VS Defaults, PGO}. The table below shows the improvement of MSVC Build Tools v14.51 over v14.50.\n<span data-olk-copy-source=\"MessageBody\">As the results include only the C and C++ benchmarks, but not the Fortran benchmarks, these results do not fully comply with\n<\/span><span data-olk-copy-source=\"MessageBody\">SPEC CPU<sup>\u00ae<\/sup> 2017\u2019s Run and Reporting rules and should be considered <em>estimated<\/em>:<\/span><\/p>\n<table>\n<thead>\n<tr>\n<th>Suite<\/th>\n<th>MSVC Config<\/th>\n<th>x64<\/th>\n<th>Arm64<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SPECspeed<sup>\u00ae<\/sup> 2017_int_base (est.)<\/td>\n<td>Peak (PGO)<\/td>\n<td><strong>5.0% faster<\/strong><\/td>\n<td><strong>6.5% faster<\/strong><\/td>\n<\/tr>\n<tr>\n<td>SPECspeed<sup>\u00ae<\/sup> 2017_int_base (est.)<\/td>\n<td>VS Defaults<\/td>\n<td><strong>4.3% faster<\/strong><\/td>\n<td><strong>4.4% faster<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><div class=\"alert alert-warning\"><p class=\"alert-divider\"><i class=\"fabric-icon fabric-icon--Warning\"><\/i><strong>Note:<\/strong><\/p>SPEC<sup>\u00ae<\/sup>, SPEC CPU<sup>\u00ae<\/sup>, and SPECspeed<sup>\u00ae<\/sup> are trademarks of the <a title=\"Original URL: http:\/\/www.spec.org\/. Click or tap if you trust this link.\" href=\"https:\/\/www.spec.org\" target=\"_blank\" rel=\"noopener noreferrer\" data-auth=\"NotApplicable\" data-linkindex=\"2\" data-ogsc=\"\">Standard Performance Evaluation Corporation (SPEC)<\/a>.<\/div><\/p>\n<p>Our second benchmark is\n<a href=\"https:\/\/dev.epicgames.com\/documentation\/en-us\/unreal-engine\/city-sample-project-unreal-engine-demonstration\">CitySample<\/a>,\nwhich is an Unreal Engine game demo. CitySample records statistics (min, max,\naverage) about frame rate, game thread time, and render thread time. These\nvalues are measured in milliseconds and are noisier, so we present the\nmin and max values taken over a series of ten runs on an Xbox Series X. The\ncompiler and linker options were unchanged from CitySample&#8217;s defaults. Lower\nvalues are better.<\/p>\n<table>\n<thead>\n<tr>\n<th>Compiler<\/th>\n<th>FrameTime Range (ms)<\/th>\n<th>RenderThreadTime range (ms)<\/th>\n<th>GameThreadTime range (ms)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>MSVC v14.44<\/td>\n<td>34.40-34.49<\/td>\n<td>8.17-8.53<\/td>\n<td>17.68-18.29<\/td>\n<\/tr>\n<tr>\n<td>MSVC v14.50<\/td>\n<td>34.37-34.49<\/td>\n<td>8.00-8.39<\/td>\n<td>17.44-18.18<\/td>\n<\/tr>\n<tr>\n<td>MSVC v14.51<\/td>\n<td>34.30-34.35<\/td>\n<td>7.73-7.95<\/td>\n<td>17.34-18.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You may have different benchmarks that matter to you and we would love to hear\nabout them. If you have performance feedback about a real application, we\naccept suggestions and bug reports in our <a href=\"https:\/\/developercommunity.visualstudio.com\/cpp\">Developer Community<\/a>.<\/p>\n<p>With that context, let&#8217;s look at specific examples of the optimizations we\ndelivered. Some of these began appearing in the 14.50 compiler, but all are\npresent in 14.51.<\/p>\n<h2>New SSA Loop Optimizer<\/h2>\n<p>Broadly speaking, the compiler performs two classes of loop optimizations:\nthose which modify the loop&#8217;s control-flow structure, such as unrolling,\npeeling, and unswitching, and those which modify data operations within the\nloop, such as hoisting invariants, strength reduction, and scalar replacement.\nIn 14.50, we finished replacing the latter set of loop optimizations with new\nones centered around Static Single Assignment (SSA) form. SSA is a\nrepresentation that many compilers use because it simplifies writing compiler\npasses.<\/p>\n<p>Replacing the loop optimizer was a very large project spanning several years,\nbut was necessary for the following reasons:<\/p>\n<ul>\n<li>Testability: The legacy loop optimizer was implemented decades ago as one monolithic\ntransformation, which made it difficult to understand, modify, debug, and\ntest. The code for its sub-passes was not organized to allow\nrunning them individually.<\/li>\n<li>Throughput: It did not use SSA like other new MSVC optimizations. Instead, it recognized\nexpressions textually based on their hashes. This older approach required\nrecomputing various data structures and bit vectors, which often represented\n3-5% of compilation time.<\/li>\n<li>Quality: Over the years, we had received a number of defect reports that were\ntraced to the legacy loop optimizer. These required substantial time\nto fix and sometimes they had to be fixed in suboptimal ways because there\nwas no better option. For example, sometimes a workaround was added to an earlier\ncompiler transformation so that the legacy loop optimizer would never see an\ninput that it found problematic.<\/li>\n<li>Completeness: The legacy loop optimizer was not able to handle certain loop forms. For\nexample, it could handle loops that counted up but not down.<\/li>\n<\/ul>\n<p>Given those limitations, the goals for the new loop optimizer were:<\/p>\n<ul>\n<li>Use SSA, like most modern compilers and other newer MSVC optimizations.<\/li>\n<li>Provide an extensible optimization framework such that new loop optimizations can\nbe added easily later.<\/li>\n<li>Provide the ability to run individual subpasses and therefore the ability to test\nthem separately.<\/li>\n<\/ul>\n<p>The basic design is to process all loops, innermost to outermost, and perform\ntransformations until a fixed point or a maximum number of iterations was\nreached. The transformations included only those that were present in the\nlegacy loop optimizer; the replacement project was tricky enough without\nintroducing the feature creep of new capabilities. With a focus on\nextensibility, new capabilities could be added after the project&#8217;s completion.<\/p>\n<p>The main challenge was that the loop optimizer is one of the most complex and\nperformance sensitive parts of the compiler. It impacts all loops and even some\ncode outside of loops because of an address mode building sub-pass that\noperated on the entire function. With such significant impact on code\nstructure, modifying the loop optimizer can uncover different and sometimes\nunexpected behavior downstream within the compiler. Couple that with the\nrigorous testing that MSVC undergoes, including thousands of regression tests,\nlarge real-world code projects, industry standard benchmarks, and Microsoft\nfirst-party software, and you have a situation that requires non-trivial\ndebugging. Imagine a failure that exhibits only at runtime within a Windows\ndriver.<\/p>\n<p>An additional complication was that the replacement had to be done all at once.\nAlthough the individual sub-passes of the new loop optimizer were stood up and\ntested individually, the legacy one could be disabled only all at once.\nTherefore, even after the new one was completed, we still had the novel test\nscenario of the new one enabled with the old one disabled. The final effort to\npolish this scenario was significant, especially for performance tuning.<\/p>\n<p>As of 14.50, the new loop optimizer was enabled for all targets. Enabling it\nresolved 23 unique compiler bugs ranging from crashes to silent bad code\ngeneration. The 5750 lines of new loop optimizer code, including 750 shared\nwith other parts of the compiler, replaced 15,500 lines of legacy loop\noptimizer. As intended, so as to not further complicate the project, there were\nneither performance improvements nor regressions. Compilation throughput\nimproved by 2.5%, which was expected.<\/p>\n<h2>New SLP Vectorizer<\/h2>\n<p>We also continued expanding our new SLP vectorization pass. SLP stands for\nsuperword-level parallelism and is sometimes called block-level vectorization.\nSLP packs similar independent scalar instructions together into a SIMD\ninstruction. SLP normally is contrasted with loop vectorization in which a\ncompiler vectorizes an entire loop body. With SLP, the vectorization can occur\nanywhere and does not need to be inside a loop. Here is an example on Arm64:<\/p>\n<pre><code class=\"language-cpp\">\/\/ Compile with \/O2 \/Qvec-report:2 and look for \"block vectorized\"\r\nvoid slp(int* a, const int* b, const int* c) {\r\n    \/\/ Order doesn't matter as long as there are no data dependencies\r\n    \/\/ between these statements. If all loads happen before any store,\r\n    \/\/ then there is no need to worry about pointer aliasing.\r\n\r\n    const int a0 = b[0] + c[0];\r\n    const int a2 = b[2] + c[2];\r\n    const int a1 = b[1] + c[1];\r\n    const int a3 = b[3] + c[3];\r\n\r\n    a[0] = a0;\r\n    a[2] = a2;\r\n    a[1] = a1;\r\n    a[3] = a3;\r\n}<\/code><\/pre>\n<p>When vectorized, that produces this code:<\/p>\n<pre><code class=\"language-assembly\">    ldr    q17,[x1]\r\n    ldr    q16,[x2]\r\n    add    v16.4s,v17.4s,v16.4s\r\n    str    q16,[x0]<\/code><\/pre>\n<p>MSVC has a legacy SLP vectorization pass that was implemented as part of its\nloop vectorizer. This was an expedient implementation choice at the time, allowing\nfor easy reuse of vectorization infrastructure, but it did not make sense\nlong-term. We implemented the new SLP vectorizer pass separately from the loop\nvectorizer by leveraging SSA utilities. Instead of setting an initial goal\nof replacing the old pass, we prioritized covering gaps in the old one\nbecause both passes can coexist. A long-term goal is to remove the old pass,\nbut for now it continues to handle a few cases that the new one does not yet cover.<\/p>\n<p>The new pass covers gaps related to vector sizes that are smaller or larger\nthan the target width. Let&#8217;s explain the larger case first.\nFor example, if the target architecture supported only 128-bit vectors, then when\ncompiling for that target, MSVC internally could not represent a vector larger\nthan 128 bits. SLP vectorization is more effective when acting on more values\nso it can be advantageous to permit temporarily creating oversized 256 or 512\nbit vectors for a 128-bit target, then converting to 128-bit vectors later. As\na specific example, MSVC initially could not represent an i32x8 vector on\nArm64, but with that capability it can handle this example:<\/p>\n<pre><code class=\"language-cpp\">\/\/ Compile with \/O2 \/Qvec-report:2 and look for \"block vectorized\"\r\n#include &lt;cstdint&gt;\r\n\r\nvoid oversized(uint16_t* __restrict a, uint16_t* __restrict b) {\r\n    a[0] = static_cast&lt;uint32_t&gt;(b[0]) * 0x7F123456 &gt;&gt; 2;\r\n    a[2] = static_cast&lt;uint32_t&gt;(b[2]) * 0x7F123456 &gt;&gt; 2;\r\n    a[1] = static_cast&lt;uint32_t&gt;(b[1]) * 0x7F123456 &gt;&gt; 2;\r\n    a[4] = static_cast&lt;uint32_t&gt;(b[4]) * 0x7F123456 &gt;&gt; 2;\r\n    a[3] = static_cast&lt;uint32_t&gt;(b[3]) * 0x7F123456 &gt;&gt; 2;\r\n    a[6] = static_cast&lt;uint32_t&gt;(b[6]) * 0x7F123456 &gt;&gt; 2;\r\n    a[7] = static_cast&lt;uint32_t&gt;(b[7]) * 0x7F123456 &gt;&gt; 2;\r\n    a[5] = static_cast&lt;uint32_t&gt;(b[5]) * 0x7F123456 &gt;&gt; 2;\r\n}<\/code><\/pre>\n<p>Which produces this Internal Representation (IR) within the compiler after SLP:<\/p>\n<pre><code class=\"language-cpp\">tv1.i16x8 = IV_VECT_DUP 0x7F123456\r\ntv2.i16x8 = IV_VECT_LOAD b\r\ntv3.i32x8 = IV_VECT_CONVERT tv2\r\ntv4.i32x8 = IV_VECT_MUL tv3, tv1\r\ntv5.i32x8 = IV_VECT_SHRIMM tv4, 0x2\r\ntv6.i16x8 = IV_VECT_CONVERT tv5\r\n            IV_VECT_STORE a, tv6<\/code><\/pre>\n<p>A new legalizer phase then turns this IR into real SIMD instructions supported by the target.\nOversized vectors are enabled for Arm64. Other targets are works-in-progress.<\/p>\n<p>Oversized vectors are needed to optimize select operations in SLP. Consider this example:<\/p>\n<pre><code class=\"language-cpp\">void oversized_select(int* __restrict a, const int* __restrict b) {\r\n    a[0] = b[0] + b[5];\r\n    a[1] = b[1] - b[4];\r\n    a[2] = b[2] + b[7];\r\n    a[3] = b[3] - b[6];\r\n    a[4] = b[4] + b[1];\r\n    a[5] = b[5] - b[0];\r\n    a[6] = b[6] + b[3];\r\n    a[7] = b[7] - b[2];\r\n}<\/code><\/pre>\n<p>Without select optimization, the IR after SLP would look something like:<\/p>\n<pre><code class=\"language-cpp\">tv1.i32x8 = IV_VECT_LOAD b\r\ntv2.i32x8 = IV_VECT_PERMUTE tv1, 5, 4, 7, 6, 1, 0, 3, 2\r\ntv3.i32x8 = IV_VECT_ADD tv1, tv2\r\ntv4.i32x8 = IV_VECT_SUB tv1, tv2\r\ntv5.i32x8 = IV_VECT_SELECT tv3, tv4, 0, 1, 0, 1, 0, 1, 0, 1\r\n            IV_VECT_STORE a, tv5<\/code><\/pre>\n<p>Notice that we compute an i32x8 addition and an i32x8 subtraction. As-is, the\ni32x8 addition would turn into two i32x4 additions, and the i32x8 subtraction\nwould turn into two i32x4 subtractions. The final assembly would look like this:<\/p>\n<pre><code class=\"language-assembly\">    ldp         q18,q20,[x1]\r\n    rev64       v17.4s,v20.4s\r\n    rev64       v19.4s,v18.4s\r\n    add         v16.4s,v18.4s,v17.4s\r\n    sub         v17.4s,v18.4s,v17.4s\r\n    ext8        v16.16b,v16.16b,v16.16b,#0xC\r\n    trn2        v18.4s,v16.4s,v17.4s\r\n    add         v16.4s,v20.4s,v19.4s\r\n    sub         v17.4s,v20.4s,v19.4s\r\n    ext8        v16.16b,v16.16b,v16.16b,#0xC\r\n    trn2        v16.4s,v16.4s,v17.4s\r\n    stp         q18,q16,[x0]\r\n    ret<\/code><\/pre>\n<p>The extra addition, subtraction, and shuffle instructions add overhead to the\nvectorized code that doesn&#8217;t exist in the scalar code. However, if we carefully\nrearrange the values such that IV_VECT_SELECT becomes a no-op, we can eliminate\none i32x4 addition and one i32x4 subtraction from the final binary.<\/p>\n<pre><code class=\"language-cpp\">tv1.i32x8 = IV_VECT_LOAD b\r\ntv2.i32x8 = IV_VECT_PERMUTE b, 0, 2, 4, 6, 1, 3, 5, 7\r\ntv3.i32x8 = IV_VECT_PERMUTE b, 5, 7, 1, 3, 4, 6, 0, 2\r\ntv4.i32x8 = IV_VECT_ADD tv2, tv3\r\ntv5.i32x8 = IV_VECT_SUB tv2, tv3\r\ntv6.i32x8 = IV_VECT_SELECT tv4, tv5, 0, 0, 0, 0, 1, 1, 1, 1\r\ntv7.i32x8 = IV_VECT_PERMUTE tv6, 0, 4, 1, 5, 2, 6, 3, 7\r\n            IV_VECT_STORE a, tv7<\/code><\/pre>\n<p>This looks like more IR since there are now IV_VECT_PERMUTEs, but the final\nbinary is actually smaller and faster:<\/p>\n<pre><code class=\"language-assembly\">    ldp         q20,q16,[x1]\r\n    uzp2        v17.4s,v16.4s,v20.4s\r\n    uzp1        v18.4s,v20.4s,v16.4s\r\n    uzp2        v19.4s,v20.4s,v16.4s\r\n    uzp1        v16.4s,v16.4s,v20.4s\r\n    add         v18.4s,v18.4s,v17.4s\r\n    sub         v16.4s,v19.4s,v16.4s\r\n    zip1        v17.4s,v18.4s,v16.4s\r\n    zip2        v16.4s,v18.4s,v16.4s\r\n    stp         q17,q16,[x0]\r\n    ret<\/code><\/pre>\n<p>Now back to the smaller case. Previously, SLP only considered vectorizing if\nthe last instructions in a sequence (usually a sequence of stores) started at\nfull vector width (it could then shrink as we find more instructions later).\nComplementing oversized vectors, SLP now also considers vectorizing at smaller\nsizes on x64. For example, this i16x4 load-shift-store is now vectorized:<\/p>\n<pre><code class=\"language-cpp\">#include &lt;cstdint&gt;\r\n\r\nvoid test_halfvec_1(int16_t *s) {\r\n    s[0] &lt;&lt;= 1;\r\n    s[1] &lt;&lt;= 1;\r\n    s[2] &lt;&lt;= 1;\r\n    s[3] &lt;&lt;= 1;\r\n}<\/code><\/pre>\n<pre><code class=\"language-assembly\">    movq    xmm0, QWORD PTR [rcx]\r\n    psllw   xmm0, 1\r\n    movq    QWORD PTR [rcx], xmm0<\/code><\/pre>\n<p>Additionally, SLP does a better job finding similar sequences of instructions on all targets. Consider this example:<\/p>\n<pre><code class=\"language-cpp\">#include &lt;cstdint&gt;\r\n\r\nvoid test_halfvec_2(int16_t *s) {\r\n    s[0] += 1; \/\/ this initial op prevented SLP vectorization\r\n\r\n    \/\/ this block should be SLP vectorized even though it doesn't fill an entire vector\r\n    s[1] &lt;&lt;= 1;\r\n    s[2] &lt;&lt;= 1;\r\n    s[3] &lt;&lt;= 1;\r\n    s[4] &lt;&lt;= 1;\r\n\r\n    s[5] += 1; \/\/ trailing op should not prevent SLP vectorization\r\n}<\/code><\/pre>\n<p>Previously, SLP would try to vectorize the sequence made up of s[0], s[1],\ns[2], and s[3], fail, and, importantly, no longer consider those elements for\nfuture iterations of SLP. Now, SLP will try again with s[1], s[2], s[3], and\ns[4].<\/p>\n<pre><code class=\"language-assembly\">    movq    xmm0, QWORD PTR [rcx+2]\r\n    inc WORD PTR [rcx]\r\n    inc WORD PTR [rcx+10]\r\n    psllw   xmm0, 1\r\n    movq    QWORD PTR [rcx+2], xmm0<\/code><\/pre>\n<h2>SROA Improvements<\/h2>\n<p>Scalar Replacement of Aggregates (SROA) is a classic compiler optimization\nthat replaces fields of non-address-taken structs and classes with\nscalar variables. These scalar variables then become candidates for register\nallocation and all optimizations that apply to scalars including constant and\ncopy propagation, dead code elimination, etc.<\/p>\n<p>We made significant improvements to our SROA. One of the SROA steps is\ndeciding which struct assignments to replace with field-by-field assignments.\nWe call this step unpacking. Let&#8217;s look at many improvements to unpacking.<\/p>\n<h3>Assignments via Indirections<\/h3>\n<p>The most important improvement was to allow unpacking more struct assignments\nthat involved indirection. Before the change was made, indirect struct\nassignments were unpacked only if the structs contained two floats or two doubles.\nEssentially the unpacking targeted the structs used for complex numbers.\nWith this restriction removed, this example improves:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    int i;\r\n    int j;\r\n    float f;\r\n};\r\n\r\nint test1(S* inS) {\r\n   S localS = *inS;\r\n   return localS.i;\r\n}<\/code><\/pre>\n<p>Before recent changes we generated this code:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 24\r\n    movsd   xmm0, QWORD PTR [rcx]\r\n    movsd   QWORD PTR localS$[rsp], xmm0\r\n    mov     eax, DWORD PTR localS$[rsp]\r\n    add     rsp, 24\r\n    ret     0<\/code><\/pre>\n<p>Now the unpacking and subsequent optimizations reduce it to:<\/p>\n<pre><code class=\"language-assembly\">    mov     eax, DWORD PTR [rcx]\r\n    ret     0<\/code><\/pre>\n<h3>Larger Structs<\/h3>\n<p>We increased our unpacking limit from 32 bytes to 64 bytes. Here is a simple example where this helps:<\/p>\n<pre><code class=\"language-cpp\">bool flag;\r\n\r\nstruct S {\r\n    int i1;\r\n    int i2;\r\n    int i3;\r\n    int i4;\r\n    int i5;\r\n    int i6;\r\n    int i7;\r\n    int i8;\r\n    int i9;\r\n};\r\n\r\nint test2() {\r\n   S localS1;\r\n   S localS2;\r\n   localS1.i1 = 1;\r\n   localS2.i1 = 1;\r\n\r\n   S localS3 = localS1;\r\n   if (flag) localS3 = localS2;\r\n\r\n   return localS3.i1;\r\n}<\/code><\/pre>\n<p>When the limit on struct unpacking was 32 bytes, we generated this code for test2:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 136\r\n    cmp     BYTE PTR ?flag@@3_NA, 0\r\n    mov     DWORD PTR localS1$[rsp], 1\r\n    movups  xmm1, XMMWORD PTR localS1$[rsp]\r\n    mov     DWORD PTR localS2$[rsp], 1\r\n    mov     eax, DWORD PTR localS2$[rsp]\r\n    jne     SHORT $LN2@test2\r\n    movd    eax, xmm1\r\n$LN2@test2:\r\n    add     rsp, 136\r\n    ret     0<\/code><\/pre>\n<p>With limit increased to 64 bytes, struct unpacking and subsequent optimization reduces to:<\/p>\n<pre><code class=\"language-assembly\">    mov     eax, 1\r\n    ret     0<\/code><\/pre>\n<h3>Unions<\/h3>\n<p>Struct unpacking now handles unions when only one of the overlapping fields is used. For example:<\/p>\n<pre><code class=\"language-cpp\">bool flag;\r\n\r\nstruct S {\r\n    int i1;\r\n    int i2;\r\n    int i3;\r\n    int i4;\r\n    int i5;\r\n};\r\n\r\nunion U {\r\n    float f;\r\n    S s;\r\n};\r\n\r\nint test3() {\r\n   U localU1;\r\n   U localU2;\r\n\r\n   float f1 = localU1.f;\r\n   float f2 = localU2.f;\r\n\r\n   localU1.s.i1 = 1;\r\n   localU2.s.i1 = 1;\r\n\r\n   U localU3 = localU1;\r\n   if (flag) localU3 = localU2;\r\n\r\n   return localU3.s.i1;\r\n}<\/code><\/pre>\n<p>The float field of the union is used in the source code, but the assignments to f1 and f2\nare dead and are eliminated prior to unpacking. Unpacking now identifies that because field\nf is unused, the rest of the union can be unpacked like a normal struct. Before, the emitted\ncode was:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 56                                 ; 00000038H\r\n    cmp     BYTE PTR ?flag@@3_NA, 0                 ; flag\r\n    mov     DWORD PTR localU1$[rsp], 1\r\n    movups  xmm0, XMMWORD PTR localU1$[rsp]\r\n    mov     DWORD PTR localU2$[rsp], 1\r\n    mov     eax, DWORD PTR localU2$[rsp]\r\n    jne     SHORT $LN2@test3\r\n    movd    eax, xmm0\r\n$LN2@test3:\r\n    add     rsp, 56                                 ; 00000038H\r\n    ret     0<\/code><\/pre>\n<p>Now it is simplified to:<\/p>\n<pre><code class=\"language-assembly\">    mov     eax, 1\r\n    ret     0<\/code><\/pre>\n<h3>Relaxing Address Taken Restrictions<\/h3>\n<p>We were not unpacking struct assignments if either of the struct&#8217;s addresses were taken. Consider:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    int i1;\r\n    int i2;\r\n};\r\n\r\nint bar(S* s);\r\n\r\nint foo() {\r\n    S s1;\r\n    S s2;\r\n\r\n    s1.i1 = 5;\r\n    s1.i2 = 6;\r\n    s2 = s1;\r\n\r\n    int result = s2.i1;\r\n\r\n    bar(&amp;s2);\r\n\r\n    return result;\r\n}<\/code><\/pre>\n<p>Before recent changes we did not unpack the s2 = s1 struct assignment because s2\nwas address-taken. We emitted this code:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 40\r\n    mov     DWORD PTR s1$[rsp], 5\r\n    mov     DWORD PTR s1$[rsp+4], 6\r\n    mov     rcx, QWORD PTR s1$[rsp]\r\n    mov     QWORD PTR s2$[rsp], rcx\r\n    lea     rcx, QWORD PTR s2$[rsp]\r\n    call    ?bar@@YAHPEAUS@@@Z\r\n    mov     eax, 5\r\n    add     rsp, 40\r\n    ret     0<\/code><\/pre>\n<p>With improved unpacking we generate this code that avoids the struct copy:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 40\r\n    lea     rcx, QWORD PTR s2$[rsp]\r\n    mov     DWORD PTR s2$[rsp], 5\r\n    mov     DWORD PTR s2$[rsp+4], 6\r\n    call    ?bar@@YAHPEAUS@@@Z\r\n    mov     eax, 5\r\n    add     rsp, 40\r\n    ret     0<\/code><\/pre>\n<h3>Unpacking Struct Assignments with Casted Fields<\/h3>\n<p>We were not unpacking if a field was used as a different type via a cast. For example:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    long long l1;\r\n    long long l2;\r\n};\r\n\r\nvoid bar (int i);\r\n\r\nlong long foo(S *s1) {\r\n    S s2 = *s1;\r\n    bar((int)s2.l1);\r\n    return s2.l1 + s2.l2;\r\n}<\/code><\/pre>\n<p>Unpacking was not done because s2.l1 was used as both an int and as a long long.\nThis restriction has been removed. The code emitted before was:<\/p>\n<pre><code class=\"language-assembly\">    push    rbx\r\n    sub     rsp, 48\r\n    movaps  XMMWORD PTR [rsp+32], xmm6\r\n    movups  xmm6, XMMWORD PTR [rcx]\r\n    movq    rbx, xmm6\r\n    mov     ecx, ebx\r\n    call    ?bar@@YAXH@Z\r\n    psrldq  xmm6, 8\r\n    movq    rax, xmm6\r\n    movaps  xmm6, XMMWORD PTR [rsp+32]\r\n    add     rax, rbx\r\n    add     rsp, 48\r\n    pop     rbx\r\n    ret     0<\/code><\/pre>\n<p>Now we are able to get rid of the struct copy:<\/p>\n<pre><code class=\"language-assembly\">    mov     QWORD PTR [rsp+8], rbx\r\n    push    rdi\r\n    sub     rsp, 32\r\n    mov     rdi, QWORD PTR [rcx]\r\n    mov     rbx, QWORD PTR [rcx+8]\r\n    mov     ecx, edi\r\n    call    ?bar@@YAXH@Z\r\n    lea     rax, QWORD PTR [rbx+rdi]\r\n    mov     rbx, QWORD PTR [rsp+48]\r\n    add     rsp, 32\r\n    pop     rdi\r\n    ret     0<\/code><\/pre>\n<h3>Unpacking Struct Assignments with Source Struct at Non-Zero Offset<\/h3>\n<p>In this example, the source of the s2 = t.s struct assignment is a struct of type S that is\nenclosed at a non-zero offset in struct T:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    int i;\r\n    int j;\r\n    int k;\r\n};\r\n\r\nstruct T {\r\n    int l;\r\n    S s;\r\n};\r\n\r\nint foo(S* s1) {\r\n    T t;\r\n    t.s.i = 1;\r\n    t.s.j = 2;\r\n    t.s.k = 3;\r\n\r\n    S s2 = t.s;\r\n    *s1 = s2;\r\n\r\n    return s1-&gt;i + s1-&gt;j + s1-&gt;k;\r\n}<\/code><\/pre>\n<p>Unpacking for this case was previously not allowed and we generated this code:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 24\r\n    mov     DWORD PTR t$[rsp+4], 1\r\n    mov     DWORD PTR t$[rsp+8], 2\r\n    movsd   xmm0, QWORD PTR t$[rsp+4]\r\n    movsd   QWORD PTR [rcx], xmm0\r\n    mov     DWORD PTR [rcx+8], 3\r\n    mov     eax, DWORD PTR [rcx+8]\r\n    add     eax, DWORD PTR [rcx+4]\r\n    add     eax, DWORD PTR [rcx]\r\n    add     rsp, 24\r\n    ret     0<\/code><\/pre>\n<p>After allowing unpacking, we are able to eliminate the struct copy and do\nconstant propagation that computes the result:<\/p>\n<pre><code class=\"language-assembly\">    mov     DWORD PTR [rcx], 1\r\n    mov     eax, 6\r\n    mov     DWORD PTR [rcx+4], 2\r\n    mov     DWORD PTR [rcx+8], 3\r\n    ret     0<\/code><\/pre>\n<h3>Repacking Struct Assignments with Indirections<\/h3>\n<p>The dual of unpacking is packing. The above examples demonstrated unpacking, but\nsometimes unpacking does not result in code simplifications. To resolve that problem,\nwe have a packing phase that may remove field-by-field assignments created by unpacking\nor that were present in the original source code. We improved packing to work\nwhen either the sources or targets were accessed indirectly via pointers. For example:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    int i1;\r\n    int i2;\r\n    int i3;\r\n    int i4;\r\n};\r\n\r\nvoid bar(S* s);\r\n\r\nvoid foo(S* s1) {\r\n    S s2;\r\n\r\n    s2.i1 = s1-&gt;i1;\r\n    s2.i2 = s1-&gt;i2;\r\n    s2.i3 = s1-&gt;i3;\r\n    s2.i4 = s1-&gt;i4;\r\n\r\n    bar (&amp;s2);\r\n}<\/code><\/pre>\n<p>Before we generated this code:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 56\r\n    mov     eax, DWORD PTR [rcx]\r\n    mov     DWORD PTR s2$[rsp], eax\r\n    mov     eax, DWORD PTR [rcx+4]\r\n    mov     DWORD PTR s2$[rsp+4], eax\r\n    mov     eax, DWORD PTR [rcx+8]\r\n    mov     DWORD PTR s2$[rsp+8], eax\r\n    mov     eax, DWORD PTR [rcx+12]\r\n    lea     rcx, QWORD PTR s2$[rsp]\r\n    mov     DWORD PTR s2$[rsp+12], eax\r\n    call    ?bar@@YAXPEAUS@@@Z\r\n    add     rsp, 56\r\n    ret     0<\/code><\/pre>\n<p>Now we are able to repack (with \/GS-) and generate this much smaller code:<\/p>\n<pre><code class=\"language-assembly\">    sub     rsp, 56\r\n    movups  xmm0, XMMWORD PTR [rcx]\r\n    lea     rcx, QWORD PTR s2$[rsp]\r\n    movups  XMMWORD PTR s2$[rsp], xmm0\r\n    call    ?bar@@YAXPEAUS@@@Z\r\n    add     rsp, 56\r\n    ret     0<\/code><\/pre>\n<p>We saw broad improvements from the above SROA improvements, including a 1.9%\nimprovement in CitySample render thread time, a 1.27% improvement in CitySample\ngame thread time, and better optimization of gsl::span.<\/p>\n<h2>Hoisting Vectorizer Pointer Overlap Checks<\/h2>\n<p>Vectorized loops sometimes contain pointer overlap checks that were inserted by\nthe compiler to ensure correctness when loading from potentially aliased memory\nregions. If the runtime check detects pointer overlap, then scalar code is\nused, but if there is no overlap, then vector code is used. If these checks\nare inside an inner loop, then they can be costly. We added the capability to\nhoist inner-loop pointer overlap checks to the parent loop when legal. This\nhoist reduces per iteration overhead, improving the performance of vectorized\nloops.<\/p>\n<p>Even for a single overlap check within an inner loop, the optimization must\naccount for that check&#8217;s multiple dynamic instances across loop iterations.\nThe hoisted check must cover all of these before the inner loop starts, either\nby computing them all or by conservatively testing a superset of the original\nranges.<\/p>\n<h2>Logical to Bitwise OR<\/h2>\n<p>Due to the C++ language&#8217;s short-circuit evaluation rules, the logical OR\nexpression A || B is in general translated as two conditional branches, with no\nactual OR instruction being emitted. An optimization is to avoid the branches\nby combining the truth values of A and B with an OR instruction. The catch is\nthat the original expression&#8217;s correctness cannot depend on the\nshort-circuiting behavior. For example, (a == 0 || (b\/a &gt; 5)), depends on\nshort-circuiting to avoid a fault.<\/p>\n<p>Consider:<\/p>\n<pre><code class=\"language-cpp\">return A || B;<\/code><\/pre>\n<p>Without optimization, the compiler emits essentially:<\/p>\n<pre><code class=\"language-cpp\">temp = false;\r\nif (A) {\r\n    temp = true;\r\n} else if (B) {  \/\/ not evaluated if A is true\r\n    temp = true;\r\n}\r\nreturn temp;<\/code><\/pre>\n<p>But with optimization, the compiler emits:<\/p>\n<pre><code class=\"language-cpp\">return (A | B) != 0;<\/code><\/pre>\n<h2>Shift-CMP folding<\/h2>\n<p>For the following code snippet:<\/p>\n<pre><code class=\"language-cpp\">void foo(int input) {\r\n    int a = input &gt;&gt; 3;\r\n\r\n    if (a &gt;= 1) {\r\n        foo2();\r\n    } else {\r\n        foo3();\r\n    }\r\n}<\/code><\/pre>\n<p>The value of a is not used outside of the comparison, so we can fold it into the\ncomparison by shifting the comparison by 3:<\/p>\n<pre><code class=\"language-cpp\">void foo(int input) {\r\n    if (input &gt;= 8) {\r\n        foo2();\r\n    } else {\r\n        foo3();\r\n    }\r\n}<\/code><\/pre>\n<p>We cannot do this optimization for every comparison. For right shifts, we can do it only for &#8216;&lt;&#8216; and &#8216;&gt;=&#8217;,\nand for left shifts only for &#8216;&gt;&#8217; and &#8216;&lt;=&#8217;.<\/p>\n<h2>Neon Codegen Improvement<\/h2>\n<p>Consider this snippet of C code:<\/p>\n<pre><code class=\"language-cpp\">uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) &lt;&lt; 16);\r\nuint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) &lt;&lt; 16);\r\nuint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) &lt;&lt; 16);\r\nuint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) &lt;&lt; 16);<\/code><\/pre>\n<p>On Arm64, we originally vectorized it with the following sequence of ARM NEON instructions:<\/p>\n<pre><code class=\"language-assembly\">    ldp         s16,s19,[x0]\r\n    ushll       v18.8h,v16.8b,#0\r\n    ldp         s17,s16,[x2]\r\n    usubl       v16.8h,v19.8b,v16.8b\r\n    ushll       v17.8h,v17.8b,#0\r\n    shll        v16.4s,v16.4h,#0x10\r\n    usubw       v16.4s,v16.4s,v17.4h\r\n    uaddw       v18.4s,v16.4s,v18.4h<\/code><\/pre>\n<p>ARM NEON has instructions to simultaneously widen and extract the low or high\nhalf of a source vector register, then perform arithmetic. In this particular\ncode snippet, we can combine the 8 scalar subtraction operations into a single\nUSUBL instruction and then use the SHLL2 instruction on the high half of the\nresult. The new NEON instruction sequence from this improvement is shorter and\nfaster:<\/p>\n<pre><code class=\"language-assembly\">    ldr         d16,[x2]\r\n    ldr         d17,[x0]\r\n    usubl       v17.8h,v17.8b,v16.8b\r\n    shll2       v16.4s,v17.8h,#0x10\r\n    saddw       v18.4s,v16.4s,v17.4h<\/code><\/pre>\n<h2>Unconditional Store Execution<\/h2>\n<p>Consider this example:<\/p>\n<pre><code class=\"language-cpp\">int test_1(int a, int i) {\r\n    int mem[4]{0};\r\n\r\n    if (mem[i] &lt; a) {\r\n        mem[i] = a;\r\n    }\r\n\r\n    return mem[0];\r\n}<\/code><\/pre>\n<p>Normally, this would result in a conditional branch around a store, but with\nunconditional store execution, the compiler will instead emit a CMOV and an\nunconditionally executed store:<\/p>\n<pre><code class=\"language-assembly\">    cmovge  ecx, DWORD PTR [rdx]\r\n    mov     DWORD PTR [rdx], ecx<\/code><\/pre>\n<p>In this specific case, the store only executes on some of the paths through\nthis function, not all paths, so the compiler must be careful to avoid\nintroducing bugs. The compiler will only consider applying the transformation\non memory that 1) the compiler can prove is not shared, which avoids\nintroducing a data race, and 2) was previously accessed by a dominating\ninstruction, such as the load in the <code>if<\/code> condition, which avoids introducing\nan access violation on a path where there wasn&#8217;t already an access violation.<\/p>\n<p>Unconditional store execution has been in the compiler for a while. The recent\nchange was to keep the compiler&#8217;s dominance information up-to-date.<\/p>\n<h2>Improved AVX Optimization<\/h2>\n<p>We&#8217;ve recently made improvements to AVX optimization. Consider this example,\nwhich was reduced from layers of generic library code and eventually manifested\ninto this pattern after lots of inlining.<\/p>\n<pre><code class=\"language-cpp\">\/\/ Compile with \/O2 \/arch:AVX\r\n\r\n#include &lt;immintrin.h&gt;\r\n\r\n__m256d test(double *a, double *b, double *c, double *d) {\r\n    __m256d temp;\r\n\r\n    temp.m256d_f64[0] = *a;\r\n    temp.m256d_f64[1] = *b;\r\n    temp.m256d_f64[2] = *c;\r\n    temp.m256d_f64[3] = *d;\r\n\r\n    return _mm256_movedup_pd(_mm256_permute2f128_pd(temp, temp, 0x20));\r\n}<\/code><\/pre>\n<p>Originally, this would result in the following generated code<\/p>\n<pre><code class=\"language-assembly\">    vmovsd  xmm0, QWORD PTR [rcx]\r\n    vmovsd  xmm1, QWORD PTR [rdx]\r\n    vmovsd  QWORD PTR temp$[rbp], xmm0\r\n    vmovsd  xmm0, QWORD PTR [r8]\r\n    vmovsd  QWORD PTR temp$[rbp+8], xmm1\r\n    vmovsd  xmm1, QWORD PTR [r9]\r\n    vmovsd  QWORD PTR temp$[rbp+16], xmm0\r\n    vmovsd  QWORD PTR temp$[rbp+24], xmm1\r\n    vmovupd ymm0, YMMWORD PTR temp$[rbp]\r\n    vperm2f128 ymm2, ymm0, ymm0, 32\r\n    vmovddup ymm0, ymm2<\/code><\/pre>\n<p>There are two improvements here. First, the stack round trip (the store-load\nsequence using <code>temp$<\/code> as a buffer) is eliminated. Second, notice that the final\nreturn value is just a broadcast of <code>*a<\/code> to all four lanes of the vector. By\ntracking how vector values move across sequences of swizzle instructions, the\ncompiler is now able to recognize that fact. The final result for this example\nis a single instruction:<\/p>\n<pre><code class=\"language-assembly\">    vbroadcastsd ymm0, QWORD PTR [rcx]<\/code><\/pre>\n<p>This optimization improved CitySample&#8217;s render thread time by 0.23ms on Xbox Series X.<\/p>\n<h2>Single and Limited Call Site Inlining<\/h2>\n<p>MSVC traditionally has taken a conservative approach to inlining, trying to be\ncautious of the code size increase and potentially build time increase that can\noccur with more aggressive approaches. There is a \/Ob3 option to enable more\naggressive inlining, but it is not enabled by default with \/O2. We looked for\nstrategic changes that would improve performance by default while not exploding\ncode size or reducing build throughput.<\/p>\n<p>The strategy that had the most positive performance impact with the least\nnegative side effects was single call site inlining. The compiler uses whole\nprogram analysis (\/GL) to inline a function if it is called in exactly one\nplace. The code size difference is negligible because there is still a single\ninstance of the function\u2019s body, given that the original standalone function\ncan be discarded. We implemented build throughput optimizations for cases where\nthe increased inlining showed throughput regressions. At first, it may seem\nsurprising that there could be throughput regressions without the overall code\nsize changing. The inlining eliminates one function at the expense of making a\ndifferent function larger, so it has potential to exacerbate any compiler\nalgorithms that are non-linear with respect to function size.<\/p>\n<p>By enabling single call site inlining, we generally improved performance\nwith no code size impact and the build throughput penalty was less than 5%.<\/p>\n<p>Later, we extended this idea to limited call site inlining, which covers\nfunctions that are called only a few times across the entire application. This\napproach needed to be more cautious about code size increase by factoring in\nthe size of the functions. There was on average a 2% code size increase, but\nearlier throughput optimizations were sufficient to deal with it.<\/p>\n<h2>Branch Elimination<\/h2>\n<p>For many years we&#8217;ve had an optimization that transforms branches that execute\na single, cheap, instruction into branchless code using instructions like\n&#8220;cmov&#8221;. This transformation can improve performance for unpredictable branches,\nand it&#8217;s important for algorithms like heapsort and binary-search.<\/p>\n<p>We&#8217;ve enhanced MSVC to allow this optimization through more levels of nested\nconditions. In particular we now optimize common &#8220;heapification&#8221; routines.\nThese usually have a branch that looks like the following:<\/p>\n<pre><code class=\"language-cpp\">  if (c &lt; end &amp;&amp; arr[c-1] &lt; arr[c]) {\r\n    c++;\r\n  }<\/code><\/pre>\n<p>Previously, we would generate assembly like the following for the second branch above:<\/p>\n<pre><code class=\"language-assembly\">    mov   ecx, DWORD PTR [r9+r8*4]\r\n    cmp   DWORD PTR [r9+r8*4-4], ecx\r\n    jge   SHORT $LN4@downheap_p\r\n    inc   edx\r\n$LN4@downheap_p:<\/code><\/pre>\n<p>We now always perform the increment instruction but conditionally store the result:<\/p>\n<pre><code class=\"language-assembly\">    mov     r8d, DWORD PTR [r10+rcx*4-4]\r\n    mov     edx, DWORD PTR [r10+rcx*4]\r\n    cmp     r8d, edx\r\n    lea     ecx, DWORD PTR [r9+1]\r\n    cmovge  ecx, r9d<\/code><\/pre>\n<p>The compiler analyzes profitability of converting branches and consults\nprofiling information when available.<\/p>\n<h2>Loop Unswitching<\/h2>\n<p>Unswitching hoists a condition from a loop, which can enable further\noptimization. For example:<\/p>\n<pre><code class=\"language-cpp\">for (int *p = arr; p &lt; arr + N; ++p) {\r\n     if (doWork) {\r\n         rv += *p;\r\n     }\r\n}<\/code><\/pre>\n<p>Can be transformed into:<\/p>\n<pre><code class=\"language-cpp\">if (doWork) {\r\n    for (int *p = arr; p &lt; arr + N; ++p) {\r\n        rv += *p;\r\n    }\r\n} else {\r\n    for (int *p = arr; p &lt; arr + N; ++p) {}\r\n}<\/code><\/pre>\n<p>Which can be optimized into just:<\/p>\n<pre><code class=\"language-cpp\">if (doWork) {\r\n    for (int *p = arr; p &lt; arr + N; ++p) {\r\n        rv += *p;\r\n    }\r\n}<\/code><\/pre>\n<p>The new change is that iterator loops are now unswitched. Consider:<\/p>\n<pre><code class=\"language-cpp\">for (auto it = arr.begin(); it != arr.end(); ++it) {\r\n     if (doWork) {\r\n         rv += *it;\r\n     }\r\n}<\/code><\/pre>\n<p>Which now can be transformed into:<\/p>\n<pre><code class=\"language-cpp\">if (doWork) {\r\n    for (auto it = arr.begin(); it != arr.end(); ++it) {\r\n         rv += *it;\r\n    }\r\n}<\/code><\/pre>\n<h2>Memset and Memcpy improvements<\/h2>\n<p>We improved how we propagate memset values forward. Consider:<\/p>\n<pre><code class=\"language-cpp\">struct S {\r\n    int a;\r\n    int b;\r\n    char data[0x100];\r\n};\r\n\r\nS s;\r\n\r\nmemset(&amp;s, 0, sizeof(s));\r\n\r\ns.a = 1;\r\n\r\n\/\/ use s.b<\/code><\/pre>\n<p>The write to s.a writes to some of the same memory as the memset, which blocked the compiler\nfrom recognizing that the use of s.b could be replaced by the zero from the memset.\nWith the recent changes, we are able to propagate memset values forward for fields that have\nnot been changed, even if other fields have been.<\/p>\n<p>Additionally, we made two improvements to our inline expansions of memsets and memcpy.\nThe first improvement is to use overlapping copies for the trailing bits when the size of the copy\nis not a multiple of the available register size. For example, before we used this inline code:<\/p>\n<pre><code class=\"language-assembly\">    movups      xmm0,xmmword ptr [rdx]\r\n    movups      xmmword ptr [rcx],xmm0\r\n    movsd       xmm1,mmword ptr [rdx+10h]\r\n    movsd       mmword ptr [rcx+10h],xmm1\r\n    mov         eax,dword ptr [rdx+18h]\r\n    mov         dword ptr [rcx+18h],eax\r\n    movzx       eax,word ptr [rdx+1Ch]\r\n    mov         word ptr [rcx+1Ch],ax\r\n    movzx       eax,byte ptr [rdx+1Eh]\r\n    mov         byte ptr [rcx+1Eh],al<\/code><\/pre>\n<p>But after we are able to use:<\/p>\n<pre><code class=\"language-assembly\">    movups      xmm0,xmmword ptr [rdx]\r\n    movups      xmmword ptr [rcx],xmm0\r\n    movups      xmm1,xmmword ptr [rdx+0Fh]\r\n    movups      xmmword ptr [rcx+0Fh],xmm1  ; overlaps previous store<\/code><\/pre>\n<p>The second improvement is to use YMM registers directly in the expansion under\n\/arch:AVX or higher. Previously, we would expand memset and memcpy using XMM\ncopies, and then a later optimization had to merge them into YMM copies. The\ndownside of the older approach was that if the expansion occurred within a\nloop, we&#8217;d end up with half as many bytes copied per iteration with twice the\nnumber of iterations. Expanding them directly as YMM copies permits fewer loop\niterations and possibly removing the loop, if the number of iterations is one.<\/p>\n<p>Before, the incomplete merging looked like:<\/p>\n<pre><code class=\"language-assembly\">    mov         ecx,4\r\nlabel:\r\n    lea         rdx,[rdx+80h]\r\n    vmovups     ymm0,ymmword ptr [rax]\r\n    vmovups     xmm1,xmmword ptr [rax+70h]\r\n    lea         rax,[rax+80h]\r\n    vmovups     ymmword ptr [rdx-80h],ymm0\r\n    vmovups     ymm0,ymmword ptr [rax-60h]\r\n    vmovups     ymmword ptr [rdx-60h],ymm0\r\n    vmovups     ymm0,ymmword ptr [rax-40h]\r\n    vmovups     ymmword ptr [rdx-40h],ymm0\r\n    vmovups     xmm0,xmmword ptr [rax-20h]\r\n    vmovups     xmmword ptr [rdx-20h],xmm0\r\n    vmovups     xmmword ptr [rdx-10h],xmm1  ; incomplete YMM merging\r\n    sub         rcx,1\r\n    jne         label<\/code><\/pre>\n<p>After, direct expansion results in:<\/p>\n<pre><code class=\"language-assembly\">    mov         ecx,2  ; half as many iterations\r\nlabel:\r\n    lea         rdx,[rdx+100h]\r\n    vmovups     ymm0,ymmword ptr [rax]\r\n    vmovups     ymm1,ymmword ptr [rax+20h]\r\n    lea         rax,[rax+100h]\r\n    vmovups     ymmword ptr [rdx-100h],ymm0\r\n    vmovups     ymm0,ymmword ptr [rax-0C0h]\r\n    vmovups     ymmword ptr [rdx-0E0h],ymm1\r\n    vmovups     ymm1,ymmword ptr [rax-0A0h]\r\n    vmovups     ymmword ptr [rdx-0C0h],ymm0\r\n    vmovups     ymm0,ymmword ptr [rax-80h]\r\n    vmovups     ymmword ptr [rdx-0A0h],ymm1\r\n    vmovups     ymm1,ymmword ptr [rax-60h]\r\n    vmovups     ymmword ptr [rdx-80h],ymm0\r\n    vmovups     ymm0,ymmword ptr [rax-40h]\r\n    vmovups     ymmword ptr [rdx-60h],ymm1\r\n    vmovups     ymm1,ymmword ptr [rax-20h]\r\n    vmovups     ymmword ptr [rdx-40h],ymm0\r\n    vmovups     ymmword ptr [rdx-20h],ymm1\r\n    sub         rcx,1\r\n    jne         label<\/code><\/pre>\n<h2>Arm64 Bitwise Ops with Shifted Registers<\/h2>\n<p>The Arm64 instruction set has bitwise operations (AND, BIC, EON, EOR, ORN, ORR)\nthat can take a shifted register as a source. Previously, we were not always\ntaking advantage of this option and emitted two instructions where it could\nhave emitted one. For example, instead of:<\/p>\n<pre><code class=\"language-assembly\">    ror     x8,x8,#5\r\n    eor     x0,x8,x0<\/code><\/pre>\n<p>We now emit:<\/p>\n<pre><code class=\"language-assembly\">    eor     x0, x0, x1, ror 5<\/code><\/pre>\n<h2>Ternary Operator Optimization<\/h2>\n<p>We enhanced optimization for the C++ ternary operator in the following two cases.\nThe first case had this form:<\/p>\n<pre><code class=\"language-cpp\">void foo(unsigned x, unsigned y) {\r\n    unsigned a = x &lt; 0x10000 ? 0 : 1;\r\n    unsigned b = y &lt; 0x10000 ? 0 : 1;\r\n\r\n    if (a | b) {\r\n        bar();\r\n    }\r\n}<\/code><\/pre>\n<p>The compiler used to emit:<\/p>\n<pre><code class=\"language-assembly\">    xor     r8d, r8d\r\n    cmp     ecx, 65536\r\n    mov     eax, r8d\r\n    setae   al\r\n    cmp     edx, 65536\r\n    setae   r8b\r\n    or      eax, r8d\r\n    jne     void bar(void)<\/code><\/pre>\n<p>By combining x and y first, the compiler now emits:<\/p>\n<pre><code class=\"language-assembly\">    or      edi, esi\r\n    cmp     edi, 65536\r\n    jae     void bar(void)<\/code><\/pre>\n<p>The second case had this form:<\/p>\n<pre><code class=\"language-cpp\">void foo(int size) {\r\n    if (!size) size = 1;\r\n    bar(size ? size : 1);\r\n}<\/code><\/pre>\n<p>Internally, the compiler transformed this into:<\/p>\n<pre><code class=\"language-cpp\">void foo(int size) {\r\n    size = size ? size : 1;\r\n    bar(size ? size : 1);\r\n}<\/code><\/pre>\n<p>but the compiler was not removing the redundant check of size. The problem was that the\ncompiler did not recognize that these expression patterns must be non-zero:<\/p>\n<pre><code class=\"language-cpp\">    x != 0 ? x : NonZeroExpression\r\n\r\n    x == 0 ? NonZeroExpression : x<\/code><\/pre>\n<p>We implemented those simplifications and the redundant check is now removed.<\/p>\n<h2>Improved Copy Prop<\/h2>\n<p>Copy propagation eliminates unnecessary assignments. For example:<\/p>\n<pre><code class=\"language-cpp\">    a = ...;\r\n    b = a;\r\n    use(b);<\/code><\/pre>\n<p>Becomes:<\/p>\n<pre><code class=\"language-cpp\">    a = ...\r\n    use(a);<\/code><\/pre>\n<p>The compiler frequently inserts copies during the process of optimization.\nMost of those copies are later optimized away, but they can be a hindrance to other optimizations in the meantime.<\/p>\n<p>We recently expanded the copy propagation that we run during the SSA Optimizer to remove copies that cross control flow boundaries:<\/p>\n<pre><code class=\"language-cpp\">a = ...;\r\nb = a;\r\n\r\nif (x) {\r\n    use(b);\r\n}<\/code><\/pre>\n<p>Becomes:<\/p>\n<pre><code class=\"language-cpp\">a = ...;\r\n\r\nif (x) {\r\n    use(a);\r\n}<\/code><\/pre>\n<p>Eliminating these additional copies earlier allows other optimizations to be\nmore effective.<\/p>\n<h2>Loop Unrolling<\/h2>\n<p>Loop unrolling reduces loop overhead. It can lead to larger but faster code.\nIn certain cases, loops can be unrolled completely, such that there is no\nlonger any loop. We removed some constraints on complete unrolling, such as\nneeding to be an innermost loop with a single (natural) exit. In other words,\nMSVC now can completely unroll loops with multiple exits (also known as\nbreakout loops or search loops) or outer loops.<\/p>\n<h2>Conclusion<\/h2>\n<p>The Microsoft C++ team released many new optimizations in MSVC Build Tools\nv14.51. We will continue focusing on performance while developing 14.52. Many\nthanks to the compiler engineers who implemented the optimizations and drafted\nearlier versions of sections of this blog post, including Alex Wong, Aman\nArora, Chris Pulido, Emily Bao, Eugene Rozenfeld, Matt Gardner, Sebastian\nPeryt, Swaroop Sridhar, and Terry Mahaffey.<\/p>\n<p>MSVC Build Tools v14.51 is currently in preview and is available in <a href=\"https:\/\/visualstudio.microsoft.com\/downloads\/\">Visual Studio 2026 Insiders<\/a>. Try it out today and share your feedback on <a href=\"https:\/\/developercommunity.visualstudio.com\/index.html\">Visual Studio Developer Community<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>MSVC Build Tools v14.51 improves performance through a wide range of new optimizations.<\/p>\n","protected":false},"author":129662,"featured_media":35800,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3946,1],"tags":[282],"class_list":["post-36345","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backend","category-cplusplus","tag-msvc"],"acf":[],"blog_post_summary":"<p>MSVC Build Tools v14.51 improves performance through a wide range of new optimizations.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/36345","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/129662"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=36345"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/36345\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35800"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=36345"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=36345"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=36345"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}