{"id":33705,"date":"2024-02-21T16:30:09","date_gmt":"2024-02-21T16:30:09","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=33705"},"modified":"2024-09-10T07:55:18","modified_gmt":"2024-09-10T07:55:18","slug":"msvc-backend-updates-since-visual-studio-2022-version-17-3","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-backend-updates-since-visual-studio-2022-version-17-3\/","title":{"rendered":"MSVC Backend Updates since Visual Studio 2022 version 17.3"},"content":{"rendered":"<p>Since <a href=\"https:\/\/visualstudio.microsoft.com\/vs\/features\/cplusplus\/\">Visual Studio 2022<\/a> version 17.3 we have continued to improve the C++ backend with new features and new and improved optimizations. Here are some of our exciting improvements.<\/p>\n<ul>\n<li>17.9 improvements for x86 and x64, thanks to our friends at Intel.\n<ul>\n<li>Support for Scalar FP intrinsics with double\/float arguments<\/li>\n<li>Improve code generation by replacing <code>VINSERTPS<\/code> with <code>VBLENDPS<\/code> for x64 only<\/li>\n<li>Support for round scalar functions<\/li>\n<\/ul>\n<\/li>\n<li>17.8 improvements\n<ul>\n<li>The new <a href=\"https:\/\/learn.microsoft.com\/cpp\/build\/reference\/arm64-function-pad-min-x64\">\/ARM64XFUNCTIONPADMINX64:#<\/a> flag allows specifying the number of bytes of padding for x64 functions in arm64x images<\/li>\n<li>The new <a href=\"https:\/\/learn.microsoft.com\/cpp\/build\/reference\/no-function-pad-section\">\/NOFUNCTIONPADSECTION:sec<\/a> flag allows disabling function padding for functions in a particular section<\/li>\n<li>LTCG build takes better advantage of threads, improving throughput.<\/li>\n<li>Support for RAO-INT, thanks to our friends at Intel.<\/li>\n<li>Address sanitizer improvements:\n<ul>\n<li>The Address Sanitizer flag is now compatible with C++ modules.<\/li>\n<li>The compiler will now report an error when <code>\/fsanitize=address<\/code> is combined with an incompatible flag, instead of silently disabling ASAN checks.<\/li>\n<li>ASAN checks are now emitted for loads and stores in memchr, memcmp, and the various string functions.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements that will help every architecture:\n<ul>\n<li>Improve hoisting of loads and stores outside of loops.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements for arm64:\n<ul>\n<li>Improve memcmp performance on both arm64 and arm64ec.<\/li>\n<li>When calling memcpy, memset, memchr, or memcmp from emulated x64 code, remove the performance overhead of switching to arm64ec versions of these functions.<\/li>\n<li>Optimize scalar immediate loads (from our friends at ARM)<\/li>\n<li>Combine <code>CSET<\/code> and <code>ADD<\/code> instructions into a single <code>CINC<\/code> instruction (from our friends at ARM)<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements for x86 and x64, many thanks to our friends at Intel:\n<ul>\n<li>Improve code generation for _mm_fmadd_sd.<\/li>\n<li>Improve code generation for <code>UMWAIT<\/code> and <code>TPAUSE<\/code>, preserving implicit input registers.<\/li>\n<li>Improve code generation for vector shift intrinsics by improving auto-vectorizer.<\/li>\n<li>Tune internal vectorization thresholds to improve auto-vectorization.<\/li>\n<li>Implement optimization for FP classification beyond std::isnan.<\/li>\n<li>Performance improvements for x64:\n<ul>\n<li>Generate a single <code>PSHUFLW<\/code> instruction for _mm_set1_epi16 when only the lower 64 bits of the result are used.<\/li>\n<li>Improve code generation for abs(). (Thanks to our friends at AMD)<\/li>\n<li>No longer generate redundant loads and stores when <code>LDDQU<\/code> is combined with <code>VBROADCAST128<\/code>.<\/li>\n<li>Generate <code>PMADDWD<\/code> instead of <code>PMULLD<\/code> where possible.<\/li>\n<li>Combine two contiguous stores into a single unaligned store.<\/li>\n<li>Use 32 vector registers in functions that use AVX512 intrinsics even when not compiling with \/arch:AVX512.<\/li>\n<li>Don&#8217;t emit unnecessary register to register moves.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements for x86:\n<ul>\n<li>Improve code generation for expf().<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>17.7 improvements\n<ul>\n<li>New <a href=\"https:\/\/learn.microsoft.com\/cpp\/build\/reference\/jump-table-rdata\">\/jumptablerdata<\/a> flag places jump tables for switch statements in the .rdata section instead of the .text section.<\/li>\n<li>Link time with a cold file system cache is now faster.<\/li>\n<li>Improve compilation time of POGO-instrumented builds.<\/li>\n<li>Speed up LTCG compilation in a variety of ways.<\/li>\n<li>OpenMP improvements with \/openmp:llvm, thanks to our friends at Intel:\n<ul>\n<li><code>#pragma omp atomic update<\/code> and <code>#pragma omp atomic capture<\/code> no longer need to call into the runtime, improving performance.<\/li>\n<li>Better code generation for OpenMP floating point atomics.<\/li>\n<li>The clause <code>schedule(static)<\/code> is now respected for ordered loops.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements for all architectures:\n<ul>\n<li>Copy propagation optimizations are now more effective, thanks to our friends from AMD.<\/li>\n<li>Improve optimization for DeBruijn table.<\/li>\n<li>Fully unroll loops of fixed size even if they contain function calls.<\/li>\n<li>Improve bit optimizations.<\/li>\n<li>Deeply nested loops are now optimized.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements and additional functionality for x86 and x64, many thanks to our friends at Intel:\n<ul>\n<li>Support Intel Sierra Forest instruction set (AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-INT8, CMPCCXADD, Additional MSR support).<\/li>\n<li>Support Intel Granite Rapids instruction set (AMX-COMPLEX).<\/li>\n<li>Support <code>LOCK_SUB<\/code>.<\/li>\n<li>Add overflow detection functions for addition, subtraction, and multiplication.<\/li>\n<li>Implement intrinsic functions for isunordered, isnan, isnormal, isfinite, isinf, issubnormal, fmax, and fmin.<\/li>\n<li>Reduce code size of bitwise vector operations.<\/li>\n<li>Improve code generation for AVX2 instructions during tail call optimization.<\/li>\n<li>Improve code generation for floating point instructions without an SSE version.<\/li>\n<li>Remove unneeded PAND instructions.<\/li>\n<li>Improve assembler output for FP16 truncating conversions to use surpress-all-exceptions instead of embedded rounding.<\/li>\n<li>Eliminate unnecessary hoisting of conversions from FP to unsigned long long.<\/li>\n<li>Performance improvements for x64:\n<ul>\n<li>No longer emit unnecessary <code>MOVSX<\/code>\/<code>MOVZX<\/code> instructions.<\/li>\n<li>Do a better job of devirtualizing calls to class functions.<\/li>\n<li>Improve performance of memmove.<\/li>\n<li>Improve code generation for <code>XOR-EXTRACT<\/code> combination pattern.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements for arm64:\n<ul>\n<li>Improve register coloring for destinations of NEON <code>BIT<\/code>, <code>BIF<\/code>, and <code>BSL<\/code> instructions, thanks to our friends at ARM.<\/li>\n<li>Convert cross-binary indirect calls that use the import address table into direct calls.<\/li>\n<li>Add the <code>_CountTrailingZeros<\/code> and <code>_CountTrailingZeros64<\/code> <a href=\"https:\/\/learn.microsoft.com\/cpp\/intrinsics\/arm64-intrinsics\">intrinsics for counting trailing zeros in integers<\/a><\/li>\n<li>Generate BFI instructions in more places.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>17.6 improvements\n<ul>\n<li>The <code>\/openmp:llvm<\/code> flag now supports the <code>collapse<\/code> clause on <code>#pragma omp loop<\/code> (<a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/open-mp-improvements-in-visual-studio-cpp\/\">Full Details<\/a>.)<\/li>\n<li>The new <code>\/d2AsanInstrumentationPerFunctionThreshold:#<\/code> flag allows turning off ASAN instrumentation on functions that would add more than a certain number of extra ASAN calls.<\/li>\n<li>New <code>\/OTHERARCHEXPORTS<\/code> option for <code>dumpbin \/EXPORTS<\/code> will dump the x64 exports of an arm64x dll.<\/li>\n<li>Build time improvements:\n<ul>\n<li>Improved LTCG build throughput.<\/li>\n<li>Reduced LTCG build memory usage.<\/li>\n<li>Reduced link time during incremental linking.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements that will help every architecture:\n<ul>\n<li>Vectorize loops that use min, max, and absolute, thanks to our friends at ARM.<\/li>\n<li>Turn loops with <code>a[i] = ((a[i]&gt;&gt;15)&amp;0x10001)*0xffff<\/code> into vector compares.<\/li>\n<li>Hoist calculation of array bases of the form <code>(a + constant)[i]<\/code> out of the loop.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements on arm64:\n<ul>\n<li>Load floats directly into floating point registers instead of using integer load and FMOV instructions.<\/li>\n<li>Improve code generation for abs(), thanks to our friends at ARM.<\/li>\n<li>Improve code generation for vectors when NEON instructions are available.<\/li>\n<li>Generate CSINC instructions when the ? operator has the constant 1 as a possible result of the expression, thanks to our friends at ARM.<\/li>\n<li>Improve code generation for loops that sum an array by using vector add instructions.<\/li>\n<li>Combine vector extend and arithmetic instructions into a single instruction.<\/li>\n<li>Remove extraneous adds, subtractions, and ors with 0.<\/li>\n<li>Auxiliary delayload IAT: new import address table for calls into delayloaded DLLs in arm64x. At runtime, Windows will patch this table to speed up program execution.<\/li>\n<\/ul>\n<\/li>\n<li>Performance improvements and additional features on x86 and x64, many thanks to our friends at Intel:\n<ul>\n<li>Support for Intel Granite Rapids x64 instruction set, specifically <code>TDPFP16PS<\/code> (AMX-FP16) and <code>PREFETCHIT0<\/code>\/<code>PREFETCHIT1<\/code>.<\/li>\n<li>Support for ties-to-away rounding for <strong>round and <\/strong>roundf intrinsic functions.<\/li>\n<li>Reduce small loops to vectors.<\/li>\n<li>No longer generate redundant <code>MOVD<\/code>\/<code>MOVQ<\/code> instructions.<\/li>\n<li>Use <code>VBLEND<\/code> instructions instead of the slower <code>VINSERTF128<\/code> and <code>VBLENDPS<\/code> instructions on AVX512 where possible.<\/li>\n<li>Promote <code>PCLMULQDQ<\/code> instructions to <code>VPCLMULQDQ<\/code> where possible with \/arch:AVX or later.<\/li>\n<li>Replace <code>VEXTRACTI128<\/code> instructions that extract the lower half of a vector with <code>VMOVDQU<\/code> instructions, thanks to our friends at AMD.<\/li>\n<li>Support for missing AVX512-FP16 intrinsics.<\/li>\n<li>Better code generation with correct VEX\/EVEX encoding for VCMPXX pseudo-ops in MASM.<\/li>\n<li>Improve conversions from 64-bit integer to floating-point.<\/li>\n<li>Improve code generation on x64 with correct instruction scheduling for <code>STMXCSR<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>17.5 improvements\n<ul>\n<li>The new <a href=\"https:\/\/learn.microsoft.com\/cpp\/build\/reference\/zc-check-gwodr\">\/Zc:checkGwOdr flag<\/a> allows for enforcing C++ standards for ODR violations even when compiling with \/Gw.<\/li>\n<li>Combine a <code>MOV<\/code> and a <code>CSEL<\/code> instruction into a <code>CSINV<\/code> instruction on arm64.<\/li>\n<li>Performance and code quality improvements for x86 and x64, thanks to our friends at Intel:\n<ul>\n<li>Improve code generation for returns of structs consisting of 2 64-bit values on x64.<\/li>\n<li>Type conversions no longer generate unnecessary <code>FSTP<\/code>\/<code>FLD<\/code> instructions.<\/li>\n<li>Improve checking floating-point values for Not-a-Number.<\/li>\n<li>Emit smaller sequence in auto-vectorizer with bit masking and reduction.<\/li>\n<li>Correct expansion of round to use ROUND instruction only under \/fp:fast.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>17.4 improvements\n<ul>\n<li>Performance improvements that will help every architecture:\n<ul>\n<li>Improve bswap for signed integers.<\/li>\n<li>Improve stackpacking for functions with memset calls.<\/li>\n<\/ul>\n<\/li>\n<li>Improve the debugging support and performance for Arm64:\n<ul>\n<li>Edit and Continue is now possible for programs targeting Arm64.<\/li>\n<li>Added support for armv8 int8 matrix multiplication instructions.<\/li>\n<li>Use <code>BIC<\/code> instructions in place of an <code>MVN<\/code> and <code>AND<\/code>.<\/li>\n<li>Use <code>BIC_SHIFT<\/code> instruction where appropriate.<\/li>\n<\/ul>\n<\/li>\n<li>Performance and code quality improvements on x64 and x86, thanks to our friends at Intel:\n<ul>\n<li>std::memchr now meets the additional C++17 requirement of stopping as soon as a matching byte is found.<\/li>\n<li>Improve code generation for 16-bit interlocked add.<\/li>\n<li>Coalesce register initialization on AVX\/AVX2.<\/li>\n<li>Improve code generation for returns of structs consisting of 2 64-bit values.<\/li>\n<li>Improve codegen for _mm_ucomieq_ss.<\/li>\n<li>Use <code>VROUNDXX<\/code> instructions for ceil, floor, <strong>trunc, and <\/strong>round.<\/li>\n<li>Improve checking floating-point values for Not-a-Number.<\/li>\n<\/ul>\n<\/li>\n<li>Support for OpenMP Standard 3.1 under the experimental <code>-openmp:llvm<\/code> switch expanded to include the <code>min<\/code> and <code>max<\/code> operators on the <code>reduction<\/code> clause.<\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/improving-copy-and-move-elision\/\">Improve copy and move elision<\/a><\/li>\n<li>The new <a href=\"https:\/\/learn.microsoft.com\/cpp\/build\/reference\/qspectre-jmp\">\/Qspectre-jmp flag<\/a> adds an int3 after unconditional jump instructions.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Do you want to experience the new improvements in the C++ backend? Please <a href=\"https:\/\/visualstudio.microsoft.com\/vs\/features\/cplusplus\/\">download the latest Visual Studio 2022<\/a> and give it a try! Any feedback is welcome. We can be reached via the comments below, <a href=\"https:\/\/developercommunity.visualstudio.com\/cpp\">Developer Community<\/a>, Twitter (<a href=\"https:\/\/twitter.com\/visualc\">@VisualC<\/a>), or email at <a href=\"mailto:visualcpp@microsoft.com\">visualcpp@microsoft.com<\/a>.<\/p>\n<p>Stay tuned for more information on updates to the latest Visual Studio.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since Visual Studio 2022 version 17.3, we have continued to improve the C++ backend with new features, improved support for arm64 and OpenMP, and new and improved optimizations across all architectures.<\/p>\n","protected":false},"author":18811,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3946,1],"tags":[],"class_list":["post-33705","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-backend","category-cplusplus"],"acf":[],"blog_post_summary":"<p>Since Visual Studio 2022 version 17.3, we have continued to improve the C++ backend with new features, improved support for arm64 and OpenMP, and new and improved optimizations across all architectures.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/33705","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/18811"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=33705"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/33705\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=33705"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=33705"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=33705"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}