February 21st, 2024

MSVC Backend Updates since Visual Studio 2022 version 17.3

Bran Hagger
software engineer

Since Visual Studio 2022 version 17.3 we have continued to improve the C++ backend with new features and new and improved optimizations. Here are some of our exciting improvements.

  • 17.9 improvements for x86 and x64, thanks to our friends at Intel.
    • Support for Scalar FP intrinsics with double/float arguments
    • Improve code generation by replacing VINSERTPS with VBLENDPS for x64 only
    • Support for round scalar functions
  • 17.8 improvements
    • The new /ARM64XFUNCTIONPADMINX64:# flag allows specifying the number of bytes of padding for x64 functions in arm64x images
    • The new /NOFUNCTIONPADSECTION:sec flag allows disabling function padding for functions in a particular section
    • LTCG build takes better advantage of threads, improving throughput.
    • Support for RAO-INT, thanks to our friends at Intel.
    • Address sanitizer improvements:
      • The Address Sanitizer flag is now compatible with C++ modules.
      • The compiler will now report an error when /fsanitize=address is combined with an incompatible flag, instead of silently disabling ASAN checks.
      • ASAN checks are now emitted for loads and stores in memchr, memcmp, and the various string functions.
    • Performance improvements that will help every architecture:
      • Improve hoisting of loads and stores outside of loops.
    • Performance improvements for arm64:
      • Improve memcmp performance on both arm64 and arm64ec.
      • When calling memcpy, memset, memchr, or memcmp from emulated x64 code, remove the performance overhead of switching to arm64ec versions of these functions.
      • Optimize scalar immediate loads (from our friends at ARM)
      • Combine CSET and ADD instructions into a single CINC instruction (from our friends at ARM)
    • Performance improvements for x86 and x64, many thanks to our friends at Intel:
      • Improve code generation for _mm_fmadd_sd.
      • Improve code generation for UMWAIT and TPAUSE, preserving implicit input registers.
      • Improve code generation for vector shift intrinsics by improving auto-vectorizer.
      • Tune internal vectorization thresholds to improve auto-vectorization.
      • Implement optimization for FP classification beyond std::isnan.
      • Performance improvements for x64:
        • Generate a single PSHUFLW instruction for _mm_set1_epi16 when only the lower 64 bits of the result are used.
        • Improve code generation for abs(). (Thanks to our friends at AMD)
        • No longer generate redundant loads and stores when LDDQU is combined with VBROADCAST128.
        • Generate PMADDWD instead of PMULLD where possible.
        • Combine two contiguous stores into a single unaligned store.
        • Use 32 vector registers in functions that use AVX512 intrinsics even when not compiling with /arch:AVX512.
        • Don’t emit unnecessary register to register moves.
      • Performance improvements for x86:
        • Improve code generation for expf().
  • 17.7 improvements
    • New /jumptablerdata flag places jump tables for switch statements in the .rdata section instead of the .text section.
    • Link time with a cold file system cache is now faster.
    • Improve compilation time of POGO-instrumented builds.
    • Speed up LTCG compilation in a variety of ways.
    • OpenMP improvements with /openmp:llvm, thanks to our friends at Intel:
      • #pragma omp atomic update and #pragma omp atomic capture no longer need to call into the runtime, improving performance.
      • Better code generation for OpenMP floating point atomics.
      • The clause schedule(static) is now respected for ordered loops.
    • Performance improvements for all architectures:
      • Copy propagation optimizations are now more effective, thanks to our friends from AMD.
      • Improve optimization for DeBruijn table.
      • Fully unroll loops of fixed size even if they contain function calls.
      • Improve bit optimizations.
      • Deeply nested loops are now optimized.
    • Performance improvements and additional functionality for x86 and x64, many thanks to our friends at Intel:
      • Support Intel Sierra Forest instruction set (AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-INT8, CMPCCXADD, Additional MSR support).
      • Support Intel Granite Rapids instruction set (AMX-COMPLEX).
      • Support LOCK_SUB.
      • Add overflow detection functions for addition, subtraction, and multiplication.
      • Implement intrinsic functions for isunordered, isnan, isnormal, isfinite, isinf, issubnormal, fmax, and fmin.
      • Reduce code size of bitwise vector operations.
      • Improve code generation for AVX2 instructions during tail call optimization.
      • Improve code generation for floating point instructions without an SSE version.
      • Remove unneeded PAND instructions.
      • Improve assembler output for FP16 truncating conversions to use surpress-all-exceptions instead of embedded rounding.
      • Eliminate unnecessary hoisting of conversions from FP to unsigned long long.
      • Performance improvements for x64:
        • No longer emit unnecessary MOVSX/MOVZX instructions.
        • Do a better job of devirtualizing calls to class functions.
        • Improve performance of memmove.
        • Improve code generation for XOR-EXTRACT combination pattern.
    • Performance improvements for arm64:
      • Improve register coloring for destinations of NEON BIT, BIF, and BSL instructions, thanks to our friends at ARM.
      • Convert cross-binary indirect calls that use the import address table into direct calls.
      • Add the _CountTrailingZeros and _CountTrailingZeros64 intrinsics for counting trailing zeros in integers
      • Generate BFI instructions in more places.
  • 17.6 improvements
    • The /openmp:llvm flag now supports the collapse clause on #pragma omp loop (Full Details.)
    • The new /d2AsanInstrumentationPerFunctionThreshold:# flag allows turning off ASAN instrumentation on functions that would add more than a certain number of extra ASAN calls.
    • New /OTHERARCHEXPORTS option for dumpbin /EXPORTS will dump the x64 exports of an arm64x dll.
    • Build time improvements:
      • Improved LTCG build throughput.
      • Reduced LTCG build memory usage.
      • Reduced link time during incremental linking.
    • Performance improvements that will help every architecture:
      • Vectorize loops that use min, max, and absolute, thanks to our friends at ARM.
      • Turn loops with a[i] = ((a[i]>>15)&0x10001)*0xffff into vector compares.
      • Hoist calculation of array bases of the form (a + constant)[i] out of the loop.
    • Performance improvements on arm64:
      • Load floats directly into floating point registers instead of using integer load and FMOV instructions.
      • Improve code generation for abs(), thanks to our friends at ARM.
      • Improve code generation for vectors when NEON instructions are available.
      • Generate CSINC instructions when the ? operator has the constant 1 as a possible result of the expression, thanks to our friends at ARM.
      • Improve code generation for loops that sum an array by using vector add instructions.
      • Combine vector extend and arithmetic instructions into a single instruction.
      • Remove extraneous adds, subtractions, and ors with 0.
      • Auxiliary delayload IAT: new import address table for calls into delayloaded DLLs in arm64x. At runtime, Windows will patch this table to speed up program execution.
    • Performance improvements and additional features on x86 and x64, many thanks to our friends at Intel:
      • Support for Intel Granite Rapids x64 instruction set, specifically TDPFP16PS (AMX-FP16) and PREFETCHIT0/PREFETCHIT1.
      • Support for ties-to-away rounding for round and roundf intrinsic functions.
      • Reduce small loops to vectors.
      • No longer generate redundant MOVD/MOVQ instructions.
      • Use VBLEND instructions instead of the slower VINSERTF128 and VBLENDPS instructions on AVX512 where possible.
      • Promote PCLMULQDQ instructions to VPCLMULQDQ where possible with /arch:AVX or later.
      • Replace VEXTRACTI128 instructions that extract the lower half of a vector with VMOVDQU instructions, thanks to our friends at AMD.
      • Support for missing AVX512-FP16 intrinsics.
      • Better code generation with correct VEX/EVEX encoding for VCMPXX pseudo-ops in MASM.
      • Improve conversions from 64-bit integer to floating-point.
      • Improve code generation on x64 with correct instruction scheduling for STMXCSR.
  • 17.5 improvements
    • The new /Zc:checkGwOdr flag allows for enforcing C++ standards for ODR violations even when compiling with /Gw.
    • Combine a MOV and a CSEL instruction into a CSINV instruction on arm64.
    • Performance and code quality improvements for x86 and x64, thanks to our friends at Intel:
      • Improve code generation for returns of structs consisting of 2 64-bit values on x64.
      • Type conversions no longer generate unnecessary FSTP/FLD instructions.
      • Improve checking floating-point values for Not-a-Number.
      • Emit smaller sequence in auto-vectorizer with bit masking and reduction.
      • Correct expansion of round to use ROUND instruction only under /fp:fast.
  • 17.4 improvements
    • Performance improvements that will help every architecture:
      • Improve bswap for signed integers.
      • Improve stackpacking for functions with memset calls.
    • Improve the debugging support and performance for Arm64:
      • Edit and Continue is now possible for programs targeting Arm64.
      • Added support for armv8 int8 matrix multiplication instructions.
      • Use BIC instructions in place of an MVN and AND.
      • Use BIC_SHIFT instruction where appropriate.
    • Performance and code quality improvements on x64 and x86, thanks to our friends at Intel:
      • std::memchr now meets the additional C++17 requirement of stopping as soon as a matching byte is found.
      • Improve code generation for 16-bit interlocked add.
      • Coalesce register initialization on AVX/AVX2.
      • Improve code generation for returns of structs consisting of 2 64-bit values.
      • Improve codegen for _mm_ucomieq_ss.
      • Use VROUNDXX instructions for ceil, floor, trunc, and round.
      • Improve checking floating-point values for Not-a-Number.
    • Support for OpenMP Standard 3.1 under the experimental -openmp:llvm switch expanded to include the min and max operators on the reduction clause.
    • Improve copy and move elision
    • The new /Qspectre-jmp flag adds an int3 after unconditional jump instructions.

Do you want to experience the new improvements in the C++ backend? Please download the latest Visual Studio 2022 and give it a try! Any feedback is welcome. We can be reached via the comments below, Developer Community, Twitter (@VisualC), or email at visualcpp@microsoft.com.

Stay tuned for more information on updates to the latest Visual Studio.

Category
BackendC++

Author

Bran Hagger
software engineer

Bran Hagger is a software developer on the C++ Machine Independent codegen team. His focus is OpenMP and the SSA optimizer.

0 comments

Discussion are closed.