Hello, I’m Ten Tzen, a Compiler Architect on the Visual C++ Compiler Code Generation team. Today, I’m going to introduce some noteworthy improvements in Visual Studio 2010.
Faster LTCG Compilation: LTCG (Link Time Code Generation) allows the compiler to perform better optimizations with information on all modules in the program (for more details see here). To merge information from all modules, LTCG compilation generally takes longer than non-LTCG compilation, particularly for large applications. In VS2010, we improved the information merging process and sped up LTCG compilation significantly. An LTCG build of Microsoft SQL Server (an application with .text size greater than 50MB) is sped up by ~30%.
Faster Pogo Instrumentation run: Profile Guided Optimization (PGO) is an approach to optimization where the compiler uses profile information to make better optimization decisions for the program. See here or here for an introduction of PGO. One major drawback of PGO is that the instrumented run is usually several times slower than a regular optimized run. In VS2010, we support a no-lock version of the instrumented binaries. With that the scenario (PGI) runs are about 1.7X faster.
Code size reduction for X64 target: Code size is a crucial factor to performance especially for applications that are performance-sensitive to the behavior of instruction cache or working set. In VS2010, several effective optimizations are introduced or improved for X64 architecture. Some of the improvements are listed below:
· More aggressively use RBP as the frame pointer to access local variables. RBP-relative address mode is one byte shorter than RSP-relative.
· Enable tail merge optimizations with the presence of C++ EH or Windows SEH (see here and here for EH or SEH).
· Combine successive constant stores to one store.
· Recognize more cases where we can emit 32-bit instruction for 64-bit immediate constants.
· Recognize more cases where we can use a 32-bit move instead of a 64-bit move.
· Optimize the code sequence of C++ EH destructor funclets.
Altogether, we have observed code size reduction in the range of 3% to 10% with various Microsoft products such as the Windows kernel components, SQL, Excel, etc.
Improvements for “Speed”: As usual, there are also many code quality tuning and improvements done across different code generation areas for “speed’. In this release, we have focuse d more on the X64 target. The following are some of the important changes that have contributed to these improvements:
· Identify and use CMOV instruction when beneficial in more situations
· More effectively combine induction variable to reduce register pressure
· Improve detection of region constants for strength reduction in a loop
· Improve scalar replacement optimization in a loop
· Improvement of avoiding store forwarding stall
· Use XMM registers for memcpy intrinsic
· Improve Inliner heuristics to identify and make more beneficial inlining decisions
Overall, we see an 8% improvement as measured by integer benchmarks and a few % points on the floating point suites for X64.
Better SIMD code generation for X86 and X64 targets: The quality of SSE/SSE2 SIMD code is crucial to game, audio, video and graphic developers. Unlike inline asm which inhibits compiler optimization of surrounding code, intrinsics were designed to allow more effective optimization and still give developers access to low-level control of the machine. In VS2010, we have added several simple but effective optimizations that focus on SIMD intrinsic quality and performance. Some of the improvements are listed below:
· Break false dependency: The scalar convert instructions (CVTSI2SD, CVTSI2SS, CVTSS2SD, or CVTSD2SS) do not modify the upper bits of the destination register. This causes a false dependency which could significantly affect performance. To break the false dependence of memory to register conversions, VS2010 compiler inserts MOVD/MOVSS/MOVSD to zero-out the upper bits and use the corresponding packed conversion. For instance,
cvtsi2ss xmm0, mem-operand à movd xmm0, mem-operand
cvtdq2ps xmm0, xmm0
For register to register conversions, XORPS is inserted to break the false dependency.
cvtsd2ss xmm1, xmm0 à xorps xmm1, xmm1
 
; cvtsd2ss xmm1, xmm0
Even though this optimization may increase code size we have observed a significant positive performance improvement on several real world code and benchmark programs.
· Perform vectorization for constant vector initializations: In VS2008, a simple initialization statement, such as __m128 x = { 1, 2, 3, 4 }, would require ~10 instructions. With VS2010, it’s optimized down to a couple of instructions. This can apply to dimensional initialization as well. The instructions generated for initialization statements like __m128 x[] = {{1,2,3,4}, {5,6}} or __m128 t2[][2]= {{{1,2},{3,4,5}}, {{6},{7,8,9}}}; are greatly reduced with VS2010.
· Optimize __mm_set_**(), __mm_setr_**() and __mm_set1_**() intrinsic family. In VS2008, a series of unpack instructions are used to do the combining of scalar values. When all arguments are constants, this can be achieved with a single vector instruction. For example, the single statement, return _mm_set_epi16(0, 1, 2, 3, -4, -5, 6, 7), would require ~20 instructions to implement in previous releases while it’s only one instruction is required in VS2010.
Better register allocation for XMM registers thus removing many redundant loads, stores and moves.
· Enable Compare & JCC CSE (Common Sub-expression Elimination) for SSE compares. For example, the code sequence below at left will be optimized to the code sequence at right:
ECX, CC1 = PCMPISTRI ECX, CC1 = PCMPISTRI
JCC(EQ) CC1 JCC(EQ) CC1
ECX, CC2 = PCMPISTRI à JCC(ULT) CC2
JCC(ULT) CC2 JCC(P) CC3
ECX, CC3 = PCMPISTRI
JCC(P) CC3
Support for AVX in Intel and AMD processors: Intel AVX (Intel Advanced Vector Extensions) is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive (See here and here for detailed information from Intel and AMD respectively). In VS2010 release, all AVX features and instructions are fully supported via intrinsic and /arch:AVX. Many optimizations have been added to improve the code quality of AVX code generation which will be described with more details in an upcoming blog post. In addition to AVX support in the compiler, the Microsoft Macro Assembler (MASM) in VS2010 also supports the Intel AVX instruction set for x86 and x64.
More precise Floating Point computation with /fp:fast: To achieve maximum speed, the compiler is allowed to optimize floating point computation aggressively under /fp:fast option. The consequence is that the floating point computation errors can accumulate and a result could be so inaccurate that it could severely affect the outcome of programs. For example, we observed that more than half of the programs in the floating points benchmark suite fail with /fp:fast in VS2008 on the X64 targets. In order to make /fp:fast more useful, we “down-tuned” a couple of optimizations in VS2010. This change could slightly affect the performance of some programs that were previously built with /fp:fast but will improve their accuracy. And if your programs were failing with /fp:fast in earlier releases, you may see better results with VS2010.
Conclusion: The Visual C++ team cares about the performance of applications built with our compiler and we continue to work with customers and CPU vendors to improve code generation. If you see issues or opportunities for improvements, please let us know though Connect or through our blog.
0 comments