In Visual Studio 2019 version 16.3 we added AVX-512 support to the auto-vectorizer of the MSVC compiler. This post will show some examples and help you enable it in your projects.
What is the auto vectorizer?
The compiler’s auto vectorizer analyzes loops in the user’s source code and generates vectorized code for a vectorization target where feasible and beneficial.
static const int length = 1024 * 8; static float a[length]; float scalarAverage() { float sum = 0.0; for (uint32_t j = 0; j < _countof(a); ++j) { sum += a[j]; } return sum / _countof(a); }
For example, if I build the code above using cl.exe /O2 /fp:fast /arch:AVX2 targeting AVX2, I get the following assembly. The lines 11-15 are the vectorized loop using ymm registers. The lines 16-21 are to calculate the scalar value sum from vector values coming out of the vector loop. Please note the number of iterations of the vector loop is only 1/8 of the scalar loop, which usually translates to improved performance.
?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)): 00000000: push ebp 00000001: mov ebp,esp 00000003: and esp,0FFFFFFF0h 00000006: sub esp,10h 00000009: xor eax,eax 0000000B: vxorps xmm1,xmm1,xmm1 0000000F: vxorps xmm2,xmm2,xmm2 00000013: nop dword ptr [eax] 00000017: nop word ptr [eax+eax] 00000020: vaddps ymm1,ymm1,ymmword ptr ?a@@3PAMA[eax] 00000028: vaddps ymm2,ymm2,ymmword ptr ?a@@3PAMA[eax+20h] 00000030: add eax,40h 00000033: cmp eax,8000h 00000038: jb 00000020</span> 0000003A: vaddps ymm0,ymm2,ymm1 0000003E: vhaddps ymm0,ymm0,ymm0 00000042: vhaddps ymm1,ymm0,ymm0 00000046: vextractf128 xmm0,ymm1,1 0000004C: vaddps xmm0,xmm1,xmm0 00000050: vmovaps xmmword ptr [esp],xmm0</span> 00000055: fld dword ptr [esp] 00000058: fmul dword ptr [__real@39000000] 0000005E: vzeroupper 00000061: mov esp,ebp 00000063: pop ebp 00000064: ret
What is AVX-512?
AVX-512 is a family of processor extensions introduced by Intel which enhance vectorization by extending vectors to 512 bits, doubling the number of vector registers, and introducing element-wise operation masking. You can detect support for AVX-512 using the __isa_available variable, which will be 6 or greater if AVX-512 support is found. This indicates support for the F(Foundational) instructions, as well as instructions from the VL, BW, DQ, and CD extensions which add additional integer vector operations, 128-bit and 256-bit operations with the additional AVX-512 registers and masking, and instructions to detect address conflicts with scatter stores. These are the same instructions that are enabled by /arch:AVX512 as described below. These extensions are available on all processors with AVX-512 that Windows officially supports. More information about AVX-512 can be found in the following blog posts that we published before.
- Microsoft Visual Studio 2017 Supports Intel® AVX-512
- Accelerating Compute-Intensive Workloads with Intel® AVX-512
How to enable AVX-512 vectorization?
/arch:AVX512 is the compiler switch to enable AVX-512 support including auto vectorization. With this switch, the auto vectorizer may vectorize a loop using instructions from the F, VL, BW, DQ, and CD extensions in AVX-512.
To build your application with AVX-512 vectorization enabled:
- In the Visual Studio IDE, you can either add the flag /arch:AVX512 to the project Property Pages > C/C++ > Command Line > Additional Options text box or turn on /arch:AVX512 by choosing Advanced Vector Extension 512 following Project Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set > Advanced Vector Extension 512 (/arch:AVX512). The second approach is available in Visual Studio 2019 version 16.4.
- If you compile from the command line using cl.exe, add the flag /arch:AVX512 before any /link options.
If I build the prior example again using cl.exe /O2 /fp:fast /arch:AVX512, I’ll get the following assembly targeting AVX-512. Similarly, the lines 7-11 are the vectorized loop. Note the loop is vectorized with zmm registers instead of ymm registers. With the expanded width of zmmx registers, the number of iterations of the AVX-512 vector loop is only half of its AVX2 version.
?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)): 00000000: push ecx 00000001: vxorps xmm0,xmm0,xmm0 00000005: vxorps xmm1,xmm1,xmm1 00000009: xor eax,eax 0000000B: nop dword ptr [eax+eax] 00000010: vaddps zmm0,zmm0,zmmword ptr ?a@@3PAMA[eax] 0000001A: vaddps zmm1,zmm1,zmmword ptr ?a@@3PAMA[eax+40h] 00000024: sub eax,0FFFFFF80h 00000027: cmp eax,8000h 0000002C: jb 00000010 0000002E: vaddps zmm1,zmm0,zmm1 00000034: vextractf32x8 ymm0,zmm1,1 0000003B: vaddps ymm1,ymm0,ymm1 0000003F: vextractf32x4 xmm0,ymm1,1 00000046: vaddps xmm1,xmm0,xmm1 0000004A: vpsrldq xmm0,xmm1,8 0000004F: vaddps xmm1,xmm0,xmm1 00000053: vpsrldq xmm0,xmm1,4 00000058: vaddss xmm0,xmm0,xmm1 0000005C: vmovss dword ptr [esp],xmm0 00000061: fld dword ptr [esp] 00000064: fmul dword ptr [__real@39000000] 0000006A: vzeroupper 0000006D: pop ecx 0000006E: ret
Closing remarks
For this release, we aim at achieving parity with /arch:AVX2 in terms of vectorization capability. There are still many things that we plan to improve in future releases. For example, our next AVX-512 improvement will take advantage of the new masked instructions. Subsequent updates will support embedded broadcast, scatter, and 128-bit and 256-bit AVX-512 instructions in the VL extension.
As always, we’d like to hear your feedback and encourage you to download Visual Studio 2019 to try it out. If you encounter any issue or have any suggestion for us, please let us know through Help > Send Feedback > Report A Problem / Suggest a Feature in Visual Studio IDE, or via Developer Community, or or Twitter @visualc.
Also, be aware that the auto-vectorization will likely make your code run significantly slower. The for-loops that will be auto-vectorized are likely inconsequential from a performance point of view. But the added AVX512 instructions will cause the CPU to significantly downclock (e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6142#Frequencies).
Added bonus: a single AVX512 instruction in one application will cause all processes of all users to get slowed down on the computer due to this. Great for shared computer resources in calc farm, virtual machines, clusters and other cloud infrastructures. But it's not like anyone use those, right?
The added bonus was true on Haswell but is not true on Broadwell. AVX2 executing cores will be capped separately to others.
This article reminds me of Auto-vectorization is not a programming model from Matt Pharr's ISPC article.
https://pharr.org/matt/blog/2018/04/18/ispc-origins.html
https://pharr.org/matt/blog/2018/04/30/ispc-all.html
I'd like to remain polite... but after using this auto-vectorization "gift" (or rather, being comedically entertained by it for two decades), one has two feel a bit sad for the people putting so much engineering effort into a technological dead end. Spoiler: it never works but on the simplest academic for-loop cases.
Matt Pharr puts it a bit more eloquently than I do.
After programming using ISPC and CUDA, it's quite clear that Intel never realised that the proper programming model for SIMD matters...
Hi Tanguy,
Thank you very much for your feedback! It's absolutely a lovely read of what you shared about ISPC as well.
We have heuristic in place to keep the auto-vectorizer from putting a dent on performance. The heuristic is far from ideal though. If you observe any performance regression from auto-vectorization, please let us know. We will be diligent on addressing them.
From our experience, we observe out-of-the-box benefits for customers from auto-vectorization. Typically, the more we vectorize, the faster our benchmarks perform. We sincerely hope the performance win we observed translates to customer scenarios as well.
Also, thank you for pointing out...
I maintain an open source project that distributes a compiled executable with the project. What happen if the executable is compiled with the /arch:AVX512 option and the computer it is run on doesn’t support AVX-512? What happens if the computer it is compiled on doesn’t support AVX-512?
Hi Jim,
Like Tanguy has pointed out, your code will crash if it runs AVX-512 instruction on a processor without AVX-512 support. The switch /arch:AVX512 targets at the F, CD, BW, DQ, and VL extensions in the AVX-512 family. So it needs a processor supporting these extensions to work. To target another processor, you need to recompile your code or use a dynamic dispatcher like what Tanguy and Anton have pointed out.
On Windows, your DLL will not be loaded with a strange error, which is non-obvious to decode. We have had this issue once. Different code paths for various architectures need to be implemented with a dispatcher, using __cpuid or __cpuidex. It effectively doubles or triples the code size. Much simpler is to switch to MKL (or OpenBLAS) and forget about problems like this.
It will crash.
You need to compile source files separately (one with AVX512 and one without) and dynamically dispatch at runtime between them depending on the CPU being used. Loads of fun.