{"id":25584,"date":"2020-02-27T14:40:13","date_gmt":"2020-02-27T14:40:13","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=25584"},"modified":"2020-02-27T14:40:13","modified_gmt":"2020-02-27T14:40:13","slug":"avx-512-auto-vectorization-in-msvc","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/avx-512-auto-vectorization-in-msvc\/","title":{"rendered":"AVX-512 Auto-Vectorization in MSVC"},"content":{"rendered":"<p>In <a href=\"https:\/\/docs.microsoft.com\/en-us\/visualstudio\/releases\/2019\/release-notes-v16.3#16.3.0\">Visual Studio 2019 version 16.3<\/a> we added AVX-512 support to the auto-vectorizer of the MSVC compiler. This post will show some examples and help you enable it in your projects.<\/p>\n<h3>What is the auto vectorizer?<\/h3>\n<p>The compiler\u2019s <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/parallel\/auto-parallelization-and-auto-vectorization?view=vs-2019\">auto vectorizer<\/a> analyzes loops in the user\u2019s source code and generates vectorized code for a vectorization target where feasible and beneficial.<\/p>\n<pre class=\"nums:true lang:c++ decode:true \">static const int length = 1024 * 8;\r\nstatic float a[length];\r\nfloat scalarAverage() {\r\n    float sum = 0.0;\r\n    for (uint32_t j = 0; j &lt; _countof(a); ++j) {\r\n        sum += a[j];\r\n    }\r\n\r\n    return sum \/ _countof(a);\r\n}<\/pre>\n<p><span style=\"font-size: 1rem;\">For example, if I build the code above using <\/span><strong style=\"font-size: 1rem;\">cl.exe \/O2 \/fp:fast \/arch:AVX2<\/strong><span style=\"font-size: 1rem;\"> targeting AVX2, I get the following assembly. The lines 11-15 are the vectorized loop using ymm registers. The lines 16-21 are to calculate the scalar value <\/span><strong style=\"font-size: 1rem;\">sum<\/strong><span style=\"font-size: 1rem;\"> from vector values coming out of the vector loop. Please note the number of iterations of the vector loop is only 1\/8 of the scalar loop, which usually translates to improved performance.<\/span><\/p>\n<pre class=\"nums:true lang:asm mark:11-15 decode:true\">?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)):\r\n00000000: push ebp\r\n00000001: mov ebp,esp\r\n00000003: and esp,0FFFFFFF0h\r\n00000006: sub esp,10h\r\n00000009: xor eax,eax\r\n0000000B: vxorps xmm1,xmm1,xmm1\r\n0000000F: vxorps xmm2,xmm2,xmm2\r\n00000013: nop dword ptr [eax]\r\n00000017: nop word ptr [eax+eax]\r\n00000020: vaddps ymm1,ymm1,ymmword ptr ?a@@3PAMA[eax]\r\n00000028: vaddps ymm2,ymm2,ymmword ptr ?a@@3PAMA[eax+20h]\r\n00000030: add eax,40h\r\n00000033: cmp eax,8000h\r\n00000038: jb 00000020&lt;\/span&gt;\r\n0000003A: vaddps ymm0,ymm2,ymm1\r\n0000003E: vhaddps ymm0,ymm0,ymm0\r\n00000042: vhaddps ymm1,ymm0,ymm0\r\n00000046: vextractf128 xmm0,ymm1,1\r\n0000004C: vaddps xmm0,xmm1,xmm0\r\n00000050: vmovaps xmmword ptr [esp],xmm0&lt;\/span&gt;\r\n00000055: fld dword ptr [esp]\r\n00000058: fmul dword ptr [__real@39000000]\r\n0000005E: vzeroupper\r\n00000061: mov esp,ebp\r\n00000063: pop ebp\r\n00000064: ret<\/pre>\n<h3><span style=\"color: inherit; font-family: inherit; font-size: 18pt;\">What is AVX-512?<\/span><\/h3>\n<p>AVX-512 is a family of processor extensions introduced by Intel which enhance <a href=\"https:\/\/blogs.msdn.microsoft.com\/nativeconcurrency\/2012\/04\/12\/what-is-vectorization\/\">vectorization<\/a> by extending vectors to 512 bits, doubling the number of vector registers, and introducing element-wise operation masking. You can detect support for AVX-512 using the __isa_available variable, which will be 6 or greater if AVX-512 support is found. This indicates support for the F(Foundational) instructions, as well as instructions from the VL, BW, DQ, and CD extensions which add additional integer vector operations, 128-bit and 256-bit operations with the additional AVX-512 registers and masking, and instructions to detect address conflicts with scatter stores. These are the same instructions that are enabled by <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/build\/reference\/arch-x64?view=vs-2019\">\/arch:AVX512<\/a> as described below. These extensions are available on all processors with AVX-512 that Windows officially supports. More information about AVX-512 can be found in the following blog posts that we published before.<\/p>\n<ul>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/microsoft-visual-studio-2017-supports-intel-avx-512\/\">Microsoft Visual Studio 2017 Supports Intel\u00ae AVX-512<\/a><\/li>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/accelerating-compute-intensive-workloads-with-intel-avx-512\/\">Accelerating Compute-Intensive Workloads with Intel\u00ae AVX-512<\/a><\/li>\n<\/ul>\n<h3>How to enable AVX-512 vectorization?<\/h3>\n<p>\/arch:AVX512 is the compiler switch to enable AVX-512 support including auto vectorization. With this switch, the auto vectorizer may vectorize a loop using instructions from the F, VL, BW, DQ, and CD extensions in AVX-512.<\/p>\n<p>To build your application with AVX-512 vectorization enabled:<\/p>\n<ul>\n<li>In the Visual Studio IDE, you can either add the flag \/arch:AVX512 to the project Property Pages &gt; C\/C++ &gt; Command Line &gt; Additional Options text box or turn on \/arch:AVX512 by choosing Advanced Vector Extension 512 following Project Properties &gt; Configuration Properties &gt; C\/C++ &gt; Code Generation &gt; Enable Enhanced Instruction Set &gt; Advanced Vector Extension 512 (\/arch:AVX512). The second approach is available in Visual Studio 2019 version 16.4.<\/li>\n<li>If you compile from the command line using cl.exe, add the flag \/arch:AVX512 before any \/link options.<\/li>\n<\/ul>\n<p>If I build the prior example again using <strong>cl.exe \/O2 \/fp:fast \/arch:AVX512<\/strong>, I\u2019ll get the following assembly targeting AVX-512. Similarly, the lines 7-11 are the vectorized loop. Note the loop is vectorized with zmm registers instead of ymm registers. With the expanded width of zmmx registers, the number of iterations of the AVX-512 vector loop is only half of its AVX2 version.<\/p>\n<pre class=\"nums:true lang:asm mark:7-11 decode:true\">?scalarAverage@@YAMXZ (float __cdecl scalarAverage(void)):\r\n00000000: push ecx\r\n00000001: vxorps xmm0,xmm0,xmm0\r\n00000005: vxorps xmm1,xmm1,xmm1\r\n00000009: xor eax,eax\r\n0000000B: nop dword ptr [eax+eax]\r\n00000010: vaddps zmm0,zmm0,zmmword ptr ?a@@3PAMA[eax]\r\n0000001A: vaddps zmm1,zmm1,zmmword ptr ?a@@3PAMA[eax+40h]\r\n00000024: sub eax,0FFFFFF80h\r\n00000027: cmp eax,8000h\r\n0000002C: jb 00000010\r\n0000002E: vaddps zmm1,zmm0,zmm1\r\n00000034: vextractf32x8 ymm0,zmm1,1\r\n0000003B: vaddps ymm1,ymm0,ymm1\r\n0000003F: vextractf32x4 xmm0,ymm1,1\r\n00000046: vaddps xmm1,xmm0,xmm1\r\n0000004A: vpsrldq xmm0,xmm1,8\r\n0000004F: vaddps xmm1,xmm0,xmm1\r\n00000053: vpsrldq xmm0,xmm1,4\r\n00000058: vaddss xmm0,xmm0,xmm1\r\n0000005C: vmovss dword ptr [esp],xmm0\r\n00000061: fld dword ptr [esp]\r\n00000064: fmul dword ptr [__real@39000000]\r\n0000006A: vzeroupper\r\n0000006D: pop ecx\r\n0000006E: ret<\/pre>\n<h3><span style=\"color: inherit; font-family: inherit; font-size: 3rem;\"><span style=\"font-size: 24pt;\">Closing remarks<\/span><\/span><\/h3>\n<p>For this release, we aim at achieving parity with \/arch:AVX2 in terms of vectorization capability. There are still many things that we plan to improve in future releases. For example, our next AVX-512 improvement will take advantage of the new masked instructions. Subsequent updates will support embedded broadcast, scatter, and 128-bit and 256-bit AVX-512 instructions in the VL extension.<\/p>\n<p>As always, we\u2019d like to hear your feedback and encourage you to\u00a0download <a href=\"https:\/\/visualstudio.microsoft.com\/vs\/\">Visual Studio 2019\u00a0<\/a>to try it out. If you encounter any issue or have any suggestion for us, please let us know through\u00a0<strong>Help &gt; Send Feedback &gt; Report A Problem \/ Suggest a Feature<\/strong>\u00a0in Visual Studio IDE, or via\u00a0<a href=\"https:\/\/developercommunity.visualstudio.com\/\">Developer Community<\/a>, or or <a href=\"https:\/\/twitter.com\/visualc\">Twitter @visualc<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Visual Studio 2019 version 16.3 we added AVX-512 support to the auto-vectorizer of the MSVC compiler. This post will show some examples and help you enable it in your projects. What is the auto vectorizer? The compiler\u2019s auto vectorizer analyzes loops in the user\u2019s source code and generates vectorized code for a vectorization target [&hellip;]<\/p>\n","protected":false},"author":8300,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,230],"tags":[3362,3361,181,3363],"class_list":["post-25584","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","category-new-feature","tag-avx-512","tag-avx512","tag-vectorization","tag-vectorizer"],"acf":[],"blog_post_summary":"<p>In Visual Studio 2019 version 16.3 we added AVX-512 support to the auto-vectorizer of the MSVC compiler. This post will show some examples and help you enable it in your projects. What is the auto vectorizer? The compiler\u2019s auto vectorizer analyzes loops in the user\u2019s source code and generates vectorized code for a vectorization target [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/25584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/8300"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=25584"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/25584\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=25584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=25584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=25584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}