{"id":24801,"date":"2019-08-08T18:34:23","date_gmt":"2019-08-08T18:34:23","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=24801"},"modified":"2019-08-08T18:34:23","modified_gmt":"2019-08-08T18:34:23","slug":"game-performance-improvements-in-visual-studio-2019-version-16-2","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/game-performance-improvements-in-visual-studio-2019-version-16-2\/","title":{"rendered":"Game performance improvements in Visual Studio 2019 version 16.2"},"content":{"rendered":"<p>This spring Gratian Lup described in his blog post the improvements for <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/game-performance-and-compilation-time-improvements-in-visual-studio-2019\/\">C++ game development in Visual Studio 2019<\/a>. From Visual Studio 2019 version 16.0 to Visual Studio 2019 version 16.2 we\u2019ve made some more improvements. On the <a href=\"https:\/\/www.unrealengine.com\/marketplace\/en-US\/infiltrator-demo\">Infiltrator Demo<\/a> we\u2019ve got 2\u20133% performance wins for the most CPU-intensive parts of the game.<\/p>\n<h2>Throughput<\/h2>\n<p>A huge throughput improvement was done in the linker! Check our recent blogpost on <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/improved-linker-fundamentals-in-visual-studio-2019\/\">Improved Linker Fundamentals in Visual Studio 2019<\/a>.<\/p>\n<h2>New Optimizations<\/h2>\n<p>A comprehensive list of new and improved C++ compiler optimizations can be found in a recent blogpost on <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-backend-updates-in-visual-studio-2019-version-16-2\/\">MSVC Backend Updates in Visual Studio 2019 version 16.2<\/a>. I\u2019ll talk in a bit more detail about some of them.<\/p>\n<p>All samples below are compiled for x64 with these switches: \/arch:AVX2 \/O2 \/fp:fast \/c \/Fa.<\/p>\n<h2>Vectorizing tiny perfect reduction loops on AVX<\/h2>\n<p>This is a common pattern for making sure that two vectors didn\u2019t diverge too much:<\/p>\n<pre class=\"lang:default decode:true \">#include &lt;xmmintrin.h&gt;\r\n#include &lt;DirectXMath.h&gt;\r\nuint32_t TestVectorsEqual(float* Vec0, float* Vec1, float Tolerance = 1e7f)\r\n{\r\n    float sum = 0.f;\r\n    for (int32_t Component = 0; Component &lt; 4; Component++)\r\n    {\r\n        float Diff = Vec0[Component] - Vec1[Component];\r\n        sum += (Diff &gt;= 0.0f) ? Diff : -Diff;\r\n    }\r\n    return (sum &lt;= Tolerance) ? 1 : 0;\r\n}<\/pre>\n<p>For version 16.2 we tweaked the vectorization heuristics for the AVX architecture to better utilize the hardware capabilities. The disassembly is for x64, AVX2, old code on the left, new on the right:<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-24802 size-full\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image.png\" alt=\"Comparison of old code versus the much-improved new code\" width=\"1624\" height=\"648\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image.png 1624w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-300x120.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-768x306.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1024x409.png 1024w\" sizes=\"(max-width: 1624px) 100vw, 1624px\" \/><\/p>\n<p>Visual Studio 2019 version 16.0 recognized the loop as a reduction loop, didn\u2019t vectorize it, but unrolled it completely. Version 16.2 also recognized the loop as a reduction loop, vectorized it (due to the heuristics change), and used horizontal add instructions to get the sum. As a result the code is much shorter and faster now.<\/p>\n<h2>Recognition of intrinsics working on a single vector element<\/h2>\n<p>The compiler now does a better job at optimizing vector intrinsics working on the lowest single element (those with ss\/sd suffix).<\/p>\n<p>A good example for the improved code is the inverse square root. This function is taken from the Unreal Engine math library (with comments removed for brevity). It\u2019s used all over the games based on Unreal Engine for rendering objects:<\/p>\n<pre class=\"lang:default decode:true\">#include &lt;xmmintrin.h&gt;\r\n#include &lt;DirectXMath.h&gt;\r\nfloat InvSqrt(float F)\r\n{\r\n    const __m128 fOneHalf = _mm_set_ss(0.5f);\r\n    __m128 Y0, X0, X1, X2, FOver2;\r\n    float temp;\r\n    Y0 = _mm_set_ss(F);\r\n    X0 = _mm_rsqrt_ss(Y0);\r\n    FOver2 = _mm_mul_ss(Y0, fOneHalf);\r\n    X1 = _mm_mul_ss(X0, X0);\r\n    X1 = _mm_sub_ss(fOneHalf, _mm_mul_ss(FOver2, X1));\r\n    X1 = _mm_add_ss(X0, _mm_mul_ss(X0, X1));\r\n    X2 = _mm_mul_ss(X1, X1);\r\n    X2 = _mm_sub_ss(fOneHalf, _mm_mul_ss(FOver2, X2));\r\n    X2 = _mm_add_ss(X1, _mm_mul_ss(X1, X2));\r\n    _mm_store_ss(&amp;temp, X2);\r\n    return temp;\r\n}<\/pre>\n<p>Again, x64, AVX2, old code on the left, new on the right:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24803\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1.png\" alt=\"Comparison of old code versus the much-improved new code\" width=\"1625\" height=\"527\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1.png 1625w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1-300x97.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1-768x249.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-1-1024x332.png 1024w\" sizes=\"(max-width: 1625px) 100vw, 1625px\" \/><\/p>\n<p>Visual Studio 2019 version 16.0 generated code for all intrinsics one by one. Version 16.2 now understands the meaning of the intrinsics better and is able to combine multiply\/add intrinsics into FMA instructions. There are still improvements to be made in this area and some are targeted for version 16.3\/16.4.<\/p>\n<p>Even now, if given a const argument, this code will be completely constant-folded:<\/p>\n<pre class=\"lang:default decode:true\">float ReturnInvSqrt()\r\n{\r\n    return InvSqrt(4.0);\r\n}<\/pre>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24804\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-2.png\" alt=\"Comparison of old code versus the much-improved new code\" width=\"1628\" height=\"483\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-2.png 1628w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-2-300x89.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-2-768x228.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-2-1024x304.png 1024w\" sizes=\"(max-width: 1628px) 100vw, 1628px\" \/><\/p>\n<p>Again, Visual Studio 2019 version 16.0 here generated code for all intrinsics, one by one. Version 16.2 was able to calculate the value at compile time. (This is done with \/fp:fast switch only).<\/p>\n<h2>More FMA patterns<\/h2>\n<p>The compiler now generates FMA in more cases:<\/p>\n<p>(fma a, b, (c * d)) + x -&gt; fma a, b, (fma c, d, x)\nx + (fma a, b, (c * d)) -&gt; fma a, b, (fma c, d, x)<\/p>\n<p>(a+1) * b -&gt; fma a, b, b\n(a+ (-1)) * b -&gt; fma a, b, (-b)\n(a &#8211; 1) * b -&gt; fma a, b, (-b)\n(a &#8211; (-1)) * b -&gt; fma a, b, b\n(1 &#8211; a) * b -&gt; fma (-a), b, b\n(-1 &#8211; a) * b -&gt; fma (-a), b, -b<\/p>\n<p>It also does more FMA simplifications:<\/p>\n<p>fma a, c1, (a * c2) -&gt; fmul a * (c1+c2)\nfma (a * c1), c2, b -&gt; fma a, c1*c2, b\nfma a, 1, b -&gt; a + b\nfma a, -1, b -&gt; (-a) + b -&gt; b &#8211; a\nfma -a, c, b -&gt; fma a, -c, b\nfma a, c, a -&gt; a * (c+1)\nfma a, c, (-a) -&gt; a * (c-1)<\/p>\n<p>Previously FMA generation worked only with local vectors. It was improved to work on globals too.<\/p>\n<p>Here is an example of the optimization at work:<\/p>\n<pre class=\"lang:default decode:true\">#include &lt;xmmintrin.h&gt;\r\n__m128 Sample(__m128 A, __m128 B)\r\n{\r\n    const __m128 fMinusOne = _mm_set_ps1(-1.0f);\r\n    __m128 X;\r\n    X = _mm_sub_ps(A, fMinusOne);\r\n    X = _mm_mul_ps(X, B);\r\n    return X;\r\n}<\/pre>\n<p>Old code on the left, new on the right:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24805\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-3.png\" alt=\"Comparison of old code versus the much-improved new code\" width=\"1515\" height=\"149\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-3.png 1515w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-3-300x30.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-3-768x76.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-3-1024x101.png 1024w\" sizes=\"(max-width: 1515px) 100vw, 1515px\" \/><\/p>\n<p>FMA is shorter and faster, and the constant is completely gone and will not occupy space.<\/p>\n<p>Another sample:<\/p>\n<pre class=\"lang:default decode:true\">#include &lt;xmmintrin.h&gt;\r\n__m128 Sample2(__m128 A, __m128 B)\r\n{\r\n    __m128 C1 = _mm_set_ps(3.0, 3.0, 2.0, 1.0);\r\n    __m128 C2 = _mm_set_ps(4.0, 4.0, 3.0, 2.0);\r\n    __m128 X = _mm_mul_ps(A, C1);\r\n    X = _mm_fmadd_ps(X, C2, B);\r\n    return X;\r\n}<\/pre>\n<p>Old code on the left, new on the right:<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24806\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-4.png\" alt=\"Comparison of old code versus the much-improved new code\" width=\"1560\" height=\"172\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-4.png 1560w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-4-300x33.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-4-768x85.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-4-1024x113.png 1024w\" sizes=\"(max-width: 1560px) 100vw, 1560px\" \/><\/p>\n<p>Version 16.2 is doing this simplification:<\/p>\n<p>fma (a * c1), c2, b -&gt; fma a, c1*c2, b<\/p>\n<p>Constants are now extracted and multiplied at compile time.<\/p>\n<h2>Memset and initialization<\/h2>\n<p>Memset code generation was improved by calling the faster CRT version where appropriate instead of expanding its definition inline. Loops that store a constant value that is formed of the same byte (e.g. 0xABABABAB) now also use the CRT version of memset.\u00a0Compared with na\u00efve code generation, calling memset is at least 2x faster on SSE2, and even faster on AVX2.<\/p>\n<h2>Inlining<\/h2>\n<p>We\u2019ve done more tweaks to the inlining heuristics. They were modified to do more aggressive inlining of small functions containing control flow.<\/p>\n<h2>Improvements in Unreal Engine \u2013 Infiltrator Demo<\/h2>\n<p>The new optimizations pay off.<\/p>\n<p>We ran the <a href=\"https:\/\/www.unrealengine.com\/marketplace\/en-US\/infiltrator-demo\">Infiltrator Demo<\/a> again (see the blogpost about <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/game-performance-and-compilation-time-improvements-in-visual-studio-2019\/\">C++ game development in Visual Studio 2019<\/a> for a description of the demo and testing methodology). Short reminder: Infiltrator Demo is based on Unreal Engine and is a nice approximation of a real game. Game performance is measured here by frame time: the smaller, the better (opposite metric would be frames per second). Testing was done similarly to the previous test run, the only difference is the new hardware: this time we ran it on AMD Zen 2 newest processor.<\/p>\n<p>Test PC configuration:<\/p>\n<ul>\n<li>AMD64 Ryzen 5 3600 6-Core Processor, 3.6 Ghz, 6 Cores, 12 Logical processors<\/li>\n<li>Radeon RX 550 GPU<\/li>\n<li>16 GB RAM<\/li>\n<li>Windows 10 1903<\/li>\n<\/ul>\n<h2>Results<\/h2>\n<p>This time we measured only \/arch:AVX2 configuration. As previously, the lower the better.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24807\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-5.png\" alt=\"Graph showing the 2-3% improvements on performance spikes\" width=\"1492\" height=\"996\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-5.png 1492w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-5-300x200.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-5-768x513.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/08\/word-image-5-1024x684.png 1024w\" sizes=\"(max-width: 1492px) 100vw, 1492px\" \/><\/p>\n<p>The blue line is the demo compiled with Visual Studio 2019, the yellow line &#8211; compiled with Visual Studio 2019 version 16.2. X axis \u2013 time, Y axis \u2013 frame time.<\/p>\n<p>Frame times are mostly the same between the two runs, but in the parts of the demo where frame times are the highest (and thus the frame rate is lowest) with Visual Studio 2019 version 16.2 we\u2019ve got an improvement of 2\u20133%.<\/p>\n<p>We\u2019d love for you to <a href=\"https:\/\/visualstudio.microsoft.com\/vs\/\">download Visual Studio 2019<\/a> and give it a try. As always, we welcome your feedback. We can be reached via the comments below or via email (<a href=\"mailto:visualcpp@microsoft.com\" target=\"_blank\" rel=\"noopener noreferrer\">visualcpp@microsoft.com<\/a>). If you encounter problems with Visual Studio or MSVC, or have a suggestion for us, please let us know through <strong>Help &gt; Send Feedback &gt; Report A Problem \/ Provide a Suggestion<\/strong> in the product, or via <a href=\"https:\/\/developercommunity.visualstudio.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Developer Community<\/a>. You can also find us on Twitter (<a href=\"https:\/\/twitter.com\/visualc\" target=\"_blank\" rel=\"noopener noreferrer\">@VisualC<\/a>).<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This spring Gratian Lup described in his blog post the improvements for C++ game development in Visual Studio 2019. From Visual Studio 2019 version 16.0 to Visual Studio 2019 version 16.2 we\u2019ve made some more improvements. On the Infiltrator Demo we\u2019ve got 2\u20133% performance wins for the most CPU-intensive parts of the game. Throughput A [&hellip;]<\/p>\n","protected":false},"author":6553,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[218],"tags":[],"class_list":["post-24801","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-performance"],"acf":[],"blog_post_summary":"<p>This spring Gratian Lup described in his blog post the improvements for C++ game development in Visual Studio 2019. From Visual Studio 2019 version 16.0 to Visual Studio 2019 version 16.2 we\u2019ve made some more improvements. On the Infiltrator Demo we\u2019ve got 2\u20133% performance wins for the most CPU-intensive parts of the game. Throughput A [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/24801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/6553"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=24801"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/24801\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=24801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=24801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=24801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}