{"id":24544,"date":"2019-07-30T00:00:00","date_gmt":"2019-07-30T00:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/cppblog\/?p=24544"},"modified":"2019-07-30T18:16:05","modified_gmt":"2019-07-30T18:16:05","slug":"improving-the-performance-of-standard-library-functions","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/improving-the-performance-of-standard-library-functions\/","title":{"rendered":"Improving the Performance of Standard Library Functions"},"content":{"rendered":"<p>In Visual Studio 2019 version 16.2 we improved the codegen of several standard library functions. Guided by your feedback on Developer Community (<a href=\"https:\/\/developercommunity.visualstudio.com\/content\/idea\/447456\/inlining-lldiv.html\" target=\"_blank\" rel=\"noopener noreferrer\">Inlining std::lldiv<\/a> and <a href=\"https:\/\/developercommunity.visualstudio.com\/idea\/643495\/better-codegen-for-std-library-calls.html\">Improved codegen for std::fmin, std::fmax, std::round, std::trunc<\/a>) we focused on the variants of standard division (<code>std::div<\/code>, <code>std::ldiv<\/code>, <code>std::lldiv<\/code>) and <code>std::isnan<\/code>.<\/p>\n<p>Originally function calls to the standard library, rather than inline assembly instructions, were generated upon each invocation of variations of <code>std::div<\/code> and <code>std::isnan<\/code>, regardless of the compiler optimization flags passed. Since these standard library function definitions live inside of the runtime, their definitions are opaque to the compiler and therefore not candidates for inlining and optimization. Furthermore, the function overhead of calling both <code>std::div<\/code> and <code>std::isnan<\/code> is greater than the actual cost of these operations. On most platforms, <code>std::div<\/code> can be computed in a single instruction that returns both quotient and remainder while <code>std::isnan<\/code> requires only a comparison and condition flag check. Inlining these calls would remove both function call overhead and allow optimizations to kick in since the compiler has the additional context of the calling function.<\/p>\n<p>To support inline assembly code generation we added a number of different functions as compiler intrinsics (also known as builtins) for <code>std::isnan<\/code>, <code>std::div<\/code>, and friends. Registering an intrinsic effectively &#8220;teaches&#8221; the meaning of that function to the compiler and results in greater control over the code generated. We went with a codegen solution rather than a library change to avoid altering library headers.<\/p>\n<h3>Optimizing std::div and Friends<\/h3>\n<p>The MSVC compiler has pre-existing support for optimizing bare division and remainder operations. Therefore, to feed calls to <code>std::div<\/code> into this existing compiler infrastructure, we recognize <code>std::div<\/code> as a compiler intrinsic and then transform the inputs of our recognized call into the canonical format the compiler is expecting for division and remainder operations.<\/p>\n<h3>Optimizing std::isnan<\/h3>\n<p>Replacing calls to <code>std::isnan<\/code> was the more complicated of the two categories of functions we targeted due to the conflicting requirements of the C and C++ standards. According to the C standard, <code>isnan<\/code> and the function it wraps, <code>fpclassify<\/code>, are required to be implemented as macros, while C++ requires both operations to be implemented as function overloads. Below is a diagram of the call structure in C++ vs C, with functions that are required to be implemented as function overloads inside bolded blue boxes, functions that are required to be implemented as macros in dashed purple boxes, and those without a requirement in green.<\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-24545\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/07\/word-image.png\" alt=\"Diagram showing that the implementations of isnan in C++ and C must be different.\" width=\"1508\" height=\"415\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/07\/word-image.png 1508w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/07\/word-image-300x83.png 300w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/07\/word-image-768x211.png 768w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2019\/07\/word-image-1024x282.png 1024w\" sizes=\"(max-width: 1508px) 100vw, 1508px\" \/><\/p>\n<p>To get a unified solution for C and C++ code, we had to bypass both <code>std::isnan<\/code> and <code>std::fpclassify<\/code> to look at the functions <code>std::fpclassify<\/code> wraps. We chose the green functions to register as compiler intrinsics since they lack any implementation requirements. Additionally, since each overload in either standard accomplishes the same task, we can transform instances of one standard&#8217;s intrinsics (we chose the C++ standard) into instances of our added C intrinsics. The table below demonstrates the results of the unification process that reduces the intrinsics we operate on after the initial pass from six to three.<\/p>\n<table style=\"border-style: solid;\" border=\"border-style: solid;\" cellspacing=\"2px\" cellpadding=\"2px\">\n<tbody>\n<tr style=\"height: 27px;\">\n<td>Function<\/td>\n<td>Intrinsic function<\/td>\n<td style=\"height: 27px; width: 29.4601%;\">Intrinsic After Unification<\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_fdtest<\/code><\/td>\n<td style=\"height: 27px; width: 33.0218%;\"><code>IV__FDTEST<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__FDCLASS<\/code><\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_dtest<\/code><\/td>\n<td style=\"border-style: solid;\"><code>IV__DTEST<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__DCLASS<\/code><\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_ldtest<\/code><\/td>\n<td style=\"height: 27px; width: 33.0218%;\"><code>IV__LDTEST<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__LDCLASS<\/code><\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_fdclass<\/code><\/td>\n<td style=\"height: 27px; width: 33.0218%;\"><code>IV__FDCLASS<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__FDCLASS<\/code><\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_dclass<\/code><\/td>\n<td style=\"height: 27px; width: 33.0218%;\"><code>IV__DCLASS<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__DCLASS<\/code><\/td>\n<\/tr>\n<tr style=\"height: 27px;\">\n<td style=\"height: 27px; width: 18.1724%;\"><code>_ldclass<\/code><\/td>\n<td style=\"height: 27px; width: 33.0218%;\"><code>IV__LDCLASS<\/code><\/td>\n<td style=\"height: 27px; width: 29.4601%;\"><code>IV__LDCLASS<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>Back to <code>std::isnan<\/code>. The reasoning behind adding six new compiler intrinsics was to improve the code generation for <code>std::isnan<\/code>. However, referencing the diagram from before, while we&#8217;ve transformed the functions that <code>std::fpclassify<\/code> calls into, that hasn&#8217;t actually changed any of the codegen for <code>std::isnan<\/code>. In the <code>_dclass<\/code> case for example, we\u2019ve recognized all <code>_dclass<\/code> calls as intrinsics, but as we\u2019re not changing the code generated for <code>_dclass<\/code>, the code emitted is still the same call to<code> _dclass<\/code> that we started out with.<\/p>\n<p>The last step required to recognize <code>std::isnan<\/code> as an intrinsic and therefore enable more efficient code generation involves pattern matching. Checking if a float\/double\/etc. is a NaN looks something like this:<\/p>\n<pre class=\"toolbar:0 nums-toggle:false plain:false plain-toggle:false lang:default decode:true\" style=\"padding-left: 80px;\">isnan(double x) {\r\n    return FP_NAN == IV__DCLASS(x);\r\n}<\/pre>\n<p>Where <code>FP_NAN<\/code> is a constant defined in both C and C++ standards. Now that it&#8217;s easy to identify calls to <code>_dclass<\/code> and friends, the optimizer was extended to recognize the above pattern (a call to one of the three unified intrinsics followed by a comparison to the <code>FP_NAN<\/code> constant) and transform it into the new <code>std::isnan<\/code> intrinsic, <code>IV_ISNAN<\/code>.<\/p>\n<h3>Results<\/h3>\n<p>To better illustrate the codegen differences, below are samples of the different x64 code generation for both <code>std::isnan<\/code> and <code>std::div<\/code>.<\/p>\n<h4>std::isnan(double)<\/h4>\n<table style=\"width: 540px;\" border=\"2\" width=\"540\" cellspacing=\"10\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td width=\"291\">Reference<\/td>\n<td width=\"249\">With Intrinsics<\/td>\n<\/tr>\n<tr style=\"line-height: 0.15cm; vertical-align: text-top;\">\n<td><code>lea rcx, QWORD PTR _X$[rsp]<\/code><\/p>\n<p><code>movsd QWORD PTR _X$[rsp], xmm0<\/code><\/p>\n<p><code>call\u00a0\u00a0\u00a0 _dtest<\/code><\/p>\n<p><code>cmp\u00a0\u00a0\u00a0\u00a0\u00a0 ax, 2<\/code><\/p>\n<p><code>sete\u00a0\u00a0\u00a0 al<\/code><\/td>\n<td><code>ucomisd xmm0, xmm0<\/code><\/p>\n<p><code>setp al<\/code><\/p>\n<p><code>movzx eax, al<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h4>std::div(long, long)<\/h4>\n<table style=\"width: 308px;\" border=\"2\" width=\"308\" cellspacing=\"10\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td width=\"154\">Reference<\/td>\n<td width=\"154\">With Intrinsics<\/td>\n<\/tr>\n<tr style=\"line-height: 0.15cm; vertical-align: text-top;\">\n<td><code>mov edx, ebx<\/code><\/p>\n<p><code>mov ecx, edi<\/code><\/p>\n<p><code>call ldiv<\/code><\/td>\n<td><code>mov eax, esi<\/code><\/p>\n<p><code>mov rcx, rbx<\/code><\/p>\n<p><code>cdq<\/code><\/p>\n<p><code>idiv edi<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>How much does effectively inlining these calls impact performance? When benchmarked by calling each operation on every member of a 1&#215;10<sup>9<\/sup> element array with unknown inputs, the following improvements can be observed:<\/p>\n<table border=\"2\" cellspacing=\"10\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td>std::isnan(double)<\/td>\n<td>Reference<\/td>\n<td>With Intrinsics<\/td>\n<td>% Improvement<\/td>\n<\/tr>\n<tr>\n<td>Avg (s)<\/td>\n<td>4.03<\/td>\n<td>1.25<\/td>\n<td>69%<\/td>\n<\/tr>\n<tr>\n<td>Std Dev (s)<\/td>\n<td>0.05<\/td>\n<td>0.04<\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<table border=\"2\" cellspacing=\"10\" cellpadding=\"5\">\n<tbody>\n<tr>\n<td>std::div(long, long)<\/td>\n<td>Reference<\/td>\n<td>With Intrinsics<\/td>\n<td>% Improvement<\/td>\n<\/tr>\n<tr>\n<td>Avg (s)<\/td>\n<td>6.52<\/td>\n<td>6.06<\/td>\n<td>7%<\/td>\n<\/tr>\n<tr>\n<td>Std Dev (s)<\/td>\n<td>0.11<\/td>\n<td>0.04<\/td>\n<td><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>You can view the benchmark source <a href=\"https:\/\/godbolt.org\/z\/5qQalr\">here <\/a>on Compiler Explorer. Each benchmark was run six times on an Intel Xeon CPU v3 @3.50GHz, with the first warm-up run thrown out.<\/p>\n<h3>Conclusion<\/h3>\n<p>The aforementioned optimizations will be enabled transparently with an upgrade to the 16.3 toolset in codebases compiled under \/O2. Otherwise, make sure you&#8217;re explicitly using \/Oi to enable intrinsic support.<\/p>\n<p>The C++ team is already looking ahead at improving the performance of standard library functions in future releases, with a similar optimization for <code>std::fma<\/code> targeting the 16.4 release. As always, we welcome your feedback and feature requests for further codegen improvements. If you see cases of inefficient code generation, please reach out via the comments below or by opening an issue on <a href=\"https:\/\/developercommunity.visualstudio.com\/spaces\/8\/index.html\">D<\/a><a href=\"https:\/\/developercommunity.visualstudio.com\/spaces\/8\/index.html\">eveloper <\/a><a href=\"https:\/\/developercommunity.visualstudio.com\/spaces\/8\/index.html\">C<\/a><a href=\"https:\/\/developercommunity.visualstudio.com\/spaces\/8\/index.html\">ommunity<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Visual Studio 2019 version 16.2 we improved the codegen of several standard library functions. Guided by your feedback on Developer Community (Inlining std::lldiv and Improved codegen for std::fmin, std::fmax, std::round, std::trunc) we focused on the variants of standard division (std::div, std::ldiv, std::lldiv) and std::isnan. Originally function calls to the standard library, rather than inline [&hellip;]<\/p>\n","protected":false},"author":5984,"featured_media":24545,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[218],"tags":[],"class_list":["post-24544","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-performance"],"acf":[],"blog_post_summary":"<p>In Visual Studio 2019 version 16.2 we improved the codegen of several standard library functions. Guided by your feedback on Developer Community (Inlining std::lldiv and Improved codegen for std::fmin, std::fmax, std::round, std::trunc) we focused on the variants of standard division (std::div, std::ldiv, std::lldiv) and std::isnan. Originally function calls to the standard library, rather than inline [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/24544","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/5984"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=24544"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/24544\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/24545"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=24544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=24544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=24544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}