{"id":13265,"date":"2017-06-29T10:37:54","date_gmt":"2017-06-29T17:37:54","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/dotnet\/?p=13265"},"modified":"2021-09-30T10:08:16","modified_gmt":"2021-09-30T17:08:16","slug":"performance-improvements-in-ryujit-in-net-core-and-net-framework","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-ryujit-in-net-core-and-net-framework\/","title":{"rendered":"Performance Improvements in RyuJIT in .NET Core and .NET Framework"},"content":{"rendered":"<p>RyuJIT is the just-in-time compiler used by .NET Core on x64 <a href=\"https:\/\/github.com\/dotnet\/announcements\/issues\/10\">and now x86<\/a>\u00a0and by the .NET Framework on x64 to compile MSIL bytecode to native machine code when a managed assembly executes. I&#8217;d like to point out some of the past year&#8217;s improvements that have gone into RyuJIT, and how they make the generated code faster.<\/p>\n<p>What follows is by no means a comprehensive list of RyuJIT optimization improvements, but rather a few hand-picked examples that should make for a fun read and point to some of the issues and pull requests on GitHub that highlight the great community interactions and contributions that have helped shape this work. Be sure to also check out Stephen Toub&#8217;s <a href=\"https:\/\/blogs.msdn.microsoft.com\/dotnet\/2017\/06\/07\/performance-improvements-in-net-core\/\">recent post<\/a> about performance improvements in the runtime and base class libraries, if you haven&#8217;t already.<\/p>\n<p>This post will be comparing the performance of RyuJIT in .NET Framework 4.6.2 to its performance in .NET Core 2.0 and .NET Framework 4.7.1. Note that .NET Framework 4.7.1 has not yet shipped and I am using an early private build of the product. The same RyuJIT compiler sources are shared between .NET Core and .NET Framework, so the compiler changes discussed here are present in both .NET Core 2.0 and .NET Framework 4.7.1 builds.<\/p>\n<p><em>NOTE: Code examples included in this post use manual <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.diagnostics.stopwatch?view=netstandard-2.0\"><code>Stopwatch<\/code><\/a> invocations, with arbitrarily fixed iteration counts and no statistical analysis, as a zero-dependency way to corroborate known large performance deltas. The timings quoted below were collected on the same machine, with compared runs executed back-to-back, but even so it would be ill-advised to extrapolate quantitative results; they serve only to confirm that the optimizations improve the performance of the targeted code sequences rather than degrade it. Active performance work, of course, demands real benchmarking, which comes with a whole host of subtle issues that it is well worth taking a dependency to manage properly. <a href=\"https:\/\/twitter.com\/andrey_akinshin\">Andrey Akinshin<\/a> recently wrote a <a href=\"http:\/\/aakinshin.net\/blog\/post\/stephen-toub-benchmarks-part1\/\">great blog post<\/a> discussing this, using\u00a0the code snippets from Stephen&#8217;s post as examples. He will publish a follow-on post to this one with additional benchmarks soon. Thanks Andrey!<\/em><\/p>\n<h4><a href=\"#devirtualization\" id=\"user-content-devirtualization\" class=\"anchor\"><\/a>Devirtualization<\/h4>\n<p>The machine code sequence that the just-in-time compiler emits for a virtual call necessarily involves some degree of indirection, so that the correct target override method can be determined when the machine code executes. Compared to a direct call, this indirection imposes nontrivial overhead. RyuJIT can now identify that certain virtual call sites will always have one particular target override, and replace those virtual calls with direct ones. This avoids the overhead of the virtual indirection and, better still, allows inlining the callee method into the callsite, eliminating call overhead entirely and giving optimizations better insight into the effects of the code. This can happen when the target object has sealed type, or when its allocation site is immediately apparent and thus its exact type is known. This optimization was introduced to RyuJIT in <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/9230\">dotnet\/coreclr #9230<\/a>; was subsequently improved by <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10192\">dotnet\/coreclr #10192<\/a>, <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10432\">dotnet\/coreclr #10432<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10471\">dotnet\/coreclr #10471<\/a>; and has plenty more <a href=\"https:\/\/github.com\/dotnet\/coreclr\/issues\/9908\">room for improvement<\/a>.\nThe PRs for the changes include some statistics (e.g. <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10471#issuecomment-289170334\">7.3% of virtual calls in System.Private.CoreLib get devirtualized<\/a>) and real-world examples (e.g. <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10192#issuecomment-286827037\">this diff in <code>ConcurrentStack.GetEnumerator()<\/code><\/a> &#8212; to see the code diff at that link you may have to scroll past the quoted output from <code>jit-diff<\/code>, which is a <a href=\"https:\/\/github.com\/dotnet\/jitutils\/blob\/master\/doc\/diffs.md\">tool<\/a> we use for assessing compiler change impact. It reports any code size increase as a &#8220;regression&#8221;, though in this case the code size increases are likely from enabling inlines, which is actually an improvement). Here&#8217;s a minimal example to illustrate the optimization in action:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/57bdad096c6a0ee7d70d2ae527639169.js\"><\/script><\/p>\n<p>Method <code>Operation.OperateTwice<\/code> takes an instance parameter of abstract type <code>Operation<\/code>, and makes two virtual calls to its <code>Operate<\/code> method.\nWhen run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, <code>OperateTwice<\/code> is inlined into <code>Test.PostDoubleIncrement<\/code>, leaving <code>PostDoubleIncrement<\/code> with two virtual calls:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/378b0554b4f59b0d01f19b7200c8a7e0.js\"><\/script><\/p>\n<p>When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, <code>OperateTwice<\/code> is again inlined into <code>Test.PostDoubleIncrement<\/code>, but the JIT can now recognize that the instance argument to the two virtual calls pulled in by that inlining is <code>PostDoubleIncrement<\/code>&#8216;s parameter <code>inc<\/code>, which is of sealed type <code>Increment<\/code>. This allows it to rewrite the virtual calls as direct calls to the <code>Incremement.Operate<\/code> override of <code>Operation.Operate<\/code>, and even inline those calls into <code>PostDoubleIncrement<\/code>, which in turn exposes the fact that this code sequence doesn&#8217;t modify instance field <code>input<\/code>, allowing the redundant load of it for the return value to be eliminated:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/0f3f0cd1c6dfb301899565ac15386b6a.js\"><\/script><\/p>\n<p>The optimized version does of course run faster; here&#8217;s what I see running locally on .NET Framework 4.6.2:<\/p>\n<pre><code>00:00:00.7389248\r\n00:00:00.7390185\r\n00:00:00.7343929\r\n00:00:00.7355264\r\n00:00:00.7350114\r\n<\/code><\/pre>\n<p>and here&#8217;s what I see running locally on .NET Core 2.0:<\/p>\n<pre><code>00:00:00.4671669\r\n00:00:00.4676545\r\n00:00:00.4683338\r\n00:00:00.4674685\r\n00:00:00.4673269\r\n<\/code><\/pre>\n<h4><a href=\"#enhanced-range-inference\" id=\"user-content-enhanced-range-inference\" class=\"anchor\"><\/a>Enhanced Range Inference<\/h4>\n<p>One key goal of JIT compiler optimizations is to reduce the cost of run-time safety checks by eliding the code for them when it can prove they will succeed; this necessarily falls to the JIT since the checks are explicitly dictated by MSIL semantics. RyuJIT&#8217;s optimizer accordingly focuses some of its analysis on array index expressions, to see whether it can prove they are in-bounds. <a href=\"https:\/\/github.com\/mikedn\">@mikedn<\/a>&#8216;s change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/9773\">dotnet\/coreclr #9773<\/a> extended this analysis to recognize the common idiom of using an unsigned comparison to check upper and lower bounds in one check (<code>(uint)i &lt; (uint)a.len<\/code> implies both <code>i &gt;= 0<\/code> and <code>i &lt; a.len<\/code> for signed <code>i<\/code>). The PR <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/9773#issuecomment-283800896\">notes<\/a> how this trimmed the machine code generated for <code>List.Add<\/code> from 68 bytes to 48 bytes, and here&#8217;s a minimal illustrative example:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/28013a5892c842489df1075adc99232b.js\"><\/script><\/p>\n<p>Method <code>Set<\/code> validates its <code>index<\/code> argument, and then stores to the array. The IL generated by the C# compiler for this method looks like so:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/146c00437e6b3ba62ed579d5fc43b0a4.js\"><\/script><\/p>\n<p>The IL has a <code>blt.un<\/code> instruction for the argument validation, and a subsequent <code>stelem<\/code> instruction for the store that carries an implied bounds check of its own. When run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, machine instructions are generated for each of these checks; here&#8217;s what the machine code for the inner loop from the <code>Main<\/code> method (into which <code>Set<\/code> gets inlined) looks like:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/10959abed2788b141ec012a731b6cd44.js\"><\/script><\/p>\n<p>When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, on the other hand, the compiler recognizes that the explicit check for the argument validation ensures that the subsequent check for the <code>stelem<\/code> instruction will always succeed, and omits the redundant check, producing this machine code:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/ad4b381f49dd0763481683bfc833156b.js\"><\/script><\/p>\n<p>Importantly, this brings the machine code in line with what one might expect from looking at the source code &#8212; a check for argument validation, followed by a store to the backing array. Also, executing the code reports a speedup as expected &#8212; running on .NET Framework 4.6.2 gives me output like this:<\/p>\n<pre><code>00:00:00.4313988\r\n00:00:00.4313209\r\n00:00:00.4320729\r\n00:00:00.4319180\r\n00:00:00.4316375\r\n<\/code><\/pre>\n<p>and running on .NET Core 2.0 gives me output like this:<\/p>\n<pre><code>00:00:00.3235982\r\n00:00:00.3237021\r\n00:00:00.3250067\r\n00:00:00.3235947\r\n00:00:00.3236944\r\n<\/code><\/pre>\n<h4><a href=\"#finally-cloning\" id=\"user-content-finally-cloning\" class=\"anchor\"><\/a>Finally Cloning<\/h4>\n<p>When it comes to exception handling and performance, one key goal is to minimize the cost that exception handling constructs impose on the non-exception path &#8212; if no exceptions are actually raised when the program runs, then (as much as possible) it should run as fast as it would if it didn&#8217;t have exception handlers at all. This poses a challenge for <code>finally<\/code> clauses, which execute on both exception and non-exception paths. In order to correctly support the exception path, the code of the <code>finally<\/code> must be bracketed by some set-up\/tear-down code that facilitates being called from the runtime code that handles exception dispatch. Let&#8217;s look at an example:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/9f5c885cfac85dd860df769ba6d4d9a8.js\"><\/script><\/p>\n<p>Method <code>Update<\/code> has a <code>finally<\/code> clause that increments ref parameter <code>right<\/code>. The actual increment boils down to a single machine instruction (<code>add dword ptr [rax], 1<\/code>), but interaction with the runtime&#8217;s exception dispatch mechanism requires 5 extra instructions prior and 3 instructions after. The exception dispatch code invokes the <code>finally<\/code> handler by calling it, and with the version of RyuJIT included with .NET Framework 4.6.2, the non-exception path of method <code>Update<\/code> similarly uses a <code>call<\/code> instruction to transfer control to the <code>finally<\/code> code. Here&#8217;s what the machine code for method <code>Update<\/code> looks like with that version of the compiler:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/cc728b7db5b511487d5a75acd91e0e84.js\"><\/script><\/p>\n<p>Thanks to change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/8551\">dotnet\/coreclr #8551<\/a>, the version of RyuJIT included with .NET Core 2.0 makes a separate copy of the <code>finally<\/code> handler body, which executes on the non-exception path, and only on the non-exception path, and therefore doesn&#8217;t need any of the code that interacts with the exception dispatch code. The result (for this simple <code>finally<\/code>) is a simple <code>inc dword ptr [rax]<\/code> in lieu of the <code>call<\/code> to the <code>finally<\/code> handler code. Here&#8217;s what the machine code for method <code>Update<\/code> looks like on .NET Core 2.0:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/5efb192634acff6eae66061c50f8b858.js\"><\/script><\/p>\n<p>(Note: As mentioned above, .NET Core and .NET Framework share RyuJIT compiler sources. In the case of this particular optimization, however, since the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.threading.thread.abort?view=netframework-4.7\"><code>Thread.Abort<\/code><\/a> mechanism that exists in .NET Framework but not .NET Core requires the optimization to perform <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/8551#issuecomment-267692283\">extra work<\/a> that&#8217;s not yet implemented, the compiler includes a check that disables this optimization when running on .NET Framework.)<\/p>\n<p>It&#8217;s worth noting that, in terms of C# source code, this optimization applies not just to <code>finally<\/code> statements, but also to other constructs which are implemented using MSIL <code>finally<\/code> clauses, such as <code>using<\/code> statements and <code>foreach<\/code> statements involving enumerators that implement <code>IDisposable<\/code>.<\/p>\n<p>As usual, the PR reports some stats (e.g. <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/8551#issuecomment-270448482\">3,000 affected methods in frameworks libraries<\/a>). The example above gives me output like this running on .NET Framework 4.6.2:<\/p>\n<pre><code>00:00:00.8864647\r\n00:00:00.8871649\r\n00:00:00.8858654\r\n00:00:00.8844547\r\n00:00:00.8863496\r\n<\/code><\/pre>\n<p>and output like this running on .NET Core 2.0:<\/p>\n<pre><code>00:00:00.3945198\r\n00:00:00.3943679\r\n00:00:00.3954488\r\n00:00:00.3944719\r\n00:00:00.3948235\r\n00:00:00.3942550\r\n00:00:00.3943774\r\n<\/code><\/pre>\n<h4><a href=\"#shift-count-mask-removal\" id=\"user-content-shift-count-mask-removal\" class=\"anchor\"><\/a>Shift Count Mask Removal<\/h4>\n<p>Generating machine code for bit-shift operations is surprisingly nuanced. Many software languages and hardware ISAs (and compiler intermediate representations) include bit-shift instructions, but in the case that the nominal shift amount is greater than or equal to the number of bits in the shifted value&#8217;s type, they have differing conventions (as do different programmer&#8217;s expectations): some interpret the shift amount modulo the number of bits, some produce the value zero (all bits &#8220;shifted out&#8221;), and some leave the result unspecified. MSIL&#8217;s <code>shl<\/code> and <code>shr.un<\/code> instructions&#8217; results are undefined in these cases (perhaps to allow the JIT compiler to simply lower these to corresponding target machine instructions regardless of the target&#8217;s convention). C#&#8217;s <code>&lt;&lt;<\/code> and <code>&gt;&gt;<\/code> operators, on the other hand, always have a defined result, interpreting the shift amount modulo the number of bits. To ensure the correct semantics, therefore, the C# compiler must emit explicit MSIL instructions to perform the modulus\/bit-mask operation on the shift amount before feeding it to <code>shl<\/code>\/<code>shr<\/code>. Since the x86, x64, and ARM64 ISAs&#8217; shift instructions likewise interpret the shift amount modulo the number of bits, on these targets a single hardware instruction can be used for the whole mask+shift sequence emitted by the C# compiler. <a href=\"https:\/\/github.com\/mikedn\">@mikedn<\/a>&#8216;s change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/11594\">dotnet\/coreclr #11594<\/a> taught RyuJIT to recognize these sequences and shrink them down appropriately. The PR reports <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/11594#issuecomment-301268383\">stats<\/a> showing this firing in 20 different framework assemblies, and, as always, a minimal illustrative example follows:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/952091ca03b43b20ebb20b61508680f1.js\"><\/script><\/p>\n<p>Method <code>LeftShift<\/code> uses only C#&#8217;s <code>&lt;&lt;<\/code> operator, and its IL includes the modulo operation (as <code>ldc 31<\/code> + <code>and<\/code>) to ensure those semantics:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/5a030998c62dc243702e04f78b48831b.js\"><\/script><\/p>\n<p>Running this with the version of the RyuJIT compiler included with .NET Framework 4.6.2, these masking operations are mechanically translated to hardware <code>and<\/code> instructions; here&#8217;s what the inner loop of method <code>Main<\/code> (into which <code>LeftShift<\/code> gets inlined) looks like:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/edf58d575a852e32e9b0c896a9b91a9f.js\"><\/script><\/p>\n<p>Running this with the RyuJIT compiler built from the current <code>master<\/code> branch (this particular change was merged after forking the .NET Core 2.0 release branch, and so will be included in versions after 2.0), on the other hand, the redundant masking is eliminated:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/d168e983dfb83cd503db2bb5f7b7b1ba.js\"><\/script><\/p>\n<p>Running this example code on .NET Framework 4.6.2, I see output like this:<\/p>\n<pre><code>00:00:00.8666592\r\n00:00:00.8644551\r\n00:00:00.8623416\r\n00:00:00.8625029\r\n00:00:00.8621675\r\n<\/code><\/pre>\n<p>Running it on .NET Core built from current <code>master<\/code> branch, I see output like this:<\/p>\n<pre><code>00:00:00.5767756\r\n00:00:00.5747216\r\n00:00:00.5753256\r\n00:00:00.5747212\r\n00:00:00.5751126\r\n<\/code><\/pre>\n<h4><a href=\"#conclusion\" id=\"user-content-conclusion\" class=\"anchor\"><\/a>Conclusion<\/h4>\n<p>There&#8217;s a lot of work going on in the JIT. I hope this small sampling has provided a fun read, and invite anyone interested to join the community pushing this work forward; there&#8217;s some <a href=\"https:\/\/github.com\/dotnet\/coreclr\/blob\/master\/Documentation\/botr\/ryujit-overview.md\">documentation on RyuJIT<\/a> available, and active work is typically labelled <a href=\"https:\/\/github.com\/dotnet\/coreclr\/labels\/area-CodeGen\">codegen<\/a> and\/or <a href=\"https:\/\/github.com\/dotnet\/coreclr\/labels\/optimization\">optimization<\/a>. Performance is a constant focus in JIT work, and we&#8217;ve got some exciting improvements in the pipeline (like <a href=\"https:\/\/github.com\/dotnet\/coreclr\/issues\/4331\">tiered jitting<\/a>), so stay tuned, and let us know what would really help light up your scenarios!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>RyuJIT is the just-in-time compiler used by .NET Core on x64 and now x86\u00a0and by the .NET Framework on x64 to compile MSIL bytecode to native machine code when a managed assembly executes. I&#8217;d like to point out some of the past year&#8217;s improvements that have gone into RyuJIT, and how they make the generated [&hellip;]<\/p>\n","protected":false},"author":363,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685],"tags":[9,11,108],"class_list":["post-13265","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","tag-net-core","tag-net-framework","tag-performance"],"acf":[],"blog_post_summary":"<p>RyuJIT is the just-in-time compiler used by .NET Core on x64 and now x86\u00a0and by the .NET Framework on x64 to compile MSIL bytecode to native machine code when a managed assembly executes. I&#8217;d like to point out some of the past year&#8217;s improvements that have gone into RyuJIT, and how they make the generated [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/13265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/363"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=13265"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/13265\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=13265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=13265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=13265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}