{"id":14885,"date":"2017-10-16T22:16:50","date_gmt":"2017-10-17T05:16:50","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/dotnet\/?p=14885"},"modified":"2021-09-29T16:38:24","modified_gmt":"2021-09-29T23:38:24","slug":"ryujit-just-in-time-compiler-optimization-enhancements","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/ryujit-just-in-time-compiler-optimization-enhancements\/","title":{"rendered":"RyuJIT Just-in-Time Compiler Optimization Enhancements"},"content":{"rendered":"<p>I&#8217;d like to tell you about some of the recent changes we&#8217;ve made as part of\u00a0our ongoing work to extend the optimization capabilities of RyuJIT, the\u00a0MSIL-to-native code generator used by .NET Core and .NET Framework. I hope it will make for an interesting read, and offer some insight into the sorts of\u00a0optimization opportunities we have our eyes on.<\/p>\n<p><em>Note: The changes described here landed after the release fork for\u00a0.NET Core 2.0 was created, so they are available in <a href=\"https:\/\/github.com\/dotnet\/core\/blob\/master\/daily-builds.md\">daily preview builds<\/a>\u00a0but not the <a href=\"https:\/\/github.com\/dotnet\/core\/blob\/master\/release-notes\/download-archive.md\">released 2.0 bits<\/a>.\u00a0Similarly, these changes landed after the fork for <a href=\"https:\/\/blogs.msdn.microsoft.com\/dotnet\/2017\/08\/07\/welcome-to-the-net-framework-4-7-1-early-access\/\">.NET Framework 4.7.1<\/a>\u00a0was created. The changes to struct argument passing and block layout, which are purely JIT changes, will automatically propagate to subsequent .NET Framework releases with the new JIT bits (the RyuJIT sources are shared between .NET Core and .NET Framework); the other changes depend on their runtime components to propagate to .NET Framework.<\/em><\/p>\n<h2><a id=\"user-content-improvements-for-span\" class=\"anchor\" href=\"#improvements-for-span\"><\/a>Improvements for Span<\/h2>\n<p>Some of our work was motivated by <a href=\"https:\/\/github.com\/dotnet\/corefxlab\/blob\/master\/docs\/specs\/span.md\">the introduction of <code>Span&lt;T&gt;<\/code><\/a>, so that it and similar types could better deliver on their performance promises.<\/p>\n<p>One such change was <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10910\">#10910<\/a>, which made the JIT recognize the <code>Item<\/code> property getters of <code>Span&lt;T&gt;<\/code> and <code>ReadOnlySpan&lt;T&gt;<\/code> as intrinsics &#8212; the JIT now recognizes calls to these getters and, rather than generate code for them the same way it would for other calls, it transforms them directly into code sequences in its intermediate representation that are similar to the sequences used for the <code>ldelem<\/code> MSIL opcode that fetches an element from an array. As noted in the PR&#8217;s <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10910#issuecomment-293511168\">performance assessment<\/a> (n.b., if you follow that link, see also the <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10910#issuecomment-294051549\">follow-up<\/a> where the initially-discovered regressions were fixed with subsequent improvements in <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10956\">#10956<\/a> and <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/20548\">dotnet\/roslyn#20548<\/a>), this improved several benchmarks in the <a href=\"https:\/\/github.com\/dotnet\/coreclr\/tree\/master\/tests\/src\/JIT\/Performance\/CodeQuality\/Span\">tests<\/a> we added to track <code>Span&lt;T&gt;<\/code> performance, by allowing the existing JIT code that optimized array bound checks that are redundant with prior checks, or that are against arrays with known constant length, to kick in for <code>Span&lt;T&gt;<\/code> as well. This is what some of those improved benchmark methods look like, and their improvements:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/e13e5385ffe7784612c77dad6ea6a79e.js\"><\/script><\/p>\n<p>Building on that, change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/11521\">#11521<\/a> updated the analysis machinery the JIT uses to eliminate bounds checks for other provably in-bounds array accesses, to similarly eliminate bounds checks for provably in-bounds <code>Span&lt;T&gt;<\/code> accesses (in particular, bounds checks in <code>for<\/code> loops bounded by <code>span.Length<\/code>). As <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/11521#issuecomment-300658768\">noted<\/a> in the PR (numbers <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/11521#issuecomment-301594380\">here<\/a>), this brought the codegen for four more microbenchmarks in the <a href=\"https:\/\/github.com\/dotnet\/coreclr\/tree\/master\/tests\/src\/JIT\/Performance\/CodeQuality\/Span\"><code>Span&lt;T&gt;<\/code> tests<\/a> up to par with the codegen for equivalent patterns with arrays; here are two of them:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/41548f0b3601199f75ab2dd32b679494.js\"><\/script><\/p>\n<p>One key fact that these bounds-check removal optimizations exploit is that array lengths are immutable; any two loads of <code>a.Length<\/code>, if <code>a<\/code> refers to the same array each time, will load the same length value. It&#8217;s common for the JIT to encounter different accesses to the same array, where the reference to the array is held in a local or parameter of type <code>T[]<\/code>, such that it can determine that intervening code hasn&#8217;t modified the local\/parameter in question, even if that intervening code has unknown side-effects. The same isn&#8217;t true for parameters of type <code>ref T[]<\/code>, since intervening code with unknown side-effects might change which array object is referenced. Consider:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/f31c6fc24ab04e2e9643adbcd8015dc0.js\"><\/script><\/p>\n<p>Since <code>Span&lt;T&gt;<\/code> is a struct, some platforms&#8217; ABIs specify that passing an argument of type <code>Span&lt;T&gt;<\/code> actually be done by creating a copy of the struct in the caller&#8217;s stack frame, and passing a pointer to that copy in to the callee via the argument registers\/stack. The JIT&#8217;s internal modeling of this convention is to rewrite <code>Span&lt;T&gt;<\/code> parameters as <code>ref Span&lt;T&gt;<\/code> parameters. That internal rewrite at first caused problems for applying bounds-check removal optimizations to spans passed as parameters. The problem was that methods written with by-value <code>Span&lt;T&gt;<\/code> parameters, which at source look analogous to by-value array parameter <code>a<\/code> in the example above, when rewritten looked to the JIT like by-reference parameters, analogous to by-reference array parameter <code>b<\/code> above. This caused the JIT to handle references to such parameters&#8217; <code>Length<\/code> fields with the same conservativism needed for <code>b<\/code> above. Change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10453\">#10453<\/a> taught the JIT to make local copies of such parameters before doing that rewrite (in beneficial cases), so that bounds-check removal optimizations can equally apply to spans passed by value. As <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/10453#issuecomment-297835320\">noted<\/a> in the PR, this change allowed these optimizations to fire in 9 more of the <a href=\"https:\/\/github.com\/dotnet\/coreclr\/tree\/master\/tests\/src\/JIT\/Performance\/CodeQuality\/Span\"><code>Span&lt;T&gt;<\/code> micro-benchmarks<\/a> in our test suite; here are three of them:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/10f80bd5a1cbd72e577d503148116d91.js\"><\/script><\/p>\n<p>This last change applies more generally to any structs passed as parameters (not just <code>Span&lt;T&gt;<\/code>); the JIT is now better able to analyze value propagation through their fields.<\/p>\n<h2><a id=\"user-content-enumhasflag-optimization\" class=\"anchor\" href=\"#enumhasflag-optimization\"><\/a>Enum.HasFlag Optimization<\/h2>\n<p>The <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.enum.hasflag?view=netstandard-2.0\"><code>Enum.HasFlag<\/code> method<\/a> offers nice readability (compare <code>targets.HasFlag(AttributeTargets.Class | AttributeTargets.Struct)<\/code> vs <code>targets &amp; (AttributeTargets.Class | AttributeTargets.Struct) == (AttributeTargets.Class | AttributeTargets.Struct)<\/code>), but, since it needs to handle reflection cases where the exact enum type isn&#8217;t known until run-time, it is <a href=\"https:\/\/stackoverflow.com\/questions\/7368652\/what-is-it-that-makes-enum-hasflag-so-slow\">notoriously<\/a> expensive. Change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/13748\">#13748<\/a> taught the JIT to recognize when the enum type is known (and known to equal the argument type) at JIT time, and generate the simple bit test rather than the expensive <code>Enum.HasFlag<\/code> call. Here&#8217;s a micro-benchmark to demonstrate, comparing .NET Core 2.0 (which doesn&#8217;t have this change) to a recent daily preview build (which does). Much thanks to <a href=\"https:\/\/github.com\/adamsitnik\">@adamsitnik<\/a> for <a href=\"https:\/\/github.com\/dotnet\/#custom-net-core-runtime\">making it easy<\/a> to use <a href=\"http:\/\/www.benchmarkdotnet.org\">BenchmarkDotNet<\/a> with <a href=\"https:\/\/github.com\/dotnet\/core\/blob\/master\/daily-builds.md\">daily preview builds<\/a> of .NET Core!<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/06f275ca02c1f78835d3c3d597a30c33.js\"><\/script><\/p>\n<p>Output:<\/p>\n<div class=\"highlight highlight-source-ini\">\n<pre class=\"lang:default decode:true\">BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063)\r\nProcessor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8\r\nFrequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC\r\n.NET Core SDK=2.1.0-preview1-007228\r\n  [Host]     : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT\r\n  Job-WFNGKY : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT\r\n  Job-VIXUQP : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT<\/pre>\n<\/div>\n<table border=\"1\" cellpadding=\"5\">\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Toolchain<\/th>\n<th align=\"right\">Mean<\/th>\n<th align=\"right\">Error<\/th>\n<th align=\"right\">StdDev<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HasFlag<\/td>\n<td>.NET Core 2.0<\/td>\n<td align=\"right\">14,917.4 ns<\/td>\n<td align=\"right\">80.147 ns<\/td>\n<td align=\"right\">71.048 ns<\/td>\n<\/tr>\n<tr>\n<td>HasFlag<\/td>\n<td>.NET Core 2.1.0-preview1-25719-04<\/td>\n<td align=\"right\">449.3 ns<\/td>\n<td align=\"right\">1.239 ns<\/td>\n<td align=\"right\">1.034 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>With the cool new <a href=\"http:\/\/adamsitnik.com\/Disassembly-Diagnoser\/\">BenchmarkDotNet DisassemblyDiagnoser<\/a> (again thanks to <a href=\"https:\/\/github.com\/adamsitnik\">@adamsitnik<\/a>), we can see that the optimized code really is a simple bit test:<\/p>\n<table border=\"1\" cellpadding=\"5\">\n<thead>\n<tr>\n<th colspan=\"2\">Bench.HasFlag<\/th>\n<\/tr>\n<tr>\n<th>RyuJIT x64 .NET Core 2.0<\/th>\n<th>RyuJIT x64 .NET Core 2.1.0-preview1-25719-04<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<pre><code>HasFlagBench.Bench.HasFlag():\r\npush    rdi\r\npush    rsi\r\npush    rbx\r\nsub     rsp,20h\r\nmov     rsi,rcx\r\nxor     edi,edi\r\nL1:\r\nmov rcx, [[AttributeTargets type]]\r\ncall    [[box]]\r\nmov     rbx,rax\r\nmov rcx, [[AttributeTargets type]]\r\ncall    [[box]]\r\nmov     ecx,dword ptr [rsi+8]\r\nmov     dword ptr [rbx+8],ecx\r\nmov     rcx,rbx\r\nmov     dword ptr [rax+8],0Ch\r\nmov     rdx,rax\r\ncall    [[System.Enum.HasFlag]]\r\nmov     byte ptr [rsi+0Ch],al\r\ninc     edi\r\ncmp     edi,3E8h\r\njl      L1\r\nadd     rsp,20h\r\npop     rbx\r\npop     rsi\r\npop     rdi\r\nret<\/code><\/pre>\n<\/td>\n<td>\n<pre><code>HasFlagBench.Bench.HasFlag():\r\nxor     eax,eax\r\nmov     edx,dword ptr [rcx+8]\r\nL1:\r\nmov     r8d,edx\r\nand     r8d,0Ch\r\ncmp     r8d,0Ch\r\nsete    r8b\r\nmov     byte ptr [rcx+0Ch],r8b\r\ninc     eax\r\ncmp     eax,3E8h\r\njl      L1\r\nret<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>What&#8217;s more, implementing this optimization involved <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/13815\">implementing<\/a> a new scheme for recognizing intrinsics in the JIT, which is <a href=\"https:\/\/github.com\/dotnet\/coreclr\/issues\/13813\">more flexible<\/a> than the previous scheme, and which <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/14020\">is being leveraged<\/a> in the implementation of <a href=\"https:\/\/github.com\/dotnet\/corefx\/issues\/22940\">Intel SIMD intrinsics for.NET Core<\/a>.<\/p>\n<h2><a id=\"user-content-block-layout-for-search-loops\" class=\"anchor\" href=\"#block-layout-for-search-loops\"><\/a>Block Layout for Search Loops<\/h2>\n<p>Outside of <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/framework\/tools\/mpgo-exe-managed-profile-guided-optimization-tool\">profile-guided optimization<\/a>, the JIT has traditionally been conservative about rearranging the basic blocks of methods it compiles, leaving them in MSIL order except to segregate code it identifies as &#8220;rarely-run&#8221; (e.g. blocks that throw or catch exceptions). Of course, MSIL order isn&#8217;t always the most performant one; notably, in the case of loops with conditional exits\/returns, it&#8217;s generally a good idea to keep the in-loop code together, and move everything on the exit path after the conditional branch out of the loop. For particularly hot loops, this can cause a significant enough difference that developers have <a href=\"https:\/\/github.com\/dotnet\/coreclr\/issues\/9692#issuecomment-307262693\">been<\/a> <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/2667#discussion-diff-49820503\">using<\/a> <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/9213#pullrequestreview-22413856\">gotos<\/a> to make the MSIL order reflect the desired machine code order. Change <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/13314\">#13314<\/a> updated the JIT&#8217;s loop detection to effect this layout automatically. As usual, the PR included a <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/13314#issuecomment-321576769\">performance assessment<\/a>,\nwhich noted speed-ups in 5 of the benchmarks in our <a href=\"https:\/\/github.com\/dotnet\/coreclr\/tree\/master\/tests\/src\/JIT\/Performance\/CodeQuality\">performance test suite<\/a>.<\/p>\n<p>Again comparing .NET Core 2.0 (which didn&#8217;t have this change) to a recent daily preview build (which does), let&#8217;s look at the effect on the repro case from the <a href=\"https:\/\/github.com\/dotnet\/coreclr\/issues\/9692\">GitHub issue<\/a> describing this opportunity:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/JosephTremoulet\/589f42fbdf7e71511b24aca259b56b81.js\"><\/script><\/p>\n<p>The results confirm that the new JIT brings the performance of the loop with the in-place <code>return<\/code> in line with the performance of the loop with the <code>goto<\/code>, and that doing so constituted a 15% speed-up:<\/p>\n<div class=\"highlight highlight-source-ini\">\n<pre class=\"lang:default decode:true \">BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063)\r\nProcessor=Intel Core i7-4790 CPU 3.60GHz (Haswell), ProcessorCount=8\r\nFrequency=3507517 Hz, Resolution=285.1020 ns, Timer=TSC\r\n.NET Core SDK=2.1.0-preview1-007228\r\n  [Host]     : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT\r\n  Job-NHAVNC : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT\r\n  Job-CTEHPT : .NET Core 2.1.0-preview1-25719-04 (Framework 4.6.25718.02), 64bit RyuJIT<\/pre>\n<\/div>\n<table border=\"1\" cellpadding=\"5\">\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Toolchain<\/th>\n<th align=\"right\">Mean<\/th>\n<th align=\"right\">Error<\/th>\n<th align=\"right\">StdDev<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>LoopReturn<\/td>\n<td>.NET Core 2.0<\/td>\n<td align=\"right\">61.97 ns<\/td>\n<td align=\"right\">0.1254 ns<\/td>\n<td align=\"right\">0.1111 ns<\/td>\n<\/tr>\n<tr>\n<td>LoopGoto<\/td>\n<td>.NET Core 2.0<\/td>\n<td align=\"right\">53.63 ns<\/td>\n<td align=\"right\">0.5171 ns<\/td>\n<td align=\"right\">0.4837 ns<\/td>\n<\/tr>\n<tr>\n<td>LoopReturn<\/td>\n<td>.NET Core 2.1.0-preview1-25719-04<\/td>\n<td align=\"right\">53.75 ns<\/td>\n<td align=\"right\">0.5089 ns<\/td>\n<td align=\"right\">0.4511 ns<\/td>\n<\/tr>\n<tr>\n<td>LoopGoto<\/td>\n<td>.NET Core 2.1.0-preview1-25719-04<\/td>\n<td align=\"right\">53.52 ns<\/td>\n<td align=\"right\">0.0999 ns<\/td>\n<td align=\"right\">0.0934 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Disassembly confirms that the difference is entirely block placement:<\/p>\n<table border=\"1\" cellpadding=\"5\">\n<thead>\n<tr>\n<th colspan=\"2\">LoopWithExit.LoopReturn<\/th>\n<\/tr>\n<tr>\n<th>RyuJIT x64 .NET Core 2.0<\/th>\n<th>RyuJIT x64 .NET Core 2.1.0-preview1-25719-04<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<pre><code>LoopLayoutBench.LoopWithExit.LoopReturn_\r\n(System.String, System.String):\r\nsub     rsp,18h\r\nxor     eax,eax\r\nmov     qword ptr [rsp+10h],rax\r\nmov     qword ptr [rsp+8],rax\r\nmov     ecx,dword ptr [rdx+8]\r\nmov     qword ptr [rsp+10h],rdx\r\nmov     rax,rdx\r\ntest    rax,rax\r\nje      L1\r\nadd     rax,0Ch\r\nL1:\r\nmov     qword ptr [rsp+8],r8\r\nmov     rdx,r8\r\ntest    rdx,rdx\r\nje      L2\r\nadd     rdx,0Ch\r\nL2:\r\ntest    ecx,ecx\r\nje      L5\r\nL3:\r\nmovzx   r8d,word ptr [rax]\r\nmovzx   r9d,word ptr [rdx]\r\ncmp     r8d,r9d\r\nje      L4\r\nxor     eax,eax\r\nadd     rsp,18h\r\nret\r\nL4:\r\nadd     rax,2\r\nadd     rdx,2\r\ndec     ecx\r\ntest    ecx,ecx\r\njne     L3\r\nL5:\r\nmov     eax,1\r\nadd     rsp,18h\r\nret<\/code><\/pre>\n<\/td>\n<td>\n<pre><code>LoopLayoutBench.LoopWithExit.LoopReturn_\r\n(System.String, System.String):\r\nsub     rsp,18h\r\nxor     eax,eax\r\nmov     qword ptr [rsp+10h],rax\r\nmov     qword ptr [rsp+8],rax\r\nmov     eax,dword ptr [rdx+8]\r\nmov     qword ptr [rsp+10h],rdx\r\ntest    rdx,rdx\r\nje      L1\r\nadd     rdx,0Ch\r\nL1:\r\nmov     qword ptr [rsp+8],r8\r\nmov     rcx,r8\r\ntest    rcx,rcx\r\nje      L2\r\nadd     rcx,0Ch\r\nL2:\r\ntest    eax,eax\r\nje      L4\r\nL3:\r\nmovzx   r8d,word ptr [rdx]\r\nmovzx   r9d,word ptr [rcx]\r\ncmp     r8d,r9d\r\njne     L5\r\nadd     rdx,2\r\nadd     rcx,2\r\ndec     eax\r\ntest    eax,eax\r\njne     L3\r\nL4:\r\nmov     eax,1\r\nadd     rsp,18h\r\nret\r\nL5:\r\nxor     eax,eax\r\nadd     rsp,18h\r\nret<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table border=\"1\" cellpadding=\"5\">\n<thead>\n<tr>\n<th colspan=\"2\">LoopWithExit.LoopGoto<\/th>\n<\/tr>\n<tr>\n<th>RyuJIT x64 .NET Core 2.0<\/th>\n<th>RyuJIT x64 .NET Core 2.1.0-preview1-25719-04<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<pre><code>LoopLayoutBench.LoopWithExit.LoopGoto_\r\n(System.String, System.String):\r\nsub     rsp,18h\r\nxor     eax,eax\r\nmov     qword ptr [rsp+10h],rax\r\nmov     qword ptr [rsp+8],rax\r\nmov     eax,dword ptr [rcx+8]\r\nmov     qword ptr [rsp+10h],rcx\r\ntest    rcx,rcx\r\nje      L1\r\nadd     rcx,0Ch\r\nL1:\r\nmov     qword ptr [rsp+8],rdx\r\ntest    rdx,rdx\r\nje      L2\r\nadd     rdx,0Ch\r\nL2:\r\ntest    eax,eax\r\nje      L4\r\nL3:\r\nmovzx   r8d,word ptr [rcx]\r\nmovzx   r9d,word ptr [rdx]\r\ncmp     r8d,r9d\r\njne     L5\r\nadd     rcx,2\r\nadd     rdx,2\r\ndec     eax\r\ntest    eax,eax\r\njne     L3\r\nL4:\r\nmov     eax,1\r\nadd     rsp,18h\r\nret\r\nL5:\r\nxor     eax,eax\r\nadd     rsp,18h\r\nret<\/code><\/pre>\n<\/td>\n<td>\n<pre><code>LoopLayoutBench.LoopWithExit.LoopGoto_\r\n(System.String, System.String):\r\nsub     rsp,18h\r\nxor     eax,eax\r\nmov     qword ptr [rsp+10h],rax\r\nmov     qword ptr [rsp+8],rax\r\nmov     eax,dword ptr [rcx+8]\r\nmov     qword ptr [rsp+10h],rcx\r\ntest    rcx,rcx\r\nje      L1\r\nadd     rcx,0Ch\r\nL1:\r\nmov     qword ptr [rsp+8],rdx\r\ntest    rdx,rdx\r\nje      L2\r\nadd     rdx,0Ch\r\nL2:\r\ntest    eax,eax\r\nje      L4\r\nL3:\r\nmovzx   r8d,word ptr [rcx]\r\nmovzx   r9d,word ptr [rdx]\r\ncmp     r8d,r9d\r\njne     L5\r\nadd     rcx,2\r\nadd     rdx,2\r\ndec     eax\r\ntest    eax,eax\r\njne     L3\r\nL4:\r\nmov     eax,1\r\nadd     rsp,18h\r\nret\r\nL5:\r\nxor     eax,eax\r\nadd     rsp,18h\r\nret<\/code><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2><a id=\"user-content-conclusion\" class=\"anchor\" href=\"#conclusion\"><\/a>Conclusion<\/h2>\n<p>We&#8217;re constantly pushing to improve our codegen, whether it&#8217;s to enable new scenarios\/features (like <code>Span&lt;T&gt;<\/code>), or to ensure good performance for natural\/readable code (like calls to <code>HasFlag<\/code> and returns from loops). As always, we invite anyone interested to join the community pushing this work forward. RyuJIT documentation available online includes an <a href=\"https:\/\/github.com\/dotnet\/coreclr\/blob\/master\/Documentation\/botr\/ryujit-overview.md\">overview<\/a> and a <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/13079\">recently<\/a> added <a href=\"https:\/\/github.com\/dotnet\/coreclr\/blob\/master\/Documentation\/botr\/ryujit-tutorial.md\">tutorial<\/a>, and our <a href=\"https:\/\/github.com\/dotnet\/coreclr\/labels\/area-CodeGen\">GitHub issues<\/a> are open for (and full of) active discussions!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;d like to tell you about some of the recent changes we&#8217;ve made as part of\u00a0our ongoing work to extend the optimization capabilities of RyuJIT, the\u00a0MSIL-to-native code generator used by .NET Core and .NET Framework. I hope it will make for an interesting read, and offer some insight into the sorts of\u00a0optimization opportunities we have [&hellip;]<\/p>\n","protected":false},"author":363,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685],"tags":[108],"class_list":["post-14885","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","tag-performance"],"acf":[],"blog_post_summary":"<p>I&#8217;d like to tell you about some of the recent changes we&#8217;ve made as part of\u00a0our ongoing work to extend the optimization capabilities of RyuJIT, the\u00a0MSIL-to-native code generator used by .NET Core and .NET Framework. I hope it will make for an interesting read, and offer some insight into the sorts of\u00a0optimization opportunities we have [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/14885","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/363"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=14885"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/14885\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=14885"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=14885"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=14885"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}