{"id":29683,"date":"2020-09-02T09:50:12","date_gmt":"2020-09-02T16:50:12","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=29683"},"modified":"2021-09-29T12:15:46","modified_gmt":"2021-09-29T19:15:46","slug":"arm64-performance-in-net-5","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/arm64-performance-in-net-5\/","title":{"rendered":"ARM64 Performance in .NET 5"},"content":{"rendered":"<p>The .NET team has significantly improved performance with .NET 5, both generally and for ARM64. You can check out the general improvements in the excellent and detailed <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\/\" rel=\"nofollow\">Performance Improvements in .NET 5<\/a> blog by Stephen. In this post, I will describe the performance improvements we made specifically for ARM64 and show the positive impact on the benchmarks we use. I will also share some of the additional opportunities for performance improvements that we have identified and plan to address in a future release.<\/p>\n<p>While we have been working on ARM64 support in RyuJIT for over five years, most of the work that was done was to ensure that we generate functionally correct ARM64 code. We spent very little time in evaluating the performance of the code RyuJIT produced for ARM64. As part of .NET 5, our focus was to perform investigation in this area and find out any obvious issues in RyuJIT that would improve the ARM64 code quality (CQ). Since Microsoft VC++ team already has support for Windows ARM64, we consulted with them to understand the CQ issues that they encountered when doing a similar exercise.<\/p>\n<p>Although fixing CQ issues is crucial, sometimes its impact might not be noticeable in an application. Hence, we also wanted to make observable improvements in the performance of .NET libraries to benefit .NET applications targeted for ARM64.<\/p>\n<p>Here is the outline I will use to describe our work for improving ARM64 performance on .NET 5:<\/p>\n<ul>\n<li>ARM64-specific optimizations in the .NET libraries.<\/li>\n<li>Evaluation of code quality produced by RyuJIT and resulting outcome.<\/li>\n<\/ul>\n<h2><a id=\"user-content-arm64-hardware-intrinsics-in-net-libraries\" class=\"anchor\" href=\"#arm64-hardware-intrinsics-in-net-libraries\" aria-hidden=\"true\"><\/a>ARM64 hardware intrinsics in .NET libraries<\/h2>\n<p>In .NET Core 3.0, we introduced a new feature called <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/hardware-intrinsics-in-net-core\/\" rel=\"nofollow\">&#8220;hardware intrinsics&#8221;<\/a> which gives access to various vectorized and non-vectorized instructions that modern hardware support. .NET developers can access these instructions using set of APIs under namespace <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotNet\/api\/system.runtime.intrinsics?view=net-5.0\" rel=\"nofollow\">System.Runtime.Intrinsics<\/a> and <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.x86?view=net-5.0\" rel=\"nofollow\">System.Runtime.Intrinsics.X86<\/a> for x86\/x64 architecture. In .NET 5, we added around 384 APIs under <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.arm?view=net-5.0\" rel=\"nofollow\">System.Runtime.Intrinsics.Arm<\/a> for ARM32\/ARM64 architecture. This involved <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues?q=is%3Aissue+label%3Aapi-approved+label%3Aarch-arm64+label%3Aarea-System.Runtime.Intrinsics+is%3Aclosed\">implementing those APIs<\/a> and making RyuJIT aware of them so it can emit appropriate ARM32\/ARM64 instruction. We also optimized methods of <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.vector64?view=net-5.0\" rel=\"nofollow\">Vector64<\/a> and <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.vector128?view=net-5.0\" rel=\"nofollow\">Vector128<\/a> that provide ways to create and manipulate <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.vector64-1?view=net-5.0\" rel=\"nofollow\">Vector64&lt;T&gt;<\/a> and <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.runtime.intrinsics.vector128-1?view=net-5.0\" rel=\"nofollow\">Vector128&lt;T&gt;<\/a> datatypes on which majority of the hardware intrinsic APIs operate on. If interested, refer to the sample code usage along with examples of <code>Vector64<\/code> and <code>Vector128<\/code> methods <a href=\"https:\/\/kunalspathak.github.io\/2020-08-01-Vectorization-APIs\/\" rel=\"nofollow\">here<\/a>. You can check our &#8220;hardware intrinsic&#8221; project progress <a href=\"https:\/\/github.com\/dotnet\/runtime\/projects\/21\">here<\/a>.<\/p>\n<hr \/>\n<h2><a id=\"user-content-optimized-net-library-code-using-arm64-hardware-intrinsics\" class=\"anchor\" href=\"#optimized-net-library-code-using-arm64-hardware-intrinsics\" aria-hidden=\"true\"><\/a>Optimized .NET library code using ARM64 hardware intrinsics<\/h2>\n<p>In .NET Core 3.1, we optimized many critical methods of .NET library using x86\/x64 intrinsics. Doing that improved the performance of such methods when ran on hardware supporting the x86\/x64 intrinsic instructions. For hardware that does not support x86\/x64 intrinsics such as ARM machines, .NET would fallback to the slower implementation of those methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/33308\">dotnet\/runtime#33308<\/a> list such .NET library methods. In .NET 5, we have optimized most of these methods using ARM64 hardware intrinsics as well. So, if your code uses any of those .NET library methods, they will now see speed boost running on ARM architecture. We focused our efforts on methods that already were optimized with x86\/x64 intrinsics, because those were chosen based on an earlier performance analysis (which we didn&#8217;t want to duplicate\/repeat) and we wanted the product to have generally similar behavior across platforms. Moving forward, we expect to use both x86\/x64 and ARM64 hardware intrinsics as our default approach when we optimize .NET library methods. We still have to decide how this will affect our policy for PRs that we accept.<\/p>\n<p>For each of the methods that we optimized in .NET 5, I&#8217;ll show you the improvements in terms of the low-level benchmark that we used for validating our improvements. These benchmarks are far from real-world. You&#8217;ll see later in the post how all of these targeted improvements combine together to greatly improve .NET on ARM64 in larger, more real-world, scenarios.<\/p>\n<h3><a id=\"user-content-systemcollections\" class=\"anchor\" href=\"#systemcollections\" aria-hidden=\"true\"><\/a>System.Collections<\/h3>\n<p><code>System.Collections.BitArray<\/code> methods were optimized by <a href=\"https:\/\/github.com\/Gnbrkm41\">@Gnbrkm41<\/a> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/33749\">dotnet\/runtime#33749<\/a>. The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/c86ef708fc9eea6afad9fac833c2768135a47aa0\/src\/benchmarks\/micro\/libraries\/System.Collections\/Perf.BitArray.cs#L12\">Perf_BitArray<\/a> microbenchmark.<\/p>\n<table>\n<thead>\n<tr>\n<th>BitArray method<\/th>\n<th>Benchmark<\/th>\n<th align=\"center\">.NET Core 3.1<\/th>\n<th align=\"center\">.NET 5<\/th>\n<th align=\"center\">Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>ctor(bool[])<\/code><\/td>\n<td>BitArrayBoolArrayCtor(Size: 512)<\/td>\n<td align=\"center\">1704.68<\/td>\n<td align=\"center\">215.55<\/td>\n<td align=\"center\">-87%<\/td>\n<\/tr>\n<tr>\n<td><code>CopyTo(Array, int)<\/code><\/td>\n<td>BitArrayCopyToBoolArray(Size: 4)<\/td>\n<td align=\"center\">269.20<\/td>\n<td align=\"center\">60.42<\/td>\n<td align=\"center\">-78%<\/td>\n<\/tr>\n<tr>\n<td><code>CopyTo(Array, int)<\/code><\/td>\n<td>BitArrayCopyToIntArray(Size: 4)<\/td>\n<td align=\"center\">87.83<\/td>\n<td align=\"center\">22.24<\/td>\n<td align=\"center\">-75%<\/td>\n<\/tr>\n<tr>\n<td><code>And(BitArray)<\/code><\/td>\n<td>BitArrayAnd(Size: 512)<\/td>\n<td align=\"center\">212.33<\/td>\n<td align=\"center\">65.17<\/td>\n<td align=\"center\">-69%<\/td>\n<\/tr>\n<tr>\n<td><code>Or(BitArray)<\/code><\/td>\n<td>BitArrayOr(Size: 512)<\/td>\n<td align=\"center\">208.82<\/td>\n<td align=\"center\">64.24<\/td>\n<td align=\"center\">-69%<\/td>\n<\/tr>\n<tr>\n<td><code>Xor(BitArray)<\/code><\/td>\n<td>BitArrayXor(Size: 512)<\/td>\n<td align=\"center\">212.34<\/td>\n<td align=\"center\">67.33<\/td>\n<td align=\"center\">-68%<\/td>\n<\/tr>\n<tr>\n<td><code>Not()<\/code><\/td>\n<td>BitArrayNot(Size: 512)<\/td>\n<td align=\"center\">152.55<\/td>\n<td align=\"center\">54.47<\/td>\n<td align=\"center\">-64%<\/td>\n<\/tr>\n<tr>\n<td><code>SetAll(bool)<\/code><\/td>\n<td>BitArraySetAll(Size: 512)<\/td>\n<td align=\"center\">108.41<\/td>\n<td align=\"center\">59.71<\/td>\n<td align=\"center\">-45%<\/td>\n<\/tr>\n<tr>\n<td><code>ctor(BitArray)<\/code><\/td>\n<td>BitArrayBitArrayCtor(Size: 4)<\/td>\n<td align=\"center\">113.39<\/td>\n<td align=\"center\">74.63<\/td>\n<td align=\"center\">-34%<\/td>\n<\/tr>\n<tr>\n<td><code>ctor(byte[])<\/code><\/td>\n<td>BitArrayByteArrayCtor(Size: 512)<\/td>\n<td align=\"center\">395.87<\/td>\n<td align=\"center\">356.61<\/td>\n<td align=\"center\">-10%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><a id=\"user-content-systemnumerics\" class=\"anchor\" href=\"#systemnumerics\" aria-hidden=\"true\"><\/a>System.Numerics<\/h3>\n<p><code>System.Numerics.BitOperations<\/code> methods were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/34486\">dotnet\/runtime#34486<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35636\">dotnet\/runtime#35636<\/a>. The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/454476401e17ed7f4d8b899ecf7661eb6cd63bad\/src\/benchmarks\/micro\/libraries\/System.Numerics.BitOperations\/Perf_BitOperations.cs#L14\">Perf_BitOperations<\/a> microbenchmark.<\/p>\n<table>\n<thead>\n<tr>\n<th>BitOperations method<\/th>\n<th>Benchmark<\/th>\n<th align=\"center\">.NET Core 3.1<\/th>\n<th align=\"center\">.NET 5<\/th>\n<th align=\"center\">Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>LeadingZeroCount(uint)<\/code><\/td>\n<td>LeadingZeroCount_uint<\/td>\n<td align=\"center\">10976.5<\/td>\n<td align=\"center\">1155.85<\/td>\n<td align=\"center\">-89%<\/td>\n<\/tr>\n<tr>\n<td><code>Log2(ulong)<\/code><\/td>\n<td>Log2_ulong<\/td>\n<td align=\"center\">11550.03<\/td>\n<td align=\"center\">1347.46<\/td>\n<td align=\"center\">-88%<\/td>\n<\/tr>\n<tr>\n<td><code>TrailingZeroCount(uint)<\/code><\/td>\n<td>TrailingZeroCount_uint<\/td>\n<td align=\"center\">7313.95<\/td>\n<td align=\"center\">1164.10<\/td>\n<td align=\"center\">-84%<\/td>\n<\/tr>\n<tr>\n<td><code>PopCount(ulong)<\/code><\/td>\n<td>PopCount_ulong<\/td>\n<td align=\"center\">4234.18<\/td>\n<td align=\"center\">1541.48<\/td>\n<td align=\"center\">-64%<\/td>\n<\/tr>\n<tr>\n<td><code>PopCount(uint)<\/code><\/td>\n<td>PopCount_uint<\/td>\n<td align=\"center\">4233.58<\/td>\n<td align=\"center\">1733.83<\/td>\n<td align=\"center\">-59%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><code>System.Numerics.Matrix4x4<\/code> methods were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/40054\">dotnet\/runtime#40054<\/a>. The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Numerics.Vectors\/Perf_Matrix4x4.cs#L11\">Perf_Matrix4x4<\/a> microbenchmark.<\/p>\n<table>\n<thead>\n<tr>\n<th>Benchmarks<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CreateScaleFromVectorWithCenterBenchmark<\/td>\n<td>29.39<\/td>\n<td>24.84<\/td>\n<td>-15%<\/td>\n<\/tr>\n<tr>\n<td>CreateOrthographicBenchmark<\/td>\n<td>17.14<\/td>\n<td>11.19<\/td>\n<td>-35%<\/td>\n<\/tr>\n<tr>\n<td>CreateScaleFromScalarWithCenterBenchmark<\/td>\n<td>26.00<\/td>\n<td>17.14<\/td>\n<td>-34%<\/td>\n<\/tr>\n<tr>\n<td>MultiplyByScalarOperatorBenchmark<\/td>\n<td>28.45<\/td>\n<td>22.06<\/td>\n<td>-22%<\/td>\n<\/tr>\n<tr>\n<td>TranslationBenchmark<\/td>\n<td>15.15<\/td>\n<td>5.39<\/td>\n<td>-64%<\/td>\n<\/tr>\n<tr>\n<td>CreateRotationZBenchmark<\/td>\n<td>50.21<\/td>\n<td>40.24<\/td>\n<td>-20%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p>The SIMD accelerated types <code>System.Numerics.Vector2<\/code>, <code>System.Numerics.Vector3<\/code> and <code>System.Numerics.Vector4<\/code> were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35421\">dotnet\/runtime#35421<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36267\">dotnet\/runtime#36267<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36512\">dotnet\/runtime#36512<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36579\">dotnet\/runtime#36579<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37882\">dotnet\/runtime#37882<\/a> to use hardware intrinsics. The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/adccb815003451dd68586516d4f25f52f3f2ebe7\/src\/benchmarks\/micro\/libraries\/System.Numerics.Vectors\/Perf_Vector2.cs#L12\">Perf_Vector2<\/a>, <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/adccb815003451dd68586516d4f25f52f3f2ebe7\/src\/benchmarks\/micro\/libraries\/System.Numerics.Vectors\/Perf_Vector3.cs#L11\">Perf_Vector3<\/a> and <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/adccb815003451dd68586516d4f25f52f3f2ebe7\/src\/benchmarks\/micro\/libraries\/System.Numerics.Vectors\/Perf_Vector4.cs#L11\">Perf_Vector4<\/a> microbenchmarks.<\/p>\n<table>\n<thead>\n<tr>\n<th>Benchmark<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Perf_Vector2.AddOperatorBenchmark<\/td>\n<td>6.59<\/td>\n<td>1.16<\/td>\n<td>-82%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector2.ClampBenchmark<\/td>\n<td>11.94<\/td>\n<td>1.10<\/td>\n<td>-91%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector2.DistanceBenchmark<\/td>\n<td>6.55<\/td>\n<td>0.70<\/td>\n<td>-89%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector2.MinBenchmark<\/td>\n<td>5.56<\/td>\n<td>1.15<\/td>\n<td>-79%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector2.SubtractFunctionBenchmark<\/td>\n<td>10.78<\/td>\n<td>0.38<\/td>\n<td>-96%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector3.MaxBenchmark<\/td>\n<td>3.46<\/td>\n<td>2.31<\/td>\n<td>-33%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector3.MinBenchmark<\/td>\n<td>3.97<\/td>\n<td>0.38<\/td>\n<td>-90%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector3.MultiplyFunctionBenchmark<\/td>\n<td>3.95<\/td>\n<td>1.16<\/td>\n<td>-71%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector3.MultiplyOperatorBenchmark<\/td>\n<td>4.30<\/td>\n<td>0.77<\/td>\n<td>-82%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.AddOperatorBenchmark<\/td>\n<td>4.04<\/td>\n<td>0.77<\/td>\n<td>-81%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.ClampBenchmark<\/td>\n<td>4.04<\/td>\n<td>0.69<\/td>\n<td>-83%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.DistanceBenchmark<\/td>\n<td>2.12<\/td>\n<td>0.38<\/td>\n<td>-82%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.MaxBenchmark<\/td>\n<td>6.74<\/td>\n<td>0.38<\/td>\n<td>-94%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.MultiplyFunctionBenchmark<\/td>\n<td>7.67<\/td>\n<td>0.39<\/td>\n<td>-95%<\/td>\n<\/tr>\n<tr>\n<td>Perf_Vector4.MultiplyOperatorBenchmark<\/td>\n<td>3.47<\/td>\n<td>0.34<\/td>\n<td>-90%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3><a id=\"user-content-systemspanhelpers\" class=\"anchor\" href=\"#systemspanhelpers\" aria-hidden=\"true\"><\/a>System.SpanHelpers<\/h3>\n<p><code>System.SpanHelpers<\/code> methods were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37624\">dotnet\/runtime#37624<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37934\">dotnet\/runtime#37934<\/a> work. The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5\/src\/benchmarks\/micro\/libraries\/System.Memory\/Span.cs#L69\">Span&lt;T&gt;.IndexOfValue<\/a> and <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a6955b7be29e30ec3d8eba83cdf1f8d4de0ae4ff\/src\/benchmarks\/micro\/libraries\/System.Memory\/ReadOnlySpan.cs#L47\">ReadOnlySpan.IndexOfString<\/a> microbenchmarks.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method names<\/th>\n<th>Benchmark<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>IndexOf(char)<\/code><\/td>\n<td>Span.IndexOfValue(Size: 512)<\/td>\n<td>66.51<\/td>\n<td>46.88<\/td>\n<td>-30%<\/td>\n<\/tr>\n<tr>\n<td><code>IndexOf(byte)<\/code><\/td>\n<td>Span.IndexOfValue(Size: 512)<\/td>\n<td>34.11<\/td>\n<td>25.41<\/td>\n<td>-25%<\/td>\n<\/tr>\n<tr>\n<td><code>IndexOf(char)<\/code><\/td>\n<td>ReadOnlySpan.IndexOfString ()<\/td>\n<td>172.68<\/td>\n<td>137.76<\/td>\n<td>-20%<\/td>\n<\/tr>\n<tr>\n<td><code>IndexOfAnyThreeValue(byte)<\/code><\/td>\n<td>Span.IndexOfAnyThreeValues(Size: 512)<\/td>\n<td>71.22<\/td>\n<td>55.92<\/td>\n<td>-21%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><\/h3>\n<p>&nbsp;<\/p>\n<h3><a id=\"user-content-systemtext\" class=\"anchor\" href=\"#systemtext\" aria-hidden=\"true\"><\/a>System.Text<\/h3>\n<p>We have also optimized methods in several classes under <code>System.Text<\/code>.<\/p>\n<ul>\n<li>Methods in <code>System.Text.ASCIIUtility<\/code> were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38597\">dotnet\/runtime#38597<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39506\">dotnet\/runtime#39506<\/a>.<\/li>\n<li><code>System.Text.Unicode<\/code> were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38653\">dotnet\/runtime#38653<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39041\">dotnet\/runtime#39041<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39050\">dotnet\/runtime#39050<\/a><\/li>\n<li><code>System.Text.Encodings.Web<\/code> were optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38707\">dotnet\/runtime#38707<\/a><\/li>\n<\/ul>\n<p>In .NET 6, we are planning to optimize remaining methods of <code>System.Text.ASCIIUtility<\/code> described in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/41292\">dotnet\/runtime#41292<\/a>, methods of <code>System.Buffers<\/code> to address <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35033\">dotnet\/runtime#35033<\/a> and merge the work to optimize <code>JsonReaderHelper.IndexOfLessThan<\/code> done by <a href=\"https:\/\/github.com\/benaadams\">Ben Adams<\/a> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/41097\">dotnet\/runtime#41097<\/a>.<\/p>\n<p>All the measurements that I have mentioned above came from our performance lab runs done on Ubuntu machines on <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/08_06_2020\/report_Daily_ca=ARM64_cb=master_co=Ubuntu1804ARM_cr=dotnetcoresdk_cc=CompliationMode=tiered-RunKind=micro_Baseline_bb=release-3.1.2xx_2020-08-06.html\" rel=\"nofollow\">8\/6\/2020<\/a>, <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/08_10_2020\/report_Daily_ca=ARM64_cb=master_co=Ubuntu1804ARM_cr=dotnetcoresdk_cc=CompliationMode=tiered-RunKind=micro_Baseline_bb=release-3.1.2xx_2020-08-10.html\" rel=\"nofollow\">8\/10\/2020<\/a> and <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/08_28_2020\/report_Daily_ca=ARM64_cb=master_co=Ubuntu1804ARM_cr=dotnetcoresdk_cc=CompliationMode=tiered-RunKind=micro_Baseline_bb=release-3.1.2xx_2020-08-28.html\" rel=\"nofollow\">8\/28\/2020<\/a>.<\/p>\n<h3><a id=\"user-content-details\" class=\"anchor\" href=\"#details\" aria-hidden=\"true\"><\/a>Details<\/h3>\n<p>It is probably clear at this point how impactful and important hardware intrinsics are. I want to show you more, by walking through an example. Imagine a <code>Test()<\/code> returns leading zero count of argument <code>value<\/code>.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">private<\/span> <span class=\"pl-k\">int<\/span> <span class=\"pl-en\">Test<\/span>(<span class=\"pl-k\">uint<\/span> <span class=\"pl-smi\">value<\/span>)\r\n{\r\n    <span class=\"pl-k\">return<\/span> <span class=\"pl-smi\">BitOperations<\/span>.<span class=\"pl-en\">LeadingZeroCount<\/span>(<span class=\"pl-smi\">value<\/span>);\r\n}<\/pre>\n<\/div>\n<p>Before optimization for ARM64, the code would execute the <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/6072e4d3a7a2a1493f514cdf4be75a3d56580e84\/src\/libraries\/System.Private.CoreLib\/src\/System\/Numerics\/BitOperations.cs#L205\">software fallback<\/a> of <code>LeadingZeroCount()<\/code>. If you see the ARM64 assembly code generated below, not only it is large, but RyuJIT had to JIT 2 methods &#8211; <code>Test(int)<\/code> and <code>Log2SoftwareFallback(int)<\/code>.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; Test(int):int<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n<span class=\"pl-en\">        cbnz    w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> M00_L00<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">32<\/span>\r\n<span class=\"pl-en\">        b       M00_L01<\/span>\r\n<span class=\"pl-en\">M00_L00:<\/span>\r\n        <span class=\"pl-v\">bl<\/span><span class=\"pl-en\">      System.Numerics.BitOperations:Log2SoftwareFallback(<\/span><span class=\"pl-k\">int<\/span><span class=\"pl-en\">):<\/span><span class=\"pl-k\">int<\/span>\r\n<span class=\"pl-en\">        eor     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">31<\/span>\r\n<span class=\"pl-en\">M00_L01:<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 28, prolog size 8<\/span>\r\n<span class=\"pl-c\">; ============================================================<\/span>\r\n\r\n\r\n<span class=\"pl-c\">; System.Numerics.BitOperations:Log2SoftwareFallback(int):int<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n<span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">1<\/span>\r\n<span class=\"pl-en\">        orr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">2<\/span>\r\n<span class=\"pl-en\">        orr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">4<\/span>\r\n<span class=\"pl-en\">        orr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">8<\/span>\r\n<span class=\"pl-en\">        orr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span>\r\n<span class=\"pl-en\">        orr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        movz    w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0xacdd<\/span>\r\n<span class=\"pl-en\">        movk    w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x7c4<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        lsr     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">27<\/span>\r\n<span class=\"pl-en\">        sxtw    x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span>\r\n<span class=\"pl-en\">        movz    x1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0xc249<\/span>\r\n<span class=\"pl-en\">        movk    x1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x5405<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span>\r\n<span class=\"pl-en\">        movk    x1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x7ffc<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">32<\/span>\r\n<span class=\"pl-en\">        ldrb    w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x1<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 92, prolog size 8<\/span><\/pre>\n<\/div>\n<p>After we optimized <code>LeadingZeroCount()<\/code> to use ARM64 intrinsics, generated code for ARM64 is just a handful of instructions (including the crucial <code>clz<\/code>). In this case, RyuJIT did not even JIT <code>Log2SoftwareFallback(int)<\/code> method because it was not called. Thus, by doing this work, we got improvement in code quality as well as JIT throughput.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; Test(int):int<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n<span class=\"pl-en\">        clz     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 24, prolog size 8<\/span><\/pre>\n<\/div>\n<hr \/>\n<h3><a id=\"user-content-aot-compilation-for-methods-having-arm64-intrinsics\" class=\"anchor\" href=\"#aot-compilation-for-methods-having-arm64-intrinsics\" aria-hidden=\"true\"><\/a>AOT compilation for methods having ARM64 intrinsics<\/h3>\n<p>In the typical case, applications are compiled to machine code at runtime using the JIT. The target machine code produced is very efficient but has the disadvantage of having to do the compilation during execution and this might add some delay during the application start-up. If the target platform is known in advance, you can create ready-to-run (R2R) native images for that target platform. This is known as ahead of time (AOT) compilation. It has the advantage of faster startup time because there is no need to produce machine code during execution. The target machine code is already present in the binary and can be run directly. AOT compiled code might be suboptimal sometimes but get replaced by optimal code eventually.<\/p>\n<p>Until .NET 5, if a method (.NET library method or user defined method) had calls to ARM64 hardware intrinsic APIs (APIs under <code>System.Runtime.Intrinsics<\/code> and <code>System.Runtime.Intrinsics.Arm<\/code>), such methods were never compiled AOT and were always deferred to get compiled during runtime. This had an impact on start-up time of some .NET apps which used one of these methods in their startup code. In .NET 5, we addressed this problem in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38060\">dotnet\/runtime#38060<\/a> and now able to do the compilation of such methods AOT.<\/p>\n<hr \/>\n<h2><a id=\"user-content-microbenchmark-analysis\" class=\"anchor\" href=\"#microbenchmark-analysis\" aria-hidden=\"true\"><\/a>Microbenchmark analysis<\/h2>\n<p>Optimizing the .NET libraries with intrinsics was a straightforward step (following in the path of what we&#8217;d already done for x86\/x64). An equal or more significant project was improving the quality of code that the JIT generates for ARM64. It&#8217;s important to make that exercise data-oriented. We picked benchmarks that we thought would highlight underlying ARM64 CQ issues. We started with the <a href=\"https:\/\/github.com\/dotnet\/performance\/tree\/master\/src\/benchmarks\/micro\">Microbenchmarks<\/a> that we maintain. There are around 1300 of these benchmarks.<\/p>\n<p>We compared ARM64 and x64 performance numbers for each of these benchmarks. Parity was not our goal, however, it is always useful to have a baseline to compare with, particularly to identify outliers. We then identified the benchmarks with the worst performance, and determined why that was the case. We tried using some profilers like <a href=\"https:\/\/docs.microsoft.com\/windows-hardware\/test\/wpt\/windows-performance-analyzer\" rel=\"nofollow\">WPA<\/a> and <a href=\"https:\/\/github.com\/microsoft\/perfview\">PerfView<\/a> but they were not useful in this scenario. Those profilers would have pointed out the hottest method in given benchmark. But since MicroBenchmarks are tiny benchmarks with at most 1~2 methods, the hottest method that the profiler pointed was mostly the benchmark method itself. Hence, to understand the ARM64 CQ issues, we decided to just inspect the assembly code produced for a given benchmark and compare it with x64 assembly. That would help us identify basic issues in RyuJIT&#8217;s ARM64 code generator.<\/p>\n<p>Next, I will describe some of the issues that we found with this exercise.<\/p>\n<h3><a id=\"user-content-memory-barriers-in-arm64\" class=\"anchor\" href=\"#memory-barriers-in-arm64\" aria-hidden=\"true\"><\/a>Memory barriers in ARM64<\/h3>\n<p>Through some of the benchmarks, we noticed accesses of <code>volatile<\/code> variables in hot loop of critical methods of <code>System.Collections.Concurrent.ConcurrentDictionary<\/code> class. Accessing <code>volatile<\/code> variable for ARM64 is expensive because they introduce memory barrier instructions. I&#8217;ll describe why, shortly. By caching the volatile variable and storing it in a local variable (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/34225\">dotnet\/runtime#34225<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36976\">dotnet\/runtime#36976<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37081\">dotnet\/runtime#37081<\/a>) outside the loop resulted in improved performance, as seen below. All the measurements are in <code>nanoseconds<\/code>.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method names<\/th>\n<th>Benchmarks<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>IsEmpty(string)<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/Concurrent\/IsEmpty.cs#L37\">IsEmpty&lt;String&gt;.Dictionary(Size: 512)<\/a><\/td>\n<td>30.11<\/td>\n<td>19.38<\/td>\n<td>-36%<\/td>\n<\/tr>\n<tr>\n<td><code>TryAdd()<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/Add\/TryAddDefaultSize.cs#L39\">TryAddDefaultSize&lt;Int32&gt;.ConcurrentDictionary(Count: 512)<\/a><\/td>\n<td>557564.35<\/td>\n<td>398071.1<\/td>\n<td>-29%<\/td>\n<\/tr>\n<tr>\n<td><code>IsEmpty(int)<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/Concurrent\/IsEmpty.cs#L37\">IsEmpty&lt;Int32&gt;.Dictionary(Size: 512)<\/a><\/td>\n<td>28.48<\/td>\n<td>20.87<\/td>\n<td>-27%<\/td>\n<\/tr>\n<tr>\n<td><code>ctor()<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/Create\/CtorFromCollection.cs#L60\">CtorFromCollection&lt;Int32&gt;.ConcurrentDictionary(Size: 512)<\/a><\/td>\n<td>497202.32<\/td>\n<td>376048.69<\/td>\n<td>-24%<\/td>\n<\/tr>\n<tr>\n<td><code>get_Count<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/Concurrent\/Count.cs#L37\">Count&lt;Int32&gt;.Dictionary(Size: 512)<\/a><\/td>\n<td>234404.62<\/td>\n<td>185172.15<\/td>\n<td>-21%<\/td>\n<\/tr>\n<tr>\n<td><code>Add(), Clear()<\/code><\/td>\n<td><a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Collections\/CreateAddAndClear.cs#L168\">CreateAddAndClear&lt;Int32&gt;.ConcurrentDictionary(Size: 512)<\/a><\/td>\n<td>704458.54<\/td>\n<td>581923.04<\/td>\n<td>-17%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We made similar optimization in <code>System.Threading.ThreadPool<\/code> as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36697\">dotnet\/runtime#36697<\/a> and in <code>System.Diagnostics.Tracing.EventCount<\/code> as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37309\">dotnet\/runtime#37309<\/a> classes.<\/p>\n<h4><a id=\"user-content-arm-memory-model\" class=\"anchor\" href=\"#arm-memory-model\" aria-hidden=\"true\"><\/a>ARM memory model<\/h4>\n<p>ARM architecture has weakly ordered memory model. The processor can re-order the memory access instructions to improve performance. It can rearrange instructions to reduce the time processor takes to access memory. The order in which instructions are written is not guaranteed, and instead may be executed depending on the memory access cost of a given instruction. This approach does not impact single core machine but can negatively impact a multi-threaded program running on a multicore machine. In such situations, there are instructions to tell processors not to re-arrange memory access at a given point. The technical term for such instructions that restricts this re-arrangement is called &#8220;memory barriers&#8221;. The <code>dmb<\/code> instruction in ARM64 acts as a barrier prohibiting the processor from moving an instruction across the fence. You can read more about it in <a href=\"https:\/\/developer.arm.com\/documentation\/den0024\/a\/memory-ordering\" rel=\"nofollow\">ARM developer docs<\/a>.<\/p>\n<p>One of the way in which you can specify adding memory barrier in your code is by using a <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/csharp\/language-reference\/keywords\/volatile\" rel=\"nofollow\">volatile variable<\/a>. With <code>volatile<\/code>, it is guaranteed that the runtime, JIT, and the processor will not rearrange reads and writes to memory locations for performance. To make this happen, RyuJIT will emit <code>dmb<\/code> (data memory barrier) instruction for ARM64 every time there is an access (read\/write) to a <code>volatile<\/code> variable.<\/p>\n<p>For example, the following is code taken from <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/a5296dda39031ac84f40eeb5a0a136c89cde599b\/src\/benchmarks\/micro\/libraries\/System.Threading\/Perf.Volatile.cs#L17\">Perf_Volatile<\/a> microbenchmark. It does a volatile read of local field <code>_location<\/code>.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">public<\/span> <span class=\"pl-k\">class<\/span> <span class=\"pl-en\">Perf_Volatile<\/span>\r\n{\r\n    <span class=\"pl-k\">private<\/span> <span class=\"pl-k\">double<\/span> <span class=\"pl-smi\">_location<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>;\r\n    \r\n    [<span class=\"pl-en\">Benchmark<\/span>]\r\n    <span class=\"pl-k\">public<\/span> <span class=\"pl-k\">double<\/span> <span class=\"pl-en\">Read_double<\/span>() <span class=\"pl-k\">=&gt;<\/span> <span class=\"pl-smi\">Volatile<\/span>.<span class=\"pl-en\">Read<\/span>(<span class=\"pl-k\">ref<\/span> <span class=\"pl-smi\">_location<\/span>);\r\n}<\/pre>\n<\/div>\n<p>The generated relevant machine code of <code>Read_double<\/code> for ARM64 is:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; Read_double():double:this<\/span>\r\n\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">8<\/span>\r\n<span class=\"pl-en\">        ldr     d0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">        dmb     ishld<\/span><\/pre>\n<\/div>\n<p>The code first gets the address of <code>_location<\/code> field, loads the value in <code>d0<\/code> register and then execute <code>dmb ishld<\/code> that acts as a data memory barrier.<\/p>\n<p>Although this guarantees the memory ordering, there is a cost associated with it. The processor must now guarantee that all the data access done before the memory barrier is visible to all the cores after the barrier instruction which could be time consuming. Hence, it is important to avoid or minimize the usage of such data access inside hot methods and loop as much as possible.<\/p>\n<h3><a id=\"user-content-arm64-and-big-constants\" class=\"anchor\" href=\"#arm64-and-big-constants\" aria-hidden=\"true\"><\/a>ARM64 and big constants<\/h3>\n<p>In .NET 5, we did some improvements in the way we handled large constants present in user code. We started eliminating redundant loads of large constants in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39096\">dotnet\/runtime#39096<\/a> which gave us around <strong>1%<\/strong> (521K bytes to be precise) improvement in the size of ARM64 code we produced for all the .NET libraries.<\/p>\n<p>It is worth noting that sometimes JIT improvements do not get reflected in the microbenchmark runs but are beneficial in overall code quality. In such cases, RyuJIT team reports the improvements that were made in terms of .NET libraries code size. RyuJIT is run on entire .NET library dlls before and after changes to understand how much impact the optimization has made, and which libraries got optimized more than others. As of preview 8, the emitted code size of entire .NET libraries for ARM64 target is 45 MB. <strong>1%<\/strong> improvement would mean we emit 450 KB less code in .NET 5, which is substantial. You can see the individual numbers of methods that were improved <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39096#issuecomment-656859475\">here<\/a>.<\/p>\n<h4><a id=\"user-content-details-1\" class=\"anchor\" href=\"#details-1\" aria-hidden=\"true\"><\/a>Details<\/h4>\n<p>ARM64 has an Instruction set architecture (ISA) with fixed length encoding with each instruction exactly 32-bits long. Because of this, a move instruction <code>mov<\/code> have space only to encode up to 16-bits unsigned constant. To move a bigger constant value, we need to move the value in multiple steps using chunks of 16-bits (<code>movz\/movk<\/code>). Due to this, multiple <code>mov<\/code> instructions are generated to construct a single bigger constant that need to be saved in a register. Alternately, in x64 a single <code>mov<\/code> can load bigger constant.<\/p>\n<p>Now imagine code containing a couple constants (<code>2981231<\/code> and <code>2981235<\/code>).<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">public<\/span> <span class=\"pl-k\">static<\/span> <span class=\"pl-k\">uint<\/span> <span class=\"pl-en\">GetHashCode<\/span>(<span class=\"pl-k\">uint<\/span> <span class=\"pl-smi\">a<\/span>, <span class=\"pl-k\">uint<\/span> <span class=\"pl-smi\">b<\/span>)\r\n{\r\n  <span class=\"pl-k\">return<\/span>  ((<span class=\"pl-smi\">a<\/span> <span class=\"pl-k\">*<\/span> <span class=\"pl-c1\">2981231<\/span>) <span class=\"pl-k\">*<\/span> <span class=\"pl-smi\">b<\/span>) <span class=\"pl-k\">+<\/span> <span class=\"pl-c1\">2981235<\/span>;\r\n}<\/pre>\n<\/div>\n<p>Before we optimized this pattern, we would generate code to construct each constant. So, if they are present in a loop, they would get constructed for every iteration.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        movz    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x7d6f<\/span>\r\n<span class=\"pl-en\">        movk    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">45<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-c\">  ; &lt;-- loads 2981231 in w2<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w2<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        movz    w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x7d73<\/span>\r\n<span class=\"pl-en\">        movk    w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">45<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-c\">  ; &lt;-- loads 2981235 in w1<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><\/pre>\n<\/div>\n<p>In .NET 5, we are now loading such constants once in a register and whenever possible, reusing them in the code. If there is more than one constant whose difference with the optimized constant is below a certain threshold, then we use the optimized constant that is already in a register to construct the other constant(s). Below, we used the value in register <code>w2<\/code> (<code>2981231<\/code> in this case) to calculate constant <code>2981235<\/code>.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        movz    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">0x7d6f<\/span>\r\n<span class=\"pl-en\">        movk    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">45<\/span> <span class=\"pl-k\">LSL<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-c\">  ; &lt;-- loads 2981231<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w2<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">4<\/span><span class=\"pl-c\">       ; &lt;-- loads 2981235<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><\/pre>\n<\/div>\n<p>This optimization was helpful not just for loading constants but also for loading method addresses because they are 64-bits long on ARM64.<\/p>\n<h3><a id=\"user-content-c-structs\" class=\"anchor\" href=\"#c-structs\" aria-hidden=\"true\"><\/a>C# structs<\/h3>\n<p>We made good progress in optimizing scenarios for ARM64 that returns C# struct and got <strong>0.19%<\/strong> code size improvement in .NET libraries. Before .NET 5, we always created a struct on stack before doing any operation on it. Any updates to its fields would do the update on stack. When returning, the fields had to be copied from the stack into the return register. Likewise, when a <code>struct<\/code> was returned from a method, we would store it on stack before operating on it. In .NET 5, we started enregistering structs that can be returned using multiple registers in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36862\">dotnet\/runtime#36862<\/a>, meaning in certain cases, the structs won&#8217;t be created on stack but will be directly created and manipulated using registers. With that, we omitted the expensive memory access in methods using structs. This was substantial work that improved scenarios that operate on stack.<\/p>\n<p>The following measurements are in <code>nanoseconds<\/code> for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/bf83f43cc9fb262b45bd2200028f12c3e05d3155\/src\/benchmarks\/micro\/libraries\/System.Memory\/Constructors.cs#L15\">ReadOnlySpan&lt;T&gt; and Span&lt;T&gt; .ctor()<\/a> microbenchmark that operates on <code>ReadOnlySpan&lt;T&gt;<\/code> and <code>Span&lt;T&gt;<\/code> structs.<\/p>\n<table>\n<thead>\n<tr>\n<th>Benchmark<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Constructors&lt;Byte&gt;.MemoryMarshalCreateSpan<\/td>\n<td>7.58<\/td>\n<td>0.43<\/td>\n<td>-94%<\/td>\n<\/tr>\n<tr>\n<td>Constructors_ValueTypesOnly&lt;Int32&gt;.ReadOnlyFromPointerLength<\/td>\n<td>7.22<\/td>\n<td>0.43<\/td>\n<td>-94%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlySpanFromArray<\/td>\n<td>6.47<\/td>\n<td>0.43<\/td>\n<td>-93%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.SpanImplicitCastFromArray<\/td>\n<td>4.26<\/td>\n<td>0.41<\/td>\n<td>-90%<\/td>\n<\/tr>\n<tr>\n<td>Constructors_ValueTypesOnly&lt;Byte&gt;.ReadOnlyFromPointerLength<\/td>\n<td>6.45<\/td>\n<td>0.64<\/td>\n<td>-90%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ArrayAsSpanStartLength<\/td>\n<td>4.02<\/td>\n<td>0.4<\/td>\n<td>-90%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlySpanImplicitCastFromSpan<\/td>\n<td>34.03<\/td>\n<td>4.35<\/td>\n<td>-87%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ArrayAsSpan<\/td>\n<td>8.34<\/td>\n<td>1.48<\/td>\n<td>-82%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlySpanImplicitCastFromArraySegment<\/td>\n<td>18.38<\/td>\n<td>3.4<\/td>\n<td>-81%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlySpanImplicitCastFromArray<\/td>\n<td>17.87<\/td>\n<td>3.5<\/td>\n<td>-80%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.SpanImplicitCastFromArraySegment<\/td>\n<td>18.62<\/td>\n<td>3.88<\/td>\n<td>-79%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.SpanFromArrayStartLength<\/td>\n<td>50.9<\/td>\n<td>14.27<\/td>\n<td>-72%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.MemoryFromArrayStartLength<\/td>\n<td>54.31<\/td>\n<td>16.23<\/td>\n<td>-70%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlySpanFromArrayStartLength<\/td>\n<td>17.34<\/td>\n<td>5.39<\/td>\n<td>-69%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.SpanFromMemory<\/td>\n<td>8.95<\/td>\n<td>3.09<\/td>\n<td>-65%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ArrayAsMemory<\/td>\n<td>53.56<\/td>\n<td>18.54<\/td>\n<td>-65%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlyMemoryFromArrayStartLength<\/td>\n<td>9.053<\/td>\n<td>3.27<\/td>\n<td>-64%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.MemoryFromArrayStartLength<\/td>\n<td>9.060<\/td>\n<td>3.3<\/td>\n<td>-64%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ArrayAsMemoryStartLength<\/td>\n<td>53.00<\/td>\n<td>19.31<\/td>\n<td>-64%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.SpanImplicitCastFromArraySegment<\/td>\n<td>63.62<\/td>\n<td>25.6<\/td>\n<td>-60%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ArrayAsMemoryStartLength<\/td>\n<td>9.07<\/td>\n<td>3.66<\/td>\n<td>-60%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlyMemoryFromArray<\/td>\n<td>9.06<\/td>\n<td>3.7<\/td>\n<td>-59%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.SpanFromArray<\/td>\n<td>8.39<\/td>\n<td>3.44<\/td>\n<td>-59%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.MemoryMarshalCreateSpan<\/td>\n<td>14.43<\/td>\n<td>7.28<\/td>\n<td>-50%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.MemoryFromArray<\/td>\n<td>6.21<\/td>\n<td>3.22<\/td>\n<td>-48%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlySpanFromMemory<\/td>\n<td>12.95<\/td>\n<td>7.35<\/td>\n<td>-43%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlySpanImplicitCastFromArraySegment<\/td>\n<td>31.84<\/td>\n<td>18.08<\/td>\n<td>-43%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.ReadOnlyMemoryFromArrayStartLength<\/td>\n<td>9.06<\/td>\n<td>5.52<\/td>\n<td>-39%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlyMemoryFromArray<\/td>\n<td>6.24<\/td>\n<td>4.13<\/td>\n<td>-34%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;String&gt;.SpanFromMemory<\/td>\n<td>20.87<\/td>\n<td>15.05<\/td>\n<td>-28%<\/td>\n<\/tr>\n<tr>\n<td>Constructors&lt;Byte&gt;.ReadOnlySpanImplicitCastFromArray<\/td>\n<td>4.47<\/td>\n<td>3.44<\/td>\n<td>-23%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h4><a id=\"user-content-details-2\" class=\"anchor\" href=\"#details-2\" aria-hidden=\"true\"><\/a>Details<\/h4>\n<p>In .NET Core 3.1, when a function created and returned a <code>struct<\/code> containing fields that can fit in a register like <code>float<\/code>, we were always creating and storing the <code>struct<\/code> on stack. Let us see an example:<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">public<\/span> <span class=\"pl-k\">struct<\/span> <span class=\"pl-en\">MyStruct<\/span>\r\n{\r\n  <span class=\"pl-k\">public<\/span> <span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">a<\/span>;\r\n  <span class=\"pl-k\">public<\/span> <span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">b<\/span>;\r\n}\r\n\r\n[<span class=\"pl-en\">MethodImpl<\/span>(<span class=\"pl-smi\">MethodImplOptions<\/span>.<span class=\"pl-smi\">NoInlining<\/span>)]\r\n<span class=\"pl-k\">public<\/span> <span class=\"pl-k\">static<\/span> <span class=\"pl-en\">MyStruct<\/span> <span class=\"pl-en\">GetMyStruct<\/span>(<span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">i<\/span>, <span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">j<\/span>)\r\n{\r\n  <span class=\"pl-en\">MyStruct<\/span> <span class=\"pl-smi\">mys<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-k\">new<\/span> <span class=\"pl-en\">MyStruct<\/span>();\r\n  <span class=\"pl-smi\">mys<\/span>.<span class=\"pl-smi\">a<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">+<\/span> <span class=\"pl-smi\">j<\/span>;\r\n  <span class=\"pl-smi\">mys<\/span>.<span class=\"pl-smi\">b<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">-<\/span> <span class=\"pl-smi\">j<\/span>;\r\n  <span class=\"pl-k\">return<\/span> <span class=\"pl-smi\">mys<\/span>;\r\n}\r\n\r\n<span class=\"pl-k\">public<\/span> <span class=\"pl-k\">static<\/span> <span class=\"pl-k\">float<\/span> <span class=\"pl-en\">GetTotal<\/span>(<span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">i<\/span>, <span class=\"pl-k\">float<\/span> <span class=\"pl-smi\">j<\/span>)\r\n{\r\n  <span class=\"pl-en\">MyStruct<\/span> <span class=\"pl-smi\">mys<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-en\">GetMyStruct<\/span>(<span class=\"pl-smi\">i<\/span>, <span class=\"pl-smi\">j<\/span>);\r\n  <span class=\"pl-k\">return<\/span> <span class=\"pl-smi\">mys<\/span>.<span class=\"pl-smi\">a<\/span> <span class=\"pl-k\">+<\/span> <span class=\"pl-smi\">mys<\/span>.<span class=\"pl-smi\">b<\/span>;\r\n}\r\n\r\n<span class=\"pl-k\">public<\/span> <span class=\"pl-k\">static<\/span> <span class=\"pl-k\">void<\/span> <span class=\"pl-en\">Main<\/span>()\r\n{\r\n  <span class=\"pl-en\">GetTotal<\/span>(<span class=\"pl-c1\">1.5f<\/span>, <span class=\"pl-c1\">2.5f<\/span>);\r\n}<\/pre>\n<\/div>\n<p>Here is the code we generated in .NET Core 3.1. If you see below, we created the <code>struct<\/code> on stack at location <code>[fp+24]<\/code> and then stored the <code>i+j<\/code> and <code>i-j<\/code> result in fields <code>a<\/code> and <code>b<\/code> located at <code>[fp+24]<\/code> and <code>[fp+28]<\/code> respectively. We finally loaded those fields from stack into the registers <code>s0<\/code> and <code>s1<\/code> to return the result. The caller <code>GetTotal()<\/code> would also save the returned <code>struct<\/code> on stack before operating on it.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; GetMyStruct(float,float):struct<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">32<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     xzr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span>\t\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">24   <\/span><span class=\"pl-c\">; &lt;-- struct created on stack at [fp+24]<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     xzr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">fadd<\/span><span class=\"pl-en\">    s16<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s1<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     s16<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\"> ; &lt;-- mys.a = i + j<\/span>\r\n        <span class=\"pl-k\">fsub<\/span><span class=\"pl-en\">    s16<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s1<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     s16<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\"> ; &lt;-- mys.a = i - j<\/span>\r\n<span class=\"pl-en\">        ldr     s0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">  ; returning the struct field 'a' in s0<\/span>\r\n<span class=\"pl-en\">        ldr     s1<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">  ; returning the struct field 'b' in s1<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">32<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 52, prolog size 12<\/span>\r\n<span class=\"pl-c\">; ============================================================<\/span>\r\n\r\n<span class=\"pl-c\">; GetTotal(float,float):float<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">32<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n        <span class=\"pl-k\">call<\/span>    <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">GetMyStruct(<\/span><span class=\"pl-c1\">float<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-c1\">float<\/span><span class=\"pl-en\">):MyStruct<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     s0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">   ; store mys.a on stack<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     s1<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">   ; store mys.b on stack<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">24<\/span>    \r\n<span class=\"pl-en\">        ldr     s0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">       ; load again in register<\/span>\r\n<span class=\"pl-en\">        ldr     s16<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">4<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">fadd<\/span><span class=\"pl-en\">    s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s16<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">32<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 44, prolog size 8<\/span>\r\n<\/pre>\n<\/div>\n<p>With the enregistration work, we do not create the <code>struct<\/code> on stack anymore in certain scenarios. With that, we do not have to load the field values from stack into the return registers. Here is the optimized code in .NET 5:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; GetMyStruct(float,float):MyStruct<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n        <span class=\"pl-k\">fadd<\/span><span class=\"pl-en\">    s16<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s1<\/span>\r\n        <span class=\"pl-k\">fsub<\/span><span class=\"pl-en\">    s1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s1<\/span><span class=\"pl-c\">   ; s1 contains value of 'b'<\/span>\r\n<span class=\"pl-en\">        fmov    s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s16<\/span><span class=\"pl-c\">      ; s0 contains value of 'a'<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n\r\n<span class=\"pl-c\">; Total bytes of code 28, prolog size 8<\/span>\r\n<span class=\"pl-c\">; ============================================================<\/span>\r\n\r\n<span class=\"pl-c\">; GetTotal(float,float):float<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n        <span class=\"pl-k\">call<\/span>    <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">GetMyStruct(<\/span><span class=\"pl-c1\">float<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-c1\">float<\/span><span class=\"pl-en\">):MyStruct<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">        fmov    s16<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s1<\/span>\r\n        <span class=\"pl-k\">fadd<\/span><span class=\"pl-en\">    s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> s16<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 28, prolog size 8<\/span><\/pre>\n<\/div>\n<p>The code size has reduced by 43% and we have eliminated 10 memory accesses in <code>GetMyStruct()<\/code> and <code>GetTotal()<\/code> combined. The stack space needed for both the methods has also reduced from <code>32 bytes<\/code> to <code>16 bytes<\/code>.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39326\">dotnet\/runtime#39326<\/a> is a work in progress to similarly optimize fields of structs that are passed in registers, that we will ship in next release. We also found issues like <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35071\">dotnet\/runtime#35071<\/a> where we do some redundant store and load when handling struct arguments or <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/build\/arm64-windows-abi-conventions?view=vs-2019\" rel=\"nofollow\">HFA registers<\/a>, or always push arguments on the stack before using them in a method as seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35635\">dotnet\/runtime#35635<\/a>. We are hoping to address these issues in a future release.<\/p>\n<h3><a id=\"user-content-array-access-with-post-index-addressing-mode\" class=\"anchor\" href=\"#array-access-with-post-index-addressing-mode\" aria-hidden=\"true\"><\/a>Array access with post-index addressing mode<\/h3>\n<p>ARM64 has various addressing modes that can be used to generate load\/store instruction to compute the memory address an operation need to access. &#8220;Post-index&#8221; addressing mode is one of them. It is usually used in scenarios where consecutive access to memory location (from fixed base address) is needed. A typical example of it is array element access in a loop where the base address of an array is fixed and the elements are in consecutive memory at a fixed offset from one another. One of the issues we found out was that we were not using post-index addressing mode in our generated ARM64 code but instead generating a lot of instructions to calculate the address of array element. We will address <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/34810\">dotnet\/runtime#34810<\/a> in a future release.<\/p>\n<h4><a id=\"user-content-details-3\" class=\"anchor\" href=\"#details-3\" aria-hidden=\"true\"><\/a>Details<\/h4>\n<p>Consider a loop that stores a value in an array element.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">public<\/span> <span class=\"pl-k\">int<\/span>[] <span class=\"pl-en\">Test<\/span>()\r\n{\r\n    <span class=\"pl-k\">int<\/span>[] <span class=\"pl-smi\">arr<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-k\">new<\/span> <span class=\"pl-k\">int<\/span>[<span class=\"pl-c1\">10<\/span>];\r\n    <span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>;\r\n    <span class=\"pl-k\">while<\/span> (<span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-c1\">9<\/span>)\r\n    {\r\n        <span class=\"pl-smi\">arr<\/span>[<span class=\"pl-smi\">i<\/span>] <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">1<\/span>;  <span class=\"pl-c\"><span class=\"pl-c\">\/\/<\/span> &lt;---- IG03<\/span>\r\n        <span class=\"pl-smi\">i<\/span><span class=\"pl-k\">++<\/span>;\r\n    }\r\n    <span class=\"pl-k\">return<\/span> <span class=\"pl-smi\">arr<\/span>;\r\n}<\/pre>\n<\/div>\n<p>To store <code>1<\/code> inside <code>arr[i]<\/code>, we need to generate instructions to calculate address of <code>arr[i]<\/code> in every iteration. For example, on x64 this is as simple as:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">...<\/span>\r\n<span class=\"pl-en\">M00_L00:<\/span>\r\n        <span class=\"pl-k\">movsxd<\/span>   <span class=\"pl-v\">rcx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">edx<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">      dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">4<\/span><span class=\"pl-s1\">*<\/span><span class=\"pl-v\">rcx<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-c1\">1<\/span>\r\n        <span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">edx<\/span>\r\n        <span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">9<\/span>\r\n        <span class=\"pl-k\">jl<\/span><span class=\"pl-en\">       SHORT M00_L00<\/span>\r\n<span class=\"pl-en\">...<\/span><\/pre>\n<\/div>\n<p><code>rax<\/code> stores the base address of array <code>arr<\/code>. <code>rcx<\/code> holds the value of <code>i<\/code> and since the array is of type <code>int<\/code>, we multiply it by <code>4<\/code>. <code>rax+4*rcx<\/code> forms the address of array element at <code>ith<\/code> index. <code>16<\/code> is the offset from base address at which elements are stored. All of this execute in a loop.<\/p>\n<p>However, for ARM64, we generate longer code as seen below. We generate 3 instructions to calculate the array element address and 4th instruction to save the value. We do this calculation in every iteration of a loop.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">...<\/span>\r\n<span class=\"pl-en\">M00_L00:<\/span>\r\n<span class=\"pl-en\">        sxtw    x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-c\">        ; load 'i' from w1<\/span>\r\n        <span class=\"pl-k\">lsl<\/span><span class=\"pl-en\">     x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">2<\/span><span class=\"pl-c\">    ; x2 *= 4<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-c\">   ; x2 += 16<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     w3<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">1<\/span><span class=\"pl-c\">        ; w3 = 1<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     w3<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x2<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-c\">  ; store w3 in [x0 + x2]<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">1<\/span><span class=\"pl-c\">    ; w1++<\/span>\r\n        <span class=\"pl-k\">cmp<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">9<\/span><span class=\"pl-c\">        ; repeat while i &lt; 9<\/span>\r\n<span class=\"pl-en\">        blt     M00_L00<\/span>\r\n<span class=\"pl-en\">...<\/span><\/pre>\n<\/div>\n<p>With post-index addressing mode, much of the recalculation here can be simplified. With this addressing mode, we can auto increment the address present in a register to get the next array element. The code gets optimized as seen below. After every execution, contents of <code>x1<\/code> would be auto incremented by 4, and would get the address of the next array element.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; x1 contains &lt;&lt;base address of arr&gt;&gt;+16<\/span>\r\n<span class=\"pl-c\">; w0 contains value \"1\"<\/span>\r\n<span class=\"pl-c\">; w1 contains value of \"i\"<\/span>\r\n\r\n<span class=\"pl-en\">M00_L00:<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x1<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-c1\">4<\/span><span class=\"pl-c\">  ; post-index addressing mode<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">1<\/span>\r\n        <span class=\"pl-k\">cmp<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">9<\/span>\r\n<span class=\"pl-en\">        blt     M00_L00<\/span><\/pre>\n<\/div>\n<p>Fixing this issue will result in both performance and code size improvements.<\/p>\n<h3><a id=\"user-content-mod-operations\" class=\"anchor\" href=\"#mod-operations\" aria-hidden=\"true\"><\/a>Mod operations<\/h3>\n<p>Modulo operations are crucial in many algorithms and currently we do not generate good quality code for certain scenarios.\nIn <code>a % b<\/code>, if <code>a<\/code> is an <code>unsigned int<\/code> and <code>b<\/code> is power of 2 and a constant, ARM64 code that is generated today is:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        lsr     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">2<\/span>\r\n        <span class=\"pl-k\">lsl<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">2<\/span>\r\n        <span class=\"pl-k\">sub<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><\/pre>\n<\/div>\n<p>But instead it can be optimized to generate:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre>        <span class=\"pl-k\">and<\/span><span class=\"pl-en\">     w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> &lt;&lt;b <\/span><span class=\"pl-s1\">-<\/span> <span class=\"pl-c1\">1<\/span><span class=\"pl-en\">&gt;&gt;<\/span><\/pre>\n<\/div>\n<p>Another scenario that we could optimize is if <code>b<\/code> is a variable. Today, we generate:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        udiv    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-c\">   ; sdiv if 'a' is signed int<\/span>\r\n        <span class=\"pl-k\">mul<\/span><span class=\"pl-en\">     w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n        <span class=\"pl-k\">sub<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><\/pre>\n<\/div>\n<p>The last two instructions can be combined into a single instruction to generate:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        udiv    w2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        msub    w3<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w3<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w2<\/span><\/pre>\n<\/div>\n<p>We will address <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/34937\">dotnet\/runtime#34937<\/a> in a future release.<\/p>\n<hr \/>\n<h2><a id=\"user-content-code-size-analysis\" class=\"anchor\" href=\"#code-size-analysis\" aria-hidden=\"true\"><\/a>Code size analysis<\/h2>\n<p>Understanding the size of ARM64 code that we produced and reducing it down was an important task for us in .NET 5. Not only does it improve the memory consumption of .NET runtime, it also reduces the disk footprint of R2R binaries that are compiled ahead-of-time.<\/p>\n<p>We found some good areas where we could reduce the ARM64 code size and the results were astonishing. In addition to some of the work I mentioned above, after we optimized code generated for call indirects in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35675\">dotnet\/runtime#35675<\/a> and virtual call stub in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36817\">dotnet\/runtime#36817<\/a>, we saw code size improvement of <strong>13%<\/strong> on .NET library R2R images. We also compared the ARM64 code produced in .NET Core 3.1 vs. .NET 5 for the <a href=\"https:\/\/www.nuget.org\/stats\" rel=\"nofollow\">top 25 NuGet packages<\/a>. On average, we improved the code size of R2R images by <strong>16.61%<\/strong>. Below are the nuget package name and version along with the % improvement. All the measurements are in <code>bytes<\/code> (lower is better).<\/p>\n<table>\n<thead>\n<tr>\n<th>Nuget package<\/th>\n<th>Nuget version<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Code size improvement<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Microsoft.EntityFrameworkCore<\/td>\n<td>3.1.6<\/td>\n<td>2414572<\/td>\n<td>1944756<\/td>\n<td>-19.46%<\/td>\n<\/tr>\n<tr>\n<td>HtmlAgilityPack<\/td>\n<td>1.11.24<\/td>\n<td>255700<\/td>\n<td>205944<\/td>\n<td>-19.46%<\/td>\n<\/tr>\n<tr>\n<td>WebDriver<\/td>\n<td>3.141.0<\/td>\n<td>330236<\/td>\n<td>266116<\/td>\n<td>-19.42%<\/td>\n<\/tr>\n<tr>\n<td>System.Data.SqlClient<\/td>\n<td>4.8.1<\/td>\n<td>118588<\/td>\n<td>96636<\/td>\n<td>-18.51%<\/td>\n<\/tr>\n<tr>\n<td>System.Web.Razor<\/td>\n<td>3.2.7<\/td>\n<td>474180<\/td>\n<td>387296<\/td>\n<td>-18.32%<\/td>\n<\/tr>\n<tr>\n<td>Moq<\/td>\n<td>4.14.5<\/td>\n<td>307540<\/td>\n<td>251264<\/td>\n<td>-18.30%<\/td>\n<\/tr>\n<tr>\n<td>MongoDB.Bson<\/td>\n<td>2.11.0<\/td>\n<td>863688<\/td>\n<td>706152<\/td>\n<td>-18.24%<\/td>\n<\/tr>\n<tr>\n<td>AWSSDK.Core<\/td>\n<td>3.3.107.32<\/td>\n<td>889712<\/td>\n<td>728000<\/td>\n<td>-18.18%<\/td>\n<\/tr>\n<tr>\n<td>AutoMapper<\/td>\n<td>10.0.0<\/td>\n<td>411132<\/td>\n<td>338068<\/td>\n<td>-17.77%<\/td>\n<\/tr>\n<tr>\n<td>xunit.core<\/td>\n<td>2.4.1<\/td>\n<td>41488<\/td>\n<td>34192<\/td>\n<td>-17.59%<\/td>\n<\/tr>\n<tr>\n<td>Google.Protobuf<\/td>\n<td>3.12.4<\/td>\n<td>643172<\/td>\n<td>532372<\/td>\n<td>-17.23%<\/td>\n<\/tr>\n<tr>\n<td>xunit.execution.dotnet<\/td>\n<td>2.4.1<\/td>\n<td>313116<\/td>\n<td>259212<\/td>\n<td>-17.22%<\/td>\n<\/tr>\n<tr>\n<td>nunit.framework<\/td>\n<td>3.12.0<\/td>\n<td>722228<\/td>\n<td>598976<\/td>\n<td>-17.07%<\/td>\n<\/tr>\n<tr>\n<td>Xamarin.Forms.Core<\/td>\n<td>4.7.0.1239<\/td>\n<td>1740552<\/td>\n<td>1444740<\/td>\n<td>-17.00%<\/td>\n<\/tr>\n<tr>\n<td>Castle.Core<\/td>\n<td>4.4.1<\/td>\n<td>389552<\/td>\n<td>323892<\/td>\n<td>-16.86%<\/td>\n<\/tr>\n<tr>\n<td>Serilog<\/td>\n<td>2.9.0<\/td>\n<td>167020<\/td>\n<td>139308<\/td>\n<td>-16.59%<\/td>\n<\/tr>\n<tr>\n<td>MongoDB.Driver.Core<\/td>\n<td>2.11.0<\/td>\n<td>1281668<\/td>\n<td>1069768<\/td>\n<td>-16.53%<\/td>\n<\/tr>\n<tr>\n<td>Newtonsoft.Json<\/td>\n<td>12.0.3<\/td>\n<td>1056372<\/td>\n<td>882724<\/td>\n<td>-16.44%<\/td>\n<\/tr>\n<tr>\n<td>polly<\/td>\n<td>7.2.1<\/td>\n<td>353456<\/td>\n<td>297120<\/td>\n<td>-15.94%<\/td>\n<\/tr>\n<tr>\n<td>StackExchange.Redis<\/td>\n<td>2.1.58<\/td>\n<td>1031668<\/td>\n<td>867804<\/td>\n<td>-15.88%<\/td>\n<\/tr>\n<tr>\n<td>RabbitMQ.Client<\/td>\n<td>6.1.0<\/td>\n<td>355372<\/td>\n<td>299152<\/td>\n<td>-15.82%<\/td>\n<\/tr>\n<tr>\n<td>Grpc.Core.Api<\/td>\n<td>2.30.0<\/td>\n<td>36488<\/td>\n<td>30912<\/td>\n<td>-15.28%<\/td>\n<\/tr>\n<tr>\n<td>Grpc.Core<\/td>\n<td>2.30.0<\/td>\n<td>190820<\/td>\n<td>161764<\/td>\n<td>-15.23%<\/td>\n<\/tr>\n<tr>\n<td>ICSharpCode.SharpZipLib<\/td>\n<td>1.2.0<\/td>\n<td>306236<\/td>\n<td>261244<\/td>\n<td>-14.69%<\/td>\n<\/tr>\n<tr>\n<td>Swashbuckle.AspNetCore.Swagger<\/td>\n<td>5.5.1<\/td>\n<td>5872<\/td>\n<td>5112<\/td>\n<td>-12.94%<\/td>\n<\/tr>\n<tr>\n<td>JetBrains.Annotations<\/td>\n<td>2020.1.0<\/td>\n<td>7736<\/td>\n<td>6824<\/td>\n<td>-11.79%<\/td>\n<\/tr>\n<tr>\n<td>Elasticsearch.Net<\/td>\n<td>7.8.2<\/td>\n<td>1904684<\/td>\n<td>1702216<\/td>\n<td>-10.63%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that most of the above packages might not include R2R images, we picked these packages for our code size measurement because they are one of the most downloaded packages and written for wide variety of domains.<\/p>\n<h3><a id=\"user-content-inline-heuristics-tweaking\" class=\"anchor\" href=\"#inline-heuristics-tweaking\" aria-hidden=\"true\"><\/a>Inline heuristics tweaking<\/h3>\n<p>Currently, RyuJIT uses various heuristics to decide whether inlining a method will be beneficial or not. Among other heuristics, one of them is to check the code size of the caller in which the callee gets inlined. The code size heuristics is based upon <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/e100d5ed292786284ef4f3ee678be5f7c43a0a53\/src\/coreclr\/src\/jit\/inline.cpp#L1027\">x64 code<\/a> which has different characteristics than the ARM64 code. We explored some ways to fine tune it for ARM64 but did not see promising results. We will continue exploring these heuristics in future.<\/p>\n<h3><a id=\"user-content-return-address-hijacking\" class=\"anchor\" href=\"#return-address-hijacking\" aria-hidden=\"true\"><\/a>Return address hijacking<\/h3>\n<p>While doing the code size analysis, we noticed that for small methods, ARM64 code includes <a href=\"https:\/\/en.wikipedia.org\/wiki\/Function_prologue#Prologue\" rel=\"nofollow\">prologue<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Function_prologue#Epilogue\" rel=\"nofollow\">epilogue<\/a> for every method, even though it is not needed. Often small methods get inlined inside the caller, but there may be scenarios where this might not happen. Consider a method <code>AdditionalCount()<\/code> that is marked as <code>NoInlining<\/code>. This method will not get inlined inside its caller. In this method, let us invoke the <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/8a2820e35c5ad841d4c3aae3af8b1ace37d22660\/src\/libraries\/System.Collections\/src\/System\/Collections\/Generic\/Stack.cs#L58\">Stack&lt;T&gt;.Count<\/a> getter.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre>[<span class=\"pl-en\">MethodImpl<\/span>(<span class=\"pl-smi\">MethodImplOptions<\/span>.<span class=\"pl-smi\">NoInlining<\/span>)]\r\n<span class=\"pl-k\">public<\/span> <span class=\"pl-k\">static<\/span> <span class=\"pl-k\">int<\/span> <span class=\"pl-en\">AdditionalCount<\/span>(<span class=\"pl-en\">Stack<\/span>&lt;<span class=\"pl-k\">string<\/span>&gt; <span class=\"pl-smi\">a<\/span>, <span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">b<\/span>)\r\n{\r\n    <span class=\"pl-k\">return<\/span> <span class=\"pl-smi\">a<\/span>.<span class=\"pl-smi\">Count<\/span> <span class=\"pl-k\">+<\/span> <span class=\"pl-smi\">b<\/span>;\r\n}<\/pre>\n<\/div>\n<p>Since there are no local variables in <code>AdditionalCount()<\/code>, nothing is retrieved from the stack and hence there is no need prepare and revert stack&#8217;s state using prologue and epilogue. Below is the code generated for x64. If you notice, the x64 code for this method is 6 bytes long, with 0 bytes in prolog.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; AdditionalCount(System.Collections.Generic.Stack`1[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]],int):int<\/span>\r\n\r\n        <span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">eax<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">edx<\/span>\r\n        <span class=\"pl-k\">add<\/span>      <span class=\"pl-v\">eax<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rcx<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">ret<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 6, prolog size 0<\/span><\/pre>\n<\/div>\n<p>However, for ARM64, we generate prologue and epilogue even though nothing is stored or retrieved from stack. Also, if you see below, the code size is 24 bytes with 8 bytes in prologue which is bigger than x64 code size.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; AdditionalCount(System.Collections.Generic.Stack`1[[System.String, System.Private.CoreLib, Version=5.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]],int):int<\/span>\r\n\r\n<span class=\"pl-en\">        stp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-s1\">-<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><span class=\"pl-en\">!<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     fp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">sp<\/span>\r\n<span class=\"pl-en\">        ldr     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">add<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span>\r\n<span class=\"pl-en\">        ldp     fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> lr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-v\">sp<\/span><span class=\"pl-s1\">],<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span>\r\n        <span class=\"pl-k\">ret<\/span><span class=\"pl-en\">     lr<\/span>\r\n\r\n<span class=\"pl-c\">; Total bytes of code 24, prolog size 8<\/span><\/pre>\n<\/div>\n<p>Our investigation showed that approximately <strong>23%<\/strong> of methods in the .NET libraries skip generating prologue\/epilogue for x64, while for ARM64, we generate extra 16 bytes code for storing and retrieving <code>fp<\/code> and <code>lr<\/code> registers. We need to do this to support <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/a21da8f6d945002bbb7cdb426c148867f60be528\/docs\/design\/coreclr\/jit\/arm64-jit-frame-layout.md\">return address hijacking<\/a>. If the .NET runtime needs to trigger garbage collection (GC), it needs to bring the user code execution to a safe point before it can start the GC. For ARM64, it has been done by generating prologue\/epilogue in user&#8217;s code to store the return address present in <code>lr<\/code> register on the stack and retrieve it back before returning. If the runtime decides to trigger GC while executing user code, it replaces the return address present on the stack with a runtime helper function address. When the method completes the execution, it retrieves the modified return address from the stack into <code>lr<\/code> and thus return to the runtime helper function so the runtime can perform GC. After GC is complete, control jumps back to the original return address of user code. All this is not needed for x64 code because the return address is already on stack and can be retrieved by the runtime. It may be possible to <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d76ef042f8ead9d06a447ab2b1004ae626185ca2\/src\/coreclr\/src\/jit\/codegencommon.cpp#L4886\">optimize return address hijacking<\/a> for certain scenarios. In future release, we will do more investigation of <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35274\">dotnet\/runtime#35274<\/a> to reduce the code size and improve speed of small methods.<\/p>\n<h3><a id=\"user-content-arm64-code-characteristics\" class=\"anchor\" href=\"#arm64-code-characteristics\" aria-hidden=\"true\"><\/a>ARM64 code characteristics<\/h3>\n<p>Although there are various issues that we have identified and continue optimizing to improve the code size produced for ARM64, there are certain aspects of ARM ISA that cannot be changed and are worth mentioning here.<\/p>\n<p>While x86 has <a href=\"https:\/\/en.wikipedia.org\/wiki\/Complex_instruction_set_computer\" rel=\"nofollow\">CISC<\/a> and ARM is a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reduced_instruction_set_computer\" rel=\"nofollow\">RISC<\/a> architecture, it is nearly impossible to have x86 and ARM target code size similar for the same method. ARM has fixed length encoding of 4-bytes in contrast to x86 which has variable length encoding. A return instruction <code>ret<\/code> on x86 can be as short as 1-byte, but on ARM64, it is always 4-bytes long. Because of fixed length encoding in ARM, there is a limited range of constant values that can be encoded inside an instruction as I mentioned in <a href=\"#user-content-arm64-and-big-constants\">ARM64 and big constants section<\/a>. Any instruction that contains a constant bigger than 12-bits (sometimes 16-bits) must be moved to a register and operated through register. Basic arithmetic instructions like <code>add<\/code> and <code>sub<\/code> cannot operate on constant values that are bigger than 12-bits. Data cannot be transferred between memory to memory. It must be loaded in a register before transferring or operating on it. If there are any constants that need to be stored in memory, those constants must be moved in a register first before storing them to the memory. Even to do memory access using various addressing modes, the address has to be moved in a register before loading or storing data into it. Thus, at various places, there is a need to perform prerequisite or setup instructions to load the data in registers before performing actual operation. That all can lead to bigger code size on ARM64 targets.<\/p>\n<hr \/>\n<h2><a id=\"user-content-peephole-analysis\" class=\"anchor\" href=\"#peephole-analysis\" aria-hidden=\"true\"><\/a>Peephole analysis<\/h2>\n<p>The last topic that I would like to mention is our data-driven engineering approach in discovering and prioritizing some other important ARM64 code quality enhancements. When inspecting ARM64 code produced for .NET libraries with several benchmarks, we realized that there were several instruction patterns that could be replaced with better and more performant instructions. In compiler literature, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Peephole_optimization\" rel=\"nofollow\">&#8220;peephole optimization&#8221;<\/a> is the phase that does such optimizations. RyuJIT does not have peephole optimization phase currently. Adding a new compiler phase is a big task and can easily take a few months to get it right without impacting other metrics like JIT throughput. Additionally, we were not sure how much code size or speed up improvement such optimization would get us. Hence, we gathered data in an interesting way to discover and prioritize various opportunities in performing peephole optimization. We wrote a utility tool <a href=\"https:\/\/github.com\/dotnet\/jitutils\/tree\/master\/src\/AnalyzeAsm\">AnalyzeAsm<\/a> that would scan through approximately 1GB file containing ARM64 disassembly code of .NET library methods and report back the frequency of instruction patterns that we were interested in, along with methods in which they are present. With that information, it became easier for us to decide that a minimal implementation of peephole optimization phase was important. With <code>AnalyzeAsm<\/code>, we identified several peephole opportunities that would give us roughly <strong>0.75%<\/strong> improvement in the code size of the .NET libraries. In .NET 5, we optimized an instruction pattern by eliminating redundant opposite <code>mov<\/code> instructions in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38179\">dotnet\/runtime#38179<\/a> which gave us <strong>0.28%<\/strong> code size improvement. Percentage-wise, the improvements are not large, but they are meaningful in the context of the whole product.<\/p>\n<h3><a id=\"user-content-details-4\" class=\"anchor\" href=\"#details-4\" aria-hidden=\"true\"><\/a>Details<\/h3>\n<p>I would like to highlight some of the peephole opportunities that we have found and hoping to address them in .NET 6.<\/p>\n<h4><a id=\"user-content-replace-pair-of-ldr-with-ldp\" class=\"anchor\" href=\"#replace-pair-of-ldr-with-ldp\" aria-hidden=\"true\"><\/a>Replace pair of &#8220;ldr&#8221; with &#8220;ldp&#8221;<\/h4>\n<p>If there are pair of consecutive load instructions <code>ldr<\/code> that loads data into a register from consecutive memory location, then the pair can be replaced by single load-pair instruction <code>ldp<\/code>.<\/p>\n<p>So below pattern:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldr     x23<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">        ldr     x24<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>can be replaced with:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldp     x23<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x24<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>As seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35130\">dotnet\/runtime#35130<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35132\">dotnet\/runtime#35132<\/a>, <code>AnalyzeAsm<\/code> pointed out that this pattern occurs approximately <strong>34,000<\/strong> times in <strong>16,000<\/strong> methods.<\/p>\n<h4><a id=\"user-content-replace-pair-of-str-with-stp\" class=\"anchor\" href=\"#replace-pair-of-str-with-stp\" aria-hidden=\"true\"><\/a>Replace pair of &#8220;str&#8221; with &#8220;stp&#8221;<\/h4>\n<p>This is similar pattern as above, except that if there are pair of consecutive store instructions <code>str<\/code> that stores data from a register into consecutive memory location, then the pair can be replaced by single store-pair instruction <code>stp<\/code>.<\/p>\n<p>So below pattern:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre>        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     x23<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     x24<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">24<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>can be replaced with:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        stp     x23<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> x24<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>As seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35133\">dotnet\/runtime#35133<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35134\">dotnet\/runtime#35134<\/a>, <code>AnalyzeAsm<\/code> pointed out that this pattern occurs approximately <strong>35,000<\/strong> times in <strong>16,400<\/strong> methods.<\/p>\n<h4><a id=\"user-content-replace-pair-of-str-wzr-with-str-xzr\" class=\"anchor\" href=\"#replace-pair-of-str-wzr-with-str-xzr\" aria-hidden=\"true\"><\/a>Replace pair of &#8220;str wzr&#8221; with &#8220;str xzr&#8221;<\/h4>\n<p><code>wzr<\/code> is 4-byte zero register while <code>xzr<\/code> is an 8-byte zero register in ARM64. If there is a pair of consecutive instructions that stores <code>wzr<\/code> in consecutive memory location, then the pair can be replaced by single store of <code>xzr<\/code> value.<\/p>\n<p>So below pattern:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre>        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     wzr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     wzr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">12<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>can be replaced with:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre>        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     xzr<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x2<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>As seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35136\">dotnet\/runtime#35136<\/a>, <code>AnalyzeAsm<\/code> pointed out that this pattern occurs approximately <strong>450<\/strong> times in <strong>353<\/strong> methods.<\/p>\n<h4><a id=\"user-content-remove-redundant-ldr-and-str\" class=\"anchor\" href=\"#remove-redundant-ldr-and-str\" aria-hidden=\"true\"><\/a>Remove redundant &#8220;ldr&#8221; and &#8220;str&#8221;<\/h4>\n<p>Another pattern that we were generating was loading a value from memory location into a register and then storing that value back from the register into same memory location. The second instruction was redundant and could be removed. Likewise, if there is a store followed by a load, it is safe to eliminate the second load instruction.<\/p>\n<p>So below pattern:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldr     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">64<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">str<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">64<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>can be optimized with:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldr     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">x19<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> #<\/span><span class=\"pl-c1\">64<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>As seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35613\">dotnet\/runtime#35613<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35614\">dotnet\/runtime#35614<\/a> issues, <code>AnalyzeAsm<\/code> pointed out that this pattern occurs approximately <strong>2570<\/strong> times in <strong>1750<\/strong> methods. We are already in the process of addressing this optimization in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39222\">dotnet\/runtime#39222<\/a>.<\/p>\n<h4><a id=\"user-content-replace-ldr-with-mov\" class=\"anchor\" href=\"#replace-ldr-with-mov\" aria-hidden=\"true\"><\/a>Replace &#8220;ldr&#8221; with &#8220;mov&#8221;<\/h4>\n<p>RyuJIT rarely generates code that will load two registers from same memory location, but we have seen that pattern in library methods. The second load instruction can be converted to <code>mov<\/code> instruction which is cheaper and does not need memory access.<\/p>\n<p>So below pattern:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldr     w1<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">        ldr     w0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span><\/pre>\n<\/div>\n<p>can be optimized with:<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">        ldr     w1<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-s1\">[<\/span><span class=\"pl-en\">fp<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\">#<\/span><span class=\"pl-c1\">28<\/span><span class=\"pl-s1\">]<\/span>\r\n        <span class=\"pl-k\">mov<\/span><span class=\"pl-en\">     w0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> w1<\/span><\/pre>\n<\/div>\n<p>As seen in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35141\">dotnet\/runtime#35141<\/a>, <code>AnalyzeAsm <\/code>pointed out that this pattern occurs approximately <strong>540<\/strong> times in <strong>300<\/strong> methods.<\/p>\n<h4><a id=\"user-content-loading-large-constants-using-movzmovk\" class=\"anchor\" href=\"#loading-large-constants-using-movzmovk\" aria-hidden=\"true\"><\/a>Loading large constants using movz\/movk<\/h4>\n<p>Since large constants cannot be encoded in an ARM64 instruction as I have <a href=\"#user-content-arm64-code-characteristics\">described above<\/a>, we also found large number of occurrences of <code>movz\/movk<\/code> pair (around <strong>191028<\/strong> of them in <strong>4578<\/strong> methods). In .NET 5, while some of these patterns are optimized by caching them as done in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39096\">dotnet\/runtime#39096<\/a>, we are hoping to revisit other patterns and come up with a way to reduce them.<\/p>\n<h4><a id=\"user-content-call-indirects-and-virtual-stubs\" class=\"anchor\" href=\"#call-indirects-and-virtual-stubs\" aria-hidden=\"true\"><\/a>Call indirects and virtual stubs<\/h4>\n<p>Lastly, as I have <a href=\"#user-content-code-size-analysis\">mentioned above<\/a>, <strong>14%<\/strong> code size improvement in .NET libraries came from optimizing <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35675\">call indirects<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/36817\">virtual call stub<\/a> in R2R code. It was possible to prioritize this from the data we obtained by using <code>AnalyzeAsm<\/code> on JIT disassembly of .NET libraries. It pointed out that the suboptimal pattern occurred approximately <strong>615,700<\/strong> times in <strong>126,800<\/strong> methods.<\/p>\n<hr \/>\n<h2><a id=\"user-content-techempower-benchmarks\" class=\"anchor\" href=\"#techempower-benchmarks\" aria-hidden=\"true\"><\/a>Techempower benchmarks<\/h2>\n<p>With all of the work that I described above and other work described in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\/\" rel=\"nofollow\">this blog<\/a>, we made significant improvement in ARM64 performance in Techempower benchmarks. The measurements below are for Requests \/ Second (higher is better)<\/p>\n<table>\n<thead>\n<tr>\n<th>TechEmpower Platform Benchmark<\/th>\n<th>.NET Core 3.1<\/th>\n<th>.NET 5<\/th>\n<th>Improvements<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>JSON RPS<\/td>\n<td>484,256<\/td>\n<td>542,463<\/td>\n<td>+12.02%<\/td>\n<\/tr>\n<tr>\n<td>Single Query RPS<\/td>\n<td>49,663<\/td>\n<td>53,392<\/td>\n<td>+7.51%<\/td>\n<\/tr>\n<tr>\n<td>20-Query RPS<\/td>\n<td>10,730<\/td>\n<td>11,114<\/td>\n<td>+3.58%<\/td>\n<\/tr>\n<tr>\n<td>Fortunes RPS<\/td>\n<td>61,164<\/td>\n<td>71,528<\/td>\n<td>+16.95%<\/td>\n<\/tr>\n<tr>\n<td>Updates RPS<\/td>\n<td>9,154<\/td>\n<td>10,217<\/td>\n<td>+11.61%<\/td>\n<\/tr>\n<tr>\n<td>Plaintext RPS<\/td>\n<td>6,763,328<\/td>\n<td>7,415,041<\/td>\n<td>+9.64%<\/td>\n<\/tr>\n<tr>\n<td>TechEmpower Performance Rating (TPR)<\/td>\n<td>484<\/td>\n<td>538<\/td>\n<td>+11.16%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<hr \/>\n<h2><a id=\"user-content-hardware\" class=\"anchor\" href=\"#hardware\" aria-hidden=\"true\"><\/a>Hardware<\/h2>\n<p>Here are the hardware details of machines we used to run the benchmarks I have covered in this blog.<\/p>\n<h3><a id=\"user-content-microbenchmarks\" class=\"anchor\" href=\"#microbenchmarks\" aria-hidden=\"true\"><\/a>MicroBenchmarks<\/h3>\n<p>Our performance lab that runs microbenchmarks has following hardware configuration.<\/p>\n<pre><code>ARM64v8\u200b\r\nMemory:              96510MB \u200b\r\nArchitecture:        aarch64\u200b\r\nByte Order:          Little Endian\u200b\r\nCPU(s):              46\u200b\r\nOn-line CPU(s) list: 0-45\u200b\r\nThread(s) per core:  1\u200b\r\nCore(s) per socket:  46\u200b\r\nSocket(s):           1\u200b\r\nNUMA node(s):        1\u200b\r\nVendor ID:           Qualcomm\u200b\r\nModel:               1\u200b\r\nModel name:          Falkor\u200b\r\nStepping:            0x0\u200b\r\nCPU max MHz:         2600.0000\u200b\r\nCPU min MHz:         600.0000\u200b\r\nBogoMIPS:            40.00\u200b\r\nL1d cache:           32K\u200b\r\nL1i cache:           64K\u200b\r\nL2 cache:            512K\u200b\r\nL3 cache:            58880K\u200b\r\nNUMA node0 CPU(s):   0-45\u200b\r\nFlags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid asimdrdm\r\n<\/code><\/pre>\n<h3><a id=\"user-content-techempower-benchmarks-1\" class=\"anchor\" href=\"#techempower-benchmarks-1\" aria-hidden=\"true\"><\/a>Techempower benchmarks<\/h3>\n<p>Our ASP.NET lab that runs techempower benchmarks has following hardware configuration.<\/p>\n<pre><code>Rack-Mount, 1U\u200b\r\nThinkSystem HR330A\u200b\r\n1x 32-Core\/3.0GHz eMAG CPU\u200b\r\n64GB DDR4 (8x8GB)\u200b\r\n1x 960GB NVMe M.2 SSD\u200b\r\n1x Single-Port 50GbE NIC\u200b\r\n2x Serial Ports\u200b\r\n1x 1GbE Management Port\u200b\r\nUbuntu 18.04\u200b\r\nARMv8\u200b\r\n\r\nArchitecture:        aarch64\u200b\r\nByte Order:          Little Endian\u200b\r\nCPU(s):              32\u200b\r\nOn-line CPU(s) list: 0-31\u200b\r\nThread(s) per core:  1\u200b\r\nCore(s) per socket:  32\u200b\r\nSocket(s):           1\u200b\r\nNUMA node(s):        1\u200b\r\nVendor ID:           APM\u200b\r\nModel:               2\u200b\r\nModel name:          X-Gene\u200b\r\nStepping:            0x3\u200b\r\nCPU max MHz:         3300.0000\u200b\r\nCPU min MHz:         363.9700\u200b\r\nBogoMIPS:            80.00\u200b\r\nL1d cache:           32K\u200b\r\nL1i cache:           32K\u200b\r\nL2 cache:            256K\u200b\r\nNUMA node0 CPU(s):   0-31\r\n<\/code><\/pre>\n<h2><a id=\"user-content-conclusion\" class=\"anchor\" href=\"#conclusion\" aria-hidden=\"true\"><\/a>Conclusion<\/h2>\n<p>In .NET 5, we made great progress in improving the speed and code size for ARM64 target. Not only did we expose ARM64 intrinsics in .NET APIs, but also consumed them in our library code to optimize critical methods. With our data-driven engineering approach, we were able to prioritize high impacting work items in .NET 5. While doing performance investigation, we have also discovered several opportunities as summarized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/35853\">dotnet\/runtime#35853<\/a> that we plan to continue working for .NET 6. We had great partnership with <a href=\"https:\/\/github.com\/TamarChristinaArm\">@TamarChristinaArm<\/a> from Arm Holdings who not only <a href=\"https:\/\/github.com\/dotnet\/runtime\/pulls?q=is%3Apr+author%3ATamarChristinaArm+is%3Aclosed\">implemented some of the ARM64 hardware intrinsics<\/a>, but also <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues?q=commenter%3ATamarChristinaArm+\">gave valuable suggestions and feedback<\/a> to improve our code quality. We want to thank multiple contributors who made it possible to ship .NET 5 running on ARM64 target.<\/p>\n<p>I would encourage you to download the latest bits of <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/5.0\" rel=\"nofollow\">.NET 5<\/a> for ARM64 and let us know your feedback.<\/p>\n<p>Happy coding on ARM64!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ARM64 performance work in .NET 5<\/p>\n","protected":false},"author":38211,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,196,3012,756,3009],"tags":[4,9,7173,7172,108,121],"class_list":["post-29683","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-dotnet-core","category-internals","category-csharp","category-performance","tag-net","tag-net-core","tag-arm","tag-arm64","tag-performance","tag-ryujit"],"acf":[],"blog_post_summary":"<p>ARM64 performance work in .NET 5<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/29683","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/38211"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=29683"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/29683\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=29683"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=29683"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=29683"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}