{"id":42195,"date":"2022-09-12T09:13:45","date_gmt":"2022-09-12T16:13:45","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=42195"},"modified":"2024-12-13T15:11:14","modified_gmt":"2024-12-13T23:11:14","slug":"arm64-performance-improvements-in-dotnet-7","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/arm64-performance-improvements-in-dotnet-7\/","title":{"rendered":"Arm64 Performance Improvements in .NET 7"},"content":{"rendered":"<p>The .NET team has continued improving performance in .NET 7, both generally and for Arm64.  You can check out the general improvements in the excellent and detailed <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">Performance Improvements in .NET 7<\/a> blog by Stephen Toub. Following along the lines of <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/arm64-performance-in-net-5\/\">ARM64 Performance in .NET 5<\/a>, in this post I will describe the performance improvements we made for Arm64 in .NET 7 and the positive impact it had on various benchmarks. Stephen did touch upon some of the work in his blog post, but here, I will go through some more details and wherever possible include the improvements we have seen after optimizing a specific area.<\/p>\n<p>When we started .NET 7, we wanted to focus on benchmarks that would impact wide range of customers. Along with the Microsoft hardware team, we did lot of research and thinking on what benchmarks should we pick that can improve the performance of both client and cloud scenarios. In this blog, I will  start by describing the performance characteristics that we thought are important to have, the methodology we used, the criteria we evaluated to select the benchmarks used during .NET 7 work. After that, I will go through the incredible work that has gone into improving .NET&#8217;s performance on Arm64 devices.<\/p>\n<h2>Performance analysis methodology<\/h2>\n<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Instruction_set_architecture\">instruction set architecture<\/a> or the ISA of x64 and Arm64 is different for each one of them and this difference is always surfaced in the form of performance numbers. While this difference exists between the two platforms, we wanted to understand how performant .NET is when running on Arm64 platforms compared to x64, and what can be done to improve its efficiency. Our goal in .NET 7 was not only to achieve performance parity between x64 and Arm64, but to deliver clear guidance to our customers on what to expect when they move their .NET applications from x64 to Arm64. To do that, we came up with a well-defined process to conduct Arm64 performance investigations on benchmarks. Before deciding which benchmarks to investigate, we narrowed down the characteristics of an ideal benchmark.<\/p>\n<ul>\n<li>It should represent a real-world code that any .NET developer will write for their software.<\/li>\n<li>It should be easy to execute and gather measurements with minimal prerequisite steps.<\/li>\n<li>Lastly, it should be executable for all the platforms on which we are interested in conducting performance measurements.<\/li>\n<\/ul>\n<p>Based upon these characteristics, the following were some of the benchmarks that we finalized for our investigations.<\/p>\n<h3>BenchmarkGames<\/h3>\n<p><a href=\"https:\/\/benchmarksgame-team.pages.debian.net\/benchmarksgame\/index.html\">The Computer Language Benchmarks Game<\/a> are one of the popular benchmarks because they are implemented in several languages making it easy to measure and compare performance of various languages. To name few, <a href=\"https:\/\/benchmarksgame-team.pages.debian.net\/benchmarksgame\/program\/fannkuchredux-csharpcore-9.html\">fannkuch-9<\/a>, <a href=\"https:\/\/benchmarksgame-team.pages.debian.net\/benchmarksgame\/program\/knucleotide-csharpcore-6.html\">knucleotide<\/a> and <a href=\"https:\/\/benchmarksgame-team.pages.debian.net\/benchmarksgame\/program\/mandelbrot-csharpcore-5.html\">mandelbrot-5<\/a> were some of the benchmarks that we selected. The good part of these benchmarks is there is no extra setup needed on the developer machine and can simply be built and executed using available tooling. On the contrary, they represent a narrow slice of computation that are hand tuned and may not be a good representation of user code.<\/p>\n<h3>TechEmpower<\/h3>\n<p><a href=\"https:\/\/www.techempower.com\/\">TechEmpower benchmarks<\/a> (from here on, I will refer them as &#8220;TE benchmarks&#8221;) are industry recognized server workload benchmarks that compare various frameworks cross languages. They are representations of real-world client\/server workloads. Unlike BenchmarkGames, these benchmarks rely on the network stack and may not be CPU dominated. Another benefit of the TE Benchmarks is that we could compare relative performance of x64 and Arm64 across a number of different runtimes and languages, giving us a useful set of yardsticks to evaluate .NET&#8217;s performance.<\/p>\n<h3>Bing.com<\/h3>\n<p>While all the above benchmarks will give us some low hanging fruits in codegen space, it would be interesting to understand the characteristics of a real-world cloud application on Arm64. After <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/bing-com-runs-on-net-core-2-1\/\">Bing switched to .NET core<\/a>, .NET team worked closely with them to identify performance bottlenecks. By analyzing Bing&#8217;s performance on Arm64, we got a sense of what a big cloud customer could expect if they switched their servers to Arm64 machines.<\/p>\n<h3>ImageSharp<\/h3>\n<p>ImageSharp is a popular .NET tool that uses intrinsics extensively in their code. Their <a href=\"https:\/\/github.com\/SixLabors\/ImageSharp\/tree\/master\/tests\/ImageSharp.Benchmarks\">benchmarks<\/a> are based on <a href=\"https:\/\/github.com\/dotnet\/BenchmarkDotNet\">BenchmarkDotnet<\/a> framework and we had a PR to <a href=\"https:\/\/github.com\/SixLabors\/ImageSharp\/pull\/1794\">enable the execution of these benchmarks on Arm64<\/a>. We also have an outstanding PR to be include <a href=\"https:\/\/github.com\/dotnet\/performance\/pull\/2149\">ImageSharp benchmarks in <code>dotnet\/performance<\/code><\/a>.<\/p>\n<h3>Paint.NET<\/h3>\n<p>The <a href=\"https:\/\/www.getpaint.net\/\">Paint.NET<\/a> is image and photo editing software written in .NET. Not only is this a real-world application, analyzing the performance of such applications would give us insights on how the UI applications perform on different hardware and if there are other areas that we should explore to optimize the performance of .NET runtime.<\/p>\n<h3>Micro benchmarks<\/h3>\n<p><a href=\"https:\/\/github.com\/dotnet\/performance\/tree\/main\/src\/benchmarks\/micro\">4600+ Micro benchmarks<\/a> are run every night by the &#8220;Dotnet Performance team&#8221; and results of various platforms can be seen at <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/allTestHistory\/TestHistoryIndexIndex.html\">this index page<\/a>.<\/p>\n<p>We started our journey of .NET 7 by gathering the known Arm64 performance work items in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/64820\">dotnet\/runtime#64820<\/a>. We started measuring various benchmarks (client as well as server workload) performance difference between x64 and Arm64 and noticed (through multiple benchmarks) that Arm64 performance was severely slow on server workload and hence we started our investigation on TE benchmark.<\/p>\n<p>As I discuss various optimizations that we made to .NET 7, I will interchangeably refer to one or more of the benchmarks that improved. In the end, we wanted to make sure that we impacted as many benchmarks and as much real-world code as possible.<\/p>\n<hr \/>\n<h2>Runtime improvements<\/h2>\n<p>.NET is built upon three main components. .NET &#8220;libraries&#8221; contains managed code (mostly C#) for common functionality that is consumed within the .NET ecosystem as well as by .NET developers. The &#8220;RyuJIT code generator&#8221; takes .NET Intermediate Language representation and converts it to machine code on the fly, the process commonly known as Just-in-time or JIT. Lastly, &#8220;runtime&#8221; facilitates the execution of program by having the information of type system, scheduling code generation, garbage collection, native support to certain .NET libraries, etc. The runtime is a critical part of any managed language platform and we wanted to closely monitor if there was any area in the runtime that was not optimal for Arm64. We found a few fundamental issues that I will highlight below.<\/p>\n<h3>L3 cache size<\/h3>\n<p>After starting our investigation, we found out that the TE benchmarks <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/60166\">were 4-8X slower on Arm64<\/a> compared to x64. One key problem was that <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/60166#issuecomment-938724532\">Gen0 GC was happening 4X times<\/a> more frequently on Arm64. We soon realized that we were <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/60166#issuecomment-940421739\">not correctly reading the L3 cache size from the Arm64 machine<\/a> that was used to measure the performance. We eventually found out that L3 cache size is not accessible from the OS on some of the machines, including <a href=\"https:\/\/d1o0i0v5q5lp8h.cloudfront.net\/ampere\/live\/assets\/documents\/Altra_Rev_A1_PB_v1.35_20220728.pdf\">Ampere Altra<\/a> (that was later fixed by the firmware update). In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71029\">dotnet\/runtime#71029<\/a>, we changed our heuristics so that if a machine cannot read L3 cache size, the runtime uses an approximate size based on the number of cores present on the machine.<\/p>\n<table>\n<thead>\n<tr>\n<th>Core count<\/th>\n<th>L3 cache size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1 ~ 4<\/td>\n<td>4 MB<\/td>\n<\/tr>\n<tr>\n<td>5 ~ 16<\/td>\n<td>8 MB<\/td>\n<\/tr>\n<tr>\n<td>17 ~ 64<\/td>\n<td>16 MB<\/td>\n<\/tr>\n<tr>\n<td>65+<\/td>\n<td>32 MB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>With this change, we saw around <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6510\">670+ microbenchmark improvements<\/a> on Linux\/arm64 and <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6501\">170+ improvements<\/a> on Windows\/arm64. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64576\">dotnet\/runtime#64576<\/a>, we started using modern hardware information for macOS 12+ to read the L3 cache size accurately.<\/p>\n<h3>Thread pool scaling<\/h3>\n<p>We saw a significant degradation in performance on higher core (32+) machines. For instance, on Ampere machines, the performance dropped by almost 80%. In other words, we were seeing higher Requests\/seconds (RPS) numbers on 28 cores than on 80 cores. The underlying problem was the way threads were using and polling the shared &#8220;global queue&#8221;. When a worker thread needed more work, it would query the global queue to see if there is more work to be done. On higher core machines, with more threads involved, there was lot of contention happening while accessing the global queue. Multiple threads were trying to acquire lock on the global queue before accessing its contents. This led to stalling and hence degradation in performance as noted in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/67845\">dotnet\/runtime#67845<\/a>. The thread pool scaling problem is not just limited to the Arm64 machines but is true for any machines that has higher core count. This problem was fixed in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/69386\">dotnet\/runtime#69386<\/a> and <a href=\"https:\/\/github.com\/dotnet\/aspnetcore\/pull\/42237\">dotnet\/aspnetcore#42237<\/a> as seen in graph below. Although this shows significant improvements, we can notice that the performance doesn&#8217;t scale linearly with more machine cores. We have ideas to improve this in future releases.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/threadpool_28_vs_80.png\" style=\"width:75%\"><\/p>\n<h3>LSE atomics<\/h3>\n<p>In a concurrent code, there is often a need to access a memory region exclusively by a single thread. Developers use atomic APIs to gain exclusive access to critical regions. On x86-x64 machines, read-modify-write (RMW) operation on a memory location can be performed by a single instruction by adding a <code>lock<\/code> prefix as seen in the example below. The method <code>Do_Transaction<\/code> calls one of the <code>Interlocked*<\/code> C++ API and as seen from the generated code by Microsoft Visual C++ (MSVC) compiler, the operation is performed using a single instruction on x86-x64.<\/p>\n<pre><code class=\"language-c++\">long Do_Transaction(\r\n    long volatile *Destination,\r\n    long Exchange,\r\n    long Comperand)\r\n{\r\n    long result =\r\n        _InterlockedCompareExchange(\r\n            Destination, \/* The pointer to a variable whose value is to be compared with. *\/\r\n            Comperand, \/* The value to be compared *\/\r\n            Exchange \/* The value to be stored *\/);\r\n    return result;\r\n}<\/code><\/pre>\n<pre><code class=\"language-assembly\">long Do_Transaction(long volatile *,long,long) PROC                 ; Do_Transaction, COMDAT\r\n        mov     eax, edx\r\n        lock cmpxchg DWORD PTR [rcx], r8d\r\n        ret     0\r\nlong Do_Transaction(long volatile *,long,long) ENDP                 ; Do_Transaction<\/code><\/pre>\n<p>However, until recently, on Arm machines, RMW operations were not permitted, and all operations were done through registers. Hence for concurrency scenarios, they had pair of instructions. &#8220;Load Acquire&#8221; (<code>ldaxr<\/code>) would gain exclusive access to the memory region such that no other core can access it and &#8220;Store Release&#8221; (<code>stlxr<\/code>) will release the access for other cores to access. In between these pair, the critical operations were performed as seen in the generated code below. If <code>stlxr<\/code> operation failed because some other CPU operated on the memory after we loaded the contents using <code>ldaxr<\/code>, there would be a code to retry (<code>cbnz<\/code> will jump back to retry) the operation.<\/p>\n<pre><code class=\"language-assembly\">|long Do_Transaction(long volatile *,long,long)| PROC           ; Do_Transaction\r\n        mov         x8,x0\r\n|$LN4@Do_Transac|\r\n        ldaxr       w0,[x8]\r\n        cmp         w0,w1\r\n        bne         |$LN5@Do_Transac|\r\n        stlxr       wip1,w2,[x8]\r\n        cbnz        wip1,|$LN4@Do_Transac|\r\n|$LN5@Do_Transac|\r\n        dmb         ish\r\n        ret\r\n\r\n        ENDP  ; |long Do_Transaction(long volatile *,long,long)|, Do_Transaction<\/code><\/pre>\n<p>Arm introduced LSE atomics instructions in v8.1. With these instructions, such operations can be done in less code and faster than the traditional version. The full memory barrier <code>dmb ish<\/code> at the end of such operation can also eliminated with the use of atomic instructions.<\/p>\n<pre><code class=\"language-assembly\">|long Do_Transaction(long volatile *,long,long)| PROC           ; Do_Transaction\r\n        casal       w1,w2,[x0]\r\n        mov         w0,w1\r\n        ret\r\n\r\n        ENDP  ; |long Do_Transaction(long volatile *,long,long)|, Do_Transaction<\/code><\/pre>\n<p>You can ready more about what ARM atomics are and why they matter in <a href=\"https:\/\/cpufun.substack.com\/p\/atomics-in-aarch64\">this<\/a> and <a href=\"https:\/\/mysqlonarm.github.io\/ARM-LSE-and-MySQL\/\">this<\/a> blogs.<\/p>\n<p>RyuJIT has had support for LSE atomics <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/18130\">since early days<\/a>. For <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.threading.interlocked?view=net-6.0\">threading interlocked APIs<\/a>, the JIT will generate the faster atomic instructions if it detects that the machine running the application supports LSE capability. However, this was only enabled for Linux. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70600\">dotnet\/runtime#70600<\/a>, we extended the support for Windows and saw a performance <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/allTestHistory\/refs\/heads\/main_arm64_Windows%2010.0.25094\/System.Threading.Tests.Perf_SpinLock.TryEnterExit.html\">win of around 45%<\/a> in scenarios involving locks.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/win_atomics.png\" style=\"width:75%\"><\/p>\n<p>In .NET 7, we not only wanted to use atomic instructions for just the managed code, but also take advantage of them in native runtime code. We use <code>Interlocked<\/code> APIs heavily for various scenarios like thread pool, helpers for .NET <code>lock<\/code> statement, garbage collection, etc. You can imagine how slow all these scenarios would become without the atomic instructions and that surfaced to some extent as part of performance gap compared to x64. To support the wide range of hardware that runs .NET, we can only generate the lowest common denominator instructions that can execute on all the machines. This is called &#8220;architecture baseline&#8221; and for Arm64, it is <code>ARM v8.0<\/code>. In other words, for ahead of time compilation scenario (as opposed to just-in-time compilation), we must be conservative and generate only the instructions that are <code>ARM v8.0<\/code> compliant. This holds true for the .NET runtime native code as well. When we compile .NET runtime using MSVC\/clang\/gcc, we explicitly inform the compiler to restrict generating modern instructions that were introduced beyond Arm v8.0. This ensures that .NET can run on older hardware, but at the same time we were not taking advantage of modern instructions like atomics on newer machines. Changing the baseline to <code>ARM v8.1<\/code> was not an option, because that would stop .NET working on machines that do not support atomic instructions. Newer version of clang added a flag <a href=\"https:\/\/reviews.llvm.org\/rG4d7df43ffdb460dddb2877a886f75f45c3fee188\"><code>-moutline-atomics<\/code><\/a> which would generate both versions of code, the slower version that has <code>ldaxr\/stlxr<\/code> and the faster version that has <code>casal<\/code> and add a check at runtime to execute the appropriate version of code depending on the machine capability. Unfortunately, .NET still uses <code>clang9<\/code> for its compilation and <code>-moutline-atomics<\/code> is not present in it. Similarly, MSVC also lacks this flag and the only way <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/msvc-backend-updates-in-visual-studio-2022-version-17-2\/\">to generate atomics was using the compiler switch <code>\/arch:armv8.1<\/code><\/a>.<\/p>\n<p>Instead of relying on C++ compilers, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70921\">dotnet\/runtime#70921<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71512\">dotnet\/runtime#71512<\/a>, we added two separate versions of code and a check in .NET runtime, that would execute the optimal version of code on given machine. As seen in <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6347\">dotnet\/perf#6347<\/a> and <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6770\">dotnet\/perf#6770<\/a>, we saw <code>10% ~ 20%<\/code> win in various benchmarks.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/win_atomics_2.png\" style=\"width:75%\"><\/p>\n<h3>Environment.ProcessorCount<\/h3>\n<p>From the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.environment.processorcount?view=net-6.0\">documentation<\/a>, <code>Environment.ProcessorCount<\/code> returns the number of logical processors on the machine or if the process is running with CPU affinity, the number of processors that the process is affinitized to. <a href=\"https:\/\/docs.microsoft.com\/windows\/win32\/procthread\/processor-groups#behavior-starting-with-windows-11-and-windows-server-2022\">Starting Windows 11 and Windows Server 2022<\/a>, the processes are no longer constrained to a single processor group by default. On the 80-cores Ampere machine that we tested, it had two processor groups, one had 16 cores and the other had 64 cores. On such machines, the <code>Environment.ProcessorCount<\/code> value would sometimes return <code>16<\/code> and other times it would return <code>64<\/code>. Since <a href=\"https:\/\/grep.app\/search?q=Environment.ProcessorCount&amp;filter[lang][0]=C%23\">a lot of user code<\/a> depends heavily on <code>Environment.ProcessorCount<\/code>, they would observe a drastic performance difference in their application on these machines depending on what value was obtained at process startup time. We fixed this issue in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/68639\">dotnet\/runtime#68639<\/a> to take into account the number of cores present in all process groups.<\/p>\n<h2>Libraries improvements<\/h2>\n<p>Continuing <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/33308\">our tradition<\/a> of optimizing libraries code for Arm64 using intrinsics, we have made several improvements to some of the hot methods. In .NET 7, we did lot of work in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53450\">dotnet\/runtime#53450<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/60094\">dotnet\/runtime#60094<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61649\">dotnet\/runtime#61649<\/a> to add cross platform hardware intrinsics helpers for <code>Vector64<\/code>, <code>Vector128<\/code> and <code>Vector256<\/code> as described in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/49397\">dotnet\/runtime#49397<\/a>. This work helped us to unify the logic of multiple libraries code paths by removing hardware specific intrinsics and instead use hardware agnostic intrinsics. With the new APIs, a developer does not need to know various hardware intrinsic APIs offered for each underlying hardware architecture, but can rather focus on the functionality they want to achieve using simpler hardware agnostic APIs. Not only did it simplify our code, but we also got good performance improvements on Arm64 just by taking advantage of hardware agnostic intrinsic APIs. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64864\">dotnet\/runtime#64864<\/a>, we also added <code>LoadPairVector64<\/code> and <code>LoadPairVector128<\/code> APIs to the libraries that loads pair of <code>Vector64<\/code> or <code>Vector128<\/code> values and returns them as a tuple.<\/p>\n<h3>Text processing improvements<\/h3>\n<p><a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a> rewrote <code>System.Buffers.Text.Base64<\/code> APIs <code>EncodeToUtf8<\/code> and <code>DecodeFromUtf8<\/code> from SSE3 based implementation to <code>Vector*<\/code> based API in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70654\">dotnet\/runtime#70654<\/a> and got <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70654#issuecomment-1164484184\">up to 60% improvement<\/a> in some of our benchmarks. A similar change made by rewriting <code>HexConverter::EncodeToUtf16<\/code> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67192\">dotnet\/runtime#67192<\/a> gave us up to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67192#issuecomment-1106756616\">50% wins on some benchmarks<\/a>.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/base64encode.png\" style=\"width:75%\"><\/p>\n<p><a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> likewise converted <code>NarrowUtf16ToAscii()<\/code> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70080\">dotnet\/runtime#70080<\/a> and <code>GetIndexOfFirstNonAsciiChar()<\/code> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71637\">dotnet\/runtime#71637<\/a> to get performance win of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71637#issuecomment-1184662254\">up to 35%<\/a>.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/get_byte_count.png\" style=\"width:75%\"><\/p>\n<p>Further, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71795\">dotnet\/runtime#71795<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73320\">dotnet\/runtime#73320<\/a> vectorized the <code>ToBase64String()<\/code>, <code>ToBase64CharArray()<\/code> and <code>TryToBase64Chars()<\/code> methods and improved the performance (see <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/7250\">win1<\/a>, <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/7244\">win2<\/a>, <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6960\">win3<\/a>, <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6983\">win4<\/a> and <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/6977\">win5<\/a>) of overall <code>Convert.ToBase64String()<\/code> not just for x64, but for Arm64 as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67811\">dotnet\/runtime#67811<\/a> improved <code>IndexOf<\/code> for scenarios when there is no match, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67192\">dotnet\/runtime#67192<\/a> accelerated <code>HexConverter::EncodeToUtf16<\/code> to use the newly written cross platform <code>Vector128<\/code> intrinsic APIs with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67192#issuecomment-1106756616\">nice wins<\/a>.<\/p>\n<h3>Reverse improvements<\/h3>\n<p><a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> also rewrote <code>Reverse()<\/code> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/72780\">dotnet\/runtime#72780<\/a> to optimize for Arm64 and got us <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/7405\">up to 70% win<\/a>.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/array_reverse.png\" style=\"width:75%\"><\/p>\n<h2>Code generation improvements<\/h2>\n<p>In this section, I will go through various work we did to improve the code quality for Arm64 devices. Not only did it improve the performance of .NET, but it also reduced the amount of code we produced.<\/p>\n<h3>Addressing mode improvements<\/h3>\n<p>Addressing mode is the mechanism by which an instruction computes the memory address it wants to access. On modern machines, memory addresses are 64-bits long. On x86-x64 hardware, where instructions are of varying width, most addresses can be directly embedded in the instruction itself. On the contrary, Arm64 being a fixed-size encoding, has a fixed instruction size of 32-bits, most often the 64-bit addresses cannot be specified as &#8220;an immediate value&#8221; inside the instruction. Arm64 provides various ways to manifest the memory address. More on this can be read in <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220728-00\/?p=106912\">this excellent article<\/a> about Arm64 addressing mode. The code generated by RyuJIT was not taking full advantage of many addressing modes, resulting in inferior code quality and slower execution. In .NET 7, we worked on improving the code to benefit from these addressing modes. <\/p>\n<p>Prior to .NET 7, when accessing an array element, we calculated the address of an array element in two steps. In the first step, depending on the index, we would calculate the offset of a corresponding element from the array&#8217;s base address and in second step, add that offset to the base address to get the address of the array element we are interested. In the below example, to access <code>data[j]<\/code>, we first load the value of <code>j<\/code> in <code>x0<\/code> and since it is of type <code>int<\/code>, multiply it by <code>4<\/code> using <code>lsl x0, x0, #2<\/code>. We access the element by adding the calculated offset value with the base address <code>x1<\/code> using <code>ldr w0, [x1, x0]<\/code>. With those two instructions, we have calculated the address by doing the operation <code>*(Base + (Index &lt;&lt; 2))<\/code>. Similar steps are taken to calculate <code>data[i]<\/code>.<\/p>\n<pre><code class=\"language-c#\">void GetSet(int* data, int i, int j) \r\n    =&gt; data[i] = data[j];<\/code><\/pre>\n<pre><code class=\"language-assembly\">; Method Prog:GetSet(long,int,int):this\r\nG_M13801_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M13801_IG02:\r\n            sxtw    x0, w3\r\n            lsl     x0, x0, #2\r\n            ldr     w0, [x1, x0]\r\n            sxtw    x2, w2\r\n            lsl     x2, x2, #2\r\n            str     w0, [x1, x2]\r\nG_M13801_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n; Total bytes of code: 40<\/code><\/pre>\n<p>Arm64 has &#8220;register indirect with index&#8221; addressing mode, often also referred by many as &#8220;scaled addressing mode&#8221;. This can be used in the scenario we just saw, where an offset is present in an unsigned register and needs to be left shifted by a constant value (in our case, the size of the array element type). In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/60808\">dotnet\/runtime#60808<\/a>, we started using &#8220;scaled addressing mode&#8221; for such scenarios resulting in the code shown below. Here, we were able to load the value of array element <code>data[j]<\/code> in a single instruction <code>ldr w0, [x1, x0, LSL #2]<\/code>. Not only did the code quality improve, but the code size of the method also reduced from 40 bytes to 32 bytes.<\/p>\n<pre><code class=\"language-assembly\">; Method Prog:GetSet(long,int,int):this\r\nG_M13801_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M13801_IG02:\r\n            sxtw    x0, w3\r\n            ldr     w0, [x1, x0, LSL #2]\r\n            sxtw    x2, w2\r\n            str     w0, [x1, x2, LSL #2]\r\nG_M13801_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n; Total bytes of code: 32<\/code><\/pre>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/60808.png\" style=\"width:75%\"><\/p>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/66902\">dotnet\/runtime#66902<\/a>, we did similar changes to improve the performance of code that operates on <code>byte<\/code> arrays.<\/p>\n<pre><code class=\"language-c#\">byte Test(byte* a, int i) =&gt; a[i];<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Program:Test(long,int):ubyte:this\r\nG_M39407_IG01:\r\n        A9BF7BFD          stp     fp, lr, [sp,#-16]!\r\n        910003FD          mov     fp, sp\r\nG_M39407_IG02:\r\n-       8B22C020          add     x0, x1, w2, SXTW\r\n-       39400000          ldrb    w0, [x0]\r\n+       3862D820          ldrb    w0, [x1, w2, SXTW #2]\r\nG_M39407_IG03:\r\n        A8C17BFD          ldp     fp, lr, [sp],#16\r\n        D65F03C0          ret     lr\r\n-; Total bytes of code: 24\r\n+; Total bytes of code: 20<\/code><\/pre>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/65468\">dotnet\/runtime#65468<\/a>, we optimized codes that operates on <code>float<\/code> arrays.<\/p>\n<pre><code class=\"language-c#\">float Test(float[] arr, int i) =&gt; arr[i];<\/code><\/pre>\n<pre><code class=\"language-diff\">G_M60861_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M60861_IG02:\r\n            ldr     w0, [x1,#8]\r\n            cmp     w2, w0\r\n            bhs     G_M60861_IG04\r\n-           ubfiz   x0, x2, #2, #32\r\n-           add     x0, x0, #16\r\n-           ldr     s0, [x1, x0]\r\n+           add     x0, x1, #16\r\n+           ldr     s0, [x0, w2, UXTW #2]\r\nG_M60861_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\nG_M60861_IG04:\r\n            bl      CORINFO_HELP_RNGCHKFAIL\r\n            brk_windows #0\r\n-; Total bytes of code: 48\r\n+; Total bytes of code: 44<\/code><\/pre>\n<p>This gave us <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/3694\">around 10% win<\/a> in some benchmarks as seen in below graph.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/65468.png\" style=\"width:75%\"><\/p>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70749\">dotnet\/runtime#70749<\/a>, we optimized code that operates on <code>object<\/code> arrays giving us <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70749#issuecomment-1164655841\">a performance win of more than 10%<\/a>.<\/p>\n<pre><code class=\"language-c#\">object Test(object[] args, int i) =&gt; args[i];<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Program:Test(System.Object[],int):System.Object:this\r\nG_M59644_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M59644_IG02:\r\n            ldr     w0, [x1,#8]\r\n            cmp     w2, w0\r\n            bhs     G_M59644_IG04\r\n-           ubfiz   x0, x2, #3, #32\r\n-           add     x0, x0, #16\r\n-           ldr     x0, [x1, x0]\r\n+           add     x0, x1, #16\r\n+           ldr     x0, [x0, w2, UXTW #3]\r\nG_M59644_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\nG_M59644_IG04:\r\n            bl      CORINFO_HELP_RNGCHKFAIL\r\n            brk_windows #0<\/code><\/pre>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/70749.png\" style=\"width:75%\"><\/p>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67490\">dotnet\/runtime#67490<\/a>, we improved the addressing modes of SIMD vectors that are loaded with unscaled indices. As seen below, to do such operations, we now take just 1 instruction instead of 2 instructions.<\/p>\n<pre><code class=\"language-c#\">Vector128&lt;byte&gt; Add(ref byte b1, ref byte b2, nuint offset) =&gt;\r\n    Vector128.LoadUnsafe(ref b1, offset) + \r\n    Vector128.LoadUnsafe(ref b2, offset);<\/code><\/pre>\n<pre><code class=\"language-diff\">-        add     x0, x1, x3\r\n-        ld1     {v16.16b}, [x0]\r\n+        ldr     q16, [x1, x3]\r\n-        add     x0, x2, x3\r\n-        ld1     {v17.16b}, [x0]\r\n+        ldr     q17, [x2, x3]\r\n         add     v16.16b, v16.16b, v17.16b\r\n         mov     v0.16b, v16.16b<\/code><\/pre>\n<h3>Memory barrier improvements<\/h3>\n<p>Arm64 has a relatively weaker memory model than x64. The processor can re-order the memory access instructions to improve the performance of a program and developer will not know about it. It can execute the instructions in order that would minimize the memory access cost. This can affect multi-threaded programs functionally and &#8220;memory barrier&#8221; is a way for a developer to inform the processor to not do this rearrangement for certain code paths. The memory barrier ensures that all the preceding writes before an instruction are completed before any subsequent memory operations.<\/p>\n<p>In .NET, developers can convey that information to the compiler by declaring a variable as <code>volatile<\/code>. We noticed that we were generating one-way barrier using store-release semantics for variable access when used with <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.threading.volatile?view=net-6.0\"><code>Volatile<\/code> class<\/a>, but were not doing the same for the ones declared using <code>volatile<\/code> keyword as seen below making the performance of <code>volatile<\/code> 2X slow on Arm64.<\/p>\n<pre><code class=\"language-c#\">private volatile int A;\r\nprivate volatile int B;\r\n\r\npublic void Test1() {\r\n    for (int i = 0; i &lt; 1000; i++) {\r\n        A = i;\r\n        B = i;\r\n    }\r\n}\r\n\r\nprivate int C;\r\nprivate int D;\r\n\r\npublic void Test2() {\r\n    for (int i = 0; i &lt; 1000; i++) {\r\n        Volatile.Write(ref C, i);\r\n        Volatile.Write(ref D, i);\r\n    }\r\n}<\/code><\/pre>\n<pre><code class=\"language-assembly\">;; --------------------------------\r\n;; Test1\r\n;; --------------------------------\r\nG_M12051_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\n\r\nG_M12051_IG02:\r\n            mov     w1, wzr\r\n\r\nG_M12051_IG03:\r\n            dmb     ish             ; &lt;-- two-way memory barrier\r\n            str     w1, [x0,#8]\r\n            dmb     ish             ; &lt;-- two-way memory barrier\r\n            str     w1, [x0,#12]\r\n            add     w1, w1, #1\r\n            cmp     w1, #0xd1ffab1e\r\n            blt     G_M12051_IG03\r\n\r\nG_M12051_IG04:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n;; --------------------------------\r\n;; Test2\r\n;; --------------------------------\r\nG_M27440_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\n\r\nG_M27440_IG02:\r\n            mov     w1, wzr\r\n            add     x2, x0, #16\r\n            add     x0, x0, #20\r\n\r\nG_M27440_IG03:\r\n            mov     x3, x2\r\n            stlr    w1, [x3]        ; &lt;-- store-release, one-way barrier\r\n            mov     x3, x0\r\n            stlr    w1, [x3]        ; &lt;-- store-release, one-way barrier\r\n            add     w1, w1, #1\r\n            cmp     w1, #0xd1ffab1e\r\n            blt     G_M27440_IG03\r\n\r\nG_M27440_IG04:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62895\">dotnet\/runtime#62895<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64354\">dotnet\/runtime#64354<\/a> fixed these problems leading to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62895#issuecomment-1017667472\">massive performance win<\/a>.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/62895.png\" style=\"width:75%\"><\/p>\n<p>Data memory barrier instructions <code>dmb ish*<\/code> are expensive and they guarantee that memory access that appear in program order before the <code>dmb<\/code> are honored before any memory access happening after the <code>dmb<\/code> instruction in program order. Often, we were generating two such instructions back-to-back. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/60219\">dotnet\/runtime#60219<\/a> fixed that problem by removing the redundant memory barriers <code>dmb<\/code> present.<\/p>\n<pre><code class=\"language-c#\">class Prog\r\n{\r\n    volatile int a;\r\n    void Test() =&gt; a++;\r\n}<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Prog:Test():this\r\nG_M48563_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M48563_IG02:\r\n            ldr     w1, [x0,#8]\r\n-           dmb     ishld\r\n+           dmb     ish\r\n            add     w1, w1, #1\r\n-           dmb     ish\r\n            str     w1, [x0,#8]\r\nG_M48563_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n-; Total bytes of code: 36\r\n+; Total bytes of code: 32<\/code><\/pre>\n<p>Instructions are added as part of ARMv8.3 to support the weaker RCpc (Release Consistent processor consistent) model where a Store-Release followed by Load-Acquire to a different address can be reordered. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67384\">dotnet\/runtime#67384<\/a> added support for these instructions in RyuJIT so the machines (like Apple M1 that supports it) can take advantage of them.<\/p>\n<h3>Hoisting expressions<\/h3>\n<p>Talking about array element access, in .NET 7, we restructured the way we were calculating the base address of an array. The way array element at <code>someArray[i]<\/code> is accessed is by finding the address of memory where the first array element is stored and then adding the appropriate index to it. Imagine <code>someArray<\/code> was stored at address <code>A<\/code> and the actual array elements start after <code>B<\/code> bytes from <code>A<\/code>. The address of first array element would be <code>A + B<\/code> and to get to the i<sup>th<\/sup> element, we would add <code>i * sizeof(array element type)<\/code>. So for a <code>int<\/code> array, the complete operation would be <code>(A + B) + (i * 4)<\/code>. If we are accessing an array inside a loop using loop index variable <code>i<\/code>, the term <code>A + B<\/code> is an invariant and we do not have to repeatedly calculate it. Instead, we can just calculate it once outside the loop, cache it and use it inside the loop. However, in RyuJIT, internally, we were representing this address as <code>(A + (i * 4)) + B<\/code> instead which prohibited us from moving the expression <code>A + B<\/code> outside the loop. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61293\">dotnet\/runtime#61293<\/a> fixed this problem as seen in the below example and gave us code size as well as performance gains (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61293#issuecomment-966430531\">here<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61293#issuecomment-973574023\">here<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61293#issuecomment-977208252\">here<\/a>).<\/p>\n<pre><code class=\"language-C#\">int Sum(int[] array)\r\n{\r\n    int sum = 0;\r\n    foreach (int item in array)\r\n        sum += item;\r\n    return sum;\r\n}<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Tests:Sum(System.Int32[]):int:this\r\nG_M56165_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M56165_IG02:\r\n            mov     w0, wzr\r\n            mov     w2, wzr\r\n            ldr     w3, [x1,#8]\r\n            cmp     w3, #0\r\n            ble     G_M56165_IG04\r\n+           add     x1, x1, #16             ;; 'A + B' is moved out of the loop\r\nG_M56165_IG03:\r\n-           ubfiz   x4, x2, #2, #32\r\n-           add     x4, x4, #16\r\n-           ldr     w4, [x1, x4]            \r\n+           ldr     w4, [x1, w2, UXTW #2]   ;; Better addressing mode\r\n            add     w0, w0, w4\r\n            add     w2, w2, #1\r\n            cmp     w3, w2\r\n            bgt     G_M56165_IG03\r\nG_M56165_IG04:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n-; Total bytes of code: 64\r\n+; Total bytes of code: 60<\/code><\/pre>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/61293.png\" style=\"width:75%\"><\/p>\n<p>While doing our Arm64 investigation on real world apps, we noted a <a href=\"https:\/\/github.com\/SixLabors\/ImageSharp\/blob\/255226b03c8350f88e641bdb58c9450b68729ef7\/src\/ImageSharp\/Processing\/Processors\/Quantization\/WuQuantizer%7BTPixel%7D.cs#L418-L445\">4-level nested for loop in ImageSharp<\/a> code base that was doing the same calculation repeatedly inside the nested loops. Here is the simplified version of the code.<\/p>\n<pre><code class=\"language-c#\">\r\nprivate const int IndexBits = 5;\r\nprivate const int IndexAlphaBits = 5;\r\nprivate const int IndexCount = (1 &lt;&lt; IndexBits) + 1;\r\nprivate const int IndexAlphaCount = (1 &lt;&lt; IndexAlphaBits) + 1;\r\n\r\npublic int M() {\r\n    int sum = 0;\r\n    for (int r = 0; r &lt; IndexCount; r++)\r\n    {\r\n        for (int g = 0; g &lt; IndexCount; g++)\r\n        {\r\n            for (int b = 0; b &lt; IndexCount; b++)\r\n            {\r\n                for (int a = 0; a &lt; IndexAlphaCount; a++)\r\n                {\r\n                    int ind1 = (r &lt;&lt; ((IndexBits * 2) + IndexAlphaBits))\r\n                                + (r &lt;&lt; (IndexBits + IndexAlphaBits + 1))\r\n                                + (g &lt;&lt; (IndexBits + IndexAlphaBits))\r\n                                + (r &lt;&lt; (IndexBits * 2))\r\n                                + (r &lt;&lt; (IndexBits + 1))\r\n                                + (g &lt;&lt; IndexBits)\r\n                                + ((r + g + b) &lt;&lt; IndexAlphaBits)\r\n                                + r + g + b + a;\r\n                    sum += ind1;\r\n                }\r\n            }\r\n        }\r\n    }\r\n    return sum;\r\n}\r\n<\/code><\/pre>\n<p>As seen, a lot of calculation around <code>IndexBits<\/code> and <code>IndexAlphaBits<\/code> is repetitive and can be done just once outside the relevant loop. While compiler should take care of such optimizations, unfortunately until .NET 6, those invariants were not moved out of the loop and we had to <a href=\"https:\/\/github.com\/SixLabors\/ImageSharp\/pull\/1818\">manually move it out<\/a> in the C# code. In .NET 7, we addressed that problem in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/68061\">dotnet\/runtime#68061<\/a> by enabling the hoisting of expressions out of multi nested loop. That increased the code quality of such loops as seen in the screenshot below. A lot of code has been moved out from <code>IG05<\/code> (b-loop) loop to outer loop <code>IG04<\/code> (g-loop) and <code>IG03<\/code> (r-loop).<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/imagesharp_asm.png\" style=\"width:75%\"><\/p>\n<p>While this optimization is very generic and applicable for all architectures, the reason I mentioned it in this blog post is because its significance and impact to the Arm64 code. Remember, Arm64 uses fixed-length 32-bits instruction encoding, it takes 3-4 instructions to manifest a 64-bit address. This can be seen in many places where the code is trying to load a method address or a global variable. For example, below code accesses a <code>static<\/code> variable inside a nested loop, a very common scenario to have.<\/p>\n<pre><code class=\"language-c#\">private static int SOME_VARIABLE = 4;\r\n\r\nint Test(int n, int m) {\r\n    int result = 0;\r\n    for (int i = 0; i &lt; n; i++) {\r\n        for (int j = 0; j &lt; m; j++) {\r\n            result += SOME_VARIABLE;\r\n        }\r\n    }\r\n    return result;\r\n}<\/code><\/pre>\n<p>To access the <code>static<\/code> variable, we manifest the address of the variable and then read its value. Although &#8220;manifesting the address of the variable&#8221; portion is invariant and can be moved out of the i-loop, until .NET 6, we were only moving it out of j-loop as seen in below assembly.<\/p>\n<pre><code class=\"language-assembly\">G_M3833_IG03:                               ; i-loop\r\n            mov     w23, wzr\r\n            cmp     w19, #0\r\n            ble     G_M3833_IG05\r\n            movz    x24, #0x7e18            ; Address 0x7ff9e6ff7e18\r\n            movk    x24, #0xe6ff LSL #16\r\n            movk    x24, #0x7ff9 LSL #32\r\n            mov     x0, x24\r\n            mov     w1, #4\r\n            bl      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE\r\n            ldr     w0, [x24,#56]\r\n\r\nG_M3833_IG04:                               ; j-loop\r\n            add     w21, w21, w0\r\n            add     w23, w23, #1\r\n            cmp     w23, w19\r\n            blt     G_M3833_IG04\r\n\r\nG_M3833_IG05:\r\n            add     w22, w22, #1\r\n            cmp     w22, w20\r\n            blt     G_M3833_IG03<\/code><\/pre>\n<p>In .NET 7, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/68061\">dotnet\/runtime#68061<\/a>, we changed the order in which we evaluate the loops and that made it possible to move the address formation out of i-loop.<\/p>\n<pre><code class=\"language-assembly\">            ...\r\n            movz    x23, #0x7e18            ; Address 0x7ff9e6ff7e18\r\n            movk    x23, #0xe6ff LSL #16\r\n            movk    x23, #0x7ff9 LSL #32\r\n\r\nG_M3833_IG03:                               ; i-loop\r\n            mov     w24, wzr\r\n            cmp     w19, #0\r\n            ble     G_M3833_IG05\r\n            mov     x0, x23\r\n            mov     w1, #4\r\n            bl      CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE\r\n            ldr     w0, [x23,#0x38]\r\n\r\nG_M3833_IG04:                               ; j-looop\r\n            add     w21, w21, w0\r\n            add     w24, w24, #1\r\n            cmp     w24, w19\r\n            blt     G_M3833_IG04\r\n\r\nG_M3833_IG05:\r\n            add     w22, w22, #1\r\n            cmp     w22, w20\r\n            blt     G_M3833_IG03<\/code><\/pre>\n<h3>Improved code alignment<\/h3>\n<p>In .NET 6, we added support of <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/loop-alignment-in-net-6\/\">loop alignment<\/a> for x86-x64 platforms. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/60135\">dotnet\/runtime#60135<\/a>, we extended the support to Arm64 platforms as well. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/59828\">dotnet\/runtime#59828<\/a>, we started aligning methods at 32-byte address boundary. Both these items were done to get both performance improvement and stability of the .NET applications running on Arm64 machines. Lastly, we wanted to make sure that the alignment does not cause any adverse effect on performance and hence in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/60787\">dotnet\/runtime#60787<\/a>, we improved the code by hiding the alignment instructions, whenever possible, behind an unconditional jump or in code blocks. As seen in below screenshot, previously, we would align the loop <code>IG06<\/code> by adding the padding just before the loop start (in the end of <code>IG05<\/code> in below example). Now, we align it by adding the padding in <code>IG03<\/code> after the unconditional jump <code>b G_M5507_IG05<\/code>.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/hide_align.png\" style=\"width:75%\"><\/p>\n<h3>Instruction selection improvements<\/h3>\n<p>There was lot of code inefficiency due to poor instruction selection, and we fixed most of the problems during .NET 7. Some of the optimization opportunities in this section were found by analyzing BenchmarkGames benchmarks.<\/p>\n<p>We improved <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/34937\">long pending<\/a> performance problems around modulo operation. There is no Arm64 instruction to calculate modulo and compilers have to translate the <code>a % b<\/code> operation into <code>a - (a \/ b) * b<\/code>. However, if <code>divisor<\/code> is power of 2, we can translate the operation into <code>a &amp; (b - 1)<\/code> instead. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/65535\">dotnet\/runtime#65535<\/a> optimized this for <code>unsigned a<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/66407\">dotnet\/runtime#66407<\/a> optimized it for <code>signed a<\/code> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70599\">dotnet\/runtime#70599<\/a> optimized for <code>a % 2<\/code>. If <code>divisor<\/code> is not a power of 2, we end up with three instructions to perform modulo operation.<\/p>\n<pre><code class=\"language-c#\">int CalculateMod(a, b) =&gt; a % b<\/code><\/pre>\n<pre><code class=\"language-assembly\">tmp0 = (a \/ b)\r\ntmp1 = (tmp0 * b)\r\ntmp2 = (a - tmp1)\r\nreturn tmp2<\/code><\/pre>\n<p>Arm64 has an instruction that can combine multiplication and subtraction into a single instruction <code>msub<\/code>. Likewise, multiplication followed by an addition can be combined into a single instruction <code>madd<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61037\">dotnet\/runtime#61037<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/66621\">dotnet\/runtime#66621<\/a> addressed these problems leading to better code quality than what it was in .NET 6 and more performant (<a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/4624\">here<\/a>, <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/4733\">here<\/a> and <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/2143\">here<\/a>).<\/p>\n<p>Lastly, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62399\">dotnet\/runtime#62399<\/a>, we transformed the operation <code>x % 2 == 0<\/code> into <code>x &amp; 1 == 0<\/code> earlier in the compilation cycle which gave us better code quality. <\/p>\n<p>To summarize, here are few methods that use mod operation.<\/p>\n<pre><code class=\"language-c#\">    int Test0(int n) {\r\n        return n % 2;\r\n    }\r\n\r\n    int Test1(int n) {\r\n        return n % 4;\r\n    }\r\n\r\n    int Test2(int n) {\r\n        return n % 18;\r\n    }\r\n\r\n    int Test3(int n, int m) {\r\n        return n % m;\r\n    }\r\n\r\n    bool Test4(int n) {\r\n        return (n % 2) == 0;\r\n    }<\/code><\/pre>\n<pre><code class=\"language-diff\">;; Test0\r\n\r\n G_M29897_IG02:\r\n-            lsr     w0, w1, #31\r\n-            add     w0, w0, w1\r\n-            asr     w0, w0, #1\r\n-            lsl     w0, w0, #1\r\n-            sub     w0, w1, w0\r\n+            and     w0, w1, #1\r\n+            cmp     w1, #0\r\n+            cneg    w0, w0, lt\r\n\r\n;; Test1\r\n\r\n G_M264_IG02:\r\n-            asr     w0, w1, #31\r\n-            and     w0, w0, #3\r\n-            add     w0, w0, w1\r\n-            asr     w0, w0, #2\r\n-            lsl     w0, w0, #2\r\n-            sub     w0, w1, w0\r\n+            and     w0, w1, #3\r\n+            negs    w1, w1\r\n+            and     w1, w1, #3\r\n+            csneg   w0, w0, w1, mi\r\n\r\n;; Test2\r\n\r\n G_M2891_IG02:\r\n-            movz    w0, #0xd1ffab1e\r\n-            movk    w0, #0xd1ffab1e LSL #16\r\n+            movz    w0, #0xD1FFAB1E\r\n+            movk    w0, #0xD1FFAB1E LSL #16\r\n             smull   x0, w0, w1\r\n             asr     x0, x0, #32\r\n             lsr     w2, w0, #31\r\n             asr     w0, w0, #2\r\n             add     w0, w0, w2\r\n-            mov     x2, #9\r\n-            mul     w0, w0, w2\r\n-            lsl     w0, w0, #1\r\n-            sub     w0, w1, w0\r\n\r\n+            mov     w2, #18\r\n+            msub    w0, w0, w2, w1\r\n\r\n;; Test3\r\n\r\n G_M38794_IG03:\r\n             sdiv    w0, w1, w2\r\n-            mul     w0, w0, w2\r\n-            sub     w0, w1, w0\r\n+            msub    w0, w0, w2, w1\r\n\r\n;; Test 4\r\n\r\n G_M55983_IG01:\r\n-            stp     fp, lr, [sp,#-16]!\r\n+            stp     fp, lr, [sp, #-0x10]!\r\n             mov     fp, sp\r\n\r\n G_M55983_IG02:\r\n-            lsr     w0, w1, #31\r\n-            add     w0, w0, w1\r\n-            asr     w0, w0, #1\r\n-            lsl     w0, w0, #1\r\n-            subs    w0, w1, w0\r\n+            tst     w1, #1\r\n             cset    x0, eq\r\n\r\n G_M55983_IG03:\r\n-            ldp     fp, lr, [sp],#16\r\n+            ldp     fp, lr, [sp], #0x10\r\n             ret     lr\r\n\r\n-G_M55983_IG04:\r\n-            bl      CORINFO_HELP_OVERFLOW\r\n\r\n-G_M55983_IG05:\r\n-            bl      CORINFO_HELP_THROWDIVZERO\r\n-            bkpt\r\n<\/code><\/pre>\n<p>Notice that in <code>Test4<\/code>, we also eliminated some of the <code>OVERFLOW<\/code> and <code>THROWDIVZERO<\/code> checks with these optimizations. Here is the graph that shows the impact.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/msub.png\" style=\"width:75%\"><\/p>\n<p>While we were investigating, we found a similar problem with MSVC compiler. It too did not generate optimal <code>madd<\/code> instruction sometimes and that <a href=\"https:\/\/developercommunity.visualstudio.com\/t\/Missed-Arm64-madd-opportunity-is-multip\/10054905\">is on track to be fixed in VS 17.4<\/a>.<\/p>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64016\">dotnet\/runtime#64016<\/a>, we optimized the code generated for <code>Math.Round(MindpointRounding.AwayFromZero)<\/code> to avoid calling into helper, but use <code>frinta<\/code> instruction instead.<\/p>\n<pre><code class=\"language-c#\">double Test(double x) =&gt; Math.Round(x, MidpointRounding.AwayFromZero);<\/code><\/pre>\n<pre><code class=\"language-diff\">             stp     fp, lr, [sp,#-16]!\r\n             mov     fp, sp\r\n\r\n-            mov     w0, #0\r\n-            mov     w1, #1\r\n+            frinta  d0, d0\r\n+\r\n             ldp     fp, lr, [sp],#16\r\n-            b       System.Math:Round(double,int,int):double ;; call, much slower\r\n-; Total bytes of code: 24\r\n+            ret     lr\r\n+; Total bytes of code: 20<\/code><\/pre>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/65584\">dotnet\/runtime#65584<\/a>, we started using <code>fmin<\/code> and <code>fmax<\/code> instructions for <code>float<\/code> and <code>double<\/code> variants of <code>Math.Min()<\/code> and <code>Math.Max()<\/code> respectively. With this, we were producing minimal code needed to conduct such operations.<\/p>\n<pre><code class=\"language-c#\">double Foo(double a, double b) =&gt; Math.Max(a, b);   <\/code><\/pre>\n<p>.NET 6 code:<\/p>\n<pre><code class=\"language-assembly\">G_M50879_IG01:\r\n            stp     fp, lr, [sp,#-32]!\r\n            mov     fp, sp\r\n\r\nG_M50879_IG02:\r\n            fcmp    d0, d1\r\n            beq     G_M50879_IG07\r\n\r\nG_M50879_IG03:\r\n            fcmp    d0, d0\r\n            bne     G_M50879_IG09\r\n            fcmp    d1, d0\r\n            blo     G_M50879_IG06\r\n\r\nG_M50879_IG04:\r\n            mov     v0.8b, v1.8b\r\n\r\nG_M50879_IG05:\r\n            ldp     fp, lr, [sp],#32\r\n            ret     lr\r\n\r\nG_M50879_IG06:\r\n            fmov    d1, d0\r\n            b       G_M50879_IG04\r\n\r\nG_M50879_IG07:\r\n            str     d1, [fp,#24]\r\n            ldr     x0, [fp,#24]\r\n            cmp     x0, #0\r\n            blt     G_M50879_IG08\r\n            b       G_M50879_IG04\r\n\r\nG_M50879_IG08:\r\n            fmov    d1, d0\r\n            b       G_M50879_IG04\r\n\r\nG_M50879_IG09:\r\n            fmov    d1, d0\r\n            b       G_M50879_IG04<\/code><\/pre>\n<p>.NET 7 code:<\/p>\n<pre><code class=\"language-assembly\">G_M50879_IG01:\r\n            stp     fp, lr, [sp, #-0x10]!\r\n            mov     fp, sp\r\n\r\nG_M50879_IG02:\r\n            fmax    d0, d0, d1\r\n\r\nG_M50879_IG03:\r\n            ldp     fp, lr, [sp], #0x10\r\n            ret     lr<\/code><\/pre>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61617\">dotnet\/runtime#61617<\/a>, for <code>float<\/code>\/<code>double<\/code> comparison scenarios, we eliminated an extra instruction to move <code>0<\/code> into a register and instead embedded <code>0<\/code> directly in the <code>fcmp<\/code> instruction. And in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62933\">dotnet\/runtime#62933<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/63821\">dotnet\/runtime#63821<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64783\">dotnet\/runtime#64783<\/a>, we started eliminating extra <code>0<\/code> for vector comparisons as seen in below differences.<\/p>\n<pre><code class=\"language-diff\"> G_M44024_IG02:        \r\n-            movi    v16.4s, #0x00\r\n-            cmeq    v16.4s, v16.4s, v0.4s\r\n+            cmeq    v16.4s, v0.4s, #0\r\n             mov     v0.16b, v16.16b\r\n             ...\r\n-            movi    v16.2s, #0x00\r\n-            fcmgt   d16, d0, d16\r\n+            fcmgt   d16, d0, #0<\/code><\/pre>\n<p>Likewise, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61035\">dotnet\/runtime#61035<\/a>, we fixed the way we were accounting for an immediate value <code>1<\/code> in addition and subtraction and that gave us <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61035#issue-1040204551\">massive code size improvements<\/a>. As seen below, <code>1<\/code> was easily encoded in <code>add<\/code> instruction itself and it saved us 1 instruction (4 bytes) for such use cases.<\/p>\n<pre><code class=\"language-c#\">static int Test(int x) =&gt; x + 1;<\/code><\/pre>\n<pre><code class=\"language-diff\">-        mov     w1, #1\r\n-        add     w0, w0, w1\r\n+        add     w0, w0, #1<\/code><\/pre>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61045\">dotnet\/runtime#61045<\/a>, we improved the instruction selection for left shift operations that are of known constants.<\/p>\n<pre><code class=\"language-c#\">static ulong Test1(uint x) =&gt; ((ulong)x) &lt;&lt; 2;<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Prog:Test1(int):long\r\nG_M16463_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M16463_IG02:\r\n-           mov     w0, w0\r\n-           lsl     x0, x0, #2\r\n+           ubfiz   x0, x0, #2, #32\r\nG_M16463_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n-; Total bytes of code: 24\r\n+; Total bytes of code: 20<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61549\">dotnet\/runtime#61549<\/a> improved some of the instruction sequence to use <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220727-00\/?p=106907\">extended register operations<\/a> which are shorter and performant. In the example below, earlier we would sign extend the value in register <code>w20<\/code> and store it in <code>x0<\/code> and then add it with a different value say <code>x19<\/code>. Now, we use the extension <code>SXTW<\/code> as part of the <code>add<\/code> instruction directly.<\/p>\n<pre><code class=\"language-diff\">-       sxtw    x0, w20 \r\n-       add     x0, x19, x0\r\n+       add     x0, x19, w20, SXTW\r\n<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62630\">dotnet\/runtime#62630<\/a> performed a peep hole optimization to eliminate a zero\/sign extension that was being done after loading its value in the previous instruction using <code>ldr<\/code> which itself does the required zero\/sign extension.<\/p>\n<pre><code class=\"language-c#\">static ulong Foo(ref uint p) =&gt; p;<\/code><\/pre>\n<pre><code class=\"language-diff\">; Method Prog:Foo(byref):long\r\nG_M41239_IG01:\r\n            stp     fp, lr, [sp,#-16]!\r\n            mov     fp, sp\r\nG_M41239_IG02:\r\n            ldr     w0, [x0]\r\n-           mov     w0, w0\r\nG_M41239_IG03:\r\n            ldp     fp, lr, [sp],#16\r\n            ret     lr\r\n-; Total bytes of code: 24\r\n+; Total bytes of code: 20<\/code><\/pre>\n<p>We also improved a specific scenario that involves vector comparison with <code>Zero<\/code> vector. Consider the following example. <\/p>\n<pre><code class=\"language-c#\">static bool IsZero(Vector128&lt;int&gt; vec) =&gt; vec == Vector128&lt;int&gt;.Zero;<\/code><\/pre>\n<p>Previously, we would perform this operation using 3 steps:<\/p>\n<ol>\n<li>Compare the contents of <code>vec<\/code> and <code>Vector128&lt;int&gt;.Zero<\/code> using <code>cmeq<\/code> instruction. If the contents are equal, the instruction would set every bit of the vector element to <code>1<\/code>, otherwise would set to <code>0<\/code>.<\/li>\n<li>Next, we find the minimum value across all the vector elements using <code>uminv<\/code> and extract that out. In our case, if <code>vec == 0<\/code>, it would be <code>1<\/code> and if <code>vec != 0<\/code>, it would be <code>0<\/code>.<\/li>\n<li>And at last, we compare if the outcome of <code>uminv<\/code> was <code>0<\/code> or <code>1<\/code> to determine if <code>vec == 0<\/code> or <code>vec != 0<\/code>.<\/li>\n<\/ol>\n<p>In .NET 7, we improved this algorithm to do the following steps instead:<\/p>\n<ol>\n<li>Scan through the <code>vec<\/code> elements and find the maximum element using <code>umaxv<\/code> instruction.<\/li>\n<li>If the maximum element found in step 1 is greater than <code>0<\/code>, then we know the contents of <code>vec<\/code> are not zero. <\/li>\n<\/ol>\n<p>Here is the code difference between .NET 6 and .NET 7.<\/p>\n<pre><code class=\"language-diff\">; Assembly listing for method IsZero(System.Runtime.Intrinsics.Vector128`1[Int32]):bool\r\n    stp     fp, lr, [sp,#-16]!\r\n    mov     fp, sp\r\n-   cmeq    v16.4s, v0.4s, #0\r\n-   uminv   b16, v16.16b\r\n-   umov    w0, v16.b[0]\r\n+   umaxv   b16, v0.16b\r\n+   umov    w0, v16.s[0]\r\n    cmp     w0, #0\r\n-   cset    x0, ne\r\n+   cset    x0, eq\r\n    ldp     fp, lr, [sp],#16\r\n    ret     lr\r\n-; Total bytes of code 36\r\n+; Total bytes of code 32<\/code><\/pre>\n<p>With this, we got <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/65632#issuecomment-1058297582\">around 25% win<\/a> in various benchmarks.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/65632.png\" style=\"width:75%\"><\/p>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/69333\">dotnet\/runtime#69333<\/a>, we started using <a href=\"https:\/\/docs.microsoft.com\/cpp\/intrinsics\/arm64-intrinsics?view=msvc-170#I\">bit scanning intrinsics<\/a> and it gave us good throughput wins for Arm64 compilation.<\/p>\n<h3>Memory initialization improvements<\/h3>\n<p>At the beginning of the method, most often developers initialize their variables to a default value. Natively, at the assembly level, this initialization happens by writing <code>zero value<\/code> to the stack memory (since local variables are on stack). A typical zeroing the memory instruction sequence consists of moving a zero register value to a stack memory like <code>mov xzr, [fp, #80]<\/code>. Depending on the number of variables we are initializing, the number of instructions to zero the memory can increase. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61030\">dotnet\/runtime#61030<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/63422\">dotnet\/runtime#63422<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/68085\">dotnet\/runtime#68085<\/a>, we switched to using SIMD registers for initializing the memory. SIMD registers are 64-byte or 128-byte long and can dramatically reduce the number of instructions to zero out the memory. Similar concept is applicable for block copy. If we need to copy a large memory, previously, we were using a pair of 8-byte registers, which would load and store values 16-bytes at a time from source to destination. We switched to using SIMD register pairs which would instead operate on 32-bytes initialization in a single instruction. For addresses that are not aligned at 32-bytes boundary, the algorithm would smartly revert back to the 16-bytes or 8-bytes initialization as seen below.<\/p>\n<pre><code class=\"language-diff\">-            ldp     x1, x2, [fp,#80]   \/\/ [V17 tmp13]\r\n-            stp     x1, x2, [fp,#24]   \/\/ [V32 tmp28]\r\n-            ldp     x1, x2, [fp,#96]   \/\/ [V17 tmp13+0x10]\r\n-            stp     x1, x2, [fp,#40]   \/\/ [V32 tmp28+0x10]\r\n-            ldp     x1, x2, [fp,#112]  \/\/ [V17 tmp13+0x20]\r\n-            stp     x1, x2, [fp,#56]   \/\/ [V32 tmp28+0x20]\r\n-            ldr     x1, [fp,#128]      \/\/ [V17 tmp13+0x30]\r\n-            str     x1, [fp,#72]       \/\/ [V32 tmp28+0x30]\r\n+            add     xip1, fp, #56\r\n+            ldr     x1, [xip1,#24]\r\n+            str     x1, [fp,#24]\r\n+            ldp     q16, q17, [xip1,#32]   ; 32-byte load\r\n+            stp     q16, q17, [fp,#32]     ; 32-byte store\r\n+            ldr     q16, [xip1,#64]        ; 16-byte load\r\n+            str     q16, [fp,#64]          ; 16-byte store<\/code><\/pre>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/64481\">dotnet\/runtime#64481<\/a>, we did several optimizations by eliminating zeroing the memory when not needed and use better instructions and addressing mode. Before .NET 7, if the memory to be copied\/initialized was big enough, we would inject a loop to do the operation as seen below. Here, we are zeroing out the memory at <code>[sp, #-16]<\/code>, decreasing the <code>sp<\/code> value by <code>16<\/code> (post-index addressing mode) and then looping until <code>x11<\/code> value does not become 0.<\/p>\n<pre><code class=\"language-assembly\">            mov     w7, #96\r\nG_M49279_IG13:\r\n            stp     xzr, xzr, [sp,#-16]!\r\n            subs    x7, x7, #16\r\n            bne     G_M49279_IG13<\/code><\/pre>\n<p>In .NET 7, We started unrolling some of this code if the memory size is within 128 bytes. In below code, we start by zeroing <code>[sp-16]<\/code> to give a hint to CPU the sequential zeroing and prompt it to switch to <a href=\"https:\/\/developer.arm.com\/documentation\/100798\/0300\/functional-description\/level-1-memory-system\/cache-behavior\/write-streaming-mode\">write streaming mode<\/a>.<\/p>\n<pre><code class=\"language-assembly\">            stp     xzr, xzr, [sp,#-16]!\r\n            stp     xzr, xzr, [sp,#-80]!\r\n            stp     xzr, xzr, [sp,#16]\r\n            stp     xzr, xzr, [sp,#32]\r\n            stp     xzr, xzr, [sp,#48]\r\n            stp     xzr, xzr, [sp,#64]<\/code><\/pre>\n<p>Talking about &#8220;write streaming mode&#8221;, we also recognized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/67244\">dotnet\/runtime#67244<\/a> the need of using <a href=\"https:\/\/developer.arm.com\/documentation\/ddi0595\/2021-06\/AArch64-Instructions\/DC-ZVA--Data-Cache-Zero-by-VA\">DC ZVA instruction<\/a> and <a href=\"https:\/\/developercommunity.visualstudio.com\/t\/Arm64:-Use-optimized-DC-ZVA-instructio\/10054880?space=62&amp;q=Optimize+Arm64+memset&amp;entry=problem\">suggested it to MSVC team<\/a>.<\/p>\n<p>We noted that in certain situations, for operations like <code>memset<\/code> and <code>memmove<\/code>, in x86-x64, we were forwarding the execution to the CRT implementation, but in Arm64, we had <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/2453f16807b85b279efc26d17d6f20de87801c09\/src\/coreclr\/vm\/arm64\/crthelpers.asm\">hand written assembly<\/a> to perform such operation.<\/p>\n<p>Here is the x64 code generated for <a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/9e4217922650b8cf8e54e835c395fb8117b9ee57\/src\/benchmarks\/micro\/runtime\/StoreBlock\/StoreBlock.cs#L156-L162\">CopyBlock128<\/a> benchmark.<\/p>\n<pre><code class=\"language-assembly\">G_M19447_IG03:\r\n       lea      rcx, bword ptr [rsp+08H]\r\n       lea      rdx, bword ptr [rsp+88H]\r\n       mov      r8d, 128\r\n       call     CORINFO_HELP_MEMCPY  ;; this calls into \"jmp memmove\"\r\n       inc      edi\r\n       cmp      edi, 100\r\n       jl       SHORT G_M19447_IG03<\/code><\/pre>\n<p>However, until .NET 6, the Arm64 assembly code was like this:<\/p>\n<pre><code class=\"language-assembly\">G_M19447_IG03:\r\n            ldr     x1, [fp,#152]\r\n            str     x1, [fp,#24]\r\n            ldp     q16, q17, [fp,#160]\r\n            stp     q16, q17, [fp,#32]\r\n            ldp     q16, q17, [fp,#192]\r\n            stp     q16, q17, [fp,#64]\r\n            ldp     q16, q17, [fp,#224]\r\n            stp     q16, q17, [fp,#96]\r\n            ldr     q16, [fp,#0xd1ffab1e]\r\n            str     q16, [fp,#128]\r\n            ldr     x1, [fp,#0xd1ffab1e]\r\n            str     x1, [fp,#144]\r\n            add     w0, w0, #1\r\n            cmp     w0, #100\r\n            blt     G_M19447_IG03<\/code><\/pre>\n<p>We switched to using CRT implementation of <code>memmove<\/code> and <code>memset<\/code> long back for <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/25750\">windows\/linux x64<\/a> as well as for <a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/17536\">linux arm64<\/a>. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/67788\">dotnet\/runtime#67788<\/a>, we switched to using CRT implementation for windows\/arm64 as well and saw <a href=\"https:\/\/github.com\/dotnet\/perf-autofiling-issues\/issues\/4623\">up to 25% improvements<\/a> in such benchmarks.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/67788.png\" style=\"width:75%\"><\/p>\n<p>Most important optimization we added during .NET 7 is <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220817-00\/?p=106998\">conditional execution instructions<\/a>. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71616\">dotnet\/runtime#71616<\/a>, @a74nh from Arm made contribution to generate conditional comparison instructions and in future with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73472\">dotnet\/runtime#73472<\/a>, we will soon have &#8220;conditional selection&#8221; instructions.<\/p>\n<p>With &#8220;conditional comparison&#8221; <code>ccmp<\/code>, we can do the comparison based upon the result of previous comparison. Let us try to deciper what is going on in below example. Previously, we would compare <code>!x<\/code> using <code>cmp w2, #0<\/code> and if true, set <code>1<\/code> in <code>x0<\/code> using <code>cset<\/code> instruction. Likewise condition <code>y == 9<\/code> is checked using <code>cmp w1, #9<\/code> and <code>1<\/code> is set in <code>x3<\/code> if they are equal. Lastly, it will compare the contents of <code>w0<\/code> and <code>w3<\/code> to see if they match and jump accordingly. In .NET 7, we perform the same comparison of <code>!x<\/code> initially. But then we perform <code>ccmp w1, #9, c, eq<\/code> which compares equality (<code>eq<\/code> in the end) of <code>w1<\/code> and <code>9<\/code> and if they are equal, set the zero flag. If the results do not match, it sets the carry <code>c<\/code> flag. The next instruction simply checks if zero was set and jump accordingly. If you see the difference, it eliminates a comparison instruction and give us performance improvement.<\/p>\n<pre><code class=\"language-C#\">bool Test() =&gt; (!x &amp; y == 9);<\/code><\/pre>\n<pre><code class=\"language-diff\">             cmp     w2, #0\r\n-            cset    x0, eq\r\n-            cmp     w1, #9\r\n-            cset    x3, ls\r\n-            tst     w0, w3\r\n-            beq     G_M40619_IG09\r\n+            ccmp    w1, #9, c, eq\r\n+            cset    x0, ls\r\n+            cbz     w0, G_M40619_IG09<\/code><\/pre>\n<h2>Tooling improvements<\/h2>\n<p>As many are familiar, if a developer is interested in seeing the disassembly for their code during development, they can paste a code snippet in https:\/\/sharplab.io\/. However, it just shows the disassembly for x64. There was no similar online tool to display Arm64 disassembly. <a href=\"https:\/\/github.com\/hez2010\">@hez2010<\/a> <a href=\"https:\/\/github.com\/compiler-explorer\/compiler-explorer\/pull\/3168\">added the support for .NET (C#\/F#\/VB)<\/a> in godbolt. Now, developer can paste their .NET code and <a href=\"https:\/\/godbolt.org\/z\/dab19aWs7\">inspect the disassembly<\/a> for all platforms we support including Arm64. There is also a Visual Studio extension <a href=\"https:\/\/marketplace.visualstudio.com\/items?itemName=EgorBogatov.Disasmo\">Disasmo<\/a> that can be installed to inspect the disassembly but in order to use that, you need to have dotnet\/runtime repository present and built on your local machine.<\/p>\n<h2>Impact<\/h2>\n<p>As seen from various graphs above, with our work in .NET 7, many Micro benchmarks improved by 10~60%. I just want to share another graph of TE benchmarks run on Linux OS in the asp.net performance lab. As seen below, when we started in .NET 7, the Requests\/Second (RPS) was lower for Arm64 but as we made progress, the line climbs up towards x64 over various .NET 7 previews. Likewise, the latency (measured in milliseconds) climbs down from .NET 6 to .NET 7.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/TE_RPS.png\" style=\"width:75%\"><\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/TE_Latency.png\" style=\"width:75%\"><\/p>\n<h2>Benchmark environment<\/h2>\n<p>How can we discuss performance and not mention the environment in which we conducted our measurements? For Arm64 context, we run two sets of benchmarks regularly and monitor the performance of those benchmarks build over build. <a href=\"https:\/\/github.com\/aspnet\/Benchmarks\">The TE benchmarks<\/a> are run in performance lab owned by ASP.NET team using <a href=\"https:\/\/github.com\/dotnet\/crank\">crank<\/a> infrastructure. The results are posted on https:\/\/aka.ms\/aspnet\/benchmarks. We run the benchmarks on <code>Intel Xeon<\/code> x64 machine and <code>Ampere Altra<\/code> Arm64 machine as seen in the <a href=\"https:\/\/msit.powerbi.com\/view?r=eyJrIjoiYTZjMTk3YjEtMzQ3Yi00NTI5LTg5ZDItNmUyMGRlOTkwMGRlIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9\">list of environment<\/a> on our results site. We run on both Linux and Windows Operating System.<\/p>\n<p>The other sets of benchmarks that are run regularly are <a href=\"https:\/\/github.com\/dotnet\/performance\/tree\/main\/src\/benchmarks\/micro\">Micro benchmarks<\/a>. They are run in a performance lab owned by .NET team and the results are posted <a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/allTestHistory\/TestHistoryIndexIndex.html\">on this site<\/a>. Just as TE benchmark environment, we run these benchmarks on wide variety of devices like <code>Surface Pro X<\/code>, <code>Intel<\/code>, <code>AMD<\/code> and <code>Ampere<\/code> machines.<\/p>\n<p>In .NET 7, we added <a href=\"https:\/\/github.com\/rickbrew\">Rick Brewster<\/a>&#8216;s <code>Paint.NET<\/code> tool to our benchmarks. It tracks various aspects of a UI tool like Startup, Ready state and rendering as seen in below graph and we monitor these metrics for both x64 and Arm64.<\/p>\n<p align=\"center\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2022\/09\/PDN_graph.png\" style=\"width:75%\"><\/p>\n<h3>Hardware with Linux OS<\/h3>\n<p><a href=\"https:\/\/www.anandtech.com\/show\/16315\/the-ampere-altra-review\/3\">Ampere Altra Arm64<\/a><\/p>\n<pre><code class=\"language-console\">Architecture:           aarch64\r\n  CPU op-mode(s):       32-bit, 64-bit\r\n  Byte Order:           Little Endian\r\nCPU(s):                 80\r\n  On-line CPU(s) list:  0-79\r\nVendor ID:              ARM\r\n  Model name:           Neoverse-N1\r\n    Model:              1\r\n    Thread(s) per core: 1\r\n    Core(s) per socket: 80\r\n    Socket(s):          1\r\n    Stepping:           r3p1\r\n    Frequency boost:    disabled\r\n    CPU max MHz:        3000.0000\r\n    CPU min MHz:        1000.0000\r\n    BogoMIPS:           50.00\r\n    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs\r\nCaches (sum of all):  \r\n  L1d:                  5 MiB (80 instances)\r\n  L1i:                  5 MiB (80 instances)\r\n  L2:                   80 MiB (80 instances)\r\n  L3:                   32 MiB (1 instance)\r\nNUMA:                   \r\n  NUMA node(s):         1\r\n  NUMA node0 CPU(s):    0-79\r\nVulnerabilities:        \r\n  Itlb multihit:        Not affected\r\n  L1tf:                 Not affected\r\n  Mds:                  Not affected\r\n  Meltdown:             Not affected\r\n  Mmio stale data:      Not affected\r\n  Retbleed:             Not affected\r\n  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl\r\n  Spectre v1:           Mitigation; __user pointer sanitization\r\n  Spectre v2:           Mitigation; CSV2, BHB\r\n  Srbds:                Not affected\r\n  Tsx async abort:      Not affected<\/code><\/pre>\n<p><a href=\"https:\/\/www.intel.com\/content\/www\/us\/en\/products\/platforms\/details\/cascade-lake.html\">Intel Cascade lake x64<\/a><\/p>\n<pre><code class=\"language-console\">  Architecture:            x86_64\r\n  CPU op-mode(s):        32-bit, 64-bit\r\n  Address sizes:         46 bits physical, 48 bits virtual\r\n  Byte Order:            Little Endian\r\nCPU(s):                  52\r\n  On-line CPU(s) list:   0-51\r\nVendor ID:               GenuineIntel\r\n  Model name:            Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz\r\n    CPU family:          6\r\n    Model:               85\r\n    Thread(s) per core:  1\r\n    Core(s) per socket:  26\r\n    Socket(s):           2\r\n    Stepping:            7\r\n    CPU max MHz:         2600.0000\r\n    CPU min MHz:         1000.0000\r\n    BogoMIPS:            5200.00\r\n    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx\r\n                         sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re\r\n                         p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx\r\n                         est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t\r\n                         imer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_\r\n                         single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid\r\n                         ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdse\r\n                         ed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves\r\n                         cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_ep\r\n                         p hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities\r\nVirtualization features:\r\n  Virtualization:        VT-x\r\nCaches (sum of all):\r\n  L1d:                   1.6 MiB (52 instances)\r\n  L1i:                   1.6 MiB (52 instances)\r\n  L2:                    52 MiB (52 instances)\r\n  L3:                    71.5 MiB (2 instances)\r\nNUMA:\r\n  NUMA node(s):          2\r\n  NUMA node0 CPU(s):     0-25\r\n  NUMA node1 CPU(s):     26-51\r\nVulnerabilities:\r\n  Itlb multihit:         KVM: Mitigation: VMX disabled\r\n  L1tf:                  Not affected\r\n  Mds:                   Not affected\r\n  Meltdown:              Not affected\r\n  Mmio stale data:       Mitigation; Clear CPU buffers; SMT disabled\r\n  Retbleed:              Mitigation; Enhanced IBRS\r\n  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp\r\n  Spectre v1:            Mitigation; usercopy\/swapgs barriers and __user pointer sanitization\r\n  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling\r\n  Srbds:                 Not affected\r\n  Tsx async abort:       Mitigation; TSX disabled<\/code><\/pre>\n<p><a href=\"https:\/\/ark.intel.com\/content\/www\/us\/en\/ark\/products\/120474\/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html\">Intel Skylake x64<\/a><\/p>\n<pre><code class=\"language-console\">Architecture:                    x86_64\r\nCPU op-mode(s):                  32-bit, 64-bit\r\nByte Order:                      Little Endian\r\nAddress sizes:                   46 bits physical, 48 bits virtual\r\nCPU(s):                          28\r\nOn-line CPU(s) list:             0-27\r\nThread(s) per core:              2\r\nCore(s) per socket:              14\r\nSocket(s):                       1\r\nNUMA node(s):                    1\r\nVendor ID:                       GenuineIntel\r\nCPU family:                      6\r\nModel:                           85\r\nModel name:                      Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz\r\nStepping:                        4\r\nCPU MHz:                         1000.131\r\nBogoMIPS:                        4400.00\r\nVirtualization:                  VT-x\r\nL1d cache:                       448 KiB\r\nL1i cache:                       448 KiB\r\nL2 cache:                        14 MiB\r\nL3 cache:                        19.3 MiB\r\nNUMA node0 CPU(s):               0-27\r\nVulnerability Itlb multihit:     KVM: Mitigation: Split huge pages\r\nVulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable\r\nVulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable\r\nVulnerability Meltdown:          Mitigation; PTI\r\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\r\nVulnerability Spectre v1:        Mitigation; usercopy\/swapgs barriers and __user pointer sanitization\r\nVulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling\r\nVulnerability Srbds:             Not affected\r\nVulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable\r\nFlags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe sysca\r\n                                 ll nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmu\r\n                                 lqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadl\r\n                                 ine_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssb\r\n                                 d mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid r\r\n                                 tm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1\r\n                                  xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d<\/code><\/pre>\n<h3>Hardware with Windows OS<\/h3>\n<p>We have wide range of machines running Windows 10, Windows 11, Windows Server 2022. Some of them are client devices like <a href=\"https:\/\/www.microsoft.com\/surface\/business\/surface-pro-x\/processor\">Surface Pro X Arm64<\/a>, while others are heavy server devices like <a href=\"https:\/\/www.intel.com\/content\/www\/us\/en\/products\/platforms\/details\/cascade-lake.html\">Intel Cascade lake x64<\/a>, <a href=\"https:\/\/www.anandtech.com\/show\/16315\/the-ampere-altra-review\/3\">Ampere Altra Arm64<\/a> and <a href=\"https:\/\/ark.intel.com\/content\/www\/us\/en\/ark\/products\/120474\/intel-xeon-gold-5120-processor-19-25m-cache-2-20-ghz.html\">Intel Skylake x64<\/a>.<\/p>\n<h3>Conclusion<\/h3>\n<p>To conclude, we had a great .NET 7 release with lots of improvements made in various areas from libraries to runtime, to code generation. We closed the performance gap between x64 and Arm64 on specific hardware. We discovered many critical problems like poor thread pool scaling and incorrect L3 cache size determination, and we addressed them in .NET 7. We improved generated code quality by taking advantage of Arm64 addressing modes, optimizing <code>%<\/code> operation, and improving general array accesses. We had great partnership with Arm engineers <a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a>, <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> and <a href=\"https:\/\/github.com\/TamarChristinaArm\">@TamarChristinaArm<\/a> from Arm made a great contribution by converting some hot .NET library code to using intrinsics. We want to thank multiple contributors who made us possible to ship a faster .NET 7 on Arm64 devices.<\/p>\n<p>Thank you for taking time to read and do let us know your feedback on using .NET on Arm64. Happy coding on Arm64!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>.NET 7 introduces a plethora of performance improvements for developers including developers targeting Arm64 devices. In this blog I break down everything you need to know about the improvements in .NET 7.<\/p>\n","protected":false},"author":38211,"featured_media":42192,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,3009],"tags":[7611,7172,108],"class_list":["post-42195","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-performance","tag-dotnet-7","tag-arm64","tag-performance"],"acf":[],"blog_post_summary":"<p>.NET 7 introduces a plethora of performance improvements for developers including developers targeting Arm64 devices. In this blog I break down everything you need to know about the improvements in .NET 7.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/42195","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/38211"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=42195"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/42195\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/42192"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=42195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=42195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=42195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}