{"id":57952,"date":"2025-09-10T06:00:00","date_gmt":"2025-09-10T13:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=57952"},"modified":"2025-11-10T15:19:27","modified_gmt":"2025-11-10T23:19:27","slug":"performance-improvements-in-net-10","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-10\/","title":{"rendered":"Performance Improvements in .NET 10"},"content":{"rendered":"<p>My kids <em>love<\/em> &#8220;Frozen&#8221;. They can sing every word, re-enact every scene, and provide detailed notes on the proper sparkle of Elsa&#8217;s ice dress. I&#8217;ve seen the movie more times than I can recount, to the point where, if you&#8217;ve seen me do any live coding, you&#8217;ve probably seen my subconscious incorporate an Arendelle reference or two. After so many viewings, I began paying closer attention to the details, like how at the very beginning of the film the ice harvesters are singing a song that subtly foreshadows the story&#8217;s central conflicts, the characters&#8217; journeys, and even the key to resolving the climax. I&#8217;m slightly ashamed to admit I didn&#8217;t comprehend this connection until viewing number ten or so, at which point I also realized I had no idea if this ice harvesting was actually &#8220;a thing&#8221; or if it was just a clever vehicle for Disney to spin a yarn. Turns out, as I subsequently researched, it&#8217;s quite real.<\/p>\n<p>In the 19th century, before refrigeration, ice was an incredibly valuable commodity. Winters in the northern United States turned ponds and lakes into seasonal gold mines. The most successful operations ran with precision: workers cleared snow from the surface so the ice would grow thicker and stronger, and they scored the surface into perfect rectangles using horse-drawn plows, turning the lake into a frozen checkerboard. Once the grid was cut, teams with long saws worked to free uniform blocks weighing several hundred pounds each. These blocks were floated along channels of open water toward the shore, at which point men with poles levered the blocks up ramps and hauled them into storage. Basically, what the movie shows.<\/p>\n<p>The storage itself was an art. Massive wooden ice houses, sometimes holding tens of thousands of tons, were lined with insulation, typically straw. Done well, this insulation could keep the ice solid for months, even through summer heat. Done poorly, you would open the doors to slush. And for those moving ice over long distances, typically by ship, every degree, every crack in the insulation, every extra day in transit meant more melting and more loss.<\/p>\n<p>Enter Frederic Tudor, the &#8220;Ice King&#8221; of Boston. He was obsessed with systemic efficiency. Where competitors saw unavoidable loss, Tudor saw a solvable problem. After experimenting with different insulators, he leaned on cheap sawdust, a lumber mill byproduct that outperformed straw, packing it densely around the ice to cut melt losses significantly. For harvesting efficiency, his operations adopted Nathaniel Jarvis Wyeth&#8217;s grid-scoring system, which produced uniform blocks that could be packed tightly, minimizing air gaps that would otherwise increase exposure in a ship&#8217;s hold. And to shorten the critical time between shore and ship, Tudor built out port infrastructure and depots near docks, allowing ships to load and unload much faster. Each change, from tools to ice house design to logistics, amplified the last, turning a risky local harvest into a reliable global trade. With Tudor&#8217;s enhancements, he had solid ice arriving in places like Havana, Rio de Janeiro, and even Calcutta (a voyage of four months in the 1830s). His performance gains allowed the product to survive journeys that were previously unthinkable.<\/p>\n<p>What made Tudor&#8217;s ice last halfway around the world wasn&#8217;t one big idea. It was a plethora of small improvements, each multiplying the effect of the last. In software development, the same principle holds: big leaps forward in performance rarely come from a single sweeping change, rather from hundreds or thousands of targeted optimizations that compound into something transformative. .NET 10&#8217;s performance story isn&#8217;t about one Disney-esque magical idea; it&#8217;s about carefully shaving off nanoseconds here and tens of bytes there, streamlining operations that are executed trillions of times.<\/p>\n<p>In the rest of this post, just as we did in Performance Improvements in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-9\/\">.NET 9<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">.NET 8<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">.NET 7<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-6\">.NET 6<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\">.NET 5<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-3-0\">.NET Core 3.0<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-2-1\">.NET Core 2.1<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core\">.NET Core 2.0<\/a>, we&#8217;ll dig into hundreds of the small but meaningful and compounding performance improvements since .NET 9 that make up .NET 10&#8217;s story (if you instead stay on LTS releases and thus are upgrading from .NET 8 instead of from .NET 9, you&#8217;ll see even more improvements based on the aggregation from all the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-9\/\">improvements in .NET 9<\/a> as well). So, without further ado, go grab a cup of your favorite hot beverage (or, given my intro, maybe something a bit more frosty), sit back, relax, and &#8220;Let It Go&#8221;!<\/p>\n<p>Or, hmm, maybe, let&#8217;s push performance &#8220;Into the Unknown&#8221;?<\/p>\n<p>Let .NET 10 performance &#8220;Show Yourself&#8221;?<\/p>\n<p>&#8220;Do You Want To Build a <del>Snowman<\/del> Fast Service?&#8221;<\/p>\n<p>I&#8217;ll see myself out.<\/p>\n<h2>Benchmarking Setup<\/h2>\n<p>As in previous posts, this tour is chock full of micro-benchmarks intended to showcase various performance improvements. Most of these benchmarks are implemented using <a href=\"https:\/\/www.nuget.org\/packages\/BenchmarkDotNet\/0.15.2\">BenchmarkDotNet 0.15.2<\/a>, with a simple setup for each.<\/p>\n<p>To follow along, make sure you have <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/9.0\">.NET 9<\/a> and <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/10.0\">.NET 10<\/a> installed, as most of the benchmarks compare the same test running on each. Then, create a new C# project in a new <code>benchmarks<\/code> directory:<\/p>\n<pre><code class=\"language-txt\">dotnet new console -o benchmarks\r\ncd benchmarks<\/code><\/pre>\n<p>That will produce two files in the <code>benchmarks<\/code> directory: <code>benchmarks.csproj<\/code>, which is the project file with information about how the application should be compiled, and <code>Program.cs<\/code>, which contains the code for the application. Finally, replace everything in <code>benchmarks.csproj<\/code> with this:<\/p>\n<pre><code class=\"language-xml\">&lt;Project Sdk=\"Microsoft.NET.Sdk\"&gt;\r\n\r\n  &lt;PropertyGroup&gt;\r\n    &lt;OutputType&gt;Exe&lt;\/OutputType&gt;\r\n    &lt;TargetFrameworks&gt;net10.0;net9.0&lt;\/TargetFrameworks&gt;\r\n    &lt;LangVersion&gt;Preview&lt;\/LangVersion&gt;\r\n    &lt;ImplicitUsings&gt;enable&lt;\/ImplicitUsings&gt;\r\n    &lt;Nullable&gt;enable&lt;\/Nullable&gt;\r\n    &lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;\r\n  &lt;\/PropertyGroup&gt;\r\n\r\n  &lt;ItemGroup&gt;\r\n    &lt;PackageReference Include=\"BenchmarkDotNet\" Version=\"0.15.2\" \/&gt;\r\n  &lt;\/ItemGroup&gt;\r\n\r\n&lt;\/Project&gt;<\/code><\/pre>\n<p>With that, we&#8217;re good to go. Unless otherwise noted, I&#8217;ve tried to make each benchmark standalone; just copy\/paste its whole contents into the Program.cs file, overwriting everything that&#8217;s there, and then run the benchmarks. Each test includes at its top a comment for the <code>dotnet<\/code> command to use to run the benchmark. It&#8217;s typically something like this:<\/p>\n<pre><code class=\"language-txt\">dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0<\/code><\/pre>\n<p>which will run the benchmark in release on both .NET 9 and .NET 10 and show the compared results. The other common variation, used when the benchmark should only be run on .NET 10 (typically because it&#8217;s comparing two approaches rather than comparing one thing on two versions), is the following:<\/p>\n<pre><code class=\"language-txt\">dotnet run -c Release -f net10.0 --filter \"*\"<\/code><\/pre>\n<p>Throughout the post, I&#8217;ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I&#8217;m demonstrating an OS-specific improvement), the results shown are from running them on Linux (Ubuntu 24.04.1) on an x64 processor.<\/p>\n<pre><code class=\"language-txt\">BenchmarkDotNet v0.15.2, Linux Ubuntu 24.04.1 LTS (Noble Numbat)\r\n11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores\r\n.NET SDK 10.0.100-rc.1.25451.107\r\n  [Host]     : .NET 9.0.9 (9.0.925.41916), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI<\/code><\/pre>\n<p>As always, a quick disclaimer: these are micro-benchmarks, timing operations so short you&#8217;d miss them by blinking (but when such operations run millions of times, the savings really add up). The exact numbers you get will depend on your hardware, your operating system, what else your machine is juggling at the moment, how much coffee you&#8217;ve had since breakfast, and perhaps whether Mercury is in retrograde. In other words, don&#8217;t expect your results to match mine exactly, but I&#8217;ve picked tests that should still be reasonably reproducible in the real world.<\/p>\n<p>Now, let&#8217;s start at the bottom of the stack. Code generation.<\/p>\n<h2>JIT<\/h2>\n<p>Among all areas of .NET, the Just-In-Time (JIT) compiler stands out as one of the most impactful. Every .NET application, whether a small console tool or a large-scale enterprise service, ultimately relies on the JIT to turn intermediate language (IL) code into optimized machine code. Any enhancement to the JIT&#8217;s generated code quality has a ripple effect, improving performance across the entire ecosystem without requiring developers to change any of their own code or even recompile their C#. And with .NET 10, there\u2019s no shortage of these improvements.<\/p>\n<h3>Deabstraction<\/h3>\n<p>As with many languages, .NET historically has had an &#8220;abstraction penalty,&#8221; those extra allocations and indirections that can occur when using high-level language features like interfaces, iterators, and delegates. Each year, the JIT gets better and better at optimizing away layers of abstraction, so that developers get to write simple code and still get great performance. .NET 10 continues this tradition. The result is that idiomatic C# (using interfaces, <code>foreach<\/code> loops, lambdas, etc.) runs even closer to the raw speed of meticulously crafted and hand-tuned code.<\/p>\n<h4>Object Stack Allocation<\/h4>\n<p>One of the most exciting areas of deabstraction progress in .NET 10 is the expanded use of escape analysis to enable stack allocation of objects. Escape analysis is a compiler technique to determine whether an object allocated in a method escapes that method, meaning determining whether that object is reachable after the method returns (for example, by being stored in a field or returned to the caller) or used in some way that the runtime can&#8217;t track within the method (like passed to an unknown callee). If the compiler can prove an object doesn&#8217;t escape, then that object&#8217;s lifetime is bounded by the method, and it can be allocated on the stack instead of on the heap. Stack allocation is much cheaper (just pointer bumping for allocation and automatic freeing when the method exits) and reduces GC pressure because, well, the object doesn&#8217;t need to be tracked by the GC. .NET 9 had already introduced some limited escape analysis and stack allocation support; .NET 10 takes this significantly further.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115172\">dotnet\/runtime#115172<\/a> teaches the JIT how to perform escape analysis related to delegates, and in particular that a delegate&#8217;s <code>Invoke<\/code> method (which is implemented by the runtime) does not stash away the <code>this<\/code> reference. Then if escape analysis can prove that the delegate&#8217;s object reference is something that otherwise hasn&#8217;t escaped, the delegate can effectively evaporate. Consider this benchmark:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"y\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public int Sum(int y)\r\n    {\r\n        Func&lt;int, int&gt; addY = x =&gt; x + y;\r\n        return DoubleResult(addY, y);\r\n    }\r\n\r\n    private int DoubleResult(Func&lt;int, int&gt; func, int arg)\r\n    {\r\n        int result = func(arg);\r\n        return result + result;\r\n    }\r\n}<\/code><\/pre>\n<p>If we just run this benchmark and compare .NET 9 and .NET 10, we can immediately tell something interesting is happening.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">19.530 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">118 B<\/td>\n<td style=\"text-align: right;\">88 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">6.685 ns<\/td>\n<td style=\"text-align: right;\">0.34<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<td style=\"text-align: right;\">0.27<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The C# code for <code>Sum<\/code> belies complicated code generation by the C# compiler. It needs to create a <code>Func&lt;int, int&gt;<\/code>, which is &#8220;closing over&#8221; the <code>y<\/code> &#8220;local&#8221;. That means the compiler needs to &#8220;lift&#8221; <code>y<\/code> to no longer be an actual local, and instead live as a field on an object; the delegate can then point to a method on that object, giving it access to <code>y<\/code>. This is approximately what the IL generated by the C# compiler looks like when decompiled to C#:<\/p>\n<pre><code class=\"language-csharp\">public int Sum(int y)\r\n{\r\n    &lt;&gt;c__DisplayClass0_0 c = new();\r\n    c.y = y;\r\n\r\n    Func&lt;int, int&gt; func = new(c.&lt;Sum&gt;b__0);\r\n\r\n    return DoubleResult(func, c.y);\r\n}\r\n\r\nprivate sealed class &lt;&gt;c__DisplayClass0_0\r\n{\r\n    public int y;\r\n\r\n    internal int &lt;Sum&gt;b__0(int x) =&gt; x + y;\r\n}<\/code><\/pre>\n<p>From that, we can see the closure is resulting in two allocations: an allocation for the &#8220;display class&#8221; (what the C# compiler calls these closure types) and an allocation for the delegate that points to the <code>&lt;Sum&gt;b__0<\/code> method on that display class instance. That&#8217;s what&#8217;s accounting for the <code>88<\/code> bytes of allocation in the .NET 9 results: the display class is 24 bytes, and the delegate is 64 bytes. In the .NET 10 version, though, we only see a 24 byte allocation; that&#8217;s because the JIT has successfully elided the delegate allocation. Here is the resulting assembly code:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Sum(Int32)\r\n       push      rbp\r\n       push      r15\r\n       push      rbx\r\n       lea       rbp,[rsp+10]\r\n       mov       ebx,esi\r\n       mov       rdi,offset MT_Tests+&lt;&gt;c__DisplayClass0_0\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       r15,rax\r\n       mov       [r15+8],ebx\r\n       mov       rdi,offset MT_System.Func&lt;System.Int32, System.Int32&gt;\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       rbx,rax\r\n       lea       rdi,[rbx+8]\r\n       mov       rsi,r15\r\n       call      CORINFO_HELP_ASSIGN_REF\r\n       mov       rax,offset Tests+&lt;&gt;c__DisplayClass0_0.&lt;Sum&gt;b__0(Int32)\r\n       mov       [rbx+18],rax\r\n       mov       esi,[r15+8]\r\n       cmp       [rbx+18],rax\r\n       jne       short M00_L01\r\n       mov       rax,[rbx+8]\r\n       add       esi,[rax+8]\r\n       mov       eax,esi\r\nM00_L00:\r\n       add       eax,eax\r\n       pop       rbx\r\n       pop       r15\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       mov       rdi,[rbx+8]\r\n       call      qword ptr [rbx+18]\r\n       jmp       short M00_L00\r\n; Total bytes of code 112\r\n\r\n; .NET 10\r\n; Tests.Sum(Int32)\r\n       push      rbx\r\n       mov       ebx,esi\r\n       mov       rdi,offset MT_Tests+&lt;&gt;c__DisplayClass0_0\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       [rax+8],ebx\r\n       mov       eax,[rax+8]\r\n       mov       ecx,eax\r\n       add       eax,ecx\r\n       add       eax,eax\r\n       pop       rbx\r\n       ret\r\n; Total bytes of code 32<\/code><\/pre>\n<p>In both .NET 9 and .NET 10, the JIT successfully inlined <code>DoubleResult<\/code>, such that the delegate doesn&#8217;t escape, but then in .NET 10, it&#8217;s able to stack allocate it. Woo hoo! There&#8217;s obviously still future opportunity here, as the JIT doesn&#8217;t elide the allocation of the closure object, but that should be addressable with some more effort, hopefully in the near future.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104906\">dotnet\/runtime#104906<\/a> from <a href=\"https:\/\/github.com\/hez2010\">@hez2010<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112250\">dotnet\/runtime#112250<\/a> extend this kind of analysis and stack allocation to arrays. How many times have you written code like this?<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public void Test()\r\n    {\r\n        Process(new string[] { \"a\", \"b\", \"c\" });\r\n\r\n        static void Process(string[] inputs)\r\n        {\r\n            foreach (string input in inputs)\r\n            {\r\n                Use(input);\r\n            }\r\n\r\n            [MethodImpl(MethodImplOptions.NoInlining)]\r\n            static void Use(string input) { }\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>Some method I want to call accepts an array of inputs and does something for each input. I need to allocate an array to pass my inputs in, either explicitly, or maybe implicitly due to using <code>params<\/code> or a collection expression. Ideally moving forward there would be an overload of such a <code>Process<\/code> method that accepted a <code>ReadOnlySpan&lt;string&gt;<\/code> instead of a <code>string[]<\/code>, and I could then avoid the allocation by construction. But for all of these cases where I&#8217;m forced to create an array, .NET 10 comes to the rescue.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">11.580 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">3.960 ns<\/td>\n<td style=\"text-align: right;\">0.34<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The JIT was able to inline <code>Process<\/code>, see that the array never leaves the frame, and stack allocate it.<\/p>\n<p>Of course, now that we&#8217;re able to stack allocate arrays, we also want to be able to deal with a common way those arrays are used: via spans. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113977\">dotnet\/runtime#113977<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116124\">dotnet\/runtime#116124<\/a> teach escape analysis to be able to reason about the fields in structs, which includes <code>Span&lt;T&gt;<\/code>, as it&#8217;s &#8220;just&#8221; a struct that stores a <code>ref T<\/code> field and an <code>int<\/code> length field.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _buffer = new byte[3];\r\n\r\n    [Benchmark]\r\n    public void Test() =&gt; Copy3Bytes(0x12345678, _buffer);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static void Copy3Bytes(int value, Span&lt;byte&gt; dest) =&gt;\r\n        BitConverter.GetBytes(value).AsSpan(0, 3).CopyTo(dest);\r\n}<\/code><\/pre>\n<p>Here, we&#8217;re using <code>BitConverter.GetBytes<\/code>, which allocates a <code>byte[]<\/code> containing the bytes from the input (in this case, it&#8217;ll be a four-byte array for the <code>int<\/code>), then we slice off three of the four bytes, and we copy them to the destination span.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">9.7717 ns<\/td>\n<td style=\"text-align: right;\">1.04<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">0.8718 ns<\/td>\n<td style=\"text-align: right;\">0.09<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In .NET 9, we get the 32-byte allocation we&#8217;d expect for the <code>byte[]<\/code> in <code>GetBytes<\/code> (every object on 64-bit is at least 24 bytes, which will include the four bytes for the array&#8217;s length, and then the four bytes for the data will be in slots 24-27, and the size will be padded up to the next word boundary, for 32). In .NET 10, with <code>GetBytes<\/code> and <code>AsSpan<\/code> inlined, the JIT can see that the array doesn&#8217;t escape, and a stack allocated version of it can be used to seed the span, just as if it were created from any other stack allocation (like <code>stackalloc<\/code>). (This case also needed a little help from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113093\">dotnet\/runtime#113093<\/a>, which taught the JIT that certain span operations, like the <code>Memmove<\/code> used internally by <code>CopyTo<\/code>, are non-escaping.)<\/p>\n<h4>Devirtualization<\/h4>\n<p>Interfaces and virtual methods are a critical aspect of .NET and the abstractions it enables. Being able to unwind these abstractions and &#8220;devirtualize&#8221; is then an important job for the JIT, which has taken notable leaps in capabilities here in .NET 10.<\/p>\n<p>While arrays are one of the most central features provided by C# and .NET, and while the JIT exerts a lot of energy and does a great job optimizing many aspects of arrays, one area in particular has caused it pain: an array&#8217;s interface implementations. The runtime manufactures a bunch of interface implementations for <code>T[]<\/code>, and because they&#8217;re implemented differently from literally every other interface implementation in .NET, the JIT hasn&#8217;t been able to apply the same devirtualization capabilities it&#8217;s applied elsewhere. And, for anyone who&#8217;s dived deep into micro-benchmarks, this can lead to some odd observations. Here&#8217;s a performance comparison between iterating over a <code>ReadOnlyCollection&lt;T&gt;<\/code> using a <code>foreach<\/code> loop (going through its enumerator) and using a <code>for<\/code> loop (indexing on each element).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.ObjectModel;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private ReadOnlyCollection&lt;int&gt; _list = new(Enumerable.Range(1, 1000).ToArray());\r\n\r\n    [Benchmark]\r\n    public int SumEnumerable()\r\n    {\r\n        int sum = 0;\r\n        foreach (var item in _list)\r\n        {\r\n            sum += item;\r\n        }\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int SumForLoop()\r\n    {\r\n        ReadOnlyCollection&lt;int&gt; list = _list;\r\n        int sum = 0;\r\n        int count = list.Count;\r\n        for (int i = 0; i &lt; count; i++)\r\n        {\r\n            sum += _list[i];\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<p>When asked &#8220;which of these will be faster&#8221;, the obvious answer is &#8220;<code>SumForLoop<\/code>&#8220;. After all, <code>SumEnumerable<\/code> is going to allocate an enumerator and has to make twice the number of interface calls (<code>MoveNext<\/code>+<code>Current<\/code> per iteration vs <code>this[int]<\/code> per iteration). As it turns out, the obvious answer is also wrong. Here are the timings on my machine for .NET 9:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SumEnumerable<\/td>\n<td style=\"text-align: right;\">949.5 ns<\/td>\n<\/tr>\n<tr>\n<td>SumForLoop<\/td>\n<td style=\"text-align: right;\">1,932.7 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>What the what?? If I change the <code>ToArray<\/code> to instead be <code>ToList<\/code>, however, the numbers are much more in line with our expectations.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SumEnumerable<\/td>\n<td style=\"text-align: right;\">1,542.0 ns<\/td>\n<\/tr>\n<tr>\n<td>SumForLoop<\/td>\n<td style=\"text-align: right;\">894.1 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So what&#8217;s going on here? It&#8217;s super subtle. First, it&#8217;s necessary to know that <code>ReadOnlyCollection&lt;T&gt;<\/code> just wraps an arbitrary <code>IList&lt;T&gt;<\/code>, the <code>ReadOnlyCollection&lt;T&gt;<\/code>&#8216;s <code>GetEnumerator()<\/code> returns <code>_list.GetEnumerator()<\/code> (I&#8217;m ignoring for this discussion the special-case where the list is empty), and <code>ReadOnlyCollection&lt;T&gt;<\/code>&#8216;s indexer just indexes into the <code>IList&lt;T&gt;<\/code>&#8216;s indexer. So far presumably this all sounds like what you&#8217;d expect. But where things gets interesting is around what the JIT is able to devirtualize. In .NET 9, it struggles to devirtualize calls to the interface implementations specifically on <code>T[]<\/code>, so it won&#8217;t devirtualize either the <code>_list.GetEnumerator()<\/code> call nor the <code>_list[index]<\/code> call. However, the enumerator that&#8217;s returned is just a normal type that implements <code>IEnumerator&lt;T&gt;<\/code>, and the JIT has no problem devirtualizing its <code>MoveNext<\/code> and <code>Current<\/code> members. Which means that we&#8217;re actually paying a lot more going through the indexer, because for <code>N<\/code> elements, we&#8217;re having to make <code>N<\/code> interface calls, whereas with the enumerator, we only need the one with <code>GetEnumerator<\/code> interface call and then no more after that.<\/p>\n<p>Thankfully, this is now addressed in .NET 10. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108153\">dotnet\/runtime#108153<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109209\">dotnet\/runtime#109209<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109237\">dotnet\/runtime#109237<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116771\">dotnet\/runtime#116771<\/a> all make it possible for the JIT to devirtualize array&#8217;s interface method implementations. Now when we run the same benchmark (reverted back to using <code>ToArray<\/code>), we get results much more in line with our expectations, with both benchmarks improving from .NET 9 to .NET 10, and with <code>SumForLoop<\/code> on .NET 10 being the fastest.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">968.5 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">775.5 ns<\/td>\n<td style=\"text-align: right;\">0.80<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>SumForLoop<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,960.5 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SumForLoop<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">624.6 ns<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One of the really interesting things about this is how many libraries are implemented on the premise that it&#8217;s faster to use an <code>IList&lt;T&gt;<\/code>&#8216;s indexer for iteration than it is to use its <code>IEnumerable&lt;T&gt;<\/code> for iteration, and that includes <code>System.Linq<\/code>. All these years, where LINQ has had specialized code paths for working with <code>IList&lt;T&gt;<\/code> when possible, while in many cases it&#8217;s been a welcome optimization, in <em>some<\/em> cases (such as when the concrete type is a <code>ReadOnlyCollection&lt;T&gt;<\/code>), it&#8217;s actually been a deoptimization.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.ObjectModel;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private ReadOnlyCollection&lt;int&gt; _list = new(Enumerable.Range(1, 1000).ToArray());\r\n\r\n    [Benchmark]\r\n    public int SkipTakeSum() =&gt; _list.Skip(100).Take(800).Sum();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SkipTakeSum<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">3.525 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SkipTakeSum<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.773 us<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Fixing devirtualization for array&#8217;s interface implementation then also has this transitive effect on LINQ.<\/p>\n<p>Guarded Devirtualization (GDV) is also improved in .NET 10, such as from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116453\">dotnet\/runtime#116453<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109256\">dotnet\/runtime#109256<\/a>. With <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/#tiering-and-dynamic-pgo\">dynamic PGO<\/a>, the JIT is able to instrument a method&#8217;s compilation and then use the resulting profiling data as part of emitting an optimized version of the method. One of the things it can profile are which types are used in a virtual dispatch. If one type dominates, it can special-case that type in the code gen and emit a customized implementation specific to that type. That then enables devirtualization in that dedicated path, which is &#8220;guarded&#8221; by the relevant type check, hence &#8220;GDV&#8221;. In some cases, however, such as if a virtual call was being made in a shared generic context, GDV would not kick in. Now it will.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public bool Test() =&gt; GenericEquals(\"abc\", \"abc\");\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static bool GenericEquals&lt;T&gt;(T a, T b) =&gt; EqualityComparer&lt;T&gt;.Default.Equals(a, b);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2.816 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.511 ns<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110827\">dotnet\/runtime#110827<\/a> from <a href=\"https:\/\/github.com\/hez2010\">@hez2010<\/a> also helps more methods to be inlined by doing another pass looking for opportunities after later phases of devirtualization. The JIT&#8217;s optimizations are split up into multiple phases; each phase can make improvements, and those improvements can expose additional opportunities. If those opportunities would only be capitalized on by a phase that already ran, they can be missed. But for phases that are relatively cheap to perform, such as doing a pass looking for additional inlining opportunities, those phases can be repeated once enough other optimization has happened that it&#8217;s likely productive to do so again.<\/p>\n<h3>Bounds Checking<\/h3>\n<p>C# is a memory-safe language, an important aspect of modern programming languages. A key component of this is the inability to walk off the beginning or end of an array, string, or span. The runtime ensures that any such invalid attempt produces an exception, rather than being allowed to perform the invalid memory access. We can see what this looks like with a small benchmark:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _array = new int[3];\r\n\r\n    [Benchmark]\r\n    public int Read() =&gt; _array[2];\r\n}<\/code><\/pre>\n<p>This is a valid access: the <code>_array<\/code> contains three elements, and the <code>Read<\/code> method is reading its last element. However, the JIT can&#8217;t be 100% certain that this access is in bounds (something could have changed what&#8217;s in the <code>_array<\/code> field to be a shorter array), and thus it needs to emit a check to ensure we&#8217;re not walking off the end of the array. Here&#8217;s what the generated assembly code for <code>Read<\/code> looks like:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Read()\r\n       push      rax\r\n       mov       rax,[rdi+8]\r\n       cmp       dword ptr [rax+8],2\r\n       jbe       short M00_L00\r\n       mov       eax,[rax+18]\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 25<\/code><\/pre>\n<p>The <code>this<\/code> reference is passed into the <code>Read<\/code> instance method in the <code>rdi<\/code> register, and the <code>_array<\/code> field is at offset 8, so the <code>mov rax,[rdi+8]<\/code> instruction is loading the address of the array into the <code>rax<\/code> register. Then the <code>cmp<\/code> is loading the value at offset 8 from that address; it so happens that&#8217;s where the length of the array is stored in the array object. So, this <code>cmp<\/code> instruction is the bounds check; it&#8217;s comparing <code>2<\/code> against that length to ensure it&#8217;s in bounds. If the array were too short for this access, the next <code>jbe<\/code> instruction would branch to the <code>M00_L00<\/code> label, which calls the <code>CORINFO_HELP_RNGCHKFAIL<\/code> helper function that throws an <code>IndexOutOfRangeException<\/code>. Any time you see this pair of <code>call CORINFO_HELP_RNGCHKFAIL<\/code>\/<code>int 3<\/code> at the end of a method, there was at least one bounds check emitted by the JIT in that method.<\/p>\n<p>Of course, we not only want safety, we also want great performance, and it&#8217;d be terrible for performance if every single read from an array (or string or span) incurred such an additional check. As such, the JIT strives to avoid emitting these checks when they&#8217;d be redundant, when it can prove by construction that the accesses are safe. For example, let me tweak my benchmark slightly, moving the array from an instance field into a <code>static readonly<\/code> field:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly int[] s_array = new int[3];\r\n\r\n    [Benchmark]\r\n    public int Read() =&gt; s_array[2];\r\n}<\/code><\/pre>\n<p>We now get this assembly:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Read()\r\n       mov       rax,705D5419FA20\r\n       mov       eax,[rax+18]\r\n       ret\r\n; Total bytes of code 14<\/code><\/pre>\n<p>The <code>static readonly<\/code> field is immutable, arrays can&#8217;t be resized, and the JIT can guarantee that the field is initialized prior to generating the code for <code>Read<\/code>. Therefore, when generating the code for <code>Read<\/code>, it can know with certainty that the array is of length three, and we&#8217;re accessing the element at index two. Therefore, the specified array index is guaranteed to be within bounds, and there&#8217;s no need for a bounds check. We simply get two <code>mov<\/code>s, the first <code>mov<\/code> to load the address of the array (which, thanks to improvements in previous releases, is allocated on a heap that doesn&#8217;t need to be compacted such that the array lives at a fixed address), and the second <code>mov<\/code> to read the <code>int<\/code> value at the location of index two (these are <code>int<\/code>s, so index two lives <code>2 * sizeof(int) = 8<\/code> bytes from the start of the array&#8217;s data, which itself on 64-bit is offset 16 bytes from the start of the array reference, for a total offset of 24 bytes, or in hex 0x18, hence the <code>rax+18<\/code> in the disassembly).<\/p>\n<p>Every release of .NET, more and more opportunities are found and implemented to eschew bounds checks that were previously being generated. .NET 10 continues this trend.<\/p>\n<p>Our first example comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109900\">dotnet\/runtime#109900<\/a>, which was inspired by the implementation of <code>BitOperations.Log2<\/code>. The operation has intrinsic hardware support on many architectures, and generally <code>BitOperations.Log2<\/code> will use one of the hardware intrinsics available to it for a very efficient implementation (e.g. <code>Lscnt.LeadingZeroCount<\/code>, <code>ArmBase.LeadingZeroCount<\/code>, or <code>X86Base.BitScanReverse<\/code>), however as a fallback implementation it uses a lookup table. The lookup table has 32 elements, and the operation involves computing a <code>uint<\/code> value and then shifting it down by 27 in order to get the top 5 bits. Any possible result is guaranteed to be a non-negative number less than 32, but indexing into the span with that result still produced a bounds check, and, as this is a critical path, &#8220;unsafe&#8221; code (meaning code that eschews the guardrails the runtime supplies by default) was then used to avoid the bounds check.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"value\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public int Log2SoftwareFallback2(uint value)\r\n    {\r\n        ReadOnlySpan&lt;byte&gt; Log2DeBruijn =\r\n        [\r\n            00, 09, 01, 10, 13, 21, 02, 29,\r\n            11, 14, 16, 18, 22, 25, 03, 30,\r\n            08, 12, 20, 28, 15, 17, 24, 07,\r\n            19, 27, 23, 06, 26, 05, 04, 31\r\n        ];\r\n\r\n        value |= value &gt;&gt; 01;\r\n        value |= value &gt;&gt; 02;\r\n        value |= value &gt;&gt; 04;\r\n        value |= value &gt;&gt; 08;\r\n        value |= value &gt;&gt; 16;\r\n\r\n        return Log2DeBruijn[(int)((value * 0x07C4ACDDu) &gt;&gt; 27)];\r\n    }\r\n}<\/code><\/pre>\n<p>Now in .NET 10, the bounds check is gone (note the presence of the <code>call CORINFO_HELP_RNGCHKFAIL<\/code> in the .NET 9 assembly and the lack of it in the .NET 10 assembly).<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Log2SoftwareFallback2(UInt32)\r\n       push      rax\r\n       mov       eax,esi\r\n       shr       eax,1\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,2\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,4\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,8\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,10\r\n       or        eax,esi\r\n       imul      eax,7C4ACDD\r\n       shr       eax,1B\r\n       cmp       eax,20\r\n       jae       short M00_L00\r\n       mov       rcx,7913CA812E10\r\n       movzx     eax,byte ptr [rax+rcx]\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 74\r\n\r\n; .NET 10\r\n; Tests.Log2SoftwareFallback2(UInt32)\r\n       mov       eax,esi\r\n       shr       eax,1\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,2\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,4\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,8\r\n       or        esi,eax\r\n       mov       eax,esi\r\n       shr       eax,10\r\n       or        eax,esi\r\n       imul      eax,7C4ACDD\r\n       shr       eax,1B\r\n       mov       rcx,7CA298325E10\r\n       movzx     eax,byte ptr [rcx+rax]\r\n       ret\r\n; Total bytes of code 58<\/code><\/pre>\n<p>This improvement then enabled <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118560\">dotnet\/runtime#118560<\/a> to simplify the code in the real <code>Log2SoftwareFallback<\/code>, avoiding manual use of unsafe constructs.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113790\">dotnet\/runtime#113790<\/a> implements a similar case, where the result of a mathematical operation is guaranteed to be in bounds. In this case, it&#8217;s the result of <code>Log2<\/code>. The change teaches the JIT to understand the maximum possible value that <code>Log2<\/code> could produce, and if that maximum is in bounds, then any result is guaranteed to be in bounds as well.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"value\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(12345)]\r\n    public nint CountDigits(ulong value)\r\n    {\r\n        ReadOnlySpan&lt;byte&gt; log2ToPow10 =\r\n        [\r\n            1,  1,  1,  2,  2,  2,  3,  3,  3,  4,  4,  4,  4,  5,  5,  5,\r\n            6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  9,  9,  9,  10, 10, 10,\r\n            10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 15, 15,\r\n            15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 19, 20\r\n        ];\r\n\r\n        return log2ToPow10[(int)ulong.Log2(value)];\r\n    }\r\n}<\/code><\/pre>\n<p>We can see the bounds check present in the .NET 9 output and absent in the .NET 10 output:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.CountDigits(UInt64)\r\n       push      rax\r\n       or        rsi,1\r\n       xor       eax,eax\r\n       lzcnt     rax,rsi\r\n       xor       eax,3F\r\n       cmp       eax,40\r\n       jae       short M00_L00\r\n       mov       rcx,7C2D0A213DF8\r\n       movzx     eax,byte ptr [rax+rcx]\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 45\r\n\r\n; .NET 10\r\n; Tests.CountDigits(UInt64)\r\n       or        rsi,1\r\n       xor       eax,eax\r\n       lzcnt     rax,rsi\r\n       xor       eax,3F\r\n       mov       rcx,71EFA9400DF8\r\n       movzx     eax,byte ptr [rcx+rax]\r\n       ret\r\n; Total bytes of code 29<\/code><\/pre>\n<p>My choice of benchmark in this case was not coincidental. This pattern shows up in the <code>FormattingHelpers.CountDigits<\/code> internal method that&#8217;s used by the core primitive types in their <code>ToString<\/code> and <code>TryFormat<\/code> implementations, in order to determine how much space will be needed to store rendered digits for a number. As with the previous example, this routine is considered core enough that it was using unsafe code to avoid the bounds check. With this fix, the code was able to be changed back to using a simple span access, and even with the simpler code, it&#8217;s now also faster.<\/p>\n<p>Now, consider this code:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"ids\")]\r\npublic partial class Tests\r\n{\r\n    public IEnumerable&lt;int[]&gt; Ids { get; } = [[1, 2, 3, 4, 5, 1]];\r\n\r\n    [Benchmark]\r\n    [ArgumentsSource(nameof(Ids))]\r\n    public bool StartAndEndAreSame(int[] ids) =&gt; ids[0] == ids[^1];\r\n}<\/code><\/pre>\n<p>I have a method that&#8217;s accepting an <code>int[]<\/code> and checking to see whether it starts and ends with the same value. The JIT has no way of knowing whether the <code>int[]<\/code> is empty or not, so it <em>does<\/em> need a bounds check; otherwise, accessing <code>ids[0]<\/code> could walk off the end of the array. However, this is what we see on .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.StartAndEndAreSame(Int32[])\r\n       push      rax\r\n       mov       eax,[rsi+8]\r\n       test      eax,eax\r\n       je        short M00_L00\r\n       mov       ecx,[rsi+10]\r\n       lea       edx,[rax-1]\r\n       cmp       edx,eax\r\n       jae       short M00_L00\r\n       mov       eax,edx\r\n       cmp       ecx,[rsi+rax*4+10]\r\n       sete      al\r\n       movzx     eax,al\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 41<\/code><\/pre>\n<p>Note there are two jumps to the <code>M00_L00<\/code> label that handles failed bounds checks&#8230; that&#8217;s because there are two bounds checks here, one for the start access and one for the end access. But that shouldn&#8217;t be necessary. <code>ids[^1]<\/code> is the same as <code>ids[ids.Length - 1]<\/code>. If the code has successfully accessed <code>ids[0]<\/code>, that means the array is at least one element in length, and if it&#8217;s at least one element in length, <code>ids[ids.Length - 1]<\/code> will always be in bounds. Thus, the second bounds check shouldn&#8217;t be needed. Indeed, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116105\">dotnet\/runtime#116105<\/a>, this is what we now get on .NET 10 (one branch to <code>M00_L00<\/code> instead of two):<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.StartAndEndAreSame(Int32[])\r\n       push      rax\r\n       mov       eax,[rsi+8]\r\n       test      eax,eax\r\n       je        short M00_L00\r\n       mov       ecx,[rsi+10]\r\n       dec       eax\r\n       cmp       ecx,[rsi+rax*4+10]\r\n       sete      al\r\n       movzx     eax,al\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 34<\/code><\/pre>\n<p>What&#8217;s really interesting to me here is the knock-on effect of having removed the bounds check. It didn&#8217;t just eliminate the <code>cmp\/jae<\/code> pair of instructions that&#8217;s typical of a bounds check. The .NET 9 version of the code had this:<\/p>\n<pre><code class=\"language-x86asm\">lea edx,[rax-1]\r\ncmp edx,eax\r\njae short M00_L00\r\nmov eax,edx<\/code><\/pre>\n<p>At this point in the assembly, the <code>rax<\/code> register is storing the length of the array. It&#8217;s calculating <code>ids.Length - 1<\/code> and storing the result into <code>edx<\/code>, and then checking to see whether <code>ids.Length-1<\/code> is in bounds of <code>ids.Length<\/code> (the only way it wouldn&#8217;t be is if the array were empty such that <code>ids.Length-1<\/code> wrapped around to <code>uint.MaxValue<\/code>); if it&#8217;s not, it jumps to the fail handler, and if it is, it stores the already computed <code>ids.Length - 1<\/code> into <code>eax<\/code>. By removing the bounds check, we get rid of those two intervening instructions, leaving these:<\/p>\n<pre><code class=\"language-x86asm\">lea edx,[rax-1]\r\nmov eax,edx<\/code><\/pre>\n<p>which is a little silly, as this sequence is just computing a decrement, and as long as it&#8217;s ok that flags get modified, it could instead just be:<\/p>\n<pre><code class=\"language-x86asm\">dec eax<\/code><\/pre>\n<p>which, as you can see in the .NET 10 output, is exactly what .NET 10 now does.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115980\">dotnet\/runtime#115980<\/a> addresses another case. Let&#8217;s say I have this method:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"start\", \"text\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(\"abc\", \"abc.\")]\r\n    public bool IsFollowedByPeriod(string start, string text) =&gt;\r\n        start.Length &lt; text.Length &amp;&amp; text[start.Length] == '.';\r\n}<\/code><\/pre>\n<p>We&#8217;re validating that one input&#8217;s length is less than the other, and then checking to see what comes immediately after it in the other. We know that <code>string.Length<\/code> is immutable, so a bounds check here is redundant, but until .NET 10, the JIT couldn&#8217;t see that.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.IsFollowedByPeriod(System.String, System.String)\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       eax,[rsi+8]\r\n       mov       ecx,[rdx+8]\r\n       cmp       eax,ecx\r\n       jge       short M00_L00\r\n       cmp       eax,ecx\r\n       jae       short M00_L01\r\n       cmp       word ptr [rdx+rax*2+0C],2E\r\n       sete      al\r\n       movzx     eax,al\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 42\r\n\r\n; .NET 10\r\n; Tests.IsFollowedByPeriod(System.String, System.String)\r\n       mov       eax,[rsi+8]\r\n       mov       ecx,[rdx+8]\r\n       cmp       eax,ecx\r\n       jge       short M00_L00\r\n       cmp       word ptr [rdx+rax*2+0C],2E\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 26<\/code><\/pre>\n<p>The removal of the bounds check almost halves the size of the function. If we don&#8217;t need to do a bounds check, we get to elide the <code>cmp\/jae<\/code>. Without that branch, nothing is targeting <code>M00_L01<\/code>, and we can remove the <code>call\/int<\/code> pair that were only necessary to support a bounds check. Then without the <code>call<\/code> in <code>M00_L01<\/code>, which was the only <code>call<\/code> in the whole method, the prologue and epilogue can be elided, meaning we also don&#8217;t need the opening and closing <code>push<\/code> and <code>pop<\/code> instructions.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113233\">dotnet\/runtime#113233<\/a> improved handling &#8220;assertions&#8221; (facts the JIT claims and based on which the JIT makes optimizations) to be less order dependent. In .NET 9, this code:<\/p>\n<pre><code class=\"language-csharp\">static bool Test(ReadOnlySpan&lt;char&gt; span, int pos) =&gt;\r\n    pos &gt; 0 &amp;&amp;\r\n    pos &lt;= span.Length - 42 &amp;&amp;\r\n    span[pos - 1] != '\\n';<\/code><\/pre>\n<p>was successfully removing the bounds check on the span access, but the following variant, which just switches the order of the first two conditions, was still incurring the bounds check.<\/p>\n<pre><code class=\"language-csharp\">static bool Test(ReadOnlySpan&lt;char&gt; span, int pos) =&gt;\r\n    pos &lt;= span.Length - 42 &amp;&amp;\r\n    pos &gt; 0 &amp;&amp;\r\n    span[pos - 1] != '\\n';<\/code><\/pre>\n<p>Note that both conditions contribute an assertion (fact) that need to be merged in order to know the bounds check can be avoided. Now in .NET 10, the bounds check is elided, regardless of the order.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _s = new string('s', 100);\r\n    private int _pos = 10;\r\n\r\n    [Benchmark]\r\n    public bool Test()\r\n    {\r\n        string s = _s;\r\n        int pos = _pos;\r\n        return\r\n            pos &lt;= s.Length - 42 &amp;&amp;\r\n            pos &gt; 0 &amp;&amp;\r\n            s[pos - 1] != '\\n';\r\n    }\r\n}<\/code><\/pre>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rdi+10]\r\n       mov       edx,[rax+8]\r\n       lea       edi,[rdx-2A]\r\n       cmp       edi,ecx\r\n       jl        short M00_L00\r\n       test      ecx,ecx\r\n       jle       short M00_L00\r\n       dec       ecx\r\n       cmp       ecx,edx\r\n       jae       short M00_L01\r\n       cmp       word ptr [rax+rcx*2+0C],0A\r\n       setne     al\r\n       movzx     eax,al\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 55\r\n\r\n; .NET 10\r\n; Tests.Test()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rdi+10]\r\n       mov       edx,[rax+8]\r\n       add       edx,0FFFFFFD6\r\n       cmp       edx,ecx\r\n       jl        short M00_L00\r\n       test      ecx,ecx\r\n       jle       short M00_L00\r\n       dec       ecx\r\n       cmp       word ptr [rax+rcx*2+0C],0A\r\n       setne     al\r\n       movzx     eax,al\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 45<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113862\">dotnet\/runtime#113862<\/a> addresses a similar case where assertions weren&#8217;t being handled as precisely as they could have been. Consider this code:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _arr = Enumerable.Range(0, 10).ToArray();\r\n\r\n    [Benchmark]\r\n    public int Sum()\r\n    {\r\n        int[] arr = _arr;\r\n        int sum = 0;\r\n\r\n        int i;\r\n        for (i = 0; i &lt; arr.Length - 3; i += 4)\r\n        {\r\n            sum += arr[i + 0];\r\n            sum += arr[i + 1];\r\n            sum += arr[i + 2];\r\n            sum += arr[i + 3];\r\n        }\r\n\r\n        for (; i &lt; arr.Length; i++)\r\n        {\r\n            sum += arr[i];\r\n        }\r\n\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<p>The <code>Sum<\/code> method is trying to do manual loop unrolling. Rather than incurring a branch on each element, it&#8217;s handling four elements per iteration. Then, for the case where the length of the input isn&#8217;t evenly divisible by four, it&#8217;s handling the remaining elements in a separate loop. In .NET 9, the JIT successfully elides the bounds checks in the main unrolled loop:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Sum()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       xor       ecx,ecx\r\n       xor       edx,edx\r\n       mov       edi,[rax+8]\r\n       lea       esi,[rdi-3]\r\n       test      esi,esi\r\n       jle       short M00_L02\r\nM00_L00:\r\n       mov       r8d,edx\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+1]\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+2]\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+3]\r\n       add       ecx,[rax+r8*4+10]\r\n       add       edx,4\r\n       cmp       esi,edx\r\n       jg        short M00_L00\r\n       jmp       short M00_L02\r\nM00_L01:\r\n       cmp       edx,edi\r\n       jae       short M00_L03\r\n       mov       esi,edx\r\n       add       ecx,[rax+rsi*4+10]\r\n       inc       edx\r\nM00_L02:\r\n       cmp       edi,edx\r\n       jg        short M00_L01\r\n       mov       eax,ecx\r\n       pop       rbp\r\n       ret\r\nM00_L03:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 92<\/code><\/pre>\n<p>You can see this in the <code>M00_L00<\/code> section, which has the five <code>add<\/code> instructions (four for the summed elements, and one for the index). However, we still see the <code>CORINFO_HELP_RNGCHKFAIL<\/code> at the end, indicating this method has a bounds check. That&#8217;s coming from the final loop, due to the JIT losing track of the fact that <code>i<\/code> is guaranteed to be non-negative. Now in .NET 10, that bounds check is removed as well (again, just look for the lack of the <code>CORINFO_HELP_RNGCHKFAIL<\/code> call).<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Sum()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       xor       ecx,ecx\r\n       xor       edx,edx\r\n       mov       edi,[rax+8]\r\n       lea       esi,[rdi-3]\r\n       test      esi,esi\r\n       jle       short M00_L01\r\nM00_L00:\r\n       mov       r8d,edx\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+1]\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+2]\r\n       add       ecx,[rax+r8*4+10]\r\n       lea       r8d,[rdx+3]\r\n       add       ecx,[rax+r8*4+10]\r\n       add       edx,4\r\n       cmp       esi,edx\r\n       jg        short M00_L00\r\nM00_L01:\r\n       cmp       edi,edx\r\n       jle       short M00_L03\r\n       test      edx,edx\r\n       jl        short M00_L04\r\nM00_L02:\r\n       mov       esi,edx\r\n       add       ecx,[rax+rsi*4+10]\r\n       inc       edx\r\n       cmp       edi,edx\r\n       jg        short M00_L02\r\nM00_L03:\r\n       mov       eax,ecx\r\n       pop       rbp\r\n       ret\r\nM00_L04:\r\n       mov       esi,edx\r\n       add       ecx,[rax+rsi*4+10]\r\n       inc       edx\r\n       cmp       edi,edx\r\n       jg        short M00_L04\r\n       jmp       short M00_L03\r\n; Total bytes of code 102<\/code><\/pre>\n<p>Another nice improvement comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112824\">dotnet\/runtime#112824<\/a>, which teaches the JIT to turn facts it already learned from earlier checks into concrete numeric ranges, and then use those ranges to fold away later relational tests and bounds checks. Consider this example:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _array = new int[10];\r\n\r\n    [Benchmark]\r\n    public void Test() =&gt; SetAndSlice(_array);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static Span&lt;int&gt; SetAndSlice(Span&lt;int&gt; src)\r\n    {\r\n        src[5] = 42;\r\n        return src.Slice(4);\r\n    }\r\n}<\/code><\/pre>\n<p>We have to incur a bounds check for the <code>src[5]<\/code>, as the JIT has no evidence that <code>src<\/code> is at least six elements long. However, by the time we get to the <code>Slice<\/code> call, we know the span has a length of at least six, or else writing into <code>src[5]<\/code> would have failed. We can use that knowledge to remove the length check from within the <code>Slice<\/code> call (note the removal of the <code>call qword ptr [7F8DDB3A7810]<\/code>\/<code>int 3<\/code> sequence, which is the manual length check and call to a throw helper method in <code>Slice<\/code>).<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.SetAndSlice(System.Span`1&lt;Int32&gt;)\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       cmp       esi,5\r\n       jbe       short M01_L01\r\n       mov       dword ptr [rdi+14],2A\r\n       cmp       esi,4\r\n       jb        short M01_L00\r\n       add       rdi,10\r\n       mov       rax,rdi\r\n       add       esi,0FFFFFFFC\r\n       mov       edx,esi\r\n       pop       rbp\r\n       ret\r\nM01_L00:\r\n       call      qword ptr [7F8DDB3A7810]\r\n       int       3\r\nM01_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 48\r\n\r\n; .NET 10\r\n; Tests.SetAndSlice(System.Span`1&lt;Int32&gt;)\r\n       push      rax\r\n       cmp       esi,5\r\n       jbe       short M01_L00\r\n       mov       dword ptr [rdi+14],2A\r\n       lea       rax,[rdi+10]\r\n       lea       edx,[rsi-4]\r\n       add       rsp,8\r\n       ret\r\nM01_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 31<\/code><\/pre>\n<p>Let&#8217;s look at one more, which has a very nice impact on bounds checking, even though technically the optimization is broader than just that. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113998\">dotnet\/runtime#113998<\/a> creates assertions from <code>switch<\/code> targets. This means that the body of a <code>switch<\/code> case statement inherits facts about what was switched over based on what the <code>case<\/code> was, e.g. in a <code>case 3<\/code> for <code>switch (x)<\/code>, the body of that case will now &#8220;know&#8221; that <code>x<\/code> is three. This is great for very popular patterns with arrays, strings, and spans, where developers switch over the length and then index into available indices in the appropriate branches. Consider this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _array = [1, 2];\r\n\r\n    [Benchmark]\r\n    public int SumArray() =&gt; Sum(_array);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    public int Sum(ReadOnlySpan&lt;int&gt; span)\r\n    {\r\n        switch (span.Length)\r\n        {\r\n            case 0: return 0;\r\n            case 1: return span[0];\r\n            case 2: return span[0] + span[1];\r\n            case 3: return span[0] + span[1] + span[2];\r\n            default: return -1;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>On .NET 9, each of those six <code>span<\/code> dereferences ends up with a bounds check:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Sum(System.ReadOnlySpan`1&lt;Int32&gt;)\r\n       push      rbp\r\n       mov       rbp,rsp\r\nM01_L00:\r\n       cmp       edx,2\r\n       jne       short M01_L02\r\n       test      edx,edx\r\n       je        short M01_L04\r\n       mov       eax,[rsi]\r\n       cmp       edx,1\r\n       jbe       short M01_L04\r\n       add       eax,[rsi+4]\r\nM01_L01:\r\n       pop       rbp\r\n       ret\r\nM01_L02:\r\n       cmp       edx,3\r\n       ja        short M01_L03\r\n       mov       eax,edx\r\n       lea       rcx,[783DA42091B8]\r\n       mov       ecx,[rcx+rax*4]\r\n       lea       rdi,[M01_L00]\r\n       add       rcx,rdi\r\n       jmp       rcx\r\nM01_L03:\r\n       mov       eax,0FFFFFFFF\r\n       pop       rbp\r\n       ret\r\n       test      edx,edx\r\n       je        short M01_L04\r\n       mov       eax,[rsi]\r\n       cmp       edx,1\r\n       jbe       short M01_L04\r\n       add       eax,[rsi+4]\r\n       cmp       edx,2\r\n       jbe       short M01_L04\r\n       add       eax,[rsi+8]\r\n       jmp       short M01_L01\r\n       test      edx,edx\r\n       je        short M01_L04\r\n       mov       eax,[rsi]\r\n       jmp       short M01_L01\r\n       xor       eax,eax\r\n       pop       rbp\r\n       ret\r\nM01_L04:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 103<\/code><\/pre>\n<p>You can see the tell-tale bounds check sign (<code>CORINFO_HELP_RNGCHKFAIL<\/code>) under <code>M01_L04<\/code>, and no fewer than six jumps targeting that label, one for each <code>span[...]<\/code> access. But on .NET 10, we get this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Sum(System.ReadOnlySpan`1&lt;Int32&gt;)\r\n       push      rbp\r\n       mov       rbp,rsp\r\nM01_L00:\r\n       cmp       edx,2\r\n       jne       short M01_L02\r\n       mov       eax,[rsi]\r\n       add       eax,[rsi+4]\r\nM01_L01:\r\n       pop       rbp\r\n       ret\r\nM01_L02:\r\n       cmp       edx,3\r\n       ja        short M01_L03\r\n       mov       eax,edx\r\n       lea       rcx,[72C15C0F8FD8]\r\n       mov       ecx,[rcx+rax*4]\r\n       lea       rdx,[M01_L00]\r\n       add       rcx,rdx\r\n       jmp       rcx\r\nM01_L03:\r\n       mov       eax,0FFFFFFFF\r\n       pop       rbp\r\n       ret\r\n       xor       eax,eax\r\n       pop       rbp\r\n       ret\r\n       mov       eax,[rsi]\r\n       jmp       short M01_L01\r\n       mov       eax,[rsi]\r\n       add       eax,[rsi+4]\r\n       add       eax,[rsi+8]\r\n       jmp       short M01_L01\r\n; Total bytes of code 70<\/code><\/pre>\n<p>The <code>CORINFO_HELP_RNGCHKFAIL<\/code> and all the jumps to it have evaporated.<\/p>\n<h3>Cloning<\/h3>\n<p>There are other ways the JIT can remove bounds checking even when it can&#8217;t prove statically that every individual access is safe. Consider this method:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _arr = new int[16];\r\n\r\n    [Benchmark]\r\n    public void Test()\r\n    {\r\n        int[] arr = _arr;\r\n        arr[0] = 2;\r\n        arr[1] = 3;\r\n        arr[2] = 5;\r\n        arr[3] = 8;\r\n        arr[4] = 13;\r\n        arr[5] = 21;\r\n        arr[6] = 34;\r\n        arr[7] = 55;\r\n    }\r\n}<\/code><\/pre>\n<p>Here&#8217;s the assembly code generated on .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      rax\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rax+8]\r\n       test      ecx,ecx\r\n       je        short M00_L00\r\n       mov       dword ptr [rax+10],2\r\n       cmp       ecx,1\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+14],3\r\n       cmp       ecx,2\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+18],5\r\n       cmp       ecx,3\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+1C],8\r\n       cmp       ecx,4\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+20],0D\r\n       cmp       ecx,5\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+24],15\r\n       cmp       ecx,6\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+28],22\r\n       cmp       ecx,7\r\n       jbe       short M00_L00\r\n       mov       dword ptr [rax+2C],37\r\n       add       rsp,8\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 114<\/code><\/pre>\n<p>Even if you&#8217;re not proficient at reading assembly, the pattern should still be obvious. In the C# code, we have eight writes into the array, and in the assembly code, we have eight repetitions of the same pattern: <code>cmp ecx,LENGTH<\/code> to compare the length of the array against the required <code>LENGTH<\/code>, <code>jbe short M00_L00<\/code> to jump to the <code>CORINFO_HELP_RNGCHKFAIL<\/code> helper if the bounds check fails, and <code>mov dword ptr [rax+OFFSET],VALUE<\/code> to store <code>VALUE<\/code> into the array at byte offset <code>OFFSET<\/code>. Inside the <code>Test<\/code> method, the JIT can&#8217;t know how long <code>_arr<\/code> is, so it must include bounds checks. Moreover, it must include all of the bounds checks, rather than coalescing them, because it is forbidden from introducing behavioral changes as part of optimizations. Imagine instead if it chose to coalesce all of the bounds checks into a single check, and emitted this method as if it were the equivalent of the following:<\/p>\n<pre><code class=\"language-csharp\">if (arr.Length &gt;= 8)\r\n{\r\n    arr[0] = 2;\r\n    arr[1] = 3;\r\n    arr[2] = 5;\r\n    arr[3] = 8;\r\n    arr[4] = 13;\r\n    arr[5] = 21;\r\n    arr[6] = 34;\r\n    arr[7] = 55;\r\n}\r\nelse\r\n{\r\n    throw new IndexOutOfRangeException();\r\n}<\/code><\/pre>\n<p>Now, let&#8217;s say the array was actually of length four. The original program would have filled the array with values <code>[2, 3, 5, 8]<\/code> before throwing an exception, but this transformed code wouldn&#8217;t (there wouldn&#8217;t be any writes to the array). That&#8217;s an observable behavioral change. An enterprising developer could of course <em>choose<\/em> to rewrite their code to avoid some of these checks, e.g.<\/p>\n<pre><code class=\"language-csharp\">arr[7] = 55;\r\narr[0] = 2;\r\narr[1] = 3;\r\narr[2] = 5;\r\narr[3] = 8;\r\narr[4] = 13;\r\narr[5] = 21;\r\narr[6] = 34;<\/code><\/pre>\n<p>By moving the last store to the beginning, the developer has given the JIT extra knowledge. The JIT can now see that <em>if<\/em> the first store succeeds, the rest are guaranteed to succeed as well, and the JIT will emit a single bounds check. But, again, that&#8217;s the developer choosing to change their program in a way the JIT must not. However, there are other things the JIT <em>can<\/em> do. Imagine the JIT chose to rewrite the method like this instead:<\/p>\n<pre><code class=\"language-csharp\">if (arr.Length &gt;= 8)\r\n{\r\n    arr[0] = 2;\r\n    arr[1] = 3;\r\n    arr[2] = 5;\r\n    arr[3] = 8;\r\n    arr[4] = 13;\r\n    arr[5] = 21;\r\n    arr[6] = 34;\r\n    arr[7] = 55;\r\n}\r\nelse\r\n{\r\n    arr[0] = 2;\r\n    arr[1] = 3;\r\n    arr[2] = 5;\r\n    arr[3] = 8;\r\n    arr[4] = 13;\r\n    arr[5] = 21;\r\n    arr[6] = 34;\r\n    arr[7] = 55;\r\n}<\/code><\/pre>\n<p>To our C# sensibilities, that looks unnecessarily complicated; the <code>if<\/code> and the <code>else<\/code> block contain <em>exactly<\/em> the same C# code. But, knowing what we now know about how the JIT can use known length information to elide bounds checks, it starts to make a bit more sense. Here&#8217;s what the JIT emits for this variant on .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rax+8]\r\n       cmp       ecx,8\r\n       jl        short M00_L00\r\n       mov       rcx,300000002\r\n       mov       [rax+10],rcx\r\n       mov       rcx,800000005\r\n       mov       [rax+18],rcx\r\n       mov       rcx,150000000D\r\n       mov       [rax+20],rcx\r\n       mov       rcx,3700000022\r\n       mov       [rax+28],rcx\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       test      ecx,ecx\r\n       je        short M00_L01\r\n       mov       dword ptr [rax+10],2\r\n       cmp       ecx,1\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+14],3\r\n       cmp       ecx,2\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+18],5\r\n       cmp       ecx,3\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+1C],8\r\n       cmp       ecx,4\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+20],0D\r\n       cmp       ecx,5\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+24],15\r\n       cmp       ecx,6\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+28],22\r\n       cmp       ecx,7\r\n       jbe       short M00_L01\r\n       mov       dword ptr [rax+2C],37\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 177<\/code><\/pre>\n<p>The <code>else<\/code> block is compiled to the <code>M00_L00<\/code> label, which contains those same eight repeated blocks we saw earlier. But the <code>if<\/code> block (above the <code>M00_L00<\/code> label) is interesting. The only branch there is the initial <code>array.Length &gt;= 8<\/code> check I wrote in the C# code, emitted as the <code>cmp ecx,8<\/code>\/<code>jl short M00_L00<\/code> pair of instructions. The rest of the block is just <code>mov<\/code> instructions (and you can see there are only four writes into the array rather than eight&#8230; the JIT has optimized the eight four-byte writes into four eight-byte writes). In our rewrite, we&#8217;ve manually cloned the code, so that in what we expect to be the vast, vast, vast majority case (presumably we wouldn&#8217;t have written the array writes in the first place if we thought they&#8217;d fail), we only incur the single length check, and then we have our &#8220;hopefully this is never needed&#8221; fallback case for the rare situation where it is. Of course, you shouldn&#8217;t (and shouldn&#8217;t need to) do such manual cloning. But, the JIT can do such cloning for you, and does.<\/p>\n<p>&#8220;Cloning&#8221; is an optimization long employed by the JIT, where it will do this kind of code duplication, typically of loops, when it believes that in doing so, it can heavily optimize a common case. Now in .NET 10, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112595\">dotnet\/runtime#112595<\/a>, it can employ this same technique for these kinds of sequences of writes. Going back to our original benchmark, here&#8217;s what we now get on .NET 10:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rax+8]\r\n       mov       edx,ecx\r\n       cmp       edx,7\r\n       jle       short M00_L01\r\n       mov       rdx,300000002\r\n       mov       [rax+10],rdx\r\n       mov       rcx,800000005\r\n       mov       [rax+18],rcx\r\n       mov       rcx,150000000D\r\n       mov       [rax+20],rcx\r\n       mov       rcx,3700000022\r\n       mov       [rax+28],rcx\r\nM00_L00:\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       test      edx,edx\r\n       je        short M00_L02\r\n       mov       dword ptr [rax+10],2\r\n       cmp       ecx,1\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+14],3\r\n       cmp       ecx,2\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+18],5\r\n       cmp       ecx,3\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+1C],8\r\n       cmp       ecx,4\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+20],0D\r\n       cmp       ecx,5\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+24],15\r\n       cmp       ecx,6\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+28],22\r\n       cmp       ecx,7\r\n       jbe       short M00_L02\r\n       mov       dword ptr [rax+2C],37\r\n       jmp       short M00_L00\r\nM00_L02:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 179<\/code><\/pre>\n<p>This structure looks almost identical to what we got when we manually cloned: the JIT has emitted the same code twice, except in one case, there are no bounds checks, and in the other case, there are all the bounds checks, and a single length check determines which path to follow. Pretty neat.<\/p>\n<p>As noted, the JIT has been doing cloning for years, in particular for loops over arrays. However, more and more code is being written against spans instead of arrays, and unfortunately this valuable optimization didn&#8217;t apply to spans. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113575\">dotnet\/runtime#113575<\/a>, it does! We can see this with a basic looping example:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _arr = new int[16];\r\n    private int _count = 8;\r\n\r\n    [Benchmark]\r\n    public void WithSpan()\r\n    {\r\n        Span&lt;int&gt; span = _arr;\r\n        int count = _count;\r\n\r\n        for (int i = 0; i &lt; count; i++)\r\n        {\r\n            span[i] = i;\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void WithArray()\r\n    {\r\n        int[] arr = _arr;\r\n        int count = _count;\r\n\r\n        for (int i = 0; i &lt; count; i++)\r\n        {\r\n            arr[i] = i;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>In both <code>WithArray<\/code> and <code>WithSpan<\/code>, we have the same loop, iterating from 0 to a <code>_count<\/code> with an unknown relationship to the length of <code>_arr<\/code>, so there has to be some kind of bounds checking emitted. Here&#8217;s what we get on .NET 9 for <code>WithSpan<\/code>:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.WithSpan()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       test      rax,rax\r\n       je        short M00_L03\r\n       lea       rcx,[rax+10]\r\n       mov       eax,[rax+8]\r\nM00_L00:\r\n       mov       edi,[rdi+10]\r\n       xor       edx,edx\r\n       test      edi,edi\r\n       jle       short M00_L02\r\n       nop       dword ptr [rax]\r\nM00_L01:\r\n       cmp       edx,eax\r\n       jae       short M00_L04\r\n       mov       [rcx+rdx*4],edx\r\n       inc       edx\r\n       cmp       edx,edi\r\n       jl        short M00_L01\r\nM00_L02:\r\n       pop       rbp\r\n       ret\r\nM00_L03:\r\n       xor       ecx,ecx\r\n       xor       eax,eax\r\n       jmp       short M00_L00\r\nM00_L04:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 59<\/code><\/pre>\n<p>There&#8217;s some upfront assembly here associated with loading <code>_array<\/code> into a span, loading <code>_count<\/code>, and checking to see whether the count is 0 (in which case the whole loop can be skipped). Then the core of the loop is at <code>M00_L01<\/code>, which is repeatedly checking <code>edx<\/code> (which contains <code>i<\/code>) against the length of the span (in <code>eax<\/code>), jumping to <code>CORINFO_HELP_RNGCHKFAIL<\/code> if it&#8217;s an out-of-bounds access, writing <code>edx<\/code> (<code>i<\/code>) into the span at the next position, bumping up <code>i<\/code>, and then jumping back to <code>M00_L01<\/code> to keep iterating if <code>i<\/code> is still less than <code>count<\/code> (stored in <code>edi<\/code>). In other words, we have two checks per iteration: is <code>i<\/code> still within the bounds of the span, and is <code>i<\/code> still less than <code>count<\/code>. Now here&#8217;s what we get on .NET 9 for <code>WithArray<\/code>:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.WithArray()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       mov       ecx,[rdi+10]\r\n       xor       edx,edx\r\n       test      ecx,ecx\r\n       jle       short M00_L01\r\n       test      rax,rax\r\n       je        short M00_L02\r\n       cmp       [rax+8],ecx\r\n       jl        short M00_L02\r\n       nop       dword ptr [rax+rax]\r\nM00_L00:\r\n       mov       edi,edx\r\n       mov       [rax+rdi*4+10],edx\r\n       inc       edx\r\n       cmp       edx,ecx\r\n       jl        short M00_L00\r\nM00_L01:\r\n       pop       rbp\r\n       ret\r\nM00_L02:\r\n       cmp       edx,[rax+8]\r\n       jae       short M00_L03\r\n       mov       edi,edx\r\n       mov       [rax+rdi*4+10],edx\r\n       inc       edx\r\n       cmp       edx,ecx\r\n       jl        short M00_L02\r\n       jmp       short M00_L01\r\nM00_L03:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 71<\/code><\/pre>\n<p>Here, label <code>M00_L02<\/code> looks very similar to the loop we just saw in <code>WithSpan<\/code>, incurring both the check against <code>count<\/code> and the bounds check on every iteration. But note section <code>M00_L00<\/code>: it&#8217;s a clone of the same loop, still with the <code>cmp edx,ecx<\/code> that checks <code>i<\/code> against <code>count<\/code> on each iteration, but no additional bounds checking in sight. The JIT has cloned the loop, specializing one to not have bounds checks, and then in the upfront section, it determines which path to follow based on a single check against the array&#8217;s length (<code>cmp [rax+8],ecx<\/code>\/<code>jl short M00_L02<\/code>). Now in .NET 10, here&#8217;s what we get for <code>WithSpan<\/code>:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.WithSpan()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       test      rax,rax\r\n       je        short M00_L04\r\n       lea       rcx,[rax+10]\r\n       mov       eax,[rax+8]\r\nM00_L00:\r\n       mov       edx,[rdi+10]\r\n       xor       edi,edi\r\n       test      edx,edx\r\n       jle       short M00_L02\r\n       cmp       edx,eax\r\n       jg        short M00_L03\r\nM00_L01:\r\n       mov       eax,edi\r\n       mov       [rcx+rax*4],edi\r\n       inc       edi\r\n       cmp       edi,edx\r\n       jl        short M00_L01\r\nM00_L02:\r\n       pop       rbp\r\n       ret\r\nM00_L03:\r\n       cmp       edi,eax\r\n       jae       short M00_L05\r\n       mov       esi,edi\r\n       mov       [rcx+rsi*4],edi\r\n       inc       edi\r\n       cmp       edi,edx\r\n       jl        short M00_L03\r\n       jmp       short M00_L02\r\nM00_L04:\r\n       xor       ecx,ecx\r\n       xor       eax,eax\r\n       jmp       short M00_L00\r\nM00_L05:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 75<\/code><\/pre>\n<p>As with <code>WithArray<\/code> in .NET 9, <code>WithSpan<\/code> for .NET 10 has the loop cloned, with the <code>M00_L03<\/code> block containing the bounds check on each iteration, and the <code>M00_L01<\/code> block eliding the bounds check on each iteration.<\/p>\n<p>The JIT gains more cloning abilities in .NET 10, as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110020\">dotnet\/runtime#110020<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108604\">dotnet\/runtime#108604<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110483\">dotnet\/runtime#110483<\/a> make it possible for the JIT to clone <code>try\/finally<\/code> blocks, whereas previously it would immediately bail out of cloning any regions containing such constructs. This might seem niche, but it&#8217;s actually quite valuable when you consider that <code>foreach<\/code>&#8216;ing over an enumerable typically involves a hidden <code>try<\/code>\/<code>finally<\/code> for the <code>finally<\/code> to call the enumerator&#8217;s <code>Dispose<\/code>.<\/p>\n<p>Many of these different optimizations interact with each other. Dynamic PGO triggers a form of cloning, as part of the guarded devirtualization (GDV) mentioned earlier: if the instrumentation data reveals that a particular virtual call is generally performed on an instance of a specific type, the JIT can clone the resulting code into one path specific to that type and another path that handles any type. That then enables the specific-type code path to devirtualize the call and possibly inline it. And if it inlines it, that then provides more opportunities for the JIT to see that an object doesn&#8217;t escape, and potentially stack allocate it.\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111473\">dotnet\/runtime#111473<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116978\">dotnet\/runtime#116978<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116992\">dotnet\/runtime#116992<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117222\">dotnet\/runtime#117222<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117295\">dotnet\/runtime#117295<\/a> enable that, enhancing escape analysis to determine if an object only escapes when such a generated type test fails (when the target object isn&#8217;t of the expected common type).<\/p>\n<p>I want to pause for a moment, because my words thus far aren&#8217;t nearly enthusiastic enough to highlight the magnitude of what this enables. The <code>dotnet\/runtime<\/code> repo uses an automated performance analysis system which flags when benchmarks significantly improve or regress and ties those changes back to the responsible PR. This is what it looked like for this PR:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/09\/ConditionalEscapeAnalysisImprovements.png\" alt=\"Conditional Escape Analysis Triggering Many Benchmark Improvements\" \/>\nWe can see why this is so good from a simple example:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _values = Enumerable.Range(1, 100).ToArray();\r\n\r\n    [Benchmark]\r\n    public int Sum() =&gt; Sum(_values);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static int Sum(IEnumerable&lt;int&gt; values)\r\n    {\r\n        int sum = 0;\r\n        foreach (int value in values)\r\n        {\r\n            sum += value;\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<p>With dynamic PGO, the instrumented code for <code>Sum<\/code> will see that <code>values<\/code> is generally an <code>int[]<\/code>, and it&#8217;ll be able to emit a specialized code path in the optimized <code>Sum<\/code> implementation for when it is. And then with this ability to do conditional escape analysis, for the common path the JIT can see that the resulting <code>GetEnumerator<\/code> produces an <code>IEnumerator&lt;int&gt;<\/code> that never escapes, such that along with all of the relevant methods being devirtualized and inlined, the enumerator can be stack allocated.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">109.86 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">35.45 ns<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Just think about how many places in your apps and services you enumerate collections like this, and you can see why it&#8217;s such an exciting improvement. Note that these cases don&#8217;t always even require PGO. Consider a case like this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly IEnumerable&lt;int&gt; s_values = new int[] { 1, 2, 3, 4, 5 };\r\n\r\n    [Benchmark]\r\n    public int Sum()\r\n    {\r\n        int sum = 0;\r\n        foreach (int value in s_values)\r\n        {\r\n            sum += value;\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<p>Here, the JIT can see that even though the <code>s_values<\/code> is typed as <code>IEnumerable&lt;int&gt;<\/code>, it&#8217;s always actually an <code>int[]<\/code>. In that case, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111948\">dotnet\/runtime#111948<\/a> enables the return type to be retyped in the JIT as <code>int[]<\/code> and the enumerator can be stack allocated.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">16.341 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">2.059 ns<\/td>\n<td style=\"text-align: right;\">0.13<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, too much cloning can be a bad thing, in particular as it increases code size. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108771\">dotnet\/runtime#108771<\/a> employs a heuristic to determine whether loops that <em>can<\/em> be cloned <em>should<\/em> be cloned; the larger the loop, the less likely it&#8217;ll be to be cloned.<\/p>\n<h3>Inlining<\/h3>\n<p>&#8220;Inlining&#8221;, which replaces a call to a function with a copy of that function&#8217;s implementation, has always been a critically important optimization. It&#8217;s easy to think about the benefits of inlining as just being about avoiding the overhead of a call, and while that can be meaningful (especially when considering security mechanisms like Intel&#8217;s Control-Flow Enforcement Technology, which slightly increases the cost of calls), generally the most benefit from inlining comes from knock-on benefits. Just as a simple example, if you have code like:<\/p>\n<pre><code class=\"language-csharp\">int i = Divide(10, 5);\r\n\r\nstatic int Divide(int n, int d) =&gt; n \/ d;<\/code><\/pre>\n<p>if <code>Divide<\/code> doesn&#8217;t get inlined, then when <code>Divide<\/code> is called, it&#8217;ll need to perform the actual <code>idiv<\/code>, which is a relatively expensive operation. In contrast, if <code>Divide<\/code> is inlined, then the call site becomes:<\/p>\n<pre><code class=\"language-csharp\">int i = 10 \/ 5;<\/code><\/pre>\n<p>which can be evaluated at compile time and becomes just:<\/p>\n<pre><code class=\"language-csharp\">int i = 2;<\/code><\/pre>\n<p>More compelling examples were already seen throughout the discussion of escape analysis and stack allocation, which depend heavily on the ability to inline methods. Given the increased importance of inlining, it&#8217;s gotten even more focus in .NET 10.<\/p>\n<p>Some of the .NET work related to inlining is about enabling more kinds of things to be inlined. Historically, a variety of constructs present in a method would prevent that method from even being considered for inlining. Arguably the most well known of these is exception handling: methods with exception handling clauses, e.g. <code>try\/catch<\/code> or <code>try\/finally<\/code>, would not be inlined. Even a simple method like <code>M<\/code> in this example:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly object _o = new();\r\n\r\n    [Benchmark]\r\n    public int Test()\r\n    {\r\n        M(_o);\r\n        return 42;\r\n    }\r\n\r\n    private static void M(object o)\r\n    {\r\n        Monitor.Enter(o);\r\n        try\r\n        {\r\n        }\r\n        finally\r\n        {\r\n            Monitor.Exit(o);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>does not get inlined on .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      rax\r\n       mov       rdi,[rdi+8]\r\n       call      qword ptr [78F199864EE8]; Tests.M(System.Object)\r\n       mov       eax,2A\r\n       add       rsp,8\r\n       ret\r\n; Total bytes of code 21<\/code><\/pre>\n<p>But with a plethora of PRs, in particular <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112968\">dotnet\/runtime#112968<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113023\">dotnet\/runtime#113023<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113497\">dotnet\/runtime#113497<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112998\">dotnet\/runtime#112998<\/a>, methods containing <code>try\/finally<\/code> are no longer blocked from inlining (<code>try\/catch<\/code> regions are still a challenge). For the same benchmark on .NET 10, we now get this assembly:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test()\r\n       push      rbp\r\n       push      rbx\r\n       push      rax\r\n       lea       rbp,[rsp+10]\r\n       mov       rbx,[rdi+8]\r\n       test      rbx,rbx\r\n       je        short M00_L03\r\n       mov       rdi,rbx\r\n       call      00007920A0EE65E0\r\n       test      eax,eax\r\n       je        short M00_L02\r\nM00_L00:\r\n       mov       rdi,rbx\r\n       call      00007920A0EE6D50\r\n       test      eax,eax\r\n       jne       short M00_L04\r\nM00_L01:\r\n       mov       eax,2A\r\n       add       rsp,8\r\n       pop       rbx\r\n       pop       rbp\r\n       ret\r\nM00_L02:\r\n       mov       rdi,rbx\r\n       call      qword ptr [79202393C1F8]\r\n       jmp       short M00_L00\r\nM00_L03:\r\n       xor       edi,edi\r\n       call      qword ptr [79202393C1C8]\r\n       int       3\r\nM00_L04:\r\n       mov       edi,eax\r\n       mov       rsi,rbx\r\n       call      qword ptr [79202393C1E0]\r\n       jmp       short M00_L01\r\n; Total bytes of code 86<\/code><\/pre>\n<p>The details of the assembly don&#8217;t matter, other than it&#8217;s a whole lot more than was there before, because we&#8217;re now looking in large part at the implementation of <code>M<\/code>. In addition to methods with <code>try\/finally<\/code> now being inlineable, other improvements have also been made around exception handling. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110273\">dotnet\/runtime#110273<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110464\">dotnet\/runtime#110464<\/a> enable the removal of <code>try\/catch<\/code> and <code>try\/fault<\/code> blocks if it can prove the <code>try<\/code> block can&#8217;t possibly throw. Consider this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public int Test(int i)\r\n    {\r\n        try\r\n        {\r\n            i++;\r\n        }\r\n        catch\r\n        {\r\n            Console.WriteLine(\"Exception caught\");\r\n        }\r\n\r\n        return i;\r\n    }\r\n}<\/code><\/pre>\n<p>There&#8217;s nothing the <code>try<\/code> block here can do that will result in an exception being thrown (assuming the developer hasn&#8217;t enabled checked arithmetic, in which case it could possibly throw an <code>OverflowException<\/code>), yet on .NET 9 we get this assembly:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(Int32)\r\n       push      rbp\r\n       sub       rsp,10\r\n       lea       rbp,[rsp+10]\r\n       mov       [rbp-10],rsp\r\n       mov       [rbp-4],esi\r\n       mov       eax,[rbp-4]\r\n       inc       eax\r\n       mov       [rbp-4],eax\r\nM00_L00:\r\n       mov       eax,[rbp-4]\r\n       add       rsp,10\r\n       pop       rbp\r\n       ret\r\n       push      rbp\r\n       sub       rsp,10\r\n       mov       rbp,[rdi]\r\n       mov       [rsp],rbp\r\n       lea       rbp,[rbp+10]\r\n       mov       rdi,784B08950018\r\n       call      qword ptr [784B0DE44EE8]\r\n       lea       rax,[M00_L00]\r\n       add       rsp,10\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 79<\/code><\/pre>\n<p>Now on .NET 10, the JIT is able to elide the <code>catch<\/code> and remove all ceremony related to the <code>try<\/code> because it can see that ceremony is pointless overhead.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test(Int32)\r\n       lea       eax,[rsi+1]\r\n       ret\r\n; Total bytes of code 4<\/code><\/pre>\n<p>That&#8217;s true even when the contents of the <code>try<\/code> calls into other methods that are then inlined, exposing their contents to the JIT&#8217;s analysis.<\/p>\n<p>(As an aside, the JIT was already able to remove <code>try\/finally<\/code> when the <code>finally<\/code> was empty, but <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108003\">dotnet\/runtime#108003<\/a> catches even more cases of checking for empty <code>finally<\/code>s again after most other optimizations have been run, in case they revealed additional empty blocks.)<\/p>\n<p>Another example is &#8220;GVM&#8221;. Previously, any method that called a GVM, or generic virtual method (a virtual method with a generic type parameter), would be blocked from being inlined.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Base _base = new();\r\n\r\n    [Benchmark]\r\n    public int Test()\r\n    {\r\n        M();\r\n        return 42;\r\n    }\r\n\r\n    private void M() =&gt; _base.M&lt;object&gt;();\r\n}\r\n\r\nclass Base\r\n{\r\n    public virtual void M&lt;T&gt;() { }\r\n}<\/code><\/pre>\n<p>On .NET 9, the above results in this assembly:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      rax\r\n       call      qword ptr [728ED5664FD8]; Tests.M()\r\n       mov       eax,2A\r\n       add       rsp,8\r\n       ret\r\n; Total bytes of code 17<\/code><\/pre>\n<p>Now on .NET 10, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116773\">dotnet\/runtime#116773<\/a>, <code>M<\/code> can now be inlined.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test()\r\n       push      rbp\r\n       push      rbx\r\n       push      rax\r\n       lea       rbp,[rsp+10]\r\n       mov       rbx,[rdi+8]\r\n       mov       rdi,rbx\r\n       mov       rsi,offset MT_Base\r\n       mov       rdx,78034C95D2A0\r\n       call      System.Runtime.CompilerServices.VirtualDispatchHelpers.VirtualFunctionPointer(System.Object, IntPtr, IntPtr)\r\n       mov       rdi,rbx\r\n       call      rax\r\n       mov       eax,2A\r\n       add       rsp,8\r\n       pop       rbx\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 57<\/code><\/pre>\n<p>Another area of investment with inlining is to do with the heuristics around when methods should be inlined. Just inlining everything would be bad; inlining copies code, which results in more code, which can have significant negative repercussions. For example, inlining&#8217;s increased code size puts more pressure on caches. Processors have an instruction cache, a small amount of super fast memory in a CPU that stores recently used instructions, making them really fast to access again the next time they&#8217;re needed (such as the next iteration through a loop, or the next time that same function is called). Consider a method <code>M<\/code>, and 100 call sites to <code>M<\/code> that are all being accessed. If all of those share the same instructions for <code>M<\/code>, because the 100 call sites are all actually calling <code>M<\/code>, the instruction cache will only need to load <code>M<\/code>&#8216;s instructions once. If all of those 100 call sites each have their own copy of <code>M<\/code>&#8216;s instructions, then all 100 copies will separately be loaded through the cache, fighting with each other and other instructions for residence. The less likely it is that instructions are in the cache, the more likely it is that the CPU will stall waiting for the instructions to be loaded from memory.<\/p>\n<p>For this reason, the JIT needs to be careful what it inlines. It tries hard to avoid inlining anything that won&#8217;t benefit (e.g. a larger method whose instructions won&#8217;t be materially influenced by the caller&#8217;s context) while also trying hard to inline anything that will materially benefit (e.g. small functions where the code required to call the function is similar in size to the contents of the function, functions with instructions that could be materially impacted by information from the call site, etc.) As part of these heuristics, the JIT has the notion of &#8220;boosts,&#8221; where observations it makes about things methods do boost the chances of that method being inlined. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114806\">dotnet\/runtime#114806<\/a> gives a boost to methods that appear to be returning new arrays of a small, fixed length; if those arrays can instead be allocated in the caller&#8217;s frame, the JIT might then be able to discover they don&#8217;t escape and enable them to be stack allocated. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110596\">dotnet\/runtime#110596<\/a> similarly looks for boxing, as the caller could possibly instead avoid the box entirely.<\/p>\n<p>For the same purpose (and also just to minimize time spent performing compilation), the JIT also maintains a budget for how much it allows to be inlined into a method compilation&#8230; once it hits that budget, it might stop inlining anything. The budgeting scheme overall works <em>ok<\/em>, however in certain circumstances it can run out of budget at very inopportune times, for example doing a lot of inlining at top-level call sites but then running out of budget by the time it gets to small methods that are critically-important to inline for good performance. To help mitigate these scenarios, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114191\">dotnet\/runtime#114191<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118641\">dotnet\/runtime#118641<\/a> more than double the JIT&#8217;s default inlining budget.<\/p>\n<p>The JIT also pays a lot of attention to the number of local variables (e.g. parameters\/locals explicitly in the IL, JIT-created temporary locals, promoted struct fields, etc.) it tracks. To avoid creating too many, the JIT would stop inlining once it was already tracking 512. But as other changes have made inlining more aggressive, this (strangely hardcoded) limit gets hit more often, leaving very valuable inlinees out in the cold. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118515\">dotnet\/runtime#118515<\/a> removed this fixed limit and instead ties it to a large percentage of the number of locals the JIT is allowed to track (by default, this ends up almost doubling the limit used by the inliner).<\/p>\n<h3>Constant Folding<\/h3>\n<p>Constant folding is a compiler&#8217;s ability to perform operations, typically math, at compile-time rather than at run-time: given multiple constants and an expressed relationship between them, the compiler can &#8220;fold&#8221; those constants together into a new constant. So, if you have the C# code <code>int M(int i) =&gt; i + 2 * 3;<\/code>, the C# compiler does constant folding and emits that into your compilation as if you&#8217;d written <code>int M(int i) =&gt; i + 6;<\/code>. The JIT can and does also do constant folding, which is valuable especially when it&#8217;s based on information not available to the C# compiler. For example, the JIT can treat <code>static readonly<\/code> fields or <code>IntPtr.Size<\/code> or <code>Vector128&lt;T&gt;.Count<\/code> as constants. And the JIT can do folding across inlines. For example, if you have:<\/p>\n<pre><code class=\"language-csharp\">int M1(int i) =&gt; i + M2(2 * 3);\r\nint M2(int j) =&gt; j * Environment.ProcessorCount;<\/code><\/pre>\n<p>the C# compiler will only be able to fold the <code>2 * 3<\/code>, and will emit the equivalent of:<\/p>\n<pre><code class=\"language-csharp\">int M1(int i) =&gt; i + M2(6);\r\nint M2(int j) =&gt; j * Environment.ProcessorCount;<\/code><\/pre>\n<p>but when compiling <code>M1<\/code>, the JIT can inline <code>M2<\/code> and treat <code>ProcessorCount<\/code> as a constant (on my machine it&#8217;s 16), and produce the following assembly code for <code>M1<\/code>:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public int M1(int i) =&gt; i + M2(6);\r\n\r\n    private int M2(int j) =&gt; j * Environment.ProcessorCount;\r\n}<\/code><\/pre>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.M1(Int32)\r\n       lea       eax,[rsi+60]\r\n       ret\r\n; Total bytes of code 4<\/code><\/pre>\n<p>That&#8217;s as if the code for <code>M1<\/code> had been <code>public int M1(int i) =&gt; i + 96;<\/code> (the displayed assembly renders hexadecimal, so the <code>60<\/code> is hexadecimal <code>0x60<\/code> and thus decimal <code>96<\/code>).<\/p>\n<p>Or consider:<\/p>\n<pre><code class=\"language-csharp\">string M() =&gt; GetString() ?? throw new Exception();\r\n\r\nstatic string GetString() =&gt; \"test\";<\/code><\/pre>\n<p>The JIT will be able to inline <code>GetString<\/code>, at which point it can see that the result is non-<code>null<\/code> and can fold away the check for the <code>null<\/code> constant, at which point it can also dead-code eliminate the <code>throw<\/code>. Constant folding is useful on its own in avoiding unnecessary work, but it also often unlocks other optimizations, like dead-code elimination and bounds-check elimination. The JIT is already quite good at finding constant folding opportunities, and gets better in .NET 10. Consider this benchmark:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"s\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(\"test\")]\r\n    public ReadOnlySpan&lt;char&gt; Test(string s)\r\n    {\r\n        s ??= \"\";\r\n        return s.AsSpan();\r\n    }\r\n}<\/code><\/pre>\n<p>Here&#8217;s the assembly that gets produced for .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(System.String)\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,75B5D6200008\r\n       test      rsi,rsi\r\n       cmove     rsi,rax\r\n       test      rsi,rsi\r\n       jne       short M00_L01\r\n       xor       eax,eax\r\n       xor       edx,edx\r\nM00_L00:\r\n       pop       rbp\r\n       ret\r\nM00_L01:\r\n       lea       rax,[rsi+0C]\r\n       mov       edx,[rsi+8]\r\n       jmp       short M00_L00\r\n; Total bytes of code 41<\/code><\/pre>\n<p>Of particular note are those two <code>test rsi,rsi<\/code> instructions, which are <code>null<\/code> checks. The assembly starts by loading a value into <code>rax<\/code>; that value is the address of the <code>\"\"<\/code> string literal. It then uses <code>test rsi,rsi<\/code> to check whether the <code>s<\/code> parameter, which was passed into this instance method in the <code>rsi<\/code> register, is <code>null<\/code>. If it is <code>null<\/code>, the <code>cmove rsi,rax<\/code> instruction sets it to the address of the <code>\"\"<\/code> literal. And then&#8230; it does <code>test rsi,rsi<\/code> again? That second test is the <code>null<\/code> check at the beginning of <code>AsSpan<\/code>, which looks like this:<\/p>\n<pre><code class=\"language-csharp\">public static ReadOnlySpan&lt;char&gt; AsSpan(this string? text)\r\n{\r\n    if (text is null) return default;\r\n    return new ReadOnlySpan&lt;char&gt;(ref text.GetRawStringData(), text.Length);\r\n}<\/code><\/pre>\n<p>Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111985\">dotnet\/runtime#111985<\/a>, that second <code>null<\/code> check, along with others, can be folded, resulting in this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test(System.String)\r\n       mov       rax,7C01C4600008\r\n       test      rsi,rsi\r\n       cmove     rsi,rax\r\n       lea       rax,[rsi+0C]\r\n       mov       edx,[rsi+8]\r\n       ret\r\n; Total bytes of code 25<\/code><\/pre>\n<p>Similar impact comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108420\">dotnet\/runtime#108420<\/a>, which is also able to fold a different class of <code>null<\/code> checks.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"condition\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(true)]\r\n    public bool Test(bool condition)\r\n    {\r\n        string tmp = condition ? GetString1() : GetString2();\r\n        return tmp is not null;\r\n    }\r\n\r\n    private static string GetString1() =&gt; \"Hello\";\r\n    private static string GetString2() =&gt; \"World\";\r\n}<\/code><\/pre>\n<p>In this benchmark, <em>we<\/em> can see that neither <code>GetString1<\/code> nor <code>GetString2<\/code> return <code>null<\/code>, and thus the <code>is not null<\/code> check shouldn&#8217;t be necessary. The JIT in .NET 9 couldn&#8217;t see that, but its improved .NET 10 self can.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(Boolean)\r\n       mov       rax,7407F000A018\r\n       mov       rcx,7407F000A050\r\n       test      sil,sil\r\n       cmove     rax,rcx\r\n       test      rax,rax\r\n       setne     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 37\r\n\r\n; .NET 10\r\n; Tests.Test(Boolean)\r\n       mov       eax,1\r\n       ret\r\n; Total bytes of code 6<\/code><\/pre>\n<p>Constant folding also applies to SIMD (Single Instruction Multiple Data), instructions that enable processing multiple pieces of data at once rather than only one element at a time. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117099\">dotnet\/runtime#117099<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117572\">dotnet\/runtime#117572<\/a> both enable more SIMD comparison operations to participate in folding.<\/p>\n<h3>Code Layout<\/h3>\n<p>When the JIT compiler generates assembly from the IL emitted by the C# compiler, it organizes that code into &#8220;basic blocks,&#8221; a sequence of instructions with one entry point and one exit point, no jumps inside, no branches out except at the end. These blocks can then be moved around as a unit, and the order in which these blocks are placed in memory is referred to as &#8220;code layout&#8221; or &#8220;basic block layout.&#8221; This ordering can have a significant performance impact because modern CPUs rely heavily on an instruction cache and on branch prediction to keep things moving fast. If frequently executed (&#8220;hot&#8221;) blocks are close together and follow a common execution path, the CPU can execute them with fewer cache misses and fewer mispredicted jumps. If the layout is poor, where the hot code is split into pieces far apart from each other, or where rarely executed (&#8220;cold&#8221;) code sits in between, the CPU can spend more time jumping around and refilling caches than doing actual work. Consider a tight loop executed millions of times. A good layout keeps the loop entry, body, and backward edge (the jump back to the beginning of the body to do the next iteration) right next to each other, letting the CPU fetch them straight from the cache. In a bad layout, that loop might be interwoven with unrelated cold blocks (say, a <code>catch<\/code> block for a <code>try<\/code> in the loop), forcing the CPU to load instructions from different places and disrupting the flow. Similarly, for an <code>if<\/code> block, the likely path should generally be the next block so no jump is required, with the unlikely branch behind a short jump away, as that better aligns with the sensibilities of branch predictors. Code layout heuristics control how that happens, and as a result, how efficient the resulting code is able to execute.<\/p>\n<p>When determining the starting layout of the blocks (before additional optimizations are done for the layout), <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108903\">dotnet\/runtime#108903<\/a> employs a &#8220;loop-aware reverse post-order&#8221; traversal. A reverse post-order traversal is an algorithm for visiting the nodes in a control flow graph such that each block appears after its predecessors. The &#8220;loop aware&#8221; part means the traversal recognizes loops as units, effectively creating a block around the whole loop, and tries to keep the whole loop together as the layout algorithm moves things around. The intent here is to start the larger layout optimizations from a more sensible place, reducing the amount of later reshuffling and situations where loop bodies get broken up.<\/p>\n<p>In the extreme, layout is essentially the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Travelling_salesman_problem\">traveling salesman problem<\/a>. The JIT must decide the order of basic blocks so that control transfers follow short, predictable paths and make efficient use of instruction cache and branch prediction. Just like the salesman visiting cities with minimal total travel distance, the compiler is trying to arrange blocks so that the &#8220;distance&#8221; between blocks, which might be measured in bytes or instruction fetch cost or something similar, is minimized. For any meaningfully-sized set of blocks, this is prohibitively expensive to compute optimally, as the number of possible orderings grows factorially with the number of blocks. Thus, the JIT has to rely on approximations rather than attempting an exact solution. One such approximation it employs now as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103450\">dotnet\/runtime#103450<\/a> (and then tweaked further in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109741\">dotnet\/runtime#109741<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109835\">dotnet\/runtime#109835<\/a>) is a &#8220;3-opt,&#8221; which really just means that rather than considering all blocks together, it looks at only three and tries to produce an optimal ordering amongst those (there are only eight possible orderings to be checked). The JIT can choose to iterate through sets of three blocks until either it doesn&#8217;t see any more improvements or hits a self-imposed limit. Specifically when handling backward jumps, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110277\">dotnet\/runtime#110277<\/a>, it expands this &#8220;3-opt&#8221; to &#8220;4-opt&#8221; (four blocks).<\/p>\n<p>.NET 10 also does a better job of factoring PGO data into layout. With dynamic PGO, the JIT is able to gather instrumentation data from an initial compilation and then use the results of that profiling to impact an optimized re-compilation. That data can lead to conclusions about what blocks are hot or cold, and which direction branches take, all information that&#8217;s valuable for layout optimization. However, data can sometimes be missing from these profiles, so the JIT has a &#8220;profile synthesis&#8221; algorithm that makes realistic guesses for these gaps in order to fill them in (if you&#8217;ve read or seen &#8220;Jurassic Park,&#8221; this is the JIT-equivalent to filling in gaps in the dinosaur DNA sequences with that from present-day frogs.) With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111915\">dotnet\/runtime#111915<\/a>, that repairing of the profile data is now performed just before layout, so that layout has a more complete picture.<\/p>\n<p>Let&#8217;s take a concrete example of all this. Here I&#8217;ve extracted the core function from <code>MemoryExtensions.BinarySearch<\/code>:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _values = Enumerable.Range(0, 512).ToArray();\r\n\r\n    [Benchmark]\r\n    public int BinarySearch()\r\n    {\r\n        int[] values = _values;\r\n        return BinarySearch(ref values[0], values.Length, 256);\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static int BinarySearch&lt;T, TComparable&gt;(\r\n        ref T spanStart, int length, TComparable comparable)\r\n        where TComparable : IComparable&lt;T&gt;, allows ref struct\r\n    {\r\n        int lo = 0;\r\n        int hi = length - 1;\r\n        while (lo &lt;= hi)\r\n        {\r\n            int i = (int)(((uint)hi + (uint)lo) &gt;&gt; 1);\r\n\r\n            int c = comparable.CompareTo(Unsafe.Add(ref spanStart, i));\r\n            if (c == 0)\r\n            {\r\n                return i;\r\n            }\r\n            else if (c &gt; 0)\r\n            {\r\n                lo = i + 1;\r\n            }\r\n            else\r\n            {\r\n                hi = i - 1;\r\n            }\r\n        }\r\n\r\n        return ~lo;\r\n    }\r\n}<\/code><\/pre>\n<p>And here&#8217;s the assembly we get for .NET 9 and .NET 10, diff&#8217;d from the former to the latter:<\/p>\n<pre><code class=\"language-diff\">; Tests.BinarySearch[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32, Int32)\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       xor       ecx,ecx\r\n       dec       esi\r\n       js        short M01_L07\r\n+      jmp       short M01_L03\r\nM01_L00:\r\n-      lea       eax,[rsi+rcx]\r\n-      shr       eax,1\r\n-      movsxd    r8,eax\r\n-      mov       r8d,[rdi+r8*4]\r\n-      cmp       edx,r8d\r\n-      jge       short M01_L03\r\n       mov       r9d,0FFFFFFFF\r\nM01_L01:\r\n       test      r9d,r9d\r\n       je        short M01_L06\r\n       test      r9d,r9d\r\n       jg        short M01_L05\r\n       lea       esi,[rax-1]\r\nM01_L02:\r\n       cmp       ecx,esi\r\n-      jle       short M01_L00\r\n-      jmp       short M01_L07\r\n+      jg        short M01_L07\r\nM01_L03:\r\n+      lea       eax,[rsi+rcx]\r\n+      shr       eax,1\r\n+      movsxd    r8,eax\r\n+      mov       r8d,[rdi+r8*4]\r\n       cmp       edx,r8d\r\n-      jg        short M01_L04\r\n-      xor       r9d,r9d\r\n+      jl        short M01_L00\r\n+      cmp       edx,r8d\r\n+      jle       short M01_L04\r\n+      mov       r9d,1\r\n       jmp       short M01_L01\r\nM01_L04:\r\n-      mov       r9d,1\r\n+      xor       r9d,r9d\r\n       jmp       short M01_L01\r\nM01_L05:\r\n       lea       ecx,[rax+1]\r\n       jmp       short M01_L02\r\nM01_L06:\r\n       pop       rbp\r\n       ret\r\nM01_L07:\r\n       mov       eax,ecx\r\n       not       eax\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 83<\/code><\/pre>\n<p>We can see that the main change here is a block that&#8217;s moved (the bulk of <code>M01_L00<\/code> moving down to <code>M01_L03<\/code>). In .NET 9, the <code>lo &lt;= hi<\/code> &#8220;stay in the loop check&#8221; (<code>cmp ecx,esi<\/code>) is a backward conditional branch (<code>jle short M01_L00<\/code>), where every iteration of the loop except for the last jumps back to the top (<code>M01_L00<\/code>). In .NET 10, it instead does a forward branch to exit the loop only in the rarer case, otherwise falling through to the body of the loop in the common case, and then unconditionally branching back.<\/p>\n<h3>GC Write Barriers<\/h3>\n<p>The .NET garbage collector (GC) works on a generational model, organizing the managed heap according to how long objects have been alive. The newest allocations land in &#8220;generation 0&#8221; (gen0), objects that have survived at least one collection are promoted to &#8220;generation 1&#8221; (gen1), and those that have been around for longer end up in &#8220;generation 2&#8221; (gen2). This is based on the premise that most objects are temporary, and that once an object has been around for a while, it&#8217;s likely to stick around for a while longer. Splitting up the heap into generations enables for quickly collecting gen0 objects by only scanning the gen0 heap for remaining references to that object. The expectation is that all, or at least the vast majority, of references to a gen0 object are also in gen0. Of course, if a reference to a gen0 object snuck into gen1 or gen2, not scanning gen1 or gen2 during a gen0 collection could be, well, bad. To avoid that case, the JIT collaborates with the GC to track references from older to younger generations. Whenever there&#8217;s a reference write that could cross a generation, the JIT emits a call to a helper that tracks the information in a &#8220;card table,&#8221; and when the GC runs, it consults this table to see if it needs to scan a portion of the higher generations. That helper is referred to as a &#8220;GC write barrier.&#8221; Since a write barrier is potentially employed on every reference write, it must be super fast, and in fact the runtime has several different variations of write barriers so that the JIT can pick one optimized for the given situation. Of course, the fastest write barrier is one that doesn&#8217;t need to exist at all, so as with bounds checks, the JIT also exerts energy to try to prove when write barriers aren&#8217;t needed, eliding them when it can. And it can even more in .NET 10.<\/p>\n<p><code>ref structs<\/code>, referred to in runtime vernacular as &#8220;byref-like types,&#8221; can never live on the heap, which means any reference fields in them will similarly never live on the heap. As such, if the JIT can prove that a reference write is targeting a field of a <code>ref struct<\/code>, it can elide the write barrier. Consider this example:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private object _object = new();\r\n\r\n    [Benchmark]\r\n    public MyRefStruct Test() =&gt; new MyRefStruct() { Obj1 = _object, Obj2 = _object, Obj3 = _object };\r\n\r\n    public ref struct MyRefStruct\r\n    {\r\n        public object Obj1;\r\n        public object Obj2;\r\n        public object Obj3;\r\n    }\r\n}<\/code><\/pre>\n<p>In the .NET 9 assembly, we can see three write barriers (<code>CORINFO_HELP_CHECKED_ASSIGN_REF<\/code>) corresponding to the three fields in <code>MyRefStruct<\/code> in the benchmark:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test()\r\n       push      r15\r\n       push      r14\r\n       push      rbx\r\n       mov       rbx,rsi\r\n       mov       r15,[rdi+8]\r\n       mov       rsi,r15\r\n       mov       r14,r15\r\n       mov       rdi,rbx\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       lea       rdi,[rbx+8]\r\n       mov       rsi,r14\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       lea       rdi,[rbx+10]\r\n       mov       rsi,r15\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       mov       rax,rbx\r\n       pop       rbx\r\n       pop       r14\r\n       pop       r15\r\n       ret\r\n; Total bytes of code 59<\/code><\/pre>\n<p>With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111576\">dotnet\/runtime#111576<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111733\">dotnet\/runtime#111733<\/a> in .NET 10, all of those write barriers are elided:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test()\r\n       mov       rax,[rdi+8]\r\n       mov       rcx,rax\r\n       mov       rdx,rax\r\n       mov       [rsi],rcx\r\n       mov       [rsi+8],rdx\r\n       mov       [rsi+10],rax\r\n       mov       rax,rsi\r\n       ret\r\n; Total bytes of code 25<\/code><\/pre>\n<p>Much more impactful, however, are <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112060\">dotnet\/runtime#112060<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112227\">dotnet\/runtime#112227<\/a>, which have to do with &#8220;return buffers.&#8221; When a .NET method is typed to return a value, the runtime has to decide how that value gets from the callee back to the caller. For small types, like integers, floating-point numbers, pointers, or object references, the answer is simple: the value can be passed back via one or more CPU registers reserved for return values, making the operation essentially free. But not all values fit neatly into registers. Larger value types, such as structs with multiple fields, require a different strategy. In these cases, the caller allocates a &#8220;return buffer,&#8221; a block of memory, typically in the caller&#8217;s stack frame, and the caller passes a pointer to that buffer as a hidden argument to the method. The method then writes the return value directly into that buffer in order to provide the caller with the data. When it comes to write barriers, the challenge here is that there historically hasn&#8217;t been a requirement that the return buffer be on the stack; it&#8217;s technically feasible it could have been allocated on the heap, even if it rarely or never is. And since the callee doesn&#8217;t know where the buffer lives, any object reference writes needed to be tracked with GC write barriers. We can see that with a simple benchmark:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _firstName = \"Jane\", _lastName = \"Smith\", _address = \"123 Main St\", _city = \"Anytown\";\r\n\r\n    [Benchmark]\r\n    public Person GetPerson() =&gt; new(_firstName, _lastName, _address, _city);\r\n\r\n    public record struct Person(string FirstName, string LastName, string Address, string City);\r\n}<\/code><\/pre>\n<p>On .NET 9, each field of the returned value type is incurring a <code>CORINFO_HELP_CHECKED_ASSIGN_REF<\/code> write barrier:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.GetPerson()\r\n       push      r15\r\n       push      r14\r\n       push      r13\r\n       push      rbx\r\n       mov       rbx,rsi\r\n       mov       rsi,[rdi+8]\r\n       mov       r15,[rdi+10]\r\n       mov       r14,[rdi+18]\r\n       mov       r13,[rdi+20]\r\n       mov       rdi,rbx\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       lea       rdi,[rbx+8]\r\n       mov       rsi,r15\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       lea       rdi,[rbx+10]\r\n       mov       rsi,r14\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       lea       rdi,[rbx+18]\r\n       mov       rsi,r13\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       mov       rax,rbx\r\n       pop       rbx\r\n       pop       r13\r\n       pop       r14\r\n       pop       r15\r\n       ret\r\n; Total bytes of code 81<\/code><\/pre>\n<p>Now in .NET 10, the calling convention has been updated to require that the return buffer live on the stack (if the caller wants the data somewhere else, it&#8217;s responsible for subsequently doing that copy). And because the return buffer is now guaranteed to be on the stack, the JIT can elide all GC write barriers as part of returning values.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.GetPerson()\r\n       mov       rax,[rdi+8]\r\n       mov       rcx,[rdi+10]\r\n       mov       rdx,[rdi+18]\r\n       mov       rdi,[rdi+20]\r\n       mov       [rsi],rax\r\n       mov       [rsi+8],rcx\r\n       mov       [rsi+10],rdx\r\n       mov       [rsi+18],rdi\r\n       mov       rax,rsi\r\n       ret\r\n; Total bytes of code 35<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111636\">dotnet\/runtime#111636<\/a> from <a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a> is also interesting from a performance perspective because, as is common in optimization, it trades off one thing for another. Prior to this change, Arm64 had one universal write barrier helper for all GC modes. This change brings Arm64 in line with x64 by routing through a <code>WriteBarrierManager<\/code> that selects among multiple <code>JIT_WriteBarrier<\/code> variants based on runtime configuration. In doing so, it makes each Arm64 write barrier a bit more expensive, by adding region checks and moving to a region-aware card marking scheme, but in exchange it lets the GC do less work: fewer cards in the card table get marked, and the GC can scan more precisely. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106191\">dotnet\/runtime#106191<\/a> also helps reduce the cost of write barriers on Arm64 by tightening the hot-path comparisons and eliminating some avoidable saves and restores.<\/p>\n<h3>Instruction Sets<\/h3>\n<p>.NET continues to see meaningful optimizations and improvements across all supported architectures, along with various architecture-specific improvements. Here are a handful of examples.<\/p>\n<h4>Arm SVE<\/h4>\n<p>APIs for Arm SVE were introduced in .NET 9. As noted in the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-9\/#arm-sve\">Arm SVE<\/a> section of last year&#8217;s post, enabling SVE is a multi-year effort, and in .NET 10, support is still considered experimental. However, the support has continued to be improved and extended, with PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115775\">dotnet\/runtime#115775<\/a> from <a href=\"https:\/\/github.com\/snickolls-arm\">@snickolls-arm<\/a> adding <code>BitwiseSelect<\/code> methods, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117711\">dotnet\/runtime#117711<\/a> from <a href=\"https:\/\/github.com\/jacob-crawley\">@jacob-crawley<\/a> adding <code>MaxPairwise<\/code> and <code>MinPairwise<\/code> methods, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117051\">dotnet\/runtime#117051<\/a> from <a href=\"https:\/\/github.com\/jonathandavies-arm\">@jonathandavies-arm<\/a> adding <code>VectorTableLookup<\/code> methods.<\/p>\n<h4>Arm64<\/h4>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111893\">dotnet\/runtime#111893<\/a> from <a href=\"https:\/\/github.com\/jonathandavies-arm\">@jonathandavies-arm<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111904\">dotnet\/runtime#111904<\/a> from <a href=\"https:\/\/github.com\/jonathandavies-arm\">@jonathandavies-arm<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111452\">dotnet\/runtime#111452<\/a> from <a href=\"https:\/\/github.com\/jonathandavies-arm\">@jonathandavies-arm<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112235\">dotnet\/runtime#112235<\/a> from <a href=\"https:\/\/github.com\/jonathandavies-arm\">@jonathandavies-arm<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111797\">dotnet\/runtime#111797<\/a> from <a href=\"https:\/\/github.com\/snickolls-arm\">@snickolls-arm<\/a> all improved .NET&#8217;s support for utilizing Arm64&#8217;s multi-operation compound instructions. For example, when implementing a compare and branch, rather than emitting a <code>cmp<\/code> against 0 followed by <code>beq<\/code> instruction, the JIT may now emit a <code>cbz<\/code> (&#8220;Compare and Branch on Zero&#8221;) instruction.<\/p>\n<h4>APX<\/h4>\n<p>Intel&#8217;s Advanced Performance Extensions (APX) was announced in 2023 as an extension of the x86\/x64 instruction set. It expands the number of general-purpose registers from 16 to 32 and adds new instructions such as conditional operations designed to reduce memory traffic, improve performance, and lower power consumption. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106557\">dotnet\/runtime#106557<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108796\">dotnet\/runtime#108796<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113237\">dotnet\/runtime#113237<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a> essentially teach the JIT how to speak the new dialect of assembly code (the REX and expanded EVEX encodings), while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108799\">dotnet\/runtime#108799<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> updates the JIT to be able to use the expanded set of registers, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116035\">dotnet\/runtime#116035<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> enables new push and pop instructions for working with such registers. The most impactful new instructions in APX are around conditional compares (<code>ccmp<\/code>), a concept the JIT already supports from targeting other instruction sets, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111072\">dotnet\/runtime#111072<\/a> from <a href=\"https:\/\/github.com\/anthonycanino\">@anthonycanino<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112153\">dotnet\/runtime#112153<\/a> from <a href=\"https:\/\/github.com\/anthonycanino\">@anthonycanino<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116445\">dotnet\/runtime#116445<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a> all teach the JIT how to make good use of these new instructions with APX.<\/p>\n<h4>AVX512<\/h4>\n<p>.NET 8 added broad support for AVX512, and .NET 9 significantly improved its handling and adoption throughout the core libraries. .NET 10 includes a plethora of additional related optimizations:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109258\">dotnet\/runtime#109258<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109267\">dotnet\/runtime#109267<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> expand the number of places the JIT is able to use EVEX embedded broadcasts, a feature that lets vector instructions read a single scalar element from memory and implicitly replicate it to all the lanes of the vector, without needing a separate broadcast or shuffle operation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108824\">dotnet\/runtime#108824<\/a> removes a redundant sign extension from broadcasts.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116117\">dotnet\/runtime#116117<\/a> from <a href=\"https:\/\/github.com\/alexcovington\">@alexcovington<\/a> improves the code generated for <code>Vector.Max<\/code> and <code>Vector.Min<\/code> when AVX512 is supported.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109474\">dotnet\/runtime#109474<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> improves &#8220;containment&#8221; (where an instruction can be eliminated by having its behaviors fully encapsulated by another instruction) for AVX512 widening intrinsics (similar containment-related improvements were made in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110736\">dotnet\/runtime#110736<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111778\">dotnet\/runtime#111778<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a>).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111853\">dotnet\/runtime#111853<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> improves <code>Vector128\/256\/512.Dot<\/code> to be better accelerated with AVX512.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110195\">dotnet\/runtime#110195<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110307\">dotnet\/runtime#110307<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117118\">dotnet\/runtime#117118<\/a> all improve how vector masks are handled. In AVX512, masks are special registers that can be included as part of various instructions to control which subset of vector elements should be utilized (each bit in a mask corresponds to one element in the vector). This enables operating on only part of a vector without needing extra branching or shuffling.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115981\">dotnet\/runtime#115981<\/a> improves zeroing (where the JIT emits instructions to zero out memory, often as part of initializing a stack frame) on AVX512. After zeroing as much as it can with 64-byte instructions, it was falling back to using 16-byte instructions, when it could have used 32-byte instructions.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110662\">dotnet\/runtime#110662<\/a> improves the code generated for <code>ExtractMostSignificantBits<\/code> (which is used by many of the searching algorithms in the core libraries) when working with <code>short<\/code> and <code>ushort<\/code> (and <code>char<\/code>, as most of those core library implementations reinterpret cast <code>char<\/code> as one of the others) by using EVEX mask support.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113864\">dotnet\/runtime#113864<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> improves the code generated for <code>ConditionalSelect<\/code> when not used with mask registers.<\/li>\n<\/ul>\n<h4>AVX10.2<\/h4>\n<p>.NET 9 added support and intrinsics for the AVX10.1 instruction set. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111209\">dotnet\/runtime#111209<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a>, .NET 10 adds support and intrinsics for the AVX10.2 instruction set. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112535\">dotnet\/runtime#112535<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a> optimizes floating-point min\/max operations with AVX10.2 instructions, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111775\">dotnet\/runtime#111775<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a> enables floating-point conversions to utilize AVX10.2.<\/p>\n<h4>GFNI<\/h4>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109537\">dotnet\/runtime#109537<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> adds intrinsics for the GFNI (Galois Field New Instructions) instruction set, which can be used for accelerating operations over Galois fields GF(2^8). These are common in cryptography, error correction, and data encoding.<\/p>\n<h4>VPCLMULQDQ<\/h4>\n<p><code>VPCLMULQDQ<\/code> is an x86 instruction set extension that adds vector support to the older <code>PCLMULQDQ<\/code> instruction, which performs carry-less multiplication over 64-bit integers. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109137\">dotnet\/runtime#109137<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> adds new intrinsic APIs for <code>VPCLMULQDQ<\/code>.<\/p>\n<h3>Miscellaneous<\/h3>\n<p>Many more PRs than the ones I&#8217;ve already called out have gone into the JIT this release. Here are a few more:<\/p>\n<ul>\n<li><strong>Eliminating some covariance checks<\/strong>. Writing into arrays of reference types can require &#8220;covariance checks.&#8221; Imagine you have a class <code>Base<\/code> and two derived types <code>Derived1 : Base<\/code> and <code>Derived2 : Base<\/code>. Since arrays in .NET are covariant, I can have a <code>Derived1[]<\/code> and cast it successfully to a <code>Base[]<\/code>, but under the covers that&#8217;s still a <code>Derived1[]<\/code>. That means, for example, that any attempt to store a <code>Derived2<\/code> into that array should fail at runtime, even if it compiles. To achieve that, the JIT needs to insert such covariance checks when writing into arrays, but just like with bounds checking and write barriers, the JIT can elide those checks when it can prove statically that they&#8217;re not necessary. Such an example is with sealed types. If the JIT sees an array of type <code>T[]<\/code> and <code>T<\/code> is known to be sealed, <code>T[]<\/code> must exactly be a <code>T[]<\/code> and not some <code>DerivedT[]<\/code>, because there can&#8217;t be a <code>DerivedT<\/code>. So with a benchmark like this:\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private List&lt;string&gt; _list = new() { \"hello\" };\r\n\r\n    [Benchmark]\r\n    public void Set() =&gt; _list[0] = \"world\";\r\n}<\/code><\/pre>\n<p>as long as the JIT can see that the array underlying the <code>List&lt;string&gt;<\/code> is a <code>string[]<\/code> (<code>string<\/code> is sealed), it shouldn&#8217;t need a covariance check. In .NET 9, we get this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Set()\r\n       push      rbx\r\n       mov       rbx,[rdi+8]\r\n       cmp       dword ptr [rbx+10],0\r\n       je        short M00_L00\r\n       mov       rdi,[rbx+8]\r\n       xor       esi,esi\r\n       mov       rdx,78914920A038\r\n       call      System.Runtime.CompilerServices.CastHelpers.StelemRef(System.Object[], IntPtr, System.Object)\r\n       inc       dword ptr [rbx+14]\r\n       pop       rbx\r\n       ret\r\nM00_L00:\r\n       call      qword ptr [78D1F80558A8]\r\n       int       3\r\n; Total bytes of code 44<\/code><\/pre>\n<p>Note that <code>CastHelpers.StelemRef<\/code> call&#8230; that&#8217;s the helper that performs the write with the covariance check. But now in .NET 10, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107116\">dotnet\/runtime#107116<\/a> (which teaches the JIT how to resolve the exact type for the field of the closed generic), we get this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Set()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       cmp       dword ptr [rax+10],0\r\n       je        short M00_L00\r\n       mov       rcx,[rax+8]\r\n       mov       edx,[rcx+8]\r\n       test      rdx,rdx\r\n       je        short M00_L01\r\n       mov       rdx,75E2B9009A40\r\n       mov       [rcx+10],rdx\r\n       inc       dword ptr [rax+14]\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       call      qword ptr [762368116760]\r\n       int       3\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 58<\/code><\/pre>\n<p>No covariance check, thank you very much.<\/li>\n<li><strong>More strength reduction<\/strong>. &#8220;Strength reduction&#8221; is a classic compiler optimization that replaces more expensive operations, like multiplications, with cheaper ones, like additions. In .NET 9, this was used to transform indexed loops that used multiplied offsets (e.g. <code>index * elementSize<\/code>) into loops that simply incremented a pointer-like offset (e.g. <code>offset += elementSize<\/code>), cutting down on arithmetic overhead and improving performance. In .NET 10, strength reduction has been extended, in particular with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110222\">dotnet\/runtime#110222<\/a>. This enables the JIT to detect multiple loop induction variables with different step sizes and strength-reduce them by leveraging their greatest common divisor (GCD). Essentially, it creates a single primary induction variable based on the GCD of the varying step sizes, and then recovers each original induction variable by appropriately scaling. Consider this example:\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"numbers\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(\"128514801826028643102849196099776734920914944609068831724328541639470403818631040\")]\r\n    public int[] Parse(string numbers)\r\n    {\r\n        int[] results = new int[numbers.Length];\r\n        for (int i = 0; i &lt; numbers.Length; i++)\r\n        {\r\n            results[i] = numbers[i] - '0';\r\n        }\r\n\r\n        return results;\r\n    }\r\n}<\/code><\/pre>\n<p>In this benchmark, we&#8217;re iterating through an input <code>string<\/code>, which is a collection of 2-byte <code>char<\/code> elements, and we&#8217;re storing the results into an array of 4-byte <code>int<\/code> elements. The core loop in the .NET 9 assembly looks like this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\nM00_L00:\r\n       mov       edx,ecx\r\n       movzx     edi,word ptr [rbx+rdx*2+0C]\r\n       add       edi,0FFFFFFD0\r\n       mov       [rax+rdx*4+10],edi\r\n       inc       ecx\r\n       cmp       r15d,ecx\r\n       jg        short M00_L00<\/code><\/pre>\n<p>The <code>movzx edi,word ptr [rbx+rdx*2+0C]<\/code> is the read of <code>numbers[i]<\/code>, and the <code>mov [rax+rdx*4+10],edi<\/code> is the assignment to <code>results[i]<\/code>. <code>rdx<\/code> here is <code>i<\/code>, so each assignment is effectively having to do <code>i*2<\/code> to compute the byte offset of the <code>char<\/code> at index <code>i<\/code>, and similarly do <code>i*4<\/code> to compute the byte offset of the <code>int<\/code> at offset <code>i<\/code>. Now here&#8217;s the .NET 10 assembly:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\nM00_L00:\r\n       movzx     edx,word ptr [rbx+rcx+0C]\r\n       add       edx,0FFFFFFD0\r\n       mov       [rax+rcx*2+10],edx\r\n       add       rcx,2\r\n       dec       r15d\r\n       jne       short M00_L00<\/code><\/pre>\n<p>The multiplication in the <code>numbers[i]<\/code> read is gone. Instead, it can just increment <code>rcx<\/code> by 2 on each iteration, treating that as the offset to the <code>i<\/code>th <code>char<\/code>, and then instead of multiplying by 4 to compute the <code>int<\/code> offset, it just multiples by 2.<\/li>\n<li><strong>CSE integration with SSA<\/strong>. As with most compilers, the JIT employs common subexpression elimination (CSE) to find identical computations and avoid doing them repeatedly. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106637\">dotnet\/runtime#106637<\/a> teaches the JIT how to do so in a more consistent manner by more fully integrating CSE with its Static Single Assignment (SSA) representation. This in turn allows for more optimizations to kick in, e.g. some of the strength reduction done around loop induction variables in .NET 9 wasn&#8217;t applying as much as it should have, and now it will.<\/li>\n<li><strong><code>return someCondition ? true : false<\/code><\/strong>. There are often multiple ways to represent the same thing, but it often happens in compilers that certain patterns will be recognized during optimization while other equivalent ones won&#8217;t, and it can therefore behoove the compiler to first normalize the representations to all use the better recognized one. There&#8217;s a really common and interesting case of this with <code>return someCondition<\/code>, where, for reasons relating to the JIT&#8217;s internal representation, the JIT is better able to optimize with the equivalent <code>return someCondition ? true : false<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107499\">dotnet\/runtime#107499<\/a> normalizes to the latter. As an example of this, consider this benchmark:\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public bool Test1(int i)\r\n    {\r\n        if (i &gt; 10 &amp;&amp; i &lt; 20) return true;\r\n        return false;\r\n    }\r\n\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public bool Test2(int i) =&gt; i &gt; 10 &amp;&amp; i &lt; 20;\r\n}<\/code><\/pre>\n<p>On .NET 9, that results in this assembly code for <code>Test1<\/code>:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test1(Int32)\r\n       sub       esi,0B\r\n       cmp       esi,8\r\n       setbe     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 13<\/code><\/pre>\n<p>The JIT has successfully recognized that it can change the two comparisons to instead be a subtraction and a single comparison, as if the <code class=\"language-csharp\">i &gt; 10 &amp;&amp; i &lt; 20<\/code> were instead written as <code class=\"language-csharp\">(uint)(i - 11) &lt;= 8<\/code>. But for <code>Test2<\/code>, .NET 9 produces this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test2(Int32)\r\n       xor       eax,eax\r\n       cmp       esi,14\r\n       setl      cl\r\n       movzx     ecx,cl\r\n       cmp       esi,0A\r\n       cmovg     eax,ecx\r\n       ret\r\n; Total bytes of code 18<\/code><\/pre>\n<p>Because of how the return condition is being represented internally by the JIT, it&#8217;s missing this particular optimization, and the assembly code more directly reflects what was written in the C#. But now in .NET 10, because of this normalization, we now get this for <code>Test2<\/code>, exactly what we got for <code>Test1<\/code>:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test2(Int32)\r\n       sub       esi,0B\r\n       cmp       esi,8\r\n       setbe     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 13<\/code><\/pre>\n<\/li>\n<li><strong>Bit tests<\/strong>. The C# compiler has a lot of flexibility in how it emits <code>switch<\/code> and <code>is<\/code> expressions. Consider a case like this: <code>c is ' ' or '\\t' or '\\r' or '\\n'<\/code>. It could emit that as the equivalent of a series of cascading <code>if<\/code>\/<code>else<\/code> branches, as an IL <code>switch<\/code> instruction, as a bit test, or as combinations of those. The C# compiler, though, doesn&#8217;t have all of the information the JIT has, such as whether the process is 32-bit or 64-bit, or knowledge of what instructions cost on given hardware. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107831\">dotnet\/runtime#107831<\/a>, the JIT will now recognize more such expressions that can be implemented as a bit test and generate the code accordingly.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"c\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments('s')]\r\n    public void Test(char c)\r\n    {\r\n        if (c is ' ' or '\\t' or '\\r' or '\\n' or '.')\r\n        {\r\n            Handle(c);\r\n        }\r\n\r\n        [MethodImpl(MethodImplOptions.NoInlining)]\r\n        static void Handle(char c) { }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">0.4537 ns<\/td>\n<td style=\"text-align: right;\">1.02<\/td>\n<td style=\"text-align: right;\">58 B<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">0.1304 ns<\/td>\n<td style=\"text-align: right;\">0.29<\/td>\n<td style=\"text-align: right;\">44 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It&#8217;s also common to see bit tests implemented in C# against shifted values; a constant mask is created with bits set at various indices, and then an incoming value to check is tested by shifting a bit to the corresponding index and seeing whether it aligns with one in the mask. For example, here is how <code>Regex<\/code> tests to see whether a provided <code>UnicodeCategory<\/code> is one of those that composes the &#8220;word&#8221; class (`\\w`):<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Globalization;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"uc\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(UnicodeCategory.DashPunctuation)]\r\n    public bool Test(UnicodeCategory uc) =&gt; (WordCategoriesMask &amp; (1 &lt;&lt; (int)uc)) != 0;\r\n\r\n    private const int WordCategoriesMask =\r\n        1 &lt;&lt; (int)UnicodeCategory.UppercaseLetter |\r\n        1 &lt;&lt; (int)UnicodeCategory.LowercaseLetter |\r\n        1 &lt;&lt; (int)UnicodeCategory.TitlecaseLetter |\r\n        1 &lt;&lt; (int)UnicodeCategory.ModifierLetter |\r\n        1 &lt;&lt; (int)UnicodeCategory.OtherLetter |\r\n        1 &lt;&lt; (int)UnicodeCategory.NonSpacingMark |\r\n        1 &lt;&lt; (int)UnicodeCategory.DecimalDigitNumber |\r\n        1 &lt;&lt; (int)UnicodeCategory.ConnectorPunctuation;\r\n}<\/code><\/pre>\n<p>Previously, the JIT would end up emitting that similar to how it&#8217;s written: a shift followed by a test. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111979\">dotnet\/runtime#111979<\/a> from <a href=\"https:\/\/github.com\/varelen\">@varelen<\/a>, it can emit it as a bit test.<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(System.Globalization.UnicodeCategory)\r\n       mov       eax,1\r\n       shlx      eax,eax,esi\r\n       test      eax,4013F\r\n       setne     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 22\r\n\r\n; .NET 10\r\n; Tests.Test(System.Globalization.UnicodeCategory)\r\n       mov       eax,4013F\r\n       bt        eax,esi\r\n       setb      al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 15<\/code><\/pre>\n<\/li>\n<li><strong>Redundant sign extensions<\/strong>. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111305\">dotnet\/runtime#111305<\/a>, the JIT can now remove more redundant sign extensions (when you take a smaller size type, e.g. <code>int<\/code>, and convert it to a larger size type, e.g. <code>long<\/code>, while preserving the value&#8217;s sign). For example, with a test like this <code>public ulong Test(int x) =&gt; (uint)x &lt; 10 ? (ulong)x &lt;&lt; 60 : 0<\/code>, the JIT can now emit a <code>mov<\/code> (just copy the bits) instead of <code>movsxd<\/code> (move with sign extension), since it knows from the first comparison that the shift will only ever be performed with a non-negative <code>x<\/code>.<\/li>\n<li><strong>Better division with BMI2<\/strong>. If the BMI2 instruction set is available, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116198\">dotnet\/runtime#116198<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a> the JIT can now use the <code>mulx<\/code> instruction (&#8220;Unsigned Multiply Without Affecting Flags&#8221;) to implement integer division, e.g.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"value\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(12345)]\r\n    public ulong Div10(ulong value) =&gt; value \/ 10;\r\n}<\/code><\/pre>\n<p>results in:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Div10(UInt64)\r\n       mov       rdx,0CCCCCCCCCCCCCCCD\r\n       mov       rax,rsi\r\n       mul       rdx\r\n       mov       rax,rdx\r\n       shr       rax,3\r\n       ret\r\n; Total bytes of code 24\r\n\r\n; .NET 10\r\n; Tests.Div10(UInt64)\r\n       mov       rdx,0CCCCCCCCCCCCCCCD\r\n       mulx      rax,rax,rsi\r\n       shr       rax,3\r\n       ret\r\n; Total bytes of code 20<\/code><\/pre>\n<\/li>\n<li><strong>Better range comparison<\/strong>. When comparing a <code>ulong<\/code> expression against <code>uint.MaxValue<\/code>, rather than being emitted as a comparison, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113037\">dotnet\/runtime#113037<\/a> from <a href=\"https:\/\/github.com\/shunkino\">@shunkino<\/a> it can be handled more efficiently as a shift.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"x\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(12345)]\r\n    public bool Test(ulong x) =&gt; x &lt;= uint.MaxValue;\r\n}<\/code><\/pre>\n<p>resulting in:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(UInt64)\r\n       mov       eax,0FFFFFFFF\r\n       cmp       rsi,rax\r\n       setbe     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 15\r\n\r\n; .NET 10\r\n; Tests.Test(UInt64)\r\n       shr       rsi,20\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<\/li>\n<li><strong>Better dead branch elimination<\/strong>. The JIT&#8217;s branch optimizer is already able to use implications from comparisons to statically determine the outcome of other branches. For example, if I have this:\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"x\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public void Test(int x)\r\n    {\r\n        if (x &gt; 100)\r\n        {\r\n            if (x &gt; 10)\r\n            {\r\n                Console.WriteLine();\r\n            }\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>the JIT generates this on .NET 9:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(Int32)\r\n       cmp       esi,64\r\n       jg        short M00_L00\r\n       ret\r\nM00_L00:\r\n       jmp       qword ptr [7766D3E64FA8]\r\n; Total bytes of code 12<\/code><\/pre>\n<p>Note there&#8217;s only a single comparison against 100 (0x64), with the comparison against 10 elided (as it&#8217;s implied by the previous comparison). However, there are many variations to this, and not all of them were handled equally well. Consider this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"x\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public void Test(int x)\r\n    {\r\n        if (x &lt; 16)\r\n            return;\r\n\r\n        if (x &lt; 8)\r\n            Console.WriteLine();\r\n    }\r\n}<\/code><\/pre>\n<p>Here, the <code>Console.WriteLine<\/code> ideally wouldn&#8217;t appear in the emitted assembly at all, as it&#8217;s never reachable. Alas, on .NET 9, we get this (the <code>jmp<\/code> instruction here is a tail call to <code>WriteLine<\/code>):<\/p>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Test(Int32)\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       cmp       esi,10\r\n       jl        short M00_L00\r\n       cmp       esi,8\r\n       jge       short M00_L00\r\n       pop       rbp\r\n       jmp       qword ptr [731ED8054FA8]\r\nM00_L00:\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 23<\/code><\/pre>\n<p>With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111766\">dotnet\/runtime#111766<\/a> on .NET 10, it successfully recognizes that by the time it gets to the <code>x &lt; 8<\/code>, that condition will always be <code>false<\/code>, and it can be eliminated. And once it&#8217;s eliminated, the initial branch is also unnecessary. So the whole method reduces to this:<\/p>\n<pre><code class=\"language-x86asm\">; .NET 10\r\n; Tests.Test(Int32)\r\n       ret\r\n; Total bytes of code 1<\/code><\/pre>\n<\/li>\n<li><strong>Better floating-point conversion<\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114410\">dotnet\/runtime#114410<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114597\">dotnet\/runtime#114597<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111595\">dotnet\/runtime#111595<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> all speed up conversions between floating-point and integers, such as by using <code>vcvtusi2s<\/code> when AVX512 is available, or when it isn&#8217;t, avoiding the intermediate <code>double<\/code> conversion.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public float Compute(uint i) =&gt; i;\r\n}<\/code><\/pre>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Compute(UInt32)\r\n       mov       eax,esi\r\n       vxorps    xmm0,xmm0,xmm0\r\n       vcvtsi2sd xmm0,xmm0,rax\r\n       vcvtsd2ss xmm0,xmm0,xmm0\r\n       ret\r\n; Total bytes of code 16\r\n\r\n; .NET 10\r\n; Tests.Compute(UInt32)\r\n       vxorps    xmm0,xmm0,xmm0\r\n       vcvtusi2ss xmm0,xmm0,esi\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<\/li>\n<li><strong>Unrolling<\/strong>. When using <code>CopyTo<\/code> (or other &#8220;memmove&#8221;-based operations) with a constant source, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108576\">dotnet\/runtime#108576<\/a> reduces costs by avoiding a redundant memory load. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109036\">dotnet\/runtime#109036<\/a> unblocks more unrolling on Arm64 for <code>Equals<\/code>\/<code>StartsWith<\/code>\/<code>EndsWith<\/code>. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110893\">dotnet\/runtime#110893<\/a> enables unrolling non-zero fills (unrolling already happened for zero fills).\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private char[] _chars = new char[100];\r\n\r\n    [Benchmark]\r\n    public void Fill() =&gt; _chars.AsSpan(0, 16).Fill('x');\r\n}<\/code><\/pre>\n<pre><code class=\"language-x86asm\">; .NET 9\r\n; Tests.Fill()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rdi,[rdi+8]\r\n       test      rdi,rdi\r\n       je        short M00_L00\r\n       cmp       dword ptr [rdi+8],10\r\n       jb        short M00_L00\r\n       add       rdi,10\r\n       mov       esi,10\r\n       mov       edx,78\r\n       call      qword ptr [7F3093FBF1F8]; System.SpanHelpers.Fill[[System.Char, System.Private.CoreLib]](Char ByRef, UIntPtr, Char)\r\n       nop\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       call      qword ptr [7F3093787810]\r\n       int       3\r\n; Total bytes of code 49\r\n\r\n; .NET 10\r\n; Tests.Fill()\r\n       push      rbp\r\n       mov       rbp,rsp\r\n       mov       rax,[rdi+8]\r\n       test      rax,rax\r\n       je        short M00_L00\r\n       cmp       dword ptr [rax+8],10\r\n       jl        short M00_L00\r\n       add       rax,10\r\n       vbroadcastss ymm0,dword ptr [78EFC70C9340]\r\n       vmovups   [rax],ymm0\r\n       vzeroupper\r\n       pop       rbp\r\n       ret\r\nM00_L00:\r\n       call      qword ptr [78EFC7447B88]\r\n       int       3\r\n; Total bytes of code 48<\/code><\/pre>\n<p>Note the call to <code>SpanHelpers.Fill<\/code> in the .NET 9 assembly and the absence of it in the .NET 10 assembly.<\/li>\n<\/ul>\n<h2>Native AOT<\/h2>\n<p>Native AOT is the ability for a .NET application to be compiled directly to assembly code at build-time. The JIT is still used for code generation, but only at build time; the JIT isn&#8217;t part of the shipping app at all, and no code generation is performed at run-time. As such, most of the optimizations to the JIT already discussed, as well as optimizations throughput the rest of this post, apply to Native AOT equally. Native AOT presents some unique opportunities and challenges, however.<\/p>\n<p>One super power of the Native AOT tool chain is the ability to interpret (some) code at build-time and use the results of that execution rather than performing the operation at run-time. This is particularly relevant for static constructors, where the constructor&#8217;s code can be interpreted to initialize various <code>static readonly<\/code> fields, and then the contents of those fields can be persisted into the generated assembly; at run-time, the contents needs only be rehydrated from the assembly rather than being recomputed. This also potentially helps to make more code redundant and removable, if for example the static constructor and anything it (and only it) referenced were no longer needed. Of course, it would be dangerous and problematic if any arbitrary code could be run during build, so instead there&#8217;s a very filtered allow list and specialized support for the most common and appropriate constructs. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107575\">dotnet\/runtime#107575<\/a> augments this &#8220;preinitialization&#8221; capability to support spans sourced from arrays, such that using methods like <code>.AsSpan()<\/code> doesn&#8217;t cause preinitialization to bail out. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114374\">dotnet\/runtime#114374<\/a> also improved preinitialization, removing restrictions around accessing static fields of other types, calling methods on other types that have their own static constructors, and dereferencing pointers.<\/p>\n<p>Conversely, Native AOT has its own challenges, specifically that size really matters and is harder to control. With a JIT available at run-time, code generation for only exactly what&#8217;s needed can be deferred until run-time. With Native AOT, <em>all<\/em> assembly code generation needs to be done at build-time, which means the Native AOT tool chain needs to work hard to determine the least amount of code it needs to emit to support everything the app might need to do at run-time. Most of the effort on Native AOT in any given release ends up being about helping it to further decrease the size of generated code. For example:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117411\">dotnet\/runtime#117411<\/a> enables folding bodies of generic instantations of the same method, essentially avoiding duplication by using the same code for copies of the same method where possible.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117080\">dotnet\/runtime#117080<\/a> similarly helps improve the existing method body deduplication logic.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117345\">dotnet\/runtime#117345<\/a> from <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a> tweaks a bit of code in reflection that would previously artificially force the code to be preserved for all enumerators of all generic instantations of every collection type.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112782\">dotnet\/runtime#112782<\/a> adds the same distinction that already existed for <code>MethodTable<\/code>s for non-generic methods (&#8220;is this method table visible to user code or not&#8221;) to generic methods, allowing more metadata for the non-user visible ones to be optimized away.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118718\">dotnet\/runtime#118718<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118832\">dotnet\/runtime#118832<\/a> enable size reductions related to boxed enums. The former tweaks a few methods in <code>Thread<\/code>, <code>GC<\/code>, and <code>CultureInfo<\/code> to avoid boxing some enums, which means the code for those needn&#8217;t be generated. The latter tweaks the implementation of <code>RuntimeHelpers.CreateSpan<\/code>, which is used by the C# compiler as part of creating spans with constructs like collection expressions. <code>CreateSpan<\/code> is a generic method, and the Native AOT toolchain&#8217;s whole-program analysis would end up treating the generic type parameter as being &#8220;reflected on,&#8221; meaning the compiler had to assume any type parameter would be used with reflection and thus had to preserve relevant metadata. When used with enums, it would need to ensure support for boxed enums was kept around, and <code>System.Console<\/code> has such a use with an enum. That in turn meant that a simple &#8220;hello, world&#8221; console app couldn&#8217;t trim away that boxed enum reflection support; now it can.<\/li>\n<\/ul>\n<h2>VM<\/h2>\n<p>The .NET runtime offers a wide range of services to managed applications, most obviously the garbage collector and the JIT compiler, but it also encompasses a host of other capabilities: assembly and type loading, exception handling, virtual method dispatch, interoperability support, stub generation, and so on. Collectively, all of these features are referred to as being a part of the .NET Virtual Machine (VM).<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108167\">dotnet\/runtime#108167<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109135\">dotnet\/runtime#109135<\/a> rewrote various runtime helpers from C in the runtime to C# in <code>System.Private.CoreLib<\/code>, including the &#8220;unboxing&#8221; helpers, which are used to unbox <code>object<\/code>s to value types in niche scenarios. This rewrite avoids overheads associated with transitioning between native and managed and also enables the JIT an opportunity to optimize in the context of callers, such as with inlining. Note that these unboxing helpers are only used in obscure situations, so it requires a bit of a complicated benchmark to demonstrate the impact:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[DisassemblyDiagnoser(0)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private object[] _objects = [new GenericStruct&lt;MyStruct, object&gt;()];\r\n\r\n    [Benchmark]\r\n    public void Unbox() =&gt; Unbox&lt;GenericStruct&lt;MyStruct, object&gt;&gt;(_objects[0]);\r\n\r\n    private void Unbox&lt;T&gt;(object o) where T : struct, IStaticMethod&lt;T&gt;\r\n    {\r\n        T? local = (T?)o;\r\n        if (local.HasValue)\r\n        {\r\n            T localCopy = local.Value;\r\n            T.Method(ref localCopy);\r\n        }\r\n    }\r\n\r\n    public interface IStaticMethod&lt;T&gt;\r\n    {\r\n        public static abstract void Method(ref T param);\r\n    }\r\n\r\n    struct MyStruct : IStaticMethod&lt;MyStruct&gt;\r\n    {\r\n        public static void Method(ref MyStruct param) { }\r\n    }\r\n\r\n    struct GenericStruct&lt;T, V&gt; : IStaticMethod&lt;GenericStruct&lt;T, V&gt;&gt; where T : IStaticMethod&lt;T&gt;\r\n    {\r\n        public T Value;\r\n\r\n        [MethodImpl(MethodImplOptions.NoInlining)]\r\n        public static void Method(ref GenericStruct&lt;T, V&gt; value) =&gt; T.Method(ref value.Value);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Unbox<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1.626 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">148 B<\/td>\n<\/tr>\n<tr>\n<td>Unbox<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.379 ns<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">148 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>What it means to move the implementation from native to managed is most easily seen just by looking at the generated assembly. Other than uninteresting and non-impactful changes in which registers happen to get assigned, the only real difference between .NET 9 and .NET 10 is a single instruction:<\/p>\n<pre><code class=\"language-diff\">-      call      CORINFO_HELP_UNBOX_NULLABLE\r\n+      call      System.Runtime.CompilerServices.CastHelpers.Unbox_Nullable(Byte ByRef, System.Runtime.CompilerServices.MethodTable*, System.Object)<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115284\">dotnet\/runtime#115284<\/a> streamlines how the runtime sets up and tears down the little code blocks (&#8220;funclets&#8221;) the runtime uses to implement <code>catch<\/code>\/<code>finally<\/code> on x64. Historically, these funclets acted a lot like tiny functions, saving and restoring non-volatile CPU registers on entry and exit (a &#8220;non-volatile&#8221; register is effectively one where the caller can expect it to contain the same value after a function call as it did before the function call). This PR changes the contract so that funclets no longer need to preserve those registers themselves; instead, the runtime takes care of preserving them. That shrinks the prologs and epilogs the JIT emits for funclets, reduces instruction count and code size, and lowers the cost of entering and exiting exception handlers.<\/p>\n<p>With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114462\">dotnet\/runtime#114462<\/a>, the runtime now uses a single shared &#8220;template&#8221; for many of the small executable &#8220;stubs&#8221; it needs at runtime; stubs are tiny chunks of machine code that act as jump points, call counters, or patchable trampolines. Previously, each memory allocation for stubs would regenerate the same instructions over and over. The new approach builds one copy of the stub code in a read-only page and then maps that same physical page into every place it&#8217;s needed, while giving each allocation its own writable page for the per-stub data that changes at runtime. This lets hundreds of virtual stub pages all point to one physical code page, cutting memory use, reducing startup work, and improving instruction cache locality.<\/p>\n<p>Also interesting are <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117218\">dotnet\/runtime#117218<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116031\">dotnet\/runtime#116031<\/a>, which together help optimize the generation of stack traces in large, heavily multi-threaded applications when being profiled.<\/p>\n<h2>Threading<\/h2>\n<p>The <code>ThreadPool<\/code> underlies most work in most .NET apps and services. It&#8217;s a critical-path component that has to be able to deal with all manners of workloads efficiently.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109841\">dotnet\/runtime#109841<\/a> implemented an opt-in feature that <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112796\">dotnet\/runtime#112796<\/a> then enabled by default for .NET 10. The idea behind it is fairly straightforward, but to understand it, we first need to examine how the thread pool queues work items. The thread pool has multiple queues, typically one &#8220;global&#8221; queue and then one &#8220;local&#8221; queue per thread in the pool. When threads outside of the pool queue work, that work goes to the global queue, and when a thread pool thread queues work, especially a <code>Task<\/code> or work related to an <code>await<\/code>, that work item typically goes to that thread&#8217;s local queue. Then when a thread pool thread finishes whatever it was doing and goes in search of more work, it first checks its own local queue (treating its local queue as highest priority), then if that&#8217;s empty it checks the global queue, and then if that&#8217;s empty it goes and helps out the other threads in the pool by searching their queues for work to be done. This is all in an attempt to a) minimize contention on the global queue (if threads are mainly queueing and dequeuing from their own local queue, they&#8217;re not contending with each other), and b) prioritize work that&#8217;s logically part of already started work (the only way for work to get into a local queue is if that thread was processing a work item that created it). Generally, this works out well, but sometimes we get into degenerate scenarios, typically when an app does something that goes against best practices&#8230; like blocking.<\/p>\n<p>Blocking a thread pool thread means that thread can&#8217;t service other work coming into the pool. If the blocking is brief, it&#8217;s generally fine, and if it&#8217;s longer, the thread pool tries to accommodate it by injecting more threads and finding a steady state at which things hum along. But a certain kind of blocking can be really problematic: <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/should-i-expose-synchronous-wrappers-for-asynchronous-methods\/\">&#8220;sync over async&#8221;<\/a>. With &#8220;sync over async&#8221;, one thread blocks while waiting for an asynchronous operation to complete, and if <em>that<\/em> operation needs to do something on the thread pool in order to complete, you now have one thread pool thread blocked waiting for another thread pool thread to pick up a particular work item and process it. This can quickly lead to the whole pool getting into a jam&#8230; especially with the thread local queues. If a thread is blocked on an operation that depends on work items in that thread&#8217;s local queue getting processed, that work item being picked off now depends on the global queue being exhausted and another thread coming along and stealing the work item from this thread&#8217;s queue. If there&#8217;s a steady stream of incoming work into the global queue, though, that will never happen; essentially, the highest priority work item has become the lowest priority work item.<\/p>\n<p>So, back to these PRs. The idea is fairly simple: when the thread is about to block, and in particular when it&#8217;s about to block waiting on a <code>Task<\/code>, it first dumps its entire local queue into the global queue. That way, this work which was highest priority for the blocked thread has a fairer chance of being processed by other threads, rather than it being the lowest priority work for everyone. We can try to see the impact of this with a specifically-crafted workload:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing System.Diagnostics;\r\n\r\nint numThreads = Environment.ProcessorCount;\r\nThreadPool.SetMaxThreads(numThreads, 1);\r\n\r\nManualResetEventSlim start = new();\r\nCountdownEvent allDone = new(numThreads);\r\nnew Thread(() =&gt;\r\n{\r\n    while (true)\r\n    {\r\n        for (int i = 0; i &lt; 10_000; i++)\r\n        {\r\n            ThreadPool.QueueUserWorkItem(_ =&gt; Thread.SpinWait(1));\r\n        }\r\n\r\n        Thread.Yield();\r\n    }\r\n}) { IsBackground = true }.Start();\r\n\r\nfor (int i = 0; i &lt; numThreads; i++)\r\n{\r\n    ThreadPool.QueueUserWorkItem(_ =&gt;\r\n    {\r\n        start.Wait();\r\n        TaskCompletionSource tcs = new();\r\n\r\n        const int LocalItemsPerThread = 4;\r\n        var remaining = LocalItemsPerThread;\r\n        for (int j = 0; j &lt; LocalItemsPerThread; j++)\r\n        {\r\n            Task.Run(() =&gt;\r\n            {\r\n                Thread.SpinWait(100);\r\n                if (Interlocked.Decrement(ref remaining) == 0)\r\n                {\r\n                    tcs.SetResult();\r\n                }\r\n            });\r\n        }\r\n\r\n        tcs.Task.Wait();\r\n        allDone.Signal();\r\n    });\r\n}\r\n\r\nvar sw = Stopwatch.StartNew();\r\nstart.Set();\r\nConsole.WriteLine(allDone.Wait(20_000) ?\r\n    $\"Completed: {sw.ElapsedMilliseconds}ms\" :\r\n    $\"Timed out after {sw.ElapsedMilliseconds}ms\");<\/code><\/pre>\n<p>This is:<\/p>\n<ul>\n<li>creating a noise thread that tries to keep the global queue inundated with new work<\/li>\n<li>queuing <code>Environment.ProcessorCount<\/code> work items, each of which queues four work items to the local queue that all do a little work and then blocks on a <code>Task<\/code> until they all complete<\/li>\n<li>waiting for those <code>Environment.ProcessorCount<\/code> work items to complete<\/li>\n<\/ul>\n<p>When I run this on .NET 9, it hangs, because there&#8217;s so much work in the global queue, no threads are able to process those sub-work items that are necessary to unblock the main work items:<\/p>\n<pre><code class=\"language-txt\">Timed out after 20002ms<\/code><\/pre>\n<p>On .NET 10, it generally completes almost instantly:<\/p>\n<pre><code class=\"language-txt\">Completed: 4ms<\/code><\/pre>\n<p>Some other tweaks were made to the pool as well:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115402\">dotnet\/runtime#115402<\/a> reduced the amount of spin-waiting done on Arm processors, bringing it more in line with x64.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112789\">dotnet\/runtime#112789<\/a> reduced the frequency at which the thread pool checked CPU utilization, as in some circumstances it was adding noticeable overhead, and makes the frequency configurable.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108135\">dotnet\/runtime#108135<\/a> from <a href=\"https:\/\/github.com\/AlanLiu90\">@AlanLiu90<\/a> removed a bit of lock contention that could happen under load when starting new thread pool threads.<\/li>\n<\/ul>\n<p>On the subject of locking, and only for developers that find themselves with a strong need to do really low-level low-lock development, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107843\">dotnet\/runtime#107843<\/a> from <a href=\"https:\/\/github.com\/hamarb123\">@hamarb123<\/a> adds two new methods to the <code>Volatile<\/code> class: <code>ReadBarrier<\/code> and <code>WriteBarrier<\/code>. A read barrier has &#8220;load acquire&#8221; semantics, and is sometimes referred to as a &#8220;downward fence&#8221;: it prevents instructions from being reordered in such a way that memory accesses below\/after the barrier move to above\/before it. In contrast, a write barrier has &#8220;store release&#8221; semantics, and is sometimes referred to as an &#8220;upwards fence&#8221;: it prevents instructions from being reordered in such a way that memory accesses above\/before the barrier move to below\/after it. I find it helps to think about this with regards to a <code>lock<\/code>:<\/p>\n<pre><code class=\"language-csharp\">A;\r\nlock (...)\r\n{\r\n    B;\r\n}\r\nC;<\/code><\/pre>\n<p>While in practice the implementation may provide stronger fences, by specification entering a <code>lock<\/code> has acquire semantics and exiting a <code>lock<\/code> has release semantics. Imagine if the instructions in the above code could be reordered like this:<\/p>\n<pre><code class=\"language-csharp\">A;\r\nB;\r\nlock (...)\r\n{\r\n}\r\nC;<\/code><\/pre>\n<p>or like this:<\/p>\n<pre><code class=\"language-csharp\">A;\r\nlock (...)\r\n{\r\n}\r\nB;\r\nC;<\/code><\/pre>\n<p>Both of those would be really bad. Thankfully, the barriers help us here. The acquire \/ read barrier semantics of entering the lock are a downwards fence: logically the brace that starts the lock puts downwards pressure on everything inside the lock to not move to before it, and the brace that ends the lock puts upwards pressure on everything inside the lock to not move to after it. Interestingly, nothing about the semantics of these barriers prevents this from happening:<\/p>\n<pre><code class=\"language-csharp\">lock (...)\r\n{\r\n    A;\r\n    B;\r\n    C;\r\n}<\/code><\/pre>\n<p>These barriers are referred to as &#8220;half fences&#8221;; the read barrier prevents later things from moving earlier, but not the other way around, and the write barrier prevents earlier things from moving later, but not the other way around. (As it happens, though, while not required by specification, today the implementation of <code>lock<\/code> does use a full barrier on both enter and exit, so nothing before or after a <code>lock<\/code> will move into it.)<\/p>\n<p>For <code>Task<\/code> in .NET 10, <code>Task.WhenAll<\/code> has a few changes to improve its performance. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110536\">dotnet\/runtime#110536<\/a> avoids a temporary collection allocation when needing to buffer up tasks from an <code>IEnumerable&lt;Task&gt;<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public Task WhenAllAlloc()\r\n    {\r\n        AsyncTaskMethodBuilder t = default;\r\n        Task whenAll = Task.WhenAll(from i in Enumerable.Range(0, 2) select t.Task);\r\n        t.SetResult();\r\n        return whenAll;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WhenAllAlloc<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">216.8 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">496 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAllAlloc<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">181.9 ns<\/td>\n<td style=\"text-align: right;\">0.84<\/td>\n<td style=\"text-align: right;\">408 B<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117715\">dotnet\/runtime#117715<\/a> from <a href=\"https:\/\/github.com\/CuteLeon\">@CuteLeon<\/a> avoids the overhead of the <code>Task.WhenAll<\/code> altogether when the input ends up just being a single task, in which case it simply returns that task instance.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public Task WhenAllAlloc()\r\n    {\r\n        AsyncTaskMethodBuilder t = default;\r\n        Task whenAll = Task.WhenAll([t.Task]);\r\n        t.SetResult();\r\n        return whenAll;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WhenAllAlloc<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">72.73 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">144 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAllAlloc<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">33.06 ns<\/td>\n<td style=\"text-align: right;\">0.45<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/an-introduction-to-system-threading-channels\/\"><code>System.Threading.Channels<\/code><\/a> is one of the lesser-known but quite useful areas of threading in .NET (you can watch <a href=\"https:\/\/www.youtube.com\/watch?v=J3IQBI5HVOw\">Yet Another &#8220;Highly Technical Talk&#8221; with Hanselman and Toub<\/a> from Build 2025 to learn more about it). If you find yourself needing a queue to hand off some data between a producer and a consumer, you should likely look into <code>Channel&lt;T&gt;<\/code>. The library was introduced in .NET Core 3.0 as a small, robust, and fast producer\/consumer queueing mechanism; it&#8217;s evolved since, such as gaining a <code>ReadAllAsync<\/code> method for consuming the contents of a channel as an <code>IAsyncEnumerable&lt;T&gt;<\/code> and a <code>PeekAsync<\/code> method for peeking at its contents without consuming. The original release supported <code>Channel.CreateUnbounded<\/code> and <code>Channel.CreateBounded<\/code> methods, and .NET 9 augmented those with a <code>Channel.CreateUnboundedPrioritized<\/code>. .NET 10 continues to expand on channels, both with functional improvements (such as with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116097\">dotnet\/runtime#116097<\/a>, which adds an unbuffered channel implementation), and with performance improvements.<\/p>\n<p>.NET 10 helps to to reduce overall memory consumption of an application using channels. One of the cross-cutting features channels supports is cancellation: you can cancel pretty much any interaction with a channel, which sports asynchronous methods for both producing and consuming data. When a reader or writer needs to pend, it creates (or reuses a pooled instance of) an <code>AsyncOperation<\/code> object that gets added to a queue; a later writer or reader that&#8217;s then able to satisfy a pending reader or writer dequeues one and marks it as completed. These queues were implemented with arrays, which makes it challenging to remove an entry from the middle of the queue if the associated operation gets canceled. So, rather than trying, it simply left the canceled object in the queue, and then when it would eventually get dequeued, it&#8217;s just thrown away and the dequeuer tries again. The theory was that, at steady state, you will quickly dequeue any canceled operations, and it&#8217;d be better to not exert a lot of effort to try to remove them more quickly. As it turns out, that assumption was problematic for some scenarios, where the workload wasn&#8217;t balanced, e.g. lots of readers would pend and timeout due to lack of writers, and each of those timed out readers would leave behind a canceled item in the queue. The next time a writer would come along, yes, all those canceled readers would get cleared out, but in the meantime, it would manifest as a notable increase in working set.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116021\">dotnet\/runtime#116021<\/a> addresses that by switching from array-backed queues to linked-list-based queues. The waiter objects themselves double as the nodes in the linked lists, so the only additional memory overhead is a couple of fields for the previous and next nodes in the linked list. But even that modest increase is undesirable, so as part of the PR, it also tries to find compensating optimizations to balance things out. It&#8217;s able to remove a field from <code>Channel&lt;T&gt;<\/code>&#8216;s custom implementation of <code>IValueTaskSource&lt;T&gt;<\/code> by applying a similar optimization that was made to <code>ManualResetValueTaskSourceCore&lt;T&gt;<\/code> in a previous release: it&#8217;s incredibly rare for an awaiter to supply an <code>ExecutionContext<\/code> (via use of the awaiter&#8217;s <code>OnCompleted<\/code> rather than <code>UnsafeOnCompleted<\/code> method), and even more so for that to happen when there&#8217;s also a non-default <code>TaskScheduler<\/code> or <code>SynchronizationContext<\/code> that needs to be stored, so rather than using two fields for those concepts, they just get grouped into one (which means that in the super duper rare case where both are needed, it incurs an extra allocation). Another field is removed for storing a <code>CancellationToken<\/code> on the instance, which on .NET Core can be retrieved from other available state. These changes then actually result in the size of the <code>AsyncOperation<\/code> waiter instance decreasing rather than increasing. Win win. It&#8217;s hard to see the impact of this change on throughput; it&#8217;s easier to just see what the impact is on working set in the degenerate case where canceled operations are never removed. If I run this code:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing System.Threading.Channels;\r\n\r\nChannel&lt;int&gt; c = Channel.CreateUnbounded&lt;int&gt;();\r\n\r\nfor (int i = 0; ; i++)\r\n{\r\n    CancellationTokenSource cts = new();\r\n    var vt = c.Reader.ReadAsync(cts.Token);\r\n    cts.Cancel();\r\n    await ((Task)vt.AsTask()).ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing);\r\n\r\n    if (i % 100_000 == 0)\r\n    {\r\n        Console.WriteLine($\"Working set: {Environment.WorkingSet:N0}b\");\r\n    }\r\n}<\/code><\/pre>\n<p>in .NET 9 I get output like this, with an ever increasing working set:<\/p>\n<pre><code class=\"language-txt\">Working set: 31,588,352b\r\nWorking set: 164,884,480b\r\nWorking set: 210,698,240b\r\nWorking set: 293,711,872b\r\nWorking set: 385,495,040b\r\nWorking set: 478,158,848b\r\nWorking set: 553,385,984b\r\nWorking set: 608,206,848b\r\nWorking set: 699,695,104b\r\nWorking set: 793,034,752b\r\nWorking set: 885,309,440b\r\nWorking set: 986,103,808b\r\nWorking set: 1,094,234,112b\r\nWorking set: 1,156,239,360b\r\nWorking set: 1,255,198,720b\r\nWorking set: 1,347,604,480b\r\nWorking set: 1,439,879,168b\r\nWorking set: 1,532,284,928b<\/code><\/pre>\n<p>and in .NET 10, I get output like this, with a nice level steady state working set:<\/p>\n<pre><code class=\"language-txt\">Working set: 33,030,144b\r\nWorking set: 44,826,624b\r\nWorking set: 45,481,984b\r\nWorking set: 45,613,056b\r\nWorking set: 45,875,200b\r\nWorking set: 45,875,200b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b\r\nWorking set: 46,006,272b<\/code><\/pre>\n<h2>Reflection<\/h2>\n<p>.NET 8 added the <code>[UnsafeAccessor]<\/code> attribute, which enables a developer to write an <code>extern<\/code> method that matches up with some non-visible member the developer wants to be able to use, and the runtime fixes up the accesses to be just as if the target member was being used directly. .NET 9 then extended it with generic support.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Reflection;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private List&lt;int&gt; _list = new List&lt;int&gt;(16);\r\n    private FieldInfo _itemsField = typeof(List&lt;int&gt;).GetField(\"_items\", BindingFlags.NonPublic | BindingFlags.Instance)!;\r\n\r\n    private static class Accessors&lt;T&gt;\r\n    {\r\n        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = \"_items\")]\r\n        public static extern ref T[] GetItems(List&lt;T&gt; list);\r\n    }\r\n\r\n    [Benchmark]\r\n    public int[] WithReflection() =&gt; (int[])_itemsField.GetValue(_list)!;\r\n\r\n    [Benchmark]\r\n    public int[] WithUnsafeAccessor() =&gt; Accessors&lt;int&gt;.GetItems(_list);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithReflection<\/td>\n<td style=\"text-align: right;\">2.6397 ns<\/td>\n<\/tr>\n<tr>\n<td>WithUnsafeAccessor<\/td>\n<td style=\"text-align: right;\">0.7300 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But there are still gaps in that story. The signature of the <code>UnsafeAccessor<\/code> member needs to align with the signature of the target member, but what if that target member has parameters that aren&#8217;t visible to the code writing the <code>UnsafeAccessor<\/code>? Or, what if the target member is static? There&#8217;s no way for the developer to express in the <code>UnsafeAccessor<\/code> on which type the target member exists.<\/p>\n<p>For these scenarios, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114881\">dotnet\/runtime#114881<\/a> augments the story with the <code>[UnsafeAccessorType]<\/code> attribute. The <code>UnsafeAccessor<\/code> method can type the relevant parameters as <code>object<\/code> but then adorn them with an <code>[UnsafeAccessorType(\"...\")]<\/code> that provides a fully-qualified name of the target type. There are a bunch of examples then of this being used in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115583\">dotnet\/runtime#115583<\/a>, which replaces most of the cross-library reflection done between libraries in .NET itself with use of <code>[UnsafeAccessor]<\/code>. An example of where this is handy is with a cyclic relationship between <code>System.Net.Http<\/code> and <code>System.Security.Cryptography<\/code>. <code>System.Net.Http<\/code> sits above <code>System.Security.Cryptography<\/code>, referencing it for critical features like <code>X509Certificate<\/code>. But <code>System.Security.Cryptography<\/code> needs to be able to make HTTP requests in order to download OCSP information, and with <code>System.Net.Http<\/code> referencing <code>System.Security.Cryptography<\/code>, <code>System.Security.Cryptography<\/code> can&#8217;t in turn explicitly reference <code>System.Net.Http<\/code>. It can, however, use reflection or <code>[UnsafeAccessor]<\/code> and <code>[UnsafeAccessorType]<\/code> to do so, and it does. It used to use reflection, now in .NET 10 it uses <code>[UnsafeAccessor]<\/code>.<\/p>\n<p>There are a few other nice improvements in and around reflection. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105814\">dotnet\/runtime#105814<\/a> from <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a> updates <code>ActivatorUtilities.CreateFactory<\/code> to remove a layer of delegates. <code>CreateFactory<\/code> returns an <code>ObjectFactory<\/code> delegate, but under the covers the implementation was creating a <code>Func&lt;...&gt;<\/code> and then creating an <code>ObjectFactory<\/code> delegate for that <code>Func&lt;...&gt;<\/code>&#8216;s <code>Invoke<\/code> method. The PR changes it to just create the <code>ObjectFactory<\/code> initially, which means every invocation avoids one layer of delegate invocation.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.DependencyInjection;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"Microsoft.Extensions.DependencyInjection.Abstractions\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"Microsoft.Extensions.DependencyInjection.Abstractions\", \"10.0.0-rc.1.25451.107\"));\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private IServiceProvider _sp = new ServiceCollection().BuildServiceProvider();\r\n    private ObjectFactory _factory = ActivatorUtilities.CreateFactory(typeof(object), Type.EmptyTypes);\r\n\r\n    [Benchmark]\r\n    public object CreateInstance() =&gt; _factory(_sp, null);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CreateInstance<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">8.136 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CreateInstance<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">6.676 ns<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112350\">dotnet\/runtime#112350<\/a> reduces some overheads and allocations as part of parsing and rendering <code>TypeName<\/code>s.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Reflection.Metadata;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"t\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(typeof(Dictionary&lt;List&lt;int[]&gt;[,], List&lt;int?[][][,]&gt;&gt;[]))]\r\n    public string ParseAndGetName(Type t) =&gt; TypeName.Parse(t.FullName).FullName; \r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ParseAndGetName<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">5.930 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">12.25 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ParseAndGetName<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">4.305 us<\/td>\n<td style=\"text-align: right;\">0.73<\/td>\n<td style=\"text-align: right;\">5.75 KB<\/td>\n<td style=\"text-align: right;\">0.47<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113803\">dotnet\/runtime#113803<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> improves how <code>DebugDirectoryBuilder<\/code> in <code>System.Reflection.Metadata<\/code> uses <code>DeflateStream<\/code> to embed a PDB. The code was previously buffering the compressed output into an intermediate <code>MemoryStream<\/code>, and then that <code>MemoryStream<\/code> was being written to the <code>BlobBuilder<\/code>. With this change, the <code>DeflateStream<\/code> is wrapped directly around the <code>BlobBuilder<\/code>, enabling the compressed data to be propagated directly to <code>builder.WriteBytes<\/code>.<\/p>\n<h2>Primitives and Numerics<\/h2>\n<p>Every time I write one of these &#8220;Performance Improvements in .NET&#8221; posts, a part of me thinks &#8220;how could there possibly be more next time.&#8221; That&#8217;s especially true for core data types, which have received so much scrutiny over the years. Yet, here we are, with more to look at for .NET 10.<\/p>\n<p><code>DateTime<\/code> and <code>DateTimeOffset<\/code> get some love in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111112\">dotnet\/runtime#111112<\/a>, in particular with micro-optimizations around how instances are initialized. Similar tweaks show up in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111244\">dotnet\/runtime#111244<\/a> for <code>DateOnly<\/code>, <code>TimeOnly<\/code>, and <code>ISOWeek<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private DateTimeOffset _dto = new DateTimeOffset(2025, 9, 10, 0, 0, 0, TimeSpan.Zero);\r\n\r\n    [Benchmark]\r\n    public DateTimeOffset GetFutureTime() =&gt; _dto + TimeSpan.FromDays(1);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetFutureTime<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">6.012 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetFutureTime<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.029 ns<\/td>\n<td style=\"text-align: right;\">0.17<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Guid<\/code> gets several notable performance improvements in .NET 10. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105654\">dotnet\/runtime#105654<\/a> from <a href=\"https:\/\/github.com\/SirCxyrtyx\">@SirCxyrtyx<\/a> imbues <code>Guid<\/code> with an implementation of <code>IUtf8SpanParsable<\/code>. This not only allows <code>Guid<\/code> to be used in places where a generic parameter is constrained to <code>IUtf8SpanParsable<\/code>, it gives <code>Guid<\/code> overloads of <code>Parse<\/code> and <code>TryParse<\/code> that operate on UTF8 bytes. This means if you have UTF8 data, you don&#8217;t first need to transcode it to UTF16 in order to parse it, nor use <code>Utf8Parser.TryParse<\/code>, which isn&#8217;t as optimized as is <code>Guid.TryParse<\/code> (but which does enable parsing out a <code>Guid<\/code> from the beginning of a larger input).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers.Text;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _utf8 = Encoding.UTF8.GetBytes(Guid.NewGuid().ToString(\"N\"));\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Guid TranscodeParse()\r\n    {\r\n        Span&lt;char&gt; scratch = stackalloc char[64];\r\n        ReadOnlySpan&lt;char&gt; input = Encoding.UTF8.TryGetChars(_utf8, scratch, out int charsWritten) ?\r\n            scratch.Slice(0, charsWritten) :\r\n            Encoding.UTF8.GetString(_utf8);\r\n\r\n        return Guid.Parse(input);\r\n    }\r\n\r\n    [Benchmark]\r\n    public Guid Utf8ParserParse() =&gt; Utf8Parser.TryParse(_utf8, out Guid result, out _, 'N') ? result : Guid.Empty;\r\n\r\n    [Benchmark]\r\n    public Guid GuidParse() =&gt; Guid.Parse(_utf8);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TranscodeParse<\/td>\n<td style=\"text-align: right;\">24.72 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Utf8ParserParse<\/td>\n<td style=\"text-align: right;\">19.34 ns<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<\/tr>\n<tr>\n<td>GuidParse<\/td>\n<td style=\"text-align: right;\">16.47 ns<\/td>\n<td style=\"text-align: right;\">0.67<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Char<\/code>, <code>Rune<\/code>, and <code>Version<\/code> also gained <code>IUtf8SpanParsable<\/code> implementations, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105773\">dotnet\/runtime#105773<\/a> from <a href=\"https:\/\/github.com\/lilinus\">@lilinus<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109252\">dotnet\/runtime#109252<\/a> from <a href=\"https:\/\/github.com\/lilinus\">@lilinus<\/a>. There&#8217;s not much of a performance benefit here for <code>char<\/code> and <code>Rune<\/code>; implementing the interface mainly yields consistency and the ability to use these types with generic routines parameterized based on that interface. But <code>Version<\/code> gains the same kinds of performance (and usability) benefits as did <code>Guid<\/code>: it now sports support for parsing directly from UTF8, rather than needing to transcode first to UTF16 and then parse that.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _utf8 = Encoding.UTF8.GetBytes(new Version(\"123.456.789.10\").ToString());\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Version TranscodeParse()\r\n    {\r\n        Span&lt;char&gt; scratch = stackalloc char[64];\r\n        ReadOnlySpan&lt;char&gt; input = Encoding.UTF8.TryGetChars(_utf8, scratch, out int charsWritten) ?\r\n            scratch.Slice(0, charsWritten) :\r\n            Encoding.UTF8.GetString(_utf8);\r\n\r\n        return Version.Parse(input);\r\n    }\r\n\r\n    [Benchmark]\r\n    public Version GuidParse() =&gt; Version.Parse(_utf8);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TranscodeParse<\/td>\n<td style=\"text-align: right;\">46.48 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GuidParse<\/td>\n<td style=\"text-align: right;\">35.75 ns<\/td>\n<td style=\"text-align: right;\">0.77<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Sometimes performance improvements come about as a side-effect of other work. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110923\">dotnet\/runtime#110923<\/a> was intending to remove some pointer use from <code>Guid<\/code>&#8216;s formatting implementation, but in doing so, it ended up also slightly improving throughput of the (admittedly rarely used) &#8220;X&#8221; format.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private char[] _dest = new char[64];\r\n    private Guid _g = Guid.NewGuid();\r\n\r\n    [Benchmark]\r\n    public void FormatX() =&gt; _g.TryFormat(_dest, out int charsWritten, \"X\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FormatX<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">3.0584 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FormatX<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">0.7873 ns<\/td>\n<td style=\"text-align: right;\">0.26<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Random<\/code> (and its cryptographically-secure counterpart <code>RandomNumberGenerator<\/code>) continues to improve in .NET 10, with new methods (such as <code>Random.GetString<\/code> and <code>Random.GetHexString<\/code> from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112162\">dotnet\/runtime#112162<\/a>) for usability, but also importantly with performance improvements to existing methods. Both <code>Random<\/code> and <code>RandomNumberGenerator<\/code> were given a handy <code>GetItems<\/code> method in .NET 8; this method allows a caller to supply a set of choices and the number of items desired, allowing <code>Random{NumberGenerator}<\/code> to perform &#8220;sampling with replacement&#8221;, selecting an item from the set that number of times. In .NET 9, these implementations were optimized to special-case a power-of-2 number of choices that&#8217;s less than or equal to 256. In such a case, we can avoid many trips to the underlying source of randomness by requesting bytes in bulk, rather than requesting an <code>int<\/code> per element. With the power-of-2 choice count, we can simply mask each byte to produce the index into the choices while not introducing bias. In .NET 10, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107988\">dotnet\/runtime#107988<\/a> extends this to apply to non-power-of-2 cases, as well. We can&#8217;t just mask off bits as in the power-of-2 case, but we can do &#8220;rejection sampling,&#8221; which is just a fancy way of saying &#8220;if you randomly get a value outside of the allowed range, try again&#8221;.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private const string Base58 = \"123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz\";\r\n\r\n    [Params(30)]\r\n    public int Length { get; set; }\r\n\r\n    [Benchmark]\r\n    public char[] WithRandom() =&gt; Random.Shared.GetItems&lt;char&gt;(Base58, Length);\r\n\r\n    [Benchmark]\r\n    public char[] WithRandomNumberGenerator() =&gt; RandomNumberGenerator.GetItems&lt;char&gt;(Base58, Length);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Length<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithRandom<\/td>\n<td>.NET 9.0<\/td>\n<td>30<\/td>\n<td style=\"text-align: right;\">144.42 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WithRandom<\/td>\n<td>.NET 10.0<\/td>\n<td>30<\/td>\n<td style=\"text-align: right;\">73.68 ns<\/td>\n<td style=\"text-align: right;\">0.51<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>WithRandomNumberGenerator<\/td>\n<td>.NET 9.0<\/td>\n<td>30<\/td>\n<td style=\"text-align: right;\">23,179.73 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WithRandomNumberGenerator<\/td>\n<td>.NET 10.0<\/td>\n<td>30<\/td>\n<td style=\"text-align: right;\">853.47 ns<\/td>\n<td style=\"text-align: right;\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>decimal<\/code> operations, specifically multiplication and division, get a performance bump, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99212\">dotnet\/runtime#99212<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private decimal _n = 9.87654321m;\r\n    private decimal _d = 1.23456789m;\r\n\r\n    [Benchmark]\r\n    public decimal Divide() =&gt; _n \/ _d;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Divide<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">27.09 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Divide<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">23.68 ns<\/td>\n<td style=\"text-align: right;\">0.87<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>UInt128<\/code> division similarly gets some assistance in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99747\">dotnet\/runtime#99747<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a>, utilizing the X86 <code>DivRem<\/code> hardware intrinsic when dividing a value that&#8217;s larger than a <code>ulong<\/code> by a value that could fit in a <code>ulong<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private UInt128 _n = new UInt128(123, 456);\r\n    private UInt128 _d = new UInt128(0, 789);\r\n\r\n    [Benchmark]\r\n    public UInt128 Divide() =&gt; _n \/ _d;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Divide<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">27.3112 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Divide<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">0.5522 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>BigInteger<\/code> gets a few improvements as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115445\">dotnet\/runtime#115445<\/a> from <a href=\"https:\/\/github.com\/Rob-Hague\">@Rob-Hague<\/a> augments its <code>TryWriteBytes<\/code> method to use a direct memory copy when viable, namely when the number is non-negative such that it doesn&#8217;t need twos-complement tweaks.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private BigInteger _value = BigInteger.Parse(string.Concat(Enumerable.Repeat(\"1234567890\", 20)));\r\n    private byte[] _bytes = new byte[256];\r\n\r\n    [Benchmark]\r\n    public bool TryWriteBytes() =&gt; _value.TryWriteBytes(_bytes, out _);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryWriteBytes<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">27.814 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryWriteBytes<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">5.743 ns<\/td>\n<td style=\"text-align: right;\">0.21<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Also rare but fun, if you tried using <code>BigInteger.Parse<\/code> exactly with the string representation of <code>int.MinValue<\/code>, you&#8217;d end up allocating unnecessarily. That&#8217;s addressed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104666\">dotnet\/runtime#104666<\/a> from <a href=\"https:\/\/github.com\/kzrnm\">@kzrnm<\/a>, which tweaks the handling of this corner-case so that it&#8217;s appropriately recognized as a case that can be represented using a singleton for <code>int.MinValue<\/code> (the singleton already existed, it just wasn&#8217;t applied in this case).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _int32min = int.MinValue.ToString();\r\n\r\n    [Benchmark]\r\n    public BigInteger ParseInt32Min() =&gt; BigInteger.Parse(_int32min);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ParseInt32Min<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">80.54 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ParseInt32Min<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">71.59 ns<\/td>\n<td style=\"text-align: right;\">0.89<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One area that got a lot of attention in .NET 10 is <code>System.Numerics.Tensors<\/code>. The <code>System.Numerics.Tensors<\/code> library was introduced in .NET 8, focusing on a <code>TensorPrimitives<\/code> class that provided various numerical routines on spans of <code>float<\/code>. .NET 9 then expanded <code>TensorPrimitives<\/code> with more operations and generic versions of them. Now in .NET 10, <code>TensorPrimitives<\/code> gains even more operations, with many of the existing ones also made faster for various scenarios.<\/p>\n<p>To start, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112933\">dotnet\/runtime#112933<\/a> adds over 70 new overloads to <code>TensorPrimitives<\/code>, including operations like <code>StdDev<\/code>, <code>Average<\/code>, <code>Clamp<\/code>, <code>DivRem<\/code>, <code>IsNaN<\/code>, <code>IsPow2<\/code>, <code>Remainder<\/code>, and many more. The majority of these operations are also vectorized, using shared implementations that are parameterized with generic operators. For example, the entirety of the <code>Decrement&lt;T&gt;<\/code> implementation is:<\/p>\n<pre><code class=\"language-csharp\">public static void Decrement&lt;T&gt;(ReadOnlySpan&lt;T&gt; x, Span&lt;T&gt; destination) where T : IDecrementOperators&lt;T&gt; =&gt;\r\n    InvokeSpanIntoSpan&lt;T, DecrementOperator&lt;T&gt;&gt;(x, destination);<\/code><\/pre>\n<p>where <code>InvokeSpanIntoSpan<\/code> is a shared routine used by almost 60 methods, each of which supplies its own operator that&#8217;s then used in the heavily-optimized routine. In this case, the <code>DecrementOperator&lt;T&gt;<\/code> is simply this:<\/p>\n<pre><code class=\"language-csharp\">private readonly struct DecrementOperator&lt;T&gt; : IUnaryOperator&lt;T, T&gt; where T : IDecrementOperators&lt;T&gt;\r\n{\r\n    public static bool Vectorizable =&gt; true;\r\n    public static T Invoke(T x) =&gt; --x;\r\n    public static Vector128&lt;T&gt; Invoke(Vector128&lt;T&gt; x) =&gt; x - Vector128&lt;T&gt;.One;\r\n    public static Vector256&lt;T&gt; Invoke(Vector256&lt;T&gt; x) =&gt; x - Vector256&lt;T&gt;.One;\r\n    public static Vector512&lt;T&gt; Invoke(Vector512&lt;T&gt; x) =&gt; x - Vector512&lt;T&gt;.One;\r\n}<\/code><\/pre>\n<p>With that minimal implementation, which provides a decrement implementation for vectorized widths of 128 bits, 256 bits, 512 bits, and scalar, the workhorse routine is able to provide a very efficient implementation.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private float[] _src = Enumerable.Range(0, 1000).Select(i =&gt; (float)i).ToArray();\r\n    private float[] _dest = new float[1000];\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void DecrementManual()\r\n    {\r\n        ReadOnlySpan&lt;float&gt; src = _src;\r\n        Span&lt;float&gt; dest = _dest;\r\n        for (int i = 0; i &lt; src.Length; i++)\r\n        {\r\n            dest[i] = src[i] - 1f;\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void DecrementTP() =&gt; TensorPrimitives.Decrement(_src, _dest);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DecrementManual<\/td>\n<td style=\"text-align: right;\">288.80 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DecrementTP<\/td>\n<td style=\"text-align: right;\">22.46 ns<\/td>\n<td style=\"text-align: right;\">0.08<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Wherever possible, these methods also utilize APIs on the underlying <code>Vector128<\/code>, <code>Vector256<\/code>, and <code>Vector512<\/code> types, including new corresponding methods introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111179\">dotnet\/runtime#111179<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115525\">dotnet\/runtime#115525<\/a>, such as <code>IsNaN<\/code>.<\/p>\n<p>Existing methods are also improved. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111615\">dotnet\/runtime#111615<\/a> from <a href=\"https:\/\/github.com\/BarionLP\">@BarionLP<\/a> improves <code>TensorPrimitives.SoftMax<\/code> by avoiding unnecessary recomputation of <code>T.Exp<\/code>. The softmax function involves computing <code>exp<\/code> for every element and summing them all together. The output for an element with value <code>x<\/code> is then the <code>exp(x)<\/code> divided by that sum. The previous implementation was following that outline, resulting in computing <code>exp<\/code> twice for each element. We can instead compute <code>exp<\/code> just once for each element, caching them temporarily in the destination while creating the sum, and then reusing those for the subsequent division, overwriting each with the actual result. The net result is close to doubling the throughput:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net9.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"System.Numerics.Tensors\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"System.Numerics.Tensors\", \"10.0.0-rc.1.25451.107\"));\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private float[] _src, _dst;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Random r = new(42);\r\n        _src = Enumerable.Range(0, 1000).Select(_ =&gt; r.NextSingle()).ToArray();\r\n        _dst = new float[_src.Length];\r\n    }\r\n\r\n    [Benchmark]\r\n    public void SoftMax() =&gt; TensorPrimitives.SoftMax(_src, _dst);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SoftMax<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,047.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SoftMax<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">649.8 ns<\/td>\n<td style=\"text-align: right;\">0.62<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111505\">dotnet\/runtime#111505<\/a> from <a href=\"https:\/\/github.com\/alexcovington\">@alexcovington<\/a> enables <code>TensorPrimitives.Divide&lt;T&gt;<\/code> to be vectorized for <code>int<\/code>. The operation already supported vectorization for <code>float<\/code> and <code>double<\/code>, for which there&#8217;s SIMD hardware-accelerated support for division, but it didn&#8217;t support <code>int<\/code>, which lacks SIMD hardware-accelerated support. This PR teaches the JIT how to emulate SIMD integer division, by converting the <code>int<\/code>s to <code>double<\/code>s, doing <code>double<\/code> division, and then converting back.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net9.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"System.Numerics.Tensors\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"System.Numerics.Tensors\", \"10.0.0-rc.1.25451.107\"));\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private int[] _n, _d, _dst;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Random r = new(42);\r\n        _n = Enumerable.Range(0, 1000).Select(_ =&gt; r.Next(1000, int.MaxValue)).ToArray();\r\n        _d = Enumerable.Range(0, 1000).Select(_ =&gt; r.Next(1, 1000)).ToArray();\r\n        _dst = new int[_n.Length];\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Divide() =&gt; TensorPrimitives.Divide(_n, _d, _dst);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Divide<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,293.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Divide<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">458.4 ns<\/td>\n<td style=\"text-align: right;\">0.35<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116945\">dotnet\/runtime#116945<\/a> further updates <code>TensorPrimitives.Divide<\/code> (as well as <code>TensorPrimitives.Sign<\/code> and <code>TensorPrimitives.ConvertToInteger<\/code>) to be vectorizable when used with <code>nint<\/code> or <code>nuint<\/code>. <code>nint<\/code> can be treated identically to <code>int<\/code> when in a 32-bit process and to <code>long<\/code> when in a 64-bit process; same for <code>nuint<\/code> with <code>uint<\/code> and <code>ulong<\/code>, respectively. So anywhere we&#8217;re successfully vectorizing for <code>int<\/code>\/<code>uint<\/code> on 32-bit or <code>long<\/code>\/<code>ulong<\/code> on 64-bit, we can also successfully vectorize for <code>nint<\/code>\/<code>nuint<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116895\">dotnet\/runtime#116895<\/a> also enables vectorizing <code>TensorPrimitives.ConvertTruncating<\/code> when used to convert <code>float<\/code> to <code>int<\/code> or <code>uint<\/code> and <code>double<\/code> to <code>long<\/code> or <code>ulong<\/code>. Vectorization hadn&#8217;t previously been enabled because the underlying operations used had some undefined behavior; that behavior was fixed late in the .NET 9 cycle, such that this vectorization can now be enabled.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net9.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"System.Numerics.Tensors\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"System.Numerics.Tensors\", \"10.0.0-rc.1.25451.107\"));\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private float[] _src;\r\n    private int[] _dst;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Random r = new(42);\r\n        _src = Enumerable.Range(0, 1000).Select(_ =&gt; r.NextSingle() * 1000).ToArray();\r\n        _dst = new int[_src.Length];\r\n    }\r\n\r\n    [Benchmark]\r\n    public void ConvertTruncating() =&gt; TensorPrimitives.ConvertTruncating(_src, _dst);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ConvertTruncating<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">933.86 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConvertTruncating<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">41.99 ns<\/td>\n<td style=\"text-align: right;\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Not to be left out, <code>TensorPrimitives.LeadingZeroCount<\/code> is also improved in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110333\">dotnet\/runtime#110333<\/a> from <a href=\"https:\/\/github.com\/alexcovington\">@alexcovington<\/a>. When AVX512 is available, the change utilizes AVX512 instructions like <code>PermuteVar16x8x2<\/code> to vectorize <code>LeadingZeroCount<\/code> for all types supported by <code>Vector512&lt;T&gt;<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net9.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"System.Numerics.Tensors\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"System.Numerics.Tensors\", \"10.0.0-rc.1.25451.107\"));\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _src, _dst;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _src = new byte[1000];\r\n        _dst = new byte[_src.Length];\r\n        new Random(42).NextBytes(_src);\r\n    }\r\n\r\n    [Benchmark]\r\n    public void LeadingZeroCount() =&gt; TensorPrimitives.LeadingZeroCount(_src, _dst);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>LeadingZeroCount<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">401.60 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>LeadingZeroCount<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">12.33 ns<\/td>\n<td style=\"text-align: right;\">0.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In terms of changes that affected the most operations, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116898\">dotnet\/runtime#116898<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116934\">dotnet\/runtime#116934<\/a> take the cake. Together, these PRs extend vectorization for almost 60 distinct operations to also accelerate for <code>Half<\/code>: <code>Abs<\/code>, <code>Add<\/code>, <code>AddMultiply<\/code>, <code>BitwiseAnd<\/code>, <code>BitwiseOr<\/code>, <code>Ceiling<\/code>, <code>Clamp<\/code>, <code>CopySign<\/code>, <code>Cos<\/code>, <code>CosPi<\/code>, <code>Cosh<\/code>, <code>CosineSimilarity<\/code>, <code>Decrement<\/code>, <code>DegreesToRadians<\/code>, <code>Divide<\/code>, <code>Exp<\/code>, <code>Exp10<\/code>, <code>Exp10M1<\/code>, <code>Exp2<\/code>, <code>Exp2M1<\/code>, <code>ExpM1<\/code>, <code>Floor<\/code>, <code>FusedAddMultiply<\/code>, <code>Hypot<\/code>, <code>Increment<\/code>, <code>Lerp<\/code>, <code>Log<\/code>, <code>Log10<\/code>, <code>Log10P1<\/code>, <code>Log2<\/code>, <code>Log2P1<\/code>, <code>LogP1<\/code>, <code>Max<\/code>, <code>MaxMagnitude<\/code>, <code>MaxMagnitudeNumber<\/code>, <code>MaxNumber<\/code>, <code>Min<\/code>, <code>MinMagnitude<\/code>, <code>MinMagnitudeNumber<\/code>, <code>MinNumber<\/code>, <code>Multiply<\/code>, <code>MultiplyAdd<\/code>, <code>MultiplyAddEstimate<\/code>, <code>Negate<\/code>, <code>OnesComplement<\/code>, <code>Reciprocal<\/code>, <code>Remainder<\/code>, <code>Round<\/code>, <code>Sigmoid<\/code>, <code>Sin<\/code>, <code>SinPi<\/code>, <code>Sinh<\/code>, <code>Sqrt<\/code>, <code>Subtract<\/code>, <code>Tan<\/code>, <code>TanPi<\/code>, <code>Tanh<\/code>, <code>Truncate<\/code>, and <code>Xor<\/code>. The challenge here is that <code>Half<\/code> doesn&#8217;t have accelerated hardware support, and today is not even supported by the vector types. In fact, even for its scalar operations, <code>Half<\/code> is manipulated internally by converting it to a <code>float<\/code>, performing the relevant operation as <code>float<\/code>, and then casting back, e.g. here&#8217;s the implementation of the <code>Half<\/code> multiplication operator:<\/p>\n<pre><code class=\"language-csharp\">public static Half operator *(Half left, Half right) =&gt; (Half)((float)left * (float)right);<\/code><\/pre>\n<p>For all of these <code>TensorPrimitives<\/code> operations, they previously would treat <code>Half<\/code> like any other unaccelerated type, and would just run a scalar loop that performed the operation on each <code>Half<\/code>. That means for each element, we&#8217;re converting it to <code>float<\/code>, then performing the operation, and then converting it back. As luck would have it, though, <code>TensorPrimitives<\/code> already defines the <code>ConvertToSingle<\/code> and <code>ConvertToHalf<\/code> methods, which are accelerated. We can then reuse those methods to do the same thing that&#8217;s already done for scalar operations but do it vectorized: take a vector of <code>Half<\/code>s, convert them all to <code>float<\/code>s, process all the <code>float<\/code>s, and convert them all back to <code>Half<\/code>s. Of course, I already stated that the vector types don&#8217;t support <code>Half<\/code>, so how can we &#8220;take a vector of <code>Half<\/code>&#8220;? By reinterpret casting the <code>Span&lt;Half&gt;<\/code> to <code>Span&lt;short&gt;<\/code> (or <code>Span&lt;ushort&gt;<\/code>), which allows us to smuggle the <code>Half<\/code>s through. And, as it turns out, even for scalar, the very first thing <code>Half<\/code>&#8216;s <code>float<\/code> cast operator does is convert it to a <code>short<\/code>.<\/p>\n<p>The net result is that a ton of operations can now be accelerated for <code>Half<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net9.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics.Tensors;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(\"System.Numerics.Tensors\", \"9.0.9\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core10_0).WithNuGet(\"System.Numerics.Tensors\", \"10.0.0-rc.1.25451.107\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic partial class Tests\r\n{\r\n    private Half[] _x, _y, _dest;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _x = new Half[1000];\r\n        _y = new Half[_x.Length];\r\n        _dest = new Half[_x.Length];\r\n        var random = new Random(42);\r\n        for (int i = 0; i &lt; _x.Length; i++)\r\n        {\r\n            _x[i] = (Half)random.NextSingle();\r\n            _y[i] = (Half)random.NextSingle();\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Add() =&gt; TensorPrimitives.Add(_x, _y, _dest);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Add<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">5,984.3 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Add<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">481.7 ns<\/td>\n<td style=\"text-align: right;\">0.08<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>System.Numerics.Tensors<\/code> library in .NET 10 now also includes stable APIs for tensor types (which use <code>TensorPrimitives<\/code> in their implementations). This includes a <code>Tensor&lt;T&gt;<\/code>, <code>ITensor&lt;,&gt;<\/code>, <code>TensorSpan&lt;T&gt;<\/code>, and <code>ReadOnlyTensorSpan&lt;T&gt;<\/code>. One of the really interesting things about these types is that they take advantage of the new C# 14 compound operators feature, and do so for a significant performance benefit. In previous versions of C#, you&#8217;re able to write custom operators, for example an addition operator:<\/p>\n<pre><code class=\"language-csharp\">public class C \r\n{\r\n    public int Value;\r\n\r\n    public static C operator +(C left, C right) =&gt; new() { Value = left.Value + right.Value };\r\n}<\/code><\/pre>\n<p>With that type, I can write code like:<\/p>\n<pre><code class=\"language-csharp\">C a = new() { Value = 42 };\r\nC b = new() { Value = 84 };\r\nC c = a + b;\r\n\r\nConsole.WriteLine(c.Value);<\/code><\/pre>\n<p>which will print out <code>126<\/code>. I can also change the code to use a compound operator, <code>+=<\/code>, like this:<\/p>\n<pre><code class=\"language-csharp\">C a = new() { Value = 42 };\r\nC b = new() { Value = 84 };\r\na += b;\r\n\r\nConsole.WriteLine(a.Value);<\/code><\/pre>\n<p>which will also print out <code>126<\/code>, because the <code>a += b<\/code> is always identical to <code>a = a + b<\/code>&#8230; or, at least it was. Now with C# 14, it&#8217;s possible for a type to not only define a <code>+<\/code> operator, it can also define a <code>+=<\/code> operator. If a type defines a <code>+=<\/code> operator, it will be used rather than expanding <code>a += b<\/code> as shorthand for <code>a = a + b<\/code>. And that has performance ramifications.<\/p>\n<p>A tensor is basically a multidimensional array, and as with arrays, these can be big&#8230; really big. If I have a sequence of operations:<\/p>\n<pre><code class=\"language-csharp\">Tensor&lt;int&gt; t1 = ...;\r\nTensor&lt;int&gt; t2 = ...;\r\nfor (int i = 0; i &lt; 3; i++)\r\n{\r\n    t1 += t2;\r\n}<\/code><\/pre>\n<p>and each of those <code>t1 += t2<\/code>s exands into <code>t1 = t1 + t2<\/code>, then for each I&#8217;m allocating a brand new tensor. If they&#8217;re big, that gets expensive right quick. But C# 14&#8217;s new user-defined compound operators, as initially added to the compiler in <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/78400\">dotnet\/roslyn#78400<\/a>, enable mutation of the target.<\/p>\n<pre><code class=\"language-csharp\">public class C \r\n{\r\n    public int Value;\r\n\r\n    public static C operator +(C left, C right) =&gt; new() { Value = left.Value + right.Value };\r\n    public static void operator +=(C other) =&gt; left.Value += other.Value;\r\n}<\/code><\/pre>\n<p>And that means that such compound operators on the tensor types can just update the target tensor in place rather than allocating a whole new (possibly very large) data structure for each computation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117997\">dotnet\/runtime#117997<\/a> adds all of these compound operators for the tensor types. (Not only are these using C# 14 user-defined compound operators, they&#8217;re doing so as extension operators, using the new C# 14 extension types feature. Fun!)<\/p>\n<h2>Collections<\/h2>\n<p>Handling collections of data is the lifeblood of any application, and as such every .NET release tries to eke out even more performance from collections and collection processing.<\/p>\n<h3>Enumeration<\/h3>\n<p>Iterating through collections is one of the most common things developers do. To make this as efficient as possible, the most prominent collection types in .NET (e.g. <code>List&lt;T&gt;<\/code>) expose struct-based enumerators (e.g. <code>List&lt;T&gt;.Enumerator<\/code>) which their public <code>GetEnumerator()<\/code> methods then return in a strongly-typed manner:<\/p>\n<pre><code class=\"language-csharp\">public Enumerator GetEnumerator() =&gt; new Enumerator(this);<\/code><\/pre>\n<p>This is in addition to their <code>IEnumerable&lt;T&gt;.GetEnumerator()<\/code> implementation, which ends up being implemented via an &#8220;explicit&#8221; interface implementation (&#8220;explicit&#8221; means the relevant method provides the interface method implementation but does not show up as a public method on the type itself), e.g. <code>List&lt;T&gt;<\/code>&#8216;s implementation:<\/p>\n<pre><code class=\"language-csharp\">IEnumerator&lt;T&gt; IEnumerable&lt;T&gt;.GetEnumerator() =&gt;\r\n    Count == 0 ? SZGenericArrayEnumerator&lt;T&gt;.Empty :\r\n    GetEnumerator();<\/code><\/pre>\n<p>Directly <code>foreach<\/code>&#8216;ing over the collection allows the C# compiler to bind to the struct-based enumerator, enabling avoiding the enumerator allocation and being able to directly see the non-virtual methods on the enumerator, rather than working with an <code>IEnumerator&lt;T&gt;<\/code> and the interface dispatch required to invoke methods on it. That, however, falls apart once the collection is used polymorphically as an <code>IEnumerable&lt;T&gt;<\/code>; at that point, the <code>IEnumerable&lt;T&gt;.GetEnumerator()<\/code> is used, which is forced to allocate a new enumerator instance (except for special-cases, such as how <code>List&lt;T&gt;<\/code>&#8216;s implementation shown above returns a singleton enumerator when the collection is empty).<\/p>\n<p>Thankfully, as noted earlier in the JIT section, the JIT has been gaining super powers around dynamic PGO, escape analysis, and stack allocation. This means that in many situations, the JIT is now able to see that the most common concrete type for a given call site is a specific enumerator type and generate code specific to when it is that type, devirtualizing the calls, possibly inlining them, and then, if it&#8217;s able to do so sufficiently, stack allocating the enumerator. With the progress that&#8217;s been made in .NET 10, this now happens very frequently for arrays and <code>List&lt;T&gt;<\/code>. While the JIT is able to do this in general regardless of an object&#8217;s type, the ubiquity of enumeration makes it all that much more important for <code>IEnumerator&lt;T&gt;<\/code>, so <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116978\">dotnet\/runtime#116978<\/a> marks <code>IEnumerator&lt;T&gt;<\/code> as an <code>[Intrinsic]<\/code>, giving the JIT the ability to better reason about it.<\/p>\n<p>However, some enumerators still needed a bit of help. Besides <code>T[]<\/code>, <code>List&lt;T&gt;<\/code> is the most popular collection type in .NET, and with the JIT changes, many <code>foreach<\/code>s of an <code>IEnumerable&lt;T&gt;<\/code> that are actually <code>List&lt;T&gt;<\/code> will successfully have the enumerator stack allocated. Awesome. That awesomeness dwindled, however, when trying out different sized lists. This is a benchmark that tests out enumerating a <code>List&lt;T&gt;<\/code> typed as <code>IEnumerable&lt;T&gt;<\/code>, with different lengths, along with benchmark results from early August 2025 (around .NET 10 Preview 7).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _enumerable;\r\n\r\n    [Params(500, 5000, 15000)]\r\n    public int Count { get; set; }\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; _enumerable = Enumerable.Range(0, Count).ToList();\r\n\r\n    [Benchmark]\r\n    public int Sum()\r\n    {\r\n        int sum = 0;\r\n        foreach (int item in _enumerable) sum += item;\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Count<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>500<\/td>\n<td style=\"text-align: right;\">214.1 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>5000<\/td>\n<td style=\"text-align: right;\">4,767.1 ns<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>15000<\/td>\n<td style=\"text-align: right;\">13,824.4 ns<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that for the <code>500<\/code> element <code>List&lt;T&gt;<\/code>, the allocation column shows that nothing was allocated on the heap, as the enumerator was successfully stack allocated. Fabulous. But then just increasing the size of the list caused it to no longer be stack-allocated. Why? The reason for the allocation in the jump from <code>500<\/code> to <code>5000<\/code> has to do with dynamic PGO combined with how <code>List&lt;T&gt;<\/code>&#8216;s enumerator was written oh so many years ago.<\/p>\n<p><code>List&lt;T&gt;<\/code>&#8216;s enumerator&#8217;s <code>MoveNext<\/code> was structured like this:<\/p>\n<pre><code class=\"language-csharp\">public bool MoveNext()\r\n{\r\n    if (_version == _list._version &amp;&amp; ((uint)_index &lt; (uint)_list._size))\r\n    {\r\n        ... \/\/ handle successfully getting next element\r\n        return true;\r\n    }\r\n\r\n    return MoveNextRare();\r\n}\r\n\r\nprivate bool MoveNextRare()\r\n{\r\n    ... \/\/ handle version mismatch and\/or returning false for completed enumeration\r\n}<\/code><\/pre>\n<p>The <code>Rare<\/code> in the name gives a hint as to why it&#8217;s split like this. The <code>MoveNext<\/code> method was kept as thin as possible for the common case of invoking <code>MoveNext<\/code>, namely all successful calls that return <code>true<\/code>; the only time <code>MoveNextRare<\/code> is needed, other than when the enumerator is misused, is for the final call to it after all elements have been yielded. That streamlining of <code>MoveNext<\/code> itself was done to make <code>MoveNext<\/code> inlineable. However, a lot has changed since this code was written, making it less important, and the separating out of <code>MoveNextRare<\/code> has a really interesting interaction with dynamic PGO. One of the things dynamic PGO looks for is whether code is considered hot (used a lot) or cold (used rarely), and that data influences whether a method should be considered for inlining. For shorter lists, dynamic PGO will see <code>MoveNextRare<\/code> invoked a reasonable number of times, and will consider it for inlining. And if all of the calls to the enumerator are inlined, the enumerator instance can avoid escaping the call frame, and can then be stack allocated. But once the list length grows to a much larger amount, that <code>MoveNextRare<\/code> method will start to look really cold, will struggle to be inlined, and will then allow the enumerator instance to escape, preventing it from being stack allocated. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118425\">dotnet\/runtime#118425<\/a> recognizes that times have changed since this enumerator was written, with many changes to inlining heuristics and PGO and the like; it undoes the separating out of <code>MoveNextRare<\/code> and simplifies the enumerator. With how the system works today, the re-combined <code>MoveNext<\/code> is still inlineable, with or without PGO, and we&#8217;re able to stack allocate at the larger size.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Count<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>500<\/td>\n<td style=\"text-align: right;\">221.2 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>5000<\/td>\n<td style=\"text-align: right;\">2,153.6 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>15000<\/td>\n<td style=\"text-align: right;\">14,724.9 ns<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>With that fix, we still had an issue, though. We&#8217;re now avoiding the allocation at lengths 500 and 5000, but at 15,000 we still see the enumerator being allocated. Now why? This has to do with <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/#on-stack-replacement\">OSR (on-stack replacement)<\/a>, which was introduced in .NET 7 as a key enabler for allowing tiered compilation to be used with methods containing loops. OSR allows for a method to be recompiled with optimizations even while it&#8217;s executing, and for an invocation of the method to jump from the unoptimized code for the method to the corresponding location in the newly optimized method. While OSR is awesome, it unfortunately causes some complications here. Once the list gets long enough, an invocation of the tier 0 (unoptimized) method will transition to the OSR optimized method&#8230; but OSR methods don&#8217;t contain dynamic PGO instrumentation (they used to, but it was removed because it led to problems if the instrumented code never got recompiled again and thus suffered regressions due to forever-more running with the instrumentation probes in place). Without the instrumentation, and in particular without the instrumentation for the tail portion of the method (where the enumerator&#8217;s <code>Dispose<\/code> method is invoked), even though <code>List&lt;T&gt;.Dispose<\/code> is a nop, the JIT may not be able to do the guarded devirtualization that enables the <code>IEnumerator&lt;T&gt;.Dispose<\/code> to be devirtualized and inlined. Meaning, ironically, that the nop <code>Dispose<\/code> causes escape analysis to see the enumerator instance escape, such that it can&#8217;t be stack allocated. Whew.<\/p>\n<p>Thankfully, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118461\">dotnet\/runtime#118461<\/a> addresses that in the JIT. Specifically for enumerators, this PR enables dynamic PGO to infer the missing instrumentation based on the earlier probes used with the other enumerator methods, which then enables it to successfully devirtualize and inline <code>Dispose<\/code>. So, for .NET 10, and the same benchmark, we end up with this lovely sight:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Count<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>500<\/td>\n<td style=\"text-align: right;\">216.5 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>5000<\/td>\n<td style=\"text-align: right;\">2,082.4 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>15000<\/td>\n<td style=\"text-align: right;\">6,525.3 ns<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other types needed a bit of help as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118467\">dotnet\/runtime#118467<\/a> addresses <code>PriorityQueue&lt;TElement, TPriority&gt;<\/code>&#8216;s enumerator; it&#8217;s enumerator was a port of <code>List&lt;T&gt;<\/code>&#8216;s and so was changed similarly.<\/p>\n<p>Separately, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117328\">dotnet\/runtime#117328<\/a> streamline&#8217;s <code>Stack&lt;T&gt;<\/code>&#8216;s enumerator type, removing around half the lines of code that previously composed it. The previous enumerator&#8217;s <code>MoveNext<\/code> incurred five branches on the way to grabbing most next elements:<\/p>\n<ul>\n<li>It first did a version check, comparing the stack&#8217;s version number against the enumerator&#8217;s captured version number, to ensure the stack hadn&#8217;t been mutated since the time the enumerator was grabbed.<\/li>\n<li>It then checked to see whether this was the first call to the enumerator, taking one path that lazily-initialized some state if it was and another path assuming already-initialized state if not.<\/li>\n<li>Assuming this wasn&#8217;t the first call, it then checked whether enumeration had previously ended.<\/li>\n<li>Assuming it hadn&#8217;t, it then checked whether there&#8217;s anything left to enumerate.<\/li>\n<li>And finally, it dereferenced the underlying array, incurring a bounds check.<\/li>\n<\/ul>\n<p>The new implementation cuts that in half. It relies on the enumerator&#8217;s constructor initializing the current index to the length of the stack, such that each <code>MoveNext<\/code> call just decrements this value. When the data is exhausted, the count will go negative. This means that we can combine a whole bunch of these checks into a single check:<\/p>\n<pre><code class=\"language-csharp\">if ((uint)index &lt; (uint)array.Length)<\/code><\/pre>\n<p>and we&#8217;re left with just two branches on the way to reading any element: the version check and whether the index is in bounds. That reduction not only means there&#8217;s less code to process and fewer branches that might be improperly predicted, it also shrinks the size of the members to the point where they&#8217;re much more likely to be inlined, which in turns makes it much more likely that the enumerator object can be stack allocated.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Stack&lt;int&gt; _direct = new Stack&lt;int&gt;(Enumerable.Range(0, 10));\r\n    private IEnumerable&lt;int&gt; _enumerable = new Stack&lt;int&gt;(Enumerable.Range(0, 10));\r\n\r\n    [Benchmark]\r\n    public int SumDirect()\r\n    {\r\n        int sum = 0;\r\n        foreach (int item in _direct) sum += item;\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int SumEnumerable()\r\n    {\r\n        int sum = 0;\r\n        foreach (int item in _enumerable) sum += item;\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SumDirect<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">23.317 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">331 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td>SumDirect<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">4.502 ns<\/td>\n<td style=\"text-align: right;\">0.19<\/td>\n<td style=\"text-align: right;\">55 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">30.893 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">642 B<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">7.906 ns<\/td>\n<td style=\"text-align: right;\">0.26<\/td>\n<td style=\"text-align: right;\">381 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117341\">dotnet\/runtime#117341<\/a> does something similar but for <code>Queue&lt;T&gt;<\/code>. <code>Queue&lt;T&gt;<\/code> has an interesting complication when compared to <code>Stack&lt;T&gt;<\/code>, which is that it can wrap around the length of the underlying array. Whereas with <code>Stack&lt;T&gt;<\/code>, we can always start at a particular index and just count down to 0, using that index as the offset into the array, with <code>Queue&lt;T&gt;<\/code>, the starting index can be anywhere in the array, and when walking from that index to the last element, we might need to wrap around back to the beginning. Such wrapping can be accomplished using <code>% array.Length<\/code> (which is what <code>Queue&lt;T&gt;<\/code> does on .NET Framework), but such a division operation can be relatively costly. An alternative, since we know the count can never be more than the array&#8217;s length, is to check whether we&#8217;ve already walked past the end of the array, and if we have, then subtract the array&#8217;s length to get to the corresponding location from the start of the array. The existing implementation in .NET 9 did just that:<\/p>\n<pre><code class=\"language-csharp\">if (index &gt;= array.Length)\r\n{\r\n    index -= array.Length; \/\/ wrap around if needed\r\n}\r\n\r\n_currentElement = array[index];<\/code><\/pre>\n<p>That is two branches, one for the check against the array length, and one for the bounds check. The bounds check can&#8217;t be eliminated here because the JIT hasn&#8217;t seen proof that the index is actually in-bounds and thus needs to be defensive. Instead, we can write it like this:<\/p>\n<pre><code class=\"language-csharp\">if ((uint)index &lt; (uint)array.Length)\r\n{\r\n    _currentElement = array[index];\r\n}\r\nelse\r\n{\r\n    index -= array.Length;\r\n    _currentElement = array[index];\r\n}<\/code><\/pre>\n<p>An enumeration of a queue can logically be split into two parts: the elements from the head index to the end of the array, and the elements from the beginning of the array to the tail. All of the former now fall into the first block, which incurs only one branch because the JIT can use the knowledge gleaned from the comparison to eliminate the bounds check. It only incurs a bounds check when in the second portion of the enumeration.<\/p>\n<p>We can more easily visualize the branch savings by using benchmarkdotnet&#8217;s <code>HardwareCounters<\/code> diagnoser, asking it to track <code>HardwareCounter.BranchInstructions<\/code> (this diagnoser only works on Windows). Note here, as well, that the changes not only improve throughput, they also enable the boxed enumerator to be stack allocated.<\/p>\n<pre><code class=\"language-csharp\">\/\/ This benchmark was run on Windows for the HardwareCounters diagnoser to work.\r\n\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing BenchmarkDotNet.Diagnosers;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HardwareCounters(HardwareCounter.BranchInstructions)]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Queue&lt;int&gt; _direct;\r\n    private IEnumerable&lt;int&gt; _enumerable;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _direct = new Queue&lt;int&gt;(Enumerable.Range(0, 10));\r\n        for (int i = 0; i &lt; 5; i++)\r\n        {\r\n            _direct.Enqueue(_direct.Dequeue());\r\n        }\r\n\r\n        _enumerable = _direct;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int SumDirect()\r\n    {\r\n        int sum = 0;\r\n        foreach (int item in _direct) sum += item;\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int SumEnumerable()\r\n    {\r\n        int sum = 0;\r\n        foreach (int item in _enumerable) sum += item;\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">BranchInstructions\/Op<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SumDirect<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">24.340 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">79<\/td>\n<td style=\"text-align: right;\">251 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td>SumDirect<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">7.192 ns<\/td>\n<td style=\"text-align: right;\">0.30<\/td>\n<td style=\"text-align: right;\">37<\/td>\n<td style=\"text-align: right;\">96 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">30.695 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">103<\/td>\n<td style=\"text-align: right;\">531 B<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SumEnumerable<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">8.672 ns<\/td>\n<td style=\"text-align: right;\">0.28<\/td>\n<td style=\"text-align: right;\">50<\/td>\n<td style=\"text-align: right;\">324 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> also gets in on the fun. The dictionary is implemented as a collection of &#8220;buckets&#8221;, each of which of which is a linked list of entries. It had a fairly complicated enumerator for processing these structures, relying on jumping between cases of a switch statement, e.g.<\/p>\n<pre><code class=\"language-csharp\">switch (_state)\r\n{\r\n    case StateUninitialized:\r\n        ... \/\/ Initialize on first MoveNext.\r\n        goto case StateOuterloop;\r\n\r\n    case StateOuterloop:\r\n        \/\/ Check if there are more buckets in the dictionary to enumerate.\r\n        if ((uint)i &lt; (uint)buckets.Length)\r\n        {\r\n            \/\/ Move to the next bucket.\r\n            ...\r\n            goto case StateInnerLoop;\r\n        }\r\n        goto default;\r\n\r\n    case StateInnerLoop:\r\n        ... \/\/ Yield elements from the current bucket.\r\n        goto case StateOuterloop;\r\n\r\n    default:\r\n        \/\/ Done iterating.\r\n        ...\r\n}<\/code><\/pre>\n<p>If you squint, there are nested loops here, where we&#8217;re enumerating each bucket and for each bucket enumerating its contents. With how this is structured, however, from the JIT&#8217;s perspective, we could enter those loops from any of those <code>case<\/code>s, depending on the current value of <code>_state<\/code>. That produces something referred to as an &#8220;irreducible loop,&#8221; which is a loop that has multiple possible entry points. Imagine you have:<\/p>\n<pre><code class=\"language-csharp\">A:\r\nif (someCondition) goto B;\r\n...\r\n\r\nB:\r\nif (someOtherCondition) goto A;<\/code><\/pre>\n<p>Labels <code>A<\/code> and <code>B<\/code> form a loop, but that loop can be entered by jumping to either <code>A<\/code> or to <code>B<\/code>. If the compiler could prove that this loop were only ever enterable from <code>A<\/code> or only ever enterable from <code>B<\/code>, then the loop would be &#8220;reducible.&#8221; Irreducible loops are much more complex than reducible loops for a compiler to deal with, as they have more complex control and data flow and in general are harder to analyze. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116949\">dotnet\/runtime#116949<\/a> rewrites the <code>MoveNext<\/code> method to be a more typical <code>while<\/code> loop, which is not only easier to read and maintain, it&#8217;s also reducible and more efficient, and because it&#8217;s more streamlined, it&#8217;s also inlineable and enables possible stack allocation.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Concurrent;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private ConcurrentDictionary&lt;int, int&gt; _ints = new(Enumerable.Range(0, 1000).ToDictionary(i =&gt; i, i =&gt; i));\r\n\r\n    [Benchmark]\r\n    public int EnumerateInts()\r\n    {\r\n        int sum = 0;\r\n        foreach (var kvp in _ints) sum += kvp.Value;\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>EnumerateInts<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">4,232.8 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>EnumerateInts<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">664.2 ns<\/td>\n<td style=\"text-align: right;\">0.16<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>LINQ<\/h3>\n<p>All of these examples show enumerating collections using a <code>foreach<\/code> loop, and while that&#8217;s obviously incredibly common, so too is using LINQ (Language Integrated Query) to enumerate and process collections. For in-memory collections, LINQ provides literally hundreds of extension methods for performing maps, filters, sorts, and a plethora of other operations over enumerables. It is incredibly handy, is thus used <em>everywhere<\/em>, and is thus important to optimize. Every release of .NET has seen improvements to LINQ, and that continues in .NET 10.<\/p>\n<p>Most prominent from a performance perspective in this release are the changes to <code>Contains<\/code>. As discussed in depth in <a href=\"https:\/\/www.youtube.com\/watch?v=xKr96nIyCFM\">Deep .NET: Deep Dive on LINQ with Stephen Toub and Scott Hanselman<\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=W4-NVVNwCWs\">Deep .NET: An even DEEPER Dive into LINQ with Stephen Toub and Scott Hanselman<\/a>, the LINQ methods are able to pass information between them by using specialized internal <code>IEnumerable&lt;T&gt;<\/code> implementations. When you call <code>Select<\/code>, that might return an <code>ArraySelectIterator&lt;TSource, TResult&gt;<\/code> or an <code>IListSelectIterator&lt;TSource, TResult&gt;<\/code> or an <code>IListSkipTakeSelectIterator&lt;TSource, TResult&gt;<\/code> or one of any number of other types. Each of these types has fields that carry information about the source (e.g. the <code>IListSkipTakeSelectIterator&lt;TSource, TResult&gt;<\/code> has fields not only for the <code>IList&lt;TSource&gt;<\/code> source and the <code>Func&lt;TSource, TResult&gt;<\/code> selector, but also for the tracked min and max bounds based on previous <code>Skip<\/code> and <code>Take<\/code> calls), and they have overrides of virtual methods that allow for various operations to be specialized. This means sequences of LINQ methods can be optimized. For example, <code>source.Where(...).Select(...)<\/code> is optimized a) to combine both the filter and the map delegates into a single <code>IEnumerable&lt;T&gt;<\/code>, thus removing the overhead of an extra layer of interface dispatch, and b) to perform operations specific to the original source data type (e.g. if <code>source<\/code> was an array, the processing can be done directly on that array rather than via <code>IEnumerator&lt;T&gt;<\/code>).<\/p>\n<p>Many of these optimizations make the most sense when a method returns an <code>IEnumerable&lt;T&gt;<\/code> that happens to be the result of a LINQ query. The producer of that method doesn&#8217;t know how the consumer will be consuming it, and the consumer doesn&#8217;t know the details of how the producer produced it. But since the LINQ methods flow context via the concrete implementations of <code>IEnumerable&lt;T&gt;<\/code>, significant optimizations are possible for interesting combinations of consumer and producer methods. For example, let&#8217;s say a producer of an <code>IEnumerable&lt;T&gt;<\/code> decides they want to always return data in ascending order, so they do:<\/p>\n<pre><code class=\"language-csharp\">public static IEnumerable&lt;T&gt; GetData()\r\n{\r\n    ...\r\n    return data.OrderBy(s =&gt; s.CreatedAt);\r\n}<\/code><\/pre>\n<p>But as it turns out, the consumer won&#8217;t be looking at all of the elements, and instead just wants the first:<\/p>\n<pre><code class=\"language-csharp\">T value = GetData().First();<\/code><\/pre>\n<p>LINQ optimizes this by having the enumerable returned from <code>OrderBy<\/code> provide a specialized implementation of <code>First<\/code>\/<code>FirstOrDefault<\/code>: it doesn&#8217;t need to perform an <code>O(N log N)<\/code> sort (or allocate a lot of memory to hold all of the keys), it can instead just do an <code>O(N)<\/code> search for the smallest element in the source, because the smallest element would be the first to be yielded from <code>OrderBy<\/code>.<\/p>\n<p><code>Contains<\/code> is ripe for these kinds of optimizations as well, e.g. <code>OrderBy<\/code>, <code>Distinct<\/code>, and <code>Reverse<\/code> all entail non-trivial processing and\/or allocation, but if followed by a <code>Contains<\/code>, all that work can be skipped, as the <code>Contains<\/code> can just search the source directly. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112684\">dotnet\/runtime#112684<\/a>, this set of optimizations is extended to <code>Contains<\/code>, with almost 30 specialized implementations of <code>Contains<\/code> across the various iterator specializations.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _source = Enumerable.Range(0, 1000).ToArray();\r\n\r\n    [Benchmark]\r\n    public bool AppendContains() =&gt; _source.Append(100).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool ConcatContains() =&gt; _source.Concat(_source).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool DefaultIfEmptyContains() =&gt; _source.DefaultIfEmpty(42).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool DistinctContains() =&gt; _source.Distinct().Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool OrderByContains() =&gt; _source.OrderBy(x =&gt; x).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool ReverseContains() =&gt; _source.Reverse().Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool UnionContains() =&gt; _source.Union(_source).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool SelectManyContains() =&gt; _source.SelectMany(x =&gt; _source).Contains(999);\r\n\r\n    [Benchmark]\r\n    public bool WhereSelectContains() =&gt; _source.Where(x =&gt; true).Select(x =&gt; x).Contains(999);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AppendContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2,931.97 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">88 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AppendContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">52.06 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<td style=\"text-align: right;\">0.64<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ConcatContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">3,065.17 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">88 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConcatContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">54.58 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<td style=\"text-align: right;\">0.64<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>DefaultIfEmptyContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">39.21 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td>DefaultIfEmptyContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">32.89 ns<\/td>\n<td style=\"text-align: right;\">0.84<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>DistinctContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">16,967.31 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">58656 B<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>DistinctContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">46.72 ns<\/td>\n<td style=\"text-align: right;\">0.003<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<td style=\"text-align: right;\">0.001<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>OrderByContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">12,884.28 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">12280 B<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>OrderByContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">50.14 ns<\/td>\n<td style=\"text-align: right;\">0.004<\/td>\n<td style=\"text-align: right;\">88 B<\/td>\n<td style=\"text-align: right;\">0.007<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ReverseContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">479.59 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">4072 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ReverseContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">51.80 ns<\/td>\n<td style=\"text-align: right;\">0.11<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<td style=\"text-align: right;\">0.01<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>UnionContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">16,910.57 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">58664 B<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>UnionContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">55.56 ns<\/td>\n<td style=\"text-align: right;\">0.003<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<td style=\"text-align: right;\">0.001<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>SelectManyContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2,950.64 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">192 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SelectManyContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">60.42 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<td style=\"text-align: right;\">0.67<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>WhereSelectContains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,782.05 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">104 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhereSelectContains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">260.25 ns<\/td>\n<td style=\"text-align: right;\">0.15<\/td>\n<td style=\"text-align: right;\">104 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>LINQ in .NET 10 also gains some new methods, including <code>Sequence<\/code> and <code>Shuffle<\/code>. While the primary purpose of these new methods is not performance, they can have a meaningful impact on performance, due to how they&#8217;ve been implemented and how they integrate with the rest of the optimizations in LINQ. Take <code>Sequence<\/code>, for example. <code>Sequence<\/code> is similar to <code>Range<\/code>, in that its a source for numbers:<\/p>\n<pre><code class=\"language-csharp\">public static IEnumerable&lt;T&gt; Sequence&lt;T&gt;(T start, T endInclusive, T step) where T : INumber&lt;T&gt;<\/code><\/pre>\n<p>Whereas <code>Range<\/code> only works with <code>int<\/code> and produces a contiguous series of non-overflowing numbers starting at the initial value, <code>Sequence<\/code> works with any <code>INumber&lt;&gt;<\/code>, supports <code>step<\/code> values other than <code>1<\/code> (including negative values), and allows for wrapping around <code>T<\/code>&#8216;s maximum or minimum. However, when appropriate (e.g. <code>step<\/code> is <code>1<\/code>), <code>Sequence<\/code> will try to utilize <code>Range<\/code>&#8216;s implementation, which has internally been updated to work with any <code>T : INumber&lt;T&gt;<\/code>, even though its public API is still tied to <code>int<\/code>. That means that all of the optimizations afforded to <code>Range&lt;T&gt;<\/code> propagate to <code>Sequence&lt;T&gt;<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private List&lt;short&gt; _values = new();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void Fill1()\r\n    {\r\n        _values.Clear();\r\n        for (short i = 42; i &lt;= 1042; i++)\r\n        {\r\n            _values.Add(i);\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Fill2()\r\n    {\r\n        _values.Clear();\r\n        _values.AddRange(Enumerable.Sequence&lt;short&gt;(42, 1042, 1));\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Fill1<\/td>\n<td style=\"text-align: right;\">1,479.99 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Fill2<\/td>\n<td style=\"text-align: right;\">37.42 ns<\/td>\n<td style=\"text-align: right;\">0.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>My favorite new LINQ method, though, is <code>Shuffle<\/code> (introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112173\">dotnet\/runtime#112173<\/a>), in part because it&#8217;s very handy, but in part because of its implementation and performance focus. The purpose of <code>Shuffle<\/code> is to randomize the source input, and logically, it&#8217;s akin to a very simple implementation:<\/p>\n<pre><code class=\"language-csharp\">public static IEnumerable&lt;T&gt; Shuffle&lt;T&gt;(IEnumerable&lt;T&gt; source)\r\n{\r\n    T[] arr = source.ToArray();\r\n    Random.Shared.Shuffle(arr);\r\n    foreach (T item in arr) yield return item;\r\n}<\/code><\/pre>\n<p>Worst case, this implementation is effectively what&#8217;s in LINQ. Just as in the worst case <code>OrderBy<\/code> needs to buffer up the whole input because it&#8217;s possible any item might be the smallest and thus need to be yielded first, <code>Shuffle<\/code> similarly needs to support the possibility that the last element should probabilistically be yielded first. However, there are a variety of special-cases in the implementation that allow it to perform significantly better than such a hand-rolled <code>Shuffle<\/code> implementation you might be using today.<\/p>\n<p>First, <code>Shuffle<\/code> has some of the same characteristics as <code>OrderBy<\/code>, in that they&#8217;re both creating permutations of the input. That means that many of the ways we can specialize subsequent operations on the result of an <code>OrderBy<\/code> also apply to <code>Shuffle<\/code>. For example, <code>Shuffle.First<\/code> on an <code>IList&lt;T&gt;<\/code> can just select an element from the list at random. <code>Shuffle.Count<\/code> can just count the underlying source, since the order of the elements is irrelevant to the result. <code>Shuffle.Contains<\/code> can just perform the contains on the underlying source. Etc. But my two favorite sequences are <code>Shuffle.Take<\/code> and <code>Shuffle.Take.Contains<\/code>.<\/p>\n<p><code>Shuffle.Take<\/code> provides an interesting optimization opportunity: whereas with <code>Shuffle<\/code> by itself we need to build the whole shuffled sequence, with a <code>Shuffle<\/code> followed immediately by a <code>Take(N)<\/code>, we only need to sample <code>N<\/code> items from the source. We still need those <code>N<\/code> items to be a uniformly random distribution, akin to what we&#8217;d get if we performed the buffering shuffle and then selected the first <code>N<\/code> items in the resulting array, but we can do so using an algorithm that allows us to avoid buffering everything. We need an algorithm that will let us iterate through the source data once, picking out elements as we go, and only ever buffering <code>N<\/code> items at a time. Enter &#8220;reservoir sampling.&#8221; I previously discussed reservoir sampling in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">Performance Improvements in .NET 8<\/a>, as it&#8217;s employed by the JIT as part of its dynamic PGO implementation, and we can use the algorithm here in <code>Shuffle<\/code> as well. Reservoir sampling provides exactly the single-pass, low-memory path we want: initialize a &#8220;reservoir&#8221; (an array) with the first <code>N<\/code> items, then as we scan the rest of the sequence, probabilistically overwrite one of the elements in our reservoir with the current item. The algorithm ensures that every element ends up in the reservoir with equal probability, yielding the same distribution as fully shuffling and taking <code>N<\/code>, but using only <code>O(N)<\/code> space and only making a single pass over an otherwise unknown-length source.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _source = Enumerable.Range(1, 1000).ToList();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public List&lt;int&gt; ShuffleTakeManual() =&gt; ShuffleManual(_source).Take(10).ToList();\r\n\r\n    [Benchmark]\r\n    public List&lt;int&gt; ShuffleTakeLinq() =&gt; _source.Shuffle().Take(10).ToList();\r\n\r\n    private static IEnumerable&lt;int&gt; ShuffleManual(IEnumerable&lt;int&gt; source)\r\n    {\r\n        int[] arr = source.ToArray();\r\n        Random.Shared.Shuffle(arr);\r\n        foreach (var item in arr)\r\n        {\r\n            yield return item;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ShuffleTakeManual<\/td>\n<td style=\"text-align: right;\">4.150 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">4232 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ShuffleTakeLinq<\/td>\n<td style=\"text-align: right;\">3.801 us<\/td>\n<td style=\"text-align: right;\">0.92<\/td>\n<td style=\"text-align: right;\">192 B<\/td>\n<td style=\"text-align: right;\">0.05<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Shuffle.Take.Contains<\/code> is even more fun. We now have a probability problem that reads like a brain teaser or an SAT question. &#8220;I have <code>totalCount<\/code> items of which <code>equalCount<\/code> match my target value, and we&#8217;re going to pick <code>takeCount<\/code> items at random. What is the probability that at least one of those <code>takeCount<\/code> items is one of the <code>equalCount<\/code> items?&#8221; This is called a hypergeometric distribution, and we can use an implementation of it for <code>Shuffle.Take.Contains<\/code>.<\/p>\n<p>To make this easier to reason about, let&#8217;s talk candy. Imagine you have a jar of 100 jelly beans, of which 20 are your favorite flavor, Watermelon, and you&#8217;re going to pick 5 of the 100 beans at random; what are the chances you get at least one Watermelon? To solve this, we could reason through all the different ways we might get 1, 2, 3, 4, or 5 Watermelons, but instead, let&#8217;s do the opposite and think through how likely it is that we don&#8217;t get any (sad panda):<\/p>\n<ul>\n<li>The chance that our first pick isn&#8217;t a Watermelon is the number of non-Watermelons divided by the total number of beans, so <code>(100-20)\/100<\/code>.<\/li>\n<li>Once we&#8217;ve picked a bean out of the jar, we&#8217;re not putting it back, so the chance that our second pick isn&#8217;t a Watermelon is now <code>(99-20)\/99<\/code> (we have one fewer bean, but our first pick wasn&#8217;t a Watermelon, so there&#8217;s the same number of Watermelons as there was before).<\/li>\n<li>For a third pick, it&#8217;s now <code>(98-20)\/98<\/code>.<\/li>\n<li>And so on.<\/li>\n<\/ul>\n<p>After five rounds, we end up with <code>(80\/100) * (79\/99) * (78\/98) * (77\/97) * (76\/96)<\/code>, which is ~32%. If the chances I don&#8217;t get a Watermelon are ~32%, then the chances I do get a Watermelon are ~68%. Jelly beans aside, that&#8217;s our algorithm:<\/p>\n<pre><code class=\"language-csharp\">double probOfDrawingZeroMatches = 1;\r\nfor (long i = 0; i &lt; _takeCount; i++)\r\n{\r\n    probOfDrawingZeroMatches *= (double)(totalCount - i - equalCount) \/ (totalCount - i);\r\n}\r\n\r\nreturn Random.Shared.NextDouble() &gt; probOfDrawingZeroMatches;<\/code><\/pre>\n<p>The net effect is we can compute the answer much more efficiently than with a naive implementation that shuffles and then separately takes and separately contains.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _source = Enumerable.Range(1, 1000).ToList();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool ShuffleTakeContainsManual() =&gt; ShuffleManual(_source).Take(10).Contains(2000);\r\n\r\n    [Benchmark]\r\n    public bool ShuffleTakeContainsLinq() =&gt; _source.Shuffle().Take(10).Contains(2000);\r\n\r\n    private static IEnumerable&lt;int&gt; ShuffleManual(IEnumerable&lt;int&gt; source)\r\n    {\r\n        int[] arr = source.ToArray();\r\n        Random.Shared.Shuffle(arr);\r\n        foreach (var item in arr)\r\n        {\r\n            yield return item;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ShuffleTakeContainsManual<\/td>\n<td style=\"text-align: right;\">3,900.99 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">4136 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ShuffleTakeContainsLinq<\/td>\n<td style=\"text-align: right;\">79.12 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<td style=\"text-align: right;\">96 B<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>LINQ<\/code> in .NET 10 also sports some new methods that <em>are<\/em> about performance (at least in part), in particular <code>LeftJoin<\/code> and <code>RightJoin<\/code>, from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110872\">dotnet\/runtime#110872<\/a>. I say these are about performance because it&#8217;s already possible to achieve the left and right join semantics using existing LINQ surface area, and the new methods do it more efficiently.<\/p>\n<p><code>Enumerable.Join<\/code> implements an &#8220;inner join,&#8221; meaning only matching pairs from the two supplied collections appear in the output. For example, this code, which is joining based on the first letter in each string:<\/p>\n<pre><code class=\"language-csharp\">IEnumerable&lt;string&gt; left = [\"apple\", \"banana\", \"cherry\", \"date\", \"grape\", \"honeydew\"];\r\nIEnumerable&lt;string&gt; right = [\"aardvark\", \"dog\", \"elephant\", \"goat\", \"gorilla\", \"hippopotamus\"];\r\nforeach (string result in left.Join(right, s =&gt; s[0], s =&gt; s[0], (s1, s2) =&gt; $\"{s1} {s2}\"))\r\n{\r\n    Console.WriteLine(result);\r\n}<\/code><\/pre>\n<p>outputs:<\/p>\n<pre><code class=\"language-txt\">apple aardvark\r\ndate dog\r\ngrape goat\r\ngrape gorilla\r\nhoneydew hippopotamus<\/code><\/pre>\n<p>In contrast, a &#8220;left join&#8221; (also known as a &#8220;left outer join&#8221;) would yield the following:<\/p>\n<pre><code class=\"language-txt\">apple aardvark\r\nbanana\r\ncherry\r\ndate dog\r\ngrape goat\r\ngrape gorilla\r\nhoneydew hippopotamus<\/code><\/pre>\n<p>Note that it has all of the same output as with the &#8220;inner join,&#8221; except it has at least one row for every <code>left<\/code> element, even if there&#8217;s no matching element in the <code>right<\/code> row. And then a &#8220;right join&#8221; (also known as a &#8220;right outer join&#8221;) would yield the following:<\/p>\n<pre><code class=\"language-txt\">apple aardvark\r\ndate dog\r\n elephant\r\ngrape goat\r\ngrape gorilla\r\nhoneydew hippopotamus<\/code><\/pre>\n<p>Again, all the same output as with the &#8220;inner join,&#8221; except it has at least one row for every <code>right<\/code> element, even if there&#8217;s no matching element in the <code>left<\/code> row.<\/p>\n<p>Prior to .NET 10, there was no <code>LeftJoin<\/code> or <code>RightJoin<\/code>, but their semantics could be achieved using a combination of <code>GroupJoin<\/code>, <code>SelectMany<\/code>, and <code>DefaultIfEmpty<\/code>:<\/p>\n<pre><code class=\"language-csharp\">public static IEnumerable&lt;TResult&gt; LeftJoin&lt;TOuter, TInner, TKey, TResult&gt;(\r\n    this IEnumerable&lt;TOuter&gt; outer, IEnumerable&lt;TInner&gt; inner,\r\n    Func&lt;TOuter, TKey&gt; outerKeySelector, Func&lt;TInner, TKey&gt; innerKeySelector,\r\n    Func&lt;TOuter, TInner?, TResult&gt; resultSelector) =&gt;\r\n    outer\r\n    .GroupJoin(inner, outerKeySelector, innerKeySelector, (o, inners) =&gt; (o, inners))\r\n    .SelectMany(x =&gt; x.inners.DefaultIfEmpty(), (x, i) =&gt; resultSelector(x.o, i));<\/code><\/pre>\n<p><code>GroupJoin<\/code> creates a group for each <code>outer<\/code> (&#8220;left&#8221;) element, where the group contains all matching items from <code>inner<\/code> (&#8220;right&#8221;). We can flatten those results by using <code>SelectMany<\/code>, such that we end up with an output for each pairing, using <code>DefaultIfEmpty<\/code> to ensure that there&#8217;s always at least a default inner element to pair. We can do the exact same thing for a <code>RightJoin<\/code>: in fact, we can implement the right join just by delegating to the left join and flipping all the arguments:<\/p>\n<pre><code class=\"language-csharp\">public static IEnumerable&lt;TResult&gt; RightJoin&lt;TOuter, TInner, TKey, TResult&gt;(\r\n    this IEnumerable&lt;TOuter&gt; outer, IEnumerable&lt;TInner&gt; inner,\r\n    Func&lt;TOuter, TKey&gt; outerKeySelector, Func&lt;TInner, TKey&gt; innerKeySelector,\r\n    Func&lt;TOuter, TInner?, TResult&gt; resultSelector) =&gt;\r\n    inner.LeftJoin(outer, innerKeySelector, outerKeySelector, (i, o) =&gt; resultSelector(o, i));<\/code><\/pre>\n<p>Thankfully, you no longer need to do that yourself, and this isn&#8217;t how the new <code>LeftJoin<\/code> and <code>RightJoin<\/code> methods are implemented in .NET 10. We can see the difference with a benchmark:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Linq;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; Outer { get; } = Enumerable.Sequence(0, 1000, 2);\r\n    private IEnumerable&lt;int&gt; Inner { get; } = Enumerable.Sequence(0, 1000, 3);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void LeftJoin_Manual() =&gt;\r\n        ManualLeftJoin(Outer, Inner, o =&gt; o, i =&gt; i, (o, i) =&gt; o + i).Count();\r\n\r\n    [Benchmark]\r\n    public int LeftJoin_Linq() =&gt;\r\n        Outer.LeftJoin(Inner, o =&gt; o, i =&gt; i, (o, i) =&gt; o + i).Count();\r\n\r\n    private static IEnumerable&lt;TResult&gt; ManualLeftJoin&lt;TOuter, TInner, TKey, TResult&gt;(\r\n        IEnumerable&lt;TOuter&gt; outer, IEnumerable&lt;TInner&gt; inner,\r\n        Func&lt;TOuter, TKey&gt; outerKeySelector, Func&lt;TInner, TKey&gt; innerKeySelector,\r\n        Func&lt;TOuter, TInner?, TResult&gt; resultSelector) =&gt;\r\n        outer\r\n        .GroupJoin(inner, outerKeySelector, innerKeySelector, (o, inners) =&gt; (o, inners))\r\n        .SelectMany(x =&gt; x.inners.DefaultIfEmpty(), (x, i) =&gt; resultSelector(x.o, i));\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>LeftJoin_Manual<\/td>\n<td style=\"text-align: right;\">29.02 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">65.84 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>LeftJoin_Linq<\/td>\n<td style=\"text-align: right;\">15.23 us<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">36.95 KB<\/td>\n<td style=\"text-align: right;\">0.56<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Moving on from new methods, existing methods were also improved in other ways. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112401\">dotnet\/runtime#112401<\/a> from <a href=\"https:\/\/github.com\/miyaji255\">@miyaji255<\/a> improved the performance of <code>ToArray<\/code> and <code>ToList<\/code> following <code>Skip<\/code> and\/or <code>Take<\/code> calls. In the specialized iterator implementation used for <code>Take<\/code> and <code>Skip<\/code>, this PR simply checks in the <code>ToList<\/code> and <code>ToArray<\/code> implementations whether the source is something from which we can easily get a <code>ReadOnlySpan&lt;T&gt;<\/code> (namely a <code>T[]<\/code> or <code>List&lt;T&gt;<\/code>). If it is, rather than copying elements one by one into the destination, it can slice the retrieved span and use its <code>CopyTo<\/code>, which, depending on the <code>T<\/code>, may even be vectorized.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly IEnumerable&lt;string&gt; _source = Enumerable.Range(0, 1000).Select(i =&gt; i.ToString()).ToArray();\r\n\r\n    [Benchmark]\r\n    public List&lt;string&gt; SkipTakeToList() =&gt; _source.Skip(200).Take(200).ToList();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SkipTakeToList<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,218.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SkipTakeToList<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">257.4 ns<\/td>\n<td style=\"text-align: right;\">0.21<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>LINQ in .NET 10 also sees a few notable enhancements for Native AOT. The code for LINQ has grown over time, as all of these various specializations have found their way into the codebase. These optimizations are generally implemented by deriving specialized iterators from a base <code>Iterator&lt;T&gt;<\/code>, which has a bunch of <code>abstract<\/code> or <code>virtual<\/code> methods for performing the subsequent operation (e.g. <code>Contains<\/code>). With Native AOT, any use of a method like <code>Enumerable.Contains<\/code> then prevents the corresponding implementations on <em>all<\/em> of those specializations from being trimmed away, leading to non-trivial increase in assembly code size. As such, years ago multiple builds of <code>System.Linq.dll<\/code> were introduced into the <code>dotnet\/runtime<\/code> build system: one focused on speed, and one focused on size. When building <code>System.Linq.dll<\/code> to go with coreclr, you&#8217;d end up with the speed-optimized build that has all of these specializations. When building <code>System.Linq.dll<\/code> to go with other flavors, like Native AOT, you&#8217;d instead get the size-optimized build, which eschews many of the LINQ optimizations that have been added in the last decade. And as this was a build-time decision, developers using one of these platforms didn&#8217;t get a choice; as you learn in kindergarten, &#8220;you get what you get and you don&#8217;t get upset.&#8221; Now in .NET 10, if you do forget what you learned in kindergarten and you do get upset, you have recourse: thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111743\">dotnet\/runtime#111743<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109978\">dotnet\/runtime#109978<\/a>, this setting is now a feature switch rather than a build-time configuration. So, in particular if you&#8217;re publishing for Native AOT and you&#8217;d prefer all the speed-focused optimizations, you can add <code>&lt;UseSizeOptimizedLinq&gt;false&lt;\/UseSizeOptimizedLinq&gt;<\/code> to your project file and be happy.<\/p>\n<p>However, the need for that switch is now also reduced significantly by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118156\">dotnet\/runtime#118156<\/a>. When this size\/speed split was previously introduced into the <code>System.Linq.dll<\/code> build, all of these specializations were eschewed, without a lot of an analysis for tradeoffs involved; as this was focused on optimizing for size, any specialized overrides were removed, no matter how much space they actually saved. Many of those savings turned out to be minimal, however, and in a variety of situations, the throughput cost was significant. This PR brings back some of the more impactful specializations where the throughput gains significantly outweigh the relatively-minimal size cost.<\/p>\n<h3>Frozen Collections<\/h3>\n<p>The <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> and <code>FrozenSet&lt;T&gt;<\/code> collection types were introduced in .NET 8 as collections optimized for the common scenario of creating a long-lived collection that&#8217;s then read from <em>a lot<\/em>. They spend more time at construction in exchange for faster read operations. Under the covers, this is achieved in part by having specializations of the implementations that are optimized for different types of data or shapes of input. .NET 9 improved upon the implementations, and .NET 10 takes it even further.<\/p>\n<p><code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> exerts a lot of energy for <code>TKey<\/code> as <code>string<\/code>, as that is such a common use case. It also has specializations for <code>TKey<\/code> as <code>Int32<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111886\">dotnet\/runtime#111886<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112298\">dotnet\/runtime#112298<\/a> extend that further by adding specializations for when <code>TKey<\/code> is any primitive integral type that&#8217;s the size of an <code>int<\/code> or smaller (e.g. <code>byte<\/code>, <code>char<\/code>, <code>ushort<\/code>, etc.) as well as enums backed by such primitives (which represent the vast, vast majority of enums used in practice). In particular, they handle the common case where these values are densely packed, in which case they implement the dictionary as an array that it can index into based on the integer&#8217;s value. This makes for a very efficient lookup, while not consuming too much additional space: it&#8217;s only used when the values are dense and thus won&#8217;t be wasting many empty slots in the array.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Frozen;\r\nusing System.Net;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"status\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly FrozenDictionary&lt;HttpStatusCode, string&gt; s_statusDescriptions =\r\n        Enum.GetValues&lt;HttpStatusCode&gt;().Distinct()\r\n            .ToFrozenDictionary(status =&gt; status, status =&gt; status.ToString());\r\n\r\n    [Benchmark]\r\n    [Arguments(HttpStatusCode.OK)]\r\n    public string Get(HttpStatusCode status) =&gt; s_statusDescriptions[status];\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Get<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2.0660 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Get<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">0.8735 ns<\/td>\n<td style=\"text-align: right;\">0.42<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Both <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> and <code>FrozenSet&lt;T&gt;<\/code> also improve with regards to the alternate lookup functionality introduced in .NET 9. Alternate lookups are a mechanism that enables getting a proxy for a dictionary or set that&#8217;s keyed with a different key from <code>TKey<\/code>, most commonly a <code>ReadOnlySpan&lt;char&gt;<\/code> when <code>TKey<\/code> is <code>string<\/code>. As noted, both <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> and <code>FrozenSet&lt;T&gt;<\/code> achieve their goals by having different implementations based on the nature of the indexed data, and that specialization is achieved by virtual methods that derived specializations override. The JIT is typically able to minimize the costs of such virtuals, especially if the collections are stored in <code>static readonly<\/code> fields. However, the alternate lookup support complicated things, as it introduced a virtual method with a generic method parameter (the alternate key type), otherwise known as GVM. &#8220;GVM&#8221; might as well be a four-letter word in performance circles, as they&#8217;re hard for the runtime to optimize. The purpose of these alternate lookups is primarily performance, but the use of a GVM significantly reduced those performance gains. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108732\">dotnet\/runtime#108732<\/a> from <a href=\"https:\/\/github.com\/andrewjsaid\">@andrewjsaid<\/a> addresses this by changing the frequency with which a GVM needs to be invoked. Rather than the lookup operation itself being a generic virtual method, the PR introduces a separate generic virtual method that retrieves a delegate for performing the lookup; the retrieval of that delegate still incurs GVM penalties, but once the delegate is retrieved, it can be cached, and invoking it does not incur said overheads. This results in measurable improvements on throughput.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Frozen;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly FrozenDictionary&lt;string, int&gt; s_d = new Dictionary&lt;string, int&gt; \r\n    {\r\n        [\"one\"] = 1, [\"two\"] = 2, [\"three\"] = 3, [\"four\"] = 4, [\"five\"] = 5, [\"six\"] = 6, \r\n        [\"seven\"] = 7, [\"eight\"] = 8, [\"nine\"] = 9, [\"ten\"] = 10, [\"eleven\"] = 11, [\"twelve\"] = 12,\r\n    }.ToFrozenDictionary();\r\n\r\n    [Benchmark]\r\n    public int Get()\r\n    {\r\n        var alternate = s_d.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;();\r\n        return\r\n            alternate[\"one\"] + alternate[\"two\"] + alternate[\"three\"] + alternate[\"four\"] + alternate[\"five\"] +\r\n            alternate[\"six\"] + alternate[\"seven\"] + alternate[\"eight\"] + alternate[\"nine\"] + alternate[\"ten\"] + \r\n            alternate[\"eleven\"] + alternate[\"twelve\"];\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Get<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">133.46 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Get<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">81.39 ns<\/td>\n<td style=\"text-align: right;\">0.61<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>BitArray<\/h3>\n<p><code>BitArray<\/code> provides support for exactly what its name says, a bit array. You create it with the desired number of values and can then read and write a <code>bool<\/code> for each index, turning the corresponding bit to <code>1<\/code> or <code>0<\/code> accordingly. It also provides a variety of helper operations for processing the whole bit array, such as for Boolean logic operations like <code>And<\/code> and <code>Not<\/code>. Where possible, those operations are vectorized, taking advantage of SIMD to process many bits per instruction.<\/p>\n<p>However, for situations where you want to write custom manipulations of the bits, you only have two options: use the indexer (or corresponding <code>Get<\/code> and <code>Set<\/code> methods), which means multiple instructions required to process each bit, or use <code>CopyTo<\/code> to extract all of the bits to a separate array, which means you need to allocate (or at least rent) such an array and pay for the memory copy before you can then manipulate the bits. There&#8217;s also not a great way to then copy those bits back if you wanted to manipulate the <code>BitArray<\/code> in place.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116308\">dotnet\/runtime#116308<\/a> adds a <code>CollectionsMarshal.AsBytes(BitArray)<\/code> method that returns a <code>Span&lt;byte&gt;<\/code> directly referencing the <code>BitArray<\/code>&#8216;s underlying storage. This provides a very efficient way to get access to all the bits, which then makes it possible to write (or reuse) vectorized algorithms. Say, for example, you wanted to use a <code>BitArray<\/code> to represent a binary embedding (an &#8220;embedding&#8221; is a vector representation of the semantic meaning of some data, basically an array of numbers, each one corresponding to some aspect of the data; a binary embedding uses a single bit for each number). To determine how semantically similar two inputs are, you get an embedding for each and then perform a distance or similarity calculation on the two. For binary embeddings, a common distance metric is &#8220;hamming distance,&#8221; which effectively lines up the bits and tells you the number of positions that have different values, e.g. <code>0b1100<\/code> and <code>0b1010<\/code> have a hamming distance of 2. Helpfully, <code>TensorPrimitives.HammingBitDistance<\/code> provides an implementation of this, accepting two <code>ReadOnlySpan&lt;T&gt;<\/code>s and computing the number of bits that differ between them. With <code>CollectionsMarshal.AsBytes<\/code>, we can now utilize that helper directly with the contents of <code>BitArray<\/code>s, both saving us the effort of having to write it manually and benefiting from any optimizations in <code>HammingBitDistance<\/code> itself.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Update benchmark.csproj with a package reference to System.Numerics.Tensors.\r\n\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections;\r\nusing System.Numerics.Tensors;\r\nusing System.Runtime.InteropServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private BitArray _bits1, _bits2;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Random r = new(42);\r\n        byte[] bytes = new byte[128];\r\n\r\n        r.NextBytes(bytes);\r\n        _bits1 = new BitArray(bytes);\r\n\r\n        r.NextBytes(bytes);\r\n        _bits2 = new BitArray(bytes);\r\n    }\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public long HammingDistanceManual()\r\n    {\r\n        long distance = 0;\r\n        for (int i = 0; i &lt; _bits1.Length; i++)\r\n        {\r\n            if (_bits1[i] != _bits2[i])\r\n            {\r\n                distance++;\r\n            }\r\n        }\r\n\r\n        return distance;\r\n    }\r\n\r\n    [Benchmark]\r\n    public long HammingDistanceTensorPrimitives() =&gt;\r\n        TensorPrimitives.HammingBitDistance(\r\n            CollectionsMarshal.AsBytes(_bits1),\r\n            CollectionsMarshal.AsBytes(_bits2));\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HammingDistanceManual<\/td>\n<td style=\"text-align: right;\">1,256.72 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>HammingDistanceTensorPrimitives<\/td>\n<td style=\"text-align: right;\">63.29 ns<\/td>\n<td style=\"text-align: right;\">0.05<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The main motivation for this PR was adding the <code>AsBytes<\/code> method, but doing so triggered a series of other modifications that themselves help with performance. For example, rather than backing the <code>BitArray<\/code> with an <code>int[]<\/code> as was previously done, it&#8217;s now backed by a <code>byte[]<\/code>, and rather than reading elements one by one in the <code>byte[]<\/code>-based constructor, vectorized copy operations are now being used (they were already being used and continue to be used in the <code>int[]<\/code>-based constructor).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _byteData = Enumerable.Range(0, 512).Select(i =&gt; (byte)i).ToArray();\r\n\r\n    [Benchmark]\r\n    public BitArray ByteCtor() =&gt; new BitArray(_byteData);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ByteCtor<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">160.10 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ByteCtor<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">83.07 ns<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Other Collections<\/h3>\n<p>There are a variety of other notable improvements in collections:<\/p>\n<ul>\n<li><strong><code>List&lt;T&gt;<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107683\">dotnet\/runtime#107683<\/a> from <a href=\"https:\/\/github.com\/karakasa\">@karakasa<\/a> builds on a change that was made in .NET 9 to improve the performance of using <code>InsertRange<\/code> on a <code>List&lt;T&gt;<\/code> to insert a <code>ReadOnlySpan&lt;T&gt;<\/code>. When a full <code>List&lt;T&gt;<\/code> is appended to, the typical process is a new larger array is allocated, all of the existing elements are copied over (one array copy), and then the new element is stored into the array in the next available slot. If that same growth routine is used when <em>inserting<\/em> rather than <em>appending<\/em> an element, you possibly end up copying some elements twice: you first copy over all of the elements into the new array, and then to handle the insert, you may again need to copy some of the elements you already copied as part of shifting them to make room for the insertion at the new location. In the extreme, if you&#8217;re inserting at index 0, you copy all of the elements into the new array, and then you copy all of the elements again to shift them by one slot. The same applies when inserting a range of elements, so with this PR, rather than first copying over all of the elements and then shifting a subset, <code>List&lt;T&gt;<\/code> now grows by copying the elements above and below the target range for the insertion to their correct location and then fills in the target range with the inserted elements.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly int[] _data = [1, 2, 3, 4];\r\n\r\n    [Benchmark]\r\n    public List&lt;int&gt; Test()\r\n    {\r\n        List&lt;int&gt; list = new(4);\r\n        list.AddRange(_data);\r\n        list.InsertRange(0, _data);\r\n        return list;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">48.65 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">30.07 ns<\/td>\n<td style=\"text-align: right;\">0.62<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong><code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108065\">dotnet\/runtime#108065<\/a> from <a href=\"https:\/\/github.com\/koenigst\">@koenigst<\/a> changes how a <code>ConcurrentDictionary<\/code>&#8216;s backing array is sized when it&#8217;s cleared. <code>ConcurrentDictionary<\/code> is implemented with an array of linked lists, and when the collection is constructed, a constructor parameter allows for presizing that array. Due to the concurrent nature of the dictionary and its implementation, <code>Clear<\/code>&#8216;ing it necessitates creating a new array rather than just using part of the old one. When that new array was created, it reset to using the default size. This PR tweaks that to remember the initial capacity requested by the user, and using that initial size again when constructing the new array.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Concurrent;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private ConcurrentDictionary&lt;int, int&gt; _data = new(concurrencyLevel: 1, capacity: 1024);\r\n\r\n    [Benchmark]\r\n    public void ClearAndAdd()\r\n    {\r\n        _data.Clear();\r\n        for (int i = 0; i &lt; 1024; i++)\r\n        {\r\n            _data.TryAdd(i, i);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ClearAndAdd<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">51.95 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">134.36 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ClearAndAdd<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">30.32 us<\/td>\n<td style=\"text-align: right;\">0.58<\/td>\n<td style=\"text-align: right;\">48.73 KB<\/td>\n<td style=\"text-align: right;\">0.36<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong><code>Dictionary&lt;TKey, TValue&gt;<\/code><\/strong>. <code>Dictionary<\/code> is one of the most popular collection types across .NET, and <code>TKey<\/code> == <code>string<\/code> is one of (if not <em>the<\/em>) most popular forms. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117427\">dotnet\/runtime#117427<\/a> makes dictionary lookups with constant <code>string<\/code>s much faster. You might expect it would be a complicated change, but it ends up being just a few strategic tweaks. A variety of methods for operating on <code>string<\/code>s are already known to the JIT and already have optimized implementations for when dealing with constants. All this PR needed to do was change which methods <code>Dictionary&lt;TKey, TValue&gt;<\/code> was using in its optimized <code>TryGetValue<\/code> lookup path, and because that path is often inlined, a constant argument to <code>TryGetValue<\/code> can be exposed as a constant to these helpers, e.g. <code>string.Equals<\/code>.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Dictionary&lt;string, int&gt; _data = new() { [\"a\"] = 1, [\"b\"] = 2, [\"c\"] = 3, [\"d\"] = 4, [\"e\"] = 5 };\r\n\r\n    [Benchmark]\r\n    public int Get() =&gt; _data[\"a\"] + _data[\"b\"] + _data[\"c\"] + _data[\"d\"] + _data[\"e\"];\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Get<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">33.81 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Get<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">14.02 ns<\/td>\n<td style=\"text-align: right;\">0.41<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong><code>OrderedDictionary&lt;TKey, TValue&gt;<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109324\">dotnet\/runtime#109324<\/a> adds new overloads of <code>TryAdd<\/code> and <code>TryGetValue<\/code> that provide the index of the added or retrieved element in the collection. This index can then be used in subsequent operations on the dictionary to access the same slot. For example, if you want to implement an <code>AddOrUpdate<\/code> operation on top of <code>OrderedDictionary<\/code>, you need to perform one or two operations, first trying to add the item, and then if found to already exist, updating it, and that update can benefit from targeting the exact index that contains the element rather than it needing to do another keyed lookup.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private OrderedDictionary&lt;string, int&gt; _dictionary = new();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void Old() =&gt; AddOrUpdate_Old(_dictionary, \"key\", k =&gt; 1, (k, v) =&gt; v + 1);\r\n\r\n    [Benchmark]\r\n    public void New() =&gt; AddOrUpdate_New(_dictionary, \"key\", k =&gt; 1, (k, v) =&gt; v + 1);\r\n\r\n    private static void AddOrUpdate_Old(OrderedDictionary&lt;string, int&gt; d, string key, Func&lt;string, int&gt; addFunc, Func&lt;string, int, int&gt; updateFunc)\r\n    {\r\n        if (d.TryGetValue(key, out int existing))\r\n        {\r\n            d[key] = updateFunc(key, existing);\r\n        }\r\n        else\r\n        {\r\n            d.Add(key, addFunc(key));\r\n        }\r\n    }\r\n\r\n    private static void AddOrUpdate_New(OrderedDictionary&lt;string, int&gt; d, string key, Func&lt;string, int&gt; addFunc, Func&lt;string, int, int&gt; updateFunc)\r\n    {\r\n        if (d.TryGetValue(key, out int existing, out int index))\r\n        {\r\n            d.SetAt(index, updateFunc(key, existing));\r\n        }\r\n        else\r\n        {\r\n            d.Add(key, addFunc(key));\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">6.961 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">4.201 ns<\/td>\n<td style=\"text-align: right;\">0.60<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong><code>ImmutableArray&lt;T&gt;<\/code><\/strong>. The <code>ImmutableCollectionsMarshal<\/code> class already exposes an <code>AsArray<\/code> method that enables retrieving the backing <code>T[]<\/code> from an <code>ImmutableArray&lt;T&gt;<\/code>. However, if you had an <code>ImmutableArray&lt;T&gt;.Builder<\/code>, there was previously no way to access the backing store it was using. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112177\">dotnet\/runtime#112177<\/a> enables doing so, with an <code>AsMemory<\/code> method that retrieves the underlying storage as a <code>Memory&lt;T&gt;<\/code>.<\/li>\n<li><strong><code>InlineArray<\/code><\/strong>. .NET 8 introduced <code>InlineArrayAttribute<\/code>, which can be used to attribute a struct containing a single field; the attribute accepts a count, and the runtime replicates the struct&#8217;s field that number of times, as if you&#8217;d logically copy\/pasted the field repeatedly. The runtime also ensures that the storage is contiguous and appropriately aligned, such that if you had an indexible collection that pointed to the beginning of the struct, you could use it as an array. And it so happens such a collection exists: <code>Span&lt;T&gt;<\/code>. C# 12 then makes it easy to treat any such attributed struct as a span, e.g.\n<pre><code class=\"language-csharp\">[InlineArray(8)]\r\ninternal struct EightStrings\r\n{\r\n    private string _field;\r\n}\r\n...\r\nEightStrings strings = default;\r\nSpan&lt;string&gt; span = strings;<\/code><\/pre>\n<p>The C# compiler will itself emit code that uses this capability. For example, if you use collection expressions to initialize a span, you&#8217;re likely triggering the compiler to emit an <code>InlineArray<\/code>. When I write this:<\/p>\n<pre><code class=\"language-csharp\">public void M(int a, int b, int c, int d) \r\n{\r\n    Span&lt;int&gt; span = [a, b, c, d];\r\n}<\/code><\/pre>\n<p>the compiler emits something like the following equivalent:<\/p>\n<pre><code class=\"language-csharp\">public void M(int a, int b, int c, int d)\r\n{\r\n    &lt;&gt;y__InlineArray4&lt;int&gt; buffer = default(&lt;&gt;y__InlineArray4&lt;int&gt;);\r\n    &lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray4&lt;int&gt;, int&gt;(ref buffer, 0) = a;\r\n    &lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray4&lt;int&gt;, int&gt;(ref buffer, 1) = b;\r\n    &lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray4&lt;int&gt;, int&gt;(ref buffer, 2) = c;\r\n    &lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray4&lt;int&gt;, int&gt;(ref buffer, 3) = d;\r\n    &lt;PrivateImplementationDetails&gt;.InlineArrayAsSpan&lt;&lt;&gt;y__InlineArray4&lt;int&gt;, int&gt;(ref buffer, 4);\r\n}<\/code><\/pre>\n<p>where it has defined that <code>&lt;&gt;y__InlineArray4<\/code> like this:<\/p>\n<pre><code class=\"language-csharp\">[StructLayout(LayoutKind.Auto)]\r\n[InlineArray(4)]\r\ninternal struct &lt;&gt;y__InlineArray4&lt;T&gt;\r\n{\r\n    [CompilerGenerated]\r\n    private T _element0;\r\n}<\/code><\/pre>\n<p>This shows up elsewhere, too. For example, C# 13 introduced support for using <code>params<\/code> with collections other than arrays, including spans, so now I can write this:<\/p>\n<pre><code class=\"language-csharp\">public void Caller(int a, int b, int c, int d) =&gt; M(a, b, c, d);\r\n\r\npublic void M(params ReadOnlySpan&lt;int&gt; span) { }<\/code><\/pre>\n<p>and for <code>Caller<\/code> we&#8217;ll see very similar code emitted to what I previously showed, with the compiler manufacturing such an <code>InlineArray<\/code> type. As you might imagine, the popularity of the features that cause the compiler to produce these types has caused there to be a lot of them emitted. Each type is specific to a particular length, so while the compiler will reuse them, a) it can end up needing to emit a lot to cover different lengths, and b) it emits them as internal to each assembly that needs them, so there can end up being a lot of duplication. Looking just at the shared framework for .NET 9 (the core libraries like <code>System.Private.CoreLib<\/code> that ship as part of the runtime), there are ~140 of these types&#8230; all of which are for sizes no larger than 8. For .NET 10, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113403\">dotnet\/runtime#113403<\/a> adds a set of public <code>InlineArray2&lt;T&gt;<\/code>, <code>InlineArray3&lt;T&gt;<\/code>, etc., that should cover the vast majority of sizes the compiler would otherwise need to emit types. In the near future, the C# compiler will be updated to use those new types when available instead of emitting its own, thereby yielding non-trivial size savings.<\/li>\n<\/ul>\n<h2>I\/O<\/h2>\n<p>In previous .NET releases, there have been concerted efforts that have invested a lot in improving specific areas of I\/O performance, such as completely rewriting <code>FileStream<\/code> in .NET 6. Nothing as comprehensive as that was done for I\/O in .NET 10, but there are some nice one-off improvements that can still have a measurable impact on certain scenarios.<\/p>\n<p>On Unix, when a <code>MemoryMappedFile<\/code> is created and it&#8217;s not associated with a particular <code>FileStream<\/code>, it needs to create some kind of backing memory for the MMF&#8217;s data. On Linux, it&#8217;d try to use <code>shm_open<\/code>, which creates a shared memory object with appropriate semantics. However, in the years since <code>MemoryMappedFile<\/code> was initially enabled on Linux, the Linux kernel has added support for anonymous files and the <code>memfd_create<\/code> function that creates them. These are ideal for <code>MemoryMappedFile<\/code> and much more efficient, so <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105178\">dotnet\/runtime#105178<\/a> from <a href=\"https:\/\/github.com\/am11\">@am11<\/a> switches over to using <code>memfd_create<\/code> when it&#8217;s available.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.MemoryMappedFiles;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public void MMF()\r\n    {\r\n        using MemoryMappedFile mff = MemoryMappedFile.CreateNew(null, 12345);\r\n        using MemoryMappedViewAccessor accessor = mff.CreateViewAccessor();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>MMF<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">9.916 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>MMF<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">6.358 us<\/td>\n<td style=\"text-align: right;\">0.64<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>FileSystemWatcher<\/code> is improved in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116830\">dotnet\/runtime#116830<\/a>. The primary purpose for this PR was to fix a memory leak, where on Windows disposing of a <code>FileSystemWatcher<\/code> while it was in use could end up leaking some objects. However, it also addresses a performance issue specific to Windows. <code>FileSystemWatcher<\/code> needs to pass a buffer to the OS for the OS to populate with file-changed information. That meant that <code>FileSystemWatcher<\/code> was allocating a managed array and then immediately pinning that buffer so it could pass a pointer to it into native code. For certain consumption of <code>FileSystemWatcher<\/code>, especially in scenarios where lots of <code>FileSystemWatcher<\/code> instances are created, that pinning could contribute to non-trivial heap fragmentation. Interestingly, though, this array is effectively never consumed as an array: all of the writes into it are performed in native code via the pointer that was passed to the OS, and all consumption of it in managed code to read out the events are done via a span. That means the array nature of it doesn&#8217;t really matter, and we&#8217;re better off just allocating a native rather than managed buffer that then requires pinning.<\/p>\n<pre><code class=\"language-csharp\">\/\/ Run on Windows.\r\n\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public void FSW()\r\n    {\r\n        using FileSystemWatcher fsw = new(Environment.CurrentDirectory);\r\n        fsw.EnableRaisingEvents = true;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FSW<\/td>\n<td>.NET 9<\/td>\n<td style=\"text-align: right;\">61.46 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">8944 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FSW<\/td>\n<td>.NET 10<\/td>\n<td style=\"text-align: right;\">61.21 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">744 B<\/td>\n<td style=\"text-align: right;\">0.08<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>BufferedStream<\/code> gets a boost from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104822\">dotnet\/runtime#104822<\/a> from <a href=\"https:\/\/github.com\/ANahr\">@ANahr<\/a>. There is a curious and problematic inconsistency in <code>BufferedStream<\/code> that&#8217;s been there since, well, forever as far as I can tell. It&#8217;s obviously been revisited in the past, and due to the super duper strong backwards compatibility concerns for .NET Framework (where a key feature is that the framework doesn&#8217;t change), the issue was never fixed. There&#8217;s even a <a href=\"https:\/\/github.com\/microsoft\/referencesource\/blob\/f7df9e2399ecd273e90908ac11caf1433e142448\/mscorlib\/system\/io\/bufferedstream.cs#L1263\">comment in the code<\/a> to this point:<\/p>\n<pre><code class=\"language-csharp\">\/\/ We should not be flushing here, but only writing to the underlying stream, but previous version flushed, so we keep this.<\/code><\/pre>\n<p>A <code>BufferedStream<\/code> does what its name says. It wraps an underlying <code>Stream<\/code> and buffers access to it. So, for example, if it were configured with a buffer size of 1000, and you wrote 100 bytes to the <code>BufferedStream<\/code> at a time, your first 10 writes would just go to the buffer and the underlying <code>Stream<\/code> wouldn&#8217;t be touched at all. Only on the 11th write would the buffer be full and need to be flushed (meaning written) to the underlying <code>Stream<\/code>. So far, so good. Moreover, there&#8217;s a difference between flushing to the underlying stream and flushing the underlying stream. Those sound almost identical, but they&#8217;re not: in the former case, we&#8217;re effectively calling <code>_stream.Write(buffer)<\/code> to write the buffer to that stream, and in the latter case, we&#8217;re effectively calling <code>_stream.Flush()<\/code> to force any buffering <em>that<\/em> stream was doing to propagate it to <em>its<\/em> underlying destination. <code>BufferedStream<\/code> really shouldn&#8217;t be in the business of doing the latter when <code>Write<\/code>&#8216;ing to the <code>BufferedStream<\/code>, and in general it wasn&#8217;t&#8230; except in one case. Whereas most of the writing-related methods would not call <code>_stream.Flush()<\/code>, for some reason <code>WriteByte<\/code> did. In particular for cases where the <code>BufferedStream<\/code> is configured with a small buffer, and where the underlying stream&#8217;s flush is relatively expensive (e.g. <code>DeflateStream.Flush<\/code> forces any buffered bytes to be compressed and emitted), that can be problematic for performance, nevermind the inconsistency. This change simply fixes the inconsistency, such that <code>WriteByte<\/code> no longer forces a flush on the underlying stream.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Compression;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _bytes;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _bytes = new byte[1024 * 1024];\r\n        new Random(42).NextBytes(_bytes);\r\n    }\r\n\r\n    [Benchmark]\r\n    public void WriteByte()\r\n    {\r\n        using Stream s = new BufferedStream(new DeflateStream(Stream.Null, CompressionLevel.SmallestSize), 256);\r\n        foreach (byte b in _bytes)\r\n        {\r\n            s.WriteByte(b);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WriteByte<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">73.87 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WriteByte<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">17.77 ms<\/td>\n<td style=\"text-align: right;\">0.24<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>While on the subject of compression, it&#8217;s worth calling out several improvements in <code>System.IO.Compression<\/code> in .NET 10, too. As noted in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-9\/\">Performance Improvements in .NET 9<\/a>, <code>DeflateStream<\/code>\/<code>GZipStream<\/code>\/<code>ZLibStream<\/code> are managed wrappers around an underlying native <code>zlib<\/code> library. For a long time, that was the original <code>zlib<\/code> (<a href=\"https:\/\/github.com\/madler\/zlib\">madler\/zlib<\/a>). Then it was Intel&#8217;s <code>zlib-intel<\/code> fork (<a href=\"https:\/\/github.com\/intel\/zlib\">intel\/zlib<\/a>), which is now archived and no longer maintained. In .NET 9, the library switched to using <code>zlib-ng<\/code> (<a href=\"https:\/\/github.com\/zlib-ng\/zlib-ng\">zlib-ng\/zlib-ng<\/a>), which is a modernized fork that&#8217;s well-maintained and optimized for a large number of hardware architectures. .NET 9 is based on <code>zlib-ng<\/code> 2.2.1. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118457\">dotnet\/runtime#118457<\/a> updates it to use <code>zlib-ng<\/code> 2.2.5. Compared with the 2.2.1 release, there are a variety of performance improvements in <code>zlib-ng<\/code> itself, which .NET 10 then inherits, such as improved used of AVX2 and AVX512. Most importantly, though, the update includes a <a href=\"https:\/\/github.com\/zlib-ng\/zlib-ng\/pull\/1938\">revert<\/a> that undoes a cleanup change in the 2.2.0 release; the original change removed a workaround for a function that had been slow and was found to no longer be slow, but as it turns out, it&#8217;s still slow in some circumstances (long, <em>highly<\/em> compressible data), resulting in a throughput regression. The fix in 2.2.5 puts back the workaround to fix the regression.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Compression;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _data = new HttpClient().GetByteArrayAsync(@\"https:\/\/raw.githubusercontent.com\/dotnet\/runtime-assets\/8d362e624cde837ec896e7fff04f2167af68cba0\/src\/System.IO.Compression.TestData\/DeflateTestData\/xargs.1\").Result;\r\n\r\n    [Benchmark]\r\n    public void Compress()\r\n    {\r\n        using ZLibStream z = new(Stream.Null, CompressionMode.Compress);\r\n        for (int i = 0; i &lt; 100; i++)\r\n        {\r\n            z.Write(_data);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Compress<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">202.79 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Compress<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">70.45 us<\/td>\n<td style=\"text-align: right;\">0.35<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The managed wrapper for <code>zlib<\/code> also gains some improvements. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113587\">dotnet\/runtime#113587<\/a> from <a href=\"https:\/\/github.com\/edwardneal\">@edwardneal<\/a> improves the case where multiple gzip payloads are being read from the underlying <code>Stream<\/code>. Due to its nature, multiple complete gzip payloads can be written one after the other, and a single <code>GZipStream<\/code> can be used to decompress all of them as if they were one. Each time it hit a boundary between payloads, the managed wrapper was throwing away the old interop handles and creating new ones, but it can instead take advantage of reset capabilities in the underlying <code>zlib<\/code> library, shaving off some cycles associated with freeing and re-allocating the underlying data structures. This is a very biased micro-benchmark (a stream containing a 1000 gzip payloads that each decompresses into a single byte), highlighting the worst case, but it exemplifies the issue:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Compression;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private MemoryStream _data;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _data = new MemoryStream();\r\n        for (int i = 0; i &lt; 1000; i++)\r\n        {\r\n            using GZipStream gzip = new(_data, CompressionMode.Compress, leaveOpen: true);\r\n            gzip.WriteByte(42);\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Decompress()\r\n    {\r\n        _data.Position = 0;\r\n        using GZipStream gzip = new(_data, CompressionMode.Decompress, leaveOpen: true);\r\n        gzip.CopyTo(Stream.Null);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Decompress<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">331.3 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Decompress<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">104.3 us<\/td>\n<td style=\"text-align: right;\">0.31<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other components that sit above these streams, like <code>ZipArchive<\/code>, have also improved. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103153\">dotnet\/runtime#103153<\/a> from <a href=\"https:\/\/github.com\/edwardneal\">@edwardneal<\/a> updates <code>ZipArchive<\/code> to not rely on <code>BinaryReader<\/code> and <code>BinaryWriter<\/code>, avoiding their underlying buffer allocations and having more fine-grained control over how and when exactly data is encoded\/decoded and written\/read. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102704\">dotnet\/runtime#102704<\/a> from <a href=\"https:\/\/github.com\/edwardneal\">@edwardneal<\/a> reduces memory consumption and allocation when updating <code>ZipArchive<\/code>s. A <code>ZipArchive<\/code> update used to be &#8220;rewrite the world&#8221;: it loaded every entry&#8217;s data into memory and rewrote all the file headers, all entry data, and the &#8220;central directory&#8221; (what the zip format calls its catalog of all the entries in the archive). A large archive would have proportionally large allocation. This PR introduces change tracking plus ordering of entries so that only the portion of the file from the first actually affected entry (or one whose variable\u2011length metadata\/data changed) is rewritten, rather than always rewriting the whole thing. The effects can be significant.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Compression;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Stream _zip = new MemoryStream();\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        using ZipArchive zip = new(_zip, ZipArchiveMode.Create, leaveOpen: true);\r\n\r\n        Random r = new(42);\r\n        for (int i = 0; i &lt; 1000; i++)\r\n        {\r\n            byte[] fileBytes = new byte[r.Next(512, 2048)];\r\n            r.NextBytes(fileBytes);\r\n            using Stream s = zip.CreateEntry($\"file{i}.txt\").Open();\r\n            s.Write(fileBytes);\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Update()\r\n    {\r\n        _zip.Position = 0;\r\n        using ZipArchive zip = new(_zip, ZipArchiveMode.Update, leaveOpen: true);\r\n        zip.GetEntry(\"file987.txt\")?.Delete();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Update<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">987.8 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">2173.9 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Update<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">354.7 us<\/td>\n<td style=\"text-align: right;\">0.36<\/td>\n<td style=\"text-align: right;\">682.22 KB<\/td>\n<td style=\"text-align: right;\">0.31<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(<code>ZipArchive<\/code> and <code>ZipFile<\/code> also gain async APIs in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114421\">dotnet\/runtime#114421<\/a>, a long requested feature that allows using async I\/O while loading, manipulating, and saving zips.)<\/p>\n<p>Finally, somewhere between performance and reliability, <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/7390\">dotnet\/roslyn-analyzers#7390<\/a> from <a href=\"https:\/\/github.com\/mpidash\">@mpidash<\/a> adds a new analyzer for <code>StreamReader.EndOfStream<\/code>. <code>StreamReader.EndOfStream<\/code> seems like it should be harmless, but it&#8217;s quite the devious little property. The intent is to determine whether the reader is at the end up of the underlying <code>Stream<\/code>. Seems easy enough. If the <code>StreamReader<\/code> still has previously read data buffered, obviously it&#8217;s not at the end. And if the reader has previously seen EOF, e.g. <code>Read<\/code> returned <code>0<\/code>, then it obviously is at the end. But in all other situations, there&#8217;s no way to know you&#8217;re at the end of the stream (at least in the general case) without performing a read, which means this property does something properties should never do: perform I\/O. Worse than just performing I\/O, that read can be a blocking operation, e.g. if the <code>Stream<\/code> represents a network stream for a <code>Socket<\/code>, and performing a read actually means blocking until data is received. Even worse, though, is when it&#8217;s used in an asynchronous method, e.g.<\/p>\n<pre><code class=\"language-csharp\">while (!reader.EndOfStream)\r\n{\r\n    string? line = await reader.ReadLineAsync();\r\n    ...\r\n}<\/code><\/pre>\n<p>Now not only might <code>EndOfStream<\/code> do I\/O and block, it&#8217;s doing that in a method that&#8217;s supposed to do all of its waiting asynchronously.<\/p>\n<p>What makes this even more frustrating is that <code>EndOfStream<\/code> isn&#8217;t even useful in a loop like that above. <code>ReadLineAsync<\/code> will return a <code>null<\/code> string if it&#8217;s at the end of the stream, so the loop would instead be better as:<\/p>\n<pre><code class=\"language-csharp\">while (await reader.ReadLineAsync() is string line)\r\n{\r\n    ...\r\n}<\/code><\/pre>\n<p>Simpler, cheaper, and no ticking time bombs of synchronous I\/O. Thanks to this new analyzer, any such use of <code>EndOfStream<\/code> in an async method will trigger <code>CA2024<\/code>:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/09\/CA2024.png\" alt=\"CA2024 Analyzer\" \/><\/p>\n<h2>Networking<\/h2>\n<p>Networking-related operations show up in almost every modern workload. Past releases of .NET have seen a lot of energy exerted on whittling away at networking overheads, as these components are used over and over and over, often in critical paths, and the overheads can add up. .NET 10 continues the streamlining trend.<\/p>\n<p>As was seen with core primitives earlier, <code>IPAddress<\/code> and <code>IPNetwork<\/code> are both imbued with UTF8 parsing capabilities, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102144\">dotnet\/runtime#102144<\/a> from <a href=\"https:\/\/github.com\/edwardneal\">@edwardneal<\/a>. As is the case with most other such types in the core libraries, the UTF8-based implementation and the UTF16-based implementation are mostly the same implementation, sharing most of their code via generic methods parameterized on <code>byte<\/code> vs <code>char<\/code>. And as a result of the focus on enabling UTF8, not only can you parse UTF8 bytes directly rather than needing to transcode first, the existing code actually gets a bit faster.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"s\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(\"Fe08::1%13542\")]\r\n    public IPAddress Parse(string s) =&gt; IPAddress.Parse(s);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Parse<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">71.35 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Parse<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">54.60 ns<\/td>\n<td style=\"text-align: right;\">0.77<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>IPAddress<\/code> is also imbued with <code>IsValid<\/code> and <code>IsValidUtf8<\/code> methods, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111433\">dotnet\/runtime#111433<\/a>. It was previously possible to test the validity of an address via <code>TryParse<\/code>, but when successful, that would allocate the <code>IPAddress<\/code>; if you don&#8217;t need the resulting object but just need to know whether it&#8217;s valid, the extra allocation is wasteful.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _address = \"123.123.123.123\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool TryParse() =&gt; IPAddress.TryParse(_address, out _);\r\n\r\n    [Benchmark]\r\n    public bool IsValid() =&gt; IPAddress.IsValid(_address);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryParse<\/td>\n<td style=\"text-align: right;\">26.26 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IsValid<\/td>\n<td style=\"text-align: right;\">21.88 ns<\/td>\n<td style=\"text-align: right;\">0.83<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Uri<\/code>, used in the above benchmark, also gets some notable improvements. In fact, one of my favorite improvements in all of .NET 10 is in <code>Uri<\/code>. The feature itself isn&#8217;t a performance improvement, but there are some interesting performance-related ramifications for it. In particular, since forever, <code>Uri<\/code> has had a length limitation due to implementation details. <code>Uri<\/code> keeps track of various offsets in the input, such as where the host portion starts, where the path starts, where the query starts, and so on. The implementer chose to use <code>ushort<\/code> for each of these values rather than <code>int<\/code>. That means the maximum length of a <code>Uri<\/code> is then constrained to the lengths a <code>ushort<\/code> can describe, namely 65,535 characters. That sounds like a ridiculously long <code>Uri<\/code>, one no one would ever need to go beyond&#8230; until you consider data URIs. Data URIs embed a representation of arbitrary bytes, typically Base64 encoded, in the URI itself. This allows for files to be represented directly in links, and it&#8217;s become a common way for AI-related services to send and receive data payloads, like images. It doesn&#8217;t take a very large image to exceed 65K characters, however, especially with Base64 encoding increasing the payload size by ~33%. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117287\">dotnet\/runtime#117287<\/a> finally removes that limitation, so now <code>Uri<\/code> can be used to represent very large data URIs, if desired. This, however, has some performance ramifications (beyond the few percentage increase in the size of <code>Uri<\/code>, to accomodate the extra <code>ushort<\/code> to <code>int<\/code> bytes). In particular, <code>Uri<\/code> implements path compression, so for example this:<\/p>\n<pre><code class=\"language-csharp\">Console.WriteLine(new Uri(\"http:\/\/test\/hello\/..\/hello\/..\/hello\"));<\/code><\/pre>\n<p>prints out:<\/p>\n<pre><code class=\"language-txt\">http:\/\/test\/hello<\/code><\/pre>\n<p>As it turns out, the algorithm implementing that path compression is <code>O(N^2)<\/code>. Oops. With a limit of 65K characters, such a quadratic complexity isn&#8217;t a security concern (as <code>O(N^2)<\/code> operations can sometimes be, as if <code>N<\/code> is unbounded, it creates an attack vector where an attacker can do <code>N<\/code> work and get the attackee to do disproportionately more). But once the limit is removed entirely, it could be. As such, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117820\">dotnet\/runtime#117820<\/a> compensates by making the path compression <code>O(N)<\/code>. And while in the general case, we don&#8217;t expect path compression to be a meaningfully impactful part of constructing <code>Uri<\/code>, in degenerate cases, even under the old limit, the change can still make a measurable improvement.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _input = $\"http:\/\/host\/{string.Concat(Enumerable.Repeat(\"a\/..\/\", 10_000))}{new string('a', 10_000)}\";\r\n\r\n    [Benchmark]\r\n    public Uri Ctor() =&gt; new Uri(_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">18.989 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">2.228 us<\/td>\n<td style=\"text-align: right;\">0.12<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the same vein, the longer the URI, the more effort is required to do whatever validation is needed in the constructor. <code>Uri<\/code>&#8216;s constructor needs to check whether the input has any Unicode characters that might need to be handled. Rather than checking all the characters one at a time, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107357\">dotnet\/runtime#107357<\/a>, <code>Uri<\/code> can now use <code>SearchValues<\/code> to more quickly rule out or find the first location of a character that needs to be looked at more deeply.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _uri;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        byte[] bytes = new byte[40_000];\r\n        new Random(42).NextBytes(bytes);\r\n        _uri = $\"data:application\/octet-stream;base64,{Convert.ToBase64String(bytes)}\";\r\n    }\r\n\r\n    [Benchmark]\r\n    public Uri Ctor() =&gt; new Uri(_uri);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">19.354 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">2.041 us<\/td>\n<td style=\"text-align: right;\">0.11<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other changes were made to <code>Uri<\/code> that further reduce construction costs in various other cases, too. For cases where the URI host is an IPv6 address, e.g. <code>http:\/\/[2603:1020:201:10::10f]<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117292\">dotnet\/runtime#117292<\/a> recognizes that scope IDs are relatively rare and makes the cases without a scope ID cheaper in exchange for making the cases with a scope ID a little more expensive.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public string CtorHost() =&gt; new Uri(\"http:\/\/[2603:1020:201:10::10f]\").Host;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CtorHost<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">304.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">208 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CtorHost<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">254.2 ns<\/td>\n<td style=\"text-align: right;\">0.83<\/td>\n<td style=\"text-align: right;\">216 B<\/td>\n<td style=\"text-align: right;\">1.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Note that the .NET 10 allocation is 8 bytes larger than the .NET 9 allocation due to the extra space required in this case for dropping the length limitation, as discussed earlier.)<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117289\">dotnet\/runtime#117289<\/a> also improves construction for cases where the URI requires normalization, saving some allocations by using normalization routines over spans (which were added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110465\">dotnet\/runtime#110465<\/a>) instead of needing to allocate <code>string<\/code>s for the inputs.<\/p>\n<pre><code class=\"language-csharp\">using BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public Uri Ctor() =&gt; new(\"http:\/\/some.host.with.\u00fcmlauts\/\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">377.6 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">440 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">322.0 ns<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">376 B<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Various improvements have also found their way into the HTTP stack. For starters, the download helpers on <code>HttpClient<\/code> and <code>HttpContent<\/code> have improved. These types expose helper methods for some of the most common forms of grabbing data; while a developer can grab the response <code>Stream<\/code> and consume that efficiently, for simple and common cases like &#8220;just get the whole response as a <code>string<\/code>&#8221; or &#8220;just get the whole response as a <code>byte[]<\/code>&#8220;, the <code>GetStringAsync<\/code> and <code>GetByteArrayAsync<\/code> make that really easy to do. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109642\">dotnet\/runtime#109642<\/a> changes how these methods operate in order to better manage the temporary buffers that are required, especially in the case where the server hasn&#8217;t advertised a <code>Content-Length<\/code>, such that the client doesn&#8217;t know ahead of time how much data to expect and thus how much space to allocate.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private HttpClient _client = new();\r\n    private Uri _uri;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Socket listener = new(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n        listener.Listen(int.MaxValue);\r\n        _ = Task.Run(async () =&gt;\r\n        {\r\n            byte[] header = \"HTTP\/1.1 200 OK\\r\\nTransfer-Encoding: chunked\\r\\n\\r\\n\"u8.ToArray();\r\n            byte[] chunkData = Enumerable.Range(0, 100).SelectMany(_ =&gt; \"abcdefghijklmnopqrstuvwxyz\").Select(c =&gt; (byte)c).ToArray();\r\n            byte[] chunkHeader = Encoding.UTF8.GetBytes($\"{chunkData.Length:X}\\r\\n\");\r\n            byte[] chunkFooter = \"\\r\\n\"u8.ToArray();\r\n            byte[] footer = \"0\\r\\n\\r\\n\"u8.ToArray();\r\n            while (true)\r\n            {\r\n                var server = await listener.AcceptAsync();\r\n                server.NoDelay = true;\r\n                using StreamReader reader = new(new NetworkStream(server), Encoding.ASCII);\r\n                while (true)\r\n                {\r\n                    while (!string.IsNullOrEmpty(await reader.ReadLineAsync())) ;\r\n\r\n                    await server.SendAsync(header);\r\n                    for (int i = 0; i &lt; 100; i++)\r\n                    {\r\n                        await server.SendAsync(chunkHeader);\r\n                        await server.SendAsync(chunkData);\r\n                        await server.SendAsync(chunkFooter);\r\n                    }\r\n                    await server.SendAsync(footer);\r\n                }\r\n            }\r\n        });\r\n\r\n        var ep = (IPEndPoint)listener.LocalEndPoint!;\r\n        _uri = new Uri($\"http:\/\/{ep.Address}:{ep.Port}\/\");\r\n    }\r\n\r\n    [Benchmark]\r\n    public async Task&lt;byte[]&gt; ResponseContentRead_ReadAsByteArrayAsync()\r\n    {\r\n        using HttpResponseMessage resp = await _client.GetAsync(_uri);\r\n        return await resp.Content.ReadAsByteArrayAsync();\r\n    }\r\n\r\n    [Benchmark]\r\n    public async Task&lt;string&gt; ResponseHeadersRead_ReadAsStringAsync()\r\n    {\r\n        using HttpResponseMessage resp = await _client.GetAsync(_uri, HttpCompletionOption.ResponseHeadersRead);\r\n        return await resp.Content.ReadAsStringAsync();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ResponseContentRead_ReadAsByteArrayAsync<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1.438 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">912.71 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ResponseContentRead_ReadAsByteArrayAsync<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.166 ms<\/td>\n<td style=\"text-align: right;\">0.81<\/td>\n<td style=\"text-align: right;\">519.12 KB<\/td>\n<td style=\"text-align: right;\">0.57<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ResponseHeadersRead_ReadAsStringAsync<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1.528 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1166.77 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ResponseHeadersRead_ReadAsStringAsync<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.306 ms<\/td>\n<td style=\"text-align: right;\">0.86<\/td>\n<td style=\"text-align: right;\">773.3 KB<\/td>\n<td style=\"text-align: right;\">0.66<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117071\">dotnet\/runtime#117071<\/a> reduces overheads associated with HTTP header validation. In the <code>System.Net.Http<\/code> implementation, some headers have dedicated parsers for them, while many (the majority of custom ones that services define) don&#8217;t. This PR recognizes that for these, the validation that needs to be performed amounts to only checking for forbidden newline characters, and the objects that were being created for all headers weren&#8217;t necessary for these.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net.Http.Headers;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly HttpResponseHeaders _headers = new HttpResponseMessage().Headers;\r\n\r\n    [Benchmark]\r\n    public void Add()\r\n    {\r\n        _headers.Clear();\r\n        _headers.Add(\"X-Custom\", \"Value\");\r\n    }\r\n\r\n    [Benchmark]\r\n    public object GetValues()\r\n    {\r\n        _headers.Clear();\r\n        _headers.TryAddWithoutValidation(\"X-Custom\", \"Value\");\r\n        return _headers.GetValues(\"X-Custom\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Add<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">28.04 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Add<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">12.61 ns<\/td>\n<td style=\"text-align: right;\">0.45<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>GetValues<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">82.57 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetValues<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">23.97 ns<\/td>\n<td style=\"text-align: right;\">0.29<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For folks using HTTP\/2, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112719\">dotnet\/runtime#112719<\/a> decreases per-connection memory consumption, by changing the <code>HPackDecoder<\/code> to lazily grow its buffers, starting from expected-case sizing rather than worst-case. (&#8220;HPACK&#8221; is the header compression algorithm used by HTTP\/2, utilizing a table shared between client and server for managing commonly transmitted headers.) It&#8217;s a little hard to measure in a micro-benchmark, since in a real app the connections get reused (and the benefits here aren&#8217;t about temporary allocation but rather connection density and overall working set), but we can get a glimpse of it by doing what you&#8217;re not supposed to do and create a new <code>HttpClient<\/code> for each request (you&#8217;re not supposed to do that, or more specifically not supposed to create a new handler for each request, because doing so tears down the connection pool and the connections it contains&#8230; which is bad for an app but exactly what we want for our micro-benchmark).<\/p>\n<pre><code class=\"language-csharp\">\/\/ For this benchmark, change the benchmark.csproj to start with:\r\n\/\/     &lt;Project Sdk=\"Microsoft.NET.Sdk.Web\"&gt;\r\n\/\/ instead of:\r\n\/\/     &lt;Project Sdk=\"Microsoft.NET.Sdk\"&gt;\r\n\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing System.Net;\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.AspNetCore.Server.Kestrel.Core;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private WebApplication _app;\r\n\r\n    [GlobalSetup]\r\n    public async Task Setup()\r\n    {\r\n        var builder = WebApplication.CreateBuilder();\r\n        builder.Logging.SetMinimumLevel(LogLevel.Warning);\r\n        builder.WebHost.ConfigureKestrel(o =&gt; o.ListenLocalhost(5000, listen =&gt; listen.Protocols = HttpProtocols.Http2));\r\n\r\n        _app = builder.Build();\r\n        _app.MapGet(\"\/hello\", () =&gt; Results.Text(\"hi from kestrel over h2c\\n\"));\r\n        var serverTask = _app.RunAsync();\r\n        await Task.Delay(300);\r\n    }\r\n\r\n    [GlobalCleanup]\r\n    public async Task Cleanup()\r\n    {\r\n        await _app.StopAsync();\r\n        await _app.DisposeAsync();\r\n    }\r\n\r\n    [Benchmark]\r\n    public async Task Get()\r\n    {\r\n        using var client = new HttpClient()\r\n        {\r\n            DefaultRequestVersion = HttpVersion.Version20,\r\n            DefaultVersionPolicy = HttpVersionPolicy.RequestVersionExact\r\n        };\r\n\r\n        var response = await client.GetAsync(\"http:\/\/localhost:5000\/hello\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Get<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">485.9 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">83.19 KB<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Get<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">445.0 us<\/td>\n<td style=\"text-align: right;\">0.92<\/td>\n<td style=\"text-align: right;\">51.79 KB<\/td>\n<td style=\"text-align: right;\">0.62<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Also, on Linux and macOS, all HTTP use (and, more generally, all socket interactions) gets a tad cheaper from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109052\">dotnet\/runtime#109052<\/a>, which eliminates a <code>ConcurrentDictionary&lt;&gt;<\/code> lookup for each asynchronous operation that completes on a <code>Socket<\/code>.<\/p>\n<p>And for all you Native AOT fans, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117012\">dotnet\/runtime#117012<\/a> also adds a feature switch that enables trimming out the HTTP\/3 implementation from <code>HttpClient<\/code>, which can represent a very sizeable and &#8220;free&#8221; space savings if you&#8217;re not using HTTP\/3 at all.<\/p>\n<h2>Searching<\/h2>\n<p>Someone once told me that computer science was &#8220;all about sorting and searching.&#8221; That&#8217;s not far off. Searching in one way, shape, or form is an integral part of many applications and services.<\/p>\n<h3>Regex<\/h3>\n<p>Whether you love or hate the terse syntax, regular expressions (regex) continue to be an integral part of software development, with applications as part of both software and the software development process. As such, it&#8217;s had robust support in .NET since the early days of the platform, with the <code>System.Text.RegularExpressions<\/code> namespace providing a feature-rich set of regex capabilities. The performance of <code>Regex<\/code> was improved significantly in .NET 5 (<a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regex-performance-improvements-in-net-5\/\">Regex Performance Improvements in .NET 5<\/a>) and then again in .NET 7, which also saw a significant amount of new functionality added (<a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regular-expression-improvements-in-dotnet-7\/\">Regular Expression Improvements in .NET 7<\/a>). It&#8217;s continued to be improved in every release since, and .NET 10 is no exception.<\/p>\n<p>As I&#8217;ve discussed in previous blog posts about regex and performance, there are two high-level ways regex engines are implemented, either with backtracking or without. Non-backtracking engines typically work by creating some form of finite automata that represents the pattern, and then for each character consumed from the input, moves around the deterministic finite automata (DFA, meaning you can be in only a single state at a time) or non-deterministic finite automata (NFA, meaning you can be in multiple states at a time), transitioning from one state to another. A key benefit of a non-backtracking engine is that it can often make linear guarantees about processing time, where an input string of length <code>N<\/code> can be processed in worst-case <code>O(N)<\/code> time. A key downside of a non-backtracking engine is it can&#8217;t support all of the features developers are familiar with in modern regex engines, like back references. Backtracking engines are named as such because they&#8217;re able to &#8220;backtrack,&#8221; trying one approach to see if there&#8217;s a match and then going back and trying another. If you have the regex pattern <code>\\w*\\d<\/code> (which matches any number of word characters followed by a single digit) and supply it with the string <code>\"12\"<\/code>, a backtracking engine is likely to first try treating both the <code>'1'<\/code> and the <code>'2'<\/code> as word characters, then find that it doesn&#8217;t have anything to fulfill the <code>\\d<\/code>, and thus backtrack, instead treating only the <code>'1'<\/code> as being consumed by the <code>\\w*<\/code>, and leaving the <code>'2'<\/code> to be consumed by the <code>\\d<\/code>. Backtracking is how engines support features like back references, variable-length lookarounds, conditional expressions, and more. They can also have excellent performance, especially on the average and best cases. A key downside, however, is their worst case, where on some patterns they can suffer from &#8220;catastrophic backtracking.&#8221; That happens when all of that backtracking leads to exploring the same input over and over and over again, possibly consuming much more than linear time.<\/p>\n<p>Since .NET 7, .NET has had an opt-in non-backtracking engine, which is what you get with <code>RegexOptions.NonBacktracking<\/code>, Otherwise, it uses a backtracking engine, whether using the default interpreter, or a regex compiled to IL (<code>RegexOptions.Compiled<\/code>), or a regex emitted as a custom C# implementation with the regex source generator (<code>[GeneratedRegex(...)]<\/code>). These backtracking engines can yield exceptional performance, but due to their backtracking nature, they are susceptible to bad worst-case performance, which is why specifying timeouts to a <code>Regex<\/code> is often encouraged, especially when using patterns of unknown provenance. Still, there are things backtracking engines can do to help mitigate some such backtracking, in particular avoiding the need for some of the backtracking in the first place.<\/p>\n<p>One of the main tools backtracking engines offer for reduced backtracking is an &#8220;atomic&#8221; construct. Some regex syntaxes surface this via &#8220;possessive quantifiers,&#8221; while others, including .NET, surface it via &#8220;atomic groups.&#8221; They&#8217;re fundamentally the same thing, just expressed in the syntax differently. An atomic group in .NET&#8217;s regex syntax is a group that is never backtracked into. If we take our previous <code>\\w*\\d<\/code> example, we could wrap the <code>\\w*<\/code> loop in an atomic group like this: <code>(?&gt;\\w*)\\d<\/code>. In doing so, whatever that <code>\\w*<\/code> consumes won&#8217;t change via backtracking after exiting the group and moving on to whatever comes after it in the pattern. So if I try to match <code>\"12\"<\/code> with such a pattern, it&#8217;ll fail, because the <code>\\w*<\/code> will consume both characters, the <code>\\d<\/code> will have nothing to match, and no backtracking will be applied, because the <code>\\w*<\/code> is wrapped in an atomic group and thus exposes no backtracking opportunities.<\/p>\n<p>In that example, wrapping the <code>\\w*<\/code> with an atomic group changes the meaning of the pattern, and thus it&#8217;s not something that a regex engine could choose to do automatically. However, there are many cases where wrapping otherwise backtracking constructs in an atomic group does not observably change behavior, because any backtracking that would otherwise happen would provably never be fruitful. Consider a pattern <code>a*b<\/code>. <code>a*b<\/code> is observably identical to <code>(?&gt;a*)b<\/code>, which says that the <code>a*<\/code> should not be backtracked into. That&#8217;s because there&#8217;s nothing the <code>a*<\/code> can &#8220;give back&#8221; (which can only be <code>a<\/code>s) that would satisfy what comes next in the pattern (which is only <code>b<\/code>). It&#8217;s thus valid for a backtracking engine to transform how it processes <code>a*b<\/code> to instead be the equivalent of how it processes <code>(?&gt;a*)b<\/code>. And the .NET regex engine has been capable of such transformations since .NET 5. This can result in massive improvements to throughput. With backtracking, waving my hands, we effectively need to execute everything after the backtracking construct for each possible position we could backtrack to. So, for example, with <code>\\w*SOMEPATTERN<\/code>, if the <code>w*<\/code> successfully initially consumes 100 characters, we then possibly need to try to match <code>SOMEPATTERN<\/code> up to 100 different times, as we may need to backtrack up to 100 times and re-evaluate <code>SOMEPATTERN<\/code> each time we give back one of the things initially matched. If we instead make that <code>(?&gt;\\w*)<\/code>, we eliminate all but one of those! That makes improvements to this ability to automatically transform backtracking constructs to be non-backtracking possibly massive improvements in performance, and practically every release of .NET since .NET 5 has increased the set of patterns that are automatically transformed. .NET 10 included.<\/p>\n<p>Let&#8217;s start with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117869\">dotnet\/runtime#117869<\/a>, which teaches the regex optimizer about more &#8220;disjoint&#8221; sets. Consider the previous example of <code>a*b<\/code>, and how I said we can make that <code>a*<\/code> loop atomic because there&#8217;s nothing <code>a*<\/code> can &#8220;give back&#8221; that matches <code>b<\/code>. That is a general statement about auto-atomicity: a loop can be made atomic if it&#8217;s guaranteed to end with something that can&#8217;t possibly match the thing that comes after it. So, if I have <code>[abc]+[def]<\/code>, that loop can be made atomic, because there&#8217;s nothing <code>[abc]<\/code> can match that <code>[def]<\/code> can also match. In contrast, if the expression were instead <code>[abc]+[cef]<\/code>, that loop must not be made atomic automatically, as doing so could change behavior. The sets <em>do<\/em> overlap, as both can match <code>'c'<\/code>. So, for example, if the input were just <code>\"cc\"<\/code>, the original expression should match it (the <code>[abc]*<\/code> loop would match <code>'c'<\/code> with one iteration of the loop and then the second <code>'c'<\/code> would satisfy the <code>[cef]<\/code> set), but if the expression were instead <code>(?&gt;[abc]+)[cef]<\/code>, it would no longer match, as the <code>[abc]+<\/code> would consume both <code>'c'<\/code>s, and there&#8217;d be nothing left for the <code>[cef]<\/code> set to match. Two sets that don&#8217;t have any overlap are referred to as being &#8220;disjoint,&#8221; and so the optimizer needs to be able to prove the disjointedness of sets in order to perform these kinds of auto-atomicity optimizations. The optimizer was already able to do so for many sets, in particular ones that were composed purely of characters or character ranges, e.g. <code>[ace]<\/code> or <code>[a-zA-Z0-9]<\/code>. But many sets are instead composed of entire Unicode categories. For example, when you write <code>\\d<\/code>, unless you&#8217;ve specified <code>RegexOptions.ECMAScript<\/code> that&#8217;s the same as <code>\\p{Nd}<\/code>, which says &#8220;match any character in the Unicode category of Number decimal digits&#8221;, aka all characters for which <code>char.GetUnicodeCategory<\/code> returns <code>UnicodeCategory.DecimalDigitNumber<\/code>. And the optimizer was unable to reason about overlap between such sets. So, for example, if you had the expression <code>\\w*\\p{Sm}<\/code>, that matches anything that&#8217;s any number of word characters followed by a math symbol (<code>UnicodeCategory.MathSymbol<\/code>). <code>\\w<\/code> is actually just a set of eight specific Unicode categories, such that the previous expression behaves identically to if I&#8217;d written <code>[\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{Mn}\\p{Nd}\\p{Pc}]*\\p{Sm}<\/code> (<code>\\w<\/code> is composed of <code>UnicodeCategory.UppercaseLetter<\/code>, <code>UnicodeCategory.LowercaseLetter<\/code>, <code>UnicodeCategory.TitlecaseLetter<\/code>, <code>UnicodeCategory.ModiferLetter<\/code>, <code>UnicodeCategory.OtherLetter<\/code>, <code>UnicodeCategory.NonSpacingMark<\/code>, <code>UnicodeCategory.ModiferLetter<\/code>, <code>UnicodeCategory.DecimalDigitNumber<\/code>, and <code>UnicodeCategory.ConnectorPunctuation<\/code>). Note that none of those eight categories is the same as <code>\\p{Sm}<\/code>, which means they&#8217;re disjoint, which means we can safely change that loop to being atomic without impacting behavior; it just makes it faster. One of the easiest ways to see the effect of this is to look at the output from the regex source generator. Before the change, if I look at the XML comment generated for that expression, I get this:<\/p>\n<pre><code class=\"language-csharp\">\/\/\/ \u25cb Match a word character greedily any number of times.\r\n\/\/\/ \u25cb Match a character in the set [\\p{Sm}].<\/code><\/pre>\n<p>and after, I get this:<\/p>\n<pre><code class=\"language-csharp\">\/\/\/ \u25cb Match a word character atomically any number of times.\r\n\/\/\/ \u25cb Match a character in the set [\\p{Sm}].<\/code><\/pre>\n<p>That one word change in the first sentence makes a huge difference. Here&#8217;s the relevant portion of the C# code emitted by the source generator for the matching routine before the change:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match a word character greedily any number of times.\r\n\/\/{\r\n    charloop_starting_pos = pos;\r\n\r\n    int iteration = 0;\r\n    while ((uint)iteration &lt; (uint)slice.Length &amp;&amp; Utilities.IsWordChar(slice[iteration]))\r\n    {\r\n        iteration++;\r\n    }\r\n\r\n    slice = slice.Slice(iteration);\r\n    pos += iteration;\r\n\r\n    charloop_ending_pos = pos;\r\n    goto CharLoopEnd;\r\n\r\n    CharLoopBacktrack:\r\n\r\n    if (Utilities.s_hasTimeout)\r\n    {\r\n        base.CheckTimeout();\r\n    }\r\n\r\n    if (charloop_starting_pos &gt;= charloop_ending_pos)\r\n    {\r\n        return false; \/\/ The input didn't match.\r\n    }\r\n    pos = --charloop_ending_pos;\r\n    slice = inputSpan.Slice(pos);\r\n\r\n    CharLoopEnd:\r\n\/\/}<\/code><\/pre>\n<p>You can see how backtracking influences the emitted code. The core loop in there is iterating through as many word characters as it can match, but then before moving on, it remembers some position information about where it was. It also sets up a label for where subsequent code should jump to if it needs to backtrack; that code undoes one of the matched characters and then retries everything that came after it. If the code needs to backtrack again, it&#8217;ll again undo one of the characters and retry. And so on. Now, here&#8217;s what the code looks like after the change:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match a word character atomically any number of times.\r\n{\r\n    int iteration = 0;\r\n    while ((uint)iteration &lt; (uint)slice.Length &amp;&amp; Utilities.IsWordChar(slice[iteration]))\r\n    {\r\n        iteration++;\r\n    }\r\n\r\n    slice = slice.Slice(iteration);\r\n    pos += iteration;\r\n}<\/code><\/pre>\n<p>All of that backtracking gunk is gone; the loop matches as much as it can, and that&#8217;s that. You can see the effect this has one some cases with a micro-benchmark like this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new string(' ', 100);\r\n    private static readonly Regex s_regex = new Regex(@\"\\s+\\S+\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<p>This is a simple test where we&#8217;re trying to match any positive number of whitespace characters followed by any positive number of non-whitespace characters, giving it an input composed entirely of whitespace. Without atomicity, the engine is going to consume all of the whitespace as part of the <code>\\s+<\/code> but will then find that there isn&#8217;t any non-whitespace available to match the <code>\\S+<\/code>. What does it do then? It backtracks, gives back one of the hundred spaces consumed by <code>\\s+<\/code>, and tries again to match the <code>\\S+<\/code>. It won&#8217;t match, so it backtracks again. And again. And again. A hundred times, until it has nothing left to try and gives up. With atomicity, all that backtracking goes away, allowing it to fail faster.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">183.31 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">69.23 ns<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117892\">dotnet\/runtime#117892<\/a> is a related improvement. In regex, <code>\\b<\/code> is called a &#8220;word boundary&#8221;; it checks whether the wordness of the previous character (whether the previous character matches <code>\\w<\/code>) matches the wordness of the next character, calling it a boundary if they differ. You can see this in the engine&#8217;s <code>IsBoundary<\/code> helper&#8217;s implementation, which follows (note that according to <a href=\"http:\/\/www.unicode.org\/reports\/tr18\/\">TR18<\/a> whether a character is considered a boundary word char is <em>almost<\/em> exactly the same as <code>\\w<\/code>, except with two additional zero-width Unicode characters also included):<\/p>\n<pre><code class=\"language-csharp\">internal static bool IsBoundary(ReadOnlySpan&lt;char&gt; inputSpan, int index)\r\n{\r\n    int indexM1 = index - 1;\r\n    return ((uint)indexM1 &lt; (uint)inputSpan.Length &amp;&amp; RegexCharClass.IsBoundaryWordChar(inputSpan[indexM1])) !=\r\n           ((uint)index &lt; (uint)inputSpan.Length &amp;&amp; RegexCharClass.IsBoundaryWordChar(inputSpan[index]));\r\n}<\/code><\/pre>\n<p>The optimizer already had a special-case in its auto-atomicity logic that had knowledge of boundaries and their relationship to <code>\\w<\/code> and <code>\\d<\/code>, specifically. So, if you had <code>\\w+\\b<\/code>, the optimizer would recognize that in order for the <code>\\b<\/code> to match, what comes after what the <code>\\w+<\/code> matches must necessarily not match <code>\\w<\/code>, because then it wouldn&#8217;t be a boundary, and thus the <code>\\w+<\/code> could be made atomic. Similarly, with a pattern of <code>\\d+\\b<\/code>, it would recognize that what came after must not be in <code>\\d<\/code>, and could make the loop atomic. It didn&#8217;t generalize this, though. Now in .NET 10, it does. This PR teaches the optimizer how to recognize subsets of <code>\\w<\/code>, because, as with the special-case of <code>\\d<\/code>, any subset of <code>\\w<\/code> can similarly benefit: if what comes before the <code>\\b<\/code> is a word character, what comes after must not be. Thus, with this PR, an expression like <code>[a-zA-Z]+\\b<\/code> will now have the loop made atomic.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = \"Supercalifragilisticexpialidocious1\";\r\n    private static readonly Regex s_regex = new Regex(@\"^[A-Za-z]+\\b\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">116.57 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">21.74 ns<\/td>\n<td style=\"text-align: right;\">0.19<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Just doing a better job of set disjointedness analysis is helpful, but more so is actually recognizing whole new classes of things that can be made atomic. In prior releases, the auto-atomicity optimizations only kicked in for loops over single characters, e.g. <code>a*<\/code>, <code>[abc]*?<\/code>, <code>[^abc]*<\/code>. That is obviously only a subset of loops, as many loops are composed of more than just a single character; loops can surround any regex construct. Even a capture group thrown into the mix would knock the auto-atomicity behavior off the rails. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117943\">dotnet\/runtime#117943<\/a>, a significant number of loops involving more complicated constructs can be made atomic. Loops larger than a single character are tricky, though, as there are more things that need to be taken into account when reasoning through atomicity. With a single character, we only need to prove disjointedness for that one character with what comes after it. But, consider an expression like <code>([a-z][0-9])+a1<\/code>. Can that loop be made atomic? What comes after the loop (<code>'a'<\/code>) is provably disjoint from what ends the loop (<code>[0-9]<\/code>), and yet making this loop atomic automatically would change behavior and be a no-no. Imagine if the input were <code>\"b2a1\"<\/code>. That matches; if this expression is processed normally, the loop would match a single iteration, consuming the <code>\"b2\"<\/code>, and then the <code>a1<\/code> after the loop would consume the corresponding <code>a1<\/code> in the input. But, if the loop were made atomic, e.g. <code>(?&gt;([a-z][0-9])+)a1<\/code>, the loop would end up performing two iterations and consuming both the <code>\"b2\"<\/code> and the <code>\"a1\"<\/code>, leaving nothing for the <code>a1<\/code> in the pattern. As it turns out, we not only need to ensure what ends the loop is disjoint from what comes after it, we also need to ensure that what starts the loop is disjoint from what comes after it. That&#8217;s not all, though. Now consider an expression <code>^(a|ab)+$<\/code>. This matches an entire input composed of <code>\"a\"<\/code>s and <code>\"ab\"<\/code>s. Given an input string like <code>\"aba\"<\/code>, this will match successfully, as it will consume the <code>\"ab\"<\/code> with the second branch of the alternation, and then consume the remaining <code>a<\/code> with the first branch of the alternation on the next iteration of the loop. But now consider what happens if we make the loop atomic: <code>^(?&gt;(a|ab)+)$<\/code>. Now on that same input, the initial <code>a<\/code> in the input will be consumed by the first branch of the alternation, and that will satisfy the loop&#8217;s minimum bound of 1 iteration, exiting the loop. It&#8217;ll then proceed to validate that it&#8217;s at the end of the string, and fail, but with the loop now atomic, there&#8217;s nothing to backtrack into, and the whole match fails. Oops. The problem here is that the loop&#8217;s ending must not only be disjoint with what comes next, and the loop&#8217;s beginning must not only be disjoint with what comes next, but because it&#8217;s a loop, what comes next can actually be itself, which means the loop&#8217;s beginning and ending must be disjoint from each other. Those criteria significantly limit to what patterns this can be applied, but even with that, it&#8217;s still surprisingly common: <code>dotnet\/runtime-assets<\/code> (which contains test assets for use with <code>dotnet\/runtime<\/code>) contains a <a href=\"https:\/\/github.com\/dotnet\/runtime-assets\/blob\/f9ac0b368d930728d6740686de29b5276958d15b\/src\/System.Text.RegularExpressions.TestData\/Regex_RealWorldPatterns.json\">database of regex patterns<\/a> sourced from appropriately-licensed nuget packages, yielding almost 20,000 unique patterns, and more than 7% of those were positively impacted by this.<\/p>\n<p>Here is an example that&#8217;s searching <a href=\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\">&#8220;The Entire Project Gutenberg Works of Mark Twain&#8221;<\/a> for sequences of all lowercase ASCII words, each followed by a space, and then all followed by an uppercase ASCII letter.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"([a-z]+ )+[A-Z]\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<p>In previous releases, that inner loop would be made atomic, but the outerloop would remain greedy (backtracking). From the XML comment generated by the source generator, we get this:<\/p>\n<pre><code class=\"language-csharp\">\/\/\/ \u25cb Loop greedily at least once.\r\n\/\/\/     \u25cb 1st capture group.\r\n\/\/\/         \u25cb Match a character in the set [a-z] atomically at least once.\r\n\/\/\/         \u25cb Match ' '.\r\n\/\/\/ \u25cb Match a character in the set [A-Z].<\/code><\/pre>\n<p>Now in .NET 10, we get this:<\/p>\n<pre><code class=\"language-csharp\">\/\/\/ \u25cb Loop atomically at least once.\r\n\/\/\/     \u25cb 1st capture group.\r\n\/\/\/         \u25cb Match a character in the set [a-z] atomically at least once.\r\n\/\/\/         \u25cb Match ' '.\r\n\/\/\/ \u25cb Match a character in the set [A-Z].<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">573.4 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">504.6 ms<\/td>\n<td style=\"text-align: right;\">0.88<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As with any optimization, auto-atomicity should never change observable behavior; it should just make things faster. And as such, every case where atomicity is automatically applied requires it being reasoned through to ensure that the optimization is of sound logic. In some cases, the optimization was written to be conservative, as the relevant reasoning through the logic wasn&#8217;t previously done. An example of that is addressed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118191\">dotnet\/runtime#118191<\/a>, which makes a few tweaks to how boundaries are handled in the auto-atomicity logic, removing some constraints that were put in place but which, as it turns out, are unnecessary. The core logic that implements the atomicity analysis is a method that looks like this:<\/p>\n<pre><code class=\"language-csharp\">private static bool CanBeMadeAtomic(RegexNode node, RegexNode subsequent, ...)<\/code><\/pre>\n<p><code>node<\/code> is the representation for the part of the regex that&#8217;s being considered for becoming atomic (e.g. a loop) and <code>subsequent<\/code> is what comes immediately after it in the pattern; the method then proceeds to validate <code>node<\/code> against <code>subsequent<\/code> to see whether it can prove there wouldn&#8217;t be any behavioral changes if <code>node<\/code> were made atomic. However, not all cases are sufficiently handled just by validating against <code>subsequent<\/code> itself. Consider a pattern like <code>a*b*\\w<\/code>, where <code>node<\/code> represents <code>a*<\/code> and <code>subsequent<\/code> represents <code>b*<\/code>. <code>a<\/code> and <code>b<\/code> are obviously disjoint, and so <code>node<\/code> can be made atomic with regards to <code>subsequent<\/code>, but&#8230; here <code>subsequent<\/code> is also &#8220;nullable,&#8221; meaning it might successfully match 0 characters (the loop has a lower bound of 0). And in such a case, what comes after the <code>a*<\/code> won&#8217;t necessarily be a <code>b<\/code> but could be what comes after the <code>b*<\/code>, which here is a <code>\\w<\/code>, which overlaps with <code>a<\/code>, and as such, it would be a behavioral change to make this into <code>(?&gt;a*)b*\\w<\/code>. Consider an input of just <code>\"a\"<\/code>. With the original pattern, <code>a*<\/code> would successfully match the empty string with 0 iterations, <code>b*<\/code> would successfully match the empty string with 0 iterations, and then <code>\\w<\/code> would successfully match the input <code>'a'<\/code>. But with the atomicized pattern, <code>(?&gt;a*)<\/code> would successfully match the input <code>'a'<\/code> with a single iteration, leaving nothing to match the <code>\\w<\/code>. As such, when <code>CanBeMadeAtomic<\/code> detects that <code>subsequent<\/code> may be nullable and successfully match the empty string, it needs to iterate to also validate against what comes after <code>subsequent<\/code> (and possibly again and again if what comes next itself keeps being nullable).<\/p>\n<p><code>CanBeMadeAtomic<\/code> already factored in boundaries (<code>\\b<\/code> and <code>\\B<\/code>), but it did so with the conservative logic that since a boundary is &#8220;zero-width&#8221; (meaning it doesn&#8217;t consume any input), it must always require checking what comes after it. But that&#8217;s not actually the case. Even though a boundary is zero-width, it still makes guarantees about what comes next: if the prior character is a word character, the next is guaranteed to not be with a successful match. And as such, we can safely make this more liberal and not require checking what comes next.<\/p>\n<p>This last example also highlights an interesting aspect of this auto-atomicity optimization in general. There is nothing this optimization provides that the developer writing the regex in the first place couldn&#8217;t have done themselves. Instead of <code>a*b<\/code>, a developer can write <code>(?&gt;a*)b<\/code>. Instead of <code>[a-z]+(?= )<\/code>, a developer can write <code>(?&gt;[a-z]+)(?= )<\/code>. And so on. But when was the last time you explicitly added an atomic group to a regex you authored? Of the almost 20,000 regular expression patterns in the aforementioned database of real-world regexes sourced from nuget, care to guess how many include an explicitly written atomic group? The answer: ~100. It&#8217;s just not something developers in general think to do, so although the optimization transforms the user&#8217;s pattern into something they could have written themselves, it&#8217;s an incredibly valuable optimization, especially since now in .NET 10 over 70% of those patterns have at least one construct upgraded to be atomic.<\/p>\n<p>The auto-atomicity optimization is an example of the optimizer removing unnecessary work. A key example of that, but certainly not the only example. Several additional PRs in .NET 10 have also eliminated unnecessary work, in other ways.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118084\">dotnet\/runtime#118084<\/a> is a fun example of this, but to understand it, we first need to understand lookarounds. A &#8220;lookaround&#8221; is a regex construct that makes its contents zero-width. Whereas when a set like &#8220;[abc]&#8221; matches it consumes a single character from the input, or when a loop like &#8220;[abc]{3,5}&#8221; matches it&#8217;ll consume between 3-5 characters from the input, lookarounds (as with other zero-width constructs, like anchors) don&#8217;t consume anything. You wrap a lookaround around a regex expression, and it effectively makes the consumption temporary, e.g. if I wrap <code>[abc]{3,5}<\/code> in a positive lookahead as <code>(?=[abc]{3,5})<\/code>, that will end up performing the whole match for the 3-5 set characters, but those characters won&#8217;t remain consumed after exiting the lookaround; the lookaround is just performing a test to ensure the inner pattern matches but the position in the input is reset upon exiting the lookaround. This is again visualized easily by looking at the code emitted by the regex source generator for a pattern like <code>(?=[abc]{3,5})abc<\/code>:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Zero-width positive lookahead.\r\n{\r\n    int positivelookahead_starting_pos = pos;\r\n\r\n    \/\/ Match a character in the set [a-c] atomically at least 3 and at most 5 times.\r\n    {\r\n        int iteration = 0;\r\n        while (iteration &lt; 5 &amp;&amp; (uint)iteration &lt; (uint)slice.Length &amp;&amp; char.IsBetween(slice[iteration], 'a', 'c'))\r\n        {\r\n            iteration++;\r\n        }\r\n\r\n        if (iteration &lt; 3)\r\n        {\r\n            return false; \/\/ The input didn't match.\r\n        }\r\n\r\n        slice = slice.Slice(iteration);\r\n        pos += iteration;\r\n    }\r\n\r\n    pos = positivelookahead_starting_pos;\r\n    slice = inputSpan.Slice(pos);\r\n}\r\n\r\n\/\/ Match the string \"abc\".\r\nif (!slice.StartsWith(\"abc\"))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}<\/code><\/pre>\n<p>We can see that the lookaround is caching the starting position, then proceeding to try to match the loop it contains, and if successful, resetting the matching position to what it was when the lookaround was entered, then continuing on to perform the match for what comes after the lookaround.<\/p>\n<p>These examples have been for a particular flavor of lookaround, called a positive lookahead. There are four variations of lookarounds composed of two choices: positive vs negative, and lookahead vs lookbehind. Lookaheads validate the pattern starting from the current position and proceeding forwards (as matching typically is), while lookbehinds validate the pattern starting from just before the current position and extending backwards. Positive indicates that the pattern should match, while negative indicates that the pattern should not match. So, for example, the negative lookbehind <code>(?&lt;!\\w)<\/code> will match if what comes before the current position is not a word character.<\/p>\n<p>Negative lookarounds are particularly interesting, because, unlike every other regex construct, they guarantee that the pattern they contain <em>doesn&#8217;t<\/em> match. That also makes them special in other regards, in particular around capture groups. For a positive lookaround, even though they&#8217;re zero width, anything capture groups inside of the lookaround capture still remain to outside of the lookaround, e.g. <code>^(?=(abc))\\1$<\/code>, which entails a backreference successfully matching what&#8217;s captured by the capture group inside of the positive lookahead, will successfully match the input <code>\"abc\"<\/code>. But because <em>negative<\/em> lookarounds guarantee their content doesn&#8217;t match, it would be counter-intuitive if anything captured inside of a negative lookaround persisted past the lookaround&#8230; so it doesn&#8217;t. The capture groups inside of a negative lookaround are still possibly meaningful, in particular if there&#8217;s a backreference also <em>inside of<\/em> the same lookaround that refers back to the capture group, e.g. the pattern <code>^(?!(ab)\\1cd)ababc<\/code> is checking to see whether the input does not begin with <code>ababcd<\/code> but does begin with <code>ababc<\/code>. But if there&#8217;s no backreference, the capture group is useless, and we don&#8217;t need to do any work for it as part of processing the regex (work like remembering where the capture occurred). Such capture groups can be completely eliminated from the node tree as part of the optimization phase, and that&#8217;s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118084\">dotnet\/runtime#118084<\/a> does. Just as developers often use backtracking constructs without thinking to make them atomic, developers also often use capture groups purely as a grouping mechanism without thinking of the possibility of making them non-capturing groups. Since captures in general need to persist to be examined by the <code>Match<\/code> object returned from a <code>Regex<\/code>, we can&#8217;t just eliminate all capture groups that aren&#8217;t used internally in the pattern, but we can for these negative lookarounds. Consider a pattern like <code>(?&lt;!(access|auth)\\s)token<\/code>, which is looking for the word <code>\"token\"<\/code> when it&#8217;s <em>not<\/em> preceeded by <code>\"access \"<\/code> or <code>\"auth \"<\/code>; the developer here (me, in this case) did what&#8217;s fairly natural, putting a group around the alternation so that the <code>\\s<\/code> that follows either word can be factored out (if it were instead <code>access|auth\\s<\/code>, the whitespace set would only be in the second branch of the alternation and wouldn&#8217;t apply to the first). But my &#8220;simple&#8221; grouping here is actually a capture group by default; to get it to be non-capturing, I&#8217;d either need to write it as a non-capturing group, i.e. <code>(?&lt;!(?:access|auth)\\s)token<\/code>, or I&#8217;d need to use <code>RegexOptions.ExplicitCapture<\/code>, which turns all non-named capture groups into non-capturing groups.<\/p>\n<p>We can similarly remove other work related to lookarounds. As noted, positive lookarounds exist to transform any pattern into a zero-width pattern, i.e. don&#8217;t consume anything. That&#8217;s all they do. If the pattern being wrapped by the positive lookaround is already zero-width, the lookaround contributes nothing to the behavior of the expression and can be removed. So, for example, if you have <code>(?=$)<\/code>, that can be transformed into just <code>$<\/code>. That&#8217;s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118091\">dotnet\/runtime#118091<\/a> does.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118079\">dotnet\/runtime#118079<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118111\">dotnet\/runtime#118111<\/a> handle other transformations relative to zero-width assertions, in particular with regards to loops. For whatever reason, you&#8217;ll see developers wrapping zero-width assertions inside of loops, either making such assertions optional (e.g. <code>\\b?<\/code>) or with some larger upper bound (e.g. <code>(?=abc)*<\/code>). But these zero-width assertions don&#8217;t consume anything; their sole purpose is to flag whether something is true or false at the current position. If you make such a zero-width assertion optional, then you&#8217;re saying &#8220;check whether it&#8217;s true or false, and then immediately ignore the answer, because both answers are valid&#8221;; as such, the whole expression can be removed as a nop. Similarly, if you wrap a loop with an upper bound greater than 1 around such an expression, you&#8217;re saying &#8220;check whether it&#8217;s true or false, now without changing anything check again, and check again, and check again.&#8221; There&#8217;s a common English expression that&#8217;s something along the lines of &#8220;insanity is doing the same thing over and over again and expecting different results.&#8221; That applies here. There may be behavioral benefits to invoking the zero-width assertion once, but repeating it beyond that is a pure waste: if it was going to fail, it would have failed the first time. Mostly. There&#8217;s one specific case where the difference is actually observable, and that has to do with an interesting feature of .NET regexes: capture groups track <em>all<\/em> matched captures, not just the last. Consider this program:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0\r\n\r\nusing System.Diagnostics;\r\nusing System.Text.RegularExpressions;\r\n\r\nMatch m = Regex.Match(\"abc\", \"^(?=(\\\\w+)){3}abc$\");\r\nDebug.Assert(m.Success);\r\n\r\nforeach (Group g in m.Groups)\r\n{\r\n    foreach (Capture c in g.Captures)\r\n    {\r\n        Console.WriteLine($\"Group: {g.Name}, Capture: {c.Value}\");\r\n    }\r\n}<\/code><\/pre>\n<p>If you run that, you may be surprised to see that capture group #1 (the explicit group I have inside of the lookahead) provides three capture values:<\/p>\n<pre><code class=\"language-txt\">Group: 0, Capture: abc\r\nGroup: 1, Capture: abc\r\nGroup: 1, Capture: abc\r\nGroup: 1, Capture: abc<\/code><\/pre>\n<p>That&#8217;s because the loop around the positive lookahead does three iterations, each iteration matches <code>\"abc\"<\/code>, and each successful capture is persisted for subsequent inspection via the <code>Regex<\/code> APIs. As such, we can&#8217;t optimize any loop around zero-width assertions by lowering the upper bound from greater than 1 to 1; we can only do so if it doesn&#8217;t contain any captures. And that&#8217;s what these PRs do. Given a loop that wraps a zero-width assertion that does not contain a capture, if the lower bound of the loop is 0, the whole loop and its contents can be eliminated, and if the upper bound of the loop is greater than 1, the loop itself can be removed, leaving only its contents in its stead.<\/p>\n<p>Any time work like this is eliminated, it&#8217;s easy to construct monstrous, misleading micro-benchmarks&#8230; but it&#8217;s also a lot of fun, so, I&#8217;ll allow myself it this time.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"(?=.*\\bTwain\\b.*\\bConnecticut\\b)*.*Mark\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">3,226.024 ms<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">6.605 ms<\/td>\n<td style=\"text-align: right;\">0.002<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118083\">dotnet\/runtime#118083<\/a> is similar. &#8220;Repeaters&#8221; are a name for a regex loop that has the same lower and upper bound, such that the contents of the loop &#8220;repeats&#8221; that fixed number of times. Typically you&#8217;ll see these written out using the <code>{N}<\/code> syntax, e.g. <code>[abc]{3}<\/code> is a repeater that requires three characters, any of which can be <code>'a'<\/code>, <code>'b'<\/code>, or <code>'c'<\/code>. But of course it could also be written out in long-form, just by manually repeating the contents, e.g. <code>[abc][abc][abc]<\/code>. Just as we saw how we can condense loops around zero-width assertions when specified in loop form, we can do the exact same thing when manually written out. So, for example, <code>\\b\\b<\/code> is the same as just <code>\\b{2}<\/code>, which is just <code>\\b<\/code>.<\/p>\n<p>Another nice example of removing unnecessary work is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118105\">dotnet\/runtime#118105<\/a>. Boundary assertions are used in many expressions, e.g. it&#8217;s quite common to see a simple pattern like <code>\\b\\w+\\b<\/code>, which is trying to match an entire word. When the regex engine encounters such an assertion, historically it&#8217;s delegated to the <code>IsBoundary<\/code> helper shown earlier. There is, however, some subtle unnecessary work here, which is more obvious when you see what the regex source generator outputs for an expression like <code>\\b\\w+\\b<\/code>. This is what the output looks like on .NET 9:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match if at a word boundary.\r\nif (!Utilities.IsBoundary(inputSpan, pos))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}\r\n\r\n\/\/ Match a word character atomically at least once.\r\n{\r\n    int iteration = 0;\r\n    while ((uint)iteration &lt; (uint)slice.Length &amp;&amp; Utilities.IsWordChar(slice[iteration]))\r\n    {\r\n        iteration++;\r\n    }\r\n\r\n    if (iteration == 0)\r\n    {\r\n        return false; \/\/ The input didn't match.\r\n    }\r\n\r\n    slice = slice.Slice(iteration);\r\n    pos += iteration;\r\n}\r\n\r\n\/\/ Match if at a word boundary.\r\nif (!Utilities.IsBoundary(inputSpan, pos))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}<\/code><\/pre>\n<p>Pretty straightforward: match the boundary, consume as many word characters as possible, then again match a boundary. Except if you look back at the definition of <code>IsBoundary<\/code>, you&#8217;ll notice that it&#8217;s doing two checks, one against the previous character and one against the next character.<\/p>\n<pre><code class=\"language-csharp\">internal static bool IsBoundary(ReadOnlySpan&lt;char&gt; inputSpan, int index)\r\n{\r\n    int indexM1 = index - 1;\r\n    return ((uint)indexM1 &lt; (uint)inputSpan.Length &amp;&amp; RegexCharClass.IsBoundaryWordChar(inputSpan[indexM1])) !=\r\n           ((uint)index &lt; (uint)inputSpan.Length &amp;&amp; RegexCharClass.IsBoundaryWordChar(inputSpan[index]));\r\n}<\/code><\/pre>\n<p>Now, look at that, and look back at the generated code, and look at this again, and back at the source generated code again. See anything unnecessary? When we perform the first boundary comparison, we are dutifully checking the previous character, which is necessary, but then we&#8217;re checking the current character, which is about to checked against <code>\\w<\/code> by the subsequent <code>\\w+<\/code> loop. Similarly for the second boundary check, we just finished matching <code>\\w+<\/code>, which will have only successfully matched if there was at least one word character. While we still need to validate that the subsequent character is not a boundary character (there are two characters considered boundary characters that aren&#8217;t word characters), we don&#8217;t need to re-validate the previous character. So, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118105\">dotnet\/runtime#118105<\/a> overhauls boundary handling in the compiler and source generator to emit customized boundary checks based on surrounding knowledge. If it can prove that the subsequent construct will validate that a character is a word character, then it only needs to validate that the previous character is not a boundary character; similarly, if it can prove that the previous construct will have already validated that a character is a word character, then it only needs to validate that the next character isn&#8217;t. This leads to this tweaked source generated code now on .NET 10:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match if at a word boundary.\r\nif (!Utilities.IsPreWordCharBoundary(inputSpan, pos))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}\r\n\r\n\/\/ Match a word character atomically at least once.\r\n{\r\n    int iteration = 0;\r\n    while ((uint)iteration &lt; (uint)slice.Length &amp;&amp; Utilities.IsWordChar(slice[iteration]))\r\n    {\r\n        iteration++;\r\n    }\r\n\r\n    if (iteration == 0)\r\n    {\r\n        return false; \/\/ The input didn't match.\r\n    }\r\n\r\n    slice = slice.Slice(iteration);\r\n    pos += iteration;\r\n}\r\n\r\n\/\/ Match if at a word boundary.\r\nif (!Utilities.IsPostWordCharBoundary(inputSpan, pos))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}<\/code><\/pre>\n<p>Those <code>IsPreWordCharBoundary<\/code> and <code>IsPostWordCharBoundary<\/code> helpers are just half the checks in the main boundary helper. In cases where there are lots of boundary tests being performed, the reduced check count can add up.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"\\ba\\b\", RegexOptions.Compiled | RegexOptions.IgnoreCase);\r\n\r\n    [Benchmark]\r\n    public int CountStandaloneAs() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountStandaloneAs<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">20.58 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountStandaloneAs<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">19.25 ms<\/td>\n<td style=\"text-align: right;\">0.94<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>Regex<\/code> optimizer is all about pattern recognition: it looks for sequences and shapes it recognizes and performs transforms over those to put them into a more efficiently-processable form. One example of this is with alternations around coalescable branches. Let&#8217;s say you have an alternation <code>a|e|i|o|u<\/code>. You could process that as an alternation, but it&#8217;s also much more efficiently represented and processed as the equivalent set <code>[aeiou]<\/code>. There is an optimization that does such transformations as part of handling alternations. However, through .NET 9, it only handled single characters and sets, but not negated sets. For example, it would transform <code>a|e|i|o|u<\/code> into <code>[aeiou]<\/code>, and it would transform <code>[aei]|[ou]<\/code> into <code>[aeiou]<\/code>, but it would not merge negations like <code>[^\\n]<\/code>, otherwise known as <code>.<\/code> (when not in <code>RegexOptions.Singleline<\/code> mode). When developers want a set that represents all characters, there are various idioms they employ, such as <code>[\\s\\S]<\/code>, which says &#8220;this is a set of all whitespace and non-whitespace characters&#8221;, aka everything. Another common idiom is <code>\\n|.<\/code>, which is the same as <code>\\n|[^\\n]<\/code>, which says &#8220;this is an alternation that matches either a newline or anything other than a newline&#8221;, aka also everything. Unfortunately, while examples like <code>[\\d\\D]<\/code> have been handled well, <code>.|\\n<\/code> has not, because of the gap in the alternation optimization. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118109\">dotnet\/runtime#118109<\/a> improves that, such that such &#8220;not&#8221; cases are mergable as part of the existing optimization. That takes a relatively expensive alternation and converts it into a super fast set check. And while, in general, set containment checks are very efficient, this one is as efficient as you can get, as it&#8217;s always true. We can see an example of this with a pattern intended to match C-style comment blocks.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private const string Input = \"\"\"\r\n        \/* This is a comment. *\/\r\n        \/* Another comment *\/\r\n        \/* Multi-line\r\n           comment *\/\r\n        \"\"\";\r\n    private static readonly Regex s_regex = new Regex(@\"\/\\*(?:.|\\n)*?\\*\/\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(Input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">344.80 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">93.59 ns<\/td>\n<td style=\"text-align: right;\">0.27<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that there&#8217;s another change that helps .NET 10 here, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118373\">dotnet\/runtime#118373<\/a>, though I hesitate to call it out as a performance improvement since it&#8217;s really more of a bug fix. As part of writing this post, these benchmark numbers were showing some oddities (it&#8217;s important in general to be skeptical of benchmark results and to investigate anything that doesn&#8217;t align with reason and expectations). The result of investigating was a one-word change that yielded significant speedups on this test, specifically when using <code>RegexOptions.Compiled<\/code> (the bug didn&#8217;t exist in the source generator). As part of handling lazy loops, there&#8217;s a special-case for when the lazy loop is around a set that matches any character, which, thanks to the previous PR, <code>(?:.|\\n)<\/code> now does. That special-case recognizes that if the lazy loop matches anything, we can efficiently find the end of the lazy loop by searching for whatever comes after the loop (e.g. in this test, the loop is followed by the literal <code>\"*\/\"<\/code>). Unfortunately, the helper that emits that <code>IndexOf<\/code> call was passed the wrong node from the pattern: it was being passed the object representing the <code>(?:.|\\n)<\/code> any-set rather than the <code>\"*\/\"<\/code> literal, which resulted in it emitting the equivalent of <code>IndexOfAnyInRange((char)0, '\\uFFFF')<\/code> rather than the equivalent of <code>IndexOf(\"*\/\")<\/code>. Oops. It was still functionally correct, in that the <code>IndexOfAnyInRange<\/code> call would successfully match the first character and the loop would re-evaluate from that location, but that means that rather than efficiently skipping using SIMD over a bunch of positions that couldn&#8217;t possibly match, we were doing non-trivial work for each and every position along the way.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118087\">dotnet\/runtime#118087<\/a> represents another interesting transformation related to alternations. It&#8217;s very common to come across alternations with empty branches, possibly because that&#8217;s what the developer wrote, but more commonly as an outcome of other transformations that have happened. For example, given the pattern <code>\\r\\n|\\r<\/code>, which is trying to match line endings that begin with <code>\\r<\/code>, there is an optimization that will factor out a common prefix of all of the branches, producing the equivalent of <code>\\r(?:\\n|)<\/code>; in other words, <code>\\r<\/code> followed by either a line feed or empty. Such an alternation is a perfectly valid representation for this concept, but there&#8217;s a more natural one: <code>?<\/code>. Behaviorally, this pattern is identical to <code>\\r\\n?<\/code>, and because the latter is more common and more canonical, the regex engine has more optimizations that recognize this loop-based form, for example coalescing with other loops, or auto-atomicity. As such, this PR finds all alternations of the form <code>X|<\/code> and transforms them into <code>X?<\/code>. Similarly, it finds all alternations of the form <code>|X<\/code> and transforms them into <code>X??<\/code>. The difference between <code>X|<\/code> and <code>|X<\/code> is whether <code>X<\/code> is tried first or empty is tried first; similarly, the difference between the greedy <code>X?<\/code> loop and the lazy <code>X??<\/code> loop is whether <code>X<\/code> is tried first or empty is tried first. The impact of this can be seen in the code generated for the previously cited example. Here is the source-generated code for the heart of the matching routine for <code>\\r\\n|\\r<\/code> on .NET 9:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match '\\r'.\r\nif (slice.IsEmpty || slice[0] != '\\r')\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}\r\n\r\n\/\/ Match with 2 alternative expressions, atomically.\r\n{\r\n    int alternation_starting_pos = pos;\r\n\r\n    \/\/ Branch 0\r\n    {\r\n        \/\/ Match '\\n'.\r\n        if ((uint)slice.Length &lt; 2 || slice[1] != '\\n')\r\n        {\r\n            goto AlternationBranch;\r\n        }\r\n\r\n        pos += 2;\r\n        slice = inputSpan.Slice(pos);\r\n        goto AlternationMatch;\r\n\r\n        AlternationBranch:\r\n        pos = alternation_starting_pos;\r\n        slice = inputSpan.Slice(pos);\r\n    }\r\n\r\n    \/\/ Branch 1\r\n    {      \r\n        pos++;\r\n        slice = inputSpan.Slice(pos);\r\n    }\r\n\r\n    AlternationMatch:;\r\n}<\/code><\/pre>\n<p>Now, here&#8217;s what&#8217;s produced on .NET 10:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match '\\r'.\r\nif (slice.IsEmpty || slice[0] != '\\r')\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}\r\n\r\n\/\/ Match '\\n' atomically, optionally.\r\nif ((uint)slice.Length &gt; (uint)1 &amp;&amp; slice[1] == '\\n')\r\n{\r\n    slice = slice.Slice(1);\r\n    pos++;\r\n}<\/code><\/pre>\n<p>The optimizer recognized that the <code>\\r\\n|\\r<\/code> was the same as <code>\\r(?:\\n|)<\/code>, which is the same as <code>\\r\\n?<\/code>, which is the same as <code>\\r(?&gt;\\n?)<\/code>, which it can produce much simplified code for, given that it no longer needs any backtracking.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"ab|a\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">23.35 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">18.73 ms<\/td>\n<td style=\"text-align: right;\">0.80<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>.NET 10 also features improvements to <code>Regex<\/code> that go beyond just this form of work elimination. <code>Regex<\/code>&#8216;s matching routines are logically factored into two pieces: finding as quickly as possible the next place that could possibly match (<code>TryFindNextPossibleStartingPosition<\/code>), and then performing the full matching routine at that location (<code>TryMatchAtCurrentPosition<\/code>). It&#8217;s desirable that <code>TryFindNextPossibleStartingPosition<\/code> both does its work as quickly as possible while also significantly limiting the number of locations a full match should be performed. <code>TryFindNextPossibleStartingPosition<\/code>, for example, could operate very quickly just by always saying that the next index in the input should be tested, which would result in the full matching logic being performed at every index in the input; that&#8217;s not great for performance. Instead, the optimizer analyzes the pattern looking for things that would allow it to quickly search for viable starting locations, e.g. fixed strings or sets at known offsets in the pattern. Anchors are some of the most valuable things the optimizer can find, as they significantly inhibit the possible places matching is valid; the ideal pattern begins with a beginning anchor (<code>^<\/code>), which then means the only possible place matching can be successful is at index 0.<\/p>\n<p>We previously discussed lookarounds, but as it turns out, until .NET 10, lookarounds weren&#8217;t factored into what <code>TryFindNextPossibleStartingPosition<\/code> should look for. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112107\">dotnet\/runtime#112107<\/a> changes that. It teaches the optimizer when and how to explore positive lookaheads at the beginning of a pattern for constructs that could help it more efficiently find starting locations. For example, in .NET 9, for the pattern <code>(?=^)hello<\/code>, here&#8217;s what the source generator emits for <code>TryFindNextPossibleStartingPosition<\/code>:<\/p>\n<pre><code class=\"language-csharp\">private bool TryFindNextPossibleStartingPosition(ReadOnlySpan&lt;char&gt; inputSpan)\r\n{\r\n    int pos = base.runtextpos;\r\n\r\n    \/\/ Any possible match is at least 5 characters.\r\n    if (pos &lt;= inputSpan.Length - 5)\r\n    {\r\n        \/\/ The pattern has the literal \"hello\" at the beginning of the pattern. Find the next occurrence.\r\n        \/\/ If it can't be found, there's no match.\r\n        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_hello_Ordinal);\r\n        if (i &gt;= 0)\r\n        {\r\n            base.runtextpos = pos + i;\r\n            return true;\r\n        }\r\n    }\r\n\r\n    \/\/ No match found.\r\n    base.runtextpos = inputSpan.Length;\r\n    return false;\r\n}<\/code><\/pre>\n<p>The optimizer found the <code>\"hello\"<\/code> string in the pattern and is thus searching for that as part of finding the next possible place to do the full match. That would be excellent, if it weren&#8217;t for the lookahead that also says any match must happen at the beginning of the input. Now in .NET 10, we get this:<\/p>\n<pre><code class=\"language-csharp\">private bool TryFindNextPossibleStartingPosition(ReadOnlySpan&lt;char&gt; inputSpan)\r\n{\r\n    int pos = base.runtextpos;\r\n\r\n    \/\/ Any possible match is at least 5 characters.\r\n    if (pos &lt;= inputSpan.Length - 5)\r\n    {\r\n        \/\/ The pattern leads with a beginning (\\A) anchor.\r\n        if (pos == 0)\r\n        {\r\n            return true;\r\n        }\r\n    }\r\n\r\n    \/\/ No match found.\r\n    base.runtextpos = inputSpan.Length;\r\n    return false;\r\n}<\/code><\/pre>\n<p>That <code>pos == 0<\/code> check is critical, because it means we will only ever attempt the full match in one location and we can avoid the search that would happen even if we never found a good location to perform the match. Again, any time you eliminate work like this, you can construct tantalizing micro-benchmarks&#8230;<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"(?=^)hello\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2,383,784.95 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">17.43 ns<\/td>\n<td style=\"text-align: right;\">0.000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That same PR also improved optimizations over alternations. It&#8217;s already the case that the branches of alternations are analyzed looking for common prefixes that can be factored out. For example, given the pattern <code>abc|abd<\/code>, the optimizer will spot the shared <code>\"ab\"<\/code> prefix at the beginning of each branch and factor that out, resulting in <code>ab(?:c|d)<\/code>, and will then see that each branch of the remaining alternation are individual characters, which it can convert into a set, <code>ab[cd]<\/code>. If, however, the branches began with anchors, these optimizations wouldn&#8217;t be applied. Given the pattern <code>^abc|^abd<\/code>, the code generators would end up emitting this exactly as it&#8217;s written, with an alternation with two branches, the first branch checking for the beginning and then matching <code>\"abc\"<\/code>, the second branch also checking for the beginning and then matching <code>\"abd\"<\/code>. Now in .NET 10, the anchor can be factored out, such that <code>^abc|^abd<\/code> ends up being rewritten as <code>^ab[cd]<\/code>.<\/p>\n<p>As a small tweak, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112065\">dotnet\/runtime#112065<\/a> also helps improve the source generated code for repeaters by using a more efficient searching routine. Let&#8217;s take the pattern <code>[0-9a-f]{32}<\/code> as an example. This is looking for sequences of 32 lowercase hex digits. In .NET 9, the implementation of that ends up looking like this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match a character in the set [0-9a-f] exactly 32 times.\r\n{\r\n    if ((uint)slice.Length &lt; 32)\r\n    {\r\n        return false; \/\/ The input didn't match.\r\n    }\r\n\r\n    if (slice.Slice(0, 32).IndexOfAnyExcept(Utilities.s_asciiHexDigitsLower) &gt;= 0)\r\n    {\r\n        return false; \/\/ The input didn't match.\r\n    }\r\n}<\/code><\/pre>\n<p>Simple, clean, fairly concise, and utilizing the vectorized <code>IndexOfAnyExcept<\/code> to very efficiently validate that the whole sequence of 32 characters are lowercase hex. We can do a tad bit better, though. The <code>IndexOfAnyExcept<\/code> method not only needs to find whether the span contains something other than one of the provided values, it needs to specify the index at which that found value occurs. That&#8217;s only a few instructions, but it&#8217;s a few unnecessary instructions, since here that exact index isn&#8217;t utilized&#8230; the implementation only cares whether it&#8217;s <code>&gt;= 0<\/code>, meaning whether anything was found or not. As such, we can instead use the <code>Contains<\/code> variant of this method, which doesn&#8217;t need to spend extra cycles determining the exact index. Now in .NET 10, this is generated:<\/p>\n<pre><code class=\"language-csharp\">\/\/ Match a character in the set [0-9a-f] exactly 32 times.\r\nif ((uint)slice.Length &lt; 32 || slice.Slice(0, 32).ContainsAnyExcept(Utilities.s_asciiHexDigitsLower))\r\n{\r\n    return false; \/\/ The input didn't match.\r\n}<\/code><\/pre>\n<p>Finally, the .NET 10 SDK includes a new analyzer related to <code>Regex<\/code>. It&#8217;s oddly common to see code that determines whether an input matches a <code>Regex<\/code> written like this: <code>Regex.Match(...).Success<\/code>. While functionally correct, that&#8217;s much more expensive than <code>Regex.IsMatch(...)<\/code>. For all of the engines, <code>Regex.Match(...)<\/code> requires allocating a new <code>Match<\/code> object and supporting data structures (except when there isn&#8217;t a match found, in which case it&#8217;s able to use an empty singleton); in contrast, <code>IsMatch<\/code> doesn&#8217;t need to allocate such an instance because it doesn&#8217;t need to return such an instance (as an implementation detail, it may still use a <code>Match<\/code> object, but it can reuse one rather than creating a new one each time). It can also avoid other inefficiencies. <code>RegexOptions.NonBacktracking<\/code> is &#8220;pay-for-play&#8221; with the information it needs to gather. Determining just <em>whether<\/em> there&#8217;s a match is cheaper than determining exactly where the match begins and ends, which is cheaper still than determining all of the captures that make up that match. <code>IsMatch<\/code> is thus the cheapest, only needing to determine that there is a match, not exactly where it is or what the exact captures are, whereas <code>Match<\/code> needs to determine all of that. <code>Regex.Matches(...).Count<\/code> is similar; it&#8217;s having to gather all of the relevant details and allocate a whole bunch of objects, whereas <code>Regex.Count(...)<\/code> can do so in a much more efficient manner. <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/7547\">dotnet\/roslyn-analyzers#7547<\/a> adds CA1874 and CA1875, which flag these cases and recommend use of <code>IsMatch<\/code> and <code>Count<\/code>, respectively.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/09\/CA1874.png\" alt=\"Analyzer and fixer for CA1874\" \/><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/09\/CA1875.png\" alt=\"Analyzer and fixer for CA1875\" \/><\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_input = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly Regex s_regex = new Regex(@\"\\b\\w+\\b\", RegexOptions.NonBacktracking);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public int MatchesCount() =&gt; s_regex.Matches(s_input).Count;\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_regex.Count(s_input);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>MatchesCount<\/td>\n<td style=\"text-align: right;\">680.4 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">665530176 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td style=\"text-align: right;\">219.0 ms<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Regex<\/code> is one form of searching, but there are other primitives and helpers throughout .NET for various forms of searching, and they&#8217;ve seen meaningful improvements in .NET 10, as well.<\/p>\n<h3>SearchValues<\/h3>\n<p>When discussing performance improvements in .NET 8, I called out two changes that were my favorites. The first was dynamic PGO. The second was <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/#searchvalues\"><code>SearchValues<\/code><\/a>.<\/p>\n<p><code>SearchValues<\/code> provides a mechanism for precomputing optimal strategies for searching. .NET 8 introduced overloads of <code>SearchValues.Create<\/code> that produce <code>SearchValues&lt;byte&gt;<\/code> and <code>SearchValues&lt;char&gt;<\/code>, and corresponding overloads of <code>IndexOfAny<\/code> and friends that accept such instances. If there&#8217;s a set of values you&#8217;ll be searching for over and over and over, you can create one of these instances once, cache it, and then use it for all subsequent searches for those values, e.g.<\/p>\n<pre><code class=\"language-csharp\">private static readonly SearchValues&lt;char&gt; s_validBase64Chars = SearchValues.Create(\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+\/\");\r\n\r\ninternal static bool IsValidBase64(ReadOnlySpan&lt;char&gt; input) =&gt;\r\n    input.ContainsAnyExcept(s_validBase64Chars);<\/code><\/pre>\n<p>There are a plethora of different implementations used by <code>SearchValues&lt;T&gt;<\/code> behind the scenes, each of which is selected and configured based on the <code>T<\/code> and the exact nature of the target values for which we&#8217;re searching. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106900\">dotnet\/runtime#106900<\/a> adds another, which both helps to shave off several instructions in the core vectorized search loop, and helps to highlight just how nuanced these different algorithms can be. Previously, if four target <code>byte<\/code> values were provided, and they weren&#8217;t in a contiguous range, <code>SearchValues.Create<\/code> would choose an implementation that just uses four vectors, one per target byte, and does four comparisons (one against each target vector) for each input vector being tested. However, there&#8217;s already a specialization that&#8217;s used for more than five target bytes when all of the target bytes are ASCII. This PR allows that specialization to be used for both four or five targets when the lower nibble (the bottom four bits) of each of the targets is unique, and in doing so, it becomes several instructions cheaper: rather than doing four comparisons, it can do a single shuffle and equality check.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly byte[] s_haystack = new HttpClient().GetByteArrayAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n    private static readonly SearchValues&lt;byte&gt; s_needle = SearchValues.Create(\"\\0\\r&amp;&lt;\"u8);\r\n\r\n    [Benchmark]\r\n    public int Count()\r\n    {\r\n        int count = 0;\r\n\r\n        ReadOnlySpan&lt;byte&gt; haystack = s_haystack.AsSpan();\r\n        int pos;\r\n        while ((pos = haystack.IndexOfAny(s_needle)) &gt;= 0)\r\n        {\r\n            count++;\r\n            haystack = haystack.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">3.704 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">2.668 ms<\/td>\n<td style=\"text-align: right;\">0.72<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107798\">dotnet\/runtime#107798<\/a> improves another such algorithm, when AVX512 is available. One of the fallback strategies used by <code>SearchValues.Create&lt;char&gt;<\/code> is a vectorized &#8220;probabilistic map&#8221;, basically a Bloom filter. It has a bitmap that stores a bit for each <code>byte<\/code> of the <code>char<\/code>; when testing to see whether the <code>char<\/code> is in the target set, it checks to see whether the bit for each of the <code>char<\/code>&#8216;s <code>byte<\/code>s is set. If at least one isn&#8217;t set, the <code>char<\/code> definitely isn&#8217;t in the target set. If both are set, more validation will need to be done to determine the actual inclusion of that value in the set. This can make it very efficient to rule out large amounts of input that definitely are not in the set and then only spend more effort on input that might be. The implementation involves various shuffle, shift, and permute operations, and this change is able to use a better set of instructions that reduce the number needed.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly SearchValues&lt;char&gt; s_searchValues = SearchValues.Create(\"\u00df\u00e4\u00f6\u00fc\u00c4\u00d6\u00dc\");\r\n    private string _input = new string('\\n', 10_000);\r\n\r\n    [Benchmark]\r\n    public int IndexOfAny() =&gt; _input.AsSpan().IndexOfAny(s_searchValues);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IndexOfAny<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">437.7 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IndexOfAny<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">404.7 ns<\/td>\n<td style=\"text-align: right;\">0.92<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>While .NET 8 introduced support for <code>SearchValues&lt;byte&gt;<\/code> and <code>SearchValues&lt;char&gt;<\/code>, .NET 9 introduced support for <code>SearchValues&lt;string&gt;<\/code>. <code>SearchValues&lt;string&gt;<\/code> is used a bit differently from <code>SearchValues&lt;byte&gt;<\/code> and <code>SearchValues&lt;char&gt;<\/code>; whereas <code>SearchValues&lt;byte&gt;<\/code> is used to search for target <code>byte<\/code>s within a collection of <code>byte<\/code>s and <code>SearchValues&lt;char&gt;<\/code> is used to search for target <code>char<\/code>s within a collection of <code>char<\/code>s, <code>SearchValues&lt;string&gt;<\/code> is used to search for target <code>string<\/code>s within a single <code>string<\/code> (or span of <code>char<\/code>s). In other words, it&#8217;s a multi-substring search. Let&#8217;s say you have the regular expression <code>(?i)hello|world<\/code>; that is specifying that it should look for either &#8220;hello&#8221; or &#8220;world&#8221; in a case-insensitive manner; the <code>SearchValues<\/code> equivalent of that is <code>SearchValues.Create([\"hello\", \"world\"], StringComparison.OrdinalIgnoreCase)<\/code> (in fact, if you specify that pattern, the <code>Regex<\/code> compiler and source generator will use such a <code>SearchValues.Create<\/code> call under the covers in order to optimize the search).<\/p>\n<p><code>SearchValues&lt;string&gt;<\/code> also gets better in .NET 10. A key algorithm used by <code>SearchValues&lt;string&gt;<\/code> whenever possible and relevant is called &#8220;Teddy,&#8221; and enables performing a vectorized search for multiple substrings. In its core processing loop, when using AVX512, there are two instructions, a <code>PermuteVar8x64x2<\/code> and an <code>AlignRight<\/code>; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107819\">dotnet\/runtime#107819<\/a> recognizes that those can be replaced by a single <code>PermuteVar64x8x2<\/code>. Similarly, when on Arm64, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118110\">dotnet\/runtime#118110<\/a> plays the instructions game and replaces a use of <code>ExtractNarrowingSaturateUpper<\/code> with the slightly cheaper <code>UnzipEven<\/code>.<\/p>\n<p><code>SearchValues&lt;string&gt;<\/code> is also able to optimize searching for a single string, spending more time to come up with optimal search parameters than does a simpler <code>IndexOf(string, StringComparison)<\/code> call. Similar to the approach with the probabilistic maps employed earlier, the vectorized search can yield false positives that then need to be weeded out. In some cases by construction, however, we know that false positives aren&#8217;t possible; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108368\">dotnet\/runtime#108368<\/a> extends an existing optimization that was case-sensitive only to also apply in some case-insensitive uses, such that we can avoid doing the extra validation step in more cases. For the candidate verification that remains, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108365\">dotnet\/runtime#108365<\/a> also significantly reduces overhead in a variety of cases, including adding specialized handling for needles (the things being searched for) of up to 16 characters (previously it was only up to 8), and precomputing more information to make the verification faster.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(@\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\").Result;\r\n\r\n    private static readonly Regex s_the = new(\"the\", RegexOptions.IgnoreCase | RegexOptions.Compiled);\r\n    private static readonly Regex s_something = new(\"something\", RegexOptions.IgnoreCase | RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int CountThe() =&gt; s_the.Count(s_haystack);\r\n\r\n    [Benchmark]\r\n    public int CountSomething() =&gt; s_something.Count(s_haystack);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountThe<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">9.881 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountThe<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">7.799 ms<\/td>\n<td style=\"text-align: right;\">0.79<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>CountSomething<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">2.466 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountSomething<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">2.027 ms<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118108\">dotnet\/runtime#118108<\/a> also adds a &#8220;packed&#8221; variant of the single-string implementation, meaning it&#8217;s able to handle common cases like ASCII more efficiently by ignoring a character&#8217;s upper zero byte in order to fit twice as much into a vector.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_haystack = string.Concat(Enumerable.Repeat(\"Sherlock Holm_s\", 8_000));\r\n    private static readonly SearchValues&lt;string&gt; s_needles = SearchValues.Create([\"Sherlock Holmes\"], StringComparison.OrdinalIgnoreCase);\r\n\r\n    [Benchmark] \r\n    public bool ContainsAny() =&gt; s_haystack.AsSpan().ContainsAny(s_needles);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ContainsAny<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">58.41 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ContainsAny<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">16.32 us<\/td>\n<td style=\"text-align: right;\">0.28<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>MemoryExtensions<\/h3>\n<p>The searching improvements continue beyond <code>SearchValues<\/code>, of course. Prior to .NET 10, the <code>MemoryExtensions<\/code> class already had a wealth of support for searching and manipulating spans, with extension methods like <code>IndexOf<\/code>, <code>IndexOfAnyExceptInRange<\/code>, <code>ContainsAny<\/code>, <code>Count<\/code>, <code>Replace<\/code>, <code>SequenceCompare<\/code>, and more (the set was further extended as well by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112951\">dotnet\/runtime#112951<\/a>, which added <code>CountAny<\/code> and <code>ReplaceAny<\/code>), but the vast majority of these were limited to work with <code>T<\/code> types constrained to be <code>IEquatable&lt;T&gt;<\/code>. And in practice, many of the types you want to search do in fact implement <code>IEquatable&lt;T&gt;<\/code>. However, you might be in a generic context with an unconstrained <code>T<\/code>, such that even if the <code>T<\/code> used to instatiate the generic type or method is equatable, it&#8217;s not evident in the type system and thus the <code>MemoryExtensions<\/code> method couldn&#8217;t be used. And of course there are scenarios where you want to be able to supply a different comparison routine. Both of these scenarios show up, for example, in the implementation of LINQ&#8217;s <code>Enumerable.Contains<\/code>; if the source <code>IEnumerable&lt;TSource&gt;<\/code> is actually something we could treat as a span, like <code>TSource[]<\/code> or <code>List&lt;TSource&gt;<\/code>, it&#8217;d be nice to be able to just delegate to the optimized <code>MemoryExtensions.Contains&lt;T&gt;<\/code>, but a) <code>Enumerable.Contains<\/code> doesn&#8217;t constrain its <code>TSource : IEquatable&lt;TSource&gt;<\/code>, and b) <code>Enumerable.Contains<\/code> accepts an optional comparer.<\/p>\n<p>To address this, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110197\">dotnet\/runtime#110197<\/a> adds ~30 new overloads to the <code>MemoryExtensions<\/code> class. These overloads all parallel existing methods, but remove the <code>IEquatable&lt;T&gt;<\/code> (or <code>IComparable&lt;T&gt;<\/code>) constraint on the generic method parameter and accept an optional <code>IEqualityComparer&lt;T&gt;?<\/code> (or <code>IComparer&lt;T&gt;<\/code>). When no comparer or a default comparer is supplied, they can fall back to using the same vectorized logic for relevant types, and otherwise can provide as optimal an implementation as they can muster, based on the nature of <code>T<\/code> and the supplied comparer.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _data = Enumerable.Range(0, 1_000_000).ToArray();\r\n\r\n    [Benchmark]\r\n    public bool Contains() =&gt; _data.Contains(int.MaxValue, EqualityComparer&lt;int&gt;.Default);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Contains<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">213.94 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Contains<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">67.86 us<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(It&#8217;s also worth highlighting that with the &#8220;first-class&#8221; span support in C# 14, many of these extensions from <code>MemoryExtensions<\/code> now naturally show up directly on types like <code>string<\/code>.)<\/p>\n<p>This kind of searching often shows up as part of other APIs. For example, encoding APIs often need to first find something to be encoded, and that searching can be accelerated by using one of these efficiently implemented search APIs. There are dozens and dozens of existing examples of that throughout the core libraries, many of the places using <code>SearchValues<\/code> or these various <code>MemoryExtensions<\/code> methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110574\">dotnet\/runtime#110574<\/a> adds another, speeding up <code>string.Normalize<\/code>&#8216;s argument validation. The current implementation walks character by character looking for the first surrogate. The new implementation gives that a jump start by using <code>IndexOfAnyInRange<\/code>.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _input = \"This is a test. This is only a test. Nothing to see here. \\u263A\\uFE0F\";\r\n\r\n    [Benchmark]\r\n    public string Normalize() =&gt; _input.Normalize();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Normalize<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">104.93 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Normalize<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">88.94 ns<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/110478\">dotnet\/runtime#110478<\/a> similarly updates <code>HttpUtility.UrlDecode<\/code> to use the vectorized <code>IndexOfAnyInRange<\/code>. It also avoids allocating the resulting <code>string<\/code> if nothing needs to be decoded.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Web;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public string UrlDecode() =&gt; HttpUtility.UrlDecode(\"aaaaabbbbb%e2%98%ba%ef%b8%8f\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>UrlDecode<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">59.42 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>UrlDecode<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">54.26 ns<\/td>\n<td style=\"text-align: right;\">0.91<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114494\">dotnet\/runtime#114494<\/a> employs <code>SearchValues<\/code> in <code>OptimizedInboxTextEncoder<\/code>, which is the core implementation that backs the various encoders like <code>JavaScriptEncoder<\/code> and <code>HtmlEncoder<\/code> in the <code>System.Text.Encodings.Web<\/code> library.<\/p>\n<h2>JSON<\/h2>\n<p>JSON is at the heart of many different domains, having become the lingua franca of data interchange on the web. With <code>System.Text.Json<\/code> as the recommended library for working with JSON in .NET, it is constantly evolving to meet additional performance requirements. .NET 10 sees it updated with both improvements to the performance of existing methods as well as new methods specifically geared towards helping with performance.<\/p>\n<p>The <code>JsonSerializer<\/code> type is layered on top of the lower-level <code>Utf8JsonReader<\/code> and <code>Utf8JsonWriter<\/code> types. When serializing, <code>JsonSerializer<\/code> needs an instance of <code>Utf8JsonWriter<\/code>, which is a <code>class<\/code>, and any associated objects, such as an <code>IBufferWriter<\/code> instance. For any temporary buffers it requires, it&#8217;ll use rented buffers from <code>ArrayPool&lt;byte&gt;<\/code>, but for these helper objects, it maintains its own cache, to avoid needing to recreate them at very high frequencies. That cache was being used for all asynchronous streaming serialization operations, but as it turns out, it wasn&#8217;t being used for synchronous streaming serialization operations. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112745\">dotnet\/runtime#112745<\/a> fixes that to make the use of the cache consistent, avoiding these intermediate allocations.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Data _data = new();\r\n    private MemoryStream _stream = new();\r\n\r\n    [Benchmark]\r\n    public void Serialize()\r\n    {\r\n        _stream.Position = 0;\r\n        JsonSerializer.Serialize(_stream, _data);\r\n    }\r\n\r\n    public class Data\r\n    {\r\n        public int Value1 { get; set; }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Serialize<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">115.36 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">176 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Serialize<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">77.73 ns<\/td>\n<td style=\"text-align: right;\">0.67<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Earlier when discussing collections, it was noted that <code>OrderedDictionary&lt;TKey, TValue&gt;<\/code> now exposes overloads of methods like <code>TryAdd<\/code> that return the relevant item&#8217;s index, which then allows subsequent access to avoid the more costly key-based lookup. As it turns out, <code>JsonObject<\/code>&#8216;s indexer needs to do that, first indexing into the dictionary by key, doing some checks, and then indexing again. It&#8217;s now been updated to use these new overloads. As those lookups typically dominate the cost of using the setter, this can upwards of double throughput of <code>JsonObject<\/code>&#8216;s indexer:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json.Nodes;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private JsonObject _obj = new();\r\n\r\n    [Benchmark]\r\n    public void Set() =&gt; _obj[\"key\"] = \"value\";\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Set<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">40.56 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Set<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">16.96 ns<\/td>\n<td style=\"text-align: right;\">0.42<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Most of the improvements in <code>System.Text.Json<\/code>, however, are actually via new APIs. This same &#8220;avoid a double lookup&#8221; issue shows up in other places, for example wanting to add a property to a <code>JsonObject<\/code> but only if it doesn&#8217;t yet exist. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111229\">dotnet\/runtime#111229<\/a> from <a href=\"https:\/\/github.com\/Flu\">@Flu<\/a>, that&#8217;s addressed with a new <code>TryAdd<\/code> method (as well as a <code>TryAdd<\/code> overload and an overload of the existing <code>TryGetPropertyValue<\/code> that, as with <code>OrderedDictionary&lt;&gt;<\/code>, returns the index of the property).<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json.Nodes;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private JsonObject _obj = new();\r\n    private JsonNode _value = JsonValue.Create(\"value\");\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void NonOverwritingSet_Manual()\r\n    {\r\n        _obj.Remove(\"key\");\r\n        if (!_obj.ContainsKey(\"key\"))\r\n        {\r\n            _obj.Add(\"key\", _value);\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void NonOverwritingSet_TryAdd()\r\n    {\r\n        _obj.Remove(\"key\");\r\n        _obj.TryAdd(\"key\", _value);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonOverwritingSet_Manual<\/td>\n<td style=\"text-align: right;\">16.59 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NonOverwritingSet_TryAdd<\/td>\n<td style=\"text-align: right;\">14.31 ns<\/td>\n<td style=\"text-align: right;\">0.86<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109472\">dotnet\/runtime#109472<\/a> from <a href=\"https:\/\/github.com\/karakasa\">@karakasa<\/a> also imbues <code>JsonArray<\/code> with new <code>RemoveAll<\/code> and <code>RemoveRange<\/code> methods. In addition to the usability benefits these can provide, they have the same performance benefits they have on <code>List&lt;T&gt;<\/code> (which is not a coincidence, given that <code>JsonArray<\/code> is, as an implementation detail, a wrapper for a <code>List&lt;JsonNode?&gt;<\/code>). Removing &#8220;incorrectly&#8221; from a <code>List&lt;T&gt;<\/code> can end up being an <code>O(N^2)<\/code> endeavor, e.g. when I run this:<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0\r\n\r\nusing System.Diagnostics;\r\n\r\nfor (int i = 100_000; i &lt; 700_000; i += 100_000)\r\n{\r\n    List&lt;int&gt; items = Enumerable.Range(0, i).ToList();\r\n\r\n    Stopwatch sw = Stopwatch.StartNew();\r\n    while (items.Count &gt; 0)\r\n    {\r\n        items.RemoveAt(0); \/\/ uh oh\r\n    }\r\n    Console.WriteLine($\"{i} =&gt; {sw.Elapsed}\");\r\n}<\/code><\/pre>\n<p>I get output like this:<\/p>\n<pre><code class=\"language-txt\">100000 =&gt; 00:00:00.2271798\r\n200000 =&gt; 00:00:00.8328727\r\n300000 =&gt; 00:00:01.9820088\r\n400000 =&gt; 00:00:03.9242008\r\n500000 =&gt; 00:00:06.9549009\r\n600000 =&gt; 00:00:11.1104903<\/code><\/pre>\n<p>Note how as the list length grows linearly, the elapsed time is growing non-linearly. That&#8217;s primarily because each <code>RemoveAt(0)<\/code> is requiring the entire remainder of the list to shift down, which is <code>O(N)<\/code> in the length of the list. That means we get <code>N + (N-1) + (N-2) + ... + 1<\/code> operations, which is <code>N(N+1)\/2<\/code>, which is <code>O(N^2)<\/code>. Both <code>RemoveRange<\/code> and <code>RemoveAll<\/code> are able to avoid those costs by doing the shifting only once per element. Of course, even without such methods, I could have written my previous removal loop in a way that keeps it linear, namely by repeatedly removing the last element rather than the first (and, of course, if I <em>really<\/em> intended on removing everything, I could have just used <code>Clear<\/code>). Typical use, however, ends up removing a smattering of elements, and being able to just delegate and not worry about accidentally incurring a non-linear overhead is helpful.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json.Nodes;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private JsonArray _arr;\r\n\r\n    [IterationSetup]\r\n    public void Setup() =&gt;\r\n        _arr = new JsonArray(Enumerable.Range(0, 100_000).Select(i =&gt; (JsonNode)i).ToArray());\r\n\r\n    [Benchmark]\r\n    public void Manual()\r\n    {\r\n        int i = 0;\r\n        while (i &lt; _arr.Count)\r\n        {\r\n            if (_arr[i]!.GetValue&lt;int&gt;() % 2 == 0)\r\n            {\r\n                _arr.RemoveAt(i);\r\n            }\r\n            else\r\n            {\r\n                i++;\r\n            }\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void RemoveAll() =&gt; _arr.RemoveAll(static n =&gt; n!.GetValue&lt;int&gt;() % 2 == 0);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Manual<\/td>\n<td style=\"text-align: right;\">355.230 ms<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RemoveAll<\/td>\n<td style=\"text-align: right;\">2.022 ms<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Note that while <code>RemoveAll<\/code> in this micro-benchmark is more than 150x faster, it does have that small allocation that the manual implementation doesn&#8217;t. That&#8217;s due to a closure in the implementation while delegating to <code>List&lt;T&gt;.RemoveAll<\/code>. This could be avoided in the future if necessary.)<\/p>\n<p>Another frequently-requested new method is from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116363\">dotnet\/runtime#116363<\/a>, which adds new <code>Parse<\/code> methods to <code>JsonElement<\/code>. If a developer wants a <code>JsonElement<\/code> and only needs it temporarily, the most efficient mechanism available today is still the right answer: <code>Parse<\/code> a <code>JsonDocument<\/code>, use its <code>RootElement<\/code>, and then <em>only<\/em> when done with the <code>JsonElement<\/code>, dispose of the <code>JsonDocument<\/code>, e.g.<\/p>\n<pre><code class=\"language-csharp\">using (JsonDocument doc = JsonDocument.Parse(json))\r\n{\r\n    DoSomething(doc.RootElement);\r\n}<\/code><\/pre>\n<p>That, however, is really only viable when the <code>JsonElement<\/code> is used in a scoped manner. If a developer needs to hand out the <code>JsonElement<\/code>, they&#8217;re left with three options:<\/p>\n<ol>\n<li><code>Parse<\/code> into a <code>JsonDocument<\/code>, clone its <code>RootElement<\/code>, dispose of the <code>JsonDocument<\/code>, hand out the clone. While using <code>JsonDocument<\/code> is good for the temporary case, making a clone like this entails a fair bit of overhead:\n<pre><code class=\"language-csharp\">JsonElement clone;\r\nusing (JsonDocument doc = JsonDocument.Parse(json))\r\n{\r\n    clone = doc.RootElement.Clone();\r\n}\r\nreturn clone;<\/code><\/pre>\n<\/li>\n<li><code>Parse<\/code> into a <code>JsonDocument<\/code> and just hand out its <code>RootElement<\/code>. Please <em>do not do this<\/em>! <code>JsonDocument.Parse<\/code> creates a <code>JsonDocument<\/code> that&#8217;s backed by an array from the <code>ArrayPool&lt;&gt;<\/code>. If you don&#8217;t <code>Dispose<\/code> of the <code>JsonDocument<\/code> in this case, an array will be rented and then never returned to the pool. That&#8217;s not the end of the world; if someone else requests an array from the pool and the pool doesn&#8217;t have one cached to give them, it&#8217;ll just manufacture one, so eventually the pool&#8217;s arrays will be replenished. But the arrays in the pool are generally &#8220;more valuable&#8221; than others, because they&#8217;ve generally been around longer, and are thus more likely to be in higher generations. By using an <code>ArrayPool<\/code> array rather than a new array for a shorter-lived <code>JsonDocument<\/code>, you&#8217;re more likely throwing away an array that&#8217;ll have net more impact on the overall system. The impact of that is not easily seen in a micro-benchmark.\n<pre><code class=\"language-csharp\">return JsonDocument.Parse(json).RootElement; \/\/ please don't do this<\/code><\/pre>\n<\/li>\n<li>Use <code>JsonSerializer<\/code> to deserialize a <code>JsonElement<\/code>. This is a simple and reasonable one-liner, but it does invoke the <code>JsonSerializer<\/code> machinery, which brings in more overhead.\n<pre><code class=\"language-csharp\">return JsonSerializer.Deserialize&lt;JsonElement&gt;(json);<\/code><\/pre>\n<\/li>\n<\/ol>\n<p>Now in .NET 10, there&#8217;s a fourth option:<\/p>\n<ul>\n<li>Use <code>JsonElement.Parse<\/code>. This is the right answer. Use this instead of (1), (2), or (3).\n<pre><code class=\"language-csharp\">\/\/  dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private const string JsonString = \"\"\"{ \"name\": \"John\", \"age\": 30, \"city\": \"New York\" }\"\"\";\r\n\r\n    [Benchmark]\r\n    public JsonElement WithClone()\r\n    {\r\n        using JsonDocument d = JsonDocument.Parse(JsonString);\r\n        return d.RootElement.Clone();\r\n    }\r\n\r\n    [Benchmark]\r\n    public JsonElement WithoutClone() =&gt;\r\n        JsonDocument.Parse(JsonString).RootElement; \/\/ please don't do this in production code\r\n\r\n    [Benchmark]\r\n    public JsonElement WithDeserialize() =&gt;\r\n        JsonSerializer.Deserialize&lt;JsonElement&gt;(JsonString);\r\n\r\n    [Benchmark]\r\n    public JsonElement WithParse() =&gt;\r\n        JsonElement.Parse(JsonString);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithClone<\/td>\n<td style=\"text-align: right;\">303.7 ns<\/td>\n<td style=\"text-align: right;\">344 B<\/td>\n<\/tr>\n<tr>\n<td>WithoutClone<\/td>\n<td style=\"text-align: right;\">249.6 ns<\/td>\n<td style=\"text-align: right;\">312 B<\/td>\n<\/tr>\n<tr>\n<td>WithDeserialize<\/td>\n<td style=\"text-align: right;\">397.3 ns<\/td>\n<td style=\"text-align: right;\">272 B<\/td>\n<\/tr>\n<tr>\n<td>WithParse<\/td>\n<td style=\"text-align: right;\">261.9 ns<\/td>\n<td style=\"text-align: right;\">272 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<\/ul>\n<p>With JSON being used as an encoding for many modern protocols, streaming large JSON payloads has become very common. And for most use cases, it&#8217;s already possible to stream JSON well with <code>System.Text.Json<\/code>. However, in previous releases there wasn&#8217;t been a good way to stream partial string properties; string properties had to have their values written in one operation. If you&#8217;ve got small strings, that&#8217;s fine. If you&#8217;ve got really, really large strings, and those strings are lazily-produced in chunks, however, you ideally want the ability to write those chunks of the property as you have them, rather than needing to buffer up the value in its entirety. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101356\">dotnet\/runtime#101356<\/a> augmented <code>Utf8JsonWriter<\/code> with a <code>WriteStringValueSegment<\/code> method, which enables such partial writes. That addresses the majority case, however there&#8217;s a very common case where additional encoding of the value is desirable, and an API that automatically handles that encoding helps to be both efficient and easy. These modern protocols often transmit large blobs of binary data within the JSON payloads. Typically, these blobs end up being Base64 strings as properties on some JSON object. Today, outputting such blobs requires Base64-encoding the whole input and then writing the resulting <code>byte<\/code>s or <code>char<\/code>s in their entirety into the <code>Utf8JsonWriter<\/code>. To address that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111041\">dotnet\/runtime#111041<\/a> adds a <code>WriteBase64StringSegment<\/code> method to <code>Utf8JsonWriter<\/code>. For those sufficiently motivated to reduce memory overheads, and to enable the streaming of such payloads, <code>WriteBase64StringSegment<\/code> enables passing in a span of bytes, which the implementation will Base64-encode and write to the JSON property; it can be called multiple times with <code>isFinalSegment=false<\/code>, such that the writer will continue appending the resulting Base64 data to the property, until it&#8217;s called with a final segment that ends the property. (<code>Utf8JsonWriter<\/code> has long had a <code>WriteBase64String<\/code> method, this new <code>WriteBase64StringSegment<\/code> simply enables it to be written in pieces.) The primary benefit of such a method is reduced latency and working set, as the entirety of the data payload needn&#8217;t be buffered before being written out, but we can still come up with a throughput benchmark that shows benefits:<\/p>\n<pre><code class=\"language-csharp\">\/\/  dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\nusing System.Text.Json;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private Utf8JsonWriter _writer = new(Stream.Null);\r\n    private Stream _source = new MemoryStream(Enumerable.Range(0, 10_000_000).Select(i =&gt; (byte)i).ToArray());\r\n\r\n    [Benchmark]\r\n    public async Task Buffered()\r\n    {\r\n        _source.Position = 0;\r\n        _writer.Reset();\r\n\r\n        byte[] buffer = ArrayPool&lt;byte&gt;.Shared.Rent(0x1000);\r\n\r\n        int totalBytes = 0;\r\n        int read;\r\n        while ((read = await _source.ReadAsync(buffer.AsMemory(totalBytes))) &gt; 0)\r\n        {\r\n            totalBytes += read;\r\n            if (totalBytes == buffer.Length)\r\n            {\r\n                byte[] newBuffer = ArrayPool&lt;byte&gt;.Shared.Rent(buffer.Length * 2);\r\n                Array.Copy(buffer, newBuffer, totalBytes);\r\n                ArrayPool&lt;byte&gt;.Shared.Return(buffer);\r\n                buffer = newBuffer;\r\n            }\r\n        }\r\n\r\n        _writer.WriteStartObject();\r\n        _writer.WriteBase64String(\"data\", buffer.AsSpan(0, totalBytes));\r\n        _writer.WriteEndObject();\r\n        await _writer.FlushAsync();\r\n\r\n        ArrayPool&lt;byte&gt;.Shared.Return(buffer);\r\n    }\r\n\r\n    [Benchmark]\r\n    public async Task Streaming()\r\n    {\r\n        _source.Position = 0;\r\n        _writer.Reset();\r\n\r\n        byte[] buffer = ArrayPool&lt;byte&gt;.Shared.Rent(0x1000);\r\n\r\n        _writer.WriteStartObject();\r\n        _writer.WritePropertyName(\"data\");\r\n        int read;\r\n        while ((read = await _source.ReadAsync(buffer)) &gt; 0)\r\n        {\r\n            _writer.WriteBase64StringSegment(buffer.AsSpan(0, read), isFinalSegment: false);\r\n        }\r\n        _writer.WriteBase64StringSegment(default, isFinalSegment: true);\r\n        _writer.WriteEndObject();\r\n        await _writer.FlushAsync();\r\n\r\n        ArrayPool&lt;byte&gt;.Shared.Return(buffer);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Buffered<\/td>\n<td style=\"text-align: right;\">3.925 ms<\/td>\n<\/tr>\n<tr>\n<td>Streaming<\/td>\n<td style=\"text-align: right;\">1.555 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>.NET 9 saw the introduction of the <code>JsonMarshal<\/code> class and the <code>GetRawUtf8Value<\/code> method, which provides raw access to the underlying bytes of property values fronted by a <code>JsonElement<\/code>. For situations where the name of the property is also needed, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107784\">dotnet\/runtime#107784<\/a> from <a href=\"https:\/\/github.com\/mwadams\">@mwadams<\/a> provides a corresponding <code>JsonMarshal.GetRawUtf8PropertyName<\/code> method.<\/p>\n<h2>Diagnostics<\/h2>\n<p>Over the years, I&#8217;ve seen a fair number of codebases introduce a <code>struct<\/code>-based <code>ValueStopwatch<\/code>; I think there are even a few still floating around the <code>Microsoft.Extensions<\/code> libraries. The premise behind these is that <code>System.Diagnostics.Stopwatch<\/code> is a <code>class<\/code>, but it simply wraps a <code>long<\/code> (a timestamp), so rather than writing code like the following that allocates:<\/p>\n<pre><code class=\"language-csharp\">Stopwatch sw = Stopwatch.StartNew();\r\n... \/\/ something being measured\r\nsw.Stop();\r\nTimeSpan elapsed = sw.Elapsed;<\/code><\/pre>\n<p>you could write:<\/p>\n<pre><code class=\"language-csharp\">ValueStopwatch sw = ValueStopwatch.StartNew();\r\n... \/\/ something being measured\r\nsw.Stop();\r\nTimeSpan elapsed = sw.Elapsed;<\/code><\/pre>\n<p>and avoid the allocation. <code>Stopwatch<\/code> subsequently gained helpers that make such a <code>ValueStopwatch<\/code> less appealing, since as of .NET 7, I can write it instead like this:<\/p>\n<pre><code class=\"language-csharp\">long start = Stopwatch.GetTimestamp();\r\n... \/\/ something being measured\r\nlong end = Stopwatch.GetTimestamp();\r\nTimeSpan elapsed = Stopwatch.GetElapsedTime(start, end);<\/code><\/pre>\n<p>However, that&#8217;s not quite as natural as the original example, that just uses <code>Stopwatch<\/code>. Wouldn&#8217;t it be nice if you could write the original example and have it executed as if it were the latter? With all the investments in .NET 9 and .NET 10 around escape analysis and stack allocation, you now can. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111834\">dotnet\/runtime#111834<\/a> streamlines the <code>Stopwatch<\/code> implementation so that <code>StartNew<\/code>, <code>Elapsed<\/code>, and <code>Stop<\/code> are fully inlineable. At that point, the JIT can see that the allocated <code>Stopwatch<\/code> instance never escapes the frame, and it can be stack allocated.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Diagnostics;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[DisassemblyDiagnoser]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public TimeSpan WithGetTimestamp()\r\n    {\r\n        long start = Stopwatch.GetTimestamp();\r\n        Nop();\r\n        long end = Stopwatch.GetTimestamp();\r\n\r\n        return Stopwatch.GetElapsedTime(start, end);\r\n    }\r\n\r\n    [Benchmark]\r\n    public TimeSpan WithStartNew()\r\n    {\r\n        Stopwatch sw = Stopwatch.StartNew();\r\n        Nop();\r\n        sw.Stop();\r\n\r\n        return sw.Elapsed;\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static void Nop() { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithGetTimestamp<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">28.95 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">148 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td>WithGetTimestamp<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">28.32 ns<\/td>\n<td style=\"text-align: right;\">0.98<\/td>\n<td style=\"text-align: right;\">130 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>WithStartNew<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">38.62 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">341 B<\/td>\n<td style=\"text-align: right;\">40 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WithStartNew<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">28.21 ns<\/td>\n<td style=\"text-align: right;\">0.73<\/td>\n<td style=\"text-align: right;\">130 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117031\">dotnet\/runtime#117031<\/a> is a nice improvement that helps reduce working set for anyone using an <code>EventSource<\/code> and that has events with really large IDs. For efficiency purposes, <code>EventSource<\/code> was using an array to map event ID to the data for that ID; lookup needs to be really fast, since the lookup is performed on every event write in order to look up the metadata for the event being written. In many <code>EventSource<\/code>s, the developer authors events with a small, contiguous range of IDs, and the array ends up being very dense. But if a developer authors any event with a really large ID (which we&#8217;ve seen happen in multiple real-world projects, due to splitting events into multiple partial class definitions shared between different projects and selecting IDs for each file unlikely to conflict with each other), an array is still created with a length to accomodate that large ID, which can result in a really big allocation that persists for the lifetime of the event source, and a lot of that allocation ends up just being wasted space. Thankfully, since <code>EventSource<\/code> was written years ago, the performance of <code>Dictionary&lt;TKey, TValue&gt;<\/code> has increased significantly, to the point where it&#8217;s able to efficiently handle the lookups without needing the event IDs to be dense. Note that there should really only ever be one instance of a given <code>EventSource<\/code>-derived type; the recommended pattern is to store one into a static readonly field and just use that one. So the overheads incurred as part of this are primarily about the impact that single large allocation has on working set for the duration of the process. To make it easier to demonstrate, though, I&#8217;m doing something you&#8217;d never, ever do, and creating a new instance per event. Don&#8217;t try this at home, or at least not in production.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Diagnostics;\r\nusing System.Diagnostics.Tracing;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private MyListener _listener = new();\r\n\r\n    [Benchmark]\r\n    public void Oops()\r\n    {\r\n        using OopsEventSource oops = new();\r\n        oops.Oops();\r\n    }\r\n\r\n    [EventSource(Name = \"MyTestEventSource\")]\r\n    public sealed class OopsEventSource : EventSource\r\n    {\r\n        [Event(12_345_678, Level = EventLevel.Error)]\r\n        public void Oops() =&gt; WriteEvent(12_345_678);\r\n    }\r\n\r\n    private sealed class MyListener : EventListener\r\n    {\r\n        protected override void OnEventSourceCreated(EventSource eventSource) =&gt; \r\n            EnableEvents(eventSource, EventLevel.Error);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Oops<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,876.21 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1157428.01 KB<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<\/tr>\n<tr>\n<td>Oops<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">22.06 us<\/td>\n<td style=\"text-align: right;\">0.01<\/td>\n<td style=\"text-align: right;\">19.21 KB<\/td>\n<td style=\"text-align: right;\">0.000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107333\">dotnet\/runtime#107333<\/a> from <a href=\"https:\/\/github.com\/AlgorithmsAreCool\">@AlgorithmsAreCool<\/a> reduces thread contention involved in starting and stopping an <code>Activity<\/code>. <code>ActivitySource<\/code> maintains a thread-safe list of listeners, which only changes on the rare occasion that a listener is registered or unregistered. Any time an <code>Activity<\/code> is created or destroyed (which can happen at very high frequency), each listener gets notified, which requires walking through the list of listeners. The previous code used a lock to protect that listeners list, and to avoid notifying the listener while holding the lock, the implementation would take the lock, determine the next listener, release the lock, notify the listener, and rinse and repeat until it had notified all listeners. This could result in significant contention, as multiple threads started and stopped <code>Activity<\/code>s. Now with this PR, the list switches to be an immutable array. Each time the list changes, a new array is created with the modified set of listeners. This makes the act of changing the listeners list much more expensive, but, as noted, that&#8217;s generally a rarity. And in exchange, notifying listeners becomes much cheaper.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117334\">dotnet\/runtime#117334<\/a> from <a href=\"https:\/\/github.com\/petrroll\">@petrroll<\/a> avoids the overheads of callers needing to interact with null loggers by excluding them in <code>LoggerFactory.CreateLoggers<\/code>, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117342\">dotnet\/runtime#117342<\/a> seals the <code>NullLogger<\/code> type so type checks against <code>NullLogger<\/code> (e.g. <code>if (logger is NullLogger<\/code>) can be made more efficient by the JIT. And <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/7290\">dotnet\/roslyn-analyzers#<\/a> from <a href=\"https:\/\/github.com\/mpidash\">@mpidash<\/a> will help developers to realize that their logging operations aren&#8217;t as cheap as they thought they might be. Consider this code:<\/p>\n<pre><code class=\"language-csharp\">[LoggerMessage(Level = LogLevel.Information, Message = \"This happened: {Value}\")]\r\nprivate static partial void Oops(ILogger logger, string value);\r\n\r\npublic static void UnexpectedlyExpensive()\r\n{\r\n    Oops(NullLogger.Instance, $\"{Guid.NewGuid()} {DateTimeOffset.UtcNow}\");\r\n}<\/code><\/pre>\n<p>It&#8217;s using the logger source generator, which will emit an implementation dedicated to this log method, including a log level check so that it doesn&#8217;t pay the bulk of the costs associated with logging unless the associated level is enabled:<\/p>\n<pre><code class=\"language-csharp\">[global::System.CodeDom.Compiler.GeneratedCodeAttribute(\"Microsoft.Extensions.Logging.Generators\", \"6.0.5.2210\")]\r\nprivate static partial void Oops(global::Microsoft.Extensions.Logging.ILogger logger, global::System.String value)\r\n{\r\n    if (logger.IsEnabled(global::Microsoft.Extensions.Logging.LogLevel.Information))\r\n    {\r\n        __OopsCallback(logger, value, null);\r\n    }\r\n}<\/code><\/pre>\n<p>Except, the call site is doing non-trivial work, creating a new <code>Guid<\/code>, fetching the current time, and allocating a string via string interpolation, even though it might be wasted work if <code>LogLevel.Information<\/code> isn&#8217;t available. This CA1873 analyzer flags that:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2025\/09\/CA1873.png\" alt=\"Analyzer for expensive logging sites\" \/><\/p>\n<h2>Cryptography<\/h2>\n<p>A ton of effort went into cryptography in .NET 10, almost entirely focused on post\u2011quantum cryptography (PQC). PQC refers to a class of cryptographic algorithms designed to resist attacks from quantum computers, machines that could one day render classic cryptographic algorithms like Rivest\u2013Shamir\u2013Adleman (RSA) or Elliptic Curve Cryptography (ECC) insecure by efficiently solving problems such as integer factorization and discrete logarithms. With the looming threat of &#8220;harvest now, decrypt later&#8221; attacks (where a well-funded attacker idly captures encrypted internet traffic, expecting that they&#8217;ll be able to decrypt and read it later) and the multi-year process required to migrate critical infrastructure, the transition to quantum\u2011safe cryptographic standards has become an urgent priority. In this light, .NET 10 adds support for ML-DSA (a National Institute of Standards and Technology PQC digital signature algorithm), Composite ML-DSA (a draft Internet Engineering Task Force specification for creating signatures that combine ML-DSA with a classical crypto algorithm like RSA), SLH-DSA (another NIST PQC signature algorithm), and ML-KEM (a NIST PQC key encapsulation algorithm). This is an important step towards quantum-resistant security, enabling developers to begin experimenting with and planning for post-quantum identity and authenticity scenarios. While this PQC effort is not about performance, the design of them is very much focused on more modern sensibilities that have performance as a key motivator. While older types, like those that derive from <code>AsymmetricAlgorithm<\/code>, are design around arrays, with support for spans tacked on later, the new types are design with spans at the center, and with array-based APIs available only for convenience.<\/p>\n<p>There are, however, some cryptography-related changes in .NET 10 that are focused squarely on performance. One is around improving OpenSSL &#8220;digest&#8221; performance. .NET&#8217;s cryptography stack is built on top of the underlying platform&#8217;s native cryptographic libraries; on Linux, that means using OpenSSL, making it a hot path for common operations like hashing, signing, and TLS. &#8220;Digest algorithms&#8221; are the family of cryptographic hash functions (for example, SHA\u2011256, SHA\u2011512, SHA\u20113) that turn arbitrary input into fixed\u2011size fingerprints; they&#8217;re used all of the place, from verifying packages to TLS handshakes to content de-duplication. While .NET can use OpenSSL 1.x if that&#8217;s what&#8217;s offered by the OS, since .NET 6 it&#8217;s been focusing more and more on optimizing for and lighting-up with OpenSSL 3 (the previously-discussed PQC support requires OpenSSL 3.5 or later). With OpenSSL 1.x, OpenSSL exposed getter functions like <code>EVP_sha256()<\/code>, which were cheap functions that just returned a direct pointer to the <code>EVP_MD<\/code> for the relevant hash implementation. OpenSSL 3.x introduced a provider model, with a fetch function (<code>EVP_MD_fetch<\/code>) for retrieving the provider-backed implementation. To keep source compatibility, the 1.x-era getter functions were changed to return pointers to compatibility shims: when you pass one of these legacy <code>EVP_MD<\/code> pointers into operations like <code>EVP_DigestInit_ex<\/code>, OpenSSL performs an &#8220;implicit fetch&#8221; under the covers to resolve the actual implementation. That implicit fetch path adds extra work, on each use. Instead, OpenSSL recommends consumers do an explicit fetch and then cache the result for reuse. That&#8217;s what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118613\">dotnet\/runtime#118613<\/a> does. The result is leaner and faster cryptographic hash operations on OpenSSL\u2011based platforms.<\/p>\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _src = new byte[1024];\r\n    private byte[] _dst = new byte[SHA256.HashSizeInBytes];\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; new Random(42).NextBytes(_src);\r\n\r\n    [Benchmark]\r\n    public void Hash() =&gt; SHA256.HashData(_src, _dst);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Hash<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">1,206.8 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Hash<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">960.6 ns<\/td>\n<td style=\"text-align: right;\">0.80<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A few other performance niceties have also found their way in.<\/p>\n<ul>\n<li><strong><code>AsnWriter.Encode<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106728\">dotnet\/runtime#106728<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112638\">dotnet\/runtime#112638<\/a> add and then use throughout the crypto stack a callback-based mechanism to <code>AsnWriter<\/code> that enables encoding without forced allocation for the temporary encoded state.<\/li>\n<li><strong><code>SafeHandle<\/code> singleton<\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109391\">dotnet\/runtime#109391<\/a> employs a singleton <code>SafeHandle<\/code> in more places in <code>X509Certificate<\/code> to avoid temporary handle allocation.<\/li>\n<li><strong>Span-based <code>ProtectedData<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109529\">dotnet\/runtime#109529<\/a> from <a href=\"https:\/\/github.com\/ChadNedzlek\">@ChadNedzlek<\/a> adds <code>Span&lt;byte&gt;<\/code>-based overloads to the <code>ProtectedData<\/code> class that enable protecting data without requiring the source or destinations to be in allocated arrays.<\/li>\n<li><strong><code>PemEncoding<\/code> UTF-8<\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109438\">dotnet\/runtime#109438<\/a> adds UTF-8 support to <code>PemEncoding<\/code>. <code>PemEncoding<\/code>, a utility class for parsing and formatting PEM (Privacy-Enhanced Mail)-encoded data such as that used in certificates and keys, previously worked only with <code>char<\/code>s. As was then done in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109564\">dotnet\/runtime#109564<\/a>, this change makes it possible to parse UTF8 data directly without first needing to transcode to UTF16.<\/li>\n<li><strong><code>FindByThumbprint<\/code><\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/109130\">dotnet\/runtime#109130<\/a> adds an <code>X509Certification2Collection.FindByThumbprint<\/code> method. The implementation uses a stack-based buffer for the thumbprint value for each candidate certificate, eliminating the arrays that would otherwise be created in a naive manual implementation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113606\">dotnet\/runtime#113606<\/a> then utilized this in <code>SslStream<\/code>.<\/li>\n<li><strong><code>SetKey<\/code><\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113146\">dotnet\/runtime#113146<\/a> adds a span-based <code>SymmetricAlgorithm.SetKey<\/code> method which can then be used to avoid creating unnecessary arrays.<\/li>\n<\/ul>\n<h2>Peanut Butter<\/h2>\n<p>As in every .NET release, there are a large number of PRs that help with performance in some fashion. The more of these that are addressed, the more the overall overhead for applications and services is lowered. Here are a smattering from this release:<\/p>\n<ul>\n<li><strong>GC<\/strong>. DATAS (Dynamic Adaptation To Application Sizes) was introduced in .NET 8 and enabled by default in .NET 9. Now in .NET 10, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105545\">dotnet\/runtime#105545<\/a> tuned DATAS to improve its overall behavior, cutting unnecessary work, smoothing out pauses (especially under high allocation rates), correcting fragmentation accounting that could cause extra short collections (gen1), and other such tweaks. The net result is fewer unnecessary collections, steadier throughput, and more predictable latency for allocation-heavy workloads. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118762\">dotnet\/runtime#118762<\/a> also adds several knobs for configuring how DATAS behaves, and in particular settings to fine-tune how Gen0 grows.<\/li>\n<li><strong>GCHandle<\/strong>. The GC supports various types of &#8220;handles&#8221; that allow for explicit management of resources in relation to GC operation. For example, you can create a &#8220;pinning handle,&#8221; which ensures that the GC will not move the object in question. Historically, these handles were surfaced to developers via the <code>GCHandle<\/code> type, but it has a variety of issues, including that it&#8217;s really easy to misuse due to lack of strong typing. To help address that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111307\">dotnet\/runtime#111307<\/a> introduces a few new strongly-typed flavors of handles, with <code>GCHandle&lt;T&gt;<\/code>, <code>PinnedGCHandle&lt;T&gt;<\/code>, and <code>WeakGCHandle&lt;T&gt;<\/code>. These should not only address some of the usability issues, they&#8217;re also able to shave off a bit of the overheads incurred by the old design.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net10.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.InteropServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private byte[] _array = new byte[16];\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void Old() =&gt; GCHandle.Alloc(_array, GCHandleType.Pinned).Free();\r\n\r\n    [Benchmark]\r\n    public void New() =&gt; new PinnedGCHandle&lt;byte[]&gt;(_array).Dispose();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">27.80 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">22.73 ns<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong>Mono interpreter<\/strong>. The mono interpreter gained optimized support for several opcodes, including switches (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107423\">dotnet\/runtime#107423<\/a>), new arrays (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107430\">dotnet\/runtime#107430<\/a>), and memory barriers (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107325\">dotnet\/runtime#107325<\/a>). But arguably more impactful was a series of more than a dozen PRs that enabled the interpreter to vectorize more operations with WebAssembly (Wasm). This included contributions like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114669\">dotnet\/runtime#114669<\/a>, which enabled vectorization of shift operations, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/113743\">dotnet\/runtime#113743<\/a>, which enabled vectorization of a plethora of operations like <code>Abs<\/code>, <code>Divide<\/code>, and <code>Truncate<\/code>. Other PRs used the Wasm-specific intrinsic APIs in more places, in order to accelerate on Wasm routines that were already accelerated on other architectures using architecture-specific intrinsics, e.g. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115062\">dotnet\/runtime#115062<\/a> used <code>PackedSimd<\/code> in the workhorse methods behind the hex conversion routines on <code>Convert<\/code>, like <code>Convert.FromBase64String<\/code>.<\/li>\n<li><strong>FCALLs<\/strong>. There are many places in the lower-layers of <code>System.Private.CoreLib<\/code> where managed code needs to call into native code in the runtime. There are two primary ways this transition from managed to native has happened, historically. One method is through what&#8217;s called a &#8220;QCALL&#8221;, essentially just a DllImport (P\/Invoke) into native functions exposed by the runtime. The other, which historically was the dominant mechansim, is an &#8220;FCALL,&#8221; which is a more complex and specialized pathway that allows direct access to managed objects from native code. FCALLs were once the standard, but over time, more of them were converted to QCALLs. This shift improves reliability (since FCALLs are notoriously tricky to implement correctly) and can also boost performance, as FCALLs require helper method frames, which QCALLs can often avoid. A ton of PRs in .NET 10 went into removing FCALLs, like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107218\">dotnet\/runtime#107218<\/a> for helper method frames in <code>Exception<\/code>, <code>GC<\/code>, and <code>Thread<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106497\">dotnet\/runtime#106497<\/a> for helper method frames in <code>object<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107152\">dotnet\/runtime#107152<\/a> for those used in connecting to profilers, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108415\">dotnet\/runtime#108415<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/108535\">dotnet\/runtime#108535<\/a> for ones in reflection, and over a dozen others. In the end, all FCALLS that touched managed memory or threw exceptions were removed.<\/li>\n<li><strong>Converting hex.<\/strong> Recent .NET releases added methods to <code>Convert<\/code> like <code>FromHexString<\/code> and <code>TryToHexStringLower<\/code>, but such methods all used UTF16. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/117965\">dotnet\/runtime#117965<\/a> adds overloads of these that work with UTF8 bytes.<\/li>\n<li><strong>Formatting.<\/strong> String interpolation is backed by &#8220;interpolated string handlers.&#8221; When you interpolate with a string target type, by default you get the <code>DefaultInterpolatedStringHandler<\/code> that comes from <code>System.Runtime.CompilerServices<\/code>. That implementation is able to use stack-allocated memory and the <code>ArrayPool&lt;&gt;<\/code> for reduced allocation overheads as it&#8217;s buffering up text formatted to it. While very advanced, other code, including other interpolated string handlers, can use <code>DefaultInterpolatedStringHandler<\/code> as an implementation detail. However, when doing so, such code only could get access to the final output as a <code>string<\/code>; the underlying buffer wasn&#8217;t exposed. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112171\">dotnet\/runtime#112171<\/a> adds a <code>Text<\/code> property to <code>DefaultInterpolatedStringHandler<\/code> for code that wants access to the already formatted text in a <code>ReadOnlySpan&lt;char&gt;<\/code>.<\/li>\n<li><strong>Enumeration-related allocations.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/118288\">dotnet\/runtime#118288<\/a> removes a handful of allocations related to enumeration, for example removing a <code>string.Split<\/code> call in <code>EnumConverter<\/code> and replacing it with a <code>MemoryExtensions.Split<\/code> call that doesn&#8217;t need to allocate either the <code>string[]<\/code> or the individual <code>string<\/code> instances.<\/li>\n<li><strong>NRBF decoding.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/107797\">dotnet\/runtime#107797<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> removes an array allocation used in a <code>decimal<\/code> constructor call, replacing it instead with a collection expression targeting a span, which will result in the state being stack allocated.<\/li>\n<li><strong>TypeConverter allocations.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/111349\">dotnet\/runtime#111349<\/a> from <a href=\"https:\/\/github.com\/AlexRadch\">@AlexRadch<\/a> reduces some parsing overheads in the <code>TypeConverter<\/code>s for <code>Size<\/code>, <code>SizeF<\/code>, <code>Point<\/code>, and <code>Rectangle<\/code> by using more modern APIs and constructs, such as the span-based <code>Split<\/code> method and string interpolation.<\/li>\n<li><strong>Generic math conversions.<\/strong> Most of the <code>TryConvertXx<\/code> methods using the various primitive&#8217;s implementations of the generic math interfaces are marked as <code>MethodImplOptions.AggressiveInlining<\/code>, to help the JIT realize they should always be inlined, but a few stragglers were left out. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/112061\">dotnet\/runtime#112061<\/a> from <a href=\"https:\/\/github.com\/hez2010\">@hez2010<\/a> fixes that.<\/li>\n<li><strong>ThrowIfNull.<\/strong> C# 14 now supports the ability to write extension static methods. This is a huge boon for libraries that need to support downlevel targeting, as it means static methods can be polyfilled just as instance methods can be. There are many libraries in .NET that build not only for the latest runtimes but also for .NET Standard 2.0 and .NET Framework, and those libraries have been unable to use helper static methods like <code>ArgumentNullException.ThrowIfNull<\/code>, which can help to streamline call sites and make methods more inlineable (in addition, of course, to tidying up the code). Now that the dotnet\/runtime repo builds with a C# 14 compiler, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114644\">dotnet\/runtime#114644<\/a> replaced ~2500 call sites in such libraries with use of a <code>ThrowIfNull<\/code> polyfill.<\/li>\n<li><strong>FileProvider Change Tokens<\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/116175\">dotnet\/runtime#116175<\/a> reduces allocation in <code>PollingWildCardChangeToken<\/code> by using allocation-free mechanisms for computing hashes, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/115684\">dotnet\/runtime#115684<\/a> from <a href=\"https:\/\/github.com\/rameel\">@rameel<\/a> reduces allocation in <code>CompositeFileProvider<\/code> by avoiding taking up space for nop <code>NullChangeToken<\/code>s.<\/li>\n<li><strong>String interpolation.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/114497\">dotnet\/runtime#114497<\/a> removes a variety of null checks when dealing with nullable inputs, shaving off some overheads of the interpolation operation.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private string _value = \" \";\r\n\r\n    [Benchmark]\r\n    public string Interpolate() =&gt; $\"{_value} {_value} {_value} {_value}\";\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">34.21 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">29.47 ns<\/td>\n<td style=\"text-align: right;\">0.86<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<li><strong><code>AssemblyQualifiedName<\/code><\/strong>. <code>Type.AssemblyQualifiedName<\/code> previously recomputed the result on every access. As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/118389\">dotnet\/runtime#118389<\/a>, it&#8217;s now cached.\n<pre><code class=\"language-csharp\">\/\/ dotnet run -c Release -f net9.0 --filter \"*\" --runtimes net9.0 net10.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[HideColumns(\"Job\", \"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [Benchmark]\r\n    public string AQN() =&gt; typeof(Dictionary&lt;int, string&gt;).AssemblyQualifiedName!;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<th style=\"text-align: right;\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AQN<\/td>\n<td>.NET 9.0<\/td>\n<td style=\"text-align: right;\">132.345 ns<\/td>\n<td style=\"text-align: right;\">1.007<\/td>\n<td style=\"text-align: right;\">712 B<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AQN<\/td>\n<td>.NET 10.0<\/td>\n<td style=\"text-align: right;\">1.218 ns<\/td>\n<td style=\"text-align: right;\">0.009<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/li>\n<\/ul>\n<h2>What&#8217;s Next?<\/h2>\n<p>Whew! After all of that, I hope you&#8217;re as excited as I am about .NET 10, and more generally, about the future of .NET.<\/p>\n<p>As you&#8217;ve seen in this tour (and in those for previous releases), the story of .NET performance is one of relentless iteration, systemic thinking, and the compounding effect of many targeted improvements. While I&#8217;ve highlighted micro-benchmarks to show specific gains, the real story isn&#8217;t about these benchmarks&#8230; it&#8217;s about making real-world applications more responsive, more scalable, more sustainable, more economical, and ultimately, more enjoyable to build and use. Whether you&#8217;re shipping high-throughput services, interactive desktop apps, or resource-constrained mobile experiences, .NET 10 offers tangible performance benefits to you and your users.<\/p>\n<p>The best way to appreciate these improvements is to try <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/10.0\">.NET 10 RC1<\/a> yourself. Download it, run your workloads, measure the impact, and share your experiences. See awesome gains? Find a regression that needs fixing? Spot an opportunity for further improvement? Shout it out, open an issue, even send a PR. Every bit of feedback helps make .NET better, and we look forward to continuing to build with you.<\/p>\n<p>Happy coding!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Take a tour through hundreds of performance improvements in .NET 10.<\/p>\n","protected":false},"author":360,"featured_media":57953,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,3009],"tags":[4,8082,108],"class_list":["post-57952","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-performance","tag-net","tag-dotnetperf","tag-performance"],"acf":[],"blog_post_summary":"<p>Take a tour through hundreds of performance improvements in .NET 10.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/57952","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=57952"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/57952\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/57953"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=57952"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=57952"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=57952"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}