{"id":47452,"date":"2023-09-13T05:05:00","date_gmt":"2023-09-13T12:05:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=47452"},"modified":"2025-10-29T11:10:50","modified_gmt":"2025-10-29T18:10:50","slug":"performance-improvements-in-net-8","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/","title":{"rendered":"Performance Improvements in .NET 8"},"content":{"rendered":"<p>I look forward to summer every year. Sun, beach, warm nights, and putting the finishing touches on the next version of .NET. It&#8217;s also the time I get to continue a tradition I started for myself back in 2017 of writing about the performance improvements that have gone into the latest .NET incarnation. A year ago that was <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">Performance Improvements in .NET 7<\/a>, which followed similar posts for <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-6\">.NET 6<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\">.NET 5<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-3-0\">.NET Core 3.0<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-2-1\">.NET Core 2.1<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core\">.NET Core 2.0<\/a>.<\/p>\n<p>Since <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">.NET 7<\/a> was released a year ago, you&#8217;ve likely been inundated with news about AI and ChatGPT. You may have even invested effort in using AI and\/or <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/demystifying-retrieval-augmented-generation-with-dotnet\/\">incorporated it into your applications<\/a>. And I got to wondering: could AI write this blog post for me? So I issued the following request to <a href=\"http:\/\/bing.com\/chat\">Bing Chat<\/a>:<\/p>\n<blockquote><p>\n&#8220;You are an expert .NET developer with a penchant for pedagogy and a love of details. Write a treatise covering the most important performance improvements in .NET 8. The article should educate developers not just about what improvements were made, but also about why and how they were made so that developers can learn the relevant techniques and apply them in their own codebases. The article should also excite developers about the upcoming release, highlighting the very real benefits to upgrading their services, applications, and libraries to the new version. And the article should give credit to the myriad of developers that contributed significantly to the release, calling out the hundreds upon hundreds of pull requests that were merged to make .NET 8 the fastest .NET to date. The article should follow the style of the post about .NET 7 performance at https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/. Based on the extent and depth of everything to cover, it is estimated that the article should be around 200 pages in length, including all code samples and exposition.&#8221;\n<\/p><\/blockquote>\n<p>And the response?<\/p>\n<blockquote>\n<p>&#8220;I&#8217;m sorry, but I can\u2019t write a 200-page treatise for you. That\u2019s beyond the scope of my capabilities and my chat settings. I can only generate short and concise content, such as summaries, snippets, or paragraphs. If you want to learn more about the performance improvements in .NET 8, you can check out some of the web search results I found for you. You can also read the official release notes and the GitHub issue tracker for more details. I hope you understand and appreciate my limitations. Thank you for your interest in .NET 8. \ud83d\ude0a&#8221;<\/p>\n<\/blockquote>\n<p>Fair enough (especially since, after writing this post and measuring the token count, it&#8217;s north of 300,000). And I&#8217;m happy with that outcome, because it means I still get to have the fun of writing this myself.<\/p>\n<p>Throughout the past year, as I was reviewing PRs in various .NET repos, I maintained a list of all the PRs that I might want to cover in this post, which is focused on the core runtime and libraries (<a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-aspnet-core-8\/\">Performance Improvements in ASP.NET Core 8<\/a> provides an in-depth focus on ASP.NET). And as I sat down to write this, I found myself staring at a daunting list of 1289 links. This post can&#8217;t cover all of them, but it does take a tour through more than 500 PRs, all of which have gone into making .NET 8 an irresistible release, one I hope you&#8217;ll all upgrade to as soon as humanly possible.<\/p>\n<p>.NET 7 was super fast. .NET 8 is faster.<\/p>\n<h2>Table of Contents<\/h2>\n<ul>\n<li><a href=\"#benchmarking-setup\">Benchmarking Setup<\/a><\/li>\n<li><a href=\"#jit\">JIT<\/a>\n<ul>\n<li><a href=\"#tiering-and-dynamic-pgo\">Tiering and Dynamic PGO<\/a><\/li>\n<li><a href=\"#vectorization\">Vectorization<\/a><\/li>\n<li><a href=\"#branching\">Branching<\/a><\/li>\n<li><a href=\"#bounds-checking\">Bounds Checking<\/a><\/li>\n<li><a href=\"#constant-folding\">Constant Folding<\/a><\/li>\n<li><a href=\"#non-gc-heap\">Non-GC Heap<\/a><\/li>\n<li><a href=\"#zeroing\">Zeroing<\/a><\/li>\n<li><a href=\"#value-types\">Value Types<\/a><\/li>\n<li><a href=\"#casting\">Casting<\/a><\/li>\n<li><a href=\"#peephole-optimizations\">Peephole Optimizations<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#native-aot\">Native AOT<\/a><\/li>\n<li><a href=\"#vm\">VM<\/a><\/li>\n<li><a href=\"#gc\">GC<\/a><\/li>\n<li><a href=\"#mono\">Mono<\/a><\/li>\n<li><a href=\"#threading\">Threading<\/a>\n<ul>\n<li><a href=\"#threadstatic\">[ThreadStatic]<\/a><\/li>\n<li><a href=\"#threadpool\">ThreadPool<\/a><\/li>\n<li><a href=\"#tasks\">Tasks<\/a><\/li>\n<li><a href=\"#parallel\">Parallel<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#reflection\">Reflection<\/a><\/li>\n<li><a href=\"#exceptions\">Exceptions<\/a><\/li>\n<li><a href=\"#primitives\">Primitives<\/a>\n<ul>\n<li><a href=\"#enums\">Enums<\/a><\/li>\n<li><a href=\"#numbers\">Numbers<\/a><\/li>\n<li><a href=\"#datetime\">DateTime<\/a><\/li>\n<li><a href=\"#guid\">Guid<\/a><\/li>\n<li><a href=\"#random\">Random<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#strings-arrays-and-spans\">Strings, Arrays, and Spans<\/a>\n<ul>\n<li><a href=\"#utf8\">UTF8<\/a><\/li>\n<li><a href=\"#ascii\">ASCII<\/a><\/li>\n<li><a href=\"#base64\">Base64<\/a><\/li>\n<li><a href=\"#hex\">Hex<\/a><\/li>\n<li><a href=\"#string-formatting\">String Formatting<\/a><\/li>\n<li><a href=\"#spans\">Spans<\/a><\/li>\n<li><a href=\"#searchvalues\">SearchValues<\/a><\/li>\n<li><a href=\"#regex\">Regex<\/a><\/li>\n<li><a href=\"#hashing\">Hashing<\/a><\/li>\n<li><a href=\"#initialization\">Initialization<\/a><\/li>\n<li><a href=\"#analyzers\">Analyzers<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#collections\">Collections<\/a>\n<ul>\n<li><a href=\"#general\">General<\/a><\/li>\n<li><a href=\"#list\">List<\/a><\/li>\n<li><a href=\"#linq\">LINQ<\/a><\/li>\n<li><a href=\"#dictionary\">Dictionary<\/a><\/li>\n<li><a href=\"#frozen-collections\">Frozen Collections<\/a><\/li>\n<li><a href=\"#immutable-collections\">Immutable Collections<\/a><\/li>\n<li><a href=\"#bitarray\">BitArray<\/a><\/li>\n<li><a href=\"#collection-expressions\">Collection Expressions<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#file-i-o\">File I\/O<\/a><\/li>\n<li><a href=\"#networking\">Networking<\/a>\n<ul>\n<li><a href=\"#networking-primitives\">Networking Primitives<\/a><\/li>\n<li><a href=\"#sockets\">Sockets<\/a><\/li>\n<li><a href=\"#tls\">TLS<\/a><\/li>\n<li><a href=\"#http\">HTTP<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"#json\">JSON<\/a><\/li>\n<li><a href=\"#cryptography\">Cryptography<\/a><\/li>\n<li><a href=\"#logging\">Logging<\/a><\/li>\n<li><a href=\"#configuration\">Configuration<\/a><\/li>\n<li><a href=\"#peanut-butter\">Peanut Butter<\/a><\/li>\n<li><a href=\"#whats-next\">What&#8217;s Next?<\/a><\/li>\n<\/ul>\n<h2>Benchmarking Setup<\/h2>\n<p>Throughout this post, I include microbenchmarks to highlight various aspects of the improvements being discussed. Most of those benchmarks are implemented using <a href=\"https:\/\/benchmarkdotnet.org\/\">BenchmarkDotNet<\/a> <a href=\"https:\/\/www.nuget.org\/packages\/BenchmarkDotNet\/0.13.8\">v0.13.8<\/a>, and, unless otherwise noted, there is a simple setup for each of these benchmarks.<\/p>\n<p>To follow along, first make sure you have <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/7.0?wt.mc_id=net8perf\">.NET 7<\/a> and <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/8.0?wt.mc_id=net8perf\">.NET 8<\/a> installed. For this post, I&#8217;ve used the .NET 8 Release Candidate (8.0.0-rc.1.23419.4).<\/p>\n<p>With those prerequisites taken care of, create a new C# project in a new <code>benchmarks<\/code> directory:<\/p>\n<pre><code class=\"language-sh\">dotnet new console -o benchmarks\r\ncd benchmarks<\/code><\/pre>\n<p>That directory will contain two files: <code>benchmarks.csproj<\/code> (the project file with information about how the application should be built) and <code>Program.cs<\/code> (the code for the application). Replace the entire contents of <code>benchmarks.csproj<\/code> with this:<\/p>\n<pre><code class=\"language-xml\">&lt;Project Sdk=\"Microsoft.NET.Sdk\"&gt;\r\n\r\n  &lt;PropertyGroup&gt;\r\n    &lt;OutputType&gt;Exe&lt;\/OutputType&gt;\r\n    &lt;TargetFrameworks&gt;net8.0;net7.0&lt;\/TargetFrameworks&gt;\r\n    &lt;LangVersion&gt;Preview&lt;\/LangVersion&gt;\r\n    &lt;ImplicitUsings&gt;enable&lt;\/ImplicitUsings&gt;\r\n    &lt;AllowUnsafeBlocks&gt;true&lt;\/AllowUnsafeBlocks&gt;\r\n    &lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;\r\n  &lt;\/PropertyGroup&gt;\r\n\r\n  &lt;ItemGroup&gt;\r\n    &lt;PackageReference Include=\"BenchmarkDotNet\" Version=\"0.13.8\" \/&gt;\r\n  &lt;\/ItemGroup&gt;\r\n\r\n&lt;\/Project&gt;<\/code><\/pre>\n<p>The preceding project file tells the build system we want:<\/p>\n<ul>\n<li>to build a runnable application (as opposed to a library),<\/li>\n<li>to be able to run on both .NET 8 and .NET 7 (so that BenchmarkDotNet can run multiple processes, one with .NET 7 and one with .NET 8, in order to be able to compare the results),<\/li>\n<li>to be able to use all of the latest features from the C# language even though C# 12 hasn&#8217;t officially shipped yet,<\/li>\n<li>to automatically import common namespaces,<\/li>\n<li>to be able to use the <code>unsafe<\/code> keyword in the code,<\/li>\n<li>and to configure the garbage collector (GC) into its &#8220;server&#8221; configuration, which impacts the tradeoffs it makes between memory consumption and throughput (this isn&#8217;t strictly necessary, I&#8217;m just in the habit of using it, and it&#8217;s the default for ASP.NET apps.)<\/li>\n<\/ul>\n<p>The <code>&lt;PackageReference\/&gt;<\/code> at the end pulls in BenchmarkDotNet from <a href=\"https:\/\/www.nuget.org\/\">NuGet<\/a> so that we&#8217;re able to use the library in <code>Program.cs<\/code>. (A handful of benchmarks require additional packages be added; I&#8217;ve noted those where applicable.)<\/p>\n<p>For each benchmark, I&#8217;ve then included the full <code>Program.cs<\/code> source; just copy and paste that code into <code>Program.cs<\/code>, replacing its entire contents. In each test, you&#8217;ll notice several attributes may be applied to the <code>Tests<\/code> class. The <code>[MemoryDiagnoser]<\/code> attribute indicates I want it to track managed allocation, the <code>[DisassemblyDiagnoser]<\/code> attribute indicates I want it to report on the actual assembly code generated for the test (and by default one level deep of functions invoked by the test), and the <code>[HideColumns]<\/code> attribute simply suppresses some columns of data BenchmarkDotNet might otherwise emit by default but are unnecessary for our purposes here.<\/p>\n<p>Running the benchmarks is then straightforward. Each shown test also includes a comment at the beginning for the <code>dotnet<\/code> command to run the benchmark. Typically, it&#8217;s something like this:<\/p>\n<pre><code class=\"language-sh\">dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0<\/code><\/pre>\n<p>The preceding <code>dotnet run<\/code> command:<\/p>\n<ul>\n<li>builds the benchmarks in a Release build. This is important for performance testing, as most optimizations are disabled in Debug builds, in both the C# compiler and the JIT compiler.<\/li>\n<li>targets .NET 7 for the host project. In general with BenchmarkDotNet, you want to target the lowest-common denominator of all runtimes you&#8217;ll be executing against, so as to ensure that all of the APIs being used are available everywhere they&#8217;re needed.<\/li>\n<li>runs all of the benchmarks in the whole program. The <code>--filter<\/code> argument can be refined to scope down to just a subset of benchmarks desired, but <code>\"*\"<\/code> says &#8220;run &#8217;em all.&#8221;<\/li>\n<li>runs the tests on both .NET 7 and .NET 8.<\/li>\n<\/ul>\n<p>Throughout the post, I&#8217;ve shown many benchmarks and the results I received from running them. All of the code works well on all supported operating systems and architectures. Unless otherwise stated, the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor (the one bulk exception to this is when I&#8217;ve used <code>[DisassemblyDiagnoser]<\/code> to show assembly code, in which case I&#8217;ve run them on Windows 11 due to a sporadic issue on Unix with <code>[DisassemblyDiagnoser]<\/code> on .NET 7 not always producing the requested assembly). My standard caveat: these are <em>microbenchmarks<\/em>, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what else is running on your machine, your current mood, and what you ate for breakfast can all affect the numbers involved. In short, don&#8217;t expect the numbers you see to match exactly the numbers I report here, though I have chosen examples where the <em>magnitude<\/em> of differences cited is expected to be fully repeatable.<\/p>\n<p>With all that out of the way, let&#8217;s dive in&#8230;<\/p>\n<h2>JIT<\/h2>\n<p>Code generation permeates every single line of code we write, and it&#8217;s critical to the end-to-end performance of applications that the compiler doing that code generation achieves high code quality. In .NET, that&#8217;s the job of the Just-In-Time (JIT) compiler, which is used both &#8220;just in time&#8221; as an application executes as well as in Ahead-Of-Time (AOT) scenarios as the workhorse to perform the codegen at build-time. Every release of .NET has seen significant improvements in the JIT, and .NET 8 is no exception. In fact, I dare say the improvements in .NET 8 in the JIT are an incredible leap beyond what was achieved in the past, in large part due to dynamic PGO&#8230;<\/p>\n<h3>Tiering and Dynamic PGO<\/h3>\n<p>To understand dynamic PGO, we first need to understand &#8220;tiering.&#8221; For many years, a .NET method was only ever compiled once: on first invocation of the method, the JIT would kick in to generate code for that method, and then that invocation and every subsequent one would use that generated code. It was a simple time, but also one frought with conflict&#8230; in particular, a conflict between how much the JIT should invest in code quality for the method and how much benefit would be gained from that enhanced code quality. Optimization is one of the most expensive things a compiler does; a compiler can spend an untold amount of time searching for additional ways to shave off an instruction here or improve the instruction sequence there. But none of us has an infinite amount of time to wait for the compiler to finish, especially in a &#8220;just in time&#8221; scenario where the compilation is happening as the application is running. As such, in a world where a method is compiled once for that process, the JIT has to either pessimize code quality or pessimize how long it takes to run, which means a tradeoff between steady-state throughput and startup time.<\/p>\n<p>As it turns out, however, the vast majority of methods invoked in an application are only ever invoked once or a small number of times. Spending a lot of time optimizing such methods would actually be a deoptimization, as likely it would take much more time to optimize them than those optimizations would gain. So, .NET Core 3.0 introduced a new feature of the JIT known as &#8220;tiered compilation.&#8221; With tiering, a method could end up being compiled multiple times. On first invocation, the method would be compiled in &#8220;tier 0,&#8221; in which the JIT prioritizes speed of compilation over code quality; in fact, the mode the JIT uses is often referred to as &#8220;min opts,&#8221; or minimal optimization, because it does as little optimization as it can muster (it still maintains a few optimizations, primarily the ones that result in less code to be compiled such that the JIT actually runs faster). In addition to minimizing optimizations, however, it also employs call counting &#8220;stubs&#8221;; when you invoke the method, the call goes through a little piece of code (the stub) that counts how many times the method was invoked, and once that count crosses a predetermined threshold (e.g. 30 calls), the method gets queued for re-compilation, this time at &#8220;tier 1,&#8221; in which the JIT throws every optimization it&#8217;s capable of at the method. Only a small subset of methods make it to tier 1, and those that do are the ones worthy of additional investment in code quality. Interestingly, there are things the JIT can learn about the method from tier 0 that can lead to even better tier 1 code quality than if the method had been compiled to tier 1 directly. For example, the JIT knows that a method &#8220;tiering up&#8221; from tier 0 to tier 1 has already been executed, and if it&#8217;s already been executed, then any <code>static readonly<\/code> fields it accesses are now already initialized, which means the JIT can look at the values of those fields and base the tier 1 code gen on what&#8217;s actually in the field (e.g. if it&#8217;s a <code>static readonly bool<\/code>, the JIT can now treat the value of that field as if it were <code>const bool<\/code>). If the method were instead compiled directly to tier 1, the JIT might not be able to make the same optimizations. Thus, with tiering, we can &#8220;have our cake and eat it, too.&#8221; We get both good startup and good throughput. Mostly&#8230;<\/p>\n<p>One wrinkle to this scheme, however, is the presence of longer-running methods. Methods might be important because they&#8217;re invoked many times, but they might also be important because they&#8217;re invoked only a few times but end up running forever, in particular due to looping. As such, tiering was disabled by default for methods containing backward branches, such that those methods would go straight to tier 1. To address that, .NET 7 introduced On-Stack Replacement (OSR). With OSR, the code generated for loops also included a counting mechanism, and after a loop iterated to a certain threshold, the JIT would compile a new optimized version of the method and jump from the minimally-optimized code to continue execution in the optimized variant. Pretty slick, and with that, in .NET 7 tiering was also enabled for methods with loops.<\/p>\n<p>But why is OSR important? If there are only a few such long-running methods, what&#8217;s the big deal if they just go straight to tier 1? Surely startup isn&#8217;t significantly negatively impacted? First, it can be: if you&#8217;re trying to trim milliseconds off startup time, every method counts. But second, as noted before, there are throughput benefits to going through tier 0, in that there are things the JIT can learn about a method from tier 0 which can then improve its tier 1 compilation. And the list of things the JIT can learn gets a whole lot bigger with dynamic PGO.<\/p>\n<p>Profile-Guided Optimization (PGO) has been around for decades, for many languages and environments, including in .NET world. The typical flow is you build your application with some additional instrumentation, you then run your application on key scenarios, you gather up the results of that instrumentation, and then you rebuild your application, feeding that instrumentation data into the optimizer, allowing it to use the knowledge about how the code executed to impact how it&#8217;s optimized. This approach is often referred to as &#8220;static PGO.&#8221; &#8220;Dynamic PGO&#8221; is similar, except there&#8217;s no effort required around how the application is built, scenarios it&#8217;s run on, or any of that. With tiering, the JIT is already generating a tier 0 version of the code and then a tier 1 version of the code&#8230; why not sprinkle some instrumentation into the tier 0 code as well? Then the JIT can use the results of that instrumentation to better optimize tier 1. It&#8217;s the same basic &#8220;build, run and collect, re-build&#8221; flow as with static PGO, but now on a per-method basis, entirely within the execution of the application, and handled automatically for you by the JIT, with zero additional dev effort required and zero additional investment needed in build automation or infrastructure.<\/p>\n<p>Dynamic PGO first previewed in .NET 6, off by default. It was improved in .NET 7, but remained off by default. Now, in .NET 8, I&#8217;m thrilled to say it&#8217;s not only been significantly improved, it&#8217;s now on by default. This one-character PR to enable it might be the most valuable PR in all of .NET 8: <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86225\">dotnet\/runtime#86225<\/a>.<\/p>\n<p>There have been a multitude of PRs to make all of this work better in .NET 8, both on tiering in general and then on dynamic PGO in particular. One of the more interesting changes is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/70941\">dotnet\/runtime#70941<\/a>, which added more tiers, though we still refer to the unoptimized as &#8220;tier 0&#8221; and the optimized as &#8220;tier 1.&#8221; This was done primarily for two reasons. First, instrumentation isn&#8217;t free; if the goal of tier 0 is to make compilation as cheap as possible, then we want to avoid adding yet more code to be compiled. So, the PR adds a new tier to address that. Most code first gets compiled to an unoptimized and uninstrumented tier (though methods with loops currently skip this tier). Then after a certain number of invocations, it gets recompiled unoptimized but instrumented. And then after a certain number of invocations, it gets compiled as optimized using the resulting instrumentation data. Second, <code>crossgen<\/code>\/<code>ReadyToRun<\/code> (R2R) images were previously unable to participate in dynamic PGO. This was a <em>big<\/em> problem for taking full advantage of all that dynamic PGO offers, in particular because there&#8217;s a significant amount of code that every .NET application uses that&#8217;s already R2R&#8217;d: the core libraries. <code>ReadyToRun<\/code> is an AOT technology that enables most of the code generation work to be done at build-time, with just some minimal fix-ups applied when that precompiled code is prepared for execution. That code is optimized and not instrumented, or else the instrumentation would slow it down. So, this PR also adds a new tier for R2R. After an R2R method has been invoked some number of times, it&#8217;s recompiled, again with optimizations but this time also with instrumentation, and then when that&#8217;s been invoked sufficiently, it&#8217;s promoted again, this time to an optimized implementation utilizing the instrumentation data gathered in the previous tier.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/TierFlow.png\" alt=\"Code flow between JIT tiers\" \/><\/p>\n<p>There have also been multiple changes focused on doing more optimization in tier 0. As noted previously, the JIT wants to be able to compile tier 0 as quickly as possible, however some optimizations in code quality actually help it to do that. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82412\">dotnet\/runtime#82412<\/a> teaches it to do some amount of constant folding (evaluating constant expressions at compile time rather than at execution time), as that can enable it to generate much less code. Much of the time the JIT spends compiling in tier 0 is for interactions with the Virtual Machine (VM) layer of the .NET runtime, such as resolving types, and so if it can significantly trim away branches that won&#8217;t ever be used, it can actually speed up tier 0 compilation while also getting better code quality. We can see this with a simple repro app like the following:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nMaybePrint(42.0);\r\n\r\nstatic void MaybePrint&lt;T&gt;(T value)\r\n{\r\n    if (value is int)\r\n        Console.WriteLine(value);\r\n}<\/code><\/pre>\n<p>I can set the <code>DOTNET_JitDisasm<\/code> environment variable to <code>*MaybePrint*<\/code>; that will result in the JIT printing out to the console the code it emits for this method. On .NET 7, when I run this (<code>dotnet run -c Release -f net7.0<\/code>), I get the following tier 0 code:<\/p>\n<pre><code class=\"language-assembly\">; Assembly listing for method Program:&lt;&lt;Main&gt;$&gt;g__MaybePrint|0_0[double](double)\r\n; Emitting BLENDED_CODE for X64 CPU with AVX - Windows\r\n; Tier-0 compilation\r\n; MinOpts code\r\n; rbp based frame\r\n; partially interruptible\r\n\r\nG_M000_IG01:                ;; offset=0000H\r\n       55                   push     rbp\r\n       4883EC30             sub      rsp, 48\r\n       C5F877               vzeroupper\r\n       488D6C2430           lea      rbp, [rsp+30H]\r\n       33C0                 xor      eax, eax\r\n       488945F8             mov      qword ptr [rbp-08H], rax\r\n       C5FB114510           vmovsd   qword ptr [rbp+10H], xmm0\r\n\r\nG_M000_IG02:                ;; offset=0018H\r\n       33C9                 xor      ecx, ecx\r\n       85C9                 test     ecx, ecx\r\n       742D                 je       SHORT G_M000_IG03\r\n       48B9B877CB99F97F0000 mov      rcx, 0x7FF999CB77B8\r\n       E813C9AE5F           call     CORINFO_HELP_NEWSFAST\r\n       488945F8             mov      gword ptr [rbp-08H], rax\r\n       488B4DF8             mov      rcx, gword ptr [rbp-08H]\r\n       C5FB104510           vmovsd   xmm0, qword ptr [rbp+10H]\r\n       C5FB114108           vmovsd   qword ptr [rcx+08H], xmm0\r\n       488B4DF8             mov      rcx, gword ptr [rbp-08H]\r\n       FF15BFF72000         call     [System.Console:WriteLine(System.Object)]\r\n\r\nG_M000_IG03:                ;; offset=0049H\r\n       90                   nop\r\n\r\nG_M000_IG04:                ;; offset=004AH\r\n       4883C430             add      rsp, 48\r\n       5D                   pop      rbp\r\n       C3                   ret\r\n\r\n; Total bytes of code 80<\/code><\/pre>\n<p>The important thing to note here is that all of the code associated with the <code>Console.WriteLine<\/code> had to be emitted, including the JIT needing to resolve the method tokens involved (which is how it knew to print &#8220;System.Console:WriteLine&#8221;), even though that branch will provably never be taken (it&#8217;s only taken when <code>value is int<\/code> and the JIT can see that <code>value<\/code> is a <code>double<\/code>). Now in .NET 8, it applies the previously-reserved-for-tier-1 constant folding optimizations that recognize the value is not an <code>int<\/code> and generates tier 0 code accordingly (<code>dotnet run -c Release -f net8.0<\/code>):<\/p>\n<pre><code class=\"language-assembly\">; Assembly listing for method Program:&lt;&lt;Main&gt;$&gt;g__MaybePrint|0_0[double](double) (Tier0)\r\n; Emitting BLENDED_CODE for X64 with AVX - Windows\r\n; Tier0 code\r\n; rbp based frame\r\n; partially interruptible\r\n\r\nG_M000_IG01:                ;; offset=0x0000\r\n       push     rbp\r\n       mov      rbp, rsp\r\n       vmovsd   qword ptr [rbp+0x10], xmm0\r\n\r\nG_M000_IG02:                ;; offset=0x0009\r\n\r\nG_M000_IG03:                ;; offset=0x0009\r\n       pop      rbp\r\n       ret\r\n\r\n; Total bytes of code 11<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77357\">dotnet\/runtime#77357<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83002\">dotnet\/runtime#83002<\/a> also enable some JIT intrinsics to be employed in tier 0 (a JIT intrinsic is a method the JIT has some special knowledge of, either knowing about its behavior so it can optimize around it accordingly, or in many cases actually supplying its own implementation to replace the one in the method&#8217;s body). This is in part for the same reason; many intrinsics can result in better dead code elimination (e.g. <code>if (typeof(T).IsValueType) { ... }<\/code>). But more so, without recognizing intrinsics as being special, we might end up generating code for an intrinsic method that we would never otherwise need to generate code for, even in tier 1. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88989\">dotnet\/runtime#88989<\/a> also eliminates some forms of boxing in tier 0.<\/p>\n<p>Collecting all of this instrumentation in tier 0 instrumented code brings with it some of its own challenges. The JIT is augmenting a bunch of methods to track a lot of additional data; where and how does it track it? And how does it do so safely and correctly when multiple threads are potentially accessing all of this at the same time? For example, one of the things the JIT tracks in an instrumented method is which branches are followed and how frequently; that requires it to count each time code traverses that branch. You can imagine that happens, well, a lot. How can it do the counting in a thread-safe yet efficient way?<\/p>\n<p>The answer previously was, it didn&#8217;t. It used racy, non-synchronized updates to a shared value, e.g. <code>_branches[branchNum]++<\/code>. This means that some updates might get lost in the presence of multithreaded access, but as the answer here only needs to be approximate, that was deemed ok. As it turns out, however, in some cases it was resulting in <em>a lot<\/em> of lost counts, which in turn caused the JIT to optimize for the wrong things. Another approach tried for comparison purposes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82775\">dotnet\/runtime#82775<\/a> was to use interlocked operations (e.g. if this were C#, <code>Interlocked.Increment<\/code>); that results in perfect accuracy, but that explicit synchronization represents a huge potential bottleneck when heavily contended. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84427\">dotnet\/runtime#84427<\/a> provides the approach that&#8217;s now enabled by default in .NET 8. It&#8217;s an implementation of a scalable approximate counter that employs some amount of pseudo-randomness to decide how often to synchronize and by how much to increment the shared count. There&#8217;s a <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/e641efb93d6fb6e82bc1aa01e3867ac06572ab93\/docs\/design\/features\/ScalableApproximateCounting.md\">great description<\/a> of all of this in the <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> repo; here is a C# implementation of the counting logic based on that discussion:<\/p>\n<pre><code class=\"language-C#\">static void Count(ref uint sharedCounter)\r\n{\r\n    uint currentCount = sharedCounter, delta = 1;\r\n    if (currentCount &gt; 0)\r\n    {\r\n        int logCount = 31 - (int)uint.LeadingZeroCount(currentCount);\r\n        if (logCount &gt;= 13)\r\n        {\r\n            delta = 1u &lt;&lt; (logCount - 12);\r\n            uint random = (uint)Random.Shared.NextInt64(0, uint.MaxValue + 1L);\r\n            if ((random &amp; (delta - 1)) != 0)\r\n            {\r\n                return;\r\n            }\r\n        }\r\n    }\r\n\r\n    Interlocked.Add(ref sharedCounter, delta);\r\n}<\/code><\/pre>\n<p>For current count values less than 8192, it ends up just doing the equivalent of an <code>Interlocked.Add(ref counter, 1)<\/code>. However, as the count increases to beyond that threshold, it starts only doing the add randomly half the time, and when it does, it adds 2. Then randomly a quarter of the time it adds 4. Then an eighth of the time it adds 8. And so on. In this way, as more and more increments are performed, it requires writing to the shared counter less and less frequently.<\/p>\n<p>We can test this out with a little app like the following (if you want to try running it, just copy the above <code>Count<\/code> into the program as well):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Diagnostics;\r\n\r\nuint counter = 0;\r\nconst int ItersPerThread = 100_000_000;\r\n\r\nwhile (true)\r\n{\r\n    Run(\"Interlock\", _ =&gt; { for (int i = 0; i &lt; ItersPerThread; i++) Interlocked.Increment(ref counter); });\r\n    Run(\"Racy     \", _ =&gt; { for (int i = 0; i &lt; ItersPerThread; i++) counter++; });\r\n    Run(\"Scalable \", _ =&gt; { for (int i = 0; i &lt; ItersPerThread; i++) Count(ref counter); });\r\n    Console.WriteLine();\r\n}\r\n\r\nvoid Run(string name, Action&lt;int&gt; body)\r\n{\r\n    counter = 0;\r\n    long start = Stopwatch.GetTimestamp();\r\n    Parallel.For(0, Environment.ProcessorCount, body);\r\n    long end = Stopwatch.GetTimestamp();\r\n    Console.WriteLine($\"{name} =&gt; Expected: {Environment.ProcessorCount * ItersPerThread:N0}, Actual: {counter,13:N0}, Elapsed: {Stopwatch.GetElapsedTime(start, end).TotalMilliseconds}ms\");\r\n}<\/code><\/pre>\n<p>When I run that, I get results like this:<\/p>\n<pre><code class=\"language-text\">Interlock =&gt; Expected: 1,200,000,000, Actual: 1,200,000,000, Elapsed: 20185.548ms\r\nRacy      =&gt; Expected: 1,200,000,000, Actual:   138,526,798, Elapsed: 987.4997ms\r\nScalable  =&gt; Expected: 1,200,000,000, Actual: 1,193,541,836, Elapsed: 1082.8471ms<\/code><\/pre>\n<p>I find these results fascinating. The interlocked approach gets the exact right count, but it&#8217;s super slow, ~20x slower than the other approaches. The fastest is the racy additions one, but its count is also wildly inaccurate: it was off by a factor of 8x! The scalable counters solution was only a hair slower than the racy solution, but its count was only off the expected value by 0.5%. This scalable approach then enables the JIT to track what it needs with the efficiency and approximate accuracy it needs. Other PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82014\">dotnet\/runtime#82014<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81731\">dotnet\/runtime#81731<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81932\">dotnet\/runtime#81932<\/a> also went into improving the JIT&#8217;s efficiency around tracking this information.<\/p>\n<p>As it turns out, this isn&#8217;t the only use of randomness in dynamic PGO. Another is used as part of determining which types are the most common targets of virtual and interface method calls. At a given call site, the JIT wants to know which type is most commonly used and by what percentage; if there&#8217;s a clear winner, it can then generate a fast path specific to that type. As in the previous example, tracking a count for every possible type that might come through is expensive. Instead, it uses an algorithm known as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reservoir_sampling\">&#8220;reservoir sampling&#8221;<\/a>. Let&#8217;s say I have a <code>char[1_000_000]<\/code> containing ~60% <code>'a'<\/code>s, ~30% <code>'b'<\/code>s, and ~10% <code>'c'<\/code>s, and I want to know which is the most common. With reservoir sampling, I might do so like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\n\/\/ Create random input for testing, with 60% a, 30% b, 10% c\r\nchar[] chars = new char[1_000_000];\r\nArray.Fill(chars, 'a', 0, 600_000);\r\nArray.Fill(chars, 'b', 600_000, 300_000);\r\nArray.Fill(chars, 'c', 900_000, 100_000);\r\nRandom.Shared.Shuffle(chars);\r\n\r\nfor (int trial = 0; trial &lt; 5; trial++)\r\n{\r\n    \/\/ Reservoir sampling\r\n    char[] reservoir = new char[32]; \/\/ same reservoir size as the JIT\r\n    int next = 0;\r\n    for (int i = 0; i &lt; reservoir.Length &amp;&amp; next &lt; chars.Length; i++, next++)\r\n    {\r\n        reservoir[i] = chars[i];\r\n    }\r\n    for (; next &lt; chars.Length; next++)\r\n    {\r\n        int r = Random.Shared.Next(next + 1);\r\n        if (r &lt; reservoir.Length)\r\n        {\r\n            reservoir[r] = chars[next];\r\n        }\r\n    }\r\n\r\n    \/\/ Print resulting percentages\r\n    Console.WriteLine($\"a: {reservoir.Count(c =&gt; c == 'a') * 100.0 \/ reservoir.Length}\");\r\n    Console.WriteLine($\"b: {reservoir.Count(c =&gt; c == 'b') * 100.0 \/ reservoir.Length}\");\r\n    Console.WriteLine($\"c: {reservoir.Count(c =&gt; c == 'c') * 100.0 \/ reservoir.Length}\");\r\n    Console.WriteLine();\r\n}<\/code><\/pre>\n<p>When I run this, I get results like the following:<\/p>\n<pre><code class=\"language-text\">a: 53.125\r\nb: 31.25\r\nc: 15.625\r\n\r\na: 65.625\r\nb: 28.125\r\nc: 6.25\r\n\r\na: 68.75\r\nb: 25\r\nc: 6.25\r\n\r\na: 40.625\r\nb: 31.25\r\nc: 28.125\r\n\r\na: 59.375\r\nb: 25\r\nc: 15.625<\/code><\/pre>\n<p>Note that in the above example, I actually had all the data in advance; in contrast, the JIT likely has multiple threads all running instrumented code and overwriting elements in the reservoir. I also happened to choose the same size reservoir the JIT is using as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87332\">dotnet\/runtime#87332<\/a>, which highlights how that value was chosen for its use case and why it needed to be tweaked.<\/p>\n<p>On all five runs above, it correctly found there to be more <code>'a'<\/code>s than <code>'b'<\/code>s and more <code>'b'<\/code>s than <code>'c'<\/code>s, and it was often reasonably close to the actual percentages. But, importantly, randomness is involved here, and every run produced slightly different results. I mention this because that means the JIT compiler now incorporates randomness, which means that the produced dynamic PGO instrumentation data is very likely to be slightly different from run to run. However, even without explicit use of randomness, there&#8217;s already non-determinism in such code, and in general there&#8217;s enough data produced that the overall behavior is quite stable and repeatable.<\/p>\n<p>Interestingly, the JIT&#8217;s PGO-based optimizations aren&#8217;t just based on the data gathered during instrumented tier 0 execution. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82926\">dotnet\/runtime#82926<\/a> (and a handful of follow-on PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83068\">dotnet\/runtime#83068<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83567\">dotnet\/runtime#83567<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84312\">dotnet\/runtime#84312<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84741\">dotnet\/runtime#84741<\/a>), the JIT will now create a synthetic profile based on statically analyzing the code and estimating a profile, such as with various approaches to static branch prediction. The JIT can then blend this data together with the instrumentation data, helping to fill in data where there are gaps (think &#8220;Jurassic Park&#8221; and using modern reptile DNA to plug the gaps in the recovered dinosaur DNA).<\/p>\n<p>Beyond the mechanisms used to enable tiering and dynamic PGO getting better (and, did I mention, being on by default?!) in .NET 8, the optimizations it performs also get better. One of the main optimizations dynamic PGO feeds is the ability to devirtualize virtual and interface calls per call site. As noted, the JIT tracks what concrete types are used, and then can generate a fast path for the most common type; this is known as guarded devirtualization (GDV). Consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    internal interface IValueProducer\r\n    {\r\n        int GetValue();\r\n    }\r\n\r\n    class Producer42 : IValueProducer\r\n    {\r\n        public int GetValue() =&gt; 42;\r\n    }\r\n\r\n    private IValueProducer _valueProducer;\r\n    private int _factor = 2;\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; _valueProducer = new Producer42();\r\n\r\n    [Benchmark]\r\n    public int GetValue() =&gt; _valueProducer.GetValue() * _factor;\r\n}<\/code><\/pre>\n<p>The <code>GetValue<\/code> method is doing:<\/p>\n<pre><code class=\"language-C#\">return _valueProducer.GetValue() * _factor;<\/code><\/pre>\n<p>Without PGO, that&#8217;s just a normal interface dispatch. With PGO, however, the JIT will end up seeing that the actual type of <code>_valueProducer<\/code> is most commonly <code>Producer42<\/code>, and it will end up generating tier 1 code closer to if my benchmark was instead:<\/p>\n<pre><code class=\"language-C#\">int result = _valueProducer.GetType() == typeof(Producer42) ?\r\n    Unsafe.As&lt;Producer42&gt;(_valueProducer).GetValue() :\r\n    _valueProducer.GetValue();\r\nreturn result * _factor;<\/code><\/pre>\n<p>It can then in turn see that the <code>Producer42.GetValue()<\/code> method is really simple, and so not only is the <code>GetValue<\/code> call devirtualized, it&#8217;s also inlined, such that the code effectively becomes:<\/p>\n<pre><code class=\"language-C#\">int result = _valueProducer.GetType() == typeof(Producer42) ?\r\n    42 :\r\n    _valueProducer.GetValue();\r\nreturn result * _factor;<\/code><\/pre>\n<p>We can confirm this by running the above benchmark. The resulting numbers certainly show something going on:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetValue<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1.6430 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">35 B<\/td>\n<\/tr>\n<tr>\n<td>GetValue<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">0.0523 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<td style=\"text-align: right\">57 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We see it&#8217;s both faster (which we expected) and more code (which we also expected). Now for the assembly. On .NET 7, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetValue()\r\n       push      rsi\r\n       sub       rsp,20\r\n       mov       rsi,rcx\r\n       mov       rcx,[rsi+8]\r\n       mov       r11,7FF999B30498\r\n       call      qword ptr [r11]\r\n       imul      eax,[rsi+10]\r\n       add       rsp,20\r\n       pop       rsi\r\n       ret\r\n; Total bytes of code 35<\/code><\/pre>\n<p>We can see it&#8217;s performing the interface call (the three <code>mov<\/code>s followed by the <code>call<\/code>) and then multiplying the result by <code>_factor<\/code> (<code>imul eax,[rsi+10]<\/code>). Now on .NET 8, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetValue()\r\n       push      rbx\r\n       sub       rsp,20\r\n       mov       rbx,rcx\r\n       mov       rcx,[rbx+8]\r\n       mov       rax,offset MT_Tests+Producer42\r\n       cmp       [rcx],rax\r\n       jne       short M00_L01\r\n       mov       eax,2A\r\nM00_L00:\r\n       imul      eax,[rbx+10]\r\n       add       rsp,20\r\n       pop       rbx\r\n       ret\r\nM00_L01:\r\n       mov       r11,7FFA1FAB04D8\r\n       call      qword ptr [r11]\r\n       jmp       short M00_L00\r\n; Total bytes of code 57<\/code><\/pre>\n<p>We still see the <code>call<\/code>, but it&#8217;s buried in a cold section at the end. Instead, we see the type of the object being compared against <code>MT_Tests+Producer42<\/code>, and if it matches (the <code>cmp [rcx],rax<\/code> followed by the <code>jne<\/code>), we store <code>2A<\/code> into <code>eax<\/code>; <code>2A<\/code> is the hex representation of <code>42<\/code>, so this is the entirety of the inlined body of the devirtualized <code>Producer42.GetValue<\/code> call. .NET 8 is also capable of doing multiple GDVs, meaning it can generate fast paths for more than 1 type, thanks in large part to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86551\">dotnet\/runtime#86551<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86809\">dotnet\/runtime#86809<\/a>. However, this is off by default and for now needs to be opted-into with a configuration setting (setting the <code>DOTNET_JitGuardedDevirtualizationMaxTypeChecks<\/code> environment variable to the desired maximum number of types for which to test). We can see the impact of that with this benchmark (note that because I&#8217;ve explicitly specified the configs to use in the code itself, I&#8217;ve omitted the <code>--runtimes<\/code> argument in the <code>dotnet<\/code> command):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\"ChecksOne\").WithRuntime(CoreRuntime.Core80))\r\n    .AddJob(Job.Default.WithId(\"ChecksThree\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_JitGuardedDevirtualizationMaxTypeChecks\", \"3\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private readonly A _a = new();\r\n    private readonly B _b = new();\r\n    private readonly C _c = new();\r\n\r\n    [Benchmark]\r\n    public void Multiple()\r\n    {\r\n        DoWork(_a);\r\n        DoWork(_b);\r\n        DoWork(_c);\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static int DoWork(IMyInterface i) =&gt; i.GetValue();\r\n\r\n    private interface IMyInterface { int GetValue(); }\r\n    private class A : IMyInterface { public int GetValue() =&gt; 123; }\r\n    private class B : IMyInterface { public int GetValue() =&gt; 456; }\r\n    private class C : IMyInterface { public int GetValue() =&gt; 789; }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Multiple<\/td>\n<td>ChecksOne<\/td>\n<td style=\"text-align: right\">7.463 ns<\/td>\n<td style=\"text-align: right\">90 B<\/td>\n<\/tr>\n<tr>\n<td>Multiple<\/td>\n<td>ChecksThree<\/td>\n<td style=\"text-align: right\">5.632 ns<\/td>\n<td style=\"text-align: right\">133 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And in the assembly code with the environment variable set, we can indeed see it doing multiple checks for three types before falling back to the general interface dispatch:<\/p>\n<pre><code class=\"language-assembly\">; Tests.DoWork(IMyInterface)\r\n       sub       rsp,28\r\n       mov       rax,offset MT_Tests+A\r\n       cmp       [rcx],rax\r\n       jne       short M01_L00\r\n       mov       eax,7B\r\n       jmp       short M01_L02\r\nM01_L00:\r\n       mov       rax,offset MT_Tests+B\r\n       cmp       [rcx],rax\r\n       jne       short M01_L01\r\n       mov       eax,1C8\r\n       jmp       short M01_L02\r\nM01_L01:\r\n       mov       rax,offset MT_Tests+C\r\n       cmp       [rcx],rax\r\n       jne       short M01_L03\r\n       mov       eax,315\r\nM01_L02:\r\n       add       rsp,28\r\n       ret\r\nM01_L03:\r\n       mov       r11,7FFA1FAC04D8\r\n       call      qword ptr [r11]\r\n       jmp       short M01_L02\r\n; Total bytes of code 88<\/code><\/pre>\n<p>(Interestingly, this optimization gets a bit better in Native AOT. There, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87055\">dotnet\/runtime#87055<\/a>, there can be no need for the fallback path. The compiler can see the entire program being optimized and can generate fast paths for all of the types that implement the target abstraction if it&#8217;s a small number.)<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75140\">dotnet\/runtime#75140<\/a> provides another really nice optimization, still related to GDV, but now for delegates and in relation to loop cloning. Take the following benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private readonly Func&lt;int, int&gt; _func = i =&gt; i + 1;\r\n\r\n    [Benchmark]\r\n    public int Sum() =&gt; Sum(_func);\r\n\r\n    private static int Sum(Func&lt;int, int&gt; func)\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; 10_000; i++)\r\n        {\r\n            sum += func(i);\r\n        }\r\n\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<p>Dynamic PGO is capable of doing GDV with delegates just as it is with virtual and interface methods. The JIT&#8217;s profiling of this method will highlight that the function being invoked is always the same <code>i => i + 1<\/code> lambda, and as we saw, that can then be transformed into a method something like the following pseudo-code:<\/p>\n<pre><code class=\"language-C#\">private static int Sum(Func&lt;int, int&gt; func)\r\n{\r\n    int sum = 0;\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        sum += func.Method == KnownLambda ? i + 1 : func(i);\r\n    }\r\n\r\n    return sum;\r\n}<\/code><\/pre>\n<p>It&#8217;s not very visible that inside our loop we&#8217;re performing the same check over and over and over. We&#8217;re also branching based on it. One common compiler optimization is &#8220;hoisting,&#8221; where a computation that&#8217;s &#8220;loop invariant&#8221; (meaning it doesn&#8217;t change per iteration) can be pulled out of the loop to be above it, e.g.<\/p>\n<pre><code class=\"language-C#\">private static int Sum(Func&lt;int, int&gt; func)\r\n{\r\n    int sum = 0;\r\n    bool isAdd = func.Method == KnownLambda;\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        sum += isAdd ? i + 1 : func(i);\r\n    }\r\n\r\n    return sum;\r\n}<\/code><\/pre>\n<p>but even with that, we still have the branch on each iteration. Wouldn&#8217;t it be nice if we could hoist that as well? What if we could &#8220;clone&#8221; the loop, duplicating it once for when the method is the known target and once for when it&#8217;s not. That&#8217;s &#8220;loop cloning,&#8221; an optimization the JIT is already capable of for other reasons, and now in .NET 8 the JIT is capable of that with this exact scenario, too. The code it&#8217;ll produce ends up then being very similar to this:<\/p>\n<pre><code class=\"language-C#\">private static int Sum(Func&lt;int, int&gt; func)\r\n{\r\n    int sum = 0;\r\n    if (func.Method == KnownLambda)\r\n    {\r\n        for (int i = 0; i &lt; 10_000; i++)\r\n        {\r\n            sum += i + 1;\r\n        }\r\n    }\r\n    else\r\n    {\r\n        for (int i = 0; i &lt; 10_000; i++)\r\n        {\r\n            sum += func(i);\r\n        }\r\n    }\r\n    return sum;\r\n}<\/code><\/pre>\n<p>Looking at the generated assembly on .NET 8 confirms this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Sum(System.Func`2&lt;Int32,Int32&gt;)\r\n       push      rdi\r\n       push      rsi\r\n       push      rbx\r\n       sub       rsp,20\r\n       mov       rbx,rcx\r\n       xor       esi,esi\r\n       xor       edi,edi\r\n       test      rbx,rbx\r\n       je        short M01_L01\r\n       mov       rax,7FFA2D630F78\r\n       cmp       [rbx+18],rax\r\n       jne       short M01_L01\r\nM01_L00:\r\n       inc       edi\r\n       mov       eax,edi\r\n       add       esi,eax\r\n       cmp       edi,2710\r\n       jl        short M01_L00\r\n       jmp       short M01_L03\r\nM01_L01:\r\n       mov       rax,7FFA2D630F78\r\n       cmp       [rbx+18],rax\r\n       jne       short M01_L04\r\n       lea       eax,[rdi+1]\r\nM01_L02:\r\n       add       esi,eax\r\n       inc       edi\r\n       cmp       edi,2710\r\n       jl        short M01_L01\r\nM01_L03:\r\n       mov       eax,esi\r\n       add       rsp,20\r\n       pop       rbx\r\n       pop       rsi\r\n       pop       rdi\r\n       ret\r\nM01_L04:\r\n       mov       edx,edi\r\n       mov       rcx,[rbx+8]\r\n       call      qword ptr [rbx+18]\r\n       jmp       short M01_L02\r\n; Total bytes of code 103<\/code><\/pre>\n<p>Focus just on the <code>M01_L00<\/code> block: you can see it ends with a <code>jl short M01_L00<\/code> to loop back around to <code>M01_L00<\/code> if <code>edi<\/code> (which is storing <code>i<\/code>) is less than 0x2710, or 10,000 decimal, aka our loop&#8217;s upper bound. Note that there are just a few instructions in the middle, nothing at all resembling a <code>call<\/code>&#8230; this is the optimized cloned loop, where our lambda has been inlined. There&#8217;s another loop that alternates between <code>M01_L02<\/code>, <code>M01_L01<\/code>, and <code>M01_L04<\/code>, and that one does have a <code>call<\/code>&#8230; that&#8217;s the fallback loop. And if we run the benchmark, we see a huge resulting improvement:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">16.546 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">55 B<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.320 us<\/td>\n<td style=\"text-align: right\">0.14<\/td>\n<td style=\"text-align: right\">113 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As long as we&#8217;re discussing hoisting, it&#8217;s worth noting other improvements have also contributed. In particular, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81635\">dotnet\/runtime#81635<\/a> enables the JIT to hoist more code used in generic method dispatch. We can see that in action with a benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public void Test() =&gt; Test&lt;string&gt;();\r\n\r\n    static void Test&lt;T&gt;()\r\n    {\r\n        for (int i = 0; i &lt; 100; i++)\r\n        {\r\n            Callee&lt;T&gt;();\r\n        }\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    static void Callee&lt;T&gt;() { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Test<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">170.8 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Test<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">147.0 ns<\/td>\n<td style=\"text-align: right\">0.86<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Before moving on, one word of warning about dynamic PGO: it&#8217;s good at what it does, really good. Why is that a &#8220;warning?&#8221; Dynamic PGO is very good about seeing what your code is doing and optimizing for it, which is awesome when you&#8217;re talking about your production applications. But there&#8217;s a particular kind of coding where you might not want that to happen, or at least you need to be acutely aware of it happening, and you&#8217;re currently looking at it: benchmarks. Microbenchmarks are all about isolating a particular piece of functionality and running that over and over and over and over in order to get good measurements about its overhead. With dynamic PGO, however, the JIT will then optimize for the exact thing you&#8217;re testing. If the thing you&#8217;re testing is exactly how the code will execute in production, then awesome. But if your test isn&#8217;t fully representative, you can get a skewed understanding of the costs involved, which can lead to making less-than-ideal assumptions and decisions.<\/p>\n<p>For example, consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\"No PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"PGO\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\npublic class Tests\r\n{\r\n    private static readonly Random s_rand = new();\r\n    private readonly IEnumerable&lt;int&gt; _source = Enumerable.Repeat(0, 1024);\r\n\r\n    [Params(1.0, 0.5)]\r\n    public double Probability { get; set; }\r\n\r\n    [Benchmark]\r\n    public bool Any() =&gt; s_rand.NextDouble() &lt; Probability ?\r\n        _source.Any(i =&gt; i == 42) :\r\n        _source.Any(i =&gt; i == 43);\r\n}<\/code><\/pre>\n<p>This runs a benchmark with two different &#8220;Probability&#8221; values. Regardless of that value, the code that&#8217;s executed for the benchmark does exactly the same thing and should result in exactly the same assembly code (other than one path checking for the value <code>42<\/code> and the other for <code>43<\/code>). In a world without PGO, there should be close to zero difference in performance between the runs, and if we set the <code>DOTNET_TieredPGO<\/code> environment variable to <code>0<\/code> (to disable PGO), that&#8217;s exactly what we see, but with PGO, we observe a larger difference:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th>Probability<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Any<\/td>\n<td>No PGO<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">5.354 us<\/td>\n<\/tr>\n<tr>\n<td>Any<\/td>\n<td>No PGO<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">5.314 us<\/td>\n<\/tr>\n<tr>\n<td>Any<\/td>\n<td>PGO<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">1.969 us<\/td>\n<\/tr>\n<tr>\n<td>Any<\/td>\n<td>PGO<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">1.495 us<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>When all of the calls use <code>i == 42<\/code> (because we set the probability to 1, all of the random values are less than that, and we always take the first branch), we see throughput ends up being 25% faster than when half of the calls use <code>i == 42<\/code> and half use <code>i == 43<\/code>. If your benchmark was only trying to measure the overhead of using <code>Enumerable.Any<\/code>, you might not realize that the resulting code was being optimized for calling <code>Any<\/code> with the same delegate every time, in which case you get different results than if <code>Any<\/code> is called with multiple delegates and all with reasonably equal chances of being used. (As an aside, the nice overall improvement between dynamic PGO being disabled and enabled comes in part from the use of <code>Random<\/code>, which internally makes a virtual call that <code>dynamic PGO<\/code> can help elide.)<\/p>\n<p>Throughout the rest of this post, I&#8217;ve kept this in mind and tried hard to show benchmarks where the resulting wins are due primarily to the cited improvements in the relevant code; where dynamic PGO plays a larger role in the improvements, I&#8217;ve called that out, often showing the results with and without dynamic PGO. There are many more benchmarks I could have shown but have avoided where it would look like a particular method had massive improvements, yet in reality it&#8217;d all be due to dynamic PGO being its awesome self rather than some explicit change made to the method&#8217;s C# code.<\/p>\n<p>One final note about dynamic PGO: it&#8217;s awesome, but it doesn&#8217;t obviate the need for thoughtful coding. If you know and can use something&#8217;s concrete type rather than an abstraction, from a performance perspective it&#8217;s better to do so rather than hoping the JIT will be able to see through it and devirtualize. To help with this, a new analyzer, <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1859\">CA1859<\/a>, was added to the .NET SDK in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6370\">dotnet\/roslyn-analyzers#6370<\/a>. The analyzer looks for places where interfaces or base classes could be replaced by derived types in order to avoid interface and virtual dispatch.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1859.png\" alt=\"CA1859\" \/>\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80335\">dotnet\/runtime#80335<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80848\">dotnet\/runtime#80848<\/a> rolled this out across <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>. As you can see from the first PR in particular, there were hundreds of places identified that with just an edit of one character (e.g. replacing <code>IList&lt;T&gt;<\/code> with <code>List&lt;T&gt;<\/code>), we could possibly reduce overheads.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\"No PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"PGO\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\npublic class Tests\r\n{\r\n    private readonly IList&lt;int&gt; _ilist = new List&lt;int&gt;();\r\n    private readonly List&lt;int&gt; _list = new();\r\n\r\n    [Benchmark]\r\n    public void IList()\r\n    {\r\n        _ilist.Add(42);\r\n        _ilist.Clear();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void List()\r\n    {\r\n        _list.Add(42);\r\n        _list.Clear();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IList<\/td>\n<td>No PGO<\/td>\n<td style=\"text-align: right\">2.876 ns<\/td>\n<\/tr>\n<tr>\n<td>IList<\/td>\n<td>PGO<\/td>\n<td style=\"text-align: right\">1.777 ns<\/td>\n<\/tr>\n<tr>\n<td>List<\/td>\n<td>No PGO<\/td>\n<td style=\"text-align: right\">1.718 ns<\/td>\n<\/tr>\n<tr>\n<td>List<\/td>\n<td>PGO<\/td>\n<td style=\"text-align: right\">1.476 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Vectorization<\/h2>\n<p>Another huge area of investment in code generation in .NET 8 is around vectorization. This is a continuation of a theme that&#8217;s been going for multiple .NET releases. Almost a decade ago, .NET gained the <code>Vector&lt;T&gt;<\/code> type. .NET Core 3.0 and .NET 5 added thousands of intrinsic methods for directly targeting specific hardware instructions. .NET 7 provided hundreds of cross-platform operations for <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/8323e58d7b587deceb7ef39e61725bce7bfe86f8\/docs\/coding-guidelines\/vectorization-guidelines.md\"><code>Vector128&lt;T&gt;<\/code> and <code>Vector256&lt;T&gt;<\/code><\/a> to enable SIMD algorithms on fixed-width vectors. And now in .NET 8, .NET gains support for AVX512, both with new hardware intrinsics directly exposing AVX512 instructions and with the new <code>Vector512<\/code> and <code>Vector512&lt;T&gt;<\/code> types.<\/p>\n<p>There were a plethora of changes that went into improving existing SIMD support, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76221\">dotnet\/runtime#76221<\/a> that improves the handling of <code>Vector256&lt;T&gt;<\/code> when it&#8217;s not hardware accelerated by lowering it as two <code>Vector128&lt;T&gt;<\/code> operations. Or like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87283\">dotnet\/runtime#87283<\/a>, which removed the generic constraint on the <code>T<\/code> in all of the vector types in order to make them easier to use in a larger set of contexts. But the bulk of the work in this area in this release is focused on AVX512.<\/p>\n<p>Wikipedia has a good overview of <a href=\"https:\/\/en.wikipedia.org\/wiki\/AVX-512\">AVX512<\/a>, which provides instructions for processing 512-bits at a time. In addition to providing wider versions of the 256-bit instructions seen in previous instruction sets, it also adds a variety of new operations, almost all of which are exposed via one of the new types in <code>System.Runtime.Intrinsics.X86<\/code>, like <code>Avx512BW<\/code>, <code>AVX512CD<\/code>, <code>Avx512DQ<\/code>, <code>Avx512F<\/code>, and <code>Avx512Vbmi<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83040\">dotnet\/runtime#83040<\/a> kicked things off by stubbing out the various files, followed by dozens of PRs that filled in the functionality, for example <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84909\">dotnet\/runtime#84909<\/a> that added the 512-bit variants of the SSE through SSE4.2 intrinsics that already exist; like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75934\">dotnet\/runtime#75934<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77419\">dotnet\/runtime#77419<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> that added support for the EVEX encoding used by AVX512 instructions; like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74113\">dotnet\/runtime#74113<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> that added the logic for detecting AVX512 support; like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80960\">dotnet\/runtime#80960<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79544\">dotnet\/runtime#79544<\/a> from <a href=\"https:\/\/github.com\/anthonycanino\">@anthonycanino<\/a> that enlightened the register allocator and emitter about AVX512&#8217;s additional registers; and like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87946\">dotnet\/runtime#87946<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84937\">dotnet\/runtime#84937<\/a> from <a href=\"https:\/\/github.com\/jkrishnavs\">@jkrishnavs<\/a> that plumbed through knowledge of various intrinsics.<\/p>\n<p>Let&#8217;s take it for a spin. The machine on which I&#8217;m writing this doesn&#8217;t have AVX512 support, but my <a href=\"https:\/\/azure.microsoft.com\/products\/dev-box\/\">Dev Box<\/a> does, so I&#8217;m using that for AVX512 comparisons (using <a href=\"https:\/\/learn.microsoft.com\/windows\/wsl\/\">WSL<\/a> with Ubuntu). In last year&#8217;s <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/#vectorization\">Performance Improvements in .NET 7<\/a>, we wrote a <code>Contains<\/code> method that used <code>Vector256&lt;T&gt;<\/code> if there was sufficient data available and it was accelerated, or else <code>Vector128&lt;T&gt;<\/code> if there was sufficient data available and it was accelerated, or else a scalar fallback. Tweaking that to also &#8220;light up&#8221; with AVX512 took me literally less than 30 seconds: copy\/paste the code block for <code>Vector256<\/code> and then search and replace in that copy from &#8220;Vector256&#8221; to &#8220;Vector512&#8243;&#8230; boom, done. Here it is in a benchmark, using environment variables to disable the JIT&#8217;s ability to use the various instruction sets so that we can try out this method with each acceleration path:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\nusing System.Runtime.InteropServices;\r\nusing System.Runtime.Intrinsics;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\"Scalar\").WithEnvironmentVariable(\"DOTNET_EnableHWIntrinsic\", \"0\").AsBaseline())\r\n    .AddJob(Job.Default.WithId(\"Vector128\").WithEnvironmentVariable(\"DOTNET_EnableAVX2\", \"0\").WithEnvironmentVariable(\"DOTNET_EnableAVX512F\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"Vector256\").WithEnvironmentVariable(\"DOTNET_EnableAVX512F\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"Vector512\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\", \"value\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _data = Enumerable.Repeat((byte)123, 999).Append((byte)42).ToArray();\r\n\r\n    [Benchmark]\r\n    [Arguments((byte)42)]\r\n    public bool Find(byte value) =&gt; Contains(_data, value);\r\n\r\n    private static unsafe bool Contains(ReadOnlySpan&lt;byte&gt; haystack, byte needle)\r\n    {\r\n        if (Vector128.IsHardwareAccelerated &amp;&amp; haystack.Length &gt;= Vector128&lt;byte&gt;.Count)\r\n        {\r\n            ref byte current = ref MemoryMarshal.GetReference(haystack);\r\n\r\n            if (Vector512.IsHardwareAccelerated &amp;&amp; haystack.Length &gt;= Vector512&lt;byte&gt;.Count)\r\n            {\r\n                Vector512&lt;byte&gt; target = Vector512.Create(needle);\r\n                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector512&lt;byte&gt;.Count);\r\n                do\r\n                {\r\n                    if (Vector512.EqualsAny(target, Vector512.LoadUnsafe(ref current)))\r\n                        return true;\r\n\r\n                    current = ref Unsafe.Add(ref current, Vector512&lt;byte&gt;.Count);\r\n                }\r\n                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));\r\n\r\n                if (Vector512.EqualsAny(target, Vector512.LoadUnsafe(ref endMinusOneVector)))\r\n                    return true;\r\n            }\r\n            else if (Vector256.IsHardwareAccelerated &amp;&amp; haystack.Length &gt;= Vector256&lt;byte&gt;.Count)\r\n            {\r\n                Vector256&lt;byte&gt; target = Vector256.Create(needle);\r\n                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector256&lt;byte&gt;.Count);\r\n                do\r\n                {\r\n                    if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref current)))\r\n                        return true;\r\n\r\n                    current = ref Unsafe.Add(ref current, Vector256&lt;byte&gt;.Count);\r\n                }\r\n                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));\r\n\r\n                if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref endMinusOneVector)))\r\n                    return true;\r\n            }\r\n            else\r\n            {\r\n                Vector128&lt;byte&gt; target = Vector128.Create(needle);\r\n                ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128&lt;byte&gt;.Count);\r\n                do\r\n                {\r\n                    if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current)))\r\n                        return true;\r\n\r\n                    current = ref Unsafe.Add(ref current, Vector128&lt;byte&gt;.Count);\r\n                }\r\n                while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));\r\n\r\n                if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector)))\r\n                    return true;\r\n            }\r\n        }\r\n        else\r\n        {\r\n            for (int i = 0; i &lt; haystack.Length; i++)\r\n                if (haystack[i] == needle)\r\n                    return true;\r\n        }\r\n\r\n        return false;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Find<\/td>\n<td>Scalar<\/td>\n<td style=\"text-align: right\">461.49 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>Vector128<\/td>\n<td style=\"text-align: right\">37.94 ns<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>Vector256<\/td>\n<td style=\"text-align: right\">22.98 ns<\/td>\n<td style=\"text-align: right\">0.05<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>Vector512<\/td>\n<td style=\"text-align: right\">10.93 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Numerous PRs elsewhere in the JIT then take advantage of AVX512 support when it&#8217;s available. For example, separate from AVX512, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83945\">dotnet\/runtime#83945<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84530\">dotnet\/runtime#84530<\/a> taught the JIT how to unroll <code>SequenceEqual<\/code> operations, such that the JIT can emit optimized, vectorized replacements when it can see a constant length for at least one of the inputs. &#8220;Unrolling&#8221; means that rather than emitting a loop for N iterations, each of which does the loop body once, a loop is emitted for N \/ M iterations, where every iteration does the loop body M times (and if N == M, there is no loop at all). So for a benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private byte[] _scheme = \"Transfer-Encoding\"u8.ToArray();\r\n\r\n    [Benchmark]\r\n    public bool SequenceEqual() =&gt; \"Transfer-Encoding\"u8.SequenceEqual(_scheme);\r\n}<\/code><\/pre>\n<p>we now get results like this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SequenceEqual<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">3.0558 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">65 B<\/td>\n<\/tr>\n<tr>\n<td>SequenceEqual<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">0.8055 ns<\/td>\n<td style=\"text-align: right\">0.26<\/td>\n<td style=\"text-align: right\">91 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For .NET 7, we see assembly code like this (note the <code>call<\/code> instruction to the underlying <code>SequenceEqual<\/code> helper):<\/p>\n<pre><code class=\"language-assembly\">; Tests.SequenceEqual()\r\n       sub       rsp,28\r\n       mov       r8,1D7BB272E48\r\n       mov       rcx,[rcx+8]\r\n       test      rcx,rcx\r\n       je        short M00_L03\r\n       lea       rdx,[rcx+10]\r\n       mov       eax,[rcx+8]\r\nM00_L00:\r\n       mov       rcx,r8\r\n       cmp       eax,11\r\n       je        short M00_L02\r\n       xor       eax,eax\r\nM00_L01:\r\n       add       rsp,28\r\n       ret\r\nM00_L02:\r\n       mov       r8d,11\r\n       call      qword ptr [7FF9D33CF120]; System.SpanHelpers.SequenceEqual(Byte ByRef, Byte ByRef, UIntPtr)\r\n       jmp       short M00_L01\r\nM00_L03:\r\n       xor       edx,edx\r\n       xor       eax,eax\r\n       jmp       short M00_L00\r\n; Total bytes of code 65<\/code><\/pre>\n<p>And now for .NET 8, we get assembly code like this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.SequenceEqual()\r\n       vzeroupper\r\n       mov       rax,1EBDDA92D38\r\n       mov       rcx,[rcx+8]\r\n       test      rcx,rcx\r\n       je        short M00_L01\r\n       lea       rdx,[rcx+10]\r\n       mov       r8d,[rcx+8]\r\nM00_L00:\r\n       cmp       r8d,11\r\n       jne       short M00_L03\r\n       vmovups   xmm0,[rax]\r\n       vmovups   xmm1,[rdx]\r\n       vmovups   xmm2,[rax+1]\r\n       vmovups   xmm3,[rdx+1]\r\n       vpxor     xmm0,xmm0,xmm1\r\n       vpxor     xmm1,xmm2,xmm3\r\n       vpor      xmm0,xmm0,xmm1\r\n       vptest    xmm0,xmm0\r\n       sete      al\r\n       movzx     eax,al\r\n       jmp       short M00_L02\r\nM00_L01:\r\n       xor       edx,edx\r\n       xor       r8d,r8d\r\n       jmp       short M00_L00\r\nM00_L02:\r\n       ret\r\nM00_L03:\r\n       xor       eax,eax\r\n       jmp       short M00_L02\r\n; Total bytes of code 91<\/code><\/pre>\n<p>Now there&#8217;s no <code>call<\/code>, with the entire implementation provided by the JIT; we can see it making liberal use of the 128-bit <code>xmm<\/code> SIMD registers. However, those PRs only enabled the JIT to handle up to 64 bytes being compared (unrolling results in larger code, so at some length it no longer makes sense to unroll). With AVX512 support in the JIT, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84854\">dotnet\/runtime#84854<\/a> then extends that up to 128 bytes. This is easily visible in a benchmark like this, which is similar to the previous example, but with larger data:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private byte[] _data1, _data2;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _data1 = Enumerable.Repeat((byte)42, 200).ToArray();\r\n        _data2 = (byte[])_data1.Clone();\r\n    }\r\n\r\n    [Benchmark]\r\n    public bool SequenceEqual() =&gt; _data1.AsSpan(0, 128).SequenceEqual(_data2.AsSpan(128));\r\n}<\/code><\/pre>\n<p>On my Dev Box with AVX512 support, for .NET 8 we get:<\/p>\n<pre><code class=\"language-assembly\">; Tests.SequenceEqual()\r\n       sub       rsp,28\r\n       vzeroupper\r\n       mov       rax,[rcx+8]\r\n       test      rax,rax\r\n       je        short M00_L01\r\n       cmp       dword ptr [rax+8],80\r\n       jb        short M00_L01\r\n       add       rax,10\r\n       mov       rcx,[rcx+10]\r\n       test      rcx,rcx\r\n       je        short M00_L01\r\n       mov       edx,[rcx+8]\r\n       cmp       edx,80\r\n       jb        short M00_L01\r\n       add       rcx,10\r\n       add       rcx,80\r\n       add       edx,0FFFFFF80\r\n       cmp       edx,80\r\n       je        short M00_L02\r\n       xor       eax,eax\r\nM00_L00:\r\n       vzeroupper\r\n       add       rsp,28\r\n       ret\r\nM00_L01:\r\n       call      qword ptr [7FF820745F08]\r\n       int       3\r\nM00_L02:\r\n       vmovups   zmm0,[rax]\r\n       vmovups   zmm1,[rcx]\r\n       vmovups   zmm2,[rax+40]\r\n       vmovups   zmm3,[rcx+40]\r\n       vpxorq    zmm0,zmm0,zmm1\r\n       vpxorq    zmm1,zmm2,zmm3\r\n       vporq     zmm0,zmm0,zmm1\r\n       vxorps    ymm1,ymm1,ymm1\r\n       vpcmpeqq  k1,zmm0,zmm1\r\n       kortestb  k1,k1\r\n       setb      al\r\n       movzx     eax,al\r\n       jmp       short M00_L00\r\n; Total bytes of code 154<\/code><\/pre>\n<p>Now instead of the 128-bit <code>xmm<\/code> registers, we see use of the 512-bit <code>zmm<\/code> registers from AVX512.<\/p>\n<p>The JIT in .NET 8 also now unrolls <code>memmove<\/code>s (<code>CopyTo<\/code>, <code>ToArray<\/code>, etc.) for small-enough constant lengths, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83638\">dotnet\/runtime#83638<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83740\">dotnet\/runtime#83740<\/a>. And then with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84348\">dotnet\/runtime#84348<\/a> that unrolling takes advantage of AVX512 if it&#8217;s available. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85501\">dotnet\/runtime#85501<\/a> extends this to <code>Span&lt;T&gt;.Fill<\/code>, too.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84885\">dotnet\/runtime#84885<\/a> extended the unrolling and vectorization done as part of <code>string<\/code>\/<code>ReadOnlySpan&lt;char&gt;<\/code> <code>Equals<\/code> and <code>StartsWith<\/code> to utilize AVX512 when available, as well.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly string _str = \"Let me not to the marriage of true minds admit impediments\";\r\n\r\n    [Benchmark]\r\n    public bool Equals() =&gt; _str.AsSpan().Equals(\r\n        \"LET ME NOT TO THE MARRIAGE OF TRUE MINDS ADMIT IMPEDIMENTS\",\r\n        StringComparison.OrdinalIgnoreCase);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Equals<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">30.995 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">101 B<\/td>\n<\/tr>\n<tr>\n<td>Equals<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.658 ns<\/td>\n<td style=\"text-align: right\">0.05<\/td>\n<td style=\"text-align: right\">116 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It&#8217;s so fast in .NET 8 because, whereas with .NET 7 it ends up calling through to the underlying helper:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Equals()\r\n       sub       rsp,48\r\n       xor       eax,eax\r\n       mov       [rsp+28],rax\r\n       vxorps    xmm4,xmm4,xmm4\r\n       vmovdqa   xmmword ptr [rsp+30],xmm4\r\n       mov       [rsp+40],rax\r\n       mov       rcx,[rcx+8]\r\n       test      rcx,rcx\r\n       je        short M00_L03\r\n       lea       rdx,[rcx+0C]\r\n       mov       ecx,[rcx+8]\r\nM00_L00:\r\n       mov       r8,21E57C058A0\r\n       mov       r8,[r8]\r\n       add       r8,0C\r\n       cmp       ecx,3A\r\n       jne       short M00_L02\r\n       mov       rcx,rdx\r\n       mov       rdx,r8\r\n       mov       r8d,3A\r\n       call      qword ptr [7FF8194B1A08]; System.Globalization.Ordinal.EqualsIgnoreCase(Char ByRef, Char ByRef, Int32)\r\nM00_L01:\r\n       nop\r\n       add       rsp,48\r\n       ret\r\nM00_L02:\r\n       xor       eax,eax\r\n       jmp       short M00_L01\r\nM00_L03:\r\n       xor       ecx,ecx\r\n       xor       edx,edx\r\n       xchg      rcx,rdx\r\n       jmp       short M00_L00\r\n; Total bytes of code 101<\/code><\/pre>\n<p>in .NET 8, the JIT generates code for the operation directly, taking advantage of AVX512&#8217;s greater width and thus able to process a larger input without significantly increasing code size:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Equals()\r\n       vzeroupper\r\n       mov       rax,[rcx+8]\r\n       test      rax,rax\r\n       jne       short M00_L00\r\n       xor       ecx,ecx\r\n       xor       edx,edx\r\n       jmp       short M00_L01\r\nM00_L00:\r\n       lea       rcx,[rax+0C]\r\n       mov       edx,[rax+8]\r\nM00_L01:\r\n       cmp       edx,3A\r\n       jne       short M00_L02\r\n       vmovups   zmm0,[rcx]\r\n       vmovups   zmm1,[7FF820495080]\r\n       vpternlogq zmm0,zmm1,[7FF8204950C0],56\r\n       vmovups   zmm1,[rcx+34]\r\n       vporq     zmm1,zmm1,[7FF820495100]\r\n       vpternlogq zmm0,zmm1,[7FF820495140],0F6\r\n       vxorps    ymm1,ymm1,ymm1\r\n       vpcmpeqq  k1,zmm0,zmm1\r\n       kortestb  k1,k1\r\n       setb      al\r\n       movzx     eax,al\r\n       jmp       short M00_L03\r\nM00_L02:\r\n       xor       eax,eax\r\nM00_L03:\r\n       vzeroupper\r\n       ret\r\n; Total bytes of code 116<\/code><\/pre>\n<p>Even super simple operations get in on the action. Here we just have a cast from a <code>ulong<\/code> to a <code>double<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"val\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(1234567891011121314ul)]\r\n    public double UIntToDouble(ulong val) =&gt; val;\r\n}<\/code><\/pre>\n<p>Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84384\">dotnet\/runtime#84384<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a>, the code for that shrinks from this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.UIntToDouble(UInt64)\r\n       vzeroupper\r\n       vxorps    xmm0,xmm0,xmm0\r\n       vcvtsi2sd xmm0,xmm0,rdx\r\n       test      rdx,rdx\r\n       jge       short M00_L00\r\n       vaddsd    xmm0,xmm0,qword ptr [7FF819E776C0]\r\nM00_L00:\r\n       ret\r\n; Total bytes of code 26<\/code><\/pre>\n<p>using the AVX <code>vcvtsi2sd<\/code> instruction, to this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.UIntToDouble(UInt64)\r\n       vzeroupper\r\n       vcvtusi2sd xmm0,xmm0,rdx\r\n       ret\r\n; Total bytes of code 10<\/code><\/pre>\n<p>using the AVX512 <code>vcvtusi2sd<\/code> instruction.<\/p>\n<p>As yet another example, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87641\">dotnet\/runtime#87641<\/a> we see the JIT using AVX512 to accelerate various <code>Math<\/code> APIs:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"left\", \"right\")]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(123456.789f, 23456.7890f)]\r\n    public float Max(float left, float right) =&gt; MathF.Max(left, right);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Max<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1.1936 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Max<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">0.2865 ns<\/td>\n<td style=\"text-align: right\">0.24<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Branching<\/h2>\n<p>Branching is integral to all meaningful code; while some algorithms are written in a branch-free manner, branch-free algorithms typically are challenging to get right and complicated to read, and typically are isolated to only small regions of code. For everything else, branching is the name of the game. Loops, if\/else blocks, ternaries&#8230; it&#8217;s hard to imagine any real code without them. Yet they can also represent one of the more significant costs in an application. Modern hardware gets big speed boosts from pipelining, for example from being able to start reading and decoding the next instruction while the previous ones are still processing. That, of course, relies on the hardware knowing what the next instruction is. If there&#8217;s no branching, that&#8217;s easy, it&#8217;s whatever instruction comes next in the sequence. For when there is branching, CPUs have built-in support in the form of branch predictors, used to determine what the next instruction most likely will be, and they&#8217;re often right&#8230; but when they&#8217;re wrong, the cost incurred from that incorrect branch prediction can be huge. Compilers thus strive to minimize branching.<\/p>\n<p>One way the impact of branches is reduced is by removing them completely. Redundant branch optimizers look for places where the compiler can prove that all paths leading to that branch will lead to the same outcome, such that the compiler can remove the branch and everything in the path not taken. Consider the following example:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private static readonly Random s_rand = new();\r\n    private readonly string _text = \"hello world!\";\r\n\r\n    [Params(1.0, 0.5)]\r\n    public double Probability { get; set; }\r\n\r\n    [Benchmark]\r\n    public ReadOnlySpan&lt;char&gt; TrySlice() =&gt; SliceOrDefault(_text.AsSpan(), s_rand.NextDouble() &lt; Probability ? 3 : 20);\r\n\r\n    [MethodImpl(MethodImplOptions.AggressiveInlining)]\r\n    public ReadOnlySpan&lt;char&gt; SliceOrDefault(ReadOnlySpan&lt;char&gt; span, int i)\r\n    {\r\n        if ((uint)i &lt; (uint)span.Length)\r\n        {\r\n            return span.Slice(i);\r\n        }\r\n\r\n        return default;\r\n    }\r\n}<\/code><\/pre>\n<p>Running that on .NET 7, we can glimpse into the impact of failed branch prediction. When we always take the branch the same way, the throughput is 2.5x what it was when it was impossible for the branch predictor to determine where we were going next:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Probability<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TrySlice<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">8.845 ns<\/td>\n<td style=\"text-align: right\">136 B<\/td>\n<\/tr>\n<tr>\n<td>TrySlice<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">3.436 ns<\/td>\n<td style=\"text-align: right\">136 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can also use this example for a .NET 8 improvement. That guarded <code>ReadOnlySpan&lt;char&gt;.Slice<\/code> call has its own branch to ensure that <code>i<\/code> is within the bounds of the span; we can see that very clearly by looking at the disassembly generated on .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.TrySlice()\r\n       push      rdi\r\n       push      rsi\r\n       push      rbp\r\n       push      rbx\r\n       sub       rsp,28\r\n       vzeroupper\r\n       mov       rdi,rcx\r\n       mov       rsi,rdx\r\n       mov       rcx,[rdi+8]\r\n       test      rcx,rcx\r\n       je        short M00_L01\r\n       lea       rbx,[rcx+0C]\r\n       mov       ebp,[rcx+8]\r\nM00_L00:\r\n       mov       rcx,1EBBFC01FA0\r\n       mov       rcx,[rcx]\r\n       mov       rcx,[rcx+8]\r\n       mov       rax,[rcx]\r\n       mov       rax,[rax+48]\r\n       call      qword ptr [rax+20]\r\n       vmovsd    xmm1,qword ptr [rdi+10]\r\n       vucomisd  xmm1,xmm0\r\n       ja        short M00_L02\r\n       mov       eax,14\r\n       jmp       short M00_L03\r\nM00_L01:\r\n       xor       ebx,ebx\r\n       xor       ebp,ebp\r\n       jmp       short M00_L00\r\nM00_L02:\r\n       mov       eax,3\r\nM00_L03:\r\n       cmp       eax,ebp\r\n       jae       short M00_L04\r\n       cmp       eax,ebp\r\n       ja        short M00_L06\r\n       mov       edx,eax\r\n       lea       rdx,[rbx+rdx*2]\r\n       sub       ebp,eax\r\n       jmp       short M00_L05\r\nM00_L04:\r\n       xor       edx,edx\r\n       xor       ebp,ebp\r\nM00_L05:\r\n       mov       [rsi],rdx\r\n       mov       [rsi+8],ebp\r\n       mov       rax,rsi\r\n       add       rsp,28\r\n       pop       rbx\r\n       pop       rbp\r\n       pop       rsi\r\n       pop       rdi\r\n       ret\r\nM00_L06:\r\n       call      qword ptr [7FF999FEB498]\r\n       int       3\r\n; Total bytes of code 136<\/code><\/pre>\n<p>In particular, look at <code>M00_L03<\/code>:<\/p>\n<pre><code class=\"language-assembly\">M00_L03:\r\n       cmp       eax,ebp\r\n       jae       short M00_L04\r\n       cmp       eax,ebp\r\n       ja        short M00_L06\r\n       mov       edx,eax\r\n       lea       rdx,[rbx+rdx*2]<\/code><\/pre>\n<p>At this point, either <code>3<\/code> or <code>20<\/code> (0x14) has been loaded into <code>eax<\/code>, and it&#8217;s being compared against <code>ebp<\/code>, which was loaded from the span&#8217;s <code>Length<\/code> earlier (<code>mov ebp,[rcx+8]<\/code>). There&#8217;s a very obvious redundant branch here, as the code does <code>cmp eax,ebp<\/code>, and then if it doesn&#8217;t jump as part of the <code>jae<\/code>, it does the exact same comparison again; the first is the one we wrote in <code>TrySlice<\/code>, the second is the one from <code>Slice<\/code> itself, which got inlined.<\/p>\n<p>On .NET 8, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/72979\">dotnet\/runtime#72979<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75804\">dotnet\/runtime#75804<\/a>, that branch (and many others of a similar ilk) is optimized away. We can run the exact same benchmark, this time on .NET 8, and if we look at the assembly at the corresponding code block (which isn&#8217;t numbered exactly the same because of other changes):<\/p>\n<pre><code class=\"language-assembly\">M00_L04:\r\n       cmp       eax,ebp\r\n       jae       short M00_L07\r\n       mov       ecx,eax\r\n       lea       rdx,[rdi+rcx*2]<\/code><\/pre>\n<p>we can see that, indeed, the redundant branch has been eliminated.<\/p>\n<p>Another way the overhead associated with branches (and branch misprediction) is removed is by avoiding them altogether. Sometimes simple bit manipulation tricks can be employed to avoid branches. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/62689\">dotnet\/runtime#62689<\/a> from <a href=\"https:\/\/github.com\/pedrobsaila\">@pedrobsaila<\/a>, for example, finds expressions like <code>i &gt;= 0 &amp;&amp; j &gt;= 0<\/code> for signed integers <code>i<\/code> and <code>j<\/code>, and rewrites them to the equivalent of <code>(i | j) &gt;= 0<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\", \"j\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42, 84)]\r\n    public bool BothGreaterThanOrEqualZero(int i, int j) =&gt; i &gt;= 0 &amp;&amp; j &gt;= 0;\r\n}<\/code><\/pre>\n<p>Here instead of code like we&#8217;d get on .NET 7, which involves a branch for the <code>&amp;&amp;<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Tests.BothGreaterThanOrEqualZero(Int32, Int32)\r\n       test      edx,edx\r\n       jl        short M00_L00\r\n       mov       eax,r8d\r\n       not       eax\r\n       shr       eax,1F\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 16<\/code><\/pre>\n<p>now on .NET 8, the result is branchless:<\/p>\n<pre><code class=\"language-assembly\">; Tests.BothGreaterThanOrEqualZero(Int32, Int32)\r\n       or        edx,r8d\r\n       mov       eax,edx\r\n       not       eax\r\n       shr       eax,1F\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<p>Such bit tricks, however, only get you so far. To go further, both x86\/64 and Arm provide conditional move instructions, like <code>cmov<\/code> on x86\/64 and <code>csel<\/code> on Arm, that encapsulate the condition into the single instruction. For example, <code>csel<\/code> &#8220;conditionally selects&#8221; the value from one of two register arguments based on whether the condition is true or false and writes that value into the destination register. The instruction pipeline stays filled then because the instruction after the <code>csel<\/code> is always the next instruction; there&#8217;s no control flow that would result in a different instruction coming next.<\/p>\n<p>The JIT in .NET 8 is now capable of emitting conditional instructions, on both x86\/64 and Arm. With PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73472\">dotnet\/runtime#73472<\/a> from <a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77728\">dotnet\/runtime#77728<\/a> from <a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a>, the JIT gains an additional &#8220;if conversion&#8221; optimization phase, where various conditional patterns are recognized and morphed into conditional nodes in the JIT&#8217;s internal representation; these can then later be emitted as conditional instructions, as was done by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78879\">dotnet\/runtime#78879<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81267\">dotnet\/runtime#81267<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82235\">dotnet\/runtime#82235<\/a>,  <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82766\">dotnet\/runtime#82766<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83089\">dotnet\/runtime#83089<\/a>. Other PRs, like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84926\">dotnet\/runtime#84926<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82031\">dotnet\/runtime#82031<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> optimized which exact instructions would be employed, in these cases using the Arm <a href=\"https:\/\/developer.arm.com\/documentation\/ddi0596\/2021-12\/Base-Instructions\/CINV--Conditional-Invert--an-alias-of-CSINV-\"><code>cinv<\/code><\/a> and <a href=\"https:\/\/developer.arm.com\/documentation\/ddi0596\/2021-12\/Base-Instructions\/CINC--Conditional-Increment--an-alias-of-CSINC-\"><code>cinc<\/code><\/a> instructions.<\/p>\n<p>We can see all this in a simple benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private static readonly Random s_rand = new();\r\n\r\n    [Params(1.0, 0.5)]\r\n    public double Probability { get; set; }\r\n\r\n    [Benchmark]\r\n    public FileOptions GetOptions() =&gt; GetOptions(s_rand.NextDouble() &lt; Probability);\r\n\r\n    private static FileOptions GetOptions(bool useAsync) =&gt; useAsync ? FileOptions.Asynchronous : FileOptions.None;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Probability<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 7.0<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">7.952 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 8.0<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">2.327 ns<\/td>\n<td style=\"text-align: right\">0.29<\/td>\n<td style=\"text-align: right\">86 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 7.0<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">2.587 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 8.0<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">2.357 ns<\/td>\n<td style=\"text-align: right\">0.91<\/td>\n<td style=\"text-align: right\">86 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Two things to notice:<\/p>\n<ol>\n<li>In .NET 7, the cost with a probability of 0.5 is 3x that of when it had a probability of 1.0, due to the branch predictor not being able to successfully predict which way the actual branch would go.<\/li>\n<li>In .NET 8, it doesn&#8217;t matter whether the probability is 0.5 or 1: the cost is the same (and cheaper than on .NET 7).<\/li>\n<\/ol>\n<p>We can also look at the generated assembly to see the difference. In particular, on .NET 8, we see this for the generated assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetOptions()\r\n       push      rbx\r\n       sub       rsp,20\r\n       vzeroupper\r\n       mov       rbx,rcx\r\n       mov       rcx,2C54EC01E40\r\n       mov       rcx,[rcx]\r\n       mov       rcx,[rcx+8]\r\n       mov       rax,offset MT_System.Random+XoshiroImpl\r\n       cmp       [rcx],rax\r\n       jne       short M00_L01\r\n       call      qword ptr [7FFA2D790C88]; System.Random+XoshiroImpl.NextDouble()\r\nM00_L00:\r\n       vmovsd    xmm1,qword ptr [rbx+8]\r\n       mov       eax,40000000\r\n       xor       ecx,ecx\r\n       vucomisd  xmm1,xmm0\r\n       cmovbe    eax,ecx\r\n       add       rsp,20\r\n       pop       rbx\r\n       ret\r\nM00_L01:\r\n       mov       rax,[rcx]\r\n       mov       rax,[rax+48]\r\n       call      qword ptr [rax+20]\r\n       jmp       short M00_L00\r\n; Total bytes of code 86<\/code><\/pre>\n<p>That <code>vucomisd; cmovbe<\/code> sequence in there is the comparison between the randomly-generated floating-point value and the probability threshold followed by the conditional move (&#8220;conditionally move if below or equal&#8221;).<\/p>\n<p>There are many methods that implicitly benefit from these transformations. Take even a simple method, like <code>Math.Max<\/code>, whose code I&#8217;ve copied here:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public int Max() =&gt; Max(1, 2);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    public static int Max(int val1, int val2)\r\n    {\r\n        return (val1 &gt;= val2) ? val1 : val2;\r\n    }\r\n}<\/code><\/pre>\n<p>That pattern should look familiar. Here&#8217;s the assembly we get on .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Max(Int32, Int32)\r\n       cmp       ecx,edx\r\n       jge       short M01_L00\r\n       mov       eax,edx\r\n       ret\r\nM01_L00:\r\n       mov       eax,ecx\r\n       ret\r\n; Total bytes of code 10<\/code><\/pre>\n<p>The two arguments come in via the <code>ecx<\/code> and <code>edx<\/code> registers. They&#8217;re compared, and if the first argument is greater than or equal to the second, it jumps down to the bottom where the first argument is moved into <code>eax<\/code> as the return value; if it wasn&#8217;t, then the second value is moved into <code>eax<\/code>. And on .NET 8:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Max(Int32, Int32)\r\n       cmp       ecx,edx\r\n       mov       eax,edx\r\n       cmovge    eax,ecx\r\n       ret\r\n; Total bytes of code 8<\/code><\/pre>\n<p>Again the two arguments come in via the <code>ecx<\/code> and <code>edx<\/code> registers, and they&#8217;re compared. The second argument is then moved into <code>eax<\/code> as the return value. If the comparison showed that the first argument was greater than the second, it&#8217;s then moved into <code>eax<\/code> (overwriting the second argument that was just moved there). Fun.<\/p>\n<p>Note if you ever find yourself wanting to do a deeper-dive into this area, BenchmarkDotNet has some excellent additional tools at your disposal. On Windows, it enables you to collect hardware counters, which expose a wealth of information about how things actually executed on the hardware, whether it be number of instructions retired, cache misses, or branch mispredictions. To use it, add another package reference to your .csproj:<\/p>\n<pre><code class=\"language-xml\">&lt;PackageReference Include=\"BenchmarkDotNet.Diagnostics.Windows\" Version=\"0.13.8\" \/&gt;<\/code><\/pre>\n<p>and add an additional attribute to your tests class:<\/p>\n<pre><code class=\"language-C#\">[HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)]<\/code><\/pre>\n<p>Then make sure you&#8217;re running the benchmarks from an elevated \/ admin terminal. When I do that, now I see this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Probability<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">BranchMispredictions\/Op<\/th>\n<th style=\"text-align: right\">BranchInstructions\/Op<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 7.0<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">8.585 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1<\/td>\n<td style=\"text-align: right\">5<\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 8.0<\/td>\n<td>0.5<\/td>\n<td style=\"text-align: right\">2.488 ns<\/td>\n<td style=\"text-align: right\">0.29<\/td>\n<td style=\"text-align: right\">0<\/td>\n<td style=\"text-align: right\">4<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 7.0<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">2.783 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">0<\/td>\n<td style=\"text-align: right\">4<\/td>\n<\/tr>\n<tr>\n<td>GetOptions<\/td>\n<td>.NET 8.0<\/td>\n<td>1<\/td>\n<td style=\"text-align: right\">2.531 ns<\/td>\n<td style=\"text-align: right\">0.91<\/td>\n<td style=\"text-align: right\">0<\/td>\n<td style=\"text-align: right\">4<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can see it confirms what we already knew: on .NET 7 with a 0.5 probability, it ends up mispredicting a branch.<\/p>\n<p>The C# compiler (aka &#8220;Roslyn&#8221;) also gets in on the branch-elimination game in .NET 8, for a very specific kind of branch. In .NET, while we think of <code>System.Boolean<\/code> as only being a two-value type (<code>false<\/code> and <code>true<\/code>), <code>sizeof(bool)<\/code> is actually one byte. That means a <code>bool<\/code> can technically have 256 different values, where 0 is considered <code>false<\/code> and [1,255] are all considered <code>true<\/code>. Thankfully, unless a developer is poking around the edges of interop or otherwise using <code>unsafe<\/code> code to purposefully manipulate these other values, developers can remain blissfully unaware of the actual numeric value here, for two reasons. First, C# doesn&#8217;t consider <code>bool<\/code> to be a numerical type, and thus you can&#8217;t perform arithmetic on it or cast it to a type like <code>int<\/code>. Second, all of the <code>bool<\/code>s produced by the runtime and C# are normalized to actually be 0 or 1 in value, e.g. a <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/system.reflection.emit.opcodes.cgt\"><code>cmp<\/code><\/a> IL instruction is documented as &#8220;If value1 is greater than value2, 1 is pushed onto the stack; otherwise 0 is pushed onto the stack.&#8221; There is a class of algorithms, however, where being able to rely on such 0 and 1 values is handy, and we were just talking about them: branch-free algorithms.<\/p>\n<p>Let&#8217;s say we didn&#8217;t have the JIT&#8217;s new-found ability to use conditional moves and we wanted to write our own <code>ConditionalSelect<\/code> method for integers:<\/p>\n<pre><code class=\"language-C#\">static int ConditionalSelect(bool condition, int whenTrue, int whenFalse);<\/code><\/pre>\n<p><em>If<\/em> we could rely on <code>bool<\/code> always being 0 or 1 (we can&#8217;t), and <em>if<\/em> we could do arithmetic on a <code>bool<\/code> (we can&#8217;t), then we could use the behavior of multiplication to implement our <code>ConditionalSelect<\/code> function. Anything multiplied by 0 is 0, and anything multiplied by 1 is itself, so we could write our <code>ConditionalSelect<\/code> like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ pseudo-code; this won't compile!\r\nstatic int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =&gt;\r\n    (whenTrue  *  condition) +\r\n    (whenFalse * !condition);<\/code><\/pre>\n<p>Then if <code>condition<\/code> is 1, <code>whenTrue * condition<\/code> would be <code>whenTrue<\/code> and <code>whenFalse * !condition<\/code> would be 0, such that the whole expression would evaluate to <code>whenTrue<\/code>. And, conversely, if <code>condition<\/code> is 0, <code>whenTrue * condition<\/code> would be 0 and <code>whenFalse * !condition<\/code> would be <code>whenFalse<\/code>, such that the whole expression would evaluate to <code>whenFalse<\/code>. As noted, though, we can&#8217;t write the above, but we could write this:<\/p>\n<pre><code class=\"language-C#\">static int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =&gt;\r\n    (whenTrue  * (condition ? 1 : 0)) +\r\n    (whenFalse * (condition ? 0 : 1));<\/code><\/pre>\n<p>That provides the exact semantics we want&#8230; but we&#8217;ve introduced two branches into our supposedly branch-free algorithm. This is the IL produced for that <code>ConditionalSelect<\/code> in .NET 7:<\/p>\n<pre><code class=\"language-assembly\">.method private hidebysig static  int32 ConditionalSelect (bool condition, int32 whenTrue, int32 whenFalse) cil managed \r\n{\r\n    .maxstack 8\r\n\r\n    IL_0000: ldarg.1\r\n    IL_0001: ldarg.0\r\n    IL_0002: brtrue.s IL_0007\r\n\r\n    IL_0004: ldc.i4.0\r\n    IL_0005: br.s IL_0008\r\n\r\n    IL_0007: ldc.i4.1\r\n\r\n    IL_0008: mul\r\n    IL_0009: ldarg.2\r\n    IL_000a: ldarg.0\r\n    IL_000b: brtrue.s IL_0010\r\n\r\n    IL_000d: ldc.i4.1\r\n    IL_000e: br.s IL_0011\r\n\r\n    IL_0010: ldc.i4.0\r\n\r\n    IL_0011: mul\r\n    IL_0012: add\r\n    IL_0013: ret\r\n}<\/code><\/pre>\n<p>Note all those <code>brtrue.s<\/code> and <code>br.s<\/code> instructions in there. Are they necessary, though? Earlier I noted that the runtime will only produce <code>bool<\/code>s with a value of 0 or 1. And thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/67191\">dotnet\/roslyn#67191<\/a>, the C# compiler now recognizes that and optimizes the pattern <code>(b ? 1 : 0)<\/code> to be branchless. Our same <code>ConditionalSelect<\/code> function now in .NET 8 compiles to this:<\/p>\n<pre><code class=\"language-assembly\">.method private hidebysig static  int32 ConditionalSelect (bool condition, int32 whenTrue, int32 whenFalse) cil managed \r\n{\r\n    .maxstack 8\r\n\r\n    IL_0000: ldarg.1\r\n    IL_0001: ldarg.0\r\n    IL_0002: ldc.i4.0\r\n    IL_0003: cgt.un\r\n    IL_0005: mul\r\n    IL_0006: ldarg.2\r\n    IL_0007: ldarg.0\r\n    IL_0008: ldc.i4.0\r\n    IL_0009: ceq\r\n    IL_000b: mul\r\n    IL_000c: add\r\n    IL_000d: ret\r\n}<\/code><\/pre>\n<p>Zero branch instructions. Of course, you wouldn&#8217;t actually want to write this function like this anymore; just because it&#8217;s branch-free doesn&#8217;t mean it&#8217;s the most efficient. On .NET 8, here&#8217;s the assembly code produced by the JIT for the above:<\/p>\n<pre><code class=\"language-assembly\">       movzx    rax, cl\r\n       xor      ecx, ecx\r\n       test     eax, eax\r\n       setne    cl\r\n       imul     ecx, edx\r\n       test     eax, eax\r\n       sete     al\r\n       movzx    rax, al\r\n       imul     eax, r8d\r\n       add      eax, ecx\r\n       ret  <\/code><\/pre>\n<p>whereas if you just wrote it as:<\/p>\n<pre><code class=\"language-C#\">static int ConditionalSelect(bool condition, int whenTrue, int whenFalse) =&gt;\r\n    condition ? whenTrue : whenFalse;<\/code><\/pre>\n<p>here&#8217;s what you&#8217;d get:<\/p>\n<pre><code class=\"language-assembly\">       test     cl, cl\r\n       mov      eax, r8d\r\n       cmovne   eax, edx\r\n       ret    <\/code><\/pre>\n<p>Even so, this C# compiler optimization is useful for other branch-free algorithms. Let&#8217;s say I wanted to write a <code>Compare<\/code> method that would compare two <code>int<\/code>s, returning -1 if the first is less than the second, 0 if they&#8217;re equal, and 1 if the first is greater than the second. I could write that like this:<\/p>\n<pre><code class=\"language-C#\">static int Compare(int x, int y)\r\n{\r\n    if (x &lt; y) return -1;\r\n    if (x &gt; y) return 1;\r\n    return 0;\r\n}<\/code><\/pre>\n<p>Simple, but every invocation will incur at least one branch, if not two. With the <code>(b ? 1 : 0)<\/code> optimization, we can instead write it like this:<\/p>\n<pre><code class=\"language-C#\">static int Compare(int x, int y)\r\n{\r\n    int gt = (x &gt; y) ? 1 : 0;\r\n    int lt = (x &lt; y) ? 1 : 0;\r\n    return gt - lt;\r\n}<\/code><\/pre>\n<p>This is now branch-free, with the C# compiler producing:<\/p>\n<pre><code class=\"language-assembly\">    IL_0000: ldarg.0\r\n    IL_0001: ldarg.1\r\n    IL_0002: cgt\r\n    IL_0004: ldarg.0\r\n    IL_0005: ldarg.1\r\n    IL_0006: clt\r\n    IL_0008: stloc.0\r\n    IL_0009: ldloc.0\r\n    IL_000a: sub\r\n    IL_000b: ret<\/code><\/pre>\n<p>and, from that, the JIT producing:<\/p>\n<pre><code class=\"language-assembly\">       xor      eax, eax\r\n       cmp      ecx, edx\r\n       setg     al\r\n       setl     cl\r\n       movzx    rcx, cl\r\n       sub      eax, ecx\r\n       ret      <\/code><\/pre>\n<p>Does that mean that everyone should now be running to rewrite their algorithms in a branch-free manner? Most definitely not. It&#8217;s another tool in your tool belt, and in some cases it&#8217;s quite beneficial, especially when it can provide more consistent throughput results due to doing the same work regardless of outcome. It&#8217;s not always a win, however, and in general it&#8217;s best not to try to outsmart the compiler. Take the example we just looked at. There&#8217;s a function with that exact implementation in the core libraries: <code>int.CompareTo<\/code>. And if you look at its implementation in .NET 8, you&#8217;ll find that it&#8217;s still using the branch-based implementation. Why? Because it often yields better results, in particular in the common case where the operation gets inlined and the JIT is able to combine the branches in the <code>CompareTo<\/code> method with ones based on processing the result of <code>CompareTo<\/code>. Most uses of <code>CompareTo<\/code> involve additional branching based on its result, such as in a quick sort partitioning step that&#8217;s deciding whether to move elements. So let&#8217;s take an example where code makes a decision based on the result of such a comparison:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private int _x = 2, _y = 1;\r\n\r\n    [Benchmark]\r\n    public int GreaterThanOrEqualTo_Branching()\r\n    {\r\n        if (Compare_Branching(_x, _y) &gt;= 0)\r\n        {\r\n            return _x * 2;\r\n        }\r\n\r\n        return _y * 3;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int GreaterThanOrEqualTo_Branchless()\r\n    {\r\n        if (Compare_Branchless(_x, _y) &gt;= 0)\r\n        {\r\n            return _x * 2;\r\n        }\r\n\r\n        return _y * 3;\r\n    }\r\n\r\n    private static int Compare_Branching(int x, int y)\r\n    {\r\n        if (x &lt; y) return -1;\r\n        if (x &gt; y) return 1;\r\n        return 0;\r\n    }\r\n\r\n    private static int Compare_Branchless(int x, int y)\r\n    {\r\n        int gt = (x &gt; y) ? 1 : 0;\r\n        int lt = (x &lt; y) ? 1 : 0;\r\n        return gt - lt;\r\n    }\r\n}<\/code><\/pre>\n<p>And the resulting assembly:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/BranchingWins.png\" alt=\"Branching vs Branchless Assembly Difference\" \/><\/p>\n<p>Note that both implementations now have just one branch (a <code>jl<\/code> in the &#8220;branching&#8221; case and a <code>js<\/code> in the &#8220;branchless&#8221; case), <em>and<\/em> the &#8220;branching&#8221; implementation results in less assembly code.<\/p>\n<h2>Bounds Checking<\/h2>\n<p>Arrays, strings, and spans are all bounds checked by the runtime. That means that indexing into one of these data structures incurs validation to ensure that the index is within the bounds of the data structure. For example, the <code>Get(byte[],int)<\/code> method here:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private byte[] _array = new byte[8];\r\n    private int _index = 4;\r\n\r\n    [Benchmark]\r\n    public void Get() =&gt; Get(_array, _index);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static byte Get(byte[] array, int index) =&gt; array[index];\r\n}<\/code><\/pre>\n<p>results in this code being generated for the method:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Get(Byte[], Int32)\r\n       sub       rsp,28\r\n       cmp       edx,[rcx+8]\r\n       jae       short M01_L00\r\n       mov       eax,edx\r\n       movzx     eax,byte ptr [rcx+rax+10]\r\n       add       rsp,28\r\n       ret\r\nM01_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 27<\/code><\/pre>\n<p>Here, the <code>byte[]<\/code> is passed in <code>rcx<\/code>, the <code>int index<\/code> is in <code>edx<\/code>, and the code is comparing the value of the index against the value stored at an 8-byte offset from the beginning of the array: that&#8217;s where the array&#8217;s length is stored. The <code>jae<\/code> instruction (jump if above or equal) is an unsigned comparison, such that if <code>(uint)index &gt;= (uint)array.Length<\/code>, it&#8217;ll jump to <code>M01_L00<\/code>, where we see a call to a helper function <code>CORINFO_HELP_RNGCHKFAIL<\/code> that will throw an <code>IndexOutOfRangeException<\/code>. All of that is the &#8220;bounds check.&#8221; The actual access into the array is the two <code>mov<\/code> and <code>movzx<\/code> instructions, where the <code>index<\/code> is moved into <code>eax<\/code>, and then the value located at <code>rcx<\/code> (the address of the array) + <code>rax<\/code> (the index) + 0x10 (the offset of the start of the data in the array) is moved into the return <code>eax<\/code> register.<\/p>\n<p>It&#8217;s the runtime&#8217;s responsibility to ensure that all accesses are guaranteed in bounds. It can do so with a bounds check. But it can also do so by proving that the index is always in range, in which case it can elide adding a bounds check that would only add overhead and provide zero benefit. Every .NET release, the JIT improves its ability to recognize patterns that don&#8217;t need a bounds check added because there&#8217;s no way the access could be out of range. And .NET 8 is no exception, with it learning several new and valuable tricks.<\/p>\n<p>One such trick comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84231\">dotnet\/runtime#84231<\/a>, where it learns how to avoid bounds checks in a pattern that&#8217;s very prevalent in collections, in particular in hash tables. In a hash table, you generally compute a hash code for a key and then use that key to index into an array (often referred to as &#8220;buckets&#8221;). As the hash code might be any <code>int<\/code> and the buckets array is invariably going to be much smaller than the full range of a 32-bit integer, all of the hash codes need to be mapped down to an element in the array, and a good way to do that is by mod&#8217;ing the hash code by the array&#8217;s length, e.g.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly int[] _array = new int[7];\r\n\r\n    [Benchmark]\r\n    public int GetBucket() =&gt; GetBucket(_array, 42);\r\n\r\n    private static int GetBucket(int[] buckets, int hashcode) =&gt;\r\n        buckets[(uint)hashcode % buckets.Length];\r\n}<\/code><\/pre>\n<p>In .NET 7, that produces:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetBucket()\r\n       sub       rsp,28\r\n       mov       rcx,[rcx+8]\r\n       mov       eax,2A\r\n       mov       edx,[rcx+8]\r\n       mov       r8d,edx\r\n       xor       edx,edx\r\n       idiv      r8\r\n       cmp       rdx,r8\r\n       jae       short M00_L00\r\n       mov       eax,[rcx+rdx*4+10]\r\n       add       rsp,28\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 44<\/code><\/pre>\n<p>Note the <code>CORINFO_HELP_RNGCHKFAIL<\/code>, the tell-tale sign of a bounds check. Now in .NET 8, the JIT recognizes that it&#8217;s impossible for a <code>uint<\/code> value <code>%<\/code>&#8216;d by an array&#8217;s length to be out of bounds of that array; either the array&#8217;s <code>Length<\/code> is greater than 0, in which case the result of the <code>%<\/code> will always be <code>&gt;= 0<\/code> and <code>&lt; array.Length<\/code>, or the <code>Length<\/code> is 0, and <code>% 0<\/code> will throw an exception. As such, it can elide the bounds check:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetBucket()\r\n       mov       rcx,[rcx+8]\r\n       mov       eax,2A\r\n       mov       r8d,[rcx+8]\r\n       xor       edx,edx\r\n       div       r8\r\n       mov       eax,[rcx+rdx*4+10]\r\n       ret\r\n; Total bytes of code 23<\/code><\/pre>\n<p>Now consider this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly string _s = \"\\\"Hello, World!\\\"\";\r\n\r\n    [Benchmark]\r\n    public bool IsQuoted() =&gt; IsQuoted(_s);\r\n\r\n    private static bool IsQuoted(string s) =&gt;\r\n        s.Length &gt;= 2 &amp;&amp; s[0] == '\"' &amp;&amp; s[^1] == '\"';\r\n}<\/code><\/pre>\n<p>Our function is checking to see whether the supplied string begins and ends with a quote. It needs to be at least two characters long, and the first and last characters need to be quotes (<code>s[^1]<\/code> is shorthand for and expanded by the C# compiler into the equivalent of <code>s[s.Length - 1]<\/code>). Here&#8217;s the .NET 7 assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsQuoted(System.String)\r\n       sub       rsp,28\r\n       mov       eax,[rcx+8]\r\n       cmp       eax,2\r\n       jl        short M01_L00\r\n       cmp       word ptr [rcx+0C],22\r\n       jne       short M01_L00\r\n       lea       edx,[rax-1]\r\n       cmp       edx,eax\r\n       jae       short M01_L01\r\n       mov       eax,edx\r\n       cmp       word ptr [rcx+rax*2+0C],22\r\n       sete      al\r\n       movzx     eax,al\r\n       add       rsp,28\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       add       rsp,28\r\n       ret\r\nM01_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 58<\/code><\/pre>\n<p>Note that our function is indexing into the string twice, and the assembly does have a <code>call CORINFO_HELP_RNGCHKFAIL<\/code> at the end of the method, but there&#8217;s only one <code>jae<\/code> referring to the location of that <code>call<\/code>. That&#8217;s because the JIT already knows to avoid the bounds check on the <code>s[0]<\/code> access: it sees that it&#8217;s already been verified that the string&#8217;s <code>Length &gt;= 2<\/code>, so it&#8217;s safe to index without a bounds check into any index <code>&lt;= 2<\/code>. But, we do still have the bounds check for the <code>s[s.Length - 1]<\/code>. Now in .NET 8, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsQuoted(System.String)\r\n       mov       eax,[rcx+8]\r\n       cmp       eax,2\r\n       jl        short M01_L00\r\n       cmp       word ptr [rcx+0C],22\r\n       jne       short M01_L00\r\n       dec       eax\r\n       cmp       word ptr [rcx+rax*2+0C],22\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 33<\/code><\/pre>\n<p>Note the distinct lack of the <code>call CORINFO_HELP_RNGCHKFAIL<\/code>; no more bounds checks. Not only did the JIT recognize that <code>s[0]<\/code> is safe because <code>s.Length &gt;= 2<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84213\">dotnet\/runtime#84213<\/a> it also recognized that since <code>s.Length &gt;= 2<\/code>, <code>s.Length - 1<\/code> is <code>&gt;= 0<\/code> and <code>&lt; s.Length<\/code>, which means it&#8217;s in-bounds and thus no range check is needed.<\/p>\n<h2>Constant Folding<\/h2>\n<p>Another important operation employed by compilers is constant folding (and the closely related constant propagation). Constant folding is just a fancy name for a compiler evaluating expressions at compile-time, e.g. if you have <code>2 * 3<\/code>, rather than emitting a multiplication instruction, it can just do the multiplication at compile-time and substitute <code>6<\/code>. Constant propagation is then the act of taking that new constant and using it anywhere this expression&#8217;s result feeds, e.g. if you have:<\/p>\n<pre><code class=\"language-C#\">int a = 2 * 3;\r\nint b = a * 4;<\/code><\/pre>\n<p>a compiler can instead pretend it was:<\/p>\n<pre><code class=\"language-C#\">int a = 6;\r\nint b = 24;<\/code><\/pre>\n<p>I bring this up here, after we just talked about bounds-check elimination, because there are scenarios where constant folding and bounds check elimination go hand-in-hand. If we can determine a data structure&#8217;s length at compile-time, and we can determine an index at a compile-time, then also at compile-time we can determine whether the index is in bounds and avoid the bounds check. We can also take it further: if we can determine not only the data structure&#8217;s length but also its contents, then we can do the indexing at compile-time and substitute the value from the data structure.<\/p>\n<p>Consider this example, which is similar in nature to the kind of code types often have in their <code>ToString<\/code> or <code>TryFormat<\/code> implementations:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public string Format(int value) =&gt; Format(value, \"B\");\r\n\r\n    [MethodImpl(MethodImplOptions.AggressiveInlining)]\r\n    static string Format(int value, ReadOnlySpan&lt;char&gt; format)\r\n    {\r\n        if (format.Length == 1)\r\n        {\r\n            switch (format[0] | 0x20)\r\n            {\r\n                case 'd': return DecimalFormat(value);\r\n                case 'x': return HexFormat(value);\r\n                case 'b': return BinaryFormat(value);\r\n            }\r\n        }\r\n\r\n        return FallbackFormat(value, format);\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)] private static string DecimalFormat(int value) =&gt; null;\r\n    [MethodImpl(MethodImplOptions.NoInlining)] private static string HexFormat(int value) =&gt; null;\r\n    [MethodImpl(MethodImplOptions.NoInlining)] private static string BinaryFormat(int value) =&gt; null;\r\n    [MethodImpl(MethodImplOptions.NoInlining)] private static string FallbackFormat(int value, ReadOnlySpan&lt;char&gt; format) =&gt; null;\r\n}<\/code><\/pre>\n<p>We have a <code>Format(int value, ReadOnlySpan&lt;char&gt; format)<\/code> method for formatting the <code>int<\/code> value according to the specified <code>format<\/code>. The call site is explicit about the format to use, as many such call sites are, explicitly passing <code>\"B\"<\/code> here. The implementation is then special-casing formats that are one-character long and match in an ignore-case manner against one of three known formats (it&#8217;s using an ASCII trick based on the values of the lowercase letters being one bit different from their uppercase counterparts, such that <code>OR<\/code>&#8216;ing an uppercase ASCII letter with <code>0x20<\/code> lowercases it). If we look at the assembly generated for this method in .NET 7, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Format(Int32)\r\n       sub       rsp,38\r\n       xor       eax,eax\r\n       mov       [rsp+28],rax\r\n       mov       ecx,edx\r\n       mov       rax,251C4801418\r\n       mov       rax,[rax]\r\n       add       rax,0C\r\n       movzx     edx,word ptr [rax]\r\n       or        edx,20\r\n       cmp       edx,62\r\n       je        short M00_L01\r\n       cmp       edx,64\r\n       je        short M00_L00;\r\n       cmp       edx,78\r\n       jne       short M00_L02\r\n       call      qword ptr [7FFF3DD47918]; Tests.HexFormat(Int32)\r\n       jmp       short M00_L03\r\nM00_L00:\r\n       call      qword ptr [7FFF3DD47900]; Tests.DecimalFormat(Int32)\r\n       jmp       short M00_L03\r\nM00_L01:\r\n       call      qword ptr [7FFF3DD47930]; Tests.BinaryFormat(Int32)\r\n       jmp       short M00_L03\r\nM00_L02:\r\n       mov       [rsp+28],rax\r\n       mov       dword ptr [rsp+30],1\r\n       lea       rdx,[rsp+28]\r\n       call      qword ptr [7FFF3DD47948]; Tests.FallbackFormat\r\nM00_L03:\r\n       nop\r\n       add       rsp,38\r\n       ret\r\n; Total bytes of code 105<\/code><\/pre>\n<p>We can see the code here from <code>Format(Int32, ReadOnlySpan&lt;char&gt;)<\/code> but this is the code for <code>Format(Int32)<\/code>, so the callee was successfully inlined. We also don&#8217;t see any code for the <code>format.Length == 1<\/code> (the first <code>cmp<\/code> is part of the <code>switch<\/code>), nor do we see any signs of a bounds check (there&#8217;s no <code>call CORINFO_HELP_RNGCHKFAIL<\/code>). We do, however, see it loading the first character from <code>format<\/code>:<\/p>\n<pre><code class=\"language-assembly\">mov       rax,251C4801418       ; loads the address of where the format const string reference is stored\r\nmov       rax,[rax]             ; loads the address of format\r\nadd       rax,0C                ; loads the address of format's first character\r\nmovzx     edx,word ptr [rax]    ; reads the first character of format<\/code><\/pre>\n<p>and then using the equivalent of a cascading <code>if<\/code>\/<code>else<\/code>. Now let&#8217;s look at .NET 8:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Format(Int32)\r\n       sub       rsp,28\r\n       mov       ecx,edx\r\n       call      qword ptr [7FFEE0BAF4C8]; Tests.BinaryFormat(Int32)\r\n       nop\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 18<\/code><\/pre>\n<p>Whoa. It not only saw that <code>format<\/code>&#8216;s <code>Length<\/code> was 1 and not only was able to avoid the bounds check, it actually read the first character, lowercased it, and matched it against all the <code>switch<\/code> branches, such that the entire operation was constant folded and propagated away, leaving just a call to <code>BinaryFormat<\/code>. That&#8217;s primarily thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78593\">dotnet\/runtime#78593<\/a>.<\/p>\n<p>There are a multitude of other such improvements, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77593\">dotnet\/runtime#77593<\/a> which enables it to constant fold the length of a <code>string<\/code> or <code>T[]<\/code> stored in a <code>static readonly<\/code> field. Consider:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private static readonly string s_newline = Environment.NewLine;\r\n\r\n    [Benchmark]\r\n    public bool IsLineFeed() =&gt; s_newline.Length == 1 &amp;&amp; s_newline[0] == '\\n';\r\n}<\/code><\/pre>\n<p>On .NET 7, I get the following assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsLineFeed()\r\n       mov       rax,18AFF401F78\r\n       mov       rax,[rax]\r\n       mov       edx,[rax+8]\r\n       cmp       edx,1\r\n       jne       short M00_L00\r\n       cmp       word ptr [rax+0C],0A\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 36<\/code><\/pre>\n<p>This is effectively a 1:1 translation of the C#, with not much interesting happening: it loads the string from <code>s_newline<\/code>, and compares its <code>Length<\/code> to 1; if it doesn&#8217;t match, it returns 0 (false), otherwise it compares the value in the first element of the array against 0xA (line feed) and returns whether they match. Now, .NET 8:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsLineFeed()\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 3<\/code><\/pre>\n<p>That&#8217;s more interesting. I ran this code on Windows, where <code>Environment.NewLine<\/code> is <code>\"\\r\\n\"<\/code>. The JIT has constant folded the entire operation, seeing that the length is not 1, such that the whole operation boils down to just returning false.<\/p>\n<p>Or consider <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78783\">dotnet\/runtime#78783<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80661\">dotnet\/runtime#80661<\/a> which teach the JIT how to actually peer into the contents of an &#8220;RVA static.&#8221; These are &#8220;Relative Virtual Address&#8221; static fields, which is a fancy way of saying they live in the assembly&#8217;s data section. The C# compiler has optimizations that put constant data into such fields; for example, when you write:<\/p>\n<pre><code class=\"language-C#\">private static ReadOnlySpan&lt;byte&gt; Prefix =&gt; \"http:\/\/\"u8;<\/code><\/pre>\n<p>the C# compiler will actually emil IL like this:<\/p>\n<pre><code class=\"language-assembly\">.method private hidebysig specialname static \r\n    valuetype [System.Runtime]System.ReadOnlySpan`1&lt;uint8&gt; get_Prefix () cil managed \r\n{\r\n    .maxstack 8\r\n\r\n    IL_0000: ldsflda int64 '&lt;PrivateImplementationDetails&gt;'::'6709A82409D4C9E2EC04E1E71AB12303402A116B0F923DB8114F69CB05F1E926'\r\n    IL_0005: ldc.i4.7\r\n    IL_0006: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1&lt;uint8&gt;::.ctor(void*, int32)\r\n    IL_000b: ret\r\n}\r\n...\r\n.class private auto ansi sealed '&lt;PrivateImplementationDetails&gt;'\r\n    extends [System.Runtime]System.Object\r\n{\r\n    .field assembly static initonly int64 '6709A82409D4C9E2EC04E1E71AB12303402A116B0F923DB8114F69CB05F1E926' at I_00002868\r\n    .data cil I_00002868 = bytearray ( 68 74 74 70 3a 2f 2f 00 )\r\n}<\/code><\/pre>\n<p>With these PRs, when indexing into such RVA statics, the JIT is now able to actually read the data at the relevant location, constant folding the operation to the value at that location. So, take the following benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public bool IsWhiteSpace() =&gt; char.IsWhiteSpace('\\n');\r\n}<\/code><\/pre>\n<p>The <code>char.IsWhiteSpace<\/code> method is implemented via a lookup into such an RVA static, using the <code>char<\/code> passed in as an index into it. If the index ends up being a <code>const<\/code>, now on .NET 8 the whole operation can be constant folded away. .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsWhiteSpace()\r\n       xor       eax,eax\r\n       test      byte ptr [7FFF9BCCD83A],80\r\n       setne     al\r\n       ret\r\n; Total bytes of code 13<\/code><\/pre>\n<p>and .NET 8:<\/p>\n<pre><code class=\"language-assembly\">; Tests.IsWhiteSpace()\r\n       mov       eax,1\r\n       ret\r\n; Total bytes of code 6<\/code><\/pre>\n<p>You get the idea. Of course, a developer hopefully wouldn&#8217;t explicitly write <code>char.IsWhiteSpace('\\n')<\/code>, but such code can result none-the-less, especially via inlining.<\/p>\n<p>There are a multitude of these kinds of improvements in .NET 8. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77102\">dotnet\/runtime#77102<\/a> made it so that a <code>static readonly<\/code> value type&#8217;s primitive fields can be constant folded as if they were themselves <code>static readonly<\/code> fields, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80431\">dotnet\/runtime#80431<\/a> extended that to strings. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85804\">dotnet\/runtime#85804<\/a> taught the JIT how to handle <code>RuntimeTypeHandle.ToIntPtr(typeof(T).TypeHandle)<\/code> (which is used in methods like <code>GC.AllocateUninitializedArray<\/code>), while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87101\">dotnet\/runtime#87101<\/a> taught it to handle <code>obj.GetType()<\/code> (such that if the JIT knows the exact type of an instance <code>obj<\/code>, it can replace the <code>GetType()<\/code> invocation with the known answer). However, one of my favorite examples, purely because of just how magical it seems, comes from a series of PRs, including <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80622\">dotnet\/runtime#80622<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78961\">dotnet\/runtime#78961<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80888\">dotnet\/runtime#80888<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81005\">dotnet\/runtime#81005<\/a>. Together, they enable this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public DateTime Get() =&gt; new DateTime(2023, 9, 1);\r\n}<\/code><\/pre>\n<p>to produce this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Get()\r\n       mov       rax,8DBAA7E629B4000\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<p>The JIT was able to successfully inline and constant fold the entire operation down to a single constant. That <code>8DBAA7E629B4000<\/code> in that <code>mov<\/code> instruction is the value for the <code>private readonly ulong _dateData<\/code> field that backs <code>DateTime<\/code>. Sure enough, if you run:<\/p>\n<pre><code class=\"language-C#\">new DateTime(0x8DBAA7E629B4000)<\/code><\/pre>\n<p>you&#8217;ll see it produces:<\/p>\n<pre><code class=\"language-C#\">[9\/1\/2023 12:00:00 AM]<\/code><\/pre>\n<p>Very cool.<\/p>\n<h2>Non-GC Heap<\/h2>\n<p>Earlier we saw an example of the codegen when loading a constant string. As a reminder, this code:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public string GetPrefix() =&gt; \"https:\/\/\";\r\n}<\/code><\/pre>\n<p>results in this assembly on .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetPrefix()\r\n       mov       rax,126A7C01498\r\n       mov       rax,[rax]\r\n       ret\r\n; Total bytes of code 14<\/code><\/pre>\n<p>There are two <code>mov<\/code> instructions here. The first is loading the location where the address to the string object is stored, and the second is reading the address stored at that location (this requires two <code>mov<\/code>s because on x64 there&#8217;s no addressing mode that supports moving the value stored at an absolute address larger than 32-bits). Even though we&#8217;re dealing with a string literal here, such that the data for the string is constant, that constant data still ends up being copied into a heap-allocated <code>string<\/code> object. That object is interned, such that there&#8217;s only one of them in the process, but it&#8217;s still a heap object, and that means it&#8217;s still subject to being moved around by the GC. That means the JIT can&#8217;t just bake in the address of the <code>string<\/code> object, since the address can change, hence why it needs to read the address each time, in order to know where it currently is. Or, does it?<\/p>\n<p>What if we could ensure that the <code>string<\/code> object for this literal is created some place where it would never move, for example on the Pinned Object Heap (POH)? Then the JIT could avoid the indirection and instead just hardcode the address of the <code>string<\/code>, knowing that it would never move. Of course, the POH guarantees objects on it will never <em>move<\/em>, but it doesn&#8217;t guarantee addresses to them will always be valid; after all, it doesn&#8217;t root the objects, so objects on the POH are still collectible by the GC, and if they were collected, their addresses would be pointing at garbage or other data that ended up reusing the space.<\/p>\n<p>To address that, .NET 8 introduces a new mechanism used by the JIT for these kinds of situations: the Non-GC Heap (an evolution of the older &#8220;Frozen Segments&#8221; concept used by Native AOT). The JIT can ensure relevant objects are allocated on the Non-GC Heap, which is, as the name suggests, not managed by the GC and is intended to store objects where the JIT can prove the object has no references the GC needs to be aware of and will be rooted for the lifetime of the process, which in turn implies it can&#8217;t be part of an unloadable context.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/HeapsWhereNetObjectsLive.png\" alt=\"Heaps where .NET Objects Live\" \/><\/p>\n<p>The JIT can then avoid indirections in code generated to access that object, instead just hardcoding the object&#8217;s address. That&#8217;s exactly what it does now for string literals, as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49576\">dotnet\/runtime#49576<\/a>. Now in .NET 8, that same method above results in this assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetPrefix()\r\n       mov       rax,227814EAEA8\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75573\">dotnet\/runtime#75573<\/a> makes a similar play, but with the <code>RuntimeType<\/code> objects produced by <code>typeof(T)<\/code> (subject to various constraints, like the <code>T<\/code> not coming from an unloadable assembly, in which case permanently rooting the object would prevent unloading). Again, we can see this with a simple benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public Type GetTestsType() =&gt; typeof(Tests);\r\n}<\/code><\/pre>\n<p>where we get the following difference between .NET 7 and .NET 8:<\/p>\n<pre><code class=\"language-assembly\">; .NET 7\r\n; Tests.GetTestsType()\r\n       sub       rsp,28\r\n       mov       rcx,offset MT_Tests\r\n       call      CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE\r\n       nop\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 25\r\n\r\n; .NET 8\r\n; Tests.GetTestsType()\r\n       mov       rax,1E0015E73F8\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<p>The same capability can be extended to other kinds of objects, as it is in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85559\">dotnet\/runtime#85559<\/a> (which is based on work from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76112\">dotnet\/runtime#76112<\/a>), making <code>Array.Empty&lt;T&gt;()<\/code> cheaper by allocating the empty arrays on the Non-GC Heap.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public string[] Test() =&gt; Array.Empty&lt;string&gt;();\r\n}<\/code><\/pre>\n<pre><code class=\"language-assembly\">; .NET 7\r\n; Tests.Test()\r\n       mov       rax,17E8D801FE8\r\n       mov       rax,[rax]\r\n       ret\r\n; Total bytes of code 14\r\n\r\n; .NET 8\r\n; Tests.Test()\r\n       mov       rax,1A0814EAEA8\r\n       ret\r\n; Total bytes of code 11<\/code><\/pre>\n<p>And as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77737\">dotnet\/runtime#77737<\/a>, it also applies to the heap object associated with <code>static<\/code> value type fields, at least those that don&#8217;t contain any GC references. Wait, heap object for value type fields? Surely, Stephen, you got that wrong, value types aren&#8217;t allocated on the heap when stored in fields. Well, actually they are when they&#8217;re stored in <code>static<\/code> fields; the runtime creates a heap-allocated box associated with that field to store the value (but the same box is reused for all writes to that field). And that means for a benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic partial class Tests\r\n{\r\n    private static readonly ConfigurationData s_config = ConfigurationData.ReadData();\r\n\r\n    [Benchmark]\r\n    public TimeSpan GetRefreshInterval() =&gt; s_config.RefreshInterval;\r\n\r\n    \/\/ Struct for storing fictional configuration data that might be read from a configuration file.\r\n    private struct ConfigurationData\r\n    {\r\n        public static ConfigurationData ReadData() =&gt; new ConfigurationData\r\n        {\r\n            Index = 0x12345,\r\n            Id = Guid.NewGuid(),\r\n            IsEnabled = true,\r\n            RefreshInterval = TimeSpan.FromSeconds(100)\r\n        };\r\n\r\n        public int Index;\r\n        public Guid Id;\r\n        public bool IsEnabled;\r\n        public TimeSpan RefreshInterval;\r\n    }\r\n}<\/code><\/pre>\n<p>we see the following assembly code for reading that <code>RefreshInterval<\/code> on .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetRefreshInterval()\r\n       mov       rax,13D84001F78\r\n       mov       rax,[rax]\r\n       mov       rax,[rax+20]\r\n       ret\r\n; Total bytes of code 18<\/code><\/pre>\n<p>That code is loading the address of the field, reading from it the address of the box object, and then reading from that box object the <code>TimeSpan<\/code> value that&#8217;s stored inside of it. But, now on .NET 8 we get the assembly you&#8217;ve now come to expect:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetRefreshInterval()\r\n       mov       rax,20D9853AE48\r\n       mov       rax,[rax]\r\n       ret\r\n; Total bytes of code 14<\/code><\/pre>\n<p>The box gets allocated on the Non-GC heap, which means the JIT can bake in the address of the object, and we get to save a <code>mov<\/code>.<\/p>\n<p>Beyond fewer indirections to access these Non-GC Heap objects, there are other benefits. For example, a &#8220;generational GC&#8221; like the one used in .NET divides the heap into multiple &#8220;generations,&#8221; where generation 0 (&#8220;gen0&#8221;) is for recently created objects and generation 2 (&#8220;gen2&#8221;) is for objects that have been around for a while. When the GC performs a collection, it needs to determine which objects are still alive (still referenced) and which ones can be collected, and to do that it has to trace through all references to find out what objects are still reachable. However, the generational model is beneficial because it can enable the GC to scour through much less of the heap than it might otherwise need to. If it can tell, for example, that there aren&#8217;t any references from gen2 back to gen0, then when doing a gen0 collection, it can avoid enumerating gen2 objects entirely. But to be able to know about such references, the GC needs to know any time a reference is written to a shared location. We can see that in this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public void Write()\r\n    {\r\n        string dst = \"old\";\r\n        Write(ref dst, \"new\");\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static void Write(ref string dst, string s) =&gt; dst = s;\r\n}<\/code><\/pre>\n<p>where the code generated for that <code>Write(ref string, string)<\/code> method on both .NET 7 and .NET 8 is:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Write(System.String ByRef, System.String)\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       nop\r\n       ret\r\n; Total bytes of code 7<\/code><\/pre>\n<p>That <code>CORINFO_HELP_CHECKED_ASSIGN_REF<\/code> is a JIT helper function that contains what&#8217;s known as a &#8220;GC write barrier,&#8221; a little piece of code that runs to let the GC track that a reference is being written that it might need to know about, e.g. because the object being assigned might be gen0 and the destination might be gen2. We see the same thing on .NET 7 for a tweak to the benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public void Write()\r\n    {\r\n        string dst = \"old\";\r\n        Write(ref dst);\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static void Write(ref string dst) =&gt; dst = \"new\";\r\n}<\/code><\/pre>\n<p>Now we&#8217;re storing a string literal into the destination, and on .NET 7 we see assembly similarly calling <code>CORINFO_HELP_CHECKED_ASSIGN_REF<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Write(System.String ByRef)\r\n       mov       rdx,1FF0E4014A0\r\n       mov       rdx,[rdx]\r\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF\r\n       nop\r\n       ret\r\n; Total bytes of code 20<\/code><\/pre>\n<p>But, now on .NET 8 we see this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Write(System.String ByRef)\r\n       mov       rax,1B3814EAEC8\r\n       mov       [rcx],rax\r\n       ret\r\n; Total bytes of code 14<\/code><\/pre>\n<p>No write barrier. That&#8217;s thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76135\">dotnet\/runtime#76135<\/a>, which recognizes that these Non-GC Heap objects don&#8217;t need to be tracked, since they&#8217;ll never be collected anyway. There are multiple other PRs that improve how constant folding works with these Non-GC Heap objects, too, like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85127\">dotnet\/runtime#85127<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85888\">dotnet\/runtime#85888<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86318\">dotnet\/runtime#86318<\/a>.<\/p>\n<h2>Zeroing<\/h2>\n<p>The JIT frequently needs to generate code that zeroes out memory. Unless you&#8217;ve used <code>[SkipLocalsInit]<\/code>, for example, any stack space allocated with <code>stackalloc<\/code> needs to be zeroed, and it&#8217;s the JIT&#8217;s responsibility to generate the code that does so. Consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{    \r\n    [Benchmark] public void Constant256() =&gt; Use(stackalloc byte[256]);\r\n\r\n    [Benchmark] public void Constant1024() =&gt; Use(stackalloc byte[1024]);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)] \/\/ prevent stackallocs from being optimized away\r\n    private static void Use(Span&lt;byte&gt; span) { }\r\n}<\/code><\/pre>\n<p>Here&#8217;s what the .NET 7 assembly looks like for both <code>Constant256<\/code> and <code>Constant1024<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Constant256()\r\n       push      rbp\r\n       sub       rsp,40\r\n       lea       rbp,[rsp+20]\r\n       xor       eax,eax\r\n       mov       [rbp+10],rax\r\n       mov       [rbp+18],rax\r\n       mov       rax,0A77E4BDA96AD\r\n       mov       [rbp+8],rax\r\n       add       rsp,20\r\n       mov       ecx,10\r\nM00_L00:\r\n       push      0\r\n       push      0\r\n       dec       rcx\r\n       jne       short M00_L00\r\n       sub       rsp,20\r\n       lea       rcx,[rsp+20]\r\n       mov       [rbp+10],rcx\r\n       mov       dword ptr [rbp+18],100\r\n       lea       rcx,[rbp+10]\r\n       call      qword ptr [7FFF3DD37900]; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       mov       rcx,0A77E4BDA96AD\r\n       cmp       [rbp+8],rcx\r\n       je        short M00_L01\r\n       call      CORINFO_HELP_FAIL_FAST\r\nM00_L01:\r\n       nop\r\n       lea       rsp,[rbp+20]\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 110\r\n\r\n; Tests.Constant1024()\r\n       push      rbp\r\n       sub       rsp,40\r\n       lea       rbp,[rsp+20]\r\n       xor       eax,eax\r\n       mov       [rbp+10],rax\r\n       mov       [rbp+18],rax\r\n       mov       rax,606DD723A061\r\n       mov       [rbp+8],rax\r\n       add       rsp,20\r\n       mov       ecx,40\r\nM00_L00:\r\n       push      0\r\n       push      0\r\n       dec       rcx\r\n       jne       short M00_L00\r\n       sub       rsp,20\r\n       lea       rcx,[rsp+20]\r\n       mov       [rbp+10],rcx\r\n       mov       dword ptr [rbp+18],400\r\n       lea       rcx,[rbp+10]\r\n       call      qword ptr [7FFF3DD47900]; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       mov       rcx,606DD723A061\r\n       cmp       [rbp+8],rcx\r\n       je        short M00_L01\r\n       call      CORINFO_HELP_FAIL_FAST\r\nM00_L01:\r\n       nop\r\n       lea       rsp,[rbp+20]\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 110<\/code><\/pre>\n<p>We can see in the middle there that the JIT has written a zeroing loop, zeroing 16 bytes at a time by pushing two 8-byte <code>0<\/code>s onto the stack on each iteration:<\/p>\n<pre><code class=\"language-assembly\">M00_L00:\r\n       push      0\r\n       push      0\r\n       dec       rcx\r\n       jne       short M00_L00<\/code><\/pre>\n<p>Now in .NET 8 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83255\">dotnet\/runtime#83255<\/a>, the JIT unrolls and vectorizes that zeroing, and after a certain threshold (which as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83274\">dotnet\/runtime#83274<\/a> has also been updated and made consistent with what other native compilers do), it switches over to using an optimized <code>memset<\/code> routine rather than emitting a large amount of code to achieve the same thing. Here&#8217;s what we now get on .NET 8 for <code>Constant256<\/code> (on my machine&#8230; I call that out because the limits are based on what instruction sets are available):<\/p>\n<pre><code class=\"language-assembly\">; Tests.Constant256()\r\n       push      rbp\r\n       sub       rsp,40\r\n       vzeroupper\r\n       lea       rbp,[rsp+20]\r\n       xor       eax,eax\r\n       mov       [rbp+10],rax\r\n       mov       [rbp+18],rax\r\n       mov       rax,6281D64D33C3\r\n       mov       [rbp+8],rax\r\n       test      [rsp],esp\r\n       sub       rsp,100\r\n       lea       rcx,[rsp+20]\r\n       vxorps    ymm0,ymm0,ymm0\r\n       vmovdqu   ymmword ptr [rcx],ymm0\r\n       vmovdqu   ymmword ptr [rcx+20],ymm0\r\n       vmovdqu   ymmword ptr [rcx+40],ymm0\r\n       vmovdqu   ymmword ptr [rcx+60],ymm0\r\n       vmovdqu   ymmword ptr [rcx+80],ymm0\r\n       vmovdqu   ymmword ptr [rcx+0A0],ymm0\r\n       vmovdqu   ymmword ptr [rcx+0C0],ymm0\r\n       vmovdqu   ymmword ptr [rcx+0E0],ymm0\r\n       mov       [rbp+10],rcx\r\n       mov       dword ptr [rbp+18],100\r\n       lea       rcx,[rbp+10]\r\n       call      qword ptr [7FFEB7D3F498]; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       mov       rcx,6281D64D33C3\r\n       cmp       [rbp+8],rcx\r\n       je        short M00_L00\r\n       call      CORINFO_HELP_FAIL_FAST\r\nM00_L00:\r\n       nop\r\n       lea       rsp,[rbp+20]\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 156<\/code><\/pre>\n<p>Notice there&#8217;s no zeroing loop, and instead we see a bunch of 256-bit <code>vmovdqu<\/code> move instructions to copy the zeroed out <code>ymm0<\/code> register to the next portion of the stack. And then for <code>Constant1024<\/code> we see:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Constant1024()\r\n       push      rbp\r\n       sub       rsp,40\r\n       lea       rbp,[rsp+20]\r\n       xor       eax,eax\r\n       mov       [rbp+10],rax\r\n       mov       [rbp+18],rax\r\n       mov       rax,0CAF12189F783\r\n       mov       [rbp],rax\r\n       test      [rsp],esp\r\n       sub       rsp,400\r\n       lea       rcx,[rsp+20]\r\n       mov       [rbp+8],rcx\r\n       xor       edx,edx\r\n       mov       r8d,400\r\n       call      CORINFO_HELP_MEMSET\r\n       mov       rcx,[rbp+8]\r\n       mov       [rbp+10],rcx\r\n       mov       dword ptr [rbp+18],400\r\n       lea       rcx,[rbp+10]\r\n       call      qword ptr [7FFEB7D5F498]; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       mov       rcx,0CAF12189F783\r\n       cmp       [rbp],rcx\r\n       je        short M00_L00\r\n       call      CORINFO_HELP_FAIL_FAST\r\nM00_L00:\r\n       nop\r\n       lea       rsp,[rbp+20]\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 119<\/code><\/pre>\n<p>Again, no zeroing loop, and instead we see <code>call CORINFO_HELP_MEMSET<\/code>, relying on the optimized underlying <code>memset<\/code> to efficiently handle the zeroing. The effects of this are visible in throughput numbers as well:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Constant256<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">7.927 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Constant256<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.181 ns<\/td>\n<td style=\"text-align: right\">0.40<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Constant1024<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">30.523 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Constant1024<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.850 ns<\/td>\n<td style=\"text-align: right\">0.29<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83488\">dotnet\/runtime#83488<\/a> improved this further by using a standard trick frequently employed when vectorizing algorithms. Let&#8217;s say you want to zero out 120 bytes and you have at your disposal an instruction for zeroing out 32 bytes at a time. We can issue three such instructions to zero out 96 bytes, but we&#8217;re then left with 24 bytes that still need to be zeroed. What do we do? We can&#8217;t write another 32 bytes from where we left off, as we might then be overwriting 8 bytes we shouldn&#8217;t be touching. We could use scalar zeroing and issue three instructions each for 8 bytes, but could we do it in just a single instruction? Yes! Since the writes are idempotent, we can just zero out the last 32 bytes of the 120 bytes, even though that means we&#8217;ll be re-zeroing 8 bytes we already zeroed. You can see this same approach utilized in many of the vectorized operations throughout the core libraries, and as of this PR, the JIT employs it when zeroing as well.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85389\">dotnet\/runtime#85389<\/a> takes this further and uses AVX512 to improve bulk operations like this zeroing. So, running the same benchmark on my Dev Box with AVX512, I see this assembly generated for <code>Constant256<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Constant256()\r\n       push      rbp\r\n       sub       rsp,40\r\n       vzeroupper\r\n       lea       rbp,[rsp+20]\r\n       xor       eax,eax\r\n       mov       [rbp+10],rax\r\n       mov       [rbp+18],rax\r\n       mov       rax,992482B435F7\r\n       mov       [rbp+8],rax\r\n       test      [rsp],esp\r\n       sub       rsp,100\r\n       lea       rcx,[rsp+20]\r\n       vxorps    ymm0,ymm0,ymm0\r\n       vmovdqu32 [rcx],zmm0\r\n       vmovdqu32 [rcx+40],zmm0\r\n       vmovdqu32 [rcx+80],zmm0\r\n       vmovdqu32 [rcx+0C0],zmm0\r\n       mov       [rbp+10],rcx\r\n       mov       dword ptr [rbp+18],100\r\n       lea       rcx,[rbp+10]\r\n       call      qword ptr [7FFCE555F4B0]; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       mov       rcx,992482B435F7\r\n       cmp       [rbp+8],rcx\r\n       je        short M00_L00\r\n       call      CORINFO_HELP_FAIL_FAST\r\nM00_L00:\r\n       nop\r\n       lea       rsp,[rbp+20]\r\n       pop       rbp\r\n       ret\r\n; Total bytes of code 132<\/code><\/pre>\n<pre><code class=\"language-assembly\">; Tests.Use(System.Span`1&lt;Byte&gt;)\r\n       ret\r\n; Total bytes of code 1<\/code><\/pre>\n<p>Note that now, rather than eight <code>vmovdqu<\/code> instructions with <code>ymm0<\/code>, we see four <code>vmovdqu32<\/code> instructions with <code>zmm0<\/code>, as each move instruction is able to zero out twice as much, with each instruction handling 64 bytes at a time.<\/p>\n<h2>Value Types<\/h2>\n<p>Value types (structs) have been used increasingly as part of high-performance code. Yet while they have obvious advantages (they don&#8217;t require heap allocation and thus reduce pressure on the GC), they also have disadvantages (more data being copied around) and have historically not been as optimized as someone relying on them heavily for performance might like. It&#8217;s been a key focus area of improvement for the JIT in the last several releases of .NET, and that continues into .NET 8.<\/p>\n<p>One specific area of improvement here is around &#8220;promotion.&#8221; In this context, promotion is the idea of splitting a struct apart into its constituent fields, effectively treating each field as its own local. This can lead to a number of valuable optimizations, including being able to enregister portions of a struct. As of .NET 7, the JIT does support struct promotion, but with limitations, including only supporting structs with at most four fields and not supporting nested structs (other than for primitive types).<\/p>\n<p>A lot of work in .NET 8 went into removing those restrictions. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83388\">dotnet\/runtime#83388<\/a> improves upon the existing promotion support with an additional optimization pass the JIT refers to as &#8220;physical promotion;&#8221; it does away with both of those cited limitations, however as of this PR the feature was still disabled by default. Other PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85105\">dotnet\/runtime#85105<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86043\">dotnet\/runtime#86043<\/a> improved it further, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88090\">dotnet\/runtime#88090<\/a> enabled the optimizations by default. The net result is visible in a benchmark like the following:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private ParsedStat _stat;\r\n\r\n    [Benchmark]\r\n    public ulong GetTime()\r\n    {\r\n        ParsedStat stat = _stat;\r\n        return stat.utime + stat.stime;\r\n    }\r\n\r\n    internal struct ParsedStat\r\n    {\r\n        internal int pid;\r\n        internal string comm;\r\n        internal char state;\r\n        internal int ppid;\r\n        internal int session;\r\n        internal ulong utime;\r\n        internal ulong stime;\r\n        internal long nice;\r\n        internal ulong starttime;\r\n        internal ulong vsize;\r\n        internal long rss;\r\n        internal ulong rsslim;\r\n    }\r\n}<\/code><\/pre>\n<p>Here we have a struct modeling some data that might be extracted from a <code>procfs<\/code> <code>stat<\/code> file on Linux. The benchmark makes a local copy of the struct and returns a sum of the user and kernel times. In .NET 7, the assembly looks like this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetTime()\r\n       push      rdi\r\n       push      rsi\r\n       sub       rsp,58\r\n       lea       rsi,[rcx+8]\r\n       lea       rdi,[rsp+8]\r\n       mov       ecx,0A\r\n       rep movsq\r\n       mov       rax,[rsp+10]\r\n       add       rax,[rsp+18]\r\n       add       rsp,58\r\n       pop       rsi\r\n       pop       rdi\r\n       ret\r\n; Total bytes of code 40<\/code><\/pre>\n<p>The two really interesting instructions here are these:<\/p>\n<pre><code class=\"language-assembly\">mov ecx,0A\r\nrep movsq<\/code><\/pre>\n<p>The <code>ParsedStat<\/code> struct is 80 bytes in size, and this pair of instructions is repeatedly (<code>rep<\/code>) copying 8-bytes (<code>movsq<\/code>) 10 times (<code>ecx<\/code> that&#8217;s been populated with 0xA) from the source location in <code>rsi<\/code> (which was initialized with <code>[rcx+8]<\/code>, aka the location of the <code>_stat<\/code> field) to the destination location in <code>rdi<\/code> (a stack location at <code>[rsp+8]<\/code>). In other words, this is making a full copy of the whole struct, even though we only need two fields from it. Now in .NET 8, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Tests.GetTime()\r\n       add       rcx,8\r\n       mov       rax,[rcx+8]\r\n       mov       rcx,[rcx+10]\r\n       add       rax,rcx\r\n       ret\r\n; Total bytes of code 16<\/code><\/pre>\n<p>Ahhh, so much nicer. Now it&#8217;s avoided the whole copy, and is simply moving the relevant <code>ulong<\/code> values into registers and adding them together.<\/p>\n<p>Here&#8217;s another example:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\".NET 7\").WithRuntime(CoreRuntime.Core70).AsBaseline())\r\n    .AddJob(Job.Default.WithId(\".NET 8 w\/o PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\".NET 8\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\", \"Runtime\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly List&lt;int?&gt; _list = Enumerable.Range(0, 10000).Select(i =&gt; (int?)i).ToList();\r\n\r\n    [Benchmark]\r\n    public int CountList()\r\n    {\r\n        int count = 0;\r\n        foreach (int? i in _list)\r\n            if (i is not null)\r\n                count++;\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<p><code>List&lt;T&gt;<\/code> has a struct <code>List&lt;T&gt;.Enumerator<\/code> that&#8217;s returned from <code>List&lt;T&gt;.GetEnumerator()<\/code>, such that when you <code>foreach<\/code> the list directly (rather than as an <code>IEnumerable&lt;T&gt;<\/code>), the C# compiler binds to this struct enumerator via the enumerator pattern. This example runs afoul of the previous limitations in two ways. That <code>Enumerator<\/code> has a field for the current <code>T<\/code>, so if <code>T<\/code> is a non-primitive value type, it violates the &#8220;no nested structs&#8221; limitation. And that <code>Enumerator<\/code> has four fields, so if that <code>T<\/code> has multiple fields, it pushes it beyond the four-field limit. Now in .NET 8, the JIT is able to see through the struct to its fields, and optimize the enumeration of the list to a much more efficient result.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountList<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">18.878 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">215 B<\/td>\n<\/tr>\n<tr>\n<td>CountList<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">11.726 us<\/td>\n<td style=\"text-align: right\">0.62<\/td>\n<td style=\"text-align: right\">70 B<\/td>\n<\/tr>\n<tr>\n<td>CountList<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">5.912 us<\/td>\n<td style=\"text-align: right\">0.31<\/td>\n<td style=\"text-align: right\">66 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note the significant improvement in both throughput and code size from .NET 7 to .NET 8 even without PGO. However, the gap here between .NET 8 without PGO and with PGO is also interesting, albeit for other reasons. We see an almost halving of execution time with PGO applied, but only four bytes of difference in assembly code size. Those four bytes stem from a single <code>mov<\/code> instruction that PGO was able to help remove, which we can see easily by pasting the two snippets into a diffing tool:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/ExtraMovInDiffChecker.png\" alt=\"An extra mov highlighted in a diff tool\" \/>\n~12us down to ~6us is a lot for a difference of a single <code>mov<\/code>&#8230; why such an outsized impact? This ends up being a really good example of what I mentioned at the beginning of this article: beware microbenchmarks, as they can differ from machine to machine. Or in this case, in particular from processor to processor. The machine on which I&#8217;m writing this and on which I&#8217;ve run the majority of the benchmarks in this post is a several year old desktop with an Intel Coffee Lake processor. When I run the same benchmark on my Dev Box, which has an Intel Xeon Platinum 8370C, I see this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountList<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">15.804 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">215 B<\/td>\n<\/tr>\n<tr>\n<td>CountList<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">7.138 us<\/td>\n<td style=\"text-align: right\">0.45<\/td>\n<td style=\"text-align: right\">70 B<\/td>\n<\/tr>\n<tr>\n<td>CountList<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">6.111 us<\/td>\n<td style=\"text-align: right\">0.39<\/td>\n<td style=\"text-align: right\">66 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Same code size, still a large improvement due to physical promotion, but now only a small ~15% rather than ~2x improvement from PGO. As it turns out, Coffee Lake is one of the processors affected by the Jump Conditional Code <a href=\"https:\/\/www.intel.com\/content\/dam\/support\/us\/en\/documents\/processors\/mitigations-jump-conditional-code-erratum.pdf\">(JCC) Erratum<\/a> issued in 2019 (&#8220;erratum&#8221; here is a fancy way of saying &#8220;bug&#8221;, or alternatively, &#8220;documentation about a bug&#8221;). The problem involved jump instructions on a 32-byte boundary, and the hardware caching information about those instructions. The issue was then subsequently fixed via a microcode update that disabled the relevant caching, but that then created a possible performance issue, as whether a jump is on a 32-byte boundary impacts whether it&#8217;s cached and therefore the resulting performance gains that cache was introduced to provide. If I set the <code>DOTNET_JitDisasm<\/code> environment variable to <code>*CountList*<\/code> (to get the JIT to output the disassembly directly, rather than relying on BenchmarkDotNet to fish it out), and set the <code>DOTNET_JitDisasmWithAlignmentBoundaries<\/code> environment variable to <code>1<\/code> (to get the JIT to include alignment boundary information in that output), I see this:<\/p>\n<pre><code class=\"language-assembly\">G_M000_IG04:                ;; offset=0018H\r\n       mov      r8d, dword ptr [rcx+10H]\r\n       cmp      edx, r8d\r\n       jae      SHORT G_M000_IG05\r\n; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 1 ; jcc erratum) 32B boundary ...............................\r\n       mov      r8, gword ptr [rcx+08H]<\/code><\/pre>\n<p>Sure enough, we see that this jump instruction is falling on a 32-byte boundary. When PGO kicks in and removes the earlier <code>mov<\/code>, that changes the alignment such that the jump is no longer on a 32-byte boundary:<\/p>\n<pre><code class=\"language-assembly\">G_M000_IG05:                ;; offset=0018H\r\n       cmp      edx, dword ptr [rcx+10H]\r\n       jae      SHORT G_M000_IG06\r\n       mov      r8, gword ptr [rcx+08H]\r\n; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................\r\n       cmp      edx, dword ptr [r8+08H]<\/code><\/pre>\n<p>This is all to say, again, there are many things that can impact microbenchmarks, and it&#8217;s valuable to understand the source of a difference rather than just taking it at face value.<\/p>\n<p>Ok, where were we? Oh yeah, structs. Another improvement related to structs comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79346\">dotnet\/runtime#79346<\/a>, which adds an additional &#8220;liveness&#8221; optimization pass earlier than the others it already has (liveness is just an indication of whether a variable might still be needed because its value might be used again in the future). This then allows the JIT to remove some struct copies it wasn&#8217;t previously able to, in particular in situations where the last time the struct is used is in passing it to another method. However, this additional liveness pass has other benefits as well, in particular with relation to &#8220;forward substitution.&#8221; Forward substitution is an optimization that can be thought of as the opposite of &#8220;common subexpression elimination&#8221; (CSE). With CSE, the compiler replaces an expression with something containing the result already computed for that expression, so for example if you had:<\/p>\n<pre><code class=\"language-C#\">int c = (a + b) + 3;\r\nint d = (a + b) * 4;<\/code><\/pre>\n<p>a compiler might use CSE to rewrite that as:<\/p>\n<pre><code class=\"language-C#\">int tmp = a + b;\r\nint c = tmp + 3;\r\nint d = tmp * 4;<\/code><\/pre>\n<p>Forward substitution could be used to undo that, distributing the expression feeding into <code>tmp<\/code> back to where <code>tmp<\/code> is used, such that we end up back with:<\/p>\n<pre><code class=\"language-C#\">int c = (a + b) + 3;\r\nint d = (a + b) * 4;<\/code><\/pre>\n<p>Why would a compiler want to do that? It can make certain subsequent optimizations easier for it to see. For example, consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(42)]\r\n    public int Merge(int a)\r\n    {\r\n        a *= 3;\r\n        a *= 3;\r\n        return a;\r\n    }\r\n}<\/code><\/pre>\n<p>On .NET 7, that results in this assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Merge(Int32)\r\n       lea       edx,[rdx+rdx*2]\r\n       lea       edx,[rdx+rdx*2]\r\n       mov       eax,edx\r\n       ret\r\n; Total bytes of code 9<\/code><\/pre>\n<p>The generated code here is performing each multiplication individually. But when we view:<\/p>\n<pre><code class=\"language-C#\">a *= 3;\r\na *= 3;\r\nreturn a;<\/code><\/pre>\n<p>instead as:<\/p>\n<pre><code class=\"language-C#\">a = a * 3;\r\na = a * 3;\r\nreturn a;<\/code><\/pre>\n<p>and knowing that the initial result stored into <code>a<\/code> is temporary (thank you, liveness), forward substitution can turn that into:<\/p>\n<pre><code class=\"language-C#\">a = (a * 3) * 3;\r\nreturn a;<\/code><\/pre>\n<p>at which point constant folding can kick in. Now on .NET 8 we get:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Merge(Int32)\r\n       lea       eax,[rdx+rdx*8]\r\n       ret\r\n; Total bytes of code 4<\/code><\/pre>\n<p>Another change related to liveness is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77990\">dotnet\/runtime#77990<\/a> from <a href=\"https:\/\/github.com\/SingleAccretion\">@SingleAccretion<\/a>. This adds another pass over one of the JIT&#8217;s internal representations, eliminating writes it finds to be useless.<\/p>\n<h2>Casting<\/h2>\n<p>Various improvements have gone into improving the performance of casting in .NET 8.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75816\">dotnet\/runtime#75816<\/a> improved the performance of using <code>is T[]<\/code> when <code>T<\/code> is sealed. There&#8217;s a <code>CORINFO_HELP_ISINSTANCEOFARRAY<\/code> helper the JIT uses to determine whether an object is of a specified array type, but when the <code>T<\/code> is sealed, the JIT can instead emit it without the helper, generating code as if it were written like <code>obj is not null &amp;&amp; obj.GetType() == typeof(T[])<\/code>. This is another example where dynamic PGO has a measurable impact, so the benchmark highlights the improvements with and without it.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\".NET 7\").WithRuntime(CoreRuntime.Core70).AsBaseline())\r\n    .AddJob(Job.Default.WithId(\".NET 8 w\/o PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\".NET 8\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\", \"Runtime\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly object _obj = new string[1];\r\n\r\n    [Benchmark]\r\n    public bool IsStringArray() =&gt; _obj is string[];\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IsStringArray<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">1.2290 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IsStringArray<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">0.2365 ns<\/td>\n<td style=\"text-align: right\">0.19<\/td>\n<\/tr>\n<tr>\n<td>IsStringArray<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">0.0825 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Moving on, consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser(maxDepth: 0)]\r\npublic class Tests\r\n{\r\n    private readonly string[] _strings = new string[1];\r\n\r\n    [Benchmark]\r\n    public string Get1() =&gt; _strings[0];\r\n\r\n    [Benchmark]\r\n    public string Get2() =&gt; Volatile.Read(ref _strings[0]);\r\n}<\/code><\/pre>\n<p><code>Get1<\/code> here is just reading and returning the 0th element from the array. <code>Get2<\/code> here is returning a <code>ref<\/code> to the 0th element from the array. Here&#8217;s the assembly we get in .NET 7:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Get1()\r\n       sub       rsp,28\r\n       mov       rax,[rcx+8]\r\n       cmp       dword ptr [rax+8],0\r\n       jbe       short M00_L00\r\n       mov       rax,[rax+10]\r\n       add       rsp,28\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 29\r\n\r\n; Tests.Get2()\r\n       sub       rsp,28\r\n       mov       rcx,[rcx+8]\r\n       xor       edx,edx\r\n       mov       r8,offset MT_System.String\r\n       call      CORINFO_HELP_LDELEMA_REF\r\n       nop\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 31<\/code><\/pre>\n<p>In <code>Get1<\/code>, we&#8217;re immediately using the array element, so the C# compiler can emit a <code>ldelem.ref<\/code> IL instruction, but in <code>Get2<\/code>, the reference to the array element is being returned, so the C# compiler emits a <code>ldelema<\/code> (load element address) instruction. In the general case, <code>ldelema<\/code> requires a type check, because of covariance; you could have a <code>Base[] array = new DerivedFromBase[1];<\/code>, in which case if you handed out a <code>ref Base<\/code> pointing into that array and someone wrote a <code>new AlsoDerivedFromBase()<\/code> via that <code>ref<\/code>, type safety would be violated (since you&#8217;d be storing an <code>AlsoDerivedFromBase<\/code> into a <code>DerivedFromBase[]<\/code> even though <code>DerivedFromBase<\/code> and <code>AlsoDerivedFromBase<\/code> aren&#8217;t related). As such, the .NET 7 assembly for this code includes a call to <code>CORINFO_HELP_LDELEMA_REF<\/code>, which is the helper function the JIT uses to perform that type check. But the array element type here is <code>string<\/code>, which is sealed, which means we can&#8217;t get into that problematic situation: there&#8217;s no type you can store into a <code>string<\/code> variable other than <code>string<\/code>. Thus, this helper call is superfluous, and with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85256\">dotnet\/runtime#85256<\/a>, the JIT can now avoid using it. On .NET 8, then, we get this for <code>Get2<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Tests.Get2()\r\n       sub       rsp,28\r\n       mov       rax,[rcx+8]\r\n       cmp       dword ptr [rax+8],0\r\n       jbe       short M00_L00\r\n       add       rax,10\r\n       add       rsp,28\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 29<\/code><\/pre>\n<p>No <code>CORINFO_HELP_LDELEMA_REF<\/code> in sight.<\/p>\n<p>And then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86728\">dotnet\/runtime#86728<\/a> reduces the costs associated with a generic cast. Previously the JIT would always use a <code>CastHelpers.ChkCastAny<\/code> method to perform the cast, but with this change, it inlines a fast success path.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly object _o = \"hello\";\r\n\r\n    [Benchmark]\r\n    public string GetString() =&gt; Cast&lt;string&gt;(_o);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    public T Cast&lt;T&gt;(object o) =&gt; (T)o;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2.247 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.300 ns<\/td>\n<td style=\"text-align: right\">0.58<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Peephole Optimizations<\/h2>\n<p>A &#8220;peephole optimization&#8221; is one in which a small sequence of instructions is replaced by a different sequence that is expected to perform better. This could include getting rid of instructions deemed unnecessary or replacing two instructions with one instruction that can accomplish the same task. Every release of .NET features a multitude of new peephole optimizations, often inspired by real-world examples where some overhead could be trimmed by slightly increasing code quality, and .NET 8 is no exception. Here are just some of these optimizations in .NET 8:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73120\">dotnet\/runtime#73120<\/a> from <a href=\"https:\/\/github.com\/dubiousconst282\">@dubiousconst282<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74806\">dotnet\/runtime#74806<\/a> from <a href=\"https:\/\/github.com\/En3Tho\">@En3Tho<\/a> improved the handling of the common bit-test patterns like <code>(x &amp; 1) != 0<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77874\">dotnet\/runtime#77874<\/a> gets rid of some unnecessary casts in a method like <code>short Add(short x, short y) =&gt; (short)(x + y)<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76981\">dotnet\/runtime#76981<\/a> improves the performance of multiplying by a number that&#8217;s one away from a power of two, by replacing an <code>imul<\/code> instruction with a three-instruction <code>mov<\/code>\/<code>shl<\/code>\/<code>add<\/code> sequence, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77137\">dotnet\/runtime#77137<\/a> improves other multiplications by a constant via replacing a <code>mov<\/code>\/<code>shl<\/code> sequence with a single <code>lea<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78786\">dotnet\/runtime#78786<\/a> from <a href=\"https:\/\/github.com\/pedrobsaila\">@pedrobsaila<\/a> fuses together separate conditions like <code>value &lt; 0 || value == 0<\/code> into the equivalent of <code>value &lt;= 0<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82750\">dotnet\/runtime#82750<\/a> eliminates some redundant <code>cmp<\/code> instructions.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79630\">dotnet\/runtime#79630<\/a> avoids an unnecessary <code>and<\/code> in a method like <code>static byte Mod(uint i) =&gt; (byte)(i % 256)<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77540\">dotnet\/runtime#77540<\/a> from <a href=\"https:\/\/github.com\/AndyJGraham\">@AndyJGraham<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84399\">dotnet\/runtime#84399<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85032\">dotnet\/runtime#85032<\/a> optimize pairs of load and store instructions and replace them with a single <code>ldp<\/code> or <code>stp<\/code> instruction on Arm.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84350\">dotnet\/runtime#84350<\/a> similarly optimizes pairs of <code>str wzr<\/code> instructions to be <code>str xzr<\/code> instructions.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83458\">dotnet\/runtime#83458<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> optimizes some redundant memory loads on Arm by replacing some <code>ldr<\/code> instructions with <code>mov<\/code> instructions.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83176\">dotnet\/runtime#83176<\/a> optimizes an <code>x &lt; 0<\/code> expression from emitting a <code>cmp<\/code>\/<code>cset<\/code> sequence on Arm to instead emitting an <code>lsr<\/code> instruction.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82924\">dotnet\/runtime#82924<\/a> removes a redundant overflow check on Arm for some division operations.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84605\">dotnet\/runtime#84605<\/a> combines an <code>lsl<\/code>\/<code>cmp<\/code> sequence on Arm into a single <code>cmp<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84667\">dotnet\/runtime#84667<\/a> combines <code>neg<\/code> and <code>cmp<\/code> sequences into use of <code>cmn<\/code> on Arm.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79550\">dotnet\/runtime#79550<\/a> replaces <code>mul<\/code>\/<code>neg<\/code> sequences on Arm with <code>mneg<\/code>.<\/li>\n<\/ul>\n<p>(I&#8217;ve touched here on some of the improvements specific to Arm. For a more in-depth look, see <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/this-arm64-performance-in-dotnet-8\">Arm64 Performance Improvements in .NET 8<\/a>).<\/p>\n<h2>Native AOT<\/h2>\n<p>Native AOT shipped in .NET 7. It enables .NET programs to be compiled at build time into a self-contained executable or library composed entirely of native code: no JIT is required at execution time to compile anything, and in fact there&#8217;s no JIT included with the compiled program. The result is an application that can have a very small on-disk footprint, a small memory footprint, and very fast startup time. In .NET 7, the primary supported workloads were console applications. Now in .NET 8, a lot of work has gone into making ASP.NET applications shine when compiled with Native AOT, as well as driving down overall costs, regardless of app model.<\/p>\n<p>A significant focus in .NET 8 was on reducing the size of built applications, and the net effect of this is quite easy to see. Let&#8217;s start by creating a new Native AOT console app:<\/p>\n<pre><code class=\"language-sh\">dotnet new console -o nativeaotexample -f net7.0<\/code><\/pre>\n<p>That creates a new <code>nativeaotexample<\/code> directory and adds to it a new &#8220;Hello, world&#8221; app that targets .NET 7. Edit the generated nativeaotexample.csproj in two ways:<\/p>\n<ol>\n<li>Change the <code>&lt;TargetFramework&gt;net7.0&lt;\/TargetFramework&gt;<\/code> to instead be <code>&lt;TargetFrameworks&gt;net7.0;net8.0&lt;\/TargetFrameworks&gt;<\/code>, so that we can easily build for either .NET 7 or .NET 8.<\/li>\n<li>Add <code>&lt;PublishAot&gt;true&lt;\/PublishAot&gt;<\/code> to the <code>&lt;PropertyGroup&gt;...&lt;\/PropertyGroup&gt;<\/code>, so that when we <code>dotnet publish<\/code>, it uses Native AOT.<\/li>\n<\/ol>\n<p>Now, publish the app for .NET 7. I&#8217;m currently targeting Linux for x64, so I&#8217;m using <code>linux-x64<\/code>, but you can follow along on Windows with a Windows identifier, like <code>win-x64<\/code>:<\/p>\n<pre><code class=\"language-sh\">dotnet publish -f net7.0 -r linux-x64 -c Release<\/code><\/pre>\n<p>That should successfully build the app, producing a standalone executable, and we can <code>ls<\/code>\/<code>dir<\/code> the output directory to see the produced binary size (here I&#8217;ve used <code>ls -s --block-size=k<\/code>):<\/p>\n<pre><code class=\"language-text\">12820K \/home\/stoub\/nativeaotexample\/bin\/Release\/net7.0\/linux-x64\/publish\/nativeaotexample<\/code><\/pre>\n<p>So, on .NET 7 on Linux, this &#8220;Hello, world&#8221; application, including all necessary library support, the GC, everything, is ~13Mb. Now, we can do the same for .NET 8:<\/p>\n<pre><code class=\"language-sh\">dotnet publish -f net8.0 -r linux-x64 -c Release<\/code><\/pre>\n<p>and again see the generated output size:<\/p>\n<pre><code class=\"language-text\">1536K \/home\/stoub\/nativeaotexample\/bin\/Release\/net8.0\/linux-x64\/publish\/nativeaotexample<\/code><\/pre>\n<p>Now on .NET 8, that ~13MB has dropped to ~1.5M! We can get it smaller, too, using various supported configuration flags. First, we can set a size vs speed option introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85133\">dotnet\/runtime#85133<\/a>, adding <code>&lt;OptimizationPreference&gt;Size&lt;\/OptimizationPreference&gt;<\/code> to the .csproj. Then if I don&#8217;t need globalization-specific code and data and am ok utilizing an invariant mode, I can add <code>&lt;InvariantGlobalization&gt;true&lt;\/InvariantGlobalization&gt;<\/code>. Maybe I don&#8217;t care about having good stack traces if an exception occurs? <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88235\">dotnet\/runtime#88235<\/a> added the <code>&lt;StackTraceSupport&gt;false&lt;\/StackTraceSupport&gt;<\/code> option. Add all of those and republish:<\/p>\n<pre><code class=\"language-text\">1248K \/home\/stoub\/nativeaotexample\/bin\/Release\/net8.0\/linux-x64\/publish\/nativeaotexample<\/code><\/pre>\n<p>Sweet.<\/p>\n<p>A good chunk of those improvements came from a relentless effort that involved hacking away at the size, 10Kb here, 20Kb there. Some examples that drove down these sizes:<\/p>\n<ul>\n<li>There are a variety of data structures the Native AOT compiler needs to create that then need to be used by the runtime when the app executes. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77884\">dotnet\/runtime#77884<\/a> added support for these data structures, including ones containing pointers, to be stored into the application and then rehydrated at execution time. Even before being extended in a variety of ways by subsequent PRs, this shaved hundreds of kilobytes off the app size, on both Windows and Linux (but more so on Linux).<\/li>\n<li>Every type with a static field containing references has a data structure associated with it containing a few pointers. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78794\">dotnet\/runtime#78794<\/a> made those pointers relative, saving ~0.5% of the HelloWorld app size (at least on Linux, a bit less on Windows). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78801\">dotnet\/runtime#78801<\/a> did the same for another set of pointers, saving another ~1%.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79594\">dotnet\/runtime#79594<\/a> removed some over-aggressive tracking of types and methods that needed data stored about them for reflection. This saved another ~32Kb on HelloWorld.<\/li>\n<li>In some cases, generic type dictionaries were being created even if they were never used and thus empty. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82591\">dotnet\/runtime#82591<\/a> got rid of these, saving another ~1.5% on a simple ASP.NET minimal APIs app. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83367\">dotnet\/runtime#83367<\/a> saved another ~20Kb by ridding itself of other empty type dictionaries.<\/li>\n<li>Members declared on a generic type have their code copied and specialized for each value type that&#8217;s substituted for the generic type parameter. However, if with some tweaks those members can be made non-generic and moved out of the type, such as into a non-generic base type, that duplication can be avoided. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82923\">dotnet\/runtime#82923<\/a> did so for array enumerators, moving down the <code>IDisposable<\/code> and non-generic <code>IEnumerator<\/code> interface implementations.<\/li>\n<li><code>CoreLib<\/code> has an implementation of an empty array enumerator that can be used when enumerating a <code>T[]<\/code> that&#8217;s empty, and that singleton may be used in non-array enumerables, e.g. enumerating an empty <code>(IEnumerable&lt;KeyValuePair&lt;TKey, TValue&gt;&gt;)Dictionary&lt;TKey, TValue&gt;<\/code> could produce that array enumerator singleton. That enumerator, however, has a reference to a <code>T[]<\/code>, and in the Native AOT world, using the enumerator then means code needs to be produced for the various members of <code>T[]<\/code>. If, however, the enumerator in question is for a <code>T[]<\/code> that&#8217;s unlikely to be used elsewhere (e.g. <code>KeyValuePair&lt;TKey, TValue&gt;[]<\/code>), <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82899\">dotnet\/runtime#82899<\/a> supplies a specialized enumerator singleton that doesn&#8217;t reference <code>T[]<\/code>, avoiding forcing that code to be created and kept (for example, code for a <code>Dictionary&lt;TKey, TValue&gt;<\/code>&#8216;s <code>IEnumerable&lt;KeyValuePair&lt;TKey, TValue&gt;&gt;<\/code>).<\/li>\n<li>No one ever calls the <code>Equals<\/code>\/<code>GetHashCode<\/code> methods on the <code>AsyncStateMachine<\/code> structs produced by the C# compiler for async methods; they&#8217;re a hidden implementation detail, but even so, such virtual methods are in general kept rooted in a Native AOT app (and whereas CoreCLR can use reflection to provide the implementation of these methods for value types, Native AOT needs customized code emitted for each). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83369\">dotnet\/runtime#83369<\/a> special-cased these to avoid them being kept, shaving another ~1% off a minimal APIs app.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83937\">dotnet\/runtime#83937<\/a> reduced the size of static constructor contexts, data structures used to pass information about a type&#8217;s static <code>cctor<\/code> between portions of the system.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84463\">dotnet\/runtime#84463<\/a> made a few tweaks that ended up avoiding creating <code>MethodTable<\/code>s for <code>double<\/code>\/<code>float<\/code> and that reduced reliance on some array methods, shaving another ~3% off HelloWorld.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84156\">dotnet\/runtime#84156<\/a> manually split a method into two portions such that some lesser-used code isn&#8217;t always brought in when using the more commonly-used code; this saved another several hundred kilobytes.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84224\">dotnet\/runtime#84224<\/a> improved handling of the common pattern <code>typeof(T) == typeof(Something)<\/code> that&#8217;s often used to do generic specialization (e.g. such as in code like <code>MemoryExtensions<\/code>), doing it in a way that makes it easier to get rid of side effects from branches that are trimmed away.<\/li>\n<li>The GC includes a vectorized sort implementation called <code>vxsort<\/code>. When building with a configuration optimized for size, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85036\">dotnet\/runtime#85036<\/a> enabled removing that throughput optimization, saving several hundred kilobytes.<\/li>\n<li><code>ValueTuple&lt;...&gt;<\/code> is a very handy type, but it brings a lot of code with it, as it implements multiple interfaces which then end up rooting functionality on the generic type parameters. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87120\">dotnet\/runtime#87120<\/a> removed a use of <code>ValueTuple&lt;T1, T2&gt;<\/code> from <code>SynchronizationContext<\/code>, saving ~200Kb.<\/li>\n<li>On Linux specifically, a large improvement came from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85139\">dotnet\/runtime#85139<\/a>. Debug symbols were previously being stored in the published executable; with this change, symbols are stripped from the executable and are instead stored in a separate <code>.dbg<\/code> file built next to it. Someone who wants to revert to keeping the symbols in the executable can add <code>&lt;StripSymbols&gt;false&lt;\/StripSymbols&gt;<\/code> to in their project.<\/li>\n<\/ul>\n<p>You get the idea. The improvements go beyond nipping and tucking here and there within the Native AOT compiler, though. Individual libraries also contributed. For example:<\/p>\n<ul>\n<li>\n<p><code>HttpClient<\/code> supports automatic decompression of response streams, for both <code>deflate<\/code> and <code>brotli<\/code>, and that in turn means that any <code>HttpClient<\/code> use implicitly brings with it most of <code>System.IO.Compression<\/code>. However, by default that decompression isn&#8217;t enabled, and you need to opt-in to it by explicitly setting the <code>AutomaticDecompression<\/code> property on the <code>HttpClientHandler<\/code> or <code>SocketsHttpHandler<\/code> in use. So, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78198\">dotnet\/runtime#78198<\/a> employs a trick where rather than <code>SocketsHttpHandler<\/code>&#8216;s main code paths relying directly on the internal <code>DecompressionHandler<\/code> that does this work, it instead relies on a delegate. The field storing that delegate starts out as null, and then as part of the <code>AutomaticDecompression<\/code> setter, that field is set to a delegate that will do the decompression work. That means that if the trimmer doesn&#8217;t see any code accessing the <code>AutomaticDecompression<\/code> setter such that the setter can be trimmed away, then all of the <code>DecompressionHandler<\/code> and its reliance on <code>DeflateStream<\/code> and <code>BrotliStream<\/code> can also be trimmed away. Since it&#8217;s a little confusing to read, here&#8217;s a representation of it:<\/p>\n<pre><code class=\"language-C#\">private DecompressionMethods _automaticDecompression;\r\nprivate Func&lt;Stream, Stream&gt;? _getStream;\r\n\r\npublic DecompressionMethods AutomaticDecompression\r\n{\r\n    get =&gt; _automaticDecompression;\r\n    set\r\n    {\r\n        _automaticDecompression = value;\r\n        _getStream ??= CreateDecompressionStream;\r\n    }\r\n}\r\n\r\npublic Stream GetStreamAsync()\r\n{\r\n    Stream response = ...;\r\n    return _getStream is not null ? _getStream(response) : response;\r\n}\r\n\r\nprivate static Stream CreateDecompressionStream(Stream stream) =&gt;\r\n    UseGZip   ? new GZipStream(stream, CompressionMode.Decompress) :\r\n    UseZLib   ? new ZLibStream(stream, CompressionMode.Decompress) :\r\n    UseBrotli ? new BrotliStream(stream, CompressionMode.Decompress) :\r\n    stream;\r\n}<\/code><\/pre>\n<p>The <code>CreateDecompressionStream<\/code> method here is the one that references all of the compression-related code, and the only code path that touches it is in the <code>AutomaticDecompression<\/code> setter. Therefore, if nothing in the app accesses the setter, the setter can be trimmed, which means the <code>CreateDecompressionStream<\/code> method can also be trimmed, which means if nothing else in the app is using these compression streams, they can also be trimmed.<\/p>\n<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80884\">dotnet\/runtime#80884<\/a> is another example, saving ~90Kb of size when <code>Regex<\/code> is used by just being a bit more intentional about what types are being used in its implementation (e.g. using a <code>bool[30]<\/code> instead of a <code>HashSet&lt;UnicodeCategory&gt;<\/code> to store a bitmap).<\/li>\n<li>Or particularly interesting, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84169\">dotnet\/runtime#84169<\/a>, which adds a new feature switch to <code>System.Xml<\/code>. Various APIs in <code>System.Xml<\/code> use <code>Uri<\/code>, which can trigger use of <code>XmlUrlResolver<\/code>, which in turn references the networking stack; an app that&#8217;s using XML but not otherwise using networking can end up inadvertently bringing in upwards of 3MB of networking code, just by using an API like <code>XDocument.Load(\"filepath.xml\")<\/code>. Such an app can use the <code>&lt;XmlResolverIsNetworkingEnabledByDefault&gt;<\/code> MSBuild property added in <a href=\"https:\/\/github.com\/dotnet\/sdk\/pull\/34412\">dotnet\/sdk#34412<\/a> to enable all of those code paths in XML to be trimmed away.<\/li>\n<li><code>ActivatorUtilities.CreateFactory<\/code> in <code>Microsoft.Extensions.DependencyInjection.Abstractions<\/code> tries to optimize throughput by spending some time upfront to build a factory that&#8217;s then very efficient at creating things. Its main strategy for doing so involved using <code>System.Linq.Expressions<\/code> as a simpler API for using reflection emit, building up custom IL for the exact thing being constructed. When you have a JIT, that can work very well. But when dynamic code isn&#8217;t supported, <code>System.Linq.Expressions<\/code> can&#8217;t use reflection emit and instead falls back to using an interpreter. That makes such an &#8220;optimization&#8221; in <code>CreateFactory<\/code> actually a deoptimization, plus it brings with it the size impact of <code>System.Linq.Expression.dll<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81262\">dotnet\/runtime#81262<\/a> adds a reflection-based alternative for when <code>!RuntimeFeature.IsDynamicCodeSupported<\/code>, resulting in faster code and allowing the <code>System.Linq.Expression<\/code> usage to be trimmed away.<\/li>\n<\/ul>\n<p>Of course, while size was a large focus for .NET 8, there are a multitude of other ways in which performance with Native AOT has improved. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79709\">dotnet\/runtime#79709<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80969\">dotnet\/runtime#80969<\/a> avoid helper calls as part of reading static fields. BenchmarkDotNet works with Native AOT as well, so we can run the following benchmark to compare; instead of using <code>--runtimes net7.0 net8.0<\/code>, we just use <code>--runtimes nativeaot7.0 nativeaot8.0<\/code> (BenchmarkDotNet also currently doesn&#8217;t support the <code>[DisassemblyDiagnoser]<\/code> with Native AOT):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes nativeaot7.0 nativeaot8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly int s_configValue = 42;\r\n\r\n    [Benchmark]\r\n    public int GetConfigValue() =&gt; s_configValue;\r\n}<\/code><\/pre>\n<p>For that, BenchmarkDotNet outputs:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetConfigValue<\/td>\n<td>NativeAOT 7.0<\/td>\n<td style=\"text-align: right\">1.1759 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>GetConfigValue<\/td>\n<td>NativeAOT 8.0<\/td>\n<td style=\"text-align: right\">0.0000 ns<\/td>\n<td style=\"text-align: right\">0.000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>including:<\/p>\n<pre><code class=\"language-text\">\/\/ * Warnings *\r\nZeroMeasurement\r\n  Tests.GetConfigValue: Runtime=NativeAOT 8.0, Toolchain=Latest ILCompiler -&gt; The method duration is indistinguishable from the empty method duration<\/code><\/pre>\n<p>(When looking at the output of optimizations, that warning always brings a smile to my face.)<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83054\">dotnet\/runtime#83054<\/a> is another good example. It improves upon <code>EqualityComparer&lt;T&gt;<\/code> support in Native AOT by ensuring that the comparer can be stored in a <code>static readonly<\/code> to enable better constant folding in consumers.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes nativeaot7.0 nativeaot8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly int[] _array = Enumerable.Range(0, 1000).ToArray();\r\n\r\n    [Benchmark]\r\n    public int FindIndex() =&gt; FindIndex(_array, 999);\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static int FindIndex&lt;T&gt;(T[] array, T value)\r\n    {\r\n        for (int i = 0; i &lt; array.Length; i++)\r\n            if (EqualityComparer&lt;T&gt;.Default.Equals(array[i], value))\r\n                return i;\r\n\r\n        return -1;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FindIndex<\/td>\n<td>NativeAOT 7.0<\/td>\n<td style=\"text-align: right\">876.2 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FindIndex<\/td>\n<td>NativeAOT 8.0<\/td>\n<td style=\"text-align: right\">367.8 ns<\/td>\n<td style=\"text-align: right\">0.42<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As another example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83911\">dotnet\/runtime#83911<\/a> avoids some overhead related to static class initialization. As we discussed in the JIT section, the JIT is able to rely on tiering to know that a static field accessed by a method must have already been initialized if the method is being promoted from tier 0 to tier 1, but tiering doesn&#8217;t exist in the Native AOT world, so this PR adds a fast-path check to help avoid most of the costs.<\/p>\n<p>Other fundamental support has also improved. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79519\">dotnet\/runtime#79519<\/a>, for example, changes how locks are implemented for Native AOT, employing a hybrid approach that starts with a lightweight spinlock and upgrades to using the <code>System.Threading.Lock<\/code> type (which is currently internal to Native AOT but likely to ship publicly in .NET 9).<\/p>\n<h2>VM<\/h2>\n<p>The VM is, loosely speaking, the part of the runtime that&#8217;s not the JIT or the GC. It&#8217;s what handles things like assembly and type loading. While there were a multitude of improvements throughout, I&#8217;ll call out three notable improvements.<\/p>\n<p>First, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79021\">dotnet\/runtime#79021<\/a> optimized the operation of mapping an instruction pointer to a <code>MethodDesc<\/code> (a data structure that represents a method, with various pieces of information about it, like its signature), which happens in particular any time stack walking is performed (e.g. exception handling, <code>Environment.Stacktrace<\/code>, etc.) and as part of some delegate creations. The change not only makes this conversion faster but also mostly lock-free, which means on a benchmark like the following, there&#8217;s a significant improvement for sequential use but an even larger one for multi-threaded use:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public void InSerial()\r\n    {\r\n        for (int i = 0; i &lt; 10_000; i++)\r\n        {\r\n            CreateDelegate&lt;string&gt;();\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void InParallel()\r\n    {\r\n        Parallel.For(0, 10_000, i =&gt;\r\n        {\r\n            CreateDelegate&lt;string&gt;();\r\n        });\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.NoInlining)]\r\n    private static Action&lt;T&gt; CreateDelegate&lt;T&gt;() =&gt; new Action&lt;T&gt;(GenericMethod);\r\n\r\n    private static void GenericMethod&lt;T&gt;(T t) { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>InSerial<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,868.4 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>InSerial<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">706.5 us<\/td>\n<td style=\"text-align: right\">0.38<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>InParallel<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,247.3 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>InParallel<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">222.9 us<\/td>\n<td style=\"text-align: right\">0.18<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Second, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83632\">dotnet\/runtime#83632<\/a> improves the performance of the <code>ExecutableAllocator<\/code>. This allocator is responsible for allocation related to all executable memory in the runtime, e.g. the JIT uses it to get memory into which to write the generated code that will then need to be executed. When memory is mapped, it has permissions associated with it for what can be done with that memory, e.g. can it be read and written, can it be executed, etc. The allocator maintains a cache, and this PR improved the performance of the allocator by reducing the number of cache misses incurred and reducing the cost of those cache misses when they do occur.<\/p>\n<p>Third, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85743\">dotnet\/runtime#85743<\/a> makes a variety of changes focused on significantly reducing startup time. This includes reducing the amount of time spent on validation of types in R2R images, making lookups for generic parameters and nested types in R2R images much faster due to dedicated metadata in the R2R image, converting an <code>O(n^2)<\/code> lookup into an <code>O(1)<\/code> lookup by storing an additional index in a method description, and ensuring that vtable chunks are always shared.<\/p>\n<h2>GC<\/h2>\n<p>At the beginning of this post, I suggested that <code>&lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;<\/code> be added to the csproj used for running the benchmarks in this post. That setting configures the GC to run in &#8220;server&#8221; mode, as opposed to &#8220;workstation&#8221; mode. The workstation mode was designed for use with client applications and is less resource intensive, preferring to use less memory but at the possible expense of throughput and scalability if the system is placed under heavier load. In contrast, the server mode was designed for larger-scale services. It is much more resource hungry, with a dedicated heap by default per logical core in the machine, and a dedicated thread per heap for servicing that heap, but it is also significantly more scalable. This tradeoff often leads to complication, as while applications might demand the scalability of the server GC, they may also want memory consumption closer to that of workstation, at least at times when demand is lower and the service needn&#8217;t have so many heaps.<\/p>\n<p>In .NET 8, the server GC now has support for a dynamic heap count, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86245\">dotnet\/runtime#86245<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87618\">dotnet\/runtime#87618<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87619\">dotnet\/runtime#87619<\/a>, which add a feature dubbed &#8220;Dynamic Adaptation To Application Sizes&#8221;, or DATAS. It&#8217;s off-by-default in .NET 8 in general (though on-by-default when publishing for Native AOT), but it can be enabled trivially, either by setting the <code>DOTNET_GCDynamicAdaptationMode<\/code> environment variable to <code>1<\/code>, or via the <code>&lt;GarbageCollectionAdaptationMode&gt;1&lt;\/GarbageCollectionAdaptationMode&gt;<\/code> MSBuild property. The employed algorithm is able to increase and decrease the heap count over time, trying to maximize its view of throughput, and maintaining a balance between that and overall memory footprint.<\/p>\n<p>Here&#8217;s a simple example. I create a console app with <code>&lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;<\/code> in the .csproj and the following code in Program.cs, which just spawns a bunch of threads that continually allocate, and then repeatedly prints out the working set:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Diagnostics;\r\n\r\nfor (int i = 0; i &lt; 32; i++)\r\n{\r\n    new Thread(() =&gt;\r\n    {\r\n        while (true) Array.ForEach(new byte[1], b =&gt; { });\r\n    }).Start();\r\n}\r\n\r\nusing Process process = Process.GetCurrentProcess();\r\nwhile (true)\r\n{\r\n    process.Refresh();\r\n    Console.WriteLine($\"{process.WorkingSet64:N0}\");\r\n    Thread.Sleep(1000);\r\n}<\/code><\/pre>\n<p>When I run that, I consistently see output like:<\/p>\n<pre><code class=\"language-text\">154,226,688\r\n154,226,688\r\n154,275,840\r\n154,275,840\r\n154,816,512\r\n154,816,512\r\n154,816,512\r\n154,824,704\r\n154,824,704\r\n154,824,704<\/code><\/pre>\n<p>When I then add <code>&lt;GarbageCollectionAdaptationMode&gt;1&lt;\/GarbageCollectionAdaptationMode&gt;<\/code> to the .csproj, the working set drops significantly:<\/p>\n<pre><code class=\"language-text\">71,430,144\r\n72,187,904\r\n72,196,096\r\n72,196,096\r\n72,245,248\r\n72,245,248\r\n72,245,248\r\n72,245,248\r\n72,245,248\r\n72,253,440<\/code><\/pre>\n<p>For a more detailed examination of the feature and plans for it, see <a href=\"https:\/\/maoni0.medium.com\/dynamically-adapting-to-application-sizes-2d72fcb6f1ea\">Dynamically Adapting To Application Sizes<\/a>.<\/p>\n<h2>Mono<\/h2>\n<p>Thus far I&#8217;ve referred to &#8220;the runtime&#8221;, &#8220;the JIT&#8221;, &#8220;the GC&#8221;, and so on. That&#8217;s all in the context of the &#8220;CoreCLR&#8221; runtime, which is the primary runtime used for console applications, ASP.NET applications, services, desktop applications, and the like. For mobile and browser .NET applications, however, the primary runtime used is the &#8220;Mono&#8221; runtime. And it also has seen some huge improvements in .NET 8, improvements that accrue to scenarios like Blazor WebAssembly apps.<\/p>\n<p>Just as how with CoreCLR there&#8217;s both the ability to JIT and AOT, there are multiple ways in which code can be shipped for Mono. Mono includes an AOT compiler; for WASM in particular, the AOT compiler enables all of the IL to be compiled to WASM, which is then shipped down to the browser. As with CoreCLR, however, AOT is opt-in. The default experience for WASM is to use an interpreter: the IL is shipped down to the browser, and the interpreter (which itself is compiled to WASM) then interprets the IL. Of course, interpretation has performance implications, and so .NET 7 augmented the interpreter with a tiering scheme similar in concept to the tiering employed by the CoreCLR JIT. The interpreter has its own representation of the code to be interpreted, and the first few times a method is invoked, it just interprets that byte code with little effort put into optimizing it. Then after enough invocations, the interpreter will take some time to optimize that internal representation so as to speed up subsequent interpretations. Even with that, however, it&#8217;s still interpreting: it&#8217;s still an interpreter implemented in WASM reading instructions for what to do and doing them. One of the most notable improvements to Mono in .NET 8 expands on this tiering by introducing a partial JIT into the interpreter. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76477\">dotnet\/runtime#76477<\/a> provided the initial code for this &#8220;jiterpreter,&#8221; as some folks refer to it. As part of the interpreter, this JIT is able to participate in the same data structures used by the interpreter and process the same byte code, and works by replacing sequences of that byte code with on-the-fly generated WASM. That could be a whole method, it could just be a hot loop within a method, or it could be just a few instructions. This provides significant flexibility, including a very progressive on-ramp where optimizations can be added incrementally, shifting more and more logic from interpretation to jitted WASM. Dozens of PRs went into making the jiterpreter a reality for .NET 8, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82773\">dotnet\/runtime#82773<\/a> that added basic SIMD support, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82756\">dotnet\/runtime#82756<\/a> that added basic loop support, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83247\">dotnet\/runtime#83247<\/a> that added a control-flow optimization pass.<\/p>\n<p>Let&#8217;s see this in action. I created a new .NET 7 Blazor WebAssembly project, added a NuGet reference to the <code>System.IO.Hashing<\/code> project, and replaced the contents of <code>Counter.razor<\/code> with the following:<\/p>\n<pre><code class=\"language-C#\">@page \"\/counter\"\r\n@using System.Diagnostics;\r\n@using System.IO.Hashing;\r\n@using System.Text;\r\n@using System.Threading.Tasks;\r\n\r\n&lt;h1&gt;.NET 7&lt;\/h1&gt;\r\n\r\n&lt;p role=\"status\"&gt;Current time: @_time&lt;\/p&gt;\r\n\r\n&lt;button class=\"btn btn-primary\" @onclick=\"Hash\"&gt;Click me&lt;\/button&gt;\r\n\r\n@code {\r\n    private TimeSpan _time;\r\n\r\n    private void Hash()\r\n    {\r\n        var sw = Stopwatch.StartNew();\r\n        for (int i = 0; i &lt; 50_000; i++) XxHash64.HashToUInt64(_data);\r\n        _time = sw.Elapsed;\r\n    }\r\n\r\n    private byte[] _data =\r\n        @\"Shall I compare thee to a summer's day?\r\n          Thou art more lovely and more temperate:\r\n          Rough winds do shake the darling buds of May,\r\n          And summer's lease hath all too short a date;\r\n          Sometime too hot the eye of heaven shines,\r\n          And often is his gold complexion dimm'd;\r\n          And every fair from fair sometime declines,\r\n          By chance or nature's changing course untrimm'd;\r\n          But thy eternal summer shall not fade,\r\n          Nor lose possession of that fair thou ow'st;\r\n          Nor shall death brag thou wander'st in his shade,\r\n          When in eternal lines to time thou grow'st:\r\n          So long as men can breathe or eyes can see,\r\n          So long lives this, and this gives life to thee.\"u8.ToArray();\r\n}<\/code><\/pre>\n<p>Then I did the exact same thing, but for .NET 8, built both in Release, and ran them both. When the resulting page opened for each, I clicked the &#8220;Click me&#8221; button (a few times, but it didn&#8217;t change the results).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/Net7vs8Interpreter.png\" alt=\"Interpreted WASM on .NET 7 vs .NET 8\" \/><\/p>\n<p>The timing measurements for how long the operation took in .NET 7 compared to .NET 8 speak for themselves.<\/p>\n<p>Beyond the jiterpreter, the interpreter itself saw a multitude of improvements, for example:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79165\">dotnet\/runtime#79165<\/a> added special handling of the <code>stobj<\/code> IL instruction for when the value type doesn&#8217;t contain any references, and thus doesn&#8217;t need to interact with the GC.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80046\">dotnet\/runtime#80046<\/a> special-cased a compare followed by <code>brtrue<\/code>\/<code>brfalse<\/code>, creating a single interpreter opcode for the very common pattern.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79392\">dotnet\/runtime#79392<\/a> added an intrinsic to the interpreter for string creation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78840\">dotnet\/runtime#78840<\/a> added a cache to the Mono runtime (including for but not limited to the interpreter) for various pieces of information about types, like <code>IsValueType<\/code>, <code>IsGenericTypeDefinition<\/code>, and <code>IsDelegate<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81782\">dotnet\/runtime#81782<\/a> added intrinsics for some of the most common operations on <code>Vector128<\/code>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86859\">dotnet\/runtime#86859<\/a> augmented this to use those same opcodes for <code>Vector&lt;T&gt;<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83498\">dotnet\/runtime#83498<\/a> special-cased division by powers of 2 to instead employ shifts.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83490\">dotnet\/runtime#83490<\/a> tweaked the inlining size limit to ensure that key methods could be inlined, like <code>List&lt;T&gt;<\/code>&#8216;s indexer.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85528\">dotnet\/runtime#85528<\/a> added devirtualization support in situations where enough type information is available to enable doing so.<\/li>\n<\/ul>\n<p>I&#8217;ve already alluded several times to vectorization in Mono, but in its own right this has been a big area of focus for Mono in .NET 8, across all backends. As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86546\">dotnet\/runtime#86546<\/a>, which completed adding <code>Vector128&lt;T&gt;<\/code> support for Mono&#8217;s AMD64 JIT backend, <code>Vector128&lt;T&gt;<\/code> is now supported across all Mono backends. Mono&#8217;s WASM backends not only support <code>Vector128&lt;T&gt;<\/code>, .NET 8 includes the new <code>System.Runtime.Intrinsics.Wasm.PackedSimd<\/code> type, which is specific to WASM and exposes hundreds of overloads that map down to WASM SIMD operations. The basis for this type was introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73289\">dotnet\/runtime#73289<\/a>, where the initial SIMD support was added as internal. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76539\">dotnet\/runtime#76539<\/a> continued the effort by adding more functionality and also making the type public, as it now is in .NET 8. Over a dozen PRs continued to build it out, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80145\">dotnet\/runtime#80145<\/a> that added <code>ConditionalSelect<\/code> intrinsics, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87052\">dotnet\/runtime#87052<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87828\">dotnet\/runtime#87828<\/a> that added load and store intrinsics, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85705\">dotnet\/runtime#85705<\/a> that added floating-point support, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88595\">dotnet\/runtime#88595<\/a>, which overhauled the surface area based on learnings since its initial design.<\/p>\n<p>Another effort in .NET 8, related to app size, has been around reducing reliance on ICU&#8217;s data files (ICU is the globalization library employed by .NET and many other systems). Instead, the goal is to rely on the target platform&#8217;s native APIs wherever possible (for WASM, APIs provided by the browser). This effort is referred to as &#8220;hybrid globalization,&#8221; because the dependence on ICU&#8217;s data files still remains, it&#8217;s just lessened, and it comes with behavioral changes, so it&#8217;s opt-in for situations where someone really wants the smaller size and is willing to deal with the behavioral accommodations. A multitude of PRs have also gone into making this a reality for .NET 8, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81470\">dotnet\/runtime#81470<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84019\">dotnet\/runtime#84019<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84249\">dotnet\/runtime#84249<\/a>. To enable the feature, you can add <code>&lt;HybridGlobalization&gt;true&lt;\/HybridGlobalization&gt;<\/code> to your .csproj, and for more information, there&#8217;s a <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/main\/docs\/design\/features\/globalization-hybrid-mode.md\">good design document<\/a> that goes into much more depth.<\/p>\n<h2>Threading<\/h2>\n<p>Recent releases of .NET saw huge improvements to the area of threading, parallelism, concurrency, and asynchrony, such as a complete rewrite of the <code>ThreadPool<\/code> (in .NET 6 and .NET 7), a complete rewrite of the async method infrastructure (in .NET Core 2.1), a complete rewrite of <code>ConcurrentQueue&lt;T&gt;<\/code> (in .NET Core 2.0), and so on. This release doesn&#8217;t include such massive overhauls, but it does include some thoughtful and impactful improvements.<\/p>\n<h3>ThreadStatic<\/h3>\n<p>The .NET runtime makes it easy to associate data with a thread, often referred to as thread-local storage (TLS). The most common way to achieve this is by annotating a static field with the <code>[ThreadStatic]<\/code> attribute (another for more advanced uses is via the <code>ThreadLocal&lt;T&gt;<\/code> type), which causes the runtime to replicate the storage for that field to be per thread rather than global for the process.<\/p>\n<pre><code class=\"language-C#\">private static int s_onePerProcess;\r\n\r\n[ThreadStatic]\r\nprivate static int t_onePerThread;<\/code><\/pre>\n<p>Historically, accessing such a <code>[ThreadStatic]<\/code> field has required a non-inlined JIT helper call (e.g. <code>CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR<\/code>), but now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82973\">dotnet\/runtime#82973<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85619\">dotnet\/runtime#85619<\/a>, the common and fast path from that helper can be inlined into the caller. We can see this with a simple benchmark that just increments an <code>int<\/code> stored in a <code>[ThreadStatic]<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes nativeaot7.0 nativeaot8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    [ThreadStatic]\r\n    private static int t_value;\r\n\r\n    [Benchmark]\r\n    public int Increment() =&gt; ++t_value;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Increment<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">8.492 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Increment<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.453 ns<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>[ThreadStatic]<\/code> was similarly optimized for Native AOT, via both <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84566\">dotnet\/runtime#84566<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87148\">dotnet\/runtime#87148<\/a>:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Increment<\/td>\n<td>NativeAOT 7.0<\/td>\n<td style=\"text-align: right\">2.305 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Increment<\/td>\n<td>NativeAOT 8.0<\/td>\n<td style=\"text-align: right\">1.325 ns<\/td>\n<td style=\"text-align: right\">0.57<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>ThreadPool<\/h3>\n<p>Let&#8217;s try an experiment. Create a new console app, and add <code>&lt;PublishAot&gt;true&lt;\/PublishAot&gt;<\/code> to the .csproj. Then make the entirety of the program this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nTask.Run(() =&gt; Console.WriteLine(Environment.StackTrace)).Wait();<\/code><\/pre>\n<p>The idea is to see the stack trace of a work item running on a <code>ThreadPool<\/code> thread. Now run it, and you should see something like this:<\/p>\n<pre><code class=\"language-text\">   at System.Environment.get_StackTrace()\r\n   at Program.&lt;&gt;c.&lt;&lt;Main&gt;$&gt;b__0_0()\r\n   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)\r\n   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task&amp; currentTaskSlot, Thread threadPoolThread)\r\n   at System.Threading.ThreadPoolWorkQueue.Dispatch()\r\n   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()<\/code><\/pre>\n<p>The important piece here is the bottom line: we see we&#8217;re being called from the <code>PortableThreadPool<\/code>, which is the managed thread pool implementation that&#8217;s been used across operating systems since .NET 6. Now, instead of running directly, let&#8217;s publish for Native AOT and run the resulting app (for the specific thing we&#8217;re looking for, this part should be done on Windows).<\/p>\n<pre><code class=\"language-sh\">dotnet publish -c Release -r win-x64\r\nD:\\examples\\tmpapp\\bin\\Release\\net8.0\\win-x64\\publish\\tmpapp.exe<\/code><\/pre>\n<p>Now, we see this:<\/p>\n<pre><code class=\"language-text\">   at System.Environment.get_StackTrace() + 0x21\r\n   at Program.&lt;&gt;c.&lt;&lt;Main&gt;$&gt;b__0_0() + 0x9\r\n   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread, ExecutionContext, ContextCallback, Object) + 0x3d\r\n   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task&amp;, Thread) + 0xcc\r\n   at System.Threading.ThreadPoolWorkQueue.Dispatch() + 0x289\r\n   at System.Threading.WindowsThreadPool.DispatchCallback(IntPtr, IntPtr, IntPtr) + 0x45<\/code><\/pre>\n<p>Again, note the last line: &#8220;WindowsThreadPool.&#8221; Applications published with Native AOT <em>on Windows<\/em> have historically used a <code>ThreadPool<\/code> implementation that wraps the <a href=\"https:\/\/learn.microsoft.com\/windows\/win32\/procthread\/using-the-thread-pool-functions\">Windows thread pool<\/a>. The work item queues and dispatching code is all the same as with the portable pool, but the thread management itself is delegated to the Windows pool. Now in .NET 8 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85373\">dotnet\/runtime#85373<\/a>, projects <em>on Windows<\/em> have the option of using either pool; Native AOT apps can opt to instead use the portable pool, and other apps can opt to instead use the Windows pool. Opting in or out is easy: in a <code>&lt;PropertyGroup\/&gt;<\/code> in the .csproj, add <code>&lt;UseWindowsThreadPool&gt;false&lt;\/UseWindowsThreadPool&gt;<\/code> to opt-out in a Native AOT app, and conversely use <code>true<\/code> in other apps to opt-in. When using this MSBuild switch, in a Native AOT app, whichever pool isn&#8217;t being used can automatically be trimmed away. For experimentation, the <code>DOTNET_ThreadPool_UseWindowsThreadPool<\/code> environment variable can also be set to <code>0<\/code> or <code>1<\/code> to explicitly opt out or in, respectively.<\/p>\n<p>There&#8217;s currently no hard-and-fast rule about why one pool might be better; the option has been added to allow developers to experiment. We&#8217;ve seen with the Windows pool that I\/O doesn&#8217;t scale as well on larger machines as it does with the portable pool. However, if the Windows thread pool is already being used heavily elsewhere in the application, consolidating into the same pool can reduce oversubscription. Further, if thread pool threads get blocked very frequently, the Windows thread pool has more information about that blocking and can potentially handle those scenarios more efficiently. We can see this with a simple example. Compile this code:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Diagnostics;\r\n\r\nvar sw = Stopwatch.StartNew();\r\n\r\nvar barrier = new Barrier(Environment.ProcessorCount * 2 + 1);\r\nfor (int i = 0; i &lt; barrier.ParticipantCount; i++)\r\n{\r\n    ThreadPool.QueueUserWorkItem(id =&gt;\r\n    {\r\n        Console.WriteLine($\"{sw.Elapsed}: {id}\");\r\n        barrier.SignalAndWait();\r\n    }, i);\r\n}\r\n\r\nbarrier.SignalAndWait();\r\nConsole.WriteLine($\"Done: {sw.Elapsed}\");<\/code><\/pre>\n<p>This is a dastardly repro that creates a bunch of work items, all of which block until all of the work items have been processed: basically it takes every thread the thread pool gives it and never gives it back (until the program exits). When I run this on my machine where <code>Environment.ProcessorCount<\/code> is 12, I get output like this:<\/p>\n<pre><code class=\"language-text\">00:00:00.0038906: 0\r\n00:00:00.0038911: 1\r\n00:00:00.0042401: 4\r\n00:00:00.0054198: 9\r\n00:00:00.0047249: 6\r\n00:00:00.0040724: 3\r\n00:00:00.0044894: 5\r\n00:00:00.0052228: 8\r\n00:00:00.0049638: 7\r\n00:00:00.0056831: 10\r\n00:00:00.0039327: 2\r\n00:00:00.0057127: 11\r\n00:00:01.0265278: 12\r\n00:00:01.5325809: 13\r\n00:00:02.0471848: 14\r\n00:00:02.5628161: 15\r\n00:00:03.5805581: 16\r\n00:00:04.5960218: 17\r\n00:00:05.1087192: 18\r\n00:00:06.1142907: 19\r\n00:00:07.1331915: 20\r\n00:00:07.6467355: 21\r\n00:00:08.1614072: 22\r\n00:00:08.6749720: 23\r\n00:00:08.6763938: 24\r\nDone: 00:00:08.6768608<\/code><\/pre>\n<p>The portable pool quickly injects <code>Environment.ProcessorCount<\/code> threads, but after that it proceeds to only inject an additional thread once or twice a second. Now, set <code>DOTNET_ThreadPool_UseWindowsThreadPool<\/code> to <code>1<\/code> and try again:<\/p>\n<pre><code class=\"language-text\">00:00:00.0034909: 3\r\n00:00:00.0036281: 4\r\n00:00:00.0032404: 0\r\n00:00:00.0032727: 1\r\n00:00:00.0032703: 2\r\n00:00:00.0447256: 5\r\n00:00:00.0449398: 6\r\n00:00:00.0451899: 7\r\n00:00:00.0454245: 8\r\n00:00:00.0456907: 9\r\n00:00:00.0459155: 10\r\n00:00:00.0461399: 11\r\n00:00:00.0463612: 12\r\n00:00:00.0465538: 13\r\n00:00:00.0467497: 14\r\n00:00:00.0469477: 15\r\n00:00:00.0471055: 16\r\n00:00:00.0472961: 17\r\n00:00:00.0474888: 18\r\n00:00:00.0477131: 19\r\n00:00:00.0478795: 20\r\n00:00:00.0480844: 21\r\n00:00:00.0482900: 22\r\n00:00:00.0485110: 23\r\n00:00:00.0486981: 24\r\nDone: 00:00:00.0498603<\/code><\/pre>\n<p>Zoom. The Windows pool is <em>much<\/em> more aggressive about injecting threads here. Whether that&#8217;s good or bad can depend on your scenario. If you&#8217;ve found yourself setting a really high minimum thread pool thread count for your application, you might want to give this option a go.<\/p>\n<h3>Tasks<\/h3>\n<p>Even with all the improvements to async\/await in previous releases, this release sees async methods get cheaper still, both when they complete synchronously and when they complete asynchronously.<\/p>\n<p>When an async <code>Task<\/code>\/<code>Task&lt;TResult&gt;<\/code>-returning method completes synchronously, it tries to give back a cached task object rather than creating one a new and incurring the allocation. In the case of <code>Task<\/code>, that&#8217;s easy, it can simply use <code>Task.CompletedTask<\/code>. In the case of <code>Task&lt;TResult&gt;<\/code>, it uses a cache that stores cached tasks for some <code>TResult<\/code> values. When <code>TResult<\/code> is <code>Boolean<\/code>, for example, it can successfully cache a <code>Task&lt;bool&gt;<\/code> for both <code>true<\/code> and <code>false<\/code>, such that it&#8217;ll always successfully avoid the allocation. For <code>int<\/code>, it caches a few tasks for common values (e.g. <code>-1<\/code> through <code>8<\/code>). For reference types, it caches a task for <code>null<\/code>. And for the primitive integer types (<code>sbyte<\/code>, <code>byte<\/code>, <code>short<\/code>, <code>ushort<\/code>, <code>char<\/code>, <code>int<\/code>, <code>uint<\/code>, <code>long<\/code>, <code>ulong<\/code>, <code>nint<\/code>, and <code>nuint<\/code>), it caches a task for 0. It used to be that all of this logic was dedicated to async methods, but in .NET 6 that logic moved into <code>Task.FromResult<\/code>, such that all use of <code>Task.FromResult<\/code> now benefits from this caching. In .NET 8, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76349\">dotnet\/runtime#76349<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87541\">dotnet\/runtime#87541<\/a>, the caching is improved further. In particular, the optimization of caching a task for <code>0<\/code> for the primitive types is extended to be the caching of a task for <code>default(TResult)<\/code> for any value type <code>TResult<\/code> that is 1, 2, 4, 8, or 16 bytes. In such cases, we can do an unsafe cast to one of these primitives, and then use that primitive&#8217;s equality to compare against <code>default<\/code>. If that comparison is true, it means the value is entirely zeroed, which means we can use a cached task for <code>Task&lt;TResult&gt;<\/code> created from <code>default(TResult)<\/code>, as that is also entirely zeroed. What if that type has a custom equality comparer? That actually doesn&#8217;t matter, since the original value and the one stored in the cached task have identical bit patterns, which means they&#8217;re indistinguishable. The net effect of this is we can cache tasks for other commonly used types.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark] public async Task&lt;TimeSpan&gt; ZeroTimeSpan() =&gt; TimeSpan.Zero;\r\n    [Benchmark] public async Task&lt;DateTime&gt; MinDateTime() =&gt; DateTime.MinValue;\r\n    [Benchmark] public async Task&lt;Guid&gt; EmptyGuid() =&gt; Guid.Empty;\r\n    [Benchmark] public async Task&lt;DayOfWeek&gt; Sunday() =&gt; DayOfWeek.Sunday;\r\n    [Benchmark] public async Task&lt;decimal&gt; ZeroDecimal() =&gt; 0m;\r\n    [Benchmark] public async Task&lt;double&gt; ZeroDouble() =&gt; 0;\r\n    [Benchmark] public async Task&lt;float&gt; ZeroFloat() =&gt; 0;\r\n    [Benchmark] public async Task&lt;Half&gt; ZeroHalf() =&gt; (Half)0f;\r\n    [Benchmark] public async Task&lt;(int, int)&gt; ZeroZeroValueTuple() =&gt; (0, 0);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ZeroTimeSpan<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">31.327 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroTimeSpan<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.851 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>MinDateTime<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">31.457 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>MinDateTime<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.277 ns<\/td>\n<td style=\"text-align: right\">0.26<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>EmptyGuid<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.233 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>EmptyGuid<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">9.013 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Sunday<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">30.907 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Sunday<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.235 ns<\/td>\n<td style=\"text-align: right\">0.27<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ZeroDecimal<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">33.109 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroDecimal<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">13.110 ns<\/td>\n<td style=\"text-align: right\">0.40<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ZeroDouble<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">30.863 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroDouble<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.568 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ZeroFloat<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">31.025 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroFloat<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.531 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ZeroHalf<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">33.906 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroHalf<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">9.008 ns<\/td>\n<td style=\"text-align: right\">0.27<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ZeroZeroValueTuple<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">33.339 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ZeroZeroValueTuple<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">11.274 ns<\/td>\n<td style=\"text-align: right\">0.34<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Those changes helped some async methods to become leaner when they complete synchronously. Other changes have helped practically <em>all<\/em> async methods to become leaner when they complete asynchronously. When an async method suspends for the first time, assuming it&#8217;s returning <code>Task<\/code>\/<code>Task&lt;TResult&gt;<\/code>\/<code>ValueTask<\/code>\/<code>ValueTask&lt;TResult&gt;<\/code> and the default async method builders are in use (i.e. they haven&#8217;t been overridden using <code>[AsyncMethodBuilder(...)]<\/code> on the method in question), a single allocation occurs: the task object to be returned. That task object is actually a type derived from <code>Task<\/code> (in the implementation today the internal type is called <code>AsyncStateMachineBox&lt;TStateMachine&gt;<\/code>) and that has on it a strongly-typed field for the state machine struct generated by the C# compiler. In fact, as of .NET 7, it has three additional fields beyond what&#8217;s on the base <code>Task&lt;TResult&gt;<\/code>:<\/p>\n<ol>\n<li>One to hold the <code>TStateMachine<\/code> state machine struct generated by the C# compiler.<\/li>\n<li>One to cache an <code>Action<\/code> delegate that points to <code>MoveNext<\/code>.<\/li>\n<li>One to store an <code>ExecutionContext<\/code> to flow to the next <code>MoveNext<\/code> invocation.<\/li>\n<\/ol>\n<p>If we can trim down the fields required, we can make every async method less expensive by allocating smaller instead of larger objects. That&#8217;s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83696\">dotnet\/runtime#83696<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83737\">dotnet\/runtime#83737<\/a> accomplish, together shaving 16 bytes (in a 64-bit process) off the size of <em>every<\/em> such async method task. How?<\/p>\n<p>The C# language allows anything to be awaitable as long as it follows the right pattern, exposing a <code>GetAwaiter()<\/code> method that returns a type with the right shape. That pattern includes a set of &#8220;OnCompleted&#8221; methods that take an <code>Action<\/code> delegate, enabling the async method builder to provide a continuation to the awaiter, such that when the awaited operation completes, it can invoke the <code>Action<\/code> to resume the method&#8217;s processing. As such, the <code>AsyncStateMachineBox<\/code> type has on it a field used to cache an <code>Action<\/code> delegate that&#8217;s lazily created to point to its <code>MoveNext<\/code> method; that <code>Action<\/code> is created during the first suspending await where it&#8217;s needed and can then be used for all subsequent awaits, such that the <code>Action<\/code> is allocated at most once for the lifetime of an async method, regardless of how many times the invocation suspends. (The delegate is only needed, however, if the state machine awaits something that&#8217;s not a known awaiter; the runtime has fast paths that avoid requiring that <code>Action<\/code> when awaiting all of the built-in awaiters). Interestingly, though, <code>Task<\/code> itself has a field for storing a delegate, and that field is only used when the <code>Task<\/code> is created to invoke a delegate (e.g. <code>Task.Run<\/code>, <code>ContinueWith<\/code>, etc.). Since most tasks allocated today come from async methods, that means that the majority of tasks have all had a wasted field. It turns out we can just use that base field on the <code>Task<\/code> for this cached <code>MoveNext<\/code> <code>Action<\/code> as well, making the field relevant to almost all tasks, and allowing us to remove the extra <code>Action<\/code> field on the state machine box.<\/p>\n<p>There&#8217;s another existing field on the base <code>Task<\/code> that also goes unused in async methods: the state object field. When you use a method like <code>StartNew<\/code> or <code>ContinueWith<\/code> to create a <code>Task<\/code>, you can provide an <code>object state<\/code> that&#8217;s then passed to the <code>Task<\/code>&#8216;s delegate. In an async method, though, the field just sits there, unused, lonely, forgotten, forelorn. Instead of having a separate field for the <code>ExecutionContext<\/code>, then, we can just store the <code>ExecutionContext<\/code> in this existing state field (being careful not to allow it to be exposed via the <code>Task<\/code>&#8216;s <code>AsyncState<\/code> property that normally exposes the object state).<\/p>\n<p>We can see the effect of getting rid of those two fields with a simple benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public async Task YieldOnce() =&gt; await Task.Yield();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>YieldOnce<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">918.6 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">112 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>YieldOnce<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">865.8 ns<\/td>\n<td style=\"text-align: right\">0.94<\/td>\n<td style=\"text-align: right\">96 B<\/td>\n<td style=\"text-align: right\">0.86<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note the 16-byte decrease just as we predicted.<\/p>\n<p>Async method overheads are reduced in other ways, too. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82181\">dotnet\/runtime#82181<\/a>, for example, shrinks the size of the <code>ManualResetValueTaskSourceCore&lt;TResult&gt;<\/code> type that&#8217;s used as the workhorse for custom <code>IValueTaskSource<\/code>\/<code>IValueTaskSource&lt;TResult&gt;<\/code> implementations; it takes advantage of the 99.9% case to use a single field for something that previously required two fields. But my favorite addition in this regard is <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/22144\">dotnet\/runtime#22144<\/a>, which adds new <code>ConfigureAwait<\/code> overloads. Yes, I know <code>ConfigureAwait<\/code> is a sore subject with some, but these new overloads a) address a really useful scenario that many folks end up writing their own custom awaiters for, b) do it in a way that&#8217;s cheaper than custom solutions can provide, and c) actually help with the <code>ConfigureAwait<\/code> naming, as it fulfills the original purpose of <code>ConfigureAwait<\/code> that led us to name it that in the first place. When <code>ConfigureAwait<\/code> was originally devised, we debated many names, and we settled on &#8220;ConfigureAwait&#8221; because that&#8217;s what it was doing: it was allowing you to provide arguments that configured how the await behaved. Of course, for the last decade, the only configuration you&#8217;ve been able to do is pass a single <code>Boolean<\/code> to indicate whether to capture the current context \/ scheduler or not, and that in part has led folks to bemoan the naming as overly verbose for something that&#8217;s a single <code>bool<\/code>. Now in .NET 8, there are new overloads of <code>ConfigureAwait<\/code> that take a <code>ConfigureAwaitOptions<\/code> enum:<\/p>\n<pre><code class=\"language-C#\">[Flags]\r\npublic enum ConfigureAwaitOptions\r\n{\r\n   None = 0,\r\n   ContinueOnCapturedContext = 1,\r\n   SuppressThrowing = 2,\r\n   ForceYielding = 4,\r\n}<\/code><\/pre>\n<p><code>ContinueOnCapturedContext<\/code> you know; that&#8217;s the same as <code>ConfigureAwait(true)<\/code> today. <code>ForceYielding<\/code> is something that comes up now and again in various capacities, but essentially you&#8217;re awaiting something and rather than continuing synchronously if the thing you&#8217;re awaiting has already completed by the time you await it, you effectively want the system to pretend it&#8217;s not completed even if it is. Then rather than continuing synchronously, the continuation will always end up running asynchronously from the caller. This can be helpful as an optimization in a variety of ways. Consider this code that was in <code>SocketsHttpHandler<\/code>&#8216;s HTTP\/2 implementation in .NET 7:<\/p>\n<pre><code class=\"language-C#\">private void DisableHttp2Connection(Http2Connection connection)\r\n{\r\n    _ = Task.Run(async () =&gt; \/\/ fire-and-forget\r\n    {\r\n        bool usable = await connection.WaitForAvailableStreamsAsync().ConfigureAwait(false);\r\n        ... \/\/ other stuff\r\n    };\r\n}<\/code><\/pre>\n<p>With <code>ForceYielding<\/code> in .NET 8, the code is now:<\/p>\n<pre><code class=\"language-C#\">private void DisableHttp2Connection(Http2Connection connection)\r\n{\r\n    _ = DisableHttp2ConnectionAsync(connection); \/\/ fire-and-forget\r\n\r\n    async Task DisableHttp2ConnectionAsync(Http2Connection connection)\r\n    {\r\n        bool usable = await connection.WaitForAvailableStreamsAsync().ConfigureAwait(ConfigureAwaitOptions.ForceYielding);\r\n        .... \/\/ other stuff\r\n    }\r\n}<\/code><\/pre>\n<p>Rather than have a separate <code>Task.Run<\/code>, we&#8217;ve just piggy-backed on the <code>await<\/code> for the task returned from <code>WaitForAvailableStreamsAsync<\/code> (which we know will quickly return the task to us), ensuring that the work that comes after it doesn&#8217;t run synchronously as part of the call to <code>DisableHttp2Connection<\/code>. Or imagine you had code that was doing:<\/p>\n<pre><code class=\"language-C#\">return Task.Run(WorkAsync);\r\n\r\nstatic async Task WorkAsync()\r\n{\r\n    while (...) await Something();\r\n}<\/code><\/pre>\n<p>This is using <code>Task.Run<\/code> to queue an async method&#8217;s invocation. That async method results in a Task being allocated, plus the <code>Task.Run<\/code> results in a <code>Task<\/code> being allocated, plus a work item needs to be queued to the <code>ThreadPool<\/code>, so at least three allocations. Now, this same functionality can be written as:<\/p>\n<pre><code class=\"language-C#\">return WorkAsync();\r\n\r\nstatic async Task WorkAsync()\r\n{\r\n    await Task.CompletedTask.ConfigureAwait(ConfigureAwaitOptions.ForceYielding);\r\n    while (...) await Something();\r\n}<\/code><\/pre>\n<p>and rather than three allocations, we end up with just one: for the async <code>Task<\/code>. That&#8217;s because with all the optimizations introduced in previous releases, the state machine box object is also what will be queued to the thread pool.<\/p>\n<p>Arguably the most valuable addition to this support, though, is <code>SuppressThrowing<\/code>. It does what it sounds like: when you <code>await<\/code> a task that completes in failure or cancellation, such that normally the <code>await<\/code> would propagate the exception, it won&#8217;t. So, for example, in <code>System.Text.Json<\/code> where we previously had this code:<\/p>\n<pre><code class=\"language-C#\">\/\/ Exceptions should only be propagated by the resuming converter\r\ntry\r\n{\r\n    await state.PendingTask.ConfigureAwait(false);\r\n}\r\ncatch { }<\/code><\/pre>\n<p>now we have this code:<\/p>\n<pre><code class=\"language-C#\">\/\/ Exceptions should only be propagated by the resuming converter\r\nawait state.PendingTask.ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing);<\/code><\/pre>\n<p>or in <code>SemaphoreSlim<\/code> where we had this code:<\/p>\n<pre><code class=\"language-C#\">await new ConfiguredNoThrowAwaiter&lt;bool&gt;(asyncWaiter.WaitAsync(TimeSpan.FromMilliseconds(millisecondsTimeout), cancellationToken));\r\nif (cancellationToken.IsCancellationRequested)\r\n{\r\n    \/\/ If we might be running as part of a cancellation callback, force the completion to be asynchronous.\r\n    await TaskScheduler.Default;\r\n}\r\n\r\nprivate readonly struct ConfiguredNoThrowAwaiter&lt;T&gt; : ICriticalNotifyCompletion, IStateMachineBoxAwareAwaiter\r\n{\r\n    private readonly Task&lt;T&gt; _task;\r\n    public ConfiguredNoThrowAwaiter(Task&lt;T&gt; task) =&gt; _task = task;\r\n    public ConfiguredNoThrowAwaiter&lt;T&gt; GetAwaiter() =&gt; this;\r\n    public bool IsCompleted =&gt; _task.IsCompleted;\r\n    public void GetResult() =&gt; _task.MarkExceptionsAsHandled();\r\n    public void OnCompleted(Action continuation) =&gt; TaskAwaiter.OnCompletedInternal(_task, continuation, continueOnCapturedContext: false, flowExecutionContext: true);\r\n    public void UnsafeOnCompleted(Action continuation) =&gt; TaskAwaiter.OnCompletedInternal(_task, continuation, continueOnCapturedContext: false, flowExecutionContext: false);\r\n    public void AwaitUnsafeOnCompleted(IAsyncStateMachineBox box) =&gt; TaskAwaiter.UnsafeOnCompletedInternal(_task, box, continueOnCapturedContext: false);\r\n}\r\n\r\ninternal readonly struct TaskSchedulerAwaiter : ICriticalNotifyCompletion\r\n{\r\n    private readonly TaskScheduler _scheduler;\r\n    public TaskSchedulerAwaiter(TaskScheduler scheduler) =&gt; _scheduler = scheduler;\r\n    public bool IsCompleted =&gt; false;\r\n    public void GetResult() { }\r\n    public void OnCompleted(Action continuation) =&gt; Task.Factory.StartNew(continuation, CancellationToken.None, TaskCreationOptions.DenyChildAttach, _scheduler);\r\n    public void UnsafeOnCompleted(Action continuation)\r\n    {\r\n        if (ReferenceEquals(_scheduler, Default))\r\n        {\r\n            ThreadPool.UnsafeQueueUserWorkItem(s =&gt; s(), continuation, preferLocal: true);\r\n        }\r\n        else\r\n        {\r\n            OnCompleted(continuation);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>now we just have this:<\/p>\n<pre><code class=\"language-C#\">await ((Task)asyncWaiter.WaitAsync(TimeSpan.FromMilliseconds(millisecondsTimeout), cancellationToken)).ConfigureAwait(ConfigureAwaitOptions.SuppressThrowing);\r\nif (cancellationToken.IsCancellationRequested)\r\n{\r\n    \/\/ If we might be running as part of a cancellation callback, force the completion to be asynchronous.\r\n    await Task.CompletedTask.ConfigureAwait(ConfigureAwaitOptions.ForceYielding);\r\n}<\/code><\/pre>\n<p>It is useful to note the <code>(Task)<\/code> cast that&#8217;s in there. <code>WaitAsync<\/code> returns a <code>Task&lt;bool&gt;<\/code>, but that <code>Task&lt;bool&gt;<\/code> is being cast to the base <code>Task<\/code> because <code>SuppressThrowing<\/code> is incompatible with <code>Task&lt;TResult&gt;<\/code>. That&#8217;s because, without an exception propagating, the await will complete successfully and return a <code>TResult<\/code>, which may be invalid if the task actually faulted. So if you have a <code>Task&lt;TResult&gt;<\/code> that you want to await with <code>SuppressThrowing<\/code>, cast to the base <code>Task<\/code> and await it, and then you can inspect the <code>Task&lt;TResult&gt;<\/code> immediately after the await completes. (If you do end up using <code>ConfigureAwaitOptions.SuppressThrowing<\/code> with a <code>Task&lt;TResult&gt;<\/code>, the <code>CA2261<\/code> analyzer introduced in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6669\">dotnet\/roslyn-analyzers#6669<\/a> will alert you to it.)<\/p>\n<p>The above example with <code>SemaphoreSlim<\/code> is using the new <code>ConfigureAwaitOptions<\/code> to replace a previous optimization added in .NET 8, as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83294\">dotnet\/runtime#83294<\/a> added to that <code>ConfiguredNoThrowAwaiter&lt;T&gt;<\/code> an implementation of the internal <code>IStateMachineBoxAwareAwaiter<\/code> interface, which is the special sauce that enables the async method builders to backchannel with a known awaiter to avoid the <code>Action<\/code> delegate allocation. Now that the behaviors this <code>ConfiguredNoThrowAwaiter<\/code> was providing are built-in, it&#8217;s no longer needed, and the built-in implementation enjoys the same privileges via <code>IStateMachineBoxAwareAwaiter<\/code>. The net result of these changes for <code>SemaphoreSlim<\/code> is that it now not only has simpler code, but faster code, too. Here&#8217;s a benchmark showing the decrease in execution time and allocation associated with <code>SemaphoreAsync.WaitAsync<\/code> calls that need to wait with a <code>CancellationToken<\/code> and\/or timeout:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly CancellationToken _token = new CancellationTokenSource().Token;\r\n    private readonly SemaphoreSlim _sem = new SemaphoreSlim(0);\r\n    private readonly Task[] _tasks = new Task[100];\r\n\r\n    [Benchmark]\r\n    public Task WaitAsync()\r\n    {\r\n        for (int i = 0; i &lt; _tasks.Length; i++)\r\n        {\r\n            _tasks[i] = _sem.WaitAsync(_token);\r\n        }\r\n        _sem.Release(_tasks.Length);\r\n        return Task.WhenAll(_tasks);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WaitAsync<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">85.48 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">44.64 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WaitAsync<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">69.37 us<\/td>\n<td style=\"text-align: right\">0.82<\/td>\n<td style=\"text-align: right\">36.02 KB<\/td>\n<td style=\"text-align: right\">0.81<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There have been other improvements on other operations on <code>Task<\/code> as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81065\">dotnet\/runtime#81065<\/a> removes a defensive <code>Task[]<\/code> allocation from <code>Task.WhenAll<\/code>. It was previously doing a defensive copy such that it could then validate on the copy whether any of the elements were <code>null<\/code> (a copy because another thread could erroneously and concurrently null out elements); that&#8217;s a large cost to pay for argument validation in the face of multi-threaded misuse. Instead, the method will still validate whether <code>null<\/code> is in the input, and if a <code>null<\/code> slips through because the input collection was erroneously mutated concurrently with the synchronous call to <code>WhenAll<\/code>, it&#8217;ll just ignore the <code>null<\/code> at that point. In making these changes, the PR also special-cased a <code>List&lt;Task&gt;<\/code> input to avoid making a copy, as <code>List&lt;Task&gt;<\/code> is also one of the main types we see fed into <code>WhenAll<\/code> (e.g. someone builds up a list of tasks and then waits for all of them).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.ObjectModel;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public void WhenAll_Array()\r\n    {\r\n        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();\r\n        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();\r\n        Task whenAll = Task.WhenAll(atmb1.Task, atmb2.Task);\r\n        atmb1.SetResult();\r\n        atmb2.SetResult();\r\n        whenAll.Wait();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void WhenAll_List()\r\n    {\r\n        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();\r\n        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();\r\n        Task whenAll = Task.WhenAll(new List&lt;Task&gt;(2) { atmb1.Task, atmb2.Task });\r\n        atmb1.SetResult();\r\n        atmb2.SetResult();\r\n        whenAll.Wait();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void WhenAll_Collection()\r\n    {\r\n        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();\r\n        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();\r\n        Task whenAll = Task.WhenAll(new ReadOnlyCollection&lt;Task&gt;(new[] { atmb1.Task, atmb2.Task }));\r\n        atmb1.SetResult();\r\n        atmb2.SetResult();\r\n        whenAll.Wait();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void WhenAll_Enumerable()\r\n    {\r\n        AsyncTaskMethodBuilder atmb1 = AsyncTaskMethodBuilder.Create();\r\n        AsyncTaskMethodBuilder atmb2 = AsyncTaskMethodBuilder.Create();\r\n        var q = new Queue&lt;Task&gt;(2);\r\n        q.Enqueue(atmb1.Task);\r\n        q.Enqueue(atmb2.Task);\r\n        Task whenAll = Task.WhenAll(q);\r\n        atmb1.SetResult();\r\n        atmb2.SetResult();\r\n        whenAll.Wait();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WhenAll_Array<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">210.8 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">304 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAll_Array<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">160.9 ns<\/td>\n<td style=\"text-align: right\">0.76<\/td>\n<td style=\"text-align: right\">264 B<\/td>\n<td style=\"text-align: right\">0.87<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>WhenAll_List<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">296.4 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">376 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAll_List<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">185.5 ns<\/td>\n<td style=\"text-align: right\">0.63<\/td>\n<td style=\"text-align: right\">296 B<\/td>\n<td style=\"text-align: right\">0.79<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>WhenAll_Collection<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">271.3 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">360 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAll_Collection<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">199.7 ns<\/td>\n<td style=\"text-align: right\">0.74<\/td>\n<td style=\"text-align: right\">328 B<\/td>\n<td style=\"text-align: right\">0.91<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>WhenAll_Enumerable<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">328.2 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">472 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAll_Enumerable<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">230.0 ns<\/td>\n<td style=\"text-align: right\">0.70<\/td>\n<td style=\"text-align: right\">432 B<\/td>\n<td style=\"text-align: right\">0.92<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The generic <code>WhenAny<\/code> was also improved as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88154\">dotnet\/runtime#88154<\/a>, which removes a <code>Task<\/code> allocation from an extra continuation that was an implementation detail. This is one of my favorite kinds of PRs: it not only improved performance, it also resulted in cleaner code, and less code.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/GitHubPlusMinusIndicatorForWhenAnyChange.png\" alt=\"GitHub plus\/minus line count indicator for Task.WhenAny\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public Task&lt;Task&lt;int&gt;&gt; WhenAnyGeneric_ListNotCompleted()\r\n    {\r\n        AsyncTaskMethodBuilder&lt;int&gt; atmb1 = default;\r\n        AsyncTaskMethodBuilder&lt;int&gt; atmb2 = default;\r\n        AsyncTaskMethodBuilder&lt;int&gt; atmb3 = default;\r\n\r\n        Task&lt;Task&lt;int&gt;&gt; wa = Task.WhenAny(new List&lt;Task&lt;int&gt;&gt;() { atmb1.Task, atmb2.Task, atmb3.Task });\r\n\r\n        atmb3.SetResult(42);\r\n\r\n        return wa;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WhenAnyGeneric_ListNotCompleted<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">555.0 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">704 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WhenAnyGeneric_ListNotCompleted<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">260.3 ns<\/td>\n<td style=\"text-align: right\">0.47<\/td>\n<td style=\"text-align: right\">504 B<\/td>\n<td style=\"text-align: right\">0.72<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One last example related to tasks, though this one is a bit different, as it&#8217;s specifically about improving test performance (and test reliability). Imagine you have a method like this:<\/p>\n<pre><code class=\"language-C#\">public static async Task LogAfterDelay(Action&lt;string, TimeSpan&gt; log)\r\n{\r\n    long startingTimestamp = Stopwatch.GetTimestamp();\r\n    await Task.Delay(TimeSpan.FromSeconds(30));\r\n    log(\"Completed\", Stopwatch.GetElapsedTime(startingTimestamp));\r\n}<\/code><\/pre>\n<p>The purpose of this method is to wait for 30 seconds and then log a completion message as well as how much time the method observed to pass. This is obviously a simplification of the kind of functionality you&#8217;d find in real applications, but you can extrapolate from it to code you&#8217;ve likely written. How do you test this? Maybe you&#8217;ve written a test like this:<\/p>\n<pre><code class=\"language-C#\">[Fact]\r\npublic async Task LogAfterDelay_Success_CompletesAfterThirtySeconds()\r\n{\r\n    TimeSpan ts = default;\r\n\r\n    Stopwatch sw = Stopwatch.StartNew();\r\n    await LogAfterDelay((message, time) =&gt; ts = time);\r\n    sw.Stop();\r\n\r\n    Assert.InRange(ts, TimeSpan.FromSeconds(30), TimeSpan.MaxValue);\r\n    Assert.InRange(sw.Elapsed, TimeSpan.FromSeconds(30), TimeSpan.MaxValue);\r\n}<\/code><\/pre>\n<p>This is validating both that the method included a value of at least 30 seconds in its log and also that at least 30 seconds passed. What&#8217;s the problem? From a performance perspective, the problem is this test had to wait 30 seconds! That&#8217;s a ton of overhead for something which would otherwise complete close to instantaneously. Now imagine the delay was longer, like 10 minutes, or that we had a bunch of tests that all needed to do the same thing. It becomes untenable to test well and thoroughly.<\/p>\n<p>To address these kinds of situations, many developers have introduced their own abstractions for the flow of time. Now in .NET 8, that&#8217;s no longer needed. As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83604\">dotnet\/runtime#83604<\/a>, the core libraries include <code>System.TimeProvider<\/code>. This abstract base class abstracts over the flow of time, with members for getting the current UTC time, getting the current local time, getting the current time zone, getting a high-frequency timestamp, and creating a timer (which in turn returns the new <code>System.Threading.ITimer<\/code> that supports changing the timer&#8217;s tick interval). Then core library members like <code>Task.Delay<\/code> and <code>CancellationTokenSource<\/code>&#8216;s constructor have new overloads that accept a <code>TimeProvider<\/code>, and use it for time-related functionality rather than being hardcoded to <code>DateTime.UtcNow<\/code>, <code>Stopwatch<\/code>, or <code>System.Threading.Timer<\/code>. With that, we can rewrite our previous method:<\/p>\n<pre><code class=\"language-C#\">public static async Task LogAfterDelay(Action&lt;string, TimeSpan&gt; log, TimeProvider provider)\r\n{\r\n    long startingTimestamp = provider.GetTimestamp();\r\n    await Task.Delay(TimeSpan.FromSeconds(30), provider);\r\n    log(\"Completed\", provider.GetElapsedTime(startingTimestamp));\r\n}<\/code><\/pre>\n<p>It&#8217;s been augmented to accept a <code>TimeProvider<\/code> parameter, though in a system that uses a dependency injection (DI) mechanism, it would likely just fetch a <code>TimeProvider<\/code> singleton from DI. Then instead of using <code>Stopwatch.GetTimestamp<\/code> or <code>Stopwatch.GetElapsedTime<\/code>, it uses the corresponding members on the <code>provider<\/code>, and instead of using the <code>Task.Delay<\/code> overload that just takes a duration, it uses the overload that also takes a <code>TimeProvider<\/code>. When used in production, this can be passed <code>TimeProvider.System<\/code>, which is implemented based on the system clock (exactly what you would get without providing a <code>TimeProvider<\/code> at all), but in a test, it can be passed a custom instance, one that manually controls the observed flow of time. Exactly such a custom <code>TimeProvider<\/code> exists in the <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.Extensions.TimeProvider.Testing\">Microsoft.Extensions.TimeProvider.Testing<\/a> NuGet package: <code>FakeTimeProvider<\/code>. Here&#8217;s an example of using it with our <code>LogAfterDelay<\/code> method:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing Microsoft.Extensions.Time.Testing;\r\nusing System.Diagnostics;\r\n\r\nStopwatch sw = Stopwatch.StartNew();\r\n\r\nvar fake = new FakeTimeProvider();\r\n\r\nTask t = LogAfterDelay((message, time) =&gt; Console.WriteLine($\"{message}: {time}\"), fake);\r\n\r\nfake.Advance(TimeSpan.FromSeconds(29));\r\nConsole.WriteLine(t.IsCompleted);\r\n\r\nfake.Advance(TimeSpan.FromSeconds(1));\r\nConsole.WriteLine(t.IsCompleted);\r\n\r\nConsole.WriteLine($\"Actual execution time: {sw.Elapsed}\");\r\n\r\nstatic async Task LogAfterDelay(Action&lt;string, TimeSpan&gt; log, TimeProvider provider)\r\n{\r\n    long startingTimestamp = provider.GetTimestamp();\r\n    await Task.Delay(TimeSpan.FromSeconds(30), provider);\r\n    log(\"Completed\", provider.GetElapsedTime(startingTimestamp));\r\n}<\/code><\/pre>\n<p>When I run this, it outputs the following:<\/p>\n<pre><code class=\"language-text\">False\r\nCompleted: 00:00:30\r\nTrue\r\nActual execution time: 00:00:00.0119943<\/code><\/pre>\n<p>In other words, after manually advancing time by 29 seconds, the operation still hadn&#8217;t completed. Then we manually advanced time by one more second, and the operation completed. It reported that 30 seconds passed, but in reality, the whole operation took only 0.01 seconds of actual wall clock time.<\/p>\n<p>With that, let&#8217;s move up the stack to <code>Parallel<\/code>&#8230;<\/p>\n<h2>Parallel<\/h2>\n<p>.NET 6 introduced new async methods onto <code>Parallel<\/code> in the form of <code>Parallel.ForEachAsync<\/code>. After its introduction, we started getting requests for an equivalent for <code>for<\/code> loops, so now in .NET 8, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84804\">dotnet\/runtime#84804<\/a>, the class gains a set of <code>Parallel.ForAsync<\/code> methods. These were previously achievable by passing in an <code>IEnumerable&lt;T&gt;<\/code> created from a method like <code>Enumerable.Range<\/code>, e.g.<\/p>\n<pre><code class=\"language-C#\">await Parallel.ForEachAsync(Enumerable.Range(0, 1_000), async i =&gt;\r\n{\r\n   ... \r\n});<\/code><\/pre>\n<p>but you can now achieve the same more simply and cheaply with:<\/p>\n<pre><code class=\"language-C#\">await Parallel.ForAsync(0, 1_000, async i =&gt;\r\n{\r\n   ... \r\n});<\/code><\/pre>\n<p>It ends up being cheaper because you don&#8217;t need to allocate the enumerable\/enumerator, and the synchronization involved in multiple workers trying to peel off the next iteration can be done in a much less expensive manner, a single <code>Interlocked<\/code> rather than using an asynchronous lock like <code>SemaphoreSlim<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark(Baseline = true)]\r\n    public Task ForEachAsync() =&gt; Parallel.ForEachAsync(Enumerable.Range(0, 1_000_000), (i, ct) =&gt; ValueTask.CompletedTask);\r\n\r\n    [Benchmark]\r\n    public Task ForAsync() =&gt; Parallel.ForAsync(0, 1_000_000, (i, ct) =&gt; ValueTask.CompletedTask);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ForEachAsync<\/td>\n<td style=\"text-align: right\">589.5 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">87925272 B<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>ForAsync<\/td>\n<td style=\"text-align: right\">147.5 ms<\/td>\n<td style=\"text-align: right\">0.25<\/td>\n<td style=\"text-align: right\">792 B<\/td>\n<td style=\"text-align: right\">0.000<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The allocation column here is particularly stark, and also a tad misleading. Why is <code>ForEachAsync<\/code> <em>so<\/em> much worse here allocation-wise? It&#8217;s because of the synchronization mechanism. There&#8217;s zero work being performed here by the delegate in the test, so all of the time is spent hammering on the source. In the case of <code>Parallel.ForAsync<\/code>, that&#8217;s a single <code>Interlocked<\/code> instruction to get the next value. In the case of <code>Parallel.ForEachAsync<\/code>, it&#8217;s a <code>WaitAsync<\/code>, and under a lot of contention, many of those <code>WaitAsync<\/code> calls are going to complete asynchronously, resulting in allocation. In a real workload, where the body delegate is doing real work, synchronously or asynchronously, the impact of that synchronization is much, much less dramatic. Here I&#8217;ve changed the calls to just be a simple <code>Task.Delay<\/code> for 1ms (and also significantly lowered the iteration count):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark(Baseline = true)]\r\n    public Task ForEachAsync() =&gt; Parallel.ForEachAsync(Enumerable.Range(0, 100), async (i, ct) =&gt; await Task.Delay(1));\r\n\r\n    [Benchmark]\r\n    public Task ForAsync() =&gt; Parallel.ForAsync(0, 100, async (i, ct) =&gt; await Task.Delay(1));\r\n}<\/code><\/pre>\n<p>and the two methods are the effectively same:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ForEachAsync<\/td>\n<td style=\"text-align: right\">89.39 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">27.96 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ForAsync<\/td>\n<td style=\"text-align: right\">89.44 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">27.84 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Interestingly, this <code>Parallel.ForAsync<\/code> method is also one of the first public methods in the core libraries to be based on the generic math interfaces introduced in .NET 7:<\/p>\n<pre><code class=\"language-C#\">public static Task ForAsync&lt;T&gt;(T fromInclusive, T toExclusive, Func&lt;T, CancellationToken, ValueTask&gt; body)\r\n    where T : notnull, IBinaryInteger&lt;T&gt;<\/code><\/pre>\n<p>When initially designing the method, we copied the synchronous <code>For<\/code> counterpart, which has overloads specific to <code>int<\/code> and overloads specific to <code>long<\/code>. Now that we have <code>IBinaryInteger&lt;T&gt;<\/code>, however, we realized we could not only reduce the number of overloads and not only reduce the number of implementations, by using <code>IBinaryInteger&lt;T&gt;<\/code> we could also open the same method up to other types folks want to use, such as <code>nint<\/code> or <code>UInt128<\/code> or <code>BigInteger<\/code>; they all &#8220;just work,&#8221; which is pretty cool. (The new <code>TotalOrderIeee754Comparer&lt;T&gt;<\/code>, added in .NET 8 in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75517\">dotnet\/runtime#75517<\/a> by <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a>, is another new public type relying on these interfaces.) Once we did that, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84853\">dotnet\/runtime#84853<\/a> we used a similar technique to deduplicate the <code>Parallel.For<\/code> implementations, such that both <code>int<\/code> and <code>long<\/code> share the same generic implementations internally.<\/p>\n<h2>Exceptions<\/h2>\n<p>In .NET 6, <code>ArgumentNullException<\/code> gained a <code>ThrowIfNull<\/code> method, as we dipped our toes into the waters of providing &#8220;throw helpers.&#8221; The intent of the method is to concisely express the constraint being verified, letting the system throw a consistent exception for failure to meet the constraint while also optimizing the success and 99.999% case where no exception need be thrown. The method is structured in such a way that the fast path performing the check gets inlined, with as little work as possible on that path, and then everything else is relegated to a method that performs the actual throwing (the JIT won&#8217;t inline that throwing method, as it&#8217;ll look at its implementation and see that the method always throws).<\/p>\n<pre><code class=\"language-C#\">public static void ThrowIfNull(\r\n    [NotNull] object? argument,\r\n    [CallerArgumentExpression(nameof(argument))] string? paramName = null)\r\n{\r\n    if (argument is null)\r\n        Throw(paramName);\r\n}\r\n\r\n[DoesNotReturn]\r\ninternal static void Throw(string? paramName) =&gt; throw new ArgumentNullException(paramName);<\/code><\/pre>\n<p>In .NET 7, <code>ArgumentNullException.ThrowIfNull<\/code> gained another overload, this time for pointers, and two new methods were introduced: <code>ArgumentException.ThrowIfNullOrEmpty<\/code> for <code>string<\/code>s and <code>ObjectDisposedException.ThrowIf<\/code>.<\/p>\n<p>Now in .NET 8, a slew of new such helpers have been added. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86007\">dotnet\/runtime#86007<\/a>, <code>ArgumentException<\/code> gains <code>ThrowIfNullOrWhiteSpace<\/code> to complement <code>ThrowIfNullOrEmpty<\/code>:<\/p>\n<pre><code class=\"language-C#\">public static void ThrowIfNullOrWhiteSpace([NotNull] string? argument, [CallerArgumentExpression(nameof(argument))] string? paramName = null);<\/code><\/pre>\n<p>and thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78222\">dotnet\/runtime#78222<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83853\">dotnet\/runtime#83853<\/a>, <code>ArgumentOutOfRangeException<\/code> gains 9 new methods:<\/p>\n<pre><code class=\"language-C#\">public static void ThrowIfEqual&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : System.IEquatable&lt;T&gt;?;\r\npublic static void ThrowIfNotEqual&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : System.IEquatable&lt;T&gt;?;\r\n\r\npublic static void ThrowIfLessThan&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;;\r\npublic static void ThrowIfLessThanOrEqual&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;;\r\n\r\npublic static void ThrowIfGreaterThan&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;;\r\npublic static void ThrowIfGreaterThanOrEqual&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;;\r\n\r\npublic static void ThrowIfNegative&lt;T&gt;(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase&lt;T&gt;;\r\npublic static void ThrowIfZero&lt;T&gt;(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase&lt;T&gt;;\r\npublic static void ThrowIfNegativeOrZero&lt;T&gt;(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : INumberBase&lt;T&gt;;<\/code><\/pre>\n<p>Those PRs used these new methods in a few places, but then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79460\">dotnet\/runtime#79460<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80355\">dotnet\/runtime#80355<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82357\">dotnet\/runtime#82357<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82533\">dotnet\/runtime#82533<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85858\">dotnet\/runtime#85858<\/a> rolled out their use more broadly throughout the core libraries. To get a sense for the usefulness of these methods, here are the number of times each of these methods is being called from within the <code>src<\/code> for the core libraries in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> as of the time I&#8217;m writing this paragraph:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Count<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ANE.ThrowIfNull(object)<\/td>\n<td>4795<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfNegative<\/td>\n<td>873<\/td>\n<\/tr>\n<tr>\n<td>AE.ThrowIfNullOrEmpty<\/td>\n<td>311<\/td>\n<\/tr>\n<tr>\n<td>ODE.ThrowIf<\/td>\n<td>237<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfGreaterThan<\/td>\n<td>223<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfNegativeOrZero<\/td>\n<td>100<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfLessThan<\/td>\n<td>89<\/td>\n<\/tr>\n<tr>\n<td>ANE.ThrowIfNull(void*)<\/td>\n<td>55<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfGreaterThanOrEqual<\/td>\n<td>39<\/td>\n<\/tr>\n<tr>\n<td>AE.ThrowIfNullOrWhiteSpace<\/td>\n<td>32<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfLessThanOrEqual<\/td>\n<td>20<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfNotEqual<\/td>\n<td>13<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfZero<\/td>\n<td>5<\/td>\n<\/tr>\n<tr>\n<td>AOORE.ThrowIfEqual<\/td>\n<td>3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These new methods also do more work in the throwing portion (e.g. formatting the exception message with the invalid arguments), which helps to better exemplify the benfits of moving all of that work out into a separate method. For example, here is the <code>ThrowIfGreaterThan<\/code> copied straight from <code>System.Private.CoreLib<\/code>:<\/p>\n<pre><code class=\"language-C#\">public static void ThrowIfGreaterThan&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;\r\n{\r\n    if (value.CompareTo(other) &gt; 0)\r\n        ThrowGreater(value, other, paramName);\r\n}\r\n\r\nprivate static void ThrowGreater&lt;T&gt;(T value, T other, string? paramName) =&gt;\r\n    throw new ArgumentOutOfRangeException(paramName, value, SR.Format(SR.ArgumentOutOfRange_Generic_MustBeLessOrEqual, paramName, value, other));<\/code><\/pre>\n<p>and here is a benchmark showing what consumption would look like if the <code>throw<\/code> expression were directly part of <code>ThrowIfGreaterThan<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"value1\", \"value2\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    [Benchmark(Baseline = true)]\r\n    [Arguments(1, 2)]\r\n    public void WithOutline(int value1, int value2)\r\n    {\r\n        ArgumentOutOfRangeException.ThrowIfGreaterThan(value1, 100);\r\n        ArgumentOutOfRangeException.ThrowIfGreaterThan(value2, 200);\r\n    }\r\n\r\n    [Benchmark]\r\n    [Arguments(1, 2)]\r\n    public void WithInline(int value1, int value2)\r\n    {\r\n        ThrowIfGreaterThan(value1, 100);\r\n        ThrowIfGreaterThan(value2, 200);\r\n    }\r\n\r\n    public static void ThrowIfGreaterThan&lt;T&gt;(T value, T other, [CallerArgumentExpression(nameof(value))] string? paramName = null) where T : IComparable&lt;T&gt;\r\n    {\r\n        if (value.CompareTo(other) &gt; 0)\r\n            throw new ArgumentOutOfRangeException(paramName, value, SR.Format(SR.ArgumentOutOfRange_Generic_MustBeLessOrEqual, paramName, value, other));\r\n    }\r\n\r\n    internal static class SR\r\n    {\r\n        public static string Format(string format, object arg0, object arg1, object arg2) =&gt; string.Format(format, arg0, arg1, arg2);\r\n        internal static string ArgumentOutOfRange_Generic_MustBeLessOrEqual =&gt; GetResourceString(\"ArgumentOutOfRange_Generic_MustBeLessOrEqual\");\r\n\r\n        [MethodImpl(MethodImplOptions.NoInlining)]\r\n        static string GetResourceString(string resourceKey) =&gt; \"{0} ('{1}') must be less than or equal to '{2}'.\";\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithOutline<\/td>\n<td style=\"text-align: right\">0.4839 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">118 B<\/td>\n<\/tr>\n<tr>\n<td>WithInline<\/td>\n<td style=\"text-align: right\">2.4976 ns<\/td>\n<td style=\"text-align: right\">5.16<\/td>\n<td style=\"text-align: right\">235 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The most relevant highlight from the generated assembly is from the <code>WithInline<\/code> case:<\/p>\n<pre><code class=\"language-assembly\">; Tests.WithInline(Int32, Int32)\r\n       push      rbx\r\n       sub       rsp,20\r\n       mov       ebx,r8d\r\n       mov       ecx,edx\r\n       mov       edx,64\r\n       mov       r8,1F5815EA8F8\r\n       call      qword ptr [7FF99C03DEA8]; Tests.ThrowIfGreaterThan[[System.Int32, System.Private.CoreLib]](Int32, Int32, System.String)\r\n       mov       ecx,ebx\r\n       mov       edx,0C8\r\n       mov       r8,1F5815EA920\r\n       add       rsp,20\r\n       pop       rbx\r\n       jmp       qword ptr [7FF99C03DEA8]; Tests.ThrowIfGreaterThan[[System.Int32, System.Private.CoreLib]](Int32, Int32, System.String)\r\n; Total bytes of code 59<\/code><\/pre>\n<p>Because there&#8217;s more cruft inside the <code>ThrowIfGreaterThan<\/code> method, the system decides not to inline it, and so we end up with two method invocations that occur even when the value is within range (the first is a <code>call<\/code>, the second here is a <code>jmp<\/code>, since there was no follow-up work in this method that would require control flow returning).<\/p>\n<p>To make it easier to roll out usage of these helpers, <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6293\">dotnet\/roslyn-analyzers#6293<\/a> added new analyzers to look for argument validation that can be replaced by one of the throw helper methods on <code>ArgumentNullException<\/code>, <code>ArgumentException<\/code>, <code>ArgumentOutOfRangeException<\/code>, or <code>ObjectDisposedException<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80149\">dotnet\/runtime#80149<\/a> enables the analyzers for <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> and fixes up many call sites.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1510.png\" alt=\"CA1510, CA1511, CA1512, CA1513\" \/><\/p>\n<h2>Reflection<\/h2>\n<p>There have been a variety of improvements here and there in the reflection stack in .NET 8, mostly around reducing allocation or caching information so that subsequent access is faster. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87902\">dotnet\/runtime#87902<\/a> tweaks some code in <code>GetCustomAttributes<\/code> to avoid allocating an <code>object[1]<\/code> array in order to set a property on an attribute.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public object[] GetCustomAttributes() =&gt; typeof(C).GetCustomAttributes(typeof(MyAttribute), inherit: true);\r\n\r\n    [My(Value1 = 1, Value2 = 2)]\r\n    class C { }\r\n\r\n    [AttributeUsage(AttributeTargets.All)]\r\n    public class MyAttribute : Attribute\r\n    {\r\n        public int Value1 { get; set; }\r\n        public int Value2 { get; set; }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,287.1 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">296 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">994.0 ns<\/td>\n<td style=\"text-align: right\">0.77<\/td>\n<td style=\"text-align: right\">232 B<\/td>\n<td style=\"text-align: right\">0.78<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other changes like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76574\">dotnet\/runtime#76574<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81059\">dotnet\/runtime#81059<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86657\">dotnet\/runtime#86657<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> also removed allocations in the reflection stack, in particular by more liberal use of spans. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78288\">dotnet\/runtime#78288<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a> improves the handling of generics information on a <code>Type<\/code>, leading to a boost for various generics-related members, in particular for <code>GetGenericTypeDefinition<\/code> for which the result is now cached on the <code>Type<\/code> object.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly Type _type = typeof(List&lt;int&gt;);\r\n\r\n    [Benchmark] public Type GetGenericTypeDefinition() =&gt; _type.GetGenericTypeDefinition();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetGenericTypeDefinition<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">47.426 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetGenericTypeDefinition<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.289 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, the largest impact on performance in reflection in .NET 8 comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88415\">dotnet\/runtime#88415<\/a>. This is a continuation of work done in .NET 7 to improve the performance of <code>MethodBase.Invoke<\/code>. When you know at compile-time the signature of the target method you want to invoke via reflection, you can achieve the best performance by using <code>CreateDelegate&lt;DelegateType&gt;<\/code> to get and cache a delegate for the method in question, and then performing all invocations via that delegate. However, if you don&#8217;t know the signature at compile-time, you need to rely on more dynamic means, like <code>MethodBase.Invoke<\/code>, which historically has been much more costly. Some enterprising developers turned to reflection emit to avoid that overhead by emitting custom invocation stubs at run-time, and that&#8217;s one of the optimization approaches taken under the covers in .NET 7 as well. Now in .NET 8, the code generated for many of these cases has improved; previously the emitter was always generating code that could accommodate <code>ref<\/code>\/<code>out<\/code> arguments, but many methods don&#8217;t have such arguments, and the generated code can be more efficient when it needn&#8217;t factor those in.<\/p>\n<pre><code class=\"language-C#\">\/\/ If you have .NET 6 installed, you can update the csproj to include a net6.0 in the target frameworks, and then run:\r\n\/\/     dotnet run -c Release -f net6.0 --filter \"*\" --runtimes net6.0 net7.0 net8.0\r\n\/\/ Otherwise, you can run:\r\n\/\/     dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Reflection;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private MethodInfo _method0, _method1, _method2, _method3;\r\n    private readonly object[] _args1 = new object[] { 1 };\r\n    private readonly object[] _args2 = new object[] { 2, 3 };\r\n    private readonly object[] _args3 = new object[] { 4, 5, 6 };\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _method0 = typeof(Tests).GetMethod(\"MyMethod0\", BindingFlags.NonPublic | BindingFlags.Static);\r\n        _method1 = typeof(Tests).GetMethod(\"MyMethod1\", BindingFlags.NonPublic | BindingFlags.Static);\r\n        _method2 = typeof(Tests).GetMethod(\"MyMethod2\", BindingFlags.NonPublic | BindingFlags.Static);\r\n        _method3 = typeof(Tests).GetMethod(\"MyMethod3\", BindingFlags.NonPublic | BindingFlags.Static);\r\n    }\r\n\r\n    [Benchmark] public void Method0() =&gt; _method0.Invoke(null, null);\r\n    [Benchmark] public void Method1() =&gt; _method1.Invoke(null, _args1);\r\n    [Benchmark] public void Method2() =&gt; _method2.Invoke(null, _args2);\r\n    [Benchmark] public void Method3() =&gt; _method3.Invoke(null, _args3);\r\n\r\n    private static void MyMethod0() { }\r\n    private static void MyMethod1(int arg1) { }\r\n    private static void MyMethod2(int arg1, int arg2) { }\r\n    private static void MyMethod3(int arg1, int arg2, int arg3) { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Method0<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right\">91.457 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Method0<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">7.205 ns<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<tr>\n<td>Method0<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">5.719 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Method1<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right\">132.832 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Method1<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">26.151 ns<\/td>\n<td style=\"text-align: right\">0.20<\/td>\n<\/tr>\n<tr>\n<td>Method1<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">21.602 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Method2<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right\">172.224 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Method2<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">37.937 ns<\/td>\n<td style=\"text-align: right\">0.22<\/td>\n<\/tr>\n<tr>\n<td>Method2<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">26.951 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Method3<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right\">211.247 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Method3<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">42.988 ns<\/td>\n<td style=\"text-align: right\">0.20<\/td>\n<\/tr>\n<tr>\n<td>Method3<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">34.112 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, there&#8217;s overhead involved here on each call and that&#8217;s repeated on each call. If we could extract that upfront work, do it once, and cache it, we can achieve much better performance. That&#8217;s exactly what the new <code>MethodInvoker<\/code> and <code>ConstructorInvoker<\/code> types implemented in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88415\">dotnet\/runtime#88415<\/a> provide. These don&#8217;t incorporate all of the obscure corner-cases that <code>MethodBase.Invoke<\/code> handles (like specially recognizing and handling <code>Type.Missing<\/code>), but for everything else, it provides a great solution for optimizing the repeated invocation of methods whose signatures are unknown at build time.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Reflection;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly object _arg0 = 4, _arg1 = 5, _arg2 = 6;\r\n    private readonly object[] _args3 = new object[] { 4, 5, 6 };\r\n    private MethodInfo _method3;\r\n    private MethodInvoker _method3Invoker;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _method3 = typeof(Tests).GetMethod(\"MyMethod3\", BindingFlags.NonPublic | BindingFlags.Static);\r\n        _method3Invoker = MethodInvoker.Create(_method3);\r\n    }\r\n\r\n    [Benchmark(Baseline = true)] \r\n    public void MethodBaseInvoke() =&gt; _method3.Invoke(null, _args3);\r\n\r\n    [Benchmark]\r\n    public void MethodInvokerInvoke() =&gt; _method3Invoker.Invoke(null, _arg0, _arg1, _arg2);\r\n\r\n    private static void MyMethod3(int arg1, int arg2, int arg3) { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>MethodBaseInvoke<\/td>\n<td style=\"text-align: right\">32.42 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>MethodInvokerInvoke<\/td>\n<td style=\"text-align: right\">11.47 ns<\/td>\n<td style=\"text-align: right\">0.35<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90119\">dotnet\/runtime#90119<\/a>, these types are then used by the <code>ActivatorUtilities.CreateFactory<\/code> method in <code>Microsoft.Extensions.DependencyInjection.Abstractions<\/code> to further improve DI service construction performance. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91881\">dotnet\/runtime#91881<\/a> improves it further by adding a an additional caching layer that further avoids reflection on each construction.<\/p>\n<h2>Primitives<\/h2>\n<p>It&#8217;s hard to believe that after two decades we&#8217;re still finding opportunity to improve the core primitive types in .NET, yet here we are. Some of this comes from new scenarios that drive optimization into different places; some of it comes from new opportunity based on new support that enables different approaches to the same problem; some of it comes from new research highlighting new ways to approach a problem; and some of it simply comes from many new eyes looking at a well-worn space (yay open source!)  Regardless of the reason, there&#8217;s a lot to be excited about here in .NET 8.<\/p>\n<h3>Enums<\/h3>\n<p>Let&#8217;s start with <code>Enum<\/code>. <code>Enum<\/code> has obviously been around since the earliest days of .NET and is  used heavily. Although <code>Enum<\/code>&#8216;s functionality and implementation have evolved, and although it&#8217;s received new APIs, at its core, how the data is stored has fundamentally remained the same for many years. In the .NET Framework implementation, there&#8217;s an internal <code>ValuesAndNames<\/code> class that stores a <code>ulong[]<\/code> and a <code>string[]<\/code>, and in .NET 7, there&#8217;s an <code>EnumInfo<\/code> that serves the same purpose. That <code>string[]<\/code> contains the names of all of the enum&#8217;s values, and the <code>ulong[]<\/code> stores their numeric counterparts. It&#8217;s a <code>ulong[]<\/code> to accommodate all possible underlying types an <code>Enum<\/code> can be, including those supported by C# (<code>sbyte<\/code>, <code>byte<\/code>, <code>short<\/code>, <code>ushort<\/code>, <code>int<\/code>, <code>uint<\/code>, <code>long<\/code>, <code>ulong<\/code>) and those additionally supported by the runtime (<code>nint<\/code>, <code>nuint<\/code>, <code>char<\/code>, <code>float<\/code>, <code>double<\/code>) even though effectively no one uses those (partial <code>bool<\/code> support used to be on this list as well, but was deleted in .NET 8 in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79962\">dotnet\/runtime#79962<\/a> by <a href=\"https:\/\/github.com\/pedrobsaila\">@pedrobsaila<\/a>).<\/p>\n<p>As an aside, as part of all of this work, we examined the breadth of appropriately-licensed NuGet packages, looking for what the most common underlying types were in their use of <code>enum<\/code>. Out of ~163 million <code>enum<\/code>s found, here&#8217;s the breakdown of their underlying types. The result is likely not surprising, given the default underlying type for <code>Enum<\/code>, but it&#8217;s still interesting:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/EnumUnderlyingTypesGraph.png\" alt=\"Graph of how common is each underlying Enum type\" \/><\/p>\n<p>There are several issues with the cited design for how <code>Enum<\/code> stores its data. Every operation translates between these <code>ulong[]<\/code> values and the actual type being used by the particular <code>Enum<\/code>, plus the array is often twice as large as it needs to be (<code>int<\/code> is the default underlying type for an enum and, as seen in the above graph, by <em>far<\/em> the most commonly used). The approach also leads to significant assembly code bloat when dealing with all the new generic methods that have been added to <code>Enum<\/code> in recent years. <code>enum<\/code>s are structs, and when a struct is used as a generic type argument, the JIT specializes the code for that value type (whereas for reference types it emits a single shared implementation used by all of them). That specialization is great for throughput, but it means that you get a copy of the code for every value type it&#8217;s used with; if you have a lot of code (e.g. <code>Enum<\/code> formatting) and a lot of possible types being substituted (e.g. every declared <code>enum<\/code> type), that&#8217;s a lot of possible increase in code size.<\/p>\n<p>To address all of this, to modernize the implementation, and to make various operations faster, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78580\">dotnet\/runtime#78580<\/a> rewrites <code>Enum<\/code>. Rather than having a non-generic <code>EnumInfo<\/code> that stores a <code>ulong[]<\/code> array of all values, it introduces a generic <code>EnumInfo&lt;TUnderlyingValue&gt;<\/code> that stores a <code>TUnderlyingValue[]<\/code>. Then based on the enum&#8217;s type, every generic and non-generic <code>Enum<\/code> method looks up the underlying <code>TUnderlyingType<\/code> and invokes a generic method with that <code>TUnderlyingType<\/code> but <em>not<\/em> with a generic type parameter for the <code>enum<\/code> type, e.g. <code>Enum.IsDefined&lt;TEnum&gt;(...)<\/code> and <code>Enum.IsDefined(typeof(TEnum), ...)<\/code> both look up the <code>TUnderlyingValue<\/code> for <code>TEnum<\/code> and invoke the internal <code>Enum.IsDefinedPrimitive&lt;TUnderlyingValue&gt;(typeof(TEnum))<\/code>. In this way, the implementation stores a strongly-typed <code>TUnderlyingValue[]<\/code> value rather than storing the worst case <code>ulong[]<\/code>, and all of the implementations across generic and non-generic entrypoints are shared while not having full generic specialization for every <code>TEnum<\/code>: worst case, we end up with one generic specialization per underlying type, of which only the previously cited 8 are expressible in C#. The generic entrypoints are able to do the mapping very efficiently, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71685\">dotnet\/runtime#71685<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a> which makes <code>typeof(TEnum).IsEnum<\/code> a JIT intrinsic (such that it effectively becomes a const), and the non-generic entrypoints use switches on <code>TypeCode<\/code>\/<code>CorElementType<\/code> as was already being done in a variety of methods.<\/p>\n<p>Other improvements were made to <code>Enum<\/code> as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76162\">dotnet\/runtime#76162<\/a> improves the performance of various methods like <code>ToString<\/code> and <code>IsDefined<\/code> in cases where all of the <code>enum<\/code>&#8216;s defined values are sequential starting from 0. In that common case, the internal function that looks up the value in the <code>EnumInfo&lt;TUnderlyingValue&gt;<\/code> can do so with a simple array access, rather than needing to search for the target.<\/p>\n<p>The net result of all of these changes are some very nice performance improvements:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly DayOfWeek _dow = DayOfWeek.Saturday;\r\n\r\n    [Benchmark] public bool IsDefined() =&gt; Enum.IsDefined(_dow);\r\n    [Benchmark] public string GetName() =&gt; Enum.GetName(_dow);\r\n    [Benchmark] public string[] GetNames() =&gt; Enum.GetNames&lt;DayOfWeek&gt;();\r\n    [Benchmark] public DayOfWeek[] GetValues() =&gt; Enum.GetValues&lt;DayOfWeek&gt;();\r\n    [Benchmark] public Array GetUnderlyingValues() =&gt; Enum.GetValuesAsUnderlyingType&lt;DayOfWeek&gt;();\r\n    [Benchmark] public string EnumToString() =&gt; _dow.ToString();\r\n    [Benchmark] public bool TryParse() =&gt; Enum.TryParse&lt;DayOfWeek&gt;(\"Saturday\", out _);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">20.021 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.502 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetName<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">24.563 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>GetName<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.648 ns<\/td>\n<td style=\"text-align: right\">0.15<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetNames<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">37.138 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetNames<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">22.688 ns<\/td>\n<td style=\"text-align: right\">0.61<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetValues<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">694.356 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">224 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetValues<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">39.406 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">0.25<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetUnderlyingValues<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">41.012 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetUnderlyingValues<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">17.249 ns<\/td>\n<td style=\"text-align: right\">0.42<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>EnumToString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.842 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">24 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>EnumToString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">14.620 ns<\/td>\n<td style=\"text-align: right\">0.44<\/td>\n<td style=\"text-align: right\">24 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>TryParse<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">49.121 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>TryParse<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">30.394 ns<\/td>\n<td style=\"text-align: right\">0.62<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These changes, however, also made <code>enum<\/code>s play much more nicely with string interpolation. First, <code>Enum<\/code> now sports a new static <code>TryFormat<\/code> method, which enables formatting an <code>enum<\/code>&#8216;s string representation directly into a <code>Span&lt;char&gt;<\/code>:<\/p>\n<pre><code class=\"language-C#\">public static bool TryFormat&lt;TEnum&gt;(TEnum value, Span&lt;char&gt; destination, out int charsWritten, [StringSyntax(StringSyntaxAttribute.EnumFormat)] ReadOnlySpan&lt;char&gt; format = default) where TEnum : struct, Enum<\/code><\/pre>\n<p>Second, <code>Enum<\/code> now implements <code>ISpanFormattable<\/code>, such that any code written to use a value&#8217;s <code>ISpanFormattable.TryFormat<\/code> method now lights up with <code>enum<\/code>s, too. However, even though enums are value types, they&#8217;re special and weird in that they derive from the reference type <code>Enum<\/code>, and that means calling instance methods like <code>ToString<\/code> or <code>ISpanFormattable.TryFormat<\/code> end up boxing the enum value.<\/p>\n<p>So, third, the various interpolated string handlers in <code>System.Private.CoreLib<\/code> were updated to special-case <code>typeof(T).IsEnum<\/code>, which as noted is now effectively free thanks to JIT optimizations, using <code>Enum.TryFormat<\/code> directly in order to avoid the boxing. We can see the impact this has by running the following benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly char[] _dest = new char[100];\r\n    private readonly FileAttributes _attr = FileAttributes.Hidden | FileAttributes.ReadOnly;\r\n\r\n    [Benchmark]\r\n    public bool Interpolate() =&gt; _dest.AsSpan().TryWrite($\"Attrs: {_attr}\", out int charsWritten);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">81.58 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">34.41 ns<\/td>\n<td style=\"text-align: right\">0.42<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Numbers<\/h2>\n<p>Such formatting improvements weren&#8217;t just reserved for <code>enum<\/code>s. The performance of number formatting also sees a nice set of improvements in .NET 8. Daniel Lemire has a <a href=\"https:\/\/lemire.me\/blog\/2021\/06\/03\/computing-the-number-of-digits-of-an-integer-even-faster\/\">nice blog post from 2021<\/a> discussing various approaches to counting the number of digits in an integer. Digit counting is relevant to number formatting as we need to know how many characters the number will be, either to allocate a string of the right length to format into or to ensure that a destination buffer is of a sufficient length. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76519\">dotnet\/runtime#76519<\/a> implements this inside of .NET&#8217;s number formatting, providing a branch-free, table-based lookup solution for computing the number of digits in a formatted value.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76726\">dotnet\/runtime#76726<\/a> improves performance further by using a trick <a href=\"https:\/\/engineering.fb.com\/2013\/03\/15\/developer-tools\/three-optimization-tips-for-c\/\">other formatting libraries use<\/a>. One of the more expensive parts of formatting a decimal is in dividing by 10 to pull off each digit; if we can reduce the number of divisions, we can reduce the overall expense of the formatting operation. The trick here is, rather than dividing by 10 for each digit in the number, we instead divide by 100 for each pair of digits in the number, and then have a precomputed lookup table for the <code>char<\/code>-based representation of all values 0 to 99. This lets us cut the number of divisions in half.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79061\">dotnet\/runtime#79061<\/a> also expands on a previous optimization already present in .NET. The formatting code contained a table of precomputed strings for single digit numbers, so if you asked for the equivalent of <code>0.ToString()<\/code>, the implementation wouldn&#8217;t need to allocate a new string, it would just fetch <code>\"0\"<\/code> from the table and return it. This PR expands that cache from single digit numbers to being all numbers 0 through 299 (it also makes the cache lazy, such that we don&#8217;t need to pay for the strings for values that are never used). The choice of 299 is somewhat arbitrary and could be raised in the future if the need presents itself, but in examining data from various services, this addresses a significant chunk of the allocations that come from number formatting. Coincidentally or not, it also includes all success status codes from the HTTP protocol.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(12)]\r\n    [Arguments(123)]\r\n    [Arguments(1_234_567_890)]\r\n    public string Int32ToString(int i) =&gt; i.ToString();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>i<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>12<\/td>\n<td style=\"text-align: right\">16.253 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">32 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>12<\/td>\n<td style=\"text-align: right\">1.985 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>123<\/td>\n<td style=\"text-align: right\">18.056 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">32 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>123<\/td>\n<td style=\"text-align: right\">1.971 ns<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>1234567890<\/td>\n<td style=\"text-align: right\">26.964 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">48 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Int32ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>1234567890<\/td>\n<td style=\"text-align: right\">17.082 ns<\/td>\n<td style=\"text-align: right\">0.63<\/td>\n<td style=\"text-align: right\">48 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Numbers in .NET 8 also gain the ability to format as binary (via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84889\">dotnet\/runtime#84889<\/a>, and parse from binary (via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84998\">dotnet\/runtime#84998<\/a>), via the new &#8220;b&#8221; specifier. For example, this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -f net8.0\r\n\r\nint i = 12345;\r\nConsole.WriteLine(i.ToString(\"x16\")); \/\/ 16 hex digits\r\nConsole.WriteLine(i.ToString(\"b16\")); \/\/ 16 binary digits<\/code><\/pre>\n<p>outputs:<\/p>\n<pre><code class=\"language-text\">0000000000003039\r\n0011000000111001<\/code><\/pre>\n<p>That implementation is then used to reimplement the existing <code>Convert.ToString(int value, int toBase)<\/code> method, such that it&#8217;s also now optimized:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly int _value = 12345;\r\n\r\n    [Benchmark]\r\n    public string ConvertBinary() =&gt; Convert.ToString(_value, 2);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ConvertBinary<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">104.73 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConvertBinary<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">23.76 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In a significant addition to the primitive types (numerical and beyond), .NET 8 also sees the introduction of the new <code>IUtf8SpanFormattable<\/code> interface. <code>ISpanFormattable<\/code> was introduced in .NET 6, and with it <code>TryFormat<\/code> methods on many types that enable those types to directly format into a <code>Span&lt;char&gt;<\/code>:<\/p>\n<pre><code class=\"language-C#\">public interface ISpanFormattable : IFormattable\r\n{\r\n    bool TryFormat(Span&lt;char&gt; destination, out int charsWritten, ReadOnlySpan&lt;char&gt; format, IFormatProvider? provider);\r\n}<\/code><\/pre>\n<p>Now in .NET 8, we also have the <code>IUtf8SpanFormattable<\/code> interface:<\/p>\n<pre><code class=\"language-C#\">public interface IUtf8SpanFormattable\r\n{\r\n    bool TryFormat(Span&lt;byte&gt; utf8Destination, out int bytesWritten, ReadOnlySpan&lt;char&gt; format, IFormatProvider? provider);\r\n}<\/code><\/pre>\n<p>that enables types to directly format into a <code>Span&lt;byte&gt;<\/code>. These are by design almost identical, the key difference being whether the implementation of these interfaces writes out UTF16 <code>char<\/code>s or UTF8 <code>byte<\/code>s. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84587\">dotnet\/runtime#84587<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84841\">dotnet\/runtime#84841<\/a>, all of the numerical primitives in <code>System.Private.CoreLib<\/code> both implement the new interface and expose a public <code>TryFormat<\/code> method. So, for example, <code>ulong<\/code> exposes these:<\/p>\n<pre><code class=\"language-C#\">public bool TryFormat(Span&lt;char&gt; destination, out int charsWritten, [StringSyntax(StringSyntaxAttribute.NumericFormat)] ReadOnlySpan&lt;char&gt; format = default, IFormatProvider? provider = null);\r\npublic bool TryFormat(Span&lt;byte&gt; utf8Destination, out int bytesWritten, [StringSyntax(StringSyntaxAttribute.NumericFormat)] ReadOnlySpan&lt;char&gt; format = default, IFormatProvider? provider = null);<\/code><\/pre>\n<p>They have the exact same functionality, support the exact same format strings, the same general performance characteristics, and so on, and simply differ in whether writing out UTF16 or UTF8. How can I be so sure they&#8217;re so similar? Because, drumroll, they share the same implementation. Thanks to generics, the two methods above delegate to the exact same helper:<\/p>\n<pre><code class=\"language-C#\">public static bool TryFormatUInt64&lt;TChar&gt;(ulong value, ReadOnlySpan&lt;char&gt; format, IFormatProvider? provider, Span&lt;TChar&gt; destination, out int charsWritten)<\/code><\/pre>\n<p>just with one with <code>TChar<\/code> as <code>char<\/code> and the other as <code>byte<\/code>. So, when we run a benchmark like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly ulong _value = 12345678901234567890;\r\n    private readonly char[] _chars = new char[20];\r\n    private readonly byte[] _bytes = new byte[20];\r\n\r\n    [Benchmark] public void FormatUTF16() =&gt; _value.TryFormat(_chars, out _);\r\n    [Benchmark] public void FormatUTF8() =&gt; _value.TryFormat(_bytes, out _);\r\n}<\/code><\/pre>\n<p>we get practically identical results like this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FormatUTF16<\/td>\n<td style=\"text-align: right\">12.10 ns<\/td>\n<\/tr>\n<tr>\n<td>FormatUTF8<\/td>\n<td style=\"text-align: right\">12.96 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And now that the primitive types themselves are able to format with full fidelity as UTF8, the <code>Utf8Formatter<\/code> class largely becomes legacy. In fact, the previously mentioned PR also rips out <code>Utf8Formatter<\/code>&#8216;s implementation and just reparents it on top of the same formatting logic from the primitive types. All of the previously cited performance improvements to number formatting then not only accrue to <code>ToString<\/code> and <code>TryFormat<\/code> for UTF16, and not only to <code>TryFormat<\/code> for UTF8, but then also to <code>Utf8Formatter<\/code> (plus, removing duplicated code and reducing maintenance burden makes me giddy).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _bytes = new byte[10];\r\n\r\n    [Benchmark]\r\n    [Arguments(123)]\r\n    [Arguments(1234567890)]\r\n    public bool Utf8FormatterTryFormat(int i) =&gt; Utf8Formatter.TryFormat(i, _bytes, out int bytesWritten);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>i<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Utf8FormatterTryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>123<\/td>\n<td style=\"text-align: right\">8.849 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Utf8FormatterTryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>123<\/td>\n<td style=\"text-align: right\">4.645 ns<\/td>\n<td style=\"text-align: right\">0.53<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Utf8FormatterTryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>1234567890<\/td>\n<td style=\"text-align: right\">15.844 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Utf8FormatterTryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>1234567890<\/td>\n<td style=\"text-align: right\">7.174 ns<\/td>\n<td style=\"text-align: right\">0.45<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Not only is UTF8 formatting directly supported by all these types, so, too, is parsing. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86875\">dotnet\/runtime#86875<\/a> added the new <code>IUtf8SpanParsable&lt;TSelf&gt;<\/code> interface and implemented it on the primitive numeric types. Just as with its formatting counterpart, this provides identical behavior to <code>IParsable&lt;TSelf&gt;<\/code>, just for UTF8 instead of UTF16. And just as with its formatting counterpart, all of the parsing logic is shared in generic routines between the two modes. In fact, not only does this share logic between UTF16 and UTF8 parsing, it follows closely on the heals of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84582\">dotnet\/runtime#84582<\/a>, which uses the same generic tricks to deduplicate the parsing logic across all the primitive types, such that the same generic routines end up being used for all the types and both UTF8 and UTF16. That PR removed almost 2,000 lines of code from <code>System.Private.CoreLib<\/code>:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/GitHubPlusMinusIndicatorForParsingDeduping.png\" alt=\"GitHub plus\/minus line count indicator for parsing deduplication\" \/><\/p>\n<h2>DateTime<\/h2>\n<p>Parsing and formatting are improved on other types, as well. Take <code>DateTime<\/code> and <code>DateTimeOffset<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84963\">dotnet\/runtime#84963<\/a> improved a variety of aspects of <code>DateTime{Offset}<\/code> formatting:<\/p>\n<ul>\n<li>The formatting logic has general support used as a fallback and that supports any custom format, but then there are dedicated routines used for the most popular formats, allowing them to be optimized and tuned. Dedicated routines already existed for the very popular &#8220;r&#8221; (RFC1123 pattern) and &#8220;o&#8221; (round-trip date\/time pattern) formats; this PR adds dedicated routines for the default format (&#8220;G&#8221;) when used with the invariant culture, the &#8220;s&#8221; format (sortable date\/time pattern), and &#8220;u&#8221; format (universal sortable date\/time pattern), all of which are used frequently in a variety of domains.<\/li>\n<li>For the &#8220;U&#8221; format (universal full date\/time pattern), the implementation would end up always allocating new <code>DateTimeFormatInfo<\/code> and <code>GregorianCalendar<\/code> instances, resulting in a significant amount of allocation even though it was only needed in a rare fallback case. This fixed it to only allocate when truly required.<\/li>\n<li>When there&#8217;s no dedicated formatting routine, formatting is done into an internal <code>ref struct<\/code> called <code>ValueListBuilder&lt;T&gt;<\/code> that starts with a provided span buffer (typically seeded from a <code>stackalloc<\/code>) and then grows with <code>ArrayPool<\/code> memory as needed. After the formatting has completed, that builder is either copied into a destination span or a new string, depending on the method that triggered the formatting. However, we can avoid that copy for a destination span if we just seed the builder with the destination span. Then if the builder still contains the initial span when formatting has completed (having not grown out of it), we know all the data fit, and we can skip the copy, as all the data is already there.<\/li>\n<\/ul>\n<p>Here&#8217;s some of the example impact:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Globalization;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly DateTime _dt = new DateTime(2023, 9, 1, 12, 34, 56);\r\n    private readonly char[] _chars = new char[100];\r\n\r\n    [Params(null, \"s\", \"u\", \"U\", \"G\")]\r\n    public string Format { get; set; }\r\n\r\n    [Benchmark] public string DT_ToString() =&gt; _dt.ToString(Format);\r\n    [Benchmark] public string DT_ToStringInvariant() =&gt; _dt.ToString(Format, CultureInfo.InvariantCulture);\r\n    [Benchmark] public bool DT_TryFormat() =&gt; _dt.TryFormat(_chars, out _, Format);\r\n    [Benchmark] public bool DT_TryFormatInvariant() =&gt; _dt.TryFormat(_chars, out _, Format, CultureInfo.InvariantCulture);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Format<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">166.64 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">102.45 ns<\/td>\n<td style=\"text-align: right\">0.62<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">161.94 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">28.74 ns<\/td>\n<td style=\"text-align: right\">0.18<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">151.52 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">78.57 ns<\/td>\n<td style=\"text-align: right\">0.52<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">140.35 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>?<\/td>\n<td style=\"text-align: right\">18.26 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">162.86 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">109.49 ns<\/td>\n<td style=\"text-align: right\">0.68<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">162.20 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">102.71 ns<\/td>\n<td style=\"text-align: right\">0.63<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">148.32 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">83.60 ns<\/td>\n<td style=\"text-align: right\">0.57<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">145.05 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>G<\/td>\n<td style=\"text-align: right\">79.77 ns<\/td>\n<td style=\"text-align: right\">0.55<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">186.44 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">29.35 ns<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">182.15 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">27.67 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">165.08 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">15.53 ns<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">155.24 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>s<\/td>\n<td style=\"text-align: right\">15.50 ns<\/td>\n<td style=\"text-align: right\">0.10<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">184.71 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">29.62 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">184.01 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">26.98 ns<\/td>\n<td style=\"text-align: right\">0.15<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">171.73 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">16.08 ns<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">158.42 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>u<\/td>\n<td style=\"text-align: right\">15.58 ns<\/td>\n<td style=\"text-align: right\">0.10<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">NA<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 7.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">1,622.28 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1240 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToString<\/td>\n<td>.NET 8.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">206.08 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">96 B<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">1,567.92 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1240 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_ToStringInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">207.60 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">96 B<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">1,590.27 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1144 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">190.98 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">1,560.00 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1144 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DT_TryFormatInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td>U<\/td>\n<td style=\"text-align: right\">184.11 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Parsing has also improved meaningfully. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82877\">dotnet\/runtime#82877<\/a> improves the handling of &#8220;ddd&#8221; (abbreviated name of the day of the week), &#8220;dddd&#8221; (full name of the day of the week), &#8220;MMM&#8221; (abbreviated name of the month), and &#8220;MMMM&#8221; (full name of the month) in a custom format string; these show up in a variety of commonly used format strings, such as in the expanded definition of the RFC1123 format: <code>ddd, dd MMM yyyy HH':'mm':'ss 'GMT'<\/code>. When the general parsing routine encounters these in a format string, it needs to consult the supplied <code>CultureInfo<\/code> \/ <code>DateTimeFormatInfo<\/code> for that culture&#8217;s associated month and day names, e.g. <code>DateTimeFormatInfo.GetAbbreviatedMonthName<\/code>, and then needs to do a linguistic ignore-case comparison for each name against the input text; that&#8217;s not particularly cheap. However, if we&#8217;re given an invariant culture, we can do the comparison much, much faster. Take &#8220;MMM&#8221; for abbreviated month name, for example. We can read the next three characters (<code>uint m0 = span[0], m1 = span[1], m2 = span[2]<\/code>), ensure they&#8217;re all ASCII (<code>(m0 | m1 | m2) &lt;= 0x7F<\/code>), and then combine them all into a single <code>uint<\/code>, employing the same ASCII casing trick discussed earlier (<code>(m0 &lt;&lt; 16) | (m1 &lt;&lt; 8) | m2 | 0x202020<\/code>). We can do the same thing, precomputed, for each month name, which for the invariant culture we know in advance, and the entire lookup becomes a single numerical <code>switch<\/code>:<\/p>\n<pre><code class=\"language-C#\">switch ((m0 &lt;&lt; 16) | (m1 &lt;&lt; 8) | m2 | 0x202020)\r\n{\r\n    case 0x6a616e: \/* 'jan' *\/ result = 1; break;\r\n    case 0x666562: \/* 'feb' *\/ result = 2; break;\r\n    case 0x6d6172: \/* 'mar' *\/ result = 3; break;\r\n    case 0x617072: \/* 'apr' *\/ result = 4; break;\r\n    case 0x6d6179: \/* 'may' *\/ result = 5; break;\r\n    case 0x6a756e: \/* 'jun' *\/ result = 6; break;\r\n    case 0x6a756c: \/* 'jul' *\/ result = 7; break;\r\n    case 0x617567: \/* 'aug' *\/ result = 8; break;\r\n    case 0x736570: \/* 'sep' *\/ result = 9; break;\r\n    case 0x6f6374: \/* 'oct' *\/ result = 10; break;\r\n    case 0x6e6f76: \/* 'nov' *\/ result = 11; break;\r\n    case 0x646563: \/* 'dec' *\/ result = 12; break;\r\n    default: maxMatchStrLen = 0; break; \/\/ undo match assumption\r\n}  <\/code><\/pre>\n<p>Nifty, and way faster.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Globalization;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private const string Format = \"ddd, dd MMM yyyy HH':'mm':'ss 'GMT'\";\r\n\r\n    private readonly string _s = new DateTime(1955, 11, 5, 6, 0, 0, DateTimeKind.Utc).ToString(Format, CultureInfo.InvariantCulture);\r\n\r\n    [Benchmark]\r\n    public void ParseExact() =&gt; DateTimeOffset.ParseExact(_s, Format, CultureInfo.InvariantCulture, DateTimeStyles.AllowInnerWhite | DateTimeStyles.AssumeUniversal);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ParseExact<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,139.3 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ParseExact<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">318.6 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A variety of other PRs contributed as well. The decreased allocation in the previous benchmark is thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82861\">dotnet\/runtime#82861<\/a>, which removed a string allocation that might occur when the format string contained quotes; the PR simply replaced the string allocation with use of spans. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82925\">dotnet\/runtime#82925<\/a> further reduced the cost of parsing with the &#8220;r&#8221; and &#8220;o&#8221; formats by removing some work that ended up being unnecessary, removing a virtual dispatch, and general streamlining of the code paths. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84964\">dotnet\/runtime#84964<\/a> removed some <code>string[]<\/code> allocations that occured in <code>ParseExact<\/code> when parsing with some cultures, in particular those that employ genitive month names. If the parser needed to retrieve the <code>MonthGenitiveNames<\/code> or <code>AbbreviatedMonthGenitiveNames<\/code> arrays, it would do so via the public properties for these on <code>DateTimeFormatInfo<\/code>; however, out of concern that code could mutate those arrays, these public properties hand back copies. That means that the parser was allocating a copy every time it accessed one of these. The parser can instead access the underlying original array, and pinky swear not to change it.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Globalization;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly CultureInfo _ci = new CultureInfo(\"ru-RU\");\r\n\r\n    [Benchmark] public DateTime Parse() =&gt; DateTime.ParseExact(\"\u0432\u0442\u043e\u0440\u043d\u0438\u043a, 18 \u0430\u043f\u0440\u0435\u043b\u044f 2023 04:31:26\", \"dddd, dd MMMM yyyy HH:mm:ss\", _ci);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Parse<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2.654 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">128 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Parse<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.353 us<\/td>\n<td style=\"text-align: right\">0.90<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>DateTime<\/code> and <code>DateTimeOffset<\/code> also implement <code>IUtf8SpanFormattable<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84469\">dotnet\/runtime#84469<\/a>, and as with the numerical types, the implementations are all shared between UTF16 and UTF8; thus all of the optimizations previously mentioned accrue to both. And again, <code>Utf8Formatter<\/code>&#8216;s support for formatting <code>DateTimeOffset<\/code> is just reparented on top of this same shared logic.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly DateTime _dt = new DateTime(2023, 9, 1, 12, 34, 56);\r\n    private readonly byte[] _bytes = new byte[100];\r\n\r\n    [Benchmark] public bool TryFormatUtf8Formatter() =&gt; Utf8Formatter.TryFormat(_dt, _bytes, out _); \r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryFormatUtf8Formatter<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">19.35 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryFormatUtf8Formatter<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">16.24 ns<\/td>\n<td style=\"text-align: right\">0.83<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Since we&#8217;re talking about <code>DateTime<\/code>, a brief foray into <code>TimeZoneInfo<\/code>. <code>TimeZoneInfo.FindSystemTimeZoneById<\/code> gets a <code>TimeZoneInfo<\/code> object for the specified identifier. One of the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/date-time-and-time-zone-enhancements-in-net-6\/\">improvements introduced in .NET 6<\/a> is that <code>FindSystemTimeZoneById<\/code> supports both the Windows time zone set as well as the IANA time zone set, regardless of whether running on Windows or Linux or macOS. However, the <code>TimeZoneInfo<\/code> was only being cached when its ID matched that for the current OS, and as such calls that resolved to the other set weren&#8217;t being fulfilled by the cache and were falling back to re-reading from the OS. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85615\">dotnet\/runtime#85615<\/a> ensures a cache can be used in both cases. It also allows returning the immutable <code>TimeZoneInfo<\/code> objects directly, rather than cloning them on every access. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88368\">dotnet\/runtime#88368<\/a> also improves <code>TimeZoneInfo<\/code>, in particular <code>GetSystemTimeZones<\/code> on Linux and macOS, by lazily loading several of the properties. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89985\">dotnet\/runtime#89985<\/a> then improves on that with a new overload of <code>GetSystemTimeZones<\/code> that allows the caller to skip the sort the implementation would otherwise perform on the result.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    [Arguments(\"America\/Los_Angeles\")]\r\n    [Arguments(\"Pacific Standard Time\")]\r\n    public TimeZoneInfo FindSystemTimeZoneById(string id) =&gt; TimeZoneInfo.FindSystemTimeZoneById(id);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>id<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FindSystemTimeZoneById<\/td>\n<td>.NET 7.0<\/td>\n<td>America\/Los_Angeles<\/td>\n<td style=\"text-align: right\">1,503.75 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FindSystemTimeZoneById<\/td>\n<td>.NET 8.0<\/td>\n<td>America\/Los_Angeles<\/td>\n<td style=\"text-align: right\">40.96 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>FindSystemTimeZoneById<\/td>\n<td>.NET 7.0<\/td>\n<td>Pacif(&#8230;) Time [21]<\/td>\n<td style=\"text-align: right\">3,951.60 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">568 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FindSystemTimeZoneById<\/td>\n<td>.NET 8.0<\/td>\n<td>Pacif(&#8230;) Time [21]<\/td>\n<td style=\"text-align: right\">57.00 ns<\/td>\n<td style=\"text-align: right\">0.01<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Back to formatting and parsing&#8230;<\/p>\n<h2>Guid<\/h2>\n<p>Formatting and parsing improvements go beyond the numerical and date types. <code>Guid<\/code> also gets in on the game. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84553\">dotnet\/runtime#84553<\/a>, <code>Guid<\/code> implements <code>IUtf8SpanFormattable<\/code>, and as with all the other cases, it shares the exact same routines between UTF16 and UTF8 support. Then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81650\">dotnet\/runtime#81650<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81666\">dotnet\/runtime#81666<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87126\">dotnet\/runtime#87126<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> vectorize that formatting support.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly Guid _guid = Guid.Parse(\"7BD626F6-4396-41E3-A491-4B1DC538DD92\");\r\n    private readonly char[] _dest = new char[100];\r\n\r\n    [Benchmark]\r\n    [Arguments(\"D\")]\r\n    [Arguments(\"N\")]\r\n    [Arguments(\"B\")]\r\n    [Arguments(\"P\")]\r\n    public bool TryFormat(string format) =&gt; _guid.TryFormat(_dest, out _, format);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>format<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>B<\/td>\n<td style=\"text-align: right\">23.622 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>B<\/td>\n<td style=\"text-align: right\">7.341 ns<\/td>\n<td style=\"text-align: right\">0.31<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>D<\/td>\n<td style=\"text-align: right\">22.134 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>D<\/td>\n<td style=\"text-align: right\">5.485 ns<\/td>\n<td style=\"text-align: right\">0.25<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>N<\/td>\n<td style=\"text-align: right\">20.891 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>N<\/td>\n<td style=\"text-align: right\">4.852 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 7.0<\/td>\n<td>P<\/td>\n<td style=\"text-align: right\">24.139 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryFormat<\/td>\n<td>.NET 8.0<\/td>\n<td>P<\/td>\n<td style=\"text-align: right\">6.101 ns<\/td>\n<td style=\"text-align: right\">0.25<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Before moving on from primitives and numerics, let&#8217;s take a quick look at <code>System.Random<\/code>, which has methods for producing pseudo-random numerical values.<\/p>\n<h2>Random<\/h2>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79790\">dotnet\/runtime#79790<\/a> from <a href=\"https:\/\/github.com\/mla-alm\">@mla-alm<\/a> provides an implementation in <code>Random<\/code> based on <a href=\"https:\/\/github.com\/lemire\">@lemire<\/a>&#8216;s <a href=\"https:\/\/github.com\/lemire\/fastrange\">unbiased range functions<\/a>. When a method like <code>Next(int min, int max)<\/code> is invoked, it needs to provide a value in the range <code>[min, max)<\/code>. In order to provide an unbiased answer, the .NET 7 implementation generates a 32-bit value, narrows down the range to the smallest power of 2 that contains the max (by taking the log2 of the max and shifting to throw away bits), and then checks whether the result is less than the max: if it is, it returns the result as the answer. But if it&#8217;s not, it rejects the value (a process referred to as &#8220;rejection sampling&#8221;) and loops around to start the whole process over. While the cost to produce each sample in the current approach isn&#8217;t terrible, the nature of the approach makes it reasonably likely the sample will need to be rejected, which means looping and retries. With the new approach, it effectively implements modulo reduction (e.g. <code>Next() % max<\/code>), except replacing the expensive modulo operation with a cheaper multiplication and shift; then a rejection sampling loop is still employed, but the bias it corrects for happens much more rarely and thus the more expensive path happens much more rarely. The net result is a nice boost on average to the throughput of <code>Random<\/code>&#8216;s methods (<code>Random<\/code> can also get a boost from dynamic PGO, as the internal abstraction <code>Random<\/code> uses can be devirtualized, so I&#8217;ve shown here the impact with and without PGO enabled.)<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\".NET 7\").WithRuntime(CoreRuntime.Core70).AsBaseline())\r\n    .AddJob(Job.Default.WithId(\".NET 8 w\/o PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\".NET 8\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\npublic class Tests\r\n{\r\n    private static readonly Random s_rand = new();\r\n\r\n    [Benchmark]\r\n    public int NextMax() => s_rand.Next(12345);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NextMax<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">5.793 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextMax<\/td>\n<td>.NET 8.0 w\/o PGO<\/td>\n<td style=\"text-align: right\">1.840 ns<\/td>\n<td style=\"text-align: right\">0.32<\/td>\n<\/tr>\n<tr>\n<td>NextMax<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.598 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87219\">dotnet\/runtime#87219<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a> then further improves this for <code>long<\/code> values. The core part of the algorithm involves multiplying the random value by the max value and then taking the low part of the product:<\/p>\n<pre><code class=\"language-C#\">UInt128 randomProduct = (UInt128)maxValue * xoshiro.NextUInt64();\r\nulong lowPart = (ulong)randomProduct;<\/code><\/pre>\n<p>This can be made more efficient by not using <code>UInt128<\/code>&#8216;s multiplication implementation and instead using <code>Math.BigMul<\/code>,<\/p>\n<pre><code class=\"language-C#\">ulong randomProduct = Math.BigMul(maxValue, xoshiro.NextUInt64(), out ulong lowPart);<\/code><\/pre>\n<p>which is implemented to use the <code>Bmi2.X64.MultiplyNoFlags<\/code> or <code>Armbase.Arm64.MultiplyHigh<\/code> intrinsics when one is available.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\npublic class Tests\r\n{\r\n    private static readonly Random s_rand = new();\r\n\r\n    [Benchmark]\r\n    public long NextMinMax() =&gt; s_rand.NextInt64(123456789101112, 1314151617181920);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NextMinMax<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">9.839 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextMinMax<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.927 ns<\/td>\n<td style=\"text-align: right\">0.20<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Finally, I&#8217;ll mention <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81627\">dotnet\/runtime#81627<\/a>. <code>Random<\/code> is both a commonly-used type in its own right and also an abstraction; many of the APIs on <code>Random<\/code> are virtual, such that a derived type can be implemented to completely swap out the algorithm employed. So, for example, if you wanted to implement a <code>MersenneTwisterRandom<\/code> that derived from <code>Random<\/code> and completely replaced the base algorithm by overriding every virtual method, you could do so, pass your instance around as <code>Random<\/code>, and everyone&#8217;s happy&#8230; unless you&#8217;re creating your derived type frequently and care about allocation. <code>Random<\/code> actually includes multiple pseudo-random generators. .NET 6 imbued it with an implementation of the <code>xoshiro128**<\/code>\/<code>xoshiro256**<\/code> algorithms, which are used when you just do <code>new Random()<\/code>. However, if you instead instantiate a derived type, the implementation falls back to the same algorithm (a variant of Knuth&#8217;s subtractive random number generator algorithm) it&#8217;s used since the dawn of <code>Random<\/code>, as it doesn&#8217;t know what the derived type will be doing nor what dependencies it may have taken on the nature of the algorithm employed. That algorithm carries with it a 56-element <code>int[]<\/code>, which means that derived classes end up instantiating and initializing that array even if they never use it. With this PR, the creation of that array is made lazy, such that it&#8217;s only initialized if and when it&#8217;s used. With that, a derived implementation that wants to avoid that cost can.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark] public Random NewDerived() =&gt; new NotRandomRandom();\r\n\r\n    private sealed class NotRandomRandom : Random { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NewDerived<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,237.73 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">312 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NewDerived<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">20.49 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Strings, Arrays, and Spans<\/h2>\n<p>.NET 8 sees a tremendous amount of improvement in the realm of data processing, in particular in the efficient manipulation of strings, arrays, and spans. Since we&#8217;ve just been talking about UTF8 and <code>IUtf8SpanFormattable<\/code>, let&#8217;s start there.<\/p>\n<h3>UTF8<\/h3>\n<p>As noted, <code>IUtf8SpanFormattable<\/code> is now implemented on a bunch of types. I noted all the numerical primitives, <code>DateTime{Offset}<\/code>, and <code>Guid<\/code>, and with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84556\">dotnet\/runtime#84556<\/a> the <code>System.Version<\/code> type also implements it, as do <code>IPAddress<\/code> and the new <code>IPNetwork<\/code> types, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84487\">dotnet\/runtime#84487<\/a>. However, .NET 8 doesn&#8217;t just provide implementations of this interface on all of these types, it also consumes the interface in a key place.<\/p>\n<p>If you&#8217;ll recall, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/string-interpolation-in-c-10-and-net-6\/\">string interpolation in C# 10 and .NET 6<\/a> was completely overhauled. This included not only making string interpolation much more efficient, but also in providing a pattern that a type could implement to allow for the string interpolation syntax to be used efficiently to do things other than create a new string. For example, a new <code>TryWrite<\/code> extension method for <code>Span&lt;char&gt;<\/code> was added that makes it possible to format an interpolated string directly into a destination <code>char<\/code> buffer:<\/p>\n<pre><code class=\"language-C#\">public bool Format(Span&lt;char&gt; span, DateTime dt, out int charsWritten) =&gt;\r\n    span.TryWrite($\"Date: {dt:R}\", out charsWritten);<\/code><\/pre>\n<p>The above gets translated (&#8220;lowered&#8221;) by the compiler into the equivalent of the following:<\/p>\n<pre><code class=\"language-C#\">public bool Format(Span&lt;char&gt; span, DateTime dt, out int charsWritten)\r\n{\r\n    var handler = new MemoryExtensions.TryWriteInterpolatedStringHandler(6, 1, span, out bool shouldAppend);\r\n    _ = shouldAppend &amp;&amp;\r\n        handler.AppendLiteral(\"Date: \") &amp;&amp;\r\n        handler.AppendFormatted&lt;DateTime&gt;(dt, \"R\");\r\n    return MemoryExtensions.TryWrite(span, ref handler, out charsWritten);<\/code><\/pre>\n<p>The implementation of that generic <code>AppendFormatted&lt;T&gt;<\/code> call examines the <code>T<\/code> and tries to do the most optimal thing. In this case, it&#8217;ll see that <code>T<\/code> implements <code>ISpanFormattable<\/code>, and it&#8217;ll end up using its <code>TryFormat<\/code> to format directly into the destination span.<\/p>\n<p>That&#8217;s for UTF16. Now with <code>IUtf8SpanFormattable<\/code>, we have the opportunity to do the same thing but for UTF8. And that&#8217;s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83852\">dotnet\/runtime#83852<\/a> does. It introduces the new <code>Utf8.TryWrite<\/code> method, which behaves exactly like the aforementioned <code>TryWrite<\/code>, except writing as UTF8 into a destination <code>Span&lt;byte&gt;<\/code> instead of as UTF16 into a destination <code>Span&lt;char&gt;<\/code>. The implementation also special-cases <code>IUtf8SpanFormattable<\/code>, using its <code>TryFormat<\/code> to write directly into the destination buffer.<\/p>\n<p>With that, we can write the equivalent to the method we wrote earlier:<\/p>\n<pre><code class=\"language-C#\">public bool Format(Span&lt;byte&gt; span, DateTime dt, out int bytesWritten) =&gt;\r\n    Utf8.TryWrite(span, $\"Date: {dt:R}\", out bytesWritten);<\/code><\/pre>\n<p>and that gets lowered as you&#8217;d now expect:<\/p>\n<pre><code class=\"language-C#\">public bool Format(Span&lt;byte&gt; span, DateTime dt, out int bytesWritten)\r\n{\r\n    var handler = new Utf8.TryWriteInterpolatedStringHandler(6, 1, span, out bool shouldAppend);\r\n    _ = shouldAppend &amp;&amp;\r\n        handler.AppendLiteral(\"Date: \") &amp;&amp;\r\n        handler.AppendFormatted&lt;DateTime&gt;(dt, \"R\");\r\n    return Utf8.TryWrite(span, ref handler, out bytesWritten);<\/code><\/pre>\n<p>So, identical, other than the parts you expect to change. But that&#8217;s also a problem in some ways. Take a look at that <code>AppendLiteral(\"Date: \")<\/code> call. In the UTF16 case where we&#8217;re dealing with a destination <code>Span&lt;char&gt;<\/code>, the implementation of <code>AppendLiteral<\/code> simply needs to copy that string into the destination; not only that, but the JIT will inline the call, see that a string literal is being copied, and will unroll the copy, making it super efficient. But in the UTF8 case, we can&#8217;t just copy the UTF16 string <code>char<\/code>s into the destination UTF8 <code>Span&lt;byte&gt;<\/code> buffer; we need to UTF8 encode the string. And while we can certainly do that (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84609\">dotnet\/runtime#84609<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85120\">dotnet\/runtime#85120<\/a> make that trivial with the addition of a new <code>Encoding.TryGetBytes<\/code> method), it&#8217;s frustratingly inefficient to need to spend cycles repeatedly at run-time doing work that could be done at compile time. After all, we&#8217;re dealing with a string literal known at JIT time; it&#8217;d be really, really nice if the JIT could do the UTF8 encoding and then do an unrolled copy just as it&#8217;s already doing in the UTF16 case. And with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85328\">dotnet\/runtime#85328<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89376\">dotnet\/runtime#89376<\/a>, that&#8217;s exactly what happens, such that performance is effectively the same between them.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Unicode;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly char[] _chars = new char[100];\r\n    private readonly byte[] _bytes = new byte[100];\r\n    private readonly int _major = 1, _minor = 2, _build = 3, _revision = 4;\r\n\r\n    [Benchmark] public bool FormatUTF16() =&gt; _chars.AsSpan().TryWrite($\"{_major}.{_minor}.{_build}.{_revision}\", out int charsWritten);\r\n    [Benchmark] public bool FormatUTF8() =&gt; Utf8.TryWrite(_bytes, $\"{_major}.{_minor}.{_build}.{_revision}\", out int bytesWritten);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FormatUTF16<\/td>\n<td style=\"text-align: right\">19.07 ns<\/td>\n<\/tr>\n<tr>\n<td>FormatUTF8<\/td>\n<td style=\"text-align: right\">19.33 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>ASCII<\/h2>\n<p>UTF8 is the predominent encoding for text on the internet and for the movement of text between endpoints. However, much of this data is actually the ASCII subset, the 128 values in the range <code>[0, 127]<\/code>. When you know the data you&#8217;re working with is ASCII, you can achieve even better performance by using routines optimized for the subset. The new <code>Ascii<\/code> class in .NET 8, introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75012\">dotnet\/runtime#75012<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84886\">dotnet\/runtime#84886<\/a>, and then further optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85926\">dotnet\/runtime#85926<\/a> from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85266\">dotnet\/runtime#85266<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84881\">dotnet\/runtime#84881<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87141\">dotnet\/runtime#87141<\/a>, provides this:<\/p>\n<pre><code class=\"language-C#\">namespace System.Text;\r\n\r\npublic static class Ascii\r\n{\r\n    public static bool Equals(ReadOnlySpan&lt;byte&gt; left, ReadOnlySpan&lt;byte&gt; right);\r\n    public static bool Equals(ReadOnlySpan&lt;byte&gt; left, ReadOnlySpan&lt;char&gt; right);\r\n    public static bool Equals(ReadOnlySpan&lt;char&gt; left, ReadOnlySpan&lt;byte&gt; right);\r\n    public static bool Equals(ReadOnlySpan&lt;char&gt; left, ReadOnlySpan&lt;char&gt; right);\r\n\r\n    public static bool EqualsIgnoreCase(ReadOnlySpan&lt;byte&gt; left, ReadOnlySpan&lt;byte&gt; right);\r\n    public static bool EqualsIgnoreCase(ReadOnlySpan&lt;byte&gt; left, ReadOnlySpan&lt;char&gt; right);\r\n    public static bool EqualsIgnoreCase(ReadOnlySpan&lt;char&gt; left, ReadOnlySpan&lt;byte&gt; right);\r\n    public static bool EqualsIgnoreCase(ReadOnlySpan&lt;char&gt; left, ReadOnlySpan&lt;char&gt; right);\r\n\r\n    public static bool IsValid(byte value);\r\n    public static bool IsValid(char value);\r\n    public static bool IsValid(ReadOnlySpan&lt;byte&gt; value);\r\n    public static bool IsValid(ReadOnlySpan&lt;char&gt; value);\r\n\r\n    public static OperationStatus ToLower(ReadOnlySpan&lt;byte&gt; source, Span&lt;byte&gt; destination, out int bytesWritten);\r\n    public static OperationStatus ToLower(ReadOnlySpan&lt;char&gt; source, Span&lt;char&gt; destination, out int charsWritten);\r\n    public static OperationStatus ToLower(ReadOnlySpan&lt;byte&gt; source, Span&lt;char&gt; destination, out int charsWritten);\r\n    public static OperationStatus ToLower(ReadOnlySpan&lt;char&gt; source, Span&lt;byte&gt; destination, out int bytesWritten);\r\n\r\n    public static OperationStatus ToUpper(ReadOnlySpan&lt;byte&gt; source, Span&lt;byte&gt; destination, out int bytesWritten);\r\n    public static OperationStatus ToUpper(ReadOnlySpan&lt;char&gt; source, Span&lt;char&gt; destination, out int charsWritten);\r\n    public static OperationStatus ToUpper(ReadOnlySpan&lt;byte&gt; source, Span&lt;char&gt; destination, out int charsWritten);\r\n    public static OperationStatus ToUpper(ReadOnlySpan&lt;char&gt; source, Span&lt;byte&gt; destination, out int bytesWritten);\r\n\r\n    public static OperationStatus ToLowerInPlace(Span&lt;byte&gt; value, out int bytesWritten);\r\n    public static OperationStatus ToLowerInPlace(Span&lt;char&gt; value, out int charsWritten);\r\n    public static OperationStatus ToUpperInPlace(Span&lt;byte&gt; value, out int bytesWritten);\r\n    public static OperationStatus ToUpperInPlace(Span&lt;char&gt; value, out int charsWritten);\r\n\r\n    public static OperationStatus FromUtf16(ReadOnlySpan&lt;char&gt; source, Span&lt;byte&gt; destination, out int bytesWritten);\r\n    public static OperationStatus ToUtf16(ReadOnlySpan&lt;byte&gt; source, Span&lt;char&gt; destination, out int charsWritten);\r\n\r\n    public static Range Trim(ReadOnlySpan&lt;byte&gt; value);\r\n    public static Range Trim(ReadOnlySpan&lt;char&gt; value);\r\n\r\n    public static Range TrimEnd(ReadOnlySpan&lt;byte&gt; value);\r\n    public static Range TrimEnd(ReadOnlySpan&lt;char&gt; value);\r\n\r\n    public static Range TrimStart(ReadOnlySpan&lt;byte&gt; value);\r\n    public static Range TrimStart(ReadOnlySpan&lt;char&gt; value);\r\n}<\/code><\/pre>\n<p>Note that it provides overloads that operate on UTF16 (<code>char<\/code>) and UTF8 (<code>byte<\/code>), and in many cases, intermixes them, such that you can, for example, compare a UTF8 <code>ReadOnlySpan&lt;byte&gt;<\/code> with a UTF16 <code>ReadOnlySpan&lt;char&gt;<\/code>, or transcode a UTF16 <code>ReadOnlySpan&lt;char&gt;<\/code> to a UTF8 <code>ReadOnlySpan&lt;byte&gt;<\/code> (which, when working with ASCII, is purely a narrowing operation, getting rid of the leading 0 <code>byte<\/code> in each <code>char<\/code>). For example, the PR that added these methods also used them in a variety of places (something I advocate for strongly, in order to ensure what has been designed is actually meeting the need, or ensure that other core library code is benefiting from the new APIs, which in turn makes those APIs more valuable, as their benefits accrue to more indirect consumers), including in multiple places in <code>SocketsHttpHandler<\/code>. Previously, <code>SocketsHttpHandler<\/code> had its own helpers for this purpose, an example of which I&#8217;ve copied here into this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _bytes = \"Strict-Transport-Security\"u8.ToArray();\r\n    private readonly string _chars = \"Strict-Transport-Security\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool Equals_OpenCoded() =&gt; EqualsOrdinalAsciiIgnoreCase(_chars, _bytes);\r\n\r\n    [Benchmark]\r\n    public bool Equals_Ascii() =&gt; Ascii.EqualsIgnoreCase(_chars, _bytes);\r\n\r\n    internal static bool EqualsOrdinalAsciiIgnoreCase(string left, ReadOnlySpan&lt;byte&gt; right)\r\n    {\r\n        if (left.Length != right.Length)\r\n            return false;\r\n\r\n        for (int i = 0; i &lt; left.Length; i++)\r\n        {\r\n            uint charA = left[i], charB = right[i];\r\n\r\n            if ((charA - 'a') &lt;= ('z' - 'a')) charA -= ('a' - 'A');\r\n            if ((charB - 'a') &lt;= ('z' - 'a')) charB -= ('a' - 'A');\r\n\r\n            if (charA != charB)\r\n                return false;\r\n        }\r\n\r\n        return true;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Equals_OpenCoded<\/td>\n<td style=\"text-align: right\">31.159 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Equals_Ascii<\/td>\n<td style=\"text-align: right\">3.985 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Many of these new <code>Ascii<\/code> APIs also got the <code>Vector512<\/code> treatment, such that they light up when AVX512 is supported by the current machine, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88532\">dotnet\/runtime#88532<\/a> from <a href=\"https:\/\/github.com\/anthonycanino\">@anthonycanino<\/a> and  <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88650\">dotnet\/runtime#88650<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a>.<\/p>\n<h2>Base64<\/h2>\n<p>An even further constrained subset of text is Base64-encoded data. This is used when arbitrary bytes need to be transferred as text, and results in text that uses only 64 characters (lowercase ASCII letters, uppercase ASCII letters, ASCII digits, &#8216;+&#8217;, and &#8216;\/&#8217;). .NET has long had methods on <code>System.Convert<\/code> for encoding and decoding Base64 with UTF16 (<code>char<\/code>), and it got an additional set of span-based methods in .NET Core 2.1 with the introduction of <code>Span&lt;T&gt;<\/code>. At that point, the <code>System.Text.Buffers.Base64<\/code> class was also introduced, with dedicated surface area for encoding and decoding <code>Base64<\/code> with UTF8 (<code>byte<\/code>). That&#8217;s now improved further in .NET 8.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85938\">dotnet\/runtime#85938<\/a> from <a href=\"https:\/\/github.com\/heathbm\">@heathbm<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86396\">dotnet\/runtime#86396<\/a> make two contributions here. First, they bring the behavior of the <code>Base64.Decode<\/code> methods for UTF8 in line with its counterparts on the <code>Convert<\/code> class, in particular around handling of whitespace. As it&#8217;s very common for there to be newlines in Base64-encoded data, the <code>Convert<\/code> class&#8217; methods for decoding <code>Base64<\/code> permitted whitespace; in contrast, the <code>Base64<\/code> class&#8217; methods for decoding would fail if whitespace was encountered. These decoding methods now permit exactly the same whitespace that <code>Convert<\/code> does. And that&#8217;s important in part because of the second contribution from these PRs, which is a new set of <code>Base64.IsValid<\/code> static methods. As with <code>Ascii.IsValid<\/code> and <code>Utf8.IsValid<\/code>, these methods simply state whether the supplied UTF8 or UTF16 input represents a valid <code>Base64<\/code> input, such that the decoding methods on both <code>Convert<\/code> and <code>Base64<\/code> could successfully decode it. And as with all such processing we see introduced into .NET, we&#8217;ve strived to make the new functionality as efficient as possible so that it can be used to maximal benefit elsewhere. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86221\">dotnet\/runtime#86221<\/a> from <a href=\"https:\/\/github.com\/WeihanLi\">@WeihanLi<\/a> updated the new <code>Base64Attribute<\/code> to use it, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86002\">dotnet\/runtime#86002<\/a> updated <code>PemEncoding.TryCountBase64<\/code> to use it. Here we can see a benchmark comparing the old non-vectorized <code>TryCountBase64<\/code> with the new version using the vectorized <code>Base64.IsValid<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers.Text;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _exampleFromPemEncodingTests =\r\n        \"MHQCAQEEICBZ7\/8T1JL2amvNB\/QShghtgZPtnPD4W+sAcHxA+hJsoAcGBSuBBAAK\\n\" +\r\n        \"oUQDQgAE3yNC5as8JVN5MjF95ofNSgRBVXjf0CKtYESWfPnmvT3n+cMMJUB9lUJf\\n\" +\r\n        \"dkFNgaSB7JlB+krZVVV8T7HZQXVDRA==\\n\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool Count_Old() =&gt; TryCountBase64_Old(_exampleFromPemEncodingTests, out _, out _, out _);\r\n\r\n    [Benchmark] \r\n    public bool Count_New() =&gt; TryCountBase64_New(_exampleFromPemEncodingTests, out _, out _, out _);\r\n\r\n    private static bool TryCountBase64_New(ReadOnlySpan&lt;char&gt; str, out int base64Start, out int base64End, out int base64DecodedSize)\r\n    {\r\n        int start = 0, end = str.Length - 1;\r\n        for (; start &lt; str.Length &amp;&amp; IsWhiteSpaceCharacter(str[start]); start++) ;\r\n        for (; end &gt; start &amp;&amp; IsWhiteSpaceCharacter(str[end]); end--) ;\r\n\r\n        if (Base64.IsValid(str.Slice(start, end + 1 - start), out base64DecodedSize))\r\n        {\r\n            base64Start = start;\r\n            base64End = end + 1;\r\n            return true;\r\n        }\r\n\r\n        base64Start = 0;\r\n        base64End = 0;\r\n        return false;\r\n    }\r\n\r\n    private static bool TryCountBase64_Old(ReadOnlySpan&lt;char&gt; str, out int base64Start, out int base64End, out int base64DecodedSize)\r\n    {\r\n        base64Start = 0;\r\n        base64End = str.Length;\r\n\r\n        if (str.IsEmpty)\r\n        {\r\n            base64DecodedSize = 0;\r\n            return true;\r\n        }\r\n\r\n        int significantCharacters = 0;\r\n        int paddingCharacters = 0;\r\n\r\n        for (int i = 0; i &lt; str.Length; i++)\r\n        {\r\n            char ch = str[i];\r\n\r\n            if (IsWhiteSpaceCharacter(ch))\r\n            {\r\n                if (significantCharacters == 0) base64Start++;\r\n                else base64End--;\r\n                continue;\r\n            }\r\n\r\n            base64End = str.Length;\r\n\r\n            if (ch == '=') paddingCharacters++;\r\n            else if (paddingCharacters == 0 &amp;&amp; IsBase64Character(ch)) significantCharacters++;\r\n            else\r\n            {\r\n                base64DecodedSize = 0;\r\n                return false;\r\n            }\r\n        }\r\n\r\n        int totalChars = paddingCharacters + significantCharacters;\r\n\r\n        if (paddingCharacters &gt; 2 || (totalChars &amp; 0b11) != 0)\r\n        {\r\n            base64DecodedSize = 0;\r\n            return false;\r\n        }\r\n\r\n        base64DecodedSize = (totalChars &gt;&gt; 2) * 3 - paddingCharacters;\r\n        return true;\r\n    }\r\n\r\n    [MethodImpl(MethodImplOptions.AggressiveInlining)]\r\n    private static bool IsBase64Character(char ch) =&gt; char.IsAsciiLetterOrDigit(ch) || ch is '+' or '\/';\r\n\r\n    [MethodImpl(MethodImplOptions.AggressiveInlining)]\r\n    private static bool IsWhiteSpaceCharacter(char ch) =&gt; ch is ' ' or '\\t' or '\\n' or '\\r';\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count_Old<\/td>\n<td style=\"text-align: right\">356.37 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count_New<\/td>\n<td style=\"text-align: right\">33.72 ns<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Hex<\/h2>\n<p>Another relevant subset of ASCII is hexadecimal, and improvements have been made in .NET 8 around conversions between bytes and their representation in hex. In particular, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82521\">dotnet\/runtime#82521<\/a> vectorized the <code>Convert.FromHexString<\/code> method using an algorithm <a href=\"http:\/\/0x80.pl\/notesen\/2022-01-17-validating-hex-parse.html#algorithm-3-by-geoff-langdale\">outlined by Langdale and Mula<\/a>. On even a moderate length input, this has a very measurable impact on throughput:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private string _hex;\r\n\r\n    [Params(4, 16, 128)]\r\n    public int Length { get; set; }\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; _hex = Convert.ToHexString(RandomNumberGenerator.GetBytes(Length));\r\n\r\n    [Benchmark]\r\n    public byte[] ConvertFromHex() =&gt; Convert.FromHexString(_hex);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Length<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 7.0<\/td>\n<td>4<\/td>\n<td style=\"text-align: right\">24.94 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 8.0<\/td>\n<td>4<\/td>\n<td style=\"text-align: right\">20.71 ns<\/td>\n<td style=\"text-align: right\">0.83<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 7.0<\/td>\n<td>16<\/td>\n<td style=\"text-align: right\">57.66 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 8.0<\/td>\n<td>16<\/td>\n<td style=\"text-align: right\">17.29 ns<\/td>\n<td style=\"text-align: right\">0.30<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 7.0<\/td>\n<td>128<\/td>\n<td style=\"text-align: right\">337.41 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConvertFromHex<\/td>\n<td>.NET 8.0<\/td>\n<td>128<\/td>\n<td style=\"text-align: right\">56.72 ns<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, the improvements in .NET 8 go well beyond just the manipulation of certain known sets of characters; there is a wealth of other improvements to explore. Let&#8217;s start with <code>System.Text.CompositeFormat<\/code>, which was introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80753\">dotnet\/runtime#80753<\/a>.<\/p>\n<h2>String Formatting<\/h2>\n<p>Since the beginning of .NET, <code>string<\/code> and friends have provided APIs for handling composite format strings, strings with text interspersed with format item placeholders, e.g. <code>\"The current time is {0:t}\"<\/code>. These strings can then be passed to various APIs, like <code>string.Format<\/code>, which are provided with both the composite format string and the arguments that should be substituted in for the placeholders, e.g. <code>string.Format(\"The current time is {0:t}\", DateTime.Now)<\/code> will return a string like <code>\"The current time is 3:44 PM\"<\/code> (the <code>0<\/code> in the placeholder indicates the 0-based number of the argument to substitute, and the <code>t<\/code> is the format that should be used, in this case the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/standard\/base-types\/standard-date-and-time-format-strings\">standard short time pattern<\/a>). Such a method invocation needs to parse the composite format string each time it&#8217;s called, even though for a given call site the composite format string typically doesn&#8217;t change from invocation to invocation. These APIs are also generally non-generic, which means if an argument is a value type (as is <code>DateTime<\/code> in my example), it&#8217;ll incur a boxing allocation. To simplify the syntax around these operations, C# 6 gained support for string interpolation, such that instead of writing <code>string.Format(null, \"The current time is {0:t}\", DateTime.Now)<\/code>, you could instead write <code>$\"The current time is {DateTime.Now:t}\"<\/code>, and it was then up to the compiler to achieve the same behavior as if <code>string.Format<\/code> had been used (which the compiler typically achieved simply by lowering the interpolation into a call to <code>string.Format<\/code>).<\/p>\n<p>In .NET 6 and C# 10, string interpolation was <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/string-interpolation-in-c-10-and-net-6\/\">significantly improved<\/a>, both in terms of the scenarios supported and in terms of its efficiency. One key aspect of the efficiency is it enabled the parsing to be performed once (at compile-time). It also enabled avoiding all of the allocation associated with providing arguments. These improvements contributed to all use of string interpolation and a significant portion of the use of <code>string.Format<\/code> in real-world applications and services. However, the compiler support works by being able to see the string at compile time. What if the format string isn&#8217;t known until run-time, such as if it&#8217;s pulled from a <code>.resx<\/code> resource file or some other source of configuration? At that point, <code>string.Format<\/code> remains the answer.<\/p>\n<p>Now in .NET 8, there&#8217;s a new answer available: <code>CompositeFormat<\/code>. Just as an interpolated string allows the compiler to do the heavy lifting once in order to optimize repeated use, <code>CompositeFormat<\/code> allows that reusable work to be done once in order to optimize repeated use. As it does the parsing at run-time, it&#8217;s able to tackle the remaining cases that string interpolation can&#8217;t reach. To create an instance, one simply calls its <code>Parse<\/code> method, which takes a composite format string, parses it, and returns a <code>CompositeFormat<\/code> instance:<\/p>\n<pre><code class=\"language-C#\">private static readonly CompositeFormat s_currentTimeFormat = CompositeFormat.Parse(SR.CurrentTime);<\/code><\/pre>\n<p>Then, existing methods like <code>string.Format<\/code> now have new overloads, exactly the same as the existing ones, but instead of taking a <code>string format<\/code>, they take a <code>CompositeFormat format<\/code>. The same formatting as was done earlier can then instead be done like this:<\/p>\n<pre><code class=\"language-C#\">string result = string.Format(null, s_currentTimeFormat, DateTime.Now);<\/code><\/pre>\n<p>This overload (and other new overloads of methods like <code>StringBuilder.AppendFormat<\/code> and <code>MemoryExtensions.TryWrite<\/code>) accepts generic arguments, avoiding the boxing.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private static readonly CompositeFormat s_format = CompositeFormat.Parse(SR.CurrentTime);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public string FormatString() =&gt; string.Format(null, SR.CurrentTime, DateTime.Now);\r\n\r\n    [Benchmark]\r\n    public string FormatComposite() =&gt; string.Format(null, s_format, DateTime.Now);\r\n}\r\n\r\ninternal static class SR\r\n{\r\n    public static string CurrentTime =&gt; \/*load from resource file*\/\"The current time is {0:t}\";\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FormatString<\/td>\n<td style=\"text-align: right\">163.6 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">96 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FormatComposite<\/td>\n<td style=\"text-align: right\">146.5 ns<\/td>\n<td style=\"text-align: right\">0.90<\/td>\n<td style=\"text-align: right\">72 B<\/td>\n<td style=\"text-align: right\">0.75<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>If you know the composite format string at compile time, interpolated strings are the answer. Otherwise, <code>CompositeFormat<\/code> can give you throughput in the same ballpark at the expense of some startup costs. Formatting with a <code>CompositeFormat<\/code> is actually implemented with the same interpolated string handlers that are used for string interpolation, e.g. <code>string.Format(..., compositeFormat, ...)<\/code> ends up calling into methods on <code>DefaultInterpolatedStringHandler<\/code> to do the actual formatting work.<\/p>\n<p>There&#8217;s also a new analyzer to help with this. CA1863 &#8220;Use &#8216;CompositeFormat'&#8221; was introduced in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6675\">dotnet\/roslyn-analyzers#6675<\/a> to identify <code>string.Format<\/code> and <code>StringBuilder.AppendFormat<\/code> calls that could possibly benefit from switching to use a <code>CompositeFormat<\/code> argument instead.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1863.png\" alt=\"CA1863\" \/><\/p>\n<h2>Spans<\/h2>\n<p>Moving on from formatting, let&#8217;s turn our attention to all the other kinds of operations one frequently wants to perform on sequences of data, whether that be arrays, strings, or the unifying force of spans. A home for many routines for manipulating all of these, via spans, is the <code>System.MemoryExtensions<\/code> type, which has received a multitude of new APIs in .NET 8.<\/p>\n<p>One very common operation is to count how many of something there are. For example, in support of multiline comments, <code>System.Text.Json<\/code> needs to count how many line feed characters there are in a given piece of JSON. This is, of course, trivial to write as a loop, whether character-by-character or using <code>IndexOf<\/code> and slicing. Now in .NET 8, you can also just call the <code>Count<\/code> extension method, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80662\">dotnet\/runtime#80662<\/a> from <a href=\"https:\/\/github.com\/bollhals\">@bollhals<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82687\">dotnet\/runtime#82687<\/a> from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a>. Here we&#8217;re counting the number of line feed characters in <a href=\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\">&#8220;The Adventures of Sherlock Holmes&#8221;<\/a> from Project Gutenberg:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly byte[] s_utf8 = new HttpClient().GetByteArrayAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public int Count_ForeachLoop()\r\n    {\r\n        int count = 0;\r\n        foreach (byte c in s_utf8)\r\n        {\r\n            if (c == '\\n') count++;\r\n        }\r\n        return count;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int Count_IndexOf()\r\n    {\r\n        ReadOnlySpan&lt;byte&gt; remaining = s_utf8;\r\n        int count = 0;\r\n\r\n        int pos;\r\n        while ((pos = remaining.IndexOf((byte)'\\n')) &gt;= 0)\r\n        {\r\n            count++;\r\n            remaining = remaining.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int Count_Count() =&gt; s_utf8.AsSpan().Count((byte)'\\n');\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count_ForeachLoop<\/td>\n<td style=\"text-align: right\">314.23 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count_IndexOf<\/td>\n<td style=\"text-align: right\">95.39 us<\/td>\n<td style=\"text-align: right\">0.30<\/td>\n<\/tr>\n<tr>\n<td>Count_Count<\/td>\n<td style=\"text-align: right\">13.68 us<\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The core of the implementation here that enables <code>MemoryExtensions.Count<\/code> to be so fast, in particular when searching for a single value, is based on just two key primitives: <code>PopCount<\/code> and <code>ExtractMostSignificantBits<\/code>. Here&#8217;s the <code>Vector128<\/code> loop that forms the bulk of the <code>Count<\/code> implementation (the implementation has similar loops for <code>Vector256<\/code> and <code>Vector512<\/code> as well):<\/p>\n<pre><code class=\"language-C#\">Vector128&lt;T&gt; targetVector = Vector128.Create(value);\r\nref T oneVectorAwayFromEnd = ref Unsafe.Subtract(ref end, Vector128&lt;T&gt;.Count);\r\ndo\r\n{\r\n    count += BitOperations.PopCount(Vector128.Equals(Vector128.LoadUnsafe(ref current), targetVector).ExtractMostSignificantBits());\r\n    current = ref Unsafe.Add(ref current, Vector128&lt;T&gt;.Count);\r\n}\r\nwhile (!Unsafe.IsAddressGreaterThan(ref current, ref oneVectorAwayFromEnd));<\/code><\/pre>\n<p>This is creating a vector where every element of the vector is the target (in this case, <code>'\\n'<\/code>). Then, as long as there&#8217;s at least one vector&#8217;s worth of data remaining, it loads the next vector (<code>Vector128.LoadUnsafe<\/code>) and compares that with the target vector (<code>Vector128.Equals<\/code>). That produces a new <code>Vector128&lt;T&gt;<\/code> where each <code>T<\/code> element is all ones when the values are equal and all zeros when they&#8217;re not. We then extract out the most significant bit of each element (<code>ExtractMostSignificantBits<\/code>), so getting a bit with the value <code>1<\/code> where the values were equal, otherwise <code>0<\/code>. And then we use <code>BitOperations.PopCount<\/code> on the resulting <code>uint<\/code> to get the &#8220;population count,&#8221; i.e. the number of bits that are <code>1<\/code>, and we add that to our running tally. In this way, the inner loop of the count operation remains branch-free, and the implementation can churn through the data very quickly. You can find several examples of using <code>Count<\/code> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81325\">dotnet\/runtime#81325<\/a>, which used it in several places in the core libraries.<\/p>\n<p>A similar new <code>MemoryExtensions<\/code> method is <code>Replace<\/code>, which comes in .NET 8 in two shapes. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76337\">dotnet\/runtime#76337<\/a> from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a> added an in-place variant:<\/p>\n<pre><code class=\"language-C#\">public static unsafe void Replace&lt;T&gt;(this Span&lt;T&gt; span, T oldValue, T newValue) where T : IEquatable&lt;T&gt;?;<\/code><\/pre>\n<p>and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83120\">dotnet\/runtime#83120<\/a> added a copying variant:<\/p>\n<pre><code class=\"language-C#\">public static unsafe void Replace&lt;T&gt;(this ReadOnlySpan&lt;T&gt; source, Span&lt;T&gt; destination, T oldValue, T newValue) where T : IEquatable&lt;T&gt;?;<\/code><\/pre>\n<p>As an example of where this comes in handy, <code>Uri<\/code> has some code paths that need to normalize directory separators to be <code>'\/'<\/code>, such that any <code>'\\\\'<\/code> characters need to be replaced. This previously used an <code>IndexOf<\/code> loop as was shown in the previous <code>Count<\/code> benchmark, and now it can just use <code>Replace<\/code>. Here&#8217;s a comparison (which, purely for benchmarking purposes, is normalizing back and forth so that each time the benchmark runs it finds things in the original state):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly char[] _uri = \"server\/somekindofpathneeding\/normalizationofitsslashes\".ToCharArray();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void Replace_ForLoop()\r\n    {\r\n        Replace(_uri, '\/', '\\\\');\r\n        Replace(_uri, '\\\\', '\/');\r\n\r\n        static void Replace(char[] chars, char from, char to)\r\n        {\r\n            for (int i = 0; i &lt; chars.Length; i++)\r\n            {\r\n                if (chars[i] == from)\r\n                {\r\n                    chars[i] = to;\r\n                }\r\n            }\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Replace_IndexOf()\r\n    {\r\n        Replace(_uri, '\/', '\\\\');\r\n        Replace(_uri, '\\\\', '\/');\r\n\r\n        static void Replace(char[] chars, char from, char to)\r\n        {\r\n            Span&lt;char&gt; remaining = chars;\r\n            int pos;\r\n            while ((pos = remaining.IndexOf(from)) &gt;= 0)\r\n            {\r\n                remaining[pos] = to;\r\n                remaining = remaining.Slice(pos + 1);\r\n            }\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Replace_Replace()\r\n    {\r\n        _uri.AsSpan().Replace('\/', '\\\\');\r\n        _uri.AsSpan().Replace('\\\\', '\/');\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Replace_ForLoop<\/td>\n<td style=\"text-align: right\">40.28 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Replace_IndexOf<\/td>\n<td style=\"text-align: right\">29.26 ns<\/td>\n<td style=\"text-align: right\">0.73<\/td>\n<\/tr>\n<tr>\n<td>Replace_Replace<\/td>\n<td style=\"text-align: right\">18.88 ns<\/td>\n<td style=\"text-align: right\">0.47<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The new <code>Replace<\/code> does better than both the manual loop and the <code>IndexOf<\/code> loop. As with <code>Count<\/code>, <code>Replace<\/code> has a fairly simple and tight inner loop; again, here&#8217;s the <code>Vector128<\/code> variant of that loop:<\/p>\n<pre><code class=\"language-C#\">do\r\n{\r\n    original = Vector128.LoadUnsafe(ref src, idx);\r\n    mask = Vector128.Equals(oldValues, original);\r\n    result = Vector128.ConditionalSelect(mask, newValues, original);\r\n    result.StoreUnsafe(ref dst, idx);\r\n\r\n    idx += (uint)Vector128&lt;T&gt;.Count;\r\n}\r\nwhile (idx &lt; lastVectorIndex);<\/code><\/pre>\n<p>This is loading the next vector&#8217;s worth of data (<code>Vector128.LoadUnsafe<\/code>) and comparing that with a vector filled with the <code>oldValue<\/code>, which produces a new <code>mask<\/code> vector with <code>1<\/code>s for equality and <code>0<\/code> for inequality. It then calls the super handy <code>Vector128.ConditionalSelect<\/code>. This is a branchless SIMD condition operation: it produces a new vector that has an element from one vector if mask&#8217;s bits were <code>1<\/code>s and from another vector if the mask&#8217;s bits were <code>0<\/code>s (think a ternary operator). That resulting vector is then saved out as the result. In this manner, it&#8217;s overwriting the whole span, in some cases just writing back the value that was previously there, and in cases where the original value was the target <code>oldValue<\/code>, writing out the <code>newValue<\/code> instead. This loop body is branch-free and doesn&#8217;t change in cost based on how many elements need to be replaced. In an extreme case where there&#8217;s nothing to be replaced, an <code>IndexOf<\/code>-based loop could end up being a tad bit faster, since the body of <code>IndexOf<\/code>&#8216;s inner loop has even fewer instructions, but such an <code>IndexOf<\/code> loop pays a relatively high cost for every replacement that needs to be done.<\/p>\n<p><code>StringBuilder<\/code> also had such an <code>IndexOf<\/code>-based implementation for its <code>Replace(char oldChar, char newChar)<\/code> and <code>Replace(char oldChar, char newChar, int startIndex, int count)<\/code> methods, and they&#8217;re now based on <code>MemoryExtensions.Replace<\/code>, so the improvements accrue there as well.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly StringBuilder _sb = new StringBuilder(\"http:\/\/server\\\\this\\\\is\\\\a\\\\test\\\\of\\\\needing\\\\to\\\\normalize\\\\directory\\\\separators\\\\\");\r\n\r\n    [Benchmark]\r\n    public void Replace()\r\n    {\r\n        _sb.Replace('\\\\', '\/');\r\n        _sb.Replace('\/', '\\\\');\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Replace<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">150.47 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Replace<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">24.79 ns<\/td>\n<td style=\"text-align: right\">0.16<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Interestingly, whereas <code>StringBuilder.Replace(char, char)<\/code> was using <code>IndexOf<\/code> and switched to use <code>Replace<\/code>, <code>StringBuilder.Replace(string, string)<\/code> wasn&#8217;t using <code>IndexOf<\/code> at all, a gap that&#8217;s been fixed in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81098\">dotnet\/runtime#81098<\/a>. <code>IndexOf<\/code> when dealing with strings is more complicated in <code>StringBuilder<\/code> because of its segmented nature. <code>StringBuilder<\/code> isn&#8217;t just backed by an array: it&#8217;s actually a linked list of segments, each of which stores an array. With the <code>char<\/code>-based <code>Replace<\/code>, it can simply operate on each segment individually, but for the <code>string<\/code>-based <code>Replace<\/code>, it needs to deal with the possibility that the value being searched for crosses a segment boundary. <code>StringBuilder.Replace(string, string)<\/code> was thus walking each segment character-by-character, doing an equality check at each position. Now with this PR, it&#8217;s using <code>IndexOf<\/code> and only falling back to a character-by-character check when close enough to a segment boundary that it might be crossed.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly StringBuilder _sb = new StringBuilder()\r\n        .Append(\"Shall I compare thee to a summer's day? \")\r\n        .Append(\"Thou art more lovely and more temperate: \")\r\n        .Append(\"Rough winds do shake the darling buds of May, \")\r\n        .Append(\"And summer's lease hath all too short a date; \")\r\n        .Append(\"Sometime too hot the eye of heaven shines, \")\r\n        .Append(\"And often is his gold complexion dimm'd; \")\r\n        .Append(\"And every fair from fair sometime declines, \")\r\n        .Append(\"By chance or nature's changing course untrimm'd; \")\r\n        .Append(\"But thy eternal summer shall not fade, \")\r\n        .Append(\"Nor lose possession of that fair thou ow'st; \")\r\n        .Append(\"Nor shall death brag thou wander'st in his shade, \")\r\n        .Append(\"When in eternal lines to time thou grow'st: \")\r\n        .Append(\"So long as men can breathe or eyes can see, \")\r\n        .Append(\"So long lives this, and this gives life to thee.\");\r\n\r\n    [Benchmark]\r\n    public void Replace()\r\n    {\r\n        _sb.Replace(\"summer\", \"winter\");\r\n        _sb.Replace(\"winter\", \"summer\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Replace<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">5,158.0 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Replace<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">476.4 ns<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As long as we&#8217;re on the subject of <code>StringBuilder<\/code>, it saw some other nice improvements in .NET 8. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85894\">dotnet\/runtime#85894<\/a> from <a href=\"https:\/\/github.com\/yesmey\">@yesmey<\/a> tweaked both <code>StringBuilder.Append(string value)<\/code> and the JIT to enable the JIT to unroll the memory copies that occur as part of appending a constant string.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly StringBuilder _sb = new();\r\n\r\n    [Benchmark]\r\n    public void Append()\r\n    {\r\n        _sb.Clear();\r\n        _sb.Append(\"This is a test of appending a string to StringBuilder\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Append<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">7.597 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Append<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.756 ns<\/td>\n<td style=\"text-align: right\">0.49<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86287\">dotnet\/runtime#86287<\/a> from <a href=\"https:\/\/github.com\/yesmey\">@yesmey<\/a> changed <code>StringBuilder.Append(char value, int repeatCount)<\/code> to use <code>Span&lt;T&gt;.Fill<\/code> instead of manually looping, taking advantage of the optimized <code>Fill<\/code> implementation, even for reasonably small counts.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly StringBuilder _sb = new();\r\n\r\n    [Benchmark]\r\n    public void Append()\r\n    {\r\n        _sb.Clear();\r\n        _sb.Append('x', 8);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Append<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">11.520 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Append<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">5.292 ns<\/td>\n<td style=\"text-align: right\">0.46<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Back to <code>MemoryExtensions<\/code>, another new helpful method is <code>MemoryExtensions.Split<\/code> (and <code>MemoryExtensions.SplitAny<\/code>). This is a span-based counterpart to <code>string.Split<\/code> for <em>some<\/em> uses of <code>string.Split<\/code>. I say &#8220;some&#8221; because there are effectively two main patterns for using <code>string.Split<\/code>: when you expect a certain number of parts, and when there are an unknown number of parts. For example, if you want to parse a version string as would be used by <code>System.Version<\/code>, there are at most four parts (&#8220;major.minor.build.revision&#8221;). But if you want to split, say, the contents of a file into all of the lines in the file (delimited by a <code>\\n<\/code>), that&#8217;s an unknown (and potentially quite large) number of parts. The new <code>MemoryExtensions.Split<\/code> method is focused on the situations where there&#8217;s a known (and reasonably small) maximum number of parts expected. In such a case, it can be significantly more efficient than <code>string.Split<\/code>, especially from an allocation perspective.<\/p>\n<p><code>string.Split<\/code> has overloads that accept an <code>int count<\/code>, and <code>MemoryExtensions.Split<\/code> behaves identically to these overloads; however, rather than giving it an <code>int count<\/code>, you give it a <code>Span&lt;Range&gt; destination<\/code> whose length is the same value you would have used for <code>count<\/code>. For example, let&#8217;s say you want to split a key\/value pair separated by an <code>'='<\/code>. If this were <code>string.Split<\/code>, you could write that as:<\/p>\n<pre><code class=\"language-C#\">string[] parts = keyValuePair.Split('=');<\/code><\/pre>\n<p>Of course, if the input was actually erroneous for what you were expecting and there were 100 equal signs, you&#8217;d end up creating an array of 101 strings. So instead, you might write that as:<\/p>\n<pre><code class=\"language-C#\">string[] parts = keyValuePair.Split('=', 3);<\/code><\/pre>\n<p>Wait, &#8220;3&#8221;? Aren&#8217;t there only two parts, and if so, why not pass &#8220;2&#8221;? Because of the behavior of what happens with the last part. The last part contains the remainder of the string after the separator before it, so for example the call:<\/p>\n<pre><code class=\"language-C#\">\"shall=i=compare=thee\".Split(new[] { '=' }, 2)<\/code><\/pre>\n<p>produces the array:<\/p>\n<pre><code class=\"language-C#\">string[2] { \"shall\", \"i=compare=thee\" }<\/code><\/pre>\n<p>If you want to know whether there were more than two parts, you need to request at least one more, and then if that last one was produced, you know the input was erroneous. For example, this:<\/p>\n<pre><code class=\"language-C#\">\"shall=i=compare=thee\".Split(new[] { '=' }, 3)<\/code><\/pre>\n<p>produces this:<\/p>\n<pre><code class=\"language-C#\">string[3] { \"shall\", \"i\", \"compare=thee\" }<\/code><\/pre>\n<p>and this:<\/p>\n<pre><code class=\"language-C#\">\"shall=i\".Split(new[] { '=' }, 3)<\/code><\/pre>\n<p>produces this:<\/p>\n<pre><code class=\"language-C#\">string[2] { \"shall\", \"i\" }<\/code><\/pre>\n<p>We can do the same thing with the new overload, except a) the caller provides the destination span to write the results into, and b) the results are stored as a <code>System.Range<\/code> rather than as a <code>string<\/code>. That means that the whole operation is allocation-free. And thanks to the indexer on <code>Span&lt;T&gt;<\/code> that lets you pass in a <code>Range<\/code> and slice the span, you can easily use the written ranges to access the relevant portions of the input.<\/p>\n<pre><code class=\"language-C#\">Span&lt;Range&gt; parts = stackalloc Range[3];\r\nint count = keyValuePairSpan.Split(parts, '=');\r\nif (count == 2)\r\n{\r\n    Console.WriteLine($\"Key={keyValuePairSpan[parts[0]]}, Value={keyValuePairSpan[parts[1]]}\");\"\r\n}<\/code><\/pre>\n<p>Here&#8217;s an example from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80211\">dotnet\/runtime#80211<\/a>, which used <code>SplitAny<\/code> to reduce the cost of <code>MimeBasePart.DecodeEncoding<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly string _input = \"=?utf-8?B?RmlsZU5hbWVf55CG0Y3Qq9C60I5jw4TRicKq0YIM0Y1hSsSeTNCy0Klh?=\";\r\n    private static readonly char[] s_decodeEncodingSplitChars = new char[] { '?', '\\r', '\\n' };\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Encoding Old()\r\n    {\r\n        if (string.IsNullOrEmpty(_input))\r\n        {\r\n            return null;\r\n        }\r\n\r\n        string[] subStrings = _input.Split(s_decodeEncodingSplitChars);\r\n        if (subStrings.Length &lt; 5 || \r\n            subStrings[0] != \"=\" || \r\n            subStrings[4] != \"=\")\r\n        {\r\n            return null;\r\n        }\r\n\r\n        string charSet = subStrings[1];\r\n        return Encoding.GetEncoding(charSet);\r\n    }\r\n\r\n    [Benchmark]\r\n    public Encoding New()\r\n    {\r\n        if (string.IsNullOrEmpty(_input))\r\n        {\r\n            return null;\r\n        }\r\n\r\n        ReadOnlySpan&lt;char&gt; valueSpan = _input;\r\n        Span&lt;Range&gt; subStrings = stackalloc Range[6];\r\n        if (valueSpan.SplitAny(subStrings, \"?\\r\\n\") &lt; 5 ||\r\n            valueSpan[subStrings[0]] is not \"=\" ||\r\n            valueSpan[subStrings[4]] is not \"=\")\r\n        {\r\n            return null;\r\n        }\r\n\r\n        return Encoding.GetEncoding(_input[subStrings[1]]);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right\">143.80 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">304 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right\">94.52 ns<\/td>\n<td style=\"text-align: right\">0.66<\/td>\n<td style=\"text-align: right\">32 B<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>More examples of <code>MemoryExtensions.Split<\/code> and <code>MemoryExtensions.SplitAny<\/code> being used are in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80471\">dotnet\/runtime#80471<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82007\">dotnet\/runtime#82007<\/a>. Both of those remove allocations from various <code>System.Net<\/code> types that were previously using <code>string.Split<\/code>.<\/p>\n<p><code>MemoryExtensions<\/code> also includes a new set of <code>IndexOf<\/code> methods for ranges, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76803\">dotnet\/runtime#76803<\/a>:<\/p>\n<pre><code class=\"language-C#\">public static int IndexOfAnyInRange&lt;T&gt;(this ReadOnlySpan&lt;T&gt; span, T lowInclusive, T highInclusive) where T : IComparable&lt;T&gt;;\r\npublic static int IndexOfAnyExceptInRange&lt;T&gt;(this ReadOnlySpan&lt;T&gt; span, T lowInclusive, T highInclusive) where T : IComparable&lt;T&gt;;\r\npublic static int LastIndexOfAnyInRange&lt;T&gt;(this ReadOnlySpan&lt;T&gt; span, T lowInclusive, T highInclusive) where T : IComparable&lt;T&gt;;\r\npublic static int LastIndexOfAnyExceptInRange&lt;T&gt;(this ReadOnlySpan&lt;T&gt; span, T lowInclusive, T highInclusive) where T : IComparable&lt;T&gt;;<\/code><\/pre>\n<p>Want to find the index of the next ASCII digit? No problem:<\/p>\n<pre><code class=\"language-C#\">int pos = text.IndexOfAnyInRange('0', '9');<\/code><\/pre>\n<p>Want to determine whether some input contains any non-ASCII or control characters? You got it:<\/p>\n<pre><code class=\"language-C#\">bool nonAsciiOrControlCharacters = text.IndexOfAnyExceptInRange((char)0x20, (char)0x7e) &gt;= 0;<\/code><\/pre>\n<p>For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78658\">dotnet\/runtime#78658<\/a> uses <code>IndexOfAnyInRange<\/code> to quickly determine whether portions of a <code>Uri<\/code> might contain a bidirectional control character, searching for anything in the range <code>[\\u200E, \\u202E]<\/code>, and then only examining further if anything in that range is found. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79357\">dotnet\/runtime#79357<\/a> uses <code>IndexOfAnyExceptInRange<\/code> to determine whether to use <code>Encoding.UTF8<\/code> or <code>Encoding.ASCII<\/code>. It was previously implemented with a simple <code>foreach<\/code> loop, and it&#8217;s now implemented with an even simpler call to <code>IndexOfAnyExceptInRange<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _text =\r\n        \"Shall I compare thee to a summer's day? \" +\r\n        \"Thou art more lovely and more temperate: \" +\r\n        \"Rough winds do shake the darling buds of May, \" +\r\n        \"And summer's lease hath all too short a date; \" +\r\n        \"Sometime too hot the eye of heaven shines, \" +\r\n        \"And often is his gold complexion dimm'd; \" +\r\n        \"And every fair from fair sometime declines, \" +\r\n        \"By chance or nature's changing course untrimm'd; \" +\r\n        \"But thy eternal summer shall not fade, \" +\r\n        \"Nor lose possession of that fair thou ow'st; \" +\r\n        \"Nor shall death brag thou wander'st in his shade, \" +\r\n        \"When in eternal lines to time thou grow'st: \" +\r\n        \"So long as men can breathe or eyes can see, \" +\r\n        \"So long lives this, and this gives life to thee.\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Encoding Old()\r\n    {\r\n        foreach (char c in _text)\r\n            if (c &gt; 126 || c &lt; 32)\r\n                return Encoding.UTF8;\r\n\r\n        return Encoding.ASCII;\r\n    }\r\n\r\n    [Benchmark]\r\n    public Encoding New() =&gt;\r\n        _text.AsSpan().IndexOfAnyExceptInRange((char)32, (char)126) &gt;= 0 ?\r\n            Encoding.UTF8 :\r\n            Encoding.ASCII;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right\">297.56 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right\">20.69 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>More of a productivity thing than performance (at least today), but .NET 8 also includes new <code>ContainsAny<\/code> methods (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87621\">dotnet\/runtime#87621<\/a>) that allow writing these kind of <code>IndexOf<\/code> calls that are then compared against 0 in a slightly cleaner fashion, e.g. the previous example could have been simplified slightly to:<\/p>\n<pre><code class=\"language-C#\">public Encoding New() =&gt;\r\n    _text.AsSpan().ContainsAnyExceptInRange((char)32, (char)126) ?\r\n        Encoding.UTF8 :\r\n        Encoding.ASCII;<\/code><\/pre>\n<p>One of the things I love about these kinds of helpers is that code can simplify down to use them, and then as the helpers improve, so too does the code that relies on them. And in .NET 8, there&#8217;s a lot of &#8220;the helpers improve.&#8221;<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86655\">dotnet\/runtime#86655<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> added support for <code>Vector512<\/code> to most of these span-based helpers in <code>MemoryExtensions<\/code>. That means that when running on hardware which supports AVX512, many of these operations simply get faster. This benchmark uses environment variables to explicitly disable support for the various instruction sets, such that we can compare performance of a given operation when nothing is vectorized, when <code>Vector128<\/code> is used and hardware accelerated, when <code>Vector256<\/code> is used and hardware accelerated, and when <code>Vector512<\/code> is used and hardware accelerated. I&#8217;ve run this on my Dev Box that does support AVX512:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing BenchmarkDotNet.Toolchains.CoreRun;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\"Scalar\").WithEnvironmentVariable(\"DOTNET_EnableHWIntrinsic\", \"0\").AsBaseline())\r\n    .AddJob(Job.Default.WithId(\"Vector128\").WithEnvironmentVariable(\"DOTNET_EnableAVX512F\", \"0\").WithEnvironmentVariable(\"DOTNET_EnableAVX2\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"Vector256\").WithEnvironmentVariable(\"DOTNET_EnableAVX512F\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\"Vector512\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\")]\r\npublic class Tests\r\n{\r\n    private readonly char[] _sourceChars = Enumerable.Repeat('a', 1024).ToArray();\r\n\r\n    [Benchmark]\r\n    public bool Contains() =&gt; _sourceChars.AsSpan().IndexOfAny('b', 'c') &gt;= 0;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Contains<\/td>\n<td>Scalar<\/td>\n<td style=\"text-align: right\">491.50 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Contains<\/td>\n<td>Vector128<\/td>\n<td style=\"text-align: right\">53.77 ns<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<\/tr>\n<tr>\n<td>Contains<\/td>\n<td>Vector256<\/td>\n<td style=\"text-align: right\">34.75 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<tr>\n<td>Contains<\/td>\n<td>Vector512<\/td>\n<td style=\"text-align: right\">21.12 ns<\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So, not <em>quite<\/em> a halving going from 128-bit to 256-bit or another halving going from 256-bit to 512-bit, but pretty close.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77947\">dotnet\/runtime#77947<\/a> vectorized <code>Equals(..., StringComparison.OrdinalIgnoreCase)<\/code> for large enough inputs (the same underlying implementation is used for both <code>string<\/code> and <code>ReadOnlySpan&lt;char&gt;<\/code>). In a loop, it loads the next two vectors. It then checks to see whether anything in those vectors is non-ASCII; it can do so efficiently by OR&#8217;ing them together (<code>vec1 | vec2<\/code>) and then seeing whether the high bit of any of the elements is set&#8230; if none are, then all the elements in both of the input vectors are ASCII (<code>((vec1 | vec2) &amp; Vector128.Create(unchecked((ushort)~0x007F))) == Vector128&lt;ushort&gt;.Zero<\/code>). If it finds anything non-ASCII, it just continues on with the old mode of comparison. But as long as everything is ASCII, then it can proceed to do the comparison in a vectorized manner. For each vector, it uses some bit hackery to create a lowercased version of the vector, and then compares the lowercased versions for equality.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _a = \"shall i compare thee to a summer's day? thou art more lovely and more temperate\";\r\n    private readonly string _b = \"SHALL I COMPARE THEE TO A SUMMER'S DAY? THOU ART MORE LOVELY AND MORE TEMPERATE\";\r\n\r\n    [Benchmark]\r\n    public bool Equals() =&gt; _a.AsSpan().Equals(_b, StringComparison.OrdinalIgnoreCase);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Equals<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">47.97 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Equals<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">18.93 ns<\/td>\n<td style=\"text-align: right\">0.39<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78262\">dotnet\/runtime#78262<\/a> uses the same tricks to vectorize <code>ToLowerInvariant<\/code> and <code>ToUpperInvariant<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _a = \"shall i compare thee to a summer's day? thou art more lovely and more temperate\";\r\n    private readonly char[] _b = new char[100];\r\n\r\n    [Benchmark]\r\n    public int ToUpperInvariant() =&gt; _a.AsSpan().ToUpperInvariant(_b);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ToUpperInvariant<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">33.22 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ToUpperInvariant<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">16.16 ns<\/td>\n<td style=\"text-align: right\">0.49<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78650\">dotnet\/runtime#78650<\/a> from <a href=\"https:\/\/github.com\/yesmey\">@yesmey<\/a> also streamlined <code>MemoryExtensions.Reverse<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _bytes = Enumerable.Range(0, 32).Select(i =&gt; (byte)i).ToArray();\r\n\r\n    [Benchmark]\r\n    public void Reverse() =&gt; _bytes.AsSpan().Reverse();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Reverse<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">3.801 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Reverse<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.052 ns<\/td>\n<td style=\"text-align: right\">0.54<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75640\">dotnet\/runtime#75640<\/a> improves the internal <code>RuntimeHelpers.IsBitwiseEquatable<\/code> method that&#8217;s used by the vast majority of <code>MemoryExtensions<\/code>. If you look in the source for <code>MemoryExtensions<\/code>, you&#8217;ll find a fairly common pattern: special-case <code>byte<\/code>, <code>ushort<\/code>, <code>uint<\/code>, and <code>ulong<\/code> with a vectorized implementation, and then fall back to a general non-vectorized implementation for everything else. Except it&#8217;s not exactly &#8220;special-case <code>byte<\/code>, <code>ushort<\/code>, <code>uint<\/code>, and <code>ulong<\/code>&#8220;, but rather &#8220;special-case bitwise-equatable types that are the same size as <code>byte<\/code>, <code>ushort<\/code>, <code>uint<\/code>, or <code>ulong<\/code>.&#8221; If something is &#8220;bitwise equatable,&#8221; that means we don&#8217;t need to worry about any <code>IEquatable&lt;T&gt;<\/code> implementation it might provide or any <code>Equals<\/code> override it might have, and we can instead simply rely on the value&#8217;s bits being the same or different from another value to identify whether the values are the same or different. And if such bitwise equality semantics apply for a type, then the intrinsics that determine equality for <code>byte<\/code>, <code>ushort<\/code>, <code>uint<\/code>, and <code>ulong<\/code> can be used for any type that&#8217;s 1, 2, 4, or 8 bytes, respectively. In .NET 7, <code>RuntimeHelpers.IsBitwiseEquatable<\/code> would be true only for a finite and hardcoded list in the runtime: <code>bool<\/code>, <code>byte<\/code>, <code>sbyte<\/code>, <code>char<\/code>, <code>short<\/code>, <code>ushort<\/code>, <code>int<\/code>, <code>uint<\/code>, <code>long<\/code>, <code>ulong<\/code>, <code>nint<\/code>, <code>nuint<\/code>, <code>Rune<\/code>, and <code>enum<\/code>s. Now in .NET 8, that list is extended to a dynamically discoverable set where the runtime can easily see that the type itself doesn&#8217;t provide any equality implementation.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private MyColor[] _values1, _values2;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _values1 = Enumerable.Range(0, 1_000).Select(i =&gt; new MyColor { R = (byte)i, G = (byte)i, B = (byte)i, A = (byte)i }).ToArray();\r\n        _values2 = (MyColor[])_values1.Clone();\r\n    }\r\n\r\n    [Benchmark] public int IndexOf() =&gt; Array.IndexOf(_values1, new MyColor { R = 1, G = 2, B = 3, A = 4 });\r\n\r\n    [Benchmark] public bool SequenceEquals() =&gt; _values1.AsSpan().SequenceEqual(_values2);\r\n\r\n    struct MyColor { public byte R, G, B, A; }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IndexOf<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">24,912.42 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<td style=\"text-align: right\">48000 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IndexOf<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">70.44 ns<\/td>\n<td style=\"text-align: right\">0.003<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>SequenceEquals<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">25,041.00 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<td style=\"text-align: right\">48000 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SequenceEquals<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">68.40 ns<\/td>\n<td style=\"text-align: right\">0.003<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note this not only means the result gets vectorized, it also ends up avoiding excessive boxing (hence all that allocation), as it&#8217;s no longer calling <code>Equals(object)<\/code> on each value type instance.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85437\">dotnet\/runtime#85437<\/a> improved the vectorization of <code>IndexOf(string\/span, StringComparison.OrdinalIgnoreCase)<\/code>. Imagine we&#8217;re searching some text for the word &#8220;elementary.&#8221; In .NET 7, it would end up doing an <code>IndexOfAny('E', 'e')<\/code> in order to find the first possible place &#8220;elementary&#8221; could match, and would then do the equivalent of a <code>Equals(\"elementary\", textAtFoundPosition, StringComparison.OrdinalIgnoreCase)<\/code>. If the <code>Equals<\/code> fails, then it loops around to search for the next possible starting location. This is ok if the the characters being searched for are rare, but in this example, <code>'e'<\/code> is the most common letter in the English alphabet, and so an <code>IndexOfAny('E', 'e')<\/code> is frequently stopping, breaking out of the vectorized inner loop, in order to do the full <code>Equals<\/code> comparison. In contrast to this, in .NET 7 <code>IndexOf(string\/span, StringComparison.Ordinal)<\/code> was improved using the algorithm <a href=\"http:\/\/0x80.pl\/articles\/simd-strfind.html#algorithm-1-generic-simd\">outlined by Mula<\/a>; the idea there is that rather than just searching for one character (e.g. the first), you have a vector for another character as well (e.g. the last), you offset them appropriately, and you AND their comparison results together as part of the inner loop. Even if <code>'e'<\/code> is very common, <code>'e'<\/code> and then a <code>'y'<\/code> nine characters later is much, much less common, and thus it can stay in its tight inner loop for longer. Now in .NET 8, we apply the same trick to <code>OrdinalIgnoreCase<\/code> when we can find two ASCII characters in the input, e.g. it&#8217;ll simultaneously search for <code>'E'<\/code> or <code>'e'<\/code> followed by a <code>'Y'<\/code> or <code>'y<\/code>&#8216; nine characters later.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    private readonly string _needle = \"elementary\";\r\n\r\n    [Benchmark]\r\n    public int Count()\r\n    {\r\n        ReadOnlySpan&lt;char&gt; haystack = s_haystack;\r\n        ReadOnlySpan&lt;char&gt; needle = _needle;\r\n        int count = 0;\r\n\r\n        int pos;\r\n        while ((pos = haystack.IndexOf(needle, StringComparison.OrdinalIgnoreCase)) &gt;= 0)\r\n        {\r\n            count++;\r\n            haystack = haystack.Slice(pos + needle.Length);\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">676.91 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">62.04 us<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Even just a simple <code>IndexOf(char)<\/code> is also significantly improved in .NET 8. Here I&#8217;m searching &#8220;The Adventures of Sherlock Holmes&#8221; for an <code>'@'<\/code>, which I happen to know doesn&#8217;t appear, such that the entire search will be spent in <code>IndexOf(char)<\/code>&#8216;s tight inner loop.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark]\r\n    public int IndexOfAt() =&gt; s_haystack.AsSpan().IndexOf('@');\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IndexOfAt<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.17 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IndexOfAt<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">20.84 us<\/td>\n<td style=\"text-align: right\">0.64<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That improvement is thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78861\">dotnet\/runtime#78861<\/a>. The goal of SIMD and vectorization is to do more with the same; rather than processing one thing at a time, process 2 or 4 or 8 or 16 or 32 or 64 things at a time. For <code>char<\/code>s, which are 16 bits in size, in a 128-bit vector you can process 8 of them at a time; double that for 256-bit, and double it again for 512-bit. But it&#8217;s not just about the size of the vector; you can also find creative ways to use a vector to process more than you otherwise could. For example, in a 128-bit vector, you can process 8 <code>char<\/code>s at a time&#8230; but you can process 16 <code>byte<\/code>s at a time. What if you could process the <code>char<\/code>s instead as <code>byte<\/code>s? You could of course reinterpret the 8 <code>char<\/code>s as 16 <code>byte<\/code>s, but for most algorithms you&#8217;d end up with the wrong answer (since each <code>byte<\/code> of the <code>char<\/code> would be treated independently). What if instead you could condense two vectors&#8217; worth of <code>char<\/code>s down to a single vector of <code>byte<\/code>, and then do the subsequent processing on that single vector of <code>byte<\/code>? Then as long as you were doing a few instructions-worth of processing on the <code>byte<\/code> vector and the cost of that condensing was cheap enough, you could approach doubling your algorithm&#8217;s performance. And that&#8217;s exactly what this PR does, at least for very common needles, and on hardware that supports SSE2. SSE2 has dedicated instructions for taking two vectors and narrowing them down to a single vector, e.g. take a <code>Vector128&lt;short&gt; a<\/code> and a <code>Vector128&lt;short&gt; b<\/code>, and combine them into a <code>Vector&lt;byte&gt; c<\/code> by taking the low <code>byte<\/code> from each <code>short<\/code> in the input. However, these particular instructions don&#8217;t simply ignore the other <code>byte<\/code> in each <code>short<\/code> completely; instead, they &#8220;saturate.&#8221; That means if casting the <code>short<\/code> value to a <code>byte<\/code> would overflow, it produces 255, and if it would underflow, it produces 0. That means we can take two vectors of 16-bit values, pack them into a single vector of 8-bit values, and then as long as the thing we&#8217;re searching for is in the range [1, 254], we can be sure that equality checks against the vector will be accurate (comparisons against 0 or 255 might lead to false positives). Note that while Arm does have support for similar &#8220;narrowing with saturation,&#8221; the cost of those particular instructions was measured to be high enough that it wasn&#8217;t feasible to use them here (they are used elsewhere). This improvement applies to several other <code>char<\/code>-based methods as well, including <code>IndexOfAny(char, char)<\/code> and <code>IndexOfAny(char, char, char)<\/code>.<\/p>\n<p>One last <code>Span<\/code>-centric improvement to highlight. The <code>Memory&lt;T&gt;<\/code> and <code>ReadOnlyMemory&lt;T&gt;<\/code> types don&#8217;t implement <code>IEnumerable&lt;T&gt;<\/code>, but the <code>MemoryMarshal.ToEnumerable<\/code> method does exist to enable getting an enumerable from them. It&#8217;s buried away in <code>MemoryMarshal<\/code> primarily so as to guide developers not to iterate through the <code>Memory&lt;T&gt;<\/code> directly, but to instead iterate through its <code>Span<\/code>, e.g.<\/p>\n<pre><code class=\"language-C#\">foreach (T value in memory.Span) { ... }<\/code><\/pre>\n<p>The driving force behind this is that the <code>Memory&lt;T&gt;.Span<\/code> property has some overhead, as a <code>Memory&lt;T&gt;<\/code> can be backed by multiple different object types (namely a <code>T[]<\/code>, a <code>string<\/code> if it&#8217;s a <code>ReadOnlyMemory&lt;char&gt;<\/code>, or a <code>MemoryManager&lt;T&gt;<\/code>), and <code>Span<\/code> needs to fetch a <code>Span&lt;T&gt;<\/code> for the right one. Even so, from time to time you do actually need an <code>IEnumerable&lt;T&gt;<\/code> from a <code>{ReadOnly}Memory&lt;T&gt;<\/code>, and <code>ToEnumerable<\/code> provides that. In such situations, it&#8217;s actually beneficial from a performance perspective that one doesn&#8217;t just pass the <code>{ReadOnly}Memory&lt;T&gt;<\/code> as an <code>IEnumerable&lt;T&gt;<\/code>, since doing so would box the value, and then enumerating that enumerable would require a second allocation for the <code>IEnumerator&lt;T&gt;<\/code>. In contrast, <code>MemoryMarshal.ToEnumerable<\/code> can return an <code>IEnumerable&lt;T&gt;<\/code> instance that is both the <code>IEnumerable&lt;T&gt;<\/code> and the <code>IEnumerator&lt;T&gt;<\/code>. In fact, that&#8217;s what it&#8217;s done since it was added, with the entirety of the implementation being:<\/p>\n<pre><code class=\"language-C#\">public static IEnumerable&lt;T&gt; ToEnumerable&lt;T&gt;(ReadOnlyMemory&lt;T&gt; memory)\r\n{\r\n    for (int i = 0; i &lt; memory.Length; i++)\r\n        yield return memory.Span[i];\r\n}<\/code><\/pre>\n<p>The C# compiler generates an <code>IEnumerable&lt;T&gt;<\/code> for such an iterator that does in fact also implement <code>IEnumerator&lt;T&gt;<\/code> and return itself from <code>GetEnumerator<\/code> to avoid an extra allocation, so that&#8217;s good. As noted, though, <code>Memory&lt;T&gt;.Span<\/code> has some overhead, and this is accessing <code>.Span<\/code> once per element&#8230; not ideal. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89274\">dotnet\/runtime#89274<\/a> addresses this in multiple ways. First, <code>ToEnumerable<\/code> itself can check the type of the underlying object behind the <code>Memory&lt;T&gt;<\/code>, and for a <code>T[]<\/code> or a <code>string<\/code> can return a different iterator that just directly indexes into the array or string rather than going through <code>.Span<\/code> on every access. Moreover, <code>ToEnumerable<\/code> can check to see whether the bounds represented by the <code>Memory&lt;T&gt;<\/code> are for the full length of the array or string&#8230; if they are, then <code>ToEnumerable<\/code> can just return the original object, without any additional allocation. The net result is a much more efficient enumeration scheme for anything other than a <code>MemoryManager&lt;T&gt;<\/code>, which is much more rare (but also not negatively impacted by the improvements for the other types).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\nusing System.Runtime.InteropServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly Memory&lt;char&gt; _array = Enumerable.Repeat('a', 1000).ToArray();\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; Count(MemoryMarshal.ToEnumerable&lt;char&gt;(_array));\r\n\r\n    [Benchmark]\r\n    public int CountLINQ() =&gt; Enumerable.Count(MemoryMarshal.ToEnumerable&lt;char&gt;(_array));\r\n\r\n    private static int Count&lt;T&gt;(IEnumerable&lt;T&gt; source)\r\n    {\r\n        int count = 0;\r\n        foreach (T item in source) count++;\r\n        return count;\r\n    }\r\n\r\n    private sealed class WrapperMemoryManager&lt;T&gt;(Memory&lt;T&gt; memory) : MemoryManager&lt;T&gt;\r\n    {\r\n        public override Span&lt;T&gt; GetSpan() =&gt; memory.Span;\r\n        public override MemoryHandle Pin(int elementIndex = 0) =&gt; throw new NotSupportedException();\r\n        public override void Unpin() =&gt; throw new NotSupportedException();\r\n        protected override void Dispose(bool disposing) { }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">6,336.147 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1,323.376 ns<\/td>\n<td style=\"text-align: right\">0.21<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>CountLINQ<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">4,972.580 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>CountLINQ<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">9.200 ns<\/td>\n<td style=\"text-align: right\">0.002<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>SearchValues<\/h2>\n<p>As should be obvious from the length of this document, there are a sheer ton of performance-focused improvements in .NET 8. As I previously noted, I think the most valuable addition in .NET 8 is enabling dynamic PGO by default. After that, I think the next most exciting addition is the new <code>System.Buffers.SearchValues<\/code> type. It is simply awesome, in my humble opinion.<\/p>\n<p>Functionally, <code>SearchValues<\/code> doesn&#8217;t do anything you couldn&#8217;t already do. For example, let&#8217;s say you wanted to search for the next ASCII letter or digit in text. You can already do that via <code>IndexOfAny<\/code>:<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;char&gt; text = ...;\r\nint pos = text.IndexOfAny(\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789\");<\/code><\/pre>\n<p>And that works, but it hasn&#8217;t been particularly fast. In .NET 7, <code>IndexOfAny(ReadOnlySpan&lt;char&gt;)<\/code> is optimized for searching for up to 5 target characters, e.g. it could efficiently vectorize a search for English vowels (<code>IndexOfAny(\"aeiou\")<\/code>). But with a target set of 62 characters like in the previous example, it would no longer vectorize, and instead of trying to see how many characters it could process per instruction, switches to trying to see how few instructions it can employ per character (meaning we&#8217;re no longer talking about fractions of an instruction per character in the haystack and now talking about multiple instructions per character in the haystack). It does this via a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bloom_filter\">Bloom filter<\/a>, referred to in the implementation as a &#8220;probabilistic map.&#8221; The idea is to maintain a bitmap of 256 bits. For every needle character, it sets 2 bits in that bitmap. Then when searching the haystack, for each character it looks to see whether both bits are set in the bitmap; if at least one isn&#8217;t set, then this character can&#8217;t be in the needle and the search can continue, but if both bits are in the bitmap, then it&#8217;s likely but not confirmed that the haystack character is in the needle, and the needle is then searched for the character to see whether we&#8217;ve found a match.<\/p>\n<p>There are actually known algorithms for doing these searches more efficiently. For example, the <a href=\"http:\/\/0x80.pl\/articles\/simd-byte-lookup.html#universal-algorithm\">&#8220;Universal&#8221; algorithm described by Mula<\/a> is a great choice when searching for an arbitrary set of ASCII characters, enabling us to efficiently vectorize a search for a needle composed of any subset of ASCII. Doing so requires some amount of computation to analyze the needle and build up the relevant bitmaps and vectors that are required for performing the search, just as we have to do so for the Bloom filter (albeit generating different artifacts). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76740\">dotnet\/runtime#76740<\/a> implements these techniques in <code>{Last}IndexOfAny{Except}<\/code>. Rather than always building up a probabilistic map, it first examines the needle to see if all of the values are ASCII, and if they are, then it switches over to this optimized ASCII-based search; if they&#8217;re not, it falls back to the same probabilistic map approach used previously. The PR also recognizes that it&#8217;s only worth attempting either optimization under the right conditions; if the haystack is really short, for example, we&#8217;re better off just doing the naive <code>O(M*N)<\/code> search, where for every character in the haystack we search through the needle to see if the <code>char<\/code> is a target.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark]\r\n    public int CountEnglishVowelsAndSometimeVowels()\r\n    {\r\n        ReadOnlySpan&lt;char&gt; remaining = s_haystack;\r\n        int count = 0, pos;\r\n        while ((pos = remaining.IndexOfAny(\"aeiouyAEIOUY\")) &gt;= 0)\r\n        {\r\n            count++;\r\n            remaining = remaining.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountEnglishVowelsAndSometimeVowels<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">6.823 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountEnglishVowelsAndSometimeVowels<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.735 ms<\/td>\n<td style=\"text-align: right\">0.55<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Even with those improvements, this work of building up these vectors is quite repetitive, and it&#8217;s not free. If you have such an <code>IndexOfAny<\/code> in a loop, you&#8217;re paying to build up those vectors over and over and over again. There&#8217;s also additional work we could do to further examine the data to choose an even more optimal approach, but every additional check performed comes at the cost of more overhead for the <code>IndexOfAny<\/code> call. This is where <code>SearchValues<\/code> comes in. The idea behind <code>SearchValues<\/code> is to perform all this work once and then cache it. Almost invariably, the pattern for using a <code>SearchValues<\/code> is to create one, store it in a <code>static readonly<\/code> field, and then use that <code>SearchValues<\/code> for all searching operations for that target set. And there are now overloads of methods like <code>IndexOfAny<\/code> that take a <code>SearchValues&lt;char&gt;<\/code> or <code>SearchValues&lt;byte&gt;<\/code>, for example, instead of a <code>ReadOnlySpan&lt;char&gt;<\/code> or <code>ReadOnlySpan&lt;byte&gt;<\/code>, respectively. Thus, my previous ASCII letter or digit example would instead look like this:<\/p>\n<pre><code class=\"language-C#\">private static readonly SearchValues&lt;char&gt; s_asciiLettersOrDigits = SearchValues.Create(\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789\");\r\n...\r\nint pos = text.IndexOfAny(s_asciiLettersOrDigits);<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78093\">dotnet\/runtime#78093<\/a> provided the initial implementation of <code>SearchValues<\/code> (it was originally named <code>IndexOfAnyValues<\/code>, but we renamed it subsequently to the more general <code>SearchValues<\/code> so that we can use it now and in the future with other methods, like <code>Count<\/code> or <code>Replace<\/code>). If you peruse the implementation, you&#8217;ll see that the <code>Create<\/code> factory methods don&#8217;t just return a concrete <code>SearchValues&lt;T&gt;<\/code> type; rather, <code>SearchValues&lt;T&gt;<\/code> provides an internal abstraction that&#8217;s then implemented by more than fifteen derived implementations, each specialized for a different scenario. You can see this fairly easily in code by running the following program:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -f net8.0\r\n\r\nusing System.Buffers;\r\n\r\nConsole.WriteLine(SearchValues.Create(\"\"));\r\nConsole.WriteLine(SearchValues.Create(\"a\"));\r\nConsole.WriteLine(SearchValues.Create(\"ac\"));\r\nConsole.WriteLine(SearchValues.Create(\"ace\"));\r\nConsole.WriteLine(SearchValues.Create(\"ab\\u05D0\\u05D1\"));\r\nConsole.WriteLine(SearchValues.Create(\"abc\\u05D0\\u05D1\"));\r\nConsole.WriteLine(SearchValues.Create(\"abcdefghijklmnopqrstuvwxyz\"));\r\nConsole.WriteLine(SearchValues.Create(\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789\"));\r\nConsole.WriteLine(SearchValues.Create(\"\\u00A3\\u00A5\\u00A7\\u00A9\\u00AB\\u00AD\"));\r\nConsole.WriteLine(SearchValues.Create(\"abc\\u05D0\\u05D1\\u05D2\"));<\/code><\/pre>\n<p>and you&#8217;ll see output like the following:<\/p>\n<pre><code class=\"language-text\">System.Buffers.EmptySearchValues`1[System.Char]\r\nSystem.Buffers.SingleCharSearchValues`1[System.Buffers.SearchValues+TrueConst]\r\nSystem.Buffers.Any2CharSearchValues`1[System.Buffers.SearchValues+TrueConst]\r\nSystem.Buffers.Any3CharSearchValues`1[System.Buffers.SearchValues+TrueConst]\r\nSystem.Buffers.Any4SearchValues`2[System.Char,System.Int16]\r\nSystem.Buffers.Any5SearchValues`2[System.Char,System.Int16]\r\nSystem.Buffers.RangeCharSearchValues`1[System.Buffers.SearchValues+TrueConst]\r\nSystem.Buffers.AsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]\r\nSystem.Buffers.ProbabilisticCharSearchValues\r\nSystem.Buffers.ProbabilisticWithAsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]<\/code><\/pre>\n<p>highlighting that each of these different inputs ends up getting mapped to a different <code>SearchValues&lt;T&gt;<\/code>-derived type.<\/p>\n<p>After that initial PR, <code>SearchValues<\/code> has been successively improved and refined. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78863\">dotnet\/runtime#78863<\/a>, for example, added AVX2 support, such that with 256-bit vectors being employed (when available) instead of 128-bit vectors, some benchmarks close to doubled in throughput, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83122\">dotnet\/runtime#83122<\/a> enabled WASM support. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78996\">dotnet\/runtime#78996<\/a> added a <code>Contains<\/code> method to be used when implementing scalar fallback paths. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86046\">dotnet\/runtime#86046<\/a> reduced the overhead of calling <code>IndexOfAny<\/code> with a <code>SearchValues<\/code> simply by tweaking how the relevant bitmaps and vectors are internally passed around. But two of my favorite tweaks are <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82866\">dotnet\/runtime#82866<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84184\">dotnet\/runtime#84184<\/a>, which improve overheads when &#8216;\\0&#8217; (null) is one of the characters in the needle. Why would this matter? Surely searching for &#8216;\\0&#8217; can&#8217;t be so common? Interestingly, in a variety of scenarios it can be. Imagine you have an algorithm that&#8217;s really good at searching for any subset of ASCII, but you want to use it to search for either a specific subset of ASCII <em>or<\/em> something non-ASCII. If you just search for the subset, you won&#8217;t learn about non-ASCII hits. And if you search for everything other than the subset, you&#8217;ll learn about non-ASCII hits but also all the wrong ASCII characters. Instead what you want to do is invert the ASCII subset, e.g. if your target characters are &#8216;A&#8217; through &#8216;Z&#8217; and &#8216;a&#8217; through &#8216;z&#8217;, you instead create the subset including &#8216;\\u0000&#8217; through &#8216;\\u0040&#8217;, &#8216;\\u005B&#8217; through &#8216;\\u0060&#8217;, and &#8216;\\u007B&#8217; through &#8216;\\u007F&#8217;. Then, rather than doing an <code>IndexOfAny<\/code> with that inverted subset, you instead do <code>IndexOfAnyExcept<\/code> with that inverted subset; this is a true case of &#8220;two wrongs make a right,&#8221; as we&#8217;ll end up with our desired behavior of searching for the original subset of ASCII letter plus anything non-ASCII. And as you&#8217;ll note, &#8216;\\0&#8217; is in our inverted subset, making the performance when &#8216;\\0&#8217; is in there more important than it otherwise would be.<\/p>\n<p>Interestingly, the probabilistic map code path in .NET 8 actually also enjoys some amount of vectorization, even without <code>SearchValues<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80963\">dotnet\/runtime#80963<\/a> (it was also further improved in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85189\">dotnet\/runtime#85189<\/a> that used better instructions on Arm, and in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85203\">dotnet\/runtime#85203<\/a> that avoided some wasted work). That means that whether or not <code>SearchValues<\/code> is used, searches involving probabilistic map get much faster than in .NET 7. For example, here&#8217;s a benchmark that again searches &#8220;The Adventures of Sherlock Holmes&#8221; and counts the number of line endings in it, using the same needle that <code>string.ReplaceLineEndings<\/code> uses:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark]\r\n    public int CountLineEndings()\r\n    {\r\n        int count = 0;\r\n        ReadOnlySpan&lt;char&gt; haystack = s_haystack;\r\n\r\n        int pos;\r\n        while ((pos = haystack.IndexOfAny(\"\\n\\r\\f\\u0085\\u2028\\u2029\")) &gt;= 0)\r\n        {\r\n            count++;\r\n            haystack = haystack.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountLineEndings<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2.155 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountLineEndings<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.323 ms<\/td>\n<td style=\"text-align: right\">0.61<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>SearchValues<\/code> can then be used to improve upon that. It does so not only by caching the probabilistic map that each call to <code>IndexOfAny<\/code> above needs to recompute, but also by recognizing that when a needle contains ASCII, that&#8217;s a good indication (heuristically) that ASCII haystacks will be prominent. As such, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89155\">dotnet\/runtime#89155<\/a> adds a fast path that performs a search for either any of the ASCII needle values or any non-ASCII value, and if it finds a non-ASCII value, then it falls back to performing the vectorized probabilistic map search. <\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n    private static readonly SearchValues&lt;char&gt; s_lineEndings = SearchValues.Create(\"\\n\\r\\f\\u0085\\u2028\\u2029\");\r\n\r\n    [Benchmark]\r\n    public int CountLineEndings_Chars()\r\n    {\r\n        int count = 0;\r\n        ReadOnlySpan&lt;char&gt; haystack = s_haystack;\r\n\r\n        int pos;\r\n        while ((pos = haystack.IndexOfAny(\"\\n\\r\\f\\u0085\\u2028\\u2029\")) &gt;= 0)\r\n        {\r\n            count++;\r\n            haystack = haystack.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int CountLineEndings_SearchValues()\r\n    {\r\n        int count = 0;\r\n        ReadOnlySpan&lt;char&gt; haystack = s_haystack;\r\n\r\n        int pos;\r\n        while ((pos = haystack.IndexOfAny(s_lineEndings)) &gt;= 0)\r\n        {\r\n            count++;\r\n            haystack = haystack.Slice(pos + 1);\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountLineEndings_Chars<\/td>\n<td style=\"text-align: right\">1,300.3 us<\/td>\n<\/tr>\n<tr>\n<td>CountLineEndings_SearchValues<\/td>\n<td style=\"text-align: right\">430.9 us<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89224\">dotnet\/runtime#89224<\/a> further augments that heuristic by guarding that ASCII fast path behind a quick check to see if the very next character is non-ASCII, skipping the ASCII-based search if it is and thereby avoiding the overhead when dealing with an all non-ASCII input. For example, here&#8217;s the result of running the previous benchmark, with the exact same code, except changing the URL to be <code>https:\/\/www.gutenberg.org\/files\/39963\/39963-0.txt<\/code>, which is an almost entirely Greek document containing Aristotle&#8217;s &#8220;The Constitution of the Athenians&#8221;:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountLineEndings_Chars<\/td>\n<td style=\"text-align: right\">542.6 us<\/td>\n<\/tr>\n<tr>\n<td>CountLineEndings_SearchValues<\/td>\n<td style=\"text-align: right\">283.6 us<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>With all of that goodness imbued in <code>SearchValues<\/code>, it&#8217;s now being used extensively throughout <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>. For example, <code>System.Text.Json<\/code> previously had its own dedicated implementation of a function <code>IndexOfQuoteOrAnyControlOrBackSlash<\/code> that it used to search for any character with an ordinal value less than 32, or a quote, or a backslash. That implementation in .NET 7 was <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/release\/7.0\/src\/libraries\/System.Text.Json\/src\/System\/Text\/Json\/Reader\/JsonReaderHelper.cs#L62-L239\">~200 lines of complicated <code>Vector&lt;T&gt;<\/code>-based code<\/a>. Now in .NET 8 thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82789\">dotnet\/runtime#82789<\/a>, it&#8217;s simply this:<\/p>\n<pre><code class=\"language-C#\">[MethodImpl(MethodImplOptions.AggressiveInlining)]\r\npublic static int IndexOfQuoteOrAnyControlOrBackSlash(this ReadOnlySpan&lt;byte&gt; span) =&gt;\r\n    span.IndexOfAny(s_controlQuoteBackslash);\r\n\r\nprivate static readonly SearchValues&lt;byte&gt; s_controlQuoteBackslash = SearchValues.Create(\r\n    \"\\u0000\\u0001\\u0002\\u0003\\u0004\\u0005\\u0006\\u0007\\u0008\\u0009\\u000A\\u000B\\u000C\\u000D\\u000E\\u000F\\u0010\\u0011\\u0012\\u0013\\u0014\\u0015\\u0016\\u0017\\u0018\\u0019\\u001A\\u001B\\u001C\\u001D\\u001E\\u001F\"u8 + \/\/ Any Control, &lt; 32 (' ')\r\n    \"\\\"\"u8 + \/\/ Quote\r\n    \"\\\\\"u8); \/\/ Backslash<\/code><\/pre>\n<p>Such use was rolled out in a bunch of PRs, for example <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78664\">dotnet\/runtime#78664<\/a> that used <code>SearchValues<\/code> in <code>System.Private.Xml<\/code>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81976\">dotnet\/runtime#81976<\/a> in <code>JsonSerializer<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78676\">dotnet\/runtime#78676<\/a> in <code>X500NameEncoder<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78667\">dotnet\/runtime#78667<\/a> in <code>Regex.Escape<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79025\">dotnet\/runtime#79025<\/a> in <code>ZipFile<\/code> and <code>TarFile<\/code>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79974\">dotnet\/runtime#79974<\/a> in <code>WebSocket<\/code>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81486\">dotnet\/runtime#81486<\/a> in <code>System.Net.Mail<\/code>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78896\">dotnet\/runtime#78896<\/a> in <code>Cookie<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78666\">dotnet\/runtime#78666<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79024\">dotnet\/runtime#79024<\/a> in <code>Uri<\/code> are particularly nice, including optimizing the commonly-used <code>Uri.EscapeDataString<\/code> helper with <code>SearchValues<\/code>; this shows up as a sizable improvement, especially when there&#8217;s nothing to escape.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private string _value = Convert.ToBase64String(\"How did I escape? With difficulty. How did I plan this moment? With pleasure. \"u8);\r\n\r\n    [Benchmark]\r\n    public string EscapeDataString() =&gt; Uri.EscapeDataString(_value);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>EscapeDataString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">85.468 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>EscapeDataString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">8.443 ns<\/td>\n<td style=\"text-align: right\">0.10<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>All in all, just in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>, <code>SearchValues.Create<\/code> is now used in more than 40 places, and that&#8217;s not including all the uses that get generated as part of <code>Regex<\/code> (more on that in a bit). This is helped along by <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6898\">dotnet\/roslyn-analyzers#6898<\/a>, which adds a new analyzer that will flag opportunities for <code>SearchValues<\/code> and update the code to use it:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1870.png\" alt=\"CA1870\" \/><\/p>\n<p>Throughout this discussion, I&#8217;ve mentioned <code>ReplaceLineEndings<\/code> several times, using it as an example of the kind of thing that wants to efficiently search for multiple characters. After <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78678\">dotnet\/runtime#78678<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81630\">dotnet\/runtime#81630<\/a>, it now also uses <code>SearchValues<\/code>, plus has been enhanced with other optimizations. Given the discussion of <code>SearchValues<\/code>, it&#8217;ll be obvious how it&#8217;s employed here, at least the basics of it. Previously, <code>ReplaceLineEndings<\/code> relied on an internal helper <code>IndexOfNewlineChar<\/code> which did this:<\/p>\n<pre><code class=\"language-C#\">internal static int IndexOfNewlineChar(ReadOnlySpan&lt;char&gt; text, out int stride)\r\n{\r\n    const string Needles = \"\\r\\n\\f\\u0085\\u2028\\u2029\";\r\n    int idx = text.IndexOfAny(needles);\r\n    ...\r\n}<\/code><\/pre>\n<p>Now, it does:<\/p>\n<pre><code class=\"language-C#\">int idx = text.IndexOfAny(SearchValuesStorage.NewLineChars);<\/code><\/pre>\n<p>where that <code>NewLineChars<\/code> is just:<\/p>\n<pre><code class=\"language-C#\">internal static class SearchValuesStorage\r\n{\r\n    public static readonly SearchValues&lt;char&gt; NewLineChars = SearchValues.Create(\"\\r\\n\\f\\u0085\\u2028\\u2029\");\r\n}<\/code><\/pre>\n<p>Straightforward. However, it takes things a bit further. Note that there are 6 characters in that list, some of which are ASCII, some of which aren&#8217;t. Knowing the algorithms <code>SearchValues<\/code> currently employs, we know that this will knock it off the\npath of just doing an ASCII search, and it&#8217;ll instead use the algorithm that does a search for one of the 3 ASCII characters plus anything non-ASCII, and if it finds anything non-ASCII, will then fallback to doing the probabilistic map search. If we could remove just one of those characters, we&#8217;d be back into the range of just being able to use the <code>IndexOfAny<\/code> implementation that can work with any 5 characters. On non-Windows systems, we&#8217;re in luck. <code>ReplaceLineEndings<\/code> by default replaces a line ending with <code>Environment.NewLine<\/code>; on Windows, that&#8217;s <code>\"\\r\\n\"<\/code>, but on Linux and macOS, that&#8217;s <code>\"\\n\"<\/code>. If the replacement text is <code>\"\\n\"<\/code> (which can also be opted-into on Windows by using the <code>ReplaceLineEndings(string replacementText)<\/code> overload), then searching for <code>'\\n'<\/code> only to replace it with <code>'\\n'<\/code> is a nop, which means we can remove <code>'\\n'<\/code> from the search list when the replacement text is <code>\"\\n\"<\/code>, bringing us down to only 5 target characters, and giving us a little edge. And while that&#8217;s a nice little gain, the bigger gain is that we won&#8217;t end up breaking out of the vectorized loop as frequently, or at all if all of the line endings are the replacement text. Further, the .NET 7 implementation was always creating a new string to return, but we can avoid allocating it if we didn&#8217;t actually replace anything with anything new. The net result of all of this are huge improvements to <code>ReplaceLineEndings<\/code>, some due to <code>SearchValues<\/code> and some beyond.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    \/\/ NOTE: This text uses \\r\\n as its line endings\r\n    private static readonly string s_text = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark]\r\n    [Arguments(\"\\r\\n\")]\r\n    [Arguments(\"\\n\")]\r\n    public string ReplaceLineEndings(string replacement) =&gt; s_text.ReplaceLineEndings(replacement);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>replacement<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ReplaceLineEndings<\/td>\n<td>.NET 7.0<\/td>\n<td>\\n<\/td>\n<td style=\"text-align: right\">2,746.3 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1163121 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ReplaceLineEndings<\/td>\n<td>.NET 8.0<\/td>\n<td>\\n<\/td>\n<td style=\"text-align: right\">995.9 us<\/td>\n<td style=\"text-align: right\">0.36<\/td>\n<td style=\"text-align: right\">1163121 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>ReplaceLineEndings<\/td>\n<td>.NET 7.0<\/td>\n<td>\\r\\n<\/td>\n<td style=\"text-align: right\">2,920.1 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1187729 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ReplaceLineEndings<\/td>\n<td>.NET 8.0<\/td>\n<td>\\r\\n<\/td>\n<td style=\"text-align: right\">356.5 us<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>SearchValue<\/code> changes also accrue to the span-based non-allocating <code>EnumerateLines<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_text = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    [Benchmark]\r\n    public int CountLines()\r\n    {\r\n        int count = 0;\r\n        foreach (ReadOnlySpan&lt;char&gt; _ in s_text.AsSpan().EnumerateLines()) count++;\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CountLines<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2,029.9 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CountLines<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">353.2 us<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Regex<\/h2>\n<p>Having just examined <code>SearchValues<\/code>, it&#8217;s a good time to talk about <code>Regex<\/code>, as the former now plays an integral role in the latter. <code>Regex<\/code> was significantly improved in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regex-performance-improvements-in-net-5\/\">.NET 5<\/a>, and then again was overhauled for <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regular-expression-improvements-in-dotnet-7\/\">.NET 7<\/a>, which saw the introduction of the regex source generator. Now in .NET 8, <code>Regex<\/code> continues to receive significant investment, in particular this release in taking advantage of much of the work already discussed that was introduced lower in the stack to enable more efficient searching.<\/p>\n<p>As a reminder, there are effectively three different &#8220;engines&#8221; within <code>System.Text.RegularExpressions<\/code>, meaning effectively three different components for actually processing a regex. The simplest engine is the &#8220;interpreter&#8221;; the <code>Regex<\/code> constructor translates the regular expression into a series of <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d1fc57ea18ee90aee8690697caed2b9f162409eb\/src\/libraries\/System.Text.RegularExpressions\/src\/System\/Text\/RegularExpressions\/RegexOpcode.cs#L23\">regex opcodes<\/a> which the <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d1fc57ea18ee90aee8690697caed2b9f162409eb\/src\/libraries\/System.Text.RegularExpressions\/src\/System\/Text\/RegularExpressions\/RegexInterpreter.cs#L32\">RegexInterpreter<\/a> then evaluates against the incoming text. This is done in a &#8220;scan&#8221; loop, which (simplified) looks like this:<\/p>\n<pre><code class=\"language-C#\">while (TryFindNextStartingPosition(text))\r\n{\r\n    if (TryMatchAtCurrentPosition(text) || _currentPosition == text.Length) break;\r\n    _currentPosition++;\r\n}<\/code><\/pre>\n<p><code>TryFindNextStartingPosition<\/code> tries to move through as much of the input text as possible until it finds a position in the input that could feasibly start a match, and then <code>TryMatchAtCurrentPosition<\/code> evaluates the pattern at that position against the input. That evaluation in the interpreter involves a loop like this, processing the opcodes that were produced from the pattern:<\/p>\n<pre><code class=\"language-C#\">while (true)\r\n{\r\n    switch (_opcode)\r\n    {\r\n        case RegexOpcode.Stop:\r\n            return match.FoundMatch;\r\n        case RegexOpcode.Goto:\r\n            Goto(Operand(0));\r\n            continue;\r\n        ... \/\/ cases for ~50 other opcodes\r\n    }\r\n}<\/code><\/pre>\n<p>Then there&#8217;s the non-backtracking engine, which is what you get when you select the <code>RegexOptions.NonBacktracking<\/code> option introduced in .NET 7. This engine shares the same <code>TryFindNextStartingPosition<\/code> implementation as the interpreter, such that all of the optimizations involved in skipping through as much text as possible (ideally via vectorized <code>IndexOf<\/code> operations) accrue to both the interpreter and the non-backtracking engine. However, that&#8217;s where the similarities end. Rather than processing regex opcodes, the non-backtracking engine works by converting the regular expression pattern into a lazily-constructed deterministic finite automata (DFA) or non-deterministic finite automata (NFA), which it then uses to evaluate the input text. The key benefit of the non-backtracking engine is that it provides linear-time execution guarantees in the length of the input. For a lot more detail, please read <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regular-expression-improvements-in-dotnet-7\/\">Regular Expression Improvements in .NET 7<\/a>.<\/p>\n<p>The third engine actually comes in two forms: <code>RegexOptions.Compiled<\/code> and the regex source generator (introduced in .NET 7). Except for a few corner-cases, these are effectively the same as each other in terms of how they work. They both generate custom code specific to the input pattern provided, with the former generating IL at run-time and the latter generating C# (which is then compiled to IL by the C# compiler) at build-time. The structure of the resulting code, and 99% of the optimizations applied, are identical between them; in fact, in .NET 7, the <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d1fc57ea18ee90aee8690697caed2b9f162409eb\/src\/libraries\/System.Text.RegularExpressions\/src\/System\/Text\/RegularExpressions\/RegexCompiler.cs#L21\"><code>RegexCompiler<\/code><\/a> was completely rewritten to be a block-by-block translation of the C# code the regex source generator emits. For both, the actual emitted code is fully customized to the exact pattern supplied, with both trying to generate code that processes the regex as efficiently as possible, and with the source generator trying to do so by generating code that is as close as possible to what an expert .NET developer might write. That&#8217;s in large part because the source it generates is visible, even in Visual Studio live as you edit your pattern:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/GeneratedRegexInVisualStudio.png\" alt=\"GeneratedRegex in Visual Studio\" \/><\/p>\n<p>I mention all of this because there is ample opportunity throughout <code>Regex<\/code>, both in the <code>TryFindNextStartingPosition<\/code> used by the interpreter and non-backtracking engines and throughout the code generated by <code>RegexCompiler<\/code> and the regex source generator, to use APIs introduced to make searching faster. I&#8217;m looking at you, <code>IndexOf<\/code> and friends.<\/p>\n<p>As noted earlier, new <code>IndexOf<\/code> variants have been introduced in .NET 8 for searching for ranges, and as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76859\">dotnet\/runtime#76859<\/a>, <code>Regex<\/code> will now take full advantage of them in generated code. For example, consider <code>[GeneratedRegex(@\"[0-9]{5}\")]<\/code>, which might be used to search for a zip code in the United States. The regex source generator in .NET 7 would emit code for <code>TryFindNextStartingPosition<\/code> that contained this:<\/p>\n<pre><code class=\"language-C#\">\/\/ The pattern begins with '0' through '9'.\r\n\/\/ Find the next occurrence. If it can't be found, there's no match.\r\nReadOnlySpan&lt;char&gt; span = inputSpan.Slice(pos);\r\nfor (int i = 0; i &lt; span.Length - 4; i++)\r\n{\r\n    if (char.IsAsciiDigit(span[i]))\r\n    ...\r\n}<\/code><\/pre>\n<p>Now in .NET 8, that same attribute instead generates this:<\/p>\n<pre><code class=\"language-C#\">\/\/ The pattern begins with a character in the set [0-9].\r\n\/\/ Find the next occurrence. If it can't be found, there's no match.\r\nReadOnlySpan&lt;char&gt; span = inputSpan.Slice(pos);\r\nfor (int i = 0; i &lt; span.Length - 4; i++)\r\n{\r\n    int indexOfPos = span.Slice(i).IndexOfAnyInRange('0', '9');\r\n    ...\r\n}<\/code><\/pre>\n<p>That .NET 7 implementation is examining one character at a time, whereas the .NET 8 code is vectorizing the search via <code>IndexOfAnyInRange<\/code>, examining multiple characters at a time. This can lead to significant speedups.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    private readonly Regex _regex = new Regex(\"[0-9]{5}\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; _regex.Count(s_haystack);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">423.88 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">29.91 us<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The generated code can use these APIs in other places as well, even as part of validating the match itself. Let&#8217;s say your pattern was instead <code>[GeneratedRegex(@\"(\\w{3,})[0-9]\")]<\/code>, which is going to look for and capture a sequence of at least three word characters that is then followed by an ASCII digit. This is a standard greedy loop, so it&#8217;s going to consume as many word characters as it can (which includes ASCII digits), and will then backtrack, giving back some of the word characters consumed, until it can find a digit. Previously, that was implemented just by giving back a single character, seeing if it was a digit, giving back a single character, seeing if it was a digit, and so on. Now? The source generator emits code that includes this:<\/p>\n<pre><code class=\"language-C#\">charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOfAnyInRange('0', '9')<\/code><\/pre>\n<p>In other words, it&#8217;s using <code>LastIndexOfAnyInRange<\/code> to optimize that backwards search for the next viable backtracking location.<\/p>\n<p>Another significant improvement that builds on improvements lower in the stack is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85438\">dotnet\/runtime#85438<\/a>. As was previously covered, the vectorization of <code>span.IndexOf(\"...\", StringComparison.OrdinalIgnoreCase)<\/code> has been improved in .NET 8. Previously, <code>Regex<\/code> wasn&#8217;t utilizing this API, as it was often able to do better with its own custom-generated code. But now that the API has been optimized, this PR changes <code>Regex<\/code> to use it, making the generated code both simpler and faster. Here I&#8217;m searching case-insensitively for the whole word &#8220;year&#8221;:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    private readonly Regex _regex = new Regex(@\"\\byear\\b\", RegexOptions.Compiled | RegexOptions.IgnoreCase);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; _regex.Count(s_haystack);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">181.80 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">63.10 us<\/td>\n<td style=\"text-align: right\">0.35<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In addition to learning how to use the existing <code>IndexOf(..., StringComparison.OrdinalIgnoreCase)<\/code> and the new <code>IndexOfAnyInRange<\/code> and <code>IndexOfAnyExceptInRange<\/code>, <code>Regex<\/code> in .NET 8 also learns how to use the new <code>SearchValues&lt;char&gt;<\/code>. This is a big boost for <code>Regex<\/code>, as it now means that it can vectorize searches for many more sets than it previously could. For example, let&#8217;s say you wanted to search for all hex numbers. You might use a pattern like <code>[0123456789ABCDEFabcdef]+<\/code>. If you plug that into the regex source generator in .NET 7, you&#8217;ll get a <code>TryFindNextPossibleStartingPosition<\/code> emitted that contains code like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ The pattern begins with a character in the set [0-9A-Fa-f].\r\n\/\/ Find the next occurrence. If it can't be found, there's no match.\r\nReadOnlySpan&lt;char&gt; span = inputSpan.Slice(pos);\r\nfor (int i = 0; i &lt; span.Length; i++)\r\n{\r\n    if (char.IsAsciiHexDigit(span[i]))\r\n    {\r\n        base.runtextpos = pos + i;\r\n        return true;\r\n    }\r\n}<\/code><\/pre>\n<p>Now in .NET 8, thanks in large part to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78927\">dotnet\/runtime#78927<\/a>, you&#8217;ll instead get code like this:<\/p>\n<pre><code class=\"language-C#\">\/\/ The pattern begins with a character in the set [0-9A-Fa-f].\r\n\/\/ Find the next occurrence. If it can't be found, there's no match.\r\nint i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_asciiHexDigits);\r\nif (i &gt;= 0)\r\n{\r\n    base.runtextpos = pos + i;\r\n    return true;\r\n}<\/code><\/pre>\n<p>What is that <code>Utilities.s_asciiHexDigits<\/code>? It&#8217;s a <code>SearchValues&lt;char&gt;<\/code> emitted into the file&#8217;s <code>Utilities<\/code> class:<\/p>\n<pre><code class=\"language-C#\">\/\/\/ &lt;summary&gt;Supports searching for characters in or not in \"0123456789ABCDEFabcdef\".&lt;\/summary&gt;\r\ninternal static readonly SearchValues&lt;char&gt; s_asciiHexDigits = SearchValues.Create(\"0123456789ABCDEFabcdef\");<\/code><\/pre>\n<p>The source generator explicitly recognized this set and so created a nice name for it, but that&#8217;s purely about readability; it can still use <code>SearchValues&lt;char&gt;<\/code> even if it doesn&#8217;t recognize the set as something that&#8217;s well-known and easily nameable. For example, if I instead augment the set to be all valid hex digits and an underscore, I then instead get this:<\/p>\n<pre><code class=\"language-C#\">\/\/\/ &lt;summary&gt;Supports searching for characters in or not in \"0123456789ABCDEF_abcdef\".&lt;\/summary&gt;\r\ninternal static readonly SearchValues&lt;char&gt; s_ascii_FF037E0000807E000000 = SearchValues.Create(\"0123456789ABCDEF_abcdef\");<\/code><\/pre>\n<p>When initially added to <code>Regex<\/code>, <code>SearchValues&lt;char&gt;<\/code> was only used when the input set was all ASCII. But as <code>SearchValues&lt;char&gt;<\/code> improved over the development of .NET 8, so too did <code>Regex<\/code>&#8216;s use of it. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89205\">dotnet\/runtime#89205<\/a>, <code>Regex<\/code> now relies on <code>SearchValues<\/code>&#8216;s ability to efficiently search for both ASCII and non-ASCII, and will similarly emit a <code>SearchValues&lt;char&gt;<\/code> if it&#8217;s able to efficiently enumerate the contents of a set and that set contains a reasonably small number of characters (today, that means no more than 128). Interestingly, <code>SearchValues<\/code>&#8216;s optimization to first do a search for the ASCII subset of a target and then fallback to a vectorized probabilistic map search was first prototyped in <code>Regex<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89140\">dotnet\/runtime#89140<\/a>), after which we decided to push the optimization downwards into <code>SearchValues<\/code> so that <code>Regex<\/code> could generate simpler code and so that other non-<code>Regex<\/code> consumers would benefit.<\/p>\n<p>That still, however, leaves the cases where we can&#8217;t efficiently enumerate the set in order to determine every character it includes, nor would we want to pass a gigantic number of characters off to <code>SearchValues<\/code>. Consider the set <code>\\w<\/code>, i.e. &#8220;word characters.&#8221; Of the 65,536 <code>char<\/code> values, 50,409 match the set <code>\\w<\/code>. It would be inefficient to enumerate all of those characters in order to try to create a <code>SearchValues&lt;char&gt;<\/code> for them, and <code>Regex<\/code> doesn&#8217;t try. Instead, as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83992\">dotnet\/runtime#83992<\/a>, <code>Regex<\/code> employs a similar approach as noted above, but with a scalar fallback. For example, for the pattern <code>\\w+<\/code>, it emits the following helper into <code>Utilities<\/code>:<\/p>\n<pre><code class=\"language-C#\">internal static int IndexOfAnyWordChar(this ReadOnlySpan&lt;char&gt; span)\r\n{\r\n    int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWordChars);\r\n    if ((uint)i &lt; (uint)span.Length)\r\n    {\r\n        if (char.IsAscii(span[i]))\r\n        {\r\n            return i;\r\n        }\r\n\r\n        do\r\n        {\r\n            if (Utilities.IsWordChar(span[i]))\r\n            {\r\n                return i;\r\n            }\r\n            i++;\r\n        }\r\n        while ((uint)i &lt; (uint)span.Length);\r\n    }\r\n\r\n    return -1;\r\n}\r\n\r\n\/\/\/ &lt;summary&gt;Supports searching for characters in or not in \"\\0\\u0001\\u0002\\u0003\\u0004\\u0005\\u0006\\a\\b\\t\\n\\v\\f\\r\\u000e\\u000f\\u0010\\u0011\\u0012\\u0013\\u0014\\u0015\\u0016\\u0017\\u0018\\u0019\\u001a\\u001b\\u001c\\u001d\\u001e\\u001f !\\\"#$%&amp;amp;'()*+,-.\/:;&amp;lt;=&amp;gt;?@[\\\\]^`{|}~\\u007f\".&lt;\/summary&gt;\r\ninternal static readonly SearchValues&lt;char&gt; s_asciiExceptWordChars = SearchValues.Create(\"\\0\\u0001\\u0002\\u0003\\u0004\\u0005\\u0006\\a\\b\\t\\n\\v\\f\\r\\u000e\\u000f\\u0010\\u0011\\u0012\\u0013\\u0014\\u0015\\u0016\\u0017\\u0018\\u0019\\u001a\\u001b\\u001c\\u001d\\u001e\\u001f !\\\"#$%&amp;'()*+,-.\/:;&lt;=&gt;?@[\\\\]^`{|}~\\u007f\");<\/code><\/pre>\n<p>The fact that it named the helper &#8220;IndexOfAnyWordChar&#8221; is, again, separate from the fact that it was able to generate this helper; it simply recognizes the set here as part of determining a name and was able to come up with a nicer one, but if it didn&#8217;t recognize it, the body of the method would be the same and the name would just be less readable, as it would come up with something fairly gibberish but unique.<\/p>\n<p>As an interesting aside, I noted that the source generator and <code>RegexCompiler<\/code> are effectively the same, just with one generating C# and one generating IL. That&#8217;s 99% correct. There is one interesting difference around their use of <code>SearchValues<\/code>, though, one which makes the source generator a bit more efficient in how it&#8217;s able to utilize the type. Any time the source generator needs a <code>SearchValues<\/code> instance for a new combination of characters, it can just emit another <code>static readonly<\/code> field for that instance, and because it&#8217;s <code>static readonly<\/code>, the JIT&#8217;s optimizations around devirtualization and inlining can kick in, with calls to use this seeing the actual type of the instance and optimizing based on that. Yay. <code>RegexCompiler<\/code> is a different story. <code>RegexCompiler<\/code> emits IL for a given <code>Regex<\/code>, and it does so using <code>DynamicMethod<\/code>; this provides the lightest-weight solution to reflection emit, also allowing the generated methods to be garbage collected when they&#8217;re no longer referenced. <code>DynamicMethod<\/code>s, however, are just that, methods. There&#8217;s no support for creating additional static fields on demand, without growing up into the much more expensive <code>TypeBuilder<\/code>-based solution. How then can <code>RegexCompiler<\/code> create and store an arbitrary number of <code>SearchValue<\/code> instances, and how can it do so in a way that similarly enables devirtualization? It employs a few tricks. First, a field was added to the internal <code>CompiledRegexRunner<\/code> type that stores the delegate to the generated method: <code>private readonly SearchValues&lt;char&gt;[]? _searchValues;<\/code> As an array, this enables any number of <code>SearchValues<\/code> to be stored; the emitted IL can access the field, grab the array, and index into it to grab the relevant <code>SearchValues&lt;char&gt;<\/code> instance. Just doing that, of course, would not allow for devirtualization, and even dynamic PGO doesn&#8217;t help here because currently <code>DynamicMethod<\/code>s don&#8217;t participate in tiering; compilation goes straight to tier 1, so there&#8217;d be no opportunity for instrumentation to see the actual <code>SearchValues&lt;char&gt;<\/code>-derived type employed. Thankfully, there are available solutions. The JIT can learn about the type of an instance from the type of a local in which it&#8217;s stored, so one solution is to create a local of the concrete and sealed <code>SearchValues&lt;char&gt;<\/code> derived type (we&#8217;re writing IL at this point, so we can do things like that without actually having access to the type in question), read the <code>SearchValues&lt;char&gt;<\/code> from the array, store it into the local, and then use the local for the subsequent access. And, in fact, we did that for a while during the .NET 8 development process. However, that does require a local, and requires an extra read\/write of that local. Instead, a tweak in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85954\">dotnet\/runtime#85954<\/a> allows the JIT to use the <code>T<\/code> in <code>Unsafe.As&lt;T&gt;(object o)<\/code> to learn about the actual type of <code>T<\/code>, and so <code>RegexCompiler<\/code> can just use <code>Unsafe.As<\/code> to inform the JIT as to the actual type of the instance such that it&#8217;s then devirtualized. The code <code>RegexCompiler<\/code> uses then to emit the IL to load a <code>SearchValues&lt;char&gt;<\/code> is this:<\/p>\n<pre><code class=\"language-C#\">\/\/ from RegexCompiler.cs, tweaked for readability in this post\r\nprivate void LoadSearchValues(ReadOnlySpan&lt;char&gt; chars)\r\n{\r\n    List&lt;SearchValues&lt;char&gt;&gt; list = _searchValues ??= new();\r\n    int index = list.Count;\r\n    list.Add(SearchValues.Create(chars));\r\n\r\n    \/\/ Unsafe.As&lt;DerivedSearchValues&gt;(Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(this._searchValues), index));\r\n    _ilg.Emit(OpCodes.Ldarg_0);\r\n    _ilg.Emit(OpCodes.Ldfld, s_searchValuesArrayField);\r\n    _ilg.Emit(OpCodes.Call, s_memoryMarshalGetArrayDataReferenceSearchValues);\r\n    _ilg.Emit(OpCodes.Ldc_I4, index * IntPtr.Size);\r\n    _ilg.Emit(OpCodes.Add);\r\n    _ilg.Emit(OpCodes.Ldind_Ref);\r\n    _ilg.Emit(OpCodes.Call, typeof(Unsafe).GetMethod(\"As\", new[] { typeof(object) })!.MakeGenericMethod(list[index].GetType()));\r\n}<\/code><\/pre>\n<p>We can see all of this in action with a benchmark like this:<\/p>\n<pre><code class=\"language-C#\">using BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    private static readonly Regex s_names = new Regex(\"Holmes|Watson|Lestrade|Hudson|Moriarty|Adler|Moran|Morstan|Gregson\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public int Count() =&gt; s_names.Count(s_haystack);\r\n}<\/code><\/pre>\n<p>Here we&#8217;re searching the same Sherlock Holmes text for the names of some of the most common characters in the detective stories. The regex pattern analyzer will try to find something for which it can vectorize a search, and it will look at all of the characters that can validly exist at each position in a match, e.g. all matches begin with &#8216;H&#8217;, &#8216;W&#8217;, &#8216;L&#8217;, &#8216;M&#8217;, &#8216;A&#8217;, or &#8216;G&#8217;. And since the shortest match is five letters (&#8220;Adler&#8221;), it&#8217;ll end up looking at the first five positions, coming up with these sets:<\/p>\n<pre><code class=\"language-text\">0: [AGHLMW]\r\n1: [adeoru]\r\n2: [delrst]\r\n3: [aegimst]\r\n4: [aenorst]<\/code><\/pre>\n<p>All of those sets have more than five characters in them, though, an important delineation as in .NET 7 that is the largest number of characters for which <code>IndexOfAny<\/code> will vectorize a search. Thus, in .NET 7, <code>Regex<\/code> ends up emitting code that walks the input checking character by character (though it does match the set using a fast branch-free bitmap mechanism):<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;char&gt; span = inputSpan.Slice(pos);\r\nfor (int i = 0; i &lt; span.Length - 4; i++)\r\n{\r\n    if (((long)((0x8318020000000000UL &lt;&lt; (int)(charMinusLow = (uint)span[i] - 'A')) &amp; (charMinusLow - 64)) &lt; 0) &amp;&amp;\r\n    ...<\/code><\/pre>\n<p>Now in .NET 8, with <code>SearchValues&lt;char&gt;<\/code> we <em>can<\/em> efficiently search for any of these sets, and the implementation ends up picking the one it thinks is statistically least likely to match:<\/p>\n<pre><code class=\"language-C#\">int indexOfPos = span.Slice(i).IndexOfAny(Utilities.s_ascii_8231800000000000);<\/code><\/pre>\n<p>where that <code>s_ascii_8231800000000000<\/code> is defined as:<\/p>\n<pre><code class=\"language-C#\">\/\/\/ &lt;summary&gt;Supports searching for characters in or not in \"AGHLMW\".&lt;\/summary&gt;\r\ninternal static readonly SearchValues&lt;char&gt; s_ascii_8231800000000000 = SearchValues.Create(\"AGHLMW\");<\/code><\/pre>\n<p>This leads the overall searching process to be much more efficient.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">630.5 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">142.3 us<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84370\">dotnet\/runtime#84370<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89099\">dotnet\/runtime#89099<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77925\">dotnet\/runtime#77925<\/a> have also contributed to how <code>IndexOf<\/code> and friends are used, tweaking the various heuristics involved. But there have been improvements to <code>Regex<\/code> as well outside of this realm.\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84003\">dotnet\/runtime#84003<\/a>, for example, streamlines the matching performance of <code>\\w<\/code> when matching against non-ASCII characters by using a bit-twiddling trick. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84843\">dotnet\/runtime#84843<\/a> changes the underlying type of an internal enum from <code>int<\/code> to <code>byte<\/code>, and in doing so ends up shrinking the size of the object containing a value of this enum by 8 bytes (in a 64-bit process). More impactful is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85564\">dotnet\/runtime#85564<\/a>, which makes a measurable improvement for <code>Regex.Replace<\/code>. <code>Replace<\/code> was maintaining a list of <code>ReadOnlyMemory&lt;char&gt;<\/code> segments to be composed back into the final string; some segments would come from the original <code>string<\/code>, while some would be the replacement <code>string<\/code>. As it turns out, though, the string reference contained in that <code>ReadOnlyMemory&lt;char&gt;<\/code> is unnecessary. We can instead just maintain a list of <code>ints<\/code>, where every time we add a segment we add to the list the <code>int offset<\/code> and the <code>int count<\/code>, and with the nature of replace, we can simply rely on the fact that we&#8217;ll need to insert the replacement text between every pair of values.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly string s_haystack = new HttpClient().GetStringAsync(\"https:\/\/www.gutenberg.org\/files\/1661\/1661-0.txt\").Result;\r\n\r\n    private static readonly Regex s_vowels = new Regex(\"[aeiou]\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public string RemoveVowels() =&gt; s_vowels.Replace(s_haystack, \"\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RemoveVowels<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">8.763 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>RemoveVowels<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">7.084 ms<\/td>\n<td style=\"text-align: right\">0.81<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One last improvement in <code>Regex<\/code> to highlight isn&#8217;t actually due to anything in <code>Regex<\/code>, but actually in a primitive <code>Regex<\/code> uses on every operation: <code>Interlocked.Exchange<\/code>. Consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.RegularExpressions;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly Regex s_r = new Regex(\"\", RegexOptions.Compiled);\r\n\r\n    [Benchmark]\r\n    public bool Overhead() =&gt; s_r.IsMatch(\"\");\r\n}<\/code><\/pre>\n<p>This is purely measuring the overhead of calling into a <code>Regex<\/code> instance; the matching routine will complete immediately as the pattern matches any input. Since we&#8217;re only talking about tens of nanoseconds, your numbers may vary here, but I routinely get results like this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Overhead<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.01 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Overhead<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">28.81 ns<\/td>\n<td style=\"text-align: right\">0.90<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That several nanosecond improvement is primarily due to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79181\">dotnet\/runtime#79181<\/a>, which made <code>Interlocked.CompareExchange<\/code> and <code>Interlocked.Exchange<\/code> for reference types into intrinsics, special-casing when the JIT can see that the new value to be written is <code>null<\/code>. These APIs need to employ a GC write barrier as part of writing the object reference into the shared location, for the same reasons previously discussed earlier in this post, but when writing <code>null<\/code>, no such barrier is required. This benefits <code>Regex<\/code>, which uses <code>Interlocked.Exchange<\/code> as part of renting a <code>RegexRunner<\/code> to use to actually process the match. Each <code>Regex<\/code> instance caches a runner object, and every operation tries to rent and return it&#8230; that renting is done with <code>Interlocked.Exchange<\/code>:<\/p>\n<pre><code class=\"language-C#\">RegexRunner runner = Interlocked.Exchange(ref _runner, null) ?? CreateRunner();\r\ntry { ... }\r\nfinally { _runner = runner; }<\/code><\/pre>\n<p>Many object pool implementations employ a similar use of <code>Interlocked.Exchange<\/code> and will similarly benefit.<\/p>\n<h2>Hashing<\/h2>\n<p>The <code>System.IO.Hashing<\/code> library was introduced in .NET 6 to provide <em>non-cryptographic hash algorithm<\/em> implementations; initially, it shipped with four types: <code>Crc32<\/code>, <code>Crc64<\/code>, <code>XxHash32<\/code>, and <code>XxHash64<\/code>. In .NET 8, it gets significant investment, in adding new optimized algorithms, in improving the performance of existing implementations, and in adding new surface area across all of the algorithms.<\/p>\n<p>The xxHash family of hash algorithms has become quite popular of late due to its high performance on both large and small inputs and its overall level of quality (e.g. how few collisions are produced, how well inputs are dispersed, etc.) <code>System.IO.Hashing<\/code> previously included implementations of the older XXH32 and XXH64 algorithms (as <code>XxHash32<\/code> and <code>XxHash64<\/code>, respectively). Now in .NET 8, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76641\">dotnet\/runtime#76641<\/a>, it includes the XXH3 algorithm (as <code>XxHash3<\/code>), and thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77944\">dotnet\/runtime#77944<\/a> from <a href=\"https:\/\/github.com\/xoofx\">@xoofx<\/a>, it includes the XXH128 algorithm (as <code>XxHash128<\/code>). The <code>XxHash3<\/code> algorithm was also further optimized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77756\">dotnet\/runtime#77756<\/a> from <a href=\"https:\/\/github.com\/xoofx\">@xoofx<\/a> by amortizing the costs of some loads and stores, and in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77881\">dotnet\/runtime#77881<\/a> from <a href=\"https:\/\/github.com\/xoofx\">@xoofx<\/a>, which improved throughput on Arm by making better use of the <code>AdvSimd<\/code> hardware intrinsics.<\/p>\n<p>To see overall performance of these hash functions, here&#8217;s a microbenchmark comparing the throughput of the cryptographic SHA256 with each of these non-cryptographic hash functions. I&#8217;ve also included an implementation of FNV-1a, which is the hash algorithm that may be used by the C# compiler for <code>switch<\/code> statements (when it needs to <code>switch<\/code> over a string, for example, and it can&#8217;t come up with a better scheme, it hashes the input, and then does a binary search through the pregenerated hashes for each of the cases), as well as an implementation based on <code>System.HashCode<\/code> (noting that <code>HashCode<\/code> is different from the rest of these, in that it&#8217;s focused on enabling the hashing of arbitrary .NET types, and includes per-process randomization, whereas a goal of these other hash functions is to be 100% deterministic across process boundaries).<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add:\r\n\/\/     &lt;PackageReference Include=\"System.IO.Hashing\" Version=\"8.0.0-rc.1.23419.4\" \/&gt;\r\n\/\/ to the benchmarks.csproj's &lt;ItemGroup&gt;.\r\n\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Buffers.Binary;\r\nusing System.IO.Hashing;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _result = new byte[100];\r\n    private byte[] _source;\r\n\r\n    [Params(3, 33_333)]\r\n    public int Length { get; set; }\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; _source = Enumerable.Range(0, Length).Select(i =&gt; (byte)i).ToArray();\r\n\r\n    \/\/ Cryptographic\r\n    [Benchmark(Baseline = true)] public void TestSHA256() =&gt; SHA256.HashData(_source, _result);\r\n\r\n    \/\/ Non-cryptographic\r\n    [Benchmark] public void TestCrc32() =&gt; Crc32.Hash(_source, _result);\r\n    [Benchmark] public void TestCrc64() =&gt; Crc64.Hash(_source, _result);\r\n    [Benchmark] public void TestXxHash32() =&gt; XxHash32.Hash(_source, _result);\r\n    [Benchmark] public void TestXxHash64() =&gt; XxHash64.Hash(_source, _result);\r\n    [Benchmark] public void TestXxHash3() =&gt; XxHash3.Hash(_source, _result);\r\n    [Benchmark] public void TestXxHash128() =&gt; XxHash128.Hash(_source, _result);\r\n\r\n    \/\/ Algorithm used by the C# compiler for switch statements\r\n    [Benchmark]\r\n    public void TestFnv1a()\r\n    {\r\n        int hash = unchecked((int)2166136261);\r\n        foreach (byte b in _source) hash = (hash ^ b) * 16777619;\r\n        BinaryPrimitives.WriteInt32LittleEndian(_result, hash);\r\n    }\r\n\r\n    \/\/ Randomized with a custom seed per process\r\n    [Benchmark]\r\n    public void TestHashCode()\r\n    {\r\n        HashCode hc = default;\r\n        hc.AddBytes(_source);\r\n        BinaryPrimitives.WriteInt32LittleEndian(_result, hc.ToHashCode());\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Length<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TestSHA256<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">856.168 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>TestHashCode<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">9.933 ns<\/td>\n<td style=\"text-align: right\">0.012<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash64<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">7.724 ns<\/td>\n<td style=\"text-align: right\">0.009<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash128<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">5.522 ns<\/td>\n<td style=\"text-align: right\">0.006<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash32<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">5.457 ns<\/td>\n<td style=\"text-align: right\">0.006<\/td>\n<\/tr>\n<tr>\n<td>TestCrc32<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">3.954 ns<\/td>\n<td style=\"text-align: right\">0.005<\/td>\n<\/tr>\n<tr>\n<td>TestCrc64<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">3.405 ns<\/td>\n<td style=\"text-align: right\">0.004<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash3<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">3.343 ns<\/td>\n<td style=\"text-align: right\">0.004<\/td>\n<\/tr>\n<tr>\n<td>TestFnv1a<\/td>\n<td>3<\/td>\n<td style=\"text-align: right\">1.617 ns<\/td>\n<td style=\"text-align: right\">0.002<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>TestSHA256<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">60,407.625 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TestFnv1a<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">31,027.249 ns<\/td>\n<td style=\"text-align: right\">0.51<\/td>\n<\/tr>\n<tr>\n<td>TestHashCode<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">4,879.262 ns<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash32<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">4,444.116 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash64<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">3,636.989 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<\/tr>\n<tr>\n<td>TestCrc64<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">1,571.445 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash3<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">1,491.740 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<\/tr>\n<tr>\n<td>TestXxHash128<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">1,474.551 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<\/tr>\n<tr>\n<td>TestCrc32<\/td>\n<td>33333<\/td>\n<td style=\"text-align: right\">1,295.663 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A key reason <code>XxHash3<\/code> and <code>XxHash128<\/code> do so much better than <code>XxHash32<\/code> and <code>XxHash64<\/code> is that their design is focused on being vectorizable. As such, the .NET implementations employ the support in <code>System.Runtime.Intrinsics<\/code> to take full advantage of the underlying hardware. This data also hints at why the C# compiler uses FNV-1a: it&#8217;s really simple and also really low overhead for small inputs, which are the most common form of input used in <code>switch<\/code> statements, but it would be a poor choice if you expected primarily longer inputs.<\/p>\n<p>You&#8217;ll note that in the previous example, <code>Crc32<\/code> and <code>Crc64<\/code> both end up in the same ballpark as <code>XxHash3<\/code> in terms of throughput (XXH3 generally ranks better than CRC32\/64 in terms of quality). CRC32 in that comparison benefits significantly from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83321\">dotnet\/runtime#83321<\/a> from <a href=\"https:\/\/github.com\/brantburnett\">@brantburnett<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86539\">dotnet\/runtime#86539<\/a> from <a href=\"https:\/\/github.com\/brantburnett\">@brantburnett<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85221\">dotnet\/runtime#85221<\/a> from <a href=\"https:\/\/github.com\/brantburnett\">@brantburnett<\/a>. These vectorize the <code>Crc32<\/code> and <code>Crc64<\/code> implementations, based on a decade-old paper from Intel titled &#8220;Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction.&#8221; The cited <code>PCLMULQDQ<\/code> instruction is part of SSE2, however the PR is also able to vectorize on Arm by taking advantage of Arm&#8217;s <code>PMULL<\/code> instruction. The net result is huge gains over .NET 7, in particular for larger inputs being hashed.<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add:\r\n\/\/     &lt;PackageReference Include=\"System.IO.Hashing\" Version=\"7.0.0\" \/&gt;\r\n\/\/ to the benchmarks.csproj's &lt;ItemGroup&gt;.\r\n\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Hashing;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet(\"System.IO.Hashing\", \"7.0.0\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(\"System.IO.Hashing\", \"8.0.0-rc.1.23419.4\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _source = Enumerable.Range(0, 1024).Select(i =&gt; (byte)i).ToArray();\r\n    private readonly byte[] _destination = new byte[4];\r\n\r\n    [Benchmark]\r\n    public void Hash() =&gt; Crc32.Hash(_source, _destination);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Hash<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2,416.24 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Hash<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">39.01 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another change also further improves performance of some of these algorithms, but with a primary purpose of actually making them easier to use in a variety of scenarios. The original design of <code>NonCryptographicHashAlgorithm<\/code> was focused on creating non-cryptographic alternatives to the existing cryptographic algorithms folks were using, and thus the APIs are all focused on writing out the resulting digests, which are opaque bytes, e.g. CRC32 produces a 4-byte hash. However, especially for these non-cryptographic algorithms, many developers are more familiar with getting back a numerical result, e.g. CRC32 produces an <code>uint<\/code>. Same data, just a different representation. Interestingly, as well, some of these algorithms operate in terms of such integers, so getting back bytes actually requires a separate step, both ensuring some kind of storage location is available in which to write the resulting bytes and then extracting the result to that location. To address all of this, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78075\">dotnet\/runtime#78075<\/a> adds to all of the types in <code>System.IO.Hashing<\/code> new utility methods for producing such numbers. For example, <code>Crc32<\/code> has two new methods added to it:<\/p>\n<pre><code class=\"language-C#\">public static uint HashToUInt32(ReadOnlySpan&lt;byte&gt; source);\r\npublic uint GetCurrentHashAsUInt32();<\/code><\/pre>\n<p>If you just want the <code>uint<\/code>-based CRC32 hash for some input bytes, you can simply call this one-shot static method <code>HashToUInt32<\/code>. Or if you&#8217;re building up the hash incrementally, having created an instance of the <code>Crc32<\/code> type and having appended data to it, you can get the current <code>uint<\/code> hash via <code>GetCurrentHashAsUInt32<\/code>. This also shaves off a few instructions for an algorithm like <code>XxHash3<\/code> which actually needs to do more work to produce the result as bytes, only to then need to get those bytes back as a <code>ulong<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add:\r\n\/\/     &lt;PackageReference Include=\"System.IO.Hashing\" Version=\"8.0.0-rc.1.23419.4\" \/&gt;\r\n\/\/ to the benchmarks.csproj's &lt;ItemGroup&gt;.\r\n\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Hashing;\r\nusing System.Runtime.InteropServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _source = new byte[] { 1, 2, 3 };\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public ulong HashToBytesThenGetUInt64()\r\n    {\r\n        ulong hash = 0;\r\n        XxHash3.Hash(_source, MemoryMarshal.AsBytes(new Span&lt;ulong&gt;(ref hash)));\r\n        return hash;\r\n    }\r\n\r\n    [Benchmark]\r\n    public ulong HashToUInt64() =&gt; XxHash3.HashToUInt64(_source);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HashToBytesThenGetUInt64<\/td>\n<td style=\"text-align: right\">3.686 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>HashToUInt64<\/td>\n<td style=\"text-align: right\">3.095 ns<\/td>\n<td style=\"text-align: right\">0.84<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Also on the hashing front, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/61558\">dotnet\/runtime#61558<\/a> from <a href=\"https:\/\/github.com\/deeprobin\">@deeprobin<\/a> adds new <code>BitOperations.Crc32C<\/code> methods that allow for iterative crc32c hash computation. A nice aspect of crc32c is that multiple platforms provide instructions for this operation, including SSE 4.2 and Arm, and the .NET method will employ whatever hardware support is available, by delegating into the relevant hardware intrinsics in <code>System.Runtime.Intrinsics<\/code>, e.g.<\/p>\n<pre><code class=\"language-C#\">if (Sse42.X64.IsSupported) return (uint)Sse42.X64.Crc32(crc, data);\r\nif (Sse42.IsSupported) return Sse42.Crc32(Sse42.Crc32(crc, (uint)(data)), (uint)(data &gt;&gt; 32));\r\nif (Crc32.Arm64.IsSupported) return Crc32.Arm64.ComputeCrc32C(crc, data);<\/code><\/pre>\n<p>We can see the impact those intrinsics have by comparing a manual implementation of the crc32c algorithm against the now built-in implementation:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Numerics;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _data = RandomNumberGenerator.GetBytes(1024 * 1024);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public uint Crc32c_Manual()\r\n    {\r\n        uint c = 0;\r\n        foreach (byte b in _data) c = Tests.Crc32C(c, b);\r\n        return c;\r\n    }\r\n\r\n    [Benchmark]\r\n    public uint Crc32c_BitOperations()\r\n    {\r\n        uint c = 0;\r\n        foreach (byte b in _data) c = BitOperations.Crc32C(c, b);\r\n        return c;\r\n    }\r\n\r\n    private static readonly uint[] s_crcTable = Generate(0x82F63B78u);\r\n\r\n    internal static uint Crc32C(uint crc, byte data) =&gt;\r\n        s_crcTable[(byte)(crc ^ data)] ^ (crc &gt;&gt; 8);\r\n\r\n    internal static uint[] Generate(uint reflectedPolynomial)\r\n    {\r\n        var table = new uint[256];\r\n\r\n        for (int i = 0; i &lt; 256; i++)\r\n        {\r\n            uint val = (uint)i;\r\n            for (int j = 0; j &lt; 8; j++)\r\n            {\r\n                if ((val &amp; 0b0000_0001) == 0)\r\n                {\r\n                    val &gt;&gt;= 1;\r\n                }\r\n                else\r\n                {\r\n                    val = (val &gt;&gt; 1) ^ reflectedPolynomial;\r\n                }\r\n            }\r\n\r\n            table[i] = val;\r\n        }\r\n\r\n        return table;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Crc32c_Manual<\/td>\n<td style=\"text-align: right\">1,977.9 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Crc32c_BitOperations<\/td>\n<td style=\"text-align: right\">739.9 us<\/td>\n<td style=\"text-align: right\">0.37<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Initialization<\/h2>\n<p>Several releases ago, the C# compiler added a valuable optimization that&#8217;s now heavily employed throughout the core libraries, and that newer C# constructs (like <code>u8<\/code>) rely on heavily. It&#8217;s quite common to want to store and access sequences or tables of data in code. For example, let&#8217;s say I want to quickly look up how many days there are in a month in the Gregorian calendar, based on that month&#8217;s 0-based index. I can use a lookup table like this (ignoring leap years for explanatory purposes):<\/p>\n<pre><code class=\"language-C#\">byte[] daysInMonth = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };<\/code><\/pre>\n<p>Of course, now I&#8217;m allocating a <code>byte[]<\/code>, so I should move that out to a <code>static readonly<\/code> field. Even then, though, that array has to be allocated, and the data loaded into it, incurring some startup overhead the first time it&#8217;s used. Instead, I can write it as:<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;byte&gt; daysInMonth = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };<\/code><\/pre>\n<p>While this looks like it&#8217;s allocating, it&#8217;s actually not. The C# compiler recognizes that all of the data being used to initialize the <code>byte[]<\/code> is constant and that the array is being stored directly into a <code>ReadOnlySpan&lt;byte&gt;<\/code>, which doesn&#8217;t provide any means for extracting the array back out. As such, the compiler instead lowers this into code that effectively does this (we can&#8217;t exactly express in C# the IL that gets generated, so this is pseudo-code):<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;byte&gt; daysInMonth = new ReadOnlySpan&lt;byte&gt;(\r\n    &amp;&lt;PrivateImplementationDetails&gt;.9D61D7D7A1AA7E8ED5214C2F39E0C55230433C7BA728C92913CA4E1967FAF8EA,\r\n    12);<\/code><\/pre>\n<p>It blits the data for the array into the assembly, and then constructing the span isn&#8217;t via an array allocation, but rather just wrapping the span around a pointer directly into the assembly&#8217;s data. This not only avoids the startup overhead and the extra object on the heap, it also better enables various JIT optimizations, especially when the JIT is able to see what offset is being accessed. If I run this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private static readonly byte[] s_daysInMonthArray = new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };\r\n    private static ReadOnlySpan&lt;byte&gt; DaysInMonthSpan =&gt; new byte[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 };\r\n\r\n    [Benchmark] public int ViaArray() =&gt; s_daysInMonthArray[0];\r\n\r\n    [Benchmark] public int ViaSpan() =&gt; DaysInMonthSpan[0];\r\n}<\/code><\/pre>\n<p>it produces this assembly:<\/p>\n<pre><code class=\"language-assembly\">; Tests.ViaArray()\r\n       mov       rax,1B860002028\r\n       mov       rax,[rax]\r\n       movzx     eax,byte ptr [rax+10]\r\n       ret\r\n; Total bytes of code 18\r\n\r\n; Tests.ViaSpan()\r\n       mov       eax,1F\r\n       ret\r\n; Total bytes of code 6<\/code><\/pre>\n<p>In other words, for the array, it&#8217;s reading the address of the array and is then reading the element at offset 0x10, or decimal 16, which is where the array&#8217;s data begins. For the span, it&#8217;s simply loading the value 0x1F, or decimal 31, as it&#8217;s directly reading the data from the assembly data. (This isn&#8217;t a case of a missing optimization in the JIT for the array example&#8230; arrays are mutable, so the JIT can&#8217;t constant fold based on the current value stored in the array, since technically it could change.)<\/p>\n<p>However, this compiler optimization only applied to <code>byte<\/code>, <code>sbyte<\/code>, and <code>bool<\/code>. Any other primitive, and the compiler would simply do exactly what you asked it to do: allocate the array. Far from ideal. The reason for the limitation was endianness. The compiler needs to generate binaries that work on both little-endian and big-endian systems; for single-byte types, there&#8217;s no endianness concern (since endianness is about the ordering of the bytes, and if there&#8217;s only one byte, there&#8217;s only one ordering), but for multi-byte types, the generated code could no longer just point directly into the data, as on some systems the data&#8217;s bytes would be reversed.<\/p>\n<p>.NET 7 added a new API to help with this, <code>RuntimeHelpers.CreateSpan&lt;T&gt;<\/code>. Rather than just emitting <code>new ReadOnlySpan&lt;T&gt;(ptrIntoData, dataLength)<\/code>, the idea was that the compiler would emit a call to <code>CreateSpan&lt;T&gt;<\/code>, passing in a reference to the field containing the data. The JIT and VM would then collude to ensure the data was loaded correctly and efficiently; on a little-endian system, the code would be emitted as if the call weren&#8217;t there (replaced by the equivalent of wrapping a span around the pointer and length), and on a big-endian system, the data would be loaded, reversed, and cached into an array, and the code gen would then be creating a span wrapping that array. Unfortunately, although the API shipped in .NET 7, the compiler support for it didn&#8217;t, and because no one was then actually using it, there were a variety of issues in the toolchain that went unnoticed.<\/p>\n<p>Thankfully, all of these issues are now addressed in .NET 8 and the C# compiler (and also backported to .NET 7). <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/61414\">dotnet\/roslyn#61414<\/a> added support to the C# compiler for also supporting <code>short<\/code>, <code>ushort<\/code>, <code>char<\/code>, <code>int<\/code>, <code>uint<\/code>, <code>long<\/code>, <code>ulong<\/code>, <code>double<\/code>, <code>float<\/code>, and <code>enum<\/code>s based on these. On target frameworks where <code>CreateSpan&lt;T&gt;<\/code> is available (.NET 7+), the compiler generates code that uses it. On frameworks where the function isn&#8217;t available, the compiler falls back to emitting a <code>static readonly<\/code> array to cache the data and wrapping a span around that. This was an important consideration for libraries that build for multiple target frameworks, so that when building &#8220;downlevel&#8221;, the implementation doesn&#8217;t fall off the proverbial performance cliff due to relying on this optimization (this optimization is a bit of an oddity, as you actually need to write your code in a way that, without the optimization, ends up performing worse than what you would have otherwise had). With the compiler implementation in place, and fixes to the Mono runtime in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82093\">dotnet\/runtime#82093<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81695\">dotnet\/runtime#81695<\/a>, and with fixes to the trimmer (which needs to preserve the alignment of the data that&#8217;s emitted by the compiler) in <a href=\"https:\/\/github.com\/dotnet\/cecil\/pull\/60\">dotnet\/cecil#60<\/a>, the rest of the runtime was then able to consume the feature, which it did in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79461\">dotnet\/runtime#79461<\/a>. So now, for example, <code>System.Text.Json<\/code> can use this to store not only how many days there are in a (non-leap) year, but also store how many days there are before a given month, something that wasn&#8217;t previously possible efficiently in this form due to there being values larger than can be stored in a <code>byte<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter **\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"i\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[DisassemblyDiagnoser]\r\npublic class Tests\r\n{\r\n    private static ReadOnlySpan&lt;int&gt; DaysToMonth365 =&gt; new int[] { 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365 };\r\n\r\n    [Benchmark]\r\n    [Arguments(1)]\r\n    public int DaysToMonth(int i) =&gt; DaysToMonth365[i];\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Code Size<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DaysToMonth<\/td>\n<td style=\"text-align: right\">0.0469 ns<\/td>\n<td style=\"text-align: right\">35 B<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre><code class=\"language-assembly\">; Tests.DaysToMonth(Int32)\r\n       sub       rsp,28\r\n       cmp       edx,0D\r\n       jae       short M00_L00\r\n       mov       eax,edx\r\n       mov       rcx,12B39072DD0\r\n       mov       eax,[rcx+rax*4]\r\n       add       rsp,28\r\n       ret\r\nM00_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 35<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/69820\">dotnet\/roslyn#69820<\/a> (which hasn&#8217;t yet merged but should soon) then rounds things out by ensuring that the pattern of initializing a <code>ReadOnlySpan&lt;T&gt;<\/code> to a <code>new T[] { const of T, const of T, ... \/* all const values *\/ }<\/code> will always avoid the array allocation, regardless of the type of <code>T<\/code> being used. The <code>T<\/code> need only be expressible as a constant in C#. That means this optimization now also applies to <code>string<\/code>, <code>decimal<\/code>, <code>nint<\/code>, and <code>nuint<\/code>. For these, the compiler will fallback to using a cached array singleton. With that, this code:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet build -c Release -f net8.0\r\n\r\ninternal static class Program\r\n{\r\n    private static void Main() { }\r\n\r\n    private static ReadOnlySpan&lt;bool&gt; Booleans =&gt; new bool[] { false, true };\r\n    private static ReadOnlySpan&lt;sbyte&gt; SBytes =&gt; new sbyte[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;byte&gt; Bytes =&gt; new byte[] { 0, 1, 2 };\r\n\r\n    private static ReadOnlySpan&lt;short&gt; Shorts =&gt; new short[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;ushort&gt; UShorts =&gt; new ushort[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;char&gt; Chars =&gt; new char[] { '0', '1', '2' };\r\n    private static ReadOnlySpan&lt;int&gt; Ints =&gt; new int[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;uint&gt; UInts =&gt; new uint[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;long&gt; Longs =&gt; new long[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;ulong&gt; ULongs =&gt; new ulong[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;float&gt; Floats =&gt; new float[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;double&gt; Doubles =&gt; new double[] { 0, 1, 2 };\r\n\r\n    private static ReadOnlySpan&lt;nint&gt; NInts =&gt; new nint[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;nuint&gt; NUInts =&gt; new nuint[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;decimal&gt; Decimals =&gt; new decimal[] { 0, 1, 2 };\r\n    private static ReadOnlySpan&lt;string&gt; Strings =&gt; new string[] { \"0\", \"1\", \"2\" };\r\n}<\/code><\/pre>\n<p>now compiles down to something like this (again, this is pseudo-code, since we can&#8217;t exactly represent in C# what&#8217;s emitted in IL):<\/p>\n<pre><code class=\"language-C#\">internal static class Program\r\n{\r\n    private static void Main() { }\r\n\r\n    \/\/\r\n    \/\/ No endianness concerns. Create a span that points directly into the assembly data,\r\n    \/\/ using the `ReadOnlySpan&lt;T&gt;(void*, int)` constructor.\r\n    \/\/\r\n\r\n    private static ReadOnlySpan&lt;bool&gt; Booleans =&gt; new ReadOnlySpan&lt;bool&gt;(\r\n        &amp;&lt;PrivateImplementationDetails&gt;.B413F47D13EE2FE6C845B2EE141AF81DE858DF4EC549A58B7970BB96645BC8D2, 2);\r\n\r\n    private static ReadOnlySpan&lt;sbyte&gt; SBytes =&gt; new ReadOnlySpan&lt;sbyte&gt;(\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC, 3);\r\n\r\n    private static ReadOnlySpan&lt;byte&gt; Bytes =&gt; new ReadOnlySpan&lt;byte&gt;(\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC, 3);\r\n\r\n    \/\/\r\n    \/\/ Endianness concerns but with data that a span could point to directly if\r\n    \/\/ of the correct byte ordering. Go through the RuntimeHelpers.CreateSpan intrinsic.\r\n    \/\/\r\n\r\n    private static ReadOnlySpan&lt;short&gt; Shorts =&gt; RuntimeHelpers.CreateSpan&lt;short&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.90C2698921CA9FD02950BE353F721888760E33AB5095A21E50F1E4360B6DE1A02);\r\n\r\n    private static ReadOnlySpan&lt;ushort&gt; UShorts =&gt; RuntimeHelpers.CreateSpan&lt;ushort&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.90C2698921CA9FD02950BE353F721888760E33AB5095A21E50F1E4360B6DE1A02);\r\n\r\n    private static ReadOnlySpan&lt;char&gt; Chars =&gt; RuntimeHelpers.CreateSpan&lt;char&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.9B9A3CBF2B718A8F94CE348CB95246738A3A9871C6236F4DA0A7CC126F03A8B42);\r\n\r\n    private static ReadOnlySpan&lt;int&gt; Ints =&gt; RuntimeHelpers.CreateSpan&lt;int&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC4);\r\n\r\n    private static ReadOnlySpan&lt;uint&gt; UInts =&gt; RuntimeHelpers.CreateSpan&lt;uint&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC4);\r\n\r\n    private static ReadOnlySpan&lt;long&gt; Longs =&gt; RuntimeHelpers.CreateSpan&lt;long&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AB25350E3E65EFEBE24584461683ECDA68725576E825E550038B90E7B14799468);\r\n\r\n    private static ReadOnlySpan&lt;ulong&gt; ULongs =&gt; RuntimeHelpers.CreateSpan&lt;ulong&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.AB25350E3E65EFEBE24584461683ECDA68725576E825E550038B90E7B14799468);\r\n\r\n    private static ReadOnlySpan&lt;float&gt; Floats =&gt; RuntimeHelpers.CreateSpan&lt;float&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.75664B4DA1C08DE9E8FAD52303CC458B3E420EDDE6591E58761E138CC5E3F1634);\r\n\r\n    private static ReadOnlySpan&lt;double&gt; Doubles =&gt; RuntimeHelpers.CreateSpan&lt;double&gt;((RuntimeFieldHandle)\r\n        &amp;&lt;PrivateImplementationDetails&gt;.B0C45303F7F11848CB5E6E5B2AF2FB2AECD0B72C28748B88B583AB6BB76DF1748);\r\n\r\n    \/\/\r\n    \/\/ Create a span around a cached array.\r\n    \/\/\r\n\r\n    private unsafe static ReadOnlySpan&lt;nuint&gt; NUInts =&gt; new ReadOnlySpan&lt;nuint&gt;(\r\n        &lt;PrivateImplementationDetails&gt;.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC_B16\r\n            ??= new nuint[] { 0, 1, 2 });\r\n\r\n    private static ReadOnlySpan&lt;nint&gt; NInts =&gt; new ReadOnlySpan&lt;nint&gt;(\r\n        &lt;PrivateImplementationDetails&gt;.AD5DC1478DE06A4C2728EA528BD9361A4B945E92A414BF4D180CEDAAEAA5F4CC_B8\r\n            ??= new nint[] { 0, 1, 2 });\r\n\r\n    private static ReadOnlySpan&lt;decimal&gt; Decimals =&gt; new ReadOnlySpan&lt;decimal&gt;(\r\n        &lt;PrivateImplementationDetails&gt;.93AF9093EDC211A9A941BDE5EF5640FD395604257F3D945F93C11BA9E918CC74_B18\r\n            ??= new decimal[] { 0, 1, 2 });\r\n\r\n    private static ReadOnlySpan&lt;string&gt; Strings =&gt; new ReadOnlySpan&lt;string&gt;(\r\n        &lt;PrivateImplementationDetails&gt;.9B9A3CBF2B718A8F94CE348CB95246738A3A9871C6236F4DA0A7CC126F03A8B4_B11\r\n            ??= new string[] { \"0\", \"1\", \"2\" });\r\n}<\/code><\/pre>\n<p>Another closely-related C# compiler improvement comes in <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/66251\">dotnet\/runtime#66251<\/a> from <a href=\"https:\/\/github.com\/alrz\">@alrz<\/a>. The previously mentioned optimization around single-byte types also applies to <code>stackalloc<\/code> initialization. If I write:<\/p>\n<pre><code class=\"language-C#\">Span&lt;int&gt; span = stackalloc int[] { 1, 2, 3 };<\/code><\/pre>\n<p>the C# compiler emits code similar to if I&#8217;d written the following:<\/p>\n<pre><code class=\"language-C#\">byte* ptr = stackalloc byte[12];\r\n*(int*)ptr = 1;\r\n*(int*)(ptr) = 2;\r\n*(int*)(ptr + (nint)2 * (nint)4) = 3;\r\nSpan&lt;int&gt; span = new Span&lt;int&gt;(ptr);<\/code><\/pre>\n<p>If, however, I switch from the multi-byte <code>int<\/code> to the single-byte <code>byte<\/code>:<\/p>\n<pre><code class=\"language-C#\">Span&lt;byte&gt; span = stackalloc byte[] { 1, 2, 3 };<\/code><\/pre>\n<p>then I get something closer to this:<\/p>\n<pre><code class=\"language-C#\">byte* ptr = stackalloc byte[3];\r\nUnsafe.CopyBlock(ptr, ref &lt;PrivateImplementationDetails&gt;.039058C6F2C0CB492C533B0A4D14EF77CC0F78ABCCCED5287D84A1A2011CFB81, 3); \/\/ actually the cpblk instruction\r\nSpan&lt;byte&gt; span = new Span&lt;byte&gt;(ptr, 3);<\/code><\/pre>\n<p>Unlike the the <code>new[]<\/code> case, however, which optimized not only for <code>byte<\/code>, <code>sbyte<\/code>, and <code>bool<\/code> but also for <code>enum<\/code>s with <code>byte<\/code> and <code>sbyte<\/code> as an underlying type, the <code>stackalloc<\/code> optimization didn&#8217;t. Thanks to this PR, it now does.<\/p>\n<p>There&#8217;s another semi-related new feature spanning C# 12 and .NET 8: <code>InlineArrayAttribute<\/code>. <code>stackalloc<\/code> has long provided a way to use stack space as a buffer, rather than needing to allocate memory on the heap; however, for most of .NET&#8217;s history, this was &#8220;unsafe,&#8221; in that it produced a pointer:<\/p>\n<pre><code class=\"language-C#\">byte* buffer = stackalloc byte[8];<\/code><\/pre>\n<p>C# 7.2 introduced the immensely useful improvement to stack allocate directly into a span, at which point it becomes &#8220;safe,&#8221; not requiring being in an <code>unsafe<\/code> context and with all access to the span bounds checked appropriately, as with any other span:<\/p>\n<pre><code class=\"language-C#\">Span&lt;byte&gt; buffer = stackalloc byte[8];<\/code><\/pre>\n<p>The C# compiler will lower that to something along the lines of:<\/p>\n<pre><code class=\"language-C#\">Span&lt;byte&gt; buffer;\r\nunsafe\r\n{\r\n    byte* tmp = stackalloc byte[8];\r\n    buffer = new Span&lt;byte&gt;(tmp, 8);\r\n}<\/code><\/pre>\n<p>However, this is still limited to the kinds of things that can be <code>stackalloc<\/code>&#8216;d, namely <code>unmanaged<\/code> types (types which don&#8217;t contain any managed references), and it&#8217;s limited in where it can be used. That&#8217;s not only because <code>stackalloc<\/code> can&#8217;t be used in places like <code>catch<\/code> and <code>finally<\/code> blocks, but also because there are places where you want to be able to have such buffers that aren&#8217;t limited to the stack: inside of other types. C# has long supported the notion of &#8220;fixed-size buffers,&#8221; e.g.<\/p>\n<pre><code class=\"language-C#\">struct C\r\n{\r\n    internal unsafe fixed char name[30];\r\n}<\/code><\/pre>\n<p>but these require being in an <code>unsafe<\/code> context since they present to a consumer as a pointer (in the above example, the type of <code>C.name<\/code> is a <code>char*<\/code>) and they&#8217;re not bounds-checked, and they&#8217;re limited in the element type supported (it can only be <code>bool<\/code>, <code>sbyte<\/code>, <code>byte<\/code>, <code>short<\/code>, <code>ushort<\/code>, <code>char<\/code>, <code>int<\/code>, <code>uint<\/code>, <code>long<\/code>, <code>ulong<\/code>, <code>double<\/code>, or <code>float<\/code>).<\/p>\n<p>.NET 8 and C# 12 provide an answer for this: <code>[InlineArray]<\/code>. This new attribute can be placed onto a <code>struct<\/code> containing a single field, like this:<\/p>\n<pre><code class=\"language-C#\">[InlineArray(8)]\r\ninternal struct EightStrings\r\n{\r\n    private string _field;\r\n}<\/code><\/pre>\n<p>The runtime then expands that struct to be logically the same as if you wrote:<\/p>\n<pre><code class=\"language-C#\">internal struct EightStrings\r\n{\r\n    private string _field0;\r\n    private string _field1;\r\n    private string _field2;\r\n    private string _field3;\r\n    private string _field4;\r\n    private string _field5;\r\n    private string _field6;\r\n    private string _field7;\r\n}<\/code><\/pre>\n<p>ensuring that all of the storage is appropriately contiguous and aligned. Why is that important? Because C# 12 then makes it easy to get a span from one of these instances, e.g.<\/p>\n<pre><code class=\"language-C#\">EightStrings strings = default;\r\nSpan&lt;string&gt; span = strings;<\/code><\/pre>\n<p>This is all &#8220;safe,&#8221; and the type of the field can be anything that&#8217;s valid as a generic type argument. That means pretty much anything other than <code>ref<\/code>s, <code>ref struct<\/code>s, and pointers. This is a constraint imposed by the C# language, since with such a field type <code>T<\/code> you wouldn&#8217;t be able to construct a <code>Span&lt;T&gt;<\/code>, but the warning can be suppressed, as the runtime itself does support anything as the field type. The compiler-generated code for getting a span is equivalent to if you wrote:<\/p>\n<pre><code class=\"language-C#\">EightStrings strings = default;\r\nSpan&lt;string&gt; span = MemoryMarshal.CreateSpan(ref Unsafe.As&lt;EightStrings, string&gt;(ref strings), 8);<\/code><\/pre>\n<p>which is obviously complicated and not something you&#8217;d want to be writing frequently. In fact, the compiler doesn&#8217;t want to emit that frequently, either, so it puts it into a helper in the assembly that it can reuse.<\/p>\n<pre><code class=\"language-C#\">[CompilerGenerated]\r\ninternal sealed class &lt;PrivateImplementationDetails&gt;\r\n{\r\n    internal static Span&lt;TElement&gt; InlineArrayAsSpan&lt;TBuffer, TElement&gt;(ref TBuffer buffer, int length) =&gt;\r\n        MemoryMarshal.CreateSpan(ref Unsafe.As&lt;TBuffer, TElement&gt;(ref buffer), length);\r\n    ...\r\n}<\/code><\/pre>\n<p>(<code>&lt;PrivateImplementationDetails&gt;<\/code> is a class the C# compiler emits to contain helpers and other compiler-generated artifacts used by code it emits elsewhere in the program. You saw it in the previous discussion as well, as it&#8217;s where it emits the data in support of array and span initialization from constants.)<\/p>\n<p>The <code>[InlineArray]<\/code>-attributed type is also a normal <code>struct<\/code> like any other, and can be used anywhere any other <code>struct<\/code> can be used; that it&#8217;s using <code>[InlineArray]<\/code> is effectively an implementation detail. So, for example, you can embed it into another type, and the following code will print out &#8220;0&#8221; through &#8220;7&#8221; as you&#8217;d expect:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Runtime.CompilerServices;\r\n\r\nMyData data = new();\r\nSpan&lt;string&gt; span = data.Strings;\r\n\r\nfor (int i = 0; i &lt; span.Length; i++) span[i] = i.ToString();\r\n\r\nforeach (string s in data.Strings) Console.WriteLine(s);\r\n\r\npublic class MyData\r\n{\r\n    private EightStrings _strings;\r\n\r\n    public Span&lt;string&gt; Strings =&gt; _strings;\r\n\r\n    [InlineArray(8)]\r\n    private unsafe struct EightStrings { private string _field; }\r\n}<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82744\">dotnet\/runtime#82744<\/a> provided the CoreCLR runtime support for <code>InlineArray<\/code>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83776\">dotnet\/runtime#83776<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84097\">dotnet\/runtime#84097<\/a> provided the Mono runtime support, and <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/68783\">dotnet\/roslyn#68783<\/a> merged the C# compiler support.<\/p>\n<p>This feature isn&#8217;t just about you using it directly, either. The compiler itself also uses <code>[InlineArray]<\/code> as an implementation detail behind other new and planned features&#8230; we&#8217;ll talk more about that when discussing collections.<\/p>\n<h3>Analyzers<\/h3>\n<p>Lastly, even though the runtime and core libraries have made great strides in improving the performance of existing functionality and adding new performance-focused support, sometimes the best fix is actually in the consuming code. That&#8217;s where analyzers come in. Several new analyzers have been added in .NET 8 to help find particular classes of string-related performance issues.<\/p>\n<p><a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/CA1858\">CA1858<\/a>, added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6295\">dotnet\/roslyn-analyzers#6295<\/a> from <a href=\"https:\/\/github.com\/Youssef1313\">@Youssef1313<\/a>, looks for calls to <code>IndexOf<\/code> where the result is then being checked for equality with 0. This is functionally the same as a call to <code>StartsWith<\/code>, but is much more expensive as it could end up examining the entire source string rather than just the starting position (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79896\">dotnet\/runtime#79896<\/a> fixes a few such uses in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>).\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1858.png\" alt=\"CA1858\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _haystack = \"\"\"\r\n        It was the best of times, it was the worst of times,\r\n        it was the age of wisdom, it was the age of foolishness,\r\n        it was the epoch of belief, it was the epoch of incredulity,\r\n        it was the season of light, it was the season of darkness,\r\n        it was the spring of hope, it was the winter of despair.\r\n        \"\"\";\r\n    private readonly string _needle = \"hello\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool StartsWith_IndexOf0() =&gt;\r\n        _haystack.IndexOf(_needle, StringComparison.OrdinalIgnoreCase) == 0;\r\n\r\n    [Benchmark]\r\n    public bool StartsWith_StartsWith() =&gt;\r\n        _haystack.StartsWith(_needle, StringComparison.OrdinalIgnoreCase);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>StartsWith_IndexOf0<\/td>\n<td style=\"text-align: right\">31.327 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>StartsWith_StartsWith<\/td>\n<td style=\"text-align: right\">4.501 ns<\/td>\n<td style=\"text-align: right\">0.14<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>CA1865, CA1866, and CA1867 are all related to each other. Added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6799\">dotnet\/roslyn-analyzers#6799<\/a> from <a href=\"https:\/\/github.com\/mrahhal\">@mrahhal<\/a>, these look for calls to <code>string<\/code> methods like <code>StartsWith<\/code>, searching for calls passing in a single-character <code>string<\/code> argument, e.g. <code>str.StartsWith(\"@\")<\/code>, and recommending the argument be converted into a <code>char<\/code>. Which diagnostic ID the analyzer raises depends on whether the transformation is 100% equivalent behavior or whether a change in behavior could potentially result, e.g. switching from a linguistic comparison to an ordinal comparison.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1865.png\" alt=\"CA1865\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _haystack = \"All we have to decide is what to do with the time that is given us.\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public int IndexOfString() =&gt; _haystack.IndexOf(\"v\");\r\n\r\n    [Benchmark]\r\n    public int IndexOfChar() =&gt; _haystack.IndexOf('v');\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IndexOfString<\/td>\n<td style=\"text-align: right\">37.634 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IndexOfChar<\/td>\n<td style=\"text-align: right\">1.979 ns<\/td>\n<td style=\"text-align: right\">0.05<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>CA1862, added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6662\">dotnet\/roslyn-analyzers#6662<\/a>, looks for places where code is performing a case-insensitive comparison (which is fine) but doing so by first lower\/uppercasing an input string and then comparing that (which is far from fine). It&#8217;s much more efficient to just use a <code>StringComparison<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89539\">dotnet\/runtime#89539<\/a> fixes a few such cases.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1862.png\" alt=\"CA1862\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly string _input = \"https:\/\/dot.net\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool IsHttps_ToUpper() =&gt; _input.ToUpperInvariant().StartsWith(\"HTTPS:\/\/\");\r\n\r\n    [Benchmark]\r\n    public bool IsHttps_StringComparison() =&gt; _input.StartsWith(\"HTTPS:\/\/\", StringComparison.OrdinalIgnoreCase);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IsHttps_ToUpper<\/td>\n<td style=\"text-align: right\">46.3702 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IsHttps_StringComparison<\/td>\n<td style=\"text-align: right\">0.4781 ns<\/td>\n<td style=\"text-align: right\">0.01<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/CA1861\">CA1861<\/a>, added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/5383\">dotnet\/roslyn-analyzers#5383<\/a> from <a href=\"https:\/\/github.com\/steveberdy\">@steveberdy<\/a>, looks for opportunities to lift and cache arrays being passed as arguments. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86229\">dotnet\/runtime#86229<\/a> addresses the issues found by the analyzer in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1861.png\" alt=\"CA1861\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private static readonly char[] s_separator = new[] { ',', ':' };\r\n    private readonly string _value = \"1,2,3:4,5,6\";\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public string[] Split_Original() =&gt; _value.Split(new[] { ',', ':' });\r\n\r\n    [Benchmark]\r\n    public string[] Split_Refactored() =&gt; _value.Split(s_separator);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Split_Original<\/td>\n<td style=\"text-align: right\">108.6 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">248 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Split_Refactored<\/td>\n<td style=\"text-align: right\">104.0 ns<\/td>\n<td style=\"text-align: right\">0.96<\/td>\n<td style=\"text-align: right\">216 B<\/td>\n<td style=\"text-align: right\">0.87<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Collections<\/h2>\n<p>Collections are the bread and butter of practically every application and service. Have more than one of something? You need a collection to manage them. And since they&#8217;re so commonly needed and used, every release of .NET invests meaningfully in improving their performance and driving down their overheads.<\/p>\n<h3>General<\/h3>\n<p>Some of the changes made in .NET 8 are largely collection-agnostic and affect a large number of collections. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82499\">dotnet\/runtime#82499<\/a> special-cases &#8220;empty&#8221; on a bunch of the built-in collection types to return an empty singleton enumerator, thus avoiding allocating a largely useless object. This is wide-reaching, affecting <code>List&lt;T&gt;<\/code>, <code>Queue&lt;T&gt;<\/code>, <code>Stack&lt;T&gt;<\/code>, <code>LinkedList&lt;T&gt;<\/code>, <code>PriorityQueue&lt;TElement, TPriority&gt;<\/code>, <code>SortedDictionary&lt;TKey, TValue&gt;<\/code>, <code>SortedList&lt;TKey, TValue&gt;<\/code>, <code>HashSet&lt;T&gt;<\/code>, <code>Dictionary&lt;TKey, TValue&gt;<\/code>, and <code>ArraySegment&lt;T&gt;<\/code>. Interestingly, <code>T[]<\/code> was already on this plan (as were a few other collections, like <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code>); if you called <code>IEnumerable&lt;T&gt;.GetEnumerator<\/code> on any <code>T[]<\/code> of length 0, you already got back a singleton enumerator hardcoded to return <code>false<\/code> from its <code>MoveNext<\/code>. That same enumerator singleton is what&#8217;s now returned from the <code>GetEnumerator<\/code> implementations of all of those cited collection types when they&#8217;re empty at the moment <code>GetEnumerator<\/code> is called.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\".NET 7\").WithRuntime(CoreRuntime.Core70).AsBaseline())\r\n    .AddJob(Job.Default.WithId(\".NET 8 w\/o PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\".NET 8\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\", \"Runtime\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly IEnumerable&lt;int&gt; _list = new List&lt;int&gt;();\r\n    private readonly IEnumerable&lt;int&gt; _queue = new Queue&lt;int&gt;();\r\n    private readonly IEnumerable&lt;int&gt; _stack = new Stack&lt;int&gt;();\r\n    private readonly IEnumerable&lt;int&gt; _linkedList = new LinkedList&lt;int&gt;();\r\n    private readonly IEnumerable&lt;int&gt; _hashSet = new HashSet&lt;int&gt;();\r\n    private readonly IEnumerable&lt;int&gt; _segment = new ArraySegment&lt;int&gt;(Array.Empty&lt;int&gt;());\r\n    private readonly IEnumerable&lt;KeyValuePair&lt;int, int&gt;&gt; _dictionary = new Dictionary&lt;int, int&gt;();\r\n    private readonly IEnumerable&lt;KeyValuePair&lt;int, int&gt;&gt; _sortedDictionary = new SortedDictionary&lt;int, int&gt;();\r\n    private readonly IEnumerable&lt;KeyValuePair&lt;int, int&gt;&gt; _sortedList = new SortedList&lt;int, int&gt;();\r\n    private readonly IEnumerable&lt;(int, int)&gt; _priorityQueue = new PriorityQueue&lt;int, int&gt;().UnorderedItems;\r\n\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetList() =&gt; _list.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetQueue() =&gt; _queue.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetStack() =&gt; _stack.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetLinkedList() =&gt; _linkedList.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetHashSet() =&gt; _hashSet.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;int&gt; GetArraySegment() =&gt; _segment.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;KeyValuePair&lt;int, int&gt;&gt; GetDictionary() =&gt; _dictionary.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;KeyValuePair&lt;int, int&gt;&gt; GetSortedDictionary() =&gt; _sortedDictionary.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;KeyValuePair&lt;int, int&gt;&gt; GetSortedList() =&gt; _sortedList.GetEnumerator();\r\n    [Benchmark] public IEnumerator&lt;(int, int)&gt; GetPriorityQueue() =&gt; _priorityQueue.GetEnumerator();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetList<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">15.9046 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetList<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">2.1016 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetList<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">0.8954 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetQueue<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">16.5115 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetQueue<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">1.8934 ns<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetQueue<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">1.1068 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetStack<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">16.2183 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetStack<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">4.5345 ns<\/td>\n<td style=\"text-align: right\">0.28<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetStack<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.7712 ns<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetLinkedList<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">19.9335 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">48 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetLinkedList<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">4.6176 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetLinkedList<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.5660 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetHashSet<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">15.8322 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetHashSet<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">1.8871 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetHashSet<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">1.1129 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetArraySegment<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">17.0096 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetArraySegment<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">3.9111 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetArraySegment<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">1.3438 ns<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetDictionary<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">18.3397 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">48 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetDictionary<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">2.3202 ns<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetDictionary<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">1.0185 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetSortedDictionary<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">49.5423 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">112 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetSortedDictionary<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">5.6333 ns<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetSortedDictionary<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.9824 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetSortedList<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">18.9600 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">48 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetSortedList<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">4.4282 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetSortedList<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.2451 ns<\/td>\n<td style=\"text-align: right\">0.12<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>GetPriorityQueue<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">17.4375 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetPriorityQueue<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">4.3855 ns<\/td>\n<td style=\"text-align: right\">0.25<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td>GetPriorityQueue<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.8931 ns<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Enumerator allocations are avoided in other contexts, as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78613\">dotnet\/runtime#78613<\/a> from <a href=\"https:\/\/github.com\/madelson\">@madelson<\/a> avoids an unnecessary enumerator allocation in <code>HashSet&lt;T&gt;.SetEquals<\/code> and <code>HashSet&lt;T&gt;.IsProperSupersetOf<\/code>, rearranging some code in order to use <code>HashSet&lt;T&gt;<\/code>&#8216;s struct-based enumerator rather than relying on it being boxed as an <code>IEnumerator&lt;T&gt;<\/code>. This both saves an allocation and avoids unnecessary interface dispatch.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly HashSet&lt;int&gt; _source1 = new HashSet&lt;int&gt; { 1, 2, 3, 4, 5 };\r\n    private readonly IEnumerable&lt;int&gt; _source2 = new HashSet&lt;int&gt; { 1, 2, 3, 4, 5 };\r\n\r\n    [Benchmark]\r\n    public bool SetEquals() =&gt; _source1.SetEquals(_source2);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SetEquals<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">75.02 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">40 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SetEquals<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">26.29 ns<\/td>\n<td style=\"text-align: right\">0.35<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There are other places where &#8220;empty&#8221; has been special-cased. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76097\">dotnet\/runtime#76097<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76764\">dotnet\/runtime#76764<\/a> added an <code>Empty<\/code> singleton to <code>ReadOnlyCollection&lt;T&gt;<\/code>, <code>ReadOnlyDictionary&lt;TKey, TValue&gt;<\/code>, and <code>ReadOnlyObservableCollection&lt;T&gt;<\/code>, and then used that singleton in a bunch of places, multiple of which accrue further to many other places that consume them. For example, <code>Array.AsReadOnly<\/code> now checks whether the array being wrapped is empty, and if it is, <code>AsReadOnly<\/code> returns <code>ReadOnlyCollection&lt;T&gt;.Empty<\/code> rather than allocating a new <code>ReadOnlyCollection&lt;T&gt;<\/code> to wrap the empty array (it also makes a similar update to <code>ReadOnlyCollection&lt;T&gt;.GetEnumerator<\/code> as was discussed with the previous PRs). <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code>&#8216;s <code>Keys<\/code> and <code>Values<\/code> will now return the same singleton if the count is known to be 0. And so on. These kinds of changes reduce the overall &#8220;peanut butter&#8221; layer of allocation across uses of collections.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.ObjectModel;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly int[] _array = new int[0];\r\n\r\n    [Benchmark]\r\n    public ReadOnlyCollection&lt;int&gt; AsReadOnly() =&gt; Array.AsReadOnly(_array);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AsReadOnly<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">13.380 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">24 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AsReadOnly<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.460 ns<\/td>\n<td style=\"text-align: right\">0.11<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, there are many much more targeted and impactful improvements for specific collection types, too.<\/p>\n<h2>List<\/h2>\n<p>The most widely used collection in .NET, other than <code>T[]<\/code>, is <code>List&lt;T&gt;<\/code>. While that claim feels accurate, I also like to be data-driven, so as one measure, looking at the same NuGet packages we looked at earlier for enums, here&#8217;s a graph showing the number of references to the various concrete collection types:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CollectionPopularity.png\" alt=\"References to collection types in NuGet packages\" \/><\/p>\n<p>Given its ubiquity, <code>List&lt;T&gt;<\/code> sees a variety of improvements in .NET 8. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76043\">dotnet\/runtime#76043<\/a> improves the performance of its <code>AddRange<\/code> method, in particular when dealing with non-<code>ICollection&lt;T&gt;<\/code> inputs. When adding an <code>ICollection&lt;T&gt;<\/code>, <code>AddRange<\/code> reads the collection&#8217;s <code>Count<\/code>, ensures the list&#8217;s array is large enough to store all the incoming data, and then copies it as efficiently as the source collection can muster by invoking the collection&#8217;s <code>CopyTo<\/code> method to propagate the data directly into the <code>List&lt;T&gt;<\/code>&#8216;s backing store. But if the input enumerable isn&#8217;t an <code>ICollection&lt;T&gt;<\/code>, <code>AddRange<\/code> has little choice but to enumerate the collection and add each item one at a time. Prior to this release, <code>AddRange(collection)<\/code> simply delegated to <code>InsertRange(Count, collection)<\/code>, which meant that when <code>InsertRange<\/code> discovered the source wasn&#8217;t an <code>ICollection&lt;T&gt;<\/code>, it would fall back to calling <code>Insert(i++, item)<\/code> with each item from the enumerable. That <code>Insert<\/code> method is too large to be inlined by default, plus involves additional checks that aren&#8217;t necessary for the <code>AddRange<\/code> usage (e.g. it needs to validate that the supplied position is within the range of the list, but for adding, we&#8217;re always just implicitly adding at the end, with a position implicitly known to be valid). This PR rewrote <code>AddRange<\/code> to not just delegate to <code>InsertRange<\/code>, at which point when it falls back to enumerating the non-<code>ICollection&lt;T&gt;<\/code> enumerable, it calls the optimized <code>Add<\/code>, which is inlineable, and which doesn&#8217;t have any extraneous checks.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithId(\".NET 7\").WithRuntime(CoreRuntime.Core70).AsBaseline())\r\n    .AddJob(Job.Default.WithId(\".NET 8 w\/o PGO\").WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(\"DOTNET_TieredPGO\", \"0\"))\r\n    .AddJob(Job.Default.WithId(\".NET 8\").WithRuntime(CoreRuntime.Core80));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"EnvironmentVariables\", \"Runtime\")]\r\npublic class Tests\r\n{\r\n    private readonly IEnumerable&lt;int&gt; _source = GetItems(1024);\r\n    private readonly List&lt;int&gt; _list = new();\r\n\r\n    [Benchmark]\r\n    public void AddRange()\r\n    {\r\n        _list.Clear();\r\n        _list.AddRange(_source);\r\n    }\r\n\r\n    private static IEnumerable&lt;int&gt; GetItems(int count)\r\n    {\r\n        for (int i = 0; i &lt; count; i++) yield return i;\r\n    }\r\n}<\/code><\/pre>\n<p>For this test, I&#8217;ve configured it to run with and without PGO on .NET 8, because this particular test benefits significantly from PGO, and I want to tease those improvements apart from those that come from the cited improvements to <code>AddRange<\/code>. Why does PGO help here? Because the <code>AddRange<\/code> method will see that the type of the enumerable is always the compiler-generated iterator for <code>GetItems<\/code> and will thus generate code specific to that type, enabling the calls that would otherwise involve interface dispatch to instead be devirtualized.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Job<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AddRange<\/td>\n<td>.NET 7<\/td>\n<td style=\"text-align: right\">6.365 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AddRange<\/td>\n<td>.NET 8 w\/o PGO<\/td>\n<td style=\"text-align: right\">4.396 us<\/td>\n<td style=\"text-align: right\">0.69<\/td>\n<\/tr>\n<tr>\n<td>AddRange<\/td>\n<td>.NET 8<\/td>\n<td style=\"text-align: right\">2.445 us<\/td>\n<td style=\"text-align: right\">0.38<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>AddRange<\/code> has improved in other ways, too. One of the long-requested features for <code>List&lt;T&gt;<\/code>, ever since spans were introduced in .NET Core 2.1, was better integration between <code>List&lt;T&gt;<\/code> and <code>{ReadOnly}Span&lt;T&gt;<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76274\">dotnet\/runtime#76274<\/a> provides that, adding support to both <code>AddRange<\/code> and <code>InsertRange<\/code> for data stored in a <code>ReadOnlySpan&lt;T&gt;<\/code>, and also support for copying all of the data in a <code>List&lt;T&gt;<\/code> to a <code>Span&lt;T&gt;<\/code> via a <code>CopyTo<\/code> method. It was of course previously possible to achieve this, but doing so required handling one element at a time, which when compared to vectorized copy implementations is significantly slower.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly int[] _source = new int[1024];\r\n    private readonly List&lt;int&gt; _list = new();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void OpenCoded()\r\n    {\r\n        _list.Clear();\r\n        foreach (int i in (ReadOnlySpan&lt;int&gt;)_source)\r\n        {\r\n            _list.Add(i);\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void AddRange()\r\n    {\r\n        _list.Clear();\r\n        _list.AddRange((ReadOnlySpan&lt;int&gt;)_source);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>OpenCoded<\/td>\n<td style=\"text-align: right\">1,261.66 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AddRange<\/td>\n<td style=\"text-align: right\">51.74 ns<\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>You may note that these new <code>AddRange<\/code>, <code>InsertRange<\/code>, and <code>CopyTo<\/code> methods were added as extension methods rather than as instance methods on <code>List&lt;T&gt;<\/code>. That was done for a few reasons, but the primary motivating factor was avoiding ambiguity. Consider this example:<\/p>\n<pre><code class=\"language-C#\">var c = new MyCollection&lt;int&gt;();\r\nc.AddRange(new int[] { 1, 2, 3 });\r\n\r\npublic class MyCollection&lt;T&gt;\r\n{\r\n    public void AddRange(IEnumerable&lt;T&gt; source) { }\r\n    public void AddRange(ReadOnlySpan&lt;T&gt; source) { }\r\n}<\/code><\/pre>\n<p>This will fail to compile with:<\/p>\n<blockquote>\n<p>error CS0121: The call is ambiguous between the following methods or properties: &#8216;MyCollection.AddRange(IEnumerable)&#8217; and &#8216;MyCollection.AddRange(ReadOnlySpan)&#8217;<\/p>\n<\/blockquote>\n<p>because an array <code>T[]<\/code> both implements <code>IEnumerable&lt;T&gt;<\/code> and has an implicit conversion to <code>ReadOnlySpan&lt;T&gt;<\/code>, and as such the compiler doesn&#8217;t know which to use. It&#8217;s likely this ambiguity will be resolved in a future version of the language, but for now we resolved it ourselves by making the span-based overload an extension method:<\/p>\n<pre><code class=\"language-C#\">namespace System.Collections.Generic\r\n{\r\n    public static class CollectionExtensions\r\n    {\r\n        public static void AddRange&lt;T&gt;(this List&lt;T&gt; list, ReadOnlySpan&lt;T&gt; source) { ... }\r\n    }\r\n}<\/code><\/pre>\n<p>The other significant addition for <code>List&lt;T&gt;<\/code> comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82146\">dotnet\/runtime#82146<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a>. In .NET 5, the <code>CollectionsMarshal.AsSpan(List&lt;T&gt;)<\/code> method was added; it returns a <code>Span&lt;T&gt;<\/code> for the in-use area of a <code>List&lt;T&gt;<\/code>&#8216;s backing store. For example, if you write:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;int&gt;(42) { 1, 2, 3 };\r\nSpan&lt;int&gt; span = CollectionsMarshal.AsSpan(list);<\/code><\/pre>\n<p>that will provide you with a <code>Span&lt;int&gt;<\/code> with length 3, since the list&#8217;s <code>Count<\/code> is 3. This is very useful for a variety of scenarios, in particular for consuming a <code>List&lt;T&gt;<\/code>&#8216;s data via span-based APIs. It doesn&#8217;t, however, enable scenarios that want to efficiently write to a <code>List&lt;T&gt;<\/code>, in particular where it would require increasing a <code>List&lt;T&gt;<\/code>&#8216;s count. Let&#8217;s say, for example, you wanted to create a new <code>List&lt;char&gt;<\/code> that contained 100 &#8216;a&#8217; values. You might think you could write:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;char&gt;(100);\r\nSpan&lt;char&gt; span = CollectionsMarshal.AsSpan(list); \/\/ oops\r\nspan.Fill('a');<\/code><\/pre>\n<p>but that won&#8217;t impact the contents of the created list at all, because the span&#8217;s <code>Length<\/code> will match the <code>Count<\/code> of the list: 0. What we need to be able to do is change the count of the list, effectively telling it &#8220;pretend like 100 values were just added to you, even though they weren&#8217;t.&#8221; This PR adds the new <code>SetCount<\/code> method, which does just that. We can now write the previous example like:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;char&gt;();\r\nCollectionsMarshal.SetCount(list, 100);\r\nSpan&lt;char&gt; span = CollectionsMarshal.AsSpan(list);\r\nspan.Fill('a'); \/\/ yay!<\/code><\/pre>\n<p>and we will successfully find ourselves with a list containing 100 &#8216;a&#8217; elements.<\/p>\n<h2>LINQ<\/h2>\n<p>That new <code>SetCount<\/code> method is not only exposed publicly, it&#8217;s also used as an implementation detail now in LINQ (Language-Integrated Query), thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85288\">dotnet\/runtime#85288<\/a>. <code>Enumerable<\/code>&#8216;s <code>ToList<\/code> method now benefits from this in a variety of places. For example, calling <code>Enumerable.Repeat('a', 100).ToList()<\/code> will behave very much like the previous example (albeit with an extra enumerable allocation for the <code>Repeat<\/code>), creating a new list, using <code>SetCount<\/code> to set its count to 100, getting the backing span, and calling <code>Fill<\/code> to populate it. The impact of directly writing to the span rather than going through <code>List&lt;T&gt;.Add<\/code> for each item is visible in the following examples:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly IEnumerable&lt;int&gt; _source = Enumerable.Range(0, 1024).ToArray();\r\n\r\n    [Benchmark]\r\n    public List&lt;int&gt; SelectToList() =&gt; _source.Select(i =&gt; i * 2).ToList();\r\n\r\n    [Benchmark]\r\n    public List&lt;byte&gt; RepeatToList() =&gt; Enumerable.Repeat((byte)'a', 1024).ToList();\r\n\r\n    [Benchmark]\r\n    public List&lt;int&gt; RangeSelectToList() =&gt; Enumerable.Range(0, 1024).Select(i =&gt; i * 2).ToList();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SelectToList<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2,627.8 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SelectToList<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1,096.6 ns<\/td>\n<td style=\"text-align: right\">0.42<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>RepeatToList<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,543.2 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>RepeatToList<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">106.1 ns<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>RangeSelectToList<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2,908.9 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>RangeSelectToList<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">865.2 ns<\/td>\n<td style=\"text-align: right\">0.29<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the case of <code>SelectToList<\/code> and <code>RangeSelectToList<\/code>, the benefit is almost entirely due to writing directly into the span for each element vs the overhead of <code>Add<\/code>. In the case of <code>RepeatToList<\/code>, because the <code>ToList<\/code> call has direct access to the span, it&#8217;s able to use the vectorized <code>Fill<\/code> method (as it was previously doing just for <code>ToArray<\/code>), achieving an even larger speedup.<\/p>\n<p>You&#8217;ll note that I didn&#8217;t include a test for <code>Enumerable.Range(...).ToList()<\/code> above. That&#8217;s because it was improved in other ways, and I didn&#8217;t want to conflate them in the measurements. In particular, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87992\">dotnet\/runtime#87992<\/a> from <a href=\"https:\/\/github.com\/neon-sunset\">@neon-sunset<\/a> vectorized the internal <code>Fill<\/code> method that&#8217;s used by the specialization of both <code>ToArray<\/code> and <code>ToList<\/code> on the iterator returned from <code>Enumerable.Range<\/code>. That means that rather than writing one <code>int<\/code> at a time, on a system that supports 128-bit vectors (which is pretty much all hardware you might use today) it&#8217;ll instead write four <code>int<\/code>s at a time, and on a system that supports 256-bit vectors, it&#8217;ll write eight <code>int<\/code>s at a time. Thus, <code>Enumerable.Range(...).ToList()<\/code> benefits both from writing directly into the span and from the now vectorized implementation, which means it ends up with similar speedups as <code>RepeatToList<\/code> above. We can also tease apart these improvements by changing what instruction sets are seen as available.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public List&lt;int&gt; RangeToList() =&gt; Enumerable.Range(0, 16_384).ToList();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RangeToList<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">25.374 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>RangeToList<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">6.872 us<\/td>\n<td style=\"text-align: right\">0.27<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These optimized span-based implementations now also accrue to other usage beyond <code>ToArray<\/code> and <code>ToList<\/code>. If you look at the <code>Enumerable.Repeat<\/code> and <code>Enumerable.Range<\/code> implementations in .NET Framework, you&#8217;ll see that they&#8217;re just normal C# iterators, e.g.<\/p>\n<pre><code class=\"language-C#\">static IEnumerable&lt;int&gt; RangeIterator(int start, int count)\r\n{\r\n    for (int i = 0; i &lt; count; i++)\r\n    {\r\n        yield return start + i;\r\n    }\r\n}<\/code><\/pre>\n<p>but years ago, these methods were changed in .NET Core to return a custom iterator (just a normal class implementing <code>IEnumerator&lt;T&gt;<\/code> where we provide the full implementation rather than the compiler doing so). Once we have a dedicated type, we can add additional interfaces to it, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88249\">dotnet\/runtime#88249<\/a> does exactly that, making these internal <code>RangeIterator<\/code>, <code>RepeatIterator<\/code>, and several other types implement <code>IList&lt;T&gt;<\/code>. That then means that any code which queries an <code>IEnumerable&lt;T&gt;<\/code> for whether it implements <code>IList&lt;T&gt;<\/code>, such as to use its <code>Count<\/code> and <code>CopyTo<\/code> methods, will light up when passed one of these instances as well. And the same <code>Fill<\/code> implementation that&#8217;s used internally to implement <code>ToArray<\/code> and <code>ToList<\/code> is then used as well with <code>CopyTo<\/code>. That means if you write code like:<\/p>\n<pre><code class=\"language-C#\">List&lt;T&gt; list = ...;\r\nIEnumerable&lt;T&gt; enumerable = ...;\r\nlist.AddRange(enumerable);<\/code><\/pre>\n<p>and that <code>enumerable<\/code> came from one of these enlightened types, it&#8217;ll now benefit from the exact same use of vectorization previously discussed, as the <code>List&lt;T&gt;<\/code> will ensure its array is appropriately sized to handle the incoming data and will then hand its array off to the iterator&#8217;s <code>ICollection&lt;T&gt;.CopyTo<\/code> method to write into directly.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly List&lt;byte&gt; _list = new();\r\n\r\n    [Benchmark]\r\n    public void AddRange()\r\n    {\r\n        _list.Clear();\r\n        _list.AddRange(Enumerable.Repeat((byte)'a', 1024));\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AddRange<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">6,826.89 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>AddRange<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">20.30 ns<\/td>\n<td style=\"text-align: right\">0.003<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Vectorization with LINQ was also improved in other ways. In .NET 7, <code>Enumerable.Min<\/code> and <code>Enumerable.Max<\/code> were taught how to vectorize the handling of some inputs (when the enumerable was actually an array or list of <code>int<\/code> or <code>long<\/code> values), and in .NET 8 <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76144\">dotnet\/runtime#76144<\/a> expanded that to cover <code>byte<\/code>, <code>sbyte<\/code>, <code>ushort<\/code>, <code>short<\/code>, <code>uint<\/code>, <code>ulong<\/code>, <code>nint<\/code>, and <code>nuint<\/code> as well (it also switched the implementation from using <code>Vector&lt;T&gt;<\/code> to using both <code>Vector128&lt;T&gt;<\/code> and <code>Vector256&lt;T&gt;<\/code>, so that shorter inputs could still benefit from some level of vectorization).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _values = Enumerable.Range(0, 4096).Select(_ =&gt; (byte)Random.Shared.Next(0, 256)).ToArray();\r\n\r\n    [Benchmark]\r\n    public byte Max() =&gt; _values.Max();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Max<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">16,496.96 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>Max<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">53.77 ns<\/td>\n<td style=\"text-align: right\">0.003<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Enumerable.Sum<\/code> has now also been vectorized, for <code>int<\/code> and <code>long<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84519\">dotnet\/runtime#84519<\/a> from <a href=\"https:\/\/github.com\/brantburnett\">@brantburnett<\/a>. <code>Sum<\/code> in LINQ performs <code>checked<\/code> arithmetic, and normal <code>Vector&lt;T&gt;<\/code> operations are <code>unchecked<\/code>, which makes the vectorization of this method a bit more challenging. To achieve it, it takes advantage of a neat little bit hack trick for determining whether an addition of two signed twos-complement numbers underflow or overflow. The same logic applies for both <code>int<\/code> and <code>long<\/code> here, so we&#8217;ll focus just on <code>int<\/code>. It&#8217;s impossible for the addition of a negative <code>int<\/code> to overflow when added to a positive <code>int<\/code>, so the only way two summed values can underflow or overflow is if they have the same sign. Further, if any wrapping occurs, it can&#8217;t wrap back to the same sign; if you add two positives numbers together and it overflows, the result will be negative, and if you add two negative numbers together and it underflows, the result will be positive. Thus, a function like this can tell us whether the sum wrapped:<\/p>\n<pre><code class=\"language-C#\">static int Sum(int a, int b, out bool overflow)\r\n{\r\n    int sum = a + b;\r\n    overflow = (((sum ^ a) &amp; (sum ^ b)) &amp; int.MinValue) != 0;\r\n    return sum;\r\n}<\/code><\/pre>\n<p>We&#8217;re <code>xor<\/code>&#8216;ing the result with each of the inputs, and <code>and<\/code>&#8216;ing those together. That will produce a number who&#8217;s top-most bit is 1 if there was overflow\/underflow, and otherwise 0, so we can then mask off all the other bits and compare to 0 to determine whether wrapping occurred. This is useful for vectorization, because we can easily do the same thing with vectors, summing the two vectors and reporting on whether any of the elemental sums overflowed:<\/p>\n<pre><code class=\"language-C#\">static Vector128&lt;int&gt; Sum(Vector128&lt;int&gt; a, Vector128&lt;int&gt; b, out bool overflow)\r\n{\r\n    Vector128&lt;int&gt; sum = a + b;\r\n    overflow = (((sum ^ a) &amp; (sum ^ b)) &amp; Vector128.Create(int.MinValue)) != Vector128&lt;int&gt;.Zero;\r\n    return sum;\r\n}<\/code><\/pre>\n<p>With that, <code>Enumerable.Sum<\/code> can be vectorized. For sure, it&#8217;s not as efficient as if we didn&#8217;t need to care about the <code>checked<\/code>; after all, for every addition operation, there&#8217;s at least an extra set of instructions for the two <code>xor<\/code>s and the <code>and<\/code>&#8216;ing of them (we can amortize the bit check across several operations by doing some loop unrolling). With 256-bit vectors, an ideal speedup for such a sum operation over <code>int<\/code> values would be 8x, since we can process eight 32-bit values at a time in a 256-bit vector. We&#8217;re then doing fairly well that we get a 4x speedup in that situation:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly IEnumerable&lt;int&gt; _values = Enumerable.Range(0, 1024).ToArray();\r\n\r\n    [Benchmark]\r\n    public int Sum() =&gt; _values.Sum();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">347.28 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">78.26 ns<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>LINQ has improved in .NET 8 beyond just vectorization; other operators have seen other kinds of optimization. Take <code>Order<\/code>\/<code>OrderDescending<\/code>, for example. These LINQ operators implement a &#8220;stable sort&#8221;; that means that while sorting the data, if two items compare equally, they&#8217;ll end up in the final result in the same order they were in the original (an &#8220;unstable sort&#8221; doesn&#8217;t care about the ordering of two values that compare equally). The core sorting routine shared by spans, arrays, and lists in .NET (e.g. <code>Array.Sort<\/code>) provides an unstable sort, so to use that implementation and provide stable ordering guarantees, LINQ has to layer the stability on top, which it does by factoring into the comparison operation between keys the original location of the key in the input (e.g. if two values otherwise compare equally, then it proceeds to compare their original locations). That, however, means it needs to remember their original locations, which means it needs to allocate a separate <code>int[]<\/code> for positions. Interestingly, though, sometimes you can&#8217;t tell the difference between whether a sort is stable or unstable. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76733\">dotnet\/runtime#76733<\/a> takes advantage of the fact that for primitive types like <code>int<\/code>, two values that compare equally with the default comparer are indistinguishable, in which case it&#8217;s fine to use an unstable sort because the only values that can compare equally have identical bits and thus trying to maintain an order between them doesn&#8217;t matter. It thus enables avoiding all of the overhead associated with maintaining a stable sort.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private IEnumerable&lt;int&gt; _source;\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; _source = Enumerable.Range(0, 1000).Reverse();\r\n\r\n    [Benchmark]\r\n    public int EnumerateOrdered()\r\n    {\r\n        int sum = 0;\r\n        foreach (int i in _source.Order()) \r\n        {\r\n            sum += i;\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>EnumerateOrdered<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">73.728 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">8.09 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>EnumerateOrdered<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">9.753 us<\/td>\n<td style=\"text-align: right\">0.13<\/td>\n<td style=\"text-align: right\">4.02 KB<\/td>\n<td style=\"text-align: right\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76418\">dotnet\/runtime#76418<\/a> also improves sorting in LINQ, this time for <code>OrderBy<\/code>\/<code>OrderByDescending<\/code>, and in particular when the type of the key used (the type returned by the <code>keySelector<\/code> delegate provided to <code>OrderBy<\/code>) is a value type and the default comparer is used. This change employs the same approach that some of the .NET collections like <code>Dictionary&lt;TKey, TValue&gt;<\/code> already do, which is to take advantage of the fact that value types when used as generics get a custom copy of the code dedicated to that type (&#8220;generic specialization&#8221;), and that <code>Comparer&lt;TValueType&gt;.Default.Compare<\/code> will get devirtualized and possibly inlined. As such, it adds a dedicated path for when the key is a value type, and that enables the comparison operation (which is invoked <code>O(n log n)<\/code> times) to be sped up.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly int[] _values = Enumerable.Range(0, 1_000_000).Reverse().ToArray();\r\n\r\n    [Benchmark]\r\n    public int OrderByToArray()\r\n    {\r\n        int sum = 0;\r\n        foreach (int i in _values.OrderBy(i =&gt; i * 2)) sum += i;\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>OrderByToArray<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">187.17 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>OrderByToArray<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">67.54 ms<\/td>\n<td style=\"text-align: right\">0.36<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, sometimes the most efficient use of LINQ is simply not using it. It&#8217;s an amazing productivity tool, and it goes to great lengths to be efficient, but sometimes there are better answers that are just as simple. <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1860\">CA1860<\/a>, added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6236\">dotnet\/roslyn-analyzers#6236<\/a> from <a href=\"https:\/\/github.com\/CollinAlpert\">@CollinAlpert<\/a>, flags one such case. It looks for use of <code>Enumerable.Any<\/code> on collections that directly expose a <code>Count<\/code>, <code>Length<\/code>, or <code>IsEmpty<\/code> property that could be used instead. While <code>Any<\/code> does use <code>Enumerable.TryGetNonEnumeratedCount<\/code> in an attempt to check the collection&#8217;s number of items without allocating or using an enumerator, even if it&#8217;s successful in doing so it incurs the overhead of the interface check and dispatch. It&#8217;s faster to just use the properties directly. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81583\">dotnet\/runtime#81583<\/a> fixed several cases of this.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1860.png\" alt=\"CA1860\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly string _str = \"hello\";\r\n    private readonly List&lt;int&gt; _list = new() { 1, 2, 3 };\r\n    private readonly int[] _array = new int[] { 4, 5, 6 };\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool AllNonEmpty_Any() =&gt;\r\n        _str.Any() &amp;&amp;\r\n        _list.Any() &amp;&amp;\r\n        _array.Any();\r\n\r\n    [Benchmark]\r\n    public bool AllNonEmpty_Property() =&gt;\r\n        _str.Length != 0 &amp;&amp;\r\n        _list.Count != 0 &amp;&amp;\r\n        _array.Length != 0;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AllNonEmpty_Any<\/td>\n<td style=\"text-align: right\">12.5302 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AllNonEmpty_Property<\/td>\n<td style=\"text-align: right\">0.3701 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Dictionary<\/h2>\n<p>In addition to making existing methods faster, LINQ has also gained some new methods in .NET 8. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85811\">dotnet\/runtime#85811<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a> added new overloads of <code>ToDictionary<\/code>. Unlike the existing overloads that are extensions on any arbitrary <code>IEnumerable&lt;TSource&gt;<\/code> and accept delegates for extracting from each <code>TSource<\/code> a <code>TKey<\/code> and\/or <code>TValue<\/code>, these new overloads are extensions on <code>IEnumerable&lt;KeyValuePair&lt;TKey, TValue&gt;&gt;<\/code> and <code>IEnumerable&lt;(TKey, TValue)&gt;<\/code>. This is primarily an addition for convenience, as it means that such an enumerable that previously used code like:<\/p>\n<pre><code class=\"language-C#\">return collection.ToDictionary(kvp =&gt; kvp.Key, kvp =&gt; kvp.Value);<\/code><\/pre>\n<p>can instead be simplified to just be:<\/p>\n<pre><code class=\"language-C#\">return collection.ToDictionary();<\/code><\/pre>\n<p>Beyond being simpler, this has the nice benefit of also being cheaper, as it means the method doesn&#8217;t need to invoke two delegates per item. It also means that this new method is a simple passthrough to <code>Dictionary&lt;TKey, TValue&gt;<\/code>&#8216;s constructor, which has its own optimizations that take advantage of knowing about <code>Dictionary&lt;TKey, TValue&gt;<\/code> internals, e.g. it can more efficiently copy the source data if it&#8217;s a <code>Dictionary&lt;TKey, TValue&gt;<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly IEnumerable&lt;KeyValuePair&lt;string, int&gt;&gt; _source = Enumerable.Range(0, 1024).ToDictionary(i =&gt; i.ToString(), i =&gt; i);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Dictionary&lt;string, int&gt; WithDelegates() =&gt; _source.ToDictionary(kvp =&gt; kvp.Key, kvp =&gt; kvp.Value);\r\n\r\n    [Benchmark]\r\n    public Dictionary&lt;string, int&gt; WithoutDelegates() =&gt; _source.ToDictionary();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithDelegates<\/td>\n<td style=\"text-align: right\">21.208 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WithoutDelegates<\/td>\n<td style=\"text-align: right\">8.652 us<\/td>\n<td style=\"text-align: right\">0.41<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>It also benefits from the <code>Dictionary&lt;TKey, TValue&gt;<\/code>&#8216;s constructor being optimized in additional ways. As noted, its constructor accepting an <code>IEnumerable&lt;KeyValuePair&lt;TKey, TValue&gt;&gt;<\/code> already special-cased when the enumerable is actually a <code>Dictionary&lt;TKey, TValue&gt;<\/code>. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86254\">dotnet\/runtime#86254<\/a>, it now also special-cases when the enumerable is a <code>KeyValuePair&lt;TKey, TValue&gt;[]<\/code> or a <code>List&lt;KeyValuePair&lt;TKey, TValue&gt;&gt;<\/code>. When such a source is found, a span is extracted from it (a simple cast for an array, or via <code>CollectionsMarshal.AsSpan<\/code> for a <code>List&lt;&gt;<\/code>), and then that span (rather than the original <code>IEnumerable&lt;&gt;<\/code>) is what&#8217;s enumerated. That saves an enumerator allocation and several interface dispatches per item for these reasonably common cases.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly List&lt;KeyValuePair&lt;int, int&gt;&gt; _list = Enumerable.Range(0, 1000).Select(i =&gt; new KeyValuePair&lt;int, int&gt;(i, i)).ToList();\r\n\r\n    [Benchmark] public Dictionary&lt;int, int&gt; FromList() =&gt; new Dictionary&lt;int, int&gt;(_list);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FromList<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">12.250 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FromList<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">6.780 us<\/td>\n<td style=\"text-align: right\">0.55<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The most common operation performed on a dictionary is looking up a key, whether to see if it exists, to add a value, or to get the current value. Previous .NET releases have seen significant improvements in this lookup time, but even better than optimizing a lookup is not needing to do one at all. One common place we&#8217;ve seen unnecessary lookups is with guard clauses that end up being unnecessary, for example code that does:<\/p>\n<pre><code class=\"language-C#\">if (!dictionary.ContainsKey(key))\r\n{\r\n    dictionary.Add(key, value);\r\n}<\/code><\/pre>\n<p>This incurs two lookups, one as part of <code>ContainsKey<\/code>, and then if the key wasn&#8217;t in the dictionary, another as part of the <code>Add<\/code> call. Code can instead achieve the same operation with:<\/p>\n<pre><code class=\"language-C#\">dictionary.TryAdd(key, value);<\/code><\/pre>\n<p>which incurs only one lookup. <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/CA1864\">CA1864<\/a>, added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6199\">dotnet\/roslyn-analyzers#6199<\/a> from <a href=\"https:\/\/github.com\/CollinAlpert\">@CollinAlpert<\/a>, looks for such places where an <code>Add<\/code> call is guarded by a <code>ContainsKey<\/code> call. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88700\">dotnet\/runtime#88700<\/a> fixed a few occurrences of this in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1864.png\" alt=\"CA1864\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly Dictionary&lt;string, string&gt; _dict = new();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public void ContainsThenAdd()\r\n    {\r\n        _dict.Clear();\r\n        if (!_dict.ContainsKey(\"key\"))\r\n        {\r\n            _dict.Add(\"key\", \"value\");\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public void TryAdd()\r\n    {\r\n        _dict.Clear();\r\n        _dict.TryAdd(\"key\", \"value\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ContainsThenAdd<\/td>\n<td style=\"text-align: right\">25.93 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryAdd<\/td>\n<td style=\"text-align: right\">19.50 ns<\/td>\n<td style=\"text-align: right\">0.75<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Similarly, <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6767\">dotnet\/roslyn-analyzers#6767<\/a> from <a href=\"https:\/\/github.com\/mpidash\">@mpidash<\/a> added <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/CA1868\">CA1868<\/a>, which looks for <code>Add<\/code> or <code>Remove<\/code> calls on <code>ISet&lt;T&gt;<\/code>s where the call is guarded by a <code>Contains<\/code>, and recommends removing the <code>Contains<\/code> call. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89652\">dotnet\/runtime#89652<\/a> from <a href=\"https:\/\/github.com\/mpidash\">@mpidash<\/a> fixes occurrences of this in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>.\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1868.png\" alt=\"CA1868\" \/><\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly HashSet&lt;string&gt; _set = new();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool ContainsThenAdd()\r\n    {\r\n        _set.Clear();\r\n        if (!_set.Contains(\"key\"))\r\n        {\r\n            _set.Add(\"key\");\r\n            return true;\r\n        }\r\n\r\n        return false;\r\n    }\r\n\r\n    [Benchmark]\r\n    public bool Add()\r\n    {\r\n        _set.Clear();\r\n        return _set.Add(\"key\");\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ContainsThenAdd<\/td>\n<td style=\"text-align: right\">22.98 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Add<\/td>\n<td style=\"text-align: right\">17.99 ns<\/td>\n<td style=\"text-align: right\">0.78<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other related analyzers previously released have also been improved. <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6387\">dotnet\/roslyn-analyzers#6387<\/a> improved <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1854\">CA1854<\/a> to find more opportunities for using <code>IDictionary&lt;TKey, TValue&gt;.TryGetValue<\/code>, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85613\">dotnet\/runtime#85613<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80996\">dotnet\/runtime#80996<\/a> using the analyzer to find and fix more occurrences.<\/p>\n<p>Other dictionaries have also improved in .NET 8. <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> in particular got a nice boost from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81557\">dotnet\/runtime#81557<\/a>, for all key types but especially for the very common case where <code>TKey<\/code> is <code>string<\/code> and the equality comparer is either the default comparer (whether that be <code>null<\/code>, <code>EqualityComparer&lt;TKey&gt;.Default<\/code>, or <code>StringComparer.Ordinal<\/code>, all of which behave identically) or <code>StringComparer.OrdinalIgnoreCase<\/code>. In .NET Core, <code>string<\/code> hash codes are randomized, meaning there&#8217;s a random seed value unique to any given process that&#8217;s incorporated into string hash codes. So if, for example, I run the following program:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -f net8.0\r\n\r\nstring s = \"Hello, world!\";\r\nConsole.WriteLine(s.GetHashCode());\r\nConsole.WriteLine(s.GetHashCode());\r\nConsole.WriteLine(s.GetHashCode());<\/code><\/pre>\n<p>I get the following output, showing that the hash code for a given string is stable across multiple <code>GetHashCode<\/code> calls:<\/p>\n<pre><code class=\"language-text\">1442385232\r\n1442385232\r\n1442385232<\/code><\/pre>\n<p>but when I run the program again, I get a different stable value:<\/p>\n<pre><code class=\"language-text\">740992523\r\n740992523\r\n740992523<\/code><\/pre>\n<p>This randomization is done to help mitigate a class of denial-of-service (DoS) attacks involving dictionaries, where an attacker might be able to trigger the worst-case algorithmic complexity of a dictionary by forcing lots of collisions amongst the keys. However, the randomization also incurs some amount of overhead. It&#8217;s enough overhead so that <code>Dictionary&lt;TKey, TValue&gt;<\/code> actually special-cases <code>string<\/code> keys with a default or <code>OrdinalIgnoreCase<\/code> comparer to skip the randomization until a sufficient number of collisions has been detected. Now in .NET 8, <code>ConcurrentDictionary&lt;string, TValue&gt;<\/code> employs the same trick. When it starts life, a <code>ConcurrentDictionary&lt;string, TValue&gt;<\/code> instance using a default or <code>OrdinalIgnoreCase<\/code> comparer performs hashing using a non-randomized comparer. Then as it&#8217;s adding an item and traversing its internal data structure, it keeps track of how many keys it has to examine that had the same hash code. If that count surpasses a threshold, it then switches back to using a randomized comparer, rehashing the whole dictionary in order to mitigate possible attacks.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Concurrent;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private KeyValuePair&lt;string, string&gt;[] _pairs;\r\n    private ConcurrentDictionary&lt;string, string&gt; _cd;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _pairs =\r\n            \/\/ from https:\/\/github.com\/dotnet\/runtime\/blob\/a30de6d40f69ef612b514344a5ec83fffd10b957\/src\/libraries\/System.Formats.Asn1\/src\/System\/Formats\/Asn1\/WellKnownOids.cs#L317-L419\r\n            new[]\r\n            {\r\n                \"1.2.840.10040.4.1\", \"1.2.840.10040.4.3\", \"1.2.840.10045.2.1\", \"1.2.840.10045.1.1\", \"1.2.840.10045.1.2\", \"1.2.840.10045.3.1.7\", \"1.2.840.10045.4.1\", \"1.2.840.10045.4.3.2\", \"1.2.840.10045.4.3.3\", \"1.2.840.10045.4.3.4\",\r\n                \"1.2.840.113549.1.1.1\", \"1.2.840.113549.1.1.5\", \"1.2.840.113549.1.1.7\", \"1.2.840.113549.1.1.8\", \"1.2.840.113549.1.1.9\", \"1.2.840.113549.1.1.10\", \"1.2.840.113549.1.1.11\", \"1.2.840.113549.1.1.12\", \"1.2.840.113549.1.1.13\",\r\n                \"1.2.840.113549.1.5.3\", \"1.2.840.113549.1.5.10\", \"1.2.840.113549.1.5.11\", \"1.2.840.113549.1.5.12\", \"1.2.840.113549.1.5.13\", \"1.2.840.113549.1.7.1\", \"1.2.840.113549.1.7.2\", \"1.2.840.113549.1.7.3\", \"1.2.840.113549.1.7.6\",\r\n                \"1.2.840.113549.1.9.1\", \"1.2.840.113549.1.9.3\", \"1.2.840.113549.1.9.4\", \"1.2.840.113549.1.9.5\", \"1.2.840.113549.1.9.6\", \"1.2.840.113549.1.9.7\", \"1.2.840.113549.1.9.14\", \"1.2.840.113549.1.9.15\", \"1.2.840.113549.1.9.16.1.4\",\r\n                \"1.2.840.113549.1.9.16.2.12\", \"1.2.840.113549.1.9.16.2.14\", \"1.2.840.113549.1.9.16.2.47\", \"1.2.840.113549.1.9.20\", \"1.2.840.113549.1.9.21\", \"1.2.840.113549.1.9.22.1\", \"1.2.840.113549.1.12.1.3\", \"1.2.840.113549.1.12.1.5\",\r\n                \"1.2.840.113549.1.12.1.6\", \"1.2.840.113549.1.12.10.1.1\", \"1.2.840.113549.1.12.10.1.2\", \"1.2.840.113549.1.12.10.1.3\", \"1.2.840.113549.1.12.10.1.5\", \"1.2.840.113549.1.12.10.1.6\", \"1.2.840.113549.2.5\", \"1.2.840.113549.2.7\",\r\n                \"1.2.840.113549.2.9\", \"1.2.840.113549.2.10\", \"1.2.840.113549.2.11\", \"1.2.840.113549.3.2\", \"1.2.840.113549.3.7\", \"1.3.6.1.4.1.311.17.1\", \"1.3.6.1.4.1.311.17.3.20\", \"1.3.6.1.4.1.311.20.2.3\", \"1.3.6.1.4.1.311.88.2.1\",\r\n                \"1.3.6.1.4.1.311.88.2.2\", \"1.3.6.1.5.5.7.3.1\", \"1.3.6.1.5.5.7.3.2\", \"1.3.6.1.5.5.7.3.3\", \"1.3.6.1.5.5.7.3.4\", \"1.3.6.1.5.5.7.3.8\", \"1.3.6.1.5.5.7.3.9\", \"1.3.6.1.5.5.7.6.2\", \"1.3.6.1.5.5.7.48.1\", \"1.3.6.1.5.5.7.48.1.2\",\r\n                \"1.3.6.1.5.5.7.48.2\", \"1.3.14.3.2.26\", \"1.3.14.3.2.7\", \"1.3.132.0.34\", \"1.3.132.0.35\", \"2.5.4.3\", \"2.5.4.5\", \"2.5.4.6\", \"2.5.4.7\", \"2.5.4.8\", \"2.5.4.10\", \"2.5.4.11\", \"2.5.4.97\", \"2.5.29.14\", \"2.5.29.15\", \"2.5.29.17\", \"2.5.29.19\",\r\n                \"2.5.29.20\", \"2.5.29.35\", \"2.16.840.1.101.3.4.1.2\", \"2.16.840.1.101.3.4.1.22\", \"2.16.840.1.101.3.4.1.42\", \"2.16.840.1.101.3.4.2.1\", \"2.16.840.1.101.3.4.2.2\", \"2.16.840.1.101.3.4.2.3\", \"2.23.140.1.2.1\", \"2.23.140.1.2.2\",\r\n            }.Select(s =&gt; new KeyValuePair&lt;string, string&gt;(s, s)).ToArray();\r\n        _cd = new ConcurrentDictionary&lt;string, string&gt;(_pairs, StringComparer.OrdinalIgnoreCase);\r\n    }\r\n\r\n    [Benchmark]\r\n    public int TryGetValue()\r\n    {\r\n        int count = 0;\r\n        foreach (KeyValuePair&lt;string, string&gt; pair in _pairs)\r\n        {\r\n            if (_cd.TryGetValue(pair.Key, out _))\r\n            {\r\n                count++;\r\n            }\r\n        }\r\n\r\n        return count;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryGetValue<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">2.917 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryGetValue<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.462 us<\/td>\n<td style=\"text-align: right\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The above benchmark also benefited from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77005\">dotnet\/runtime#77005<\/a>, which tweaked another long-standing optimization in the type. <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> maintains a <code>Node<\/code> object for every key\/value pair it stores. As multiple threads might be reading from the dictionary concurrent with updates happening, the dictionary needs to be really careful about how it mutates data stored in the collection. If an update is performed that needs to update a <code>TValue<\/code> in an existing node (e.g. <code>cd[existingKey] = newValue<\/code>), the dictionary needs to be very careful to avoid torn reads, such that one thread could be reading the value while another thread is writing the value, leading to the reader seeing part of the old value and part of the new value. It does this by only reusing that same <code>Node<\/code> for an update if it can write the <code>TValue<\/code> atomically. It can write it atomically if the <code>TValue<\/code> is a reference type, in which case it&#8217;s simply writing a pointer-sized reference, or if the <code>TValue<\/code> is a primitive value that&#8217;s defined by the platform to always be written atomically when written with appropriate alignment, e.g. <code>int<\/code>, or <code>long<\/code> when in a 64-bit process. To make this check efficient, <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> computes once whether a given <code>TValue<\/code> is writable atomically, storing it into a <code>static readonly<\/code> field, such that in tier 1 compilation, the JIT can treat the value as a <code>const<\/code>. However, this <code>const<\/code> trick doesn&#8217;t always work. The field was on <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> itself, and if one of those generic type parameters ended up being a reference type (e.g. <code>ConcurrentDictionary&lt;object, int&gt;<\/code>), accessing the <code>static readonly<\/code> field would require a generic lookup (the JIT isn&#8217;t currently able to see that the value stored in the field is only dependent on the <code>TValue<\/code> and not on the <code>TKey<\/code>). To fix this, the field was moved to a separate type where <code>TValue<\/code> is the only generic parameter, and a check for <code>typeof(TValue).IsValueType<\/code> (which is itself a JIT intrinsic that manifests as a <code>const<\/code>) is done separately.<\/p>\n<p><code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code>&#8216;s <code>TryRemove<\/code> was also improved this release, via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82004\">dotnet\/runtime#82004<\/a>. Mutation of a <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> requires taking a lock. However, in the case of <code>TryRemove<\/code>, we only actually need the lock if it&#8217;s possible the item being removed is contained. If the number of items protected by the given lock is 0, we know <code>TryRemove<\/code> will be a nop. Thus, this PR added a fast path to <code>TryRemove<\/code> that read the count for that lock and immediately bailed if it was 0.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Concurrent;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly ConcurrentDictionary&lt;int, int&gt; _empty = new();\r\n\r\n    [Benchmark]\r\n    public bool TryRemoveEmpty() =&gt; _empty.TryRemove(default, out _);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryRemoveEmpty<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">26.963 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryRemoveEmpty<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">5.853 ns<\/td>\n<td style=\"text-align: right\">0.22<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another dictionary that&#8217;s been improved in .NET 8 is <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code>. As background if you haven&#8217;t used this type before, <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code> is a very specialized dictionary based on <code>DependentHandle<\/code>; think of it as every key being a weak reference (so if the GC runs, the key in the dictionary won&#8217;t be counted as a strong root that would keep the object alive), and that if the key is collected, the whole entry is removed from the table. It&#8217;s particularly useful in situations where additional data needs to be associated with an object but where for whatever reason you&#8217;re unable to modify that object to have a reference to the additional data. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80059\">dotnet\/runtime#80059<\/a> improves the performance of lookups on a <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code>, in particular for objects that <em>aren&#8217;t<\/em> in the collection, and even more specifically for an object that&#8217;s never been in any dictionary. Since <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code> is about object references, unlike other dictionaries in .NET, it doesn&#8217;t use the default <code>EqualityComparer&lt;TKey&gt;.Default<\/code> to determine whether an object is in the collection; it just uses object reference equality. And that means to get a hash code for an object, it uses the same functionality that the base <code>object.GetHashCode<\/code> does. It can&#8217;t just call <code>GetHashCode<\/code>, as the method could have been overridden, so instead it directly calls to the same public <code>RuntimeHelpers.GetHashCode<\/code> that <code>object.GetHashCode<\/code> uses:<\/p>\n<pre><code class=\"language-C#\">public class Object\r\n{\r\n    public virtual int GetHashCode() =&gt; RuntimeHelpers.GetHashCode(this);\r\n    ...\r\n}<\/code><\/pre>\n<p>This PR tweaks what <code>ConditionalWeakTable&lt;,&gt;<\/code> does here. It introduces a new internal <code>RuntimeHelpers.TryGetHashCode<\/code> that will avoid creating and storing a hash code for the object if the object doesn&#8217;t already have one. It then uses that method from <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code> as part of <code>TryGetValue<\/code> (and <code>Remove<\/code>, and other related APIs). If <code>TryGetHashCode<\/code> returns a value indicating the object doesn&#8217;t yet have one, then the operation can early-exit, because for the object to have been stored into the collection, it must have had a hash code generated for it.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private ConditionalWeakTable&lt;SomeObject, Data&gt; _cwt;\r\n    private List&lt;object&gt; _rooted;\r\n    private readonly SomeObject _key = new();\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _cwt = new();\r\n        _rooted = new();\r\n        for (int i = 0; i &lt; 1000; i++)\r\n        {\r\n            SomeObject key = new();\r\n            _rooted.Add(key);\r\n            _cwt.Add(key, new());\r\n        }\r\n    }\r\n\r\n    [Benchmark]\r\n    public int GetValue() =&gt; _cwt.TryGetValue(_key, out Data d) ? d.Value : 0;\r\n\r\n    private sealed class SomeObject { }\r\n\r\n    private sealed class Data\r\n    {\r\n        public int Value;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetValue<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">4.533 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetValue<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">3.028 ns<\/td>\n<td style=\"text-align: right\">0.67<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So, improvements to <code>Dictionary&lt;TKey, TValue&gt;<\/code>, <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code>, <code>ConditionalWeakTable&lt;TKey, TValue&gt;<\/code>&#8230; are those the &#8220;end all be all&#8221; of hash table world? Don&#8217;t be silly&#8230;<\/p>\n<h2>Frozen Collections<\/h2>\n<p>There are many specialized libraries available on NuGet, providing all manner of data structures with this or that optimization or targeted at this or that scenario. Our goal with the core .NET libraries has never been to provide all possible data structures (it&#8217;s actually been a goal not to), but rather to provide the most commonly needed data structures focused on the most commonly needed scenarios, and rely on the ecosystem to provide alternatives where something else is deemed valuable. As a result, we don&#8217;t add new collection types all that frequently; we continually optimize the ones that are there and we routinely augment them with additional functionality, but we rarely introduce brand new collection types. In fact, in the last several years, the only new general-purpose collection type introduced into the core libraries was <code>PriorityQueue&lt;TElement, TPriority&gt;<\/code> class, which was added in .NET 6. However, enough of a need has presented itself that .NET 8 sees the introduction of not one but two new collection types: <code>System.Collections.Frozen.FrozenDictionary&lt;TKey, TValue&gt;<\/code> and <code>System.Collections.Frozen.FrozenSet&lt;TKey, TValue&gt;<\/code>.<\/p>\n<p>Beyond causing &#8220;Let It Go&#8221; to be stuck in your head for the rest of the day (&#8220;you&#8217;re welcome&#8221;), what benefit do these new types provide, especially when we already have <code>System.Collections.Immutable.ImmutableDictionary&lt;TKey, TValue&gt;<\/code> and <code>System.Collections.Immutable.ImmutableSet&lt;T&gt;<\/code>? There are enough similarities between the existing immutable collections and the new frozen collections that the latter are actually included in the <code>System.Collections.Immutable<\/code> library, which means they&#8217;re also available as part of the <code>System.Collections.Immutable<\/code> NuGet package. But there are also enough differences to warrant us adding them. In particular, this is an example of where scenario and intended use make a big impact on whether a particular data structure makes sense for your needs.<\/p>\n<p>Arguably, the existing <code>System.Collections.Immutable<\/code> collections were misnamed. Yes, they&#8217;re &#8220;immutable,&#8221; meaning that once you&#8217;ve constructed an instance of one of the collection types, you can&#8217;t change its contents. However, that could have easily been achieved simply by wrapping an immutable facade around one of the existing mutable ones, e.g. an immutable dictionary type that just copied the data into a mutable <code>Dictionary&lt;TKey, TValue&gt;<\/code> and exposed only reading operations:<\/p>\n<pre><code class=\"language-C#\">public sealed class MyImmutableDictionary&lt;TKey, TValue&gt; :\r\n    IReadOnlyDictionary&lt;TKey, TValue&gt;\r\n    where TKey : notnull\r\n{\r\n    private readonly Dictionary&lt;TKey, TValue&gt; _data;\r\n\r\n    public MyImmutableDictionary(IEnumerable&lt;KeyValuePair&lt;TKey, TValue&gt;&gt; source) =&gt; _data = source.ToDictionary();\r\n\r\n    public bool TryGetValue(TKey key, [MaybeNullWhen(false)] out TValue value) =&gt; _data.TryGetValue(key, out value);\r\n\r\n    ...\r\n}<\/code><\/pre>\n<p>Yet, if you look at the implementation of <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code>, you&#8217;ll see a ton of code involved in making the type tick. Why? Because it and its friends are optimized for something very different. In academic nomenclature, the immutable collections are actually &#8220;persistent&#8221; collections. A persistent data structure is one that provides mutating operations on the collection (e.g. Add, Remove, etc.) but where those operations don&#8217;t actually change the existing instance, instead resulting in a new instance being created that contains that modification. So, for example, <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> ironically exposes an <code>Add(TKey key, TValue value)<\/code> method, but this method doesn&#8217;t actually modify the collection instance on which it&#8217;s called; instead, it creates and returns a brand new <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> instance, containing all of the key\/value pairs from the original instance as well as the new key\/value pair being added. Now, you could imagine that being done simply by copying all of the data to a new <code>Dictionary&lt;TKey, TValue&gt;<\/code> and adding in the new value, e.g.<\/p>\n<pre><code class=\"language-C#\">public sealed class MyPersistentDictionary&lt;TKey, TValue&gt; where TKey : notnull\r\n{\r\n    private readonly Dictionary&lt;TKey, TValue&gt; _data;\r\n\r\n    public MyPersistentDictionary&lt;TKey, TValue&gt; Add(TKey key, TValue value)\r\n    {\r\n        var newDictionary = new Dictionary&lt;TKey, TValue&gt;(_data);\r\n        newDictionary.Add(key, value);\r\n        return newDictionary;\r\n    }\r\n\r\n    ...\r\n}<\/code><\/pre>\n<p>but while functional, that&#8217;s terribly inefficient from a memory consumption perspective, as every addition results in a brand new copy of all of the data being made, just to store that one additional pair in the new instance. It&#8217;s also terribly inefficient from an algorithmic complexity perspective, as adding N values would end up being an <code>O(n^2)<\/code> algorithm (each new item would result in copying all previous items). As such, <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> is optimized to share as much as possible between instances. Its implementation uses an <a href=\"https:\/\/en.wikipedia.org\/wiki\/AVL_tree\">AVL tree<\/a>, a self-balancing binary search tree. Adding into such a tree not only requires <code>O(log n)<\/code> time (whereas the full copy shown in <code>MyPersistentDictionary&lt;TKey, TValue&gt;<\/code> above is <code>O(n)<\/code>), it also enables reusing entire portions of a tree between instances of dictionaries. If adding a key\/value pair doesn&#8217;t require mutating a particular subtree, then both the new and old dictionary instances can point to that same subtree, thereby avoiding significant memory increase. You can see this from a benchmark like the following:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Immutable;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private const int Items = 10_000;\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public Dictionary&lt;int, int&gt; DictionaryAdds()\r\n    {\r\n        Dictionary&lt;int, int&gt; d = new();\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            var newD = new Dictionary&lt;int, int&gt;(d);\r\n            newD.Add(i, i);\r\n            d = newD;\r\n        }\r\n        return d;\r\n    }\r\n\r\n    [Benchmark]\r\n    public ImmutableDictionary&lt;int, int&gt; ImmutableDictionaryAdds()\r\n    {\r\n        ImmutableDictionary&lt;int, int&gt; d = ImmutableDictionary&lt;int, int&gt;.Empty;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            d = d.Add(i, i);\r\n        }\r\n        return d;\r\n    }\r\n}<\/code><\/pre>\n<p>which when run on .NET 8 yields the following results for me:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DictionaryAdds<\/td>\n<td style=\"text-align: right\">478.961 ms<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>ImmutableDictionaryAdds<\/td>\n<td style=\"text-align: right\">4.067 ms<\/td>\n<td style=\"text-align: right\">0.009<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>That highlights that the tree-based nature of <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> makes it significantly more efficient (~120x better in both throughput and allocation in this run) for <em>this<\/em> example of performing lots of additions, when compared with using for the same purpose a <code>Dictionary&lt;TKey, TValue&gt;<\/code> treated as being immutable. And that&#8217;s why these immutable collections came into being in the first place. The C# compiler uses lots and lots of dictionaries and sets and the like, and it employs a lot of concurrency. It needs to enable one thread to &#8220;tear off&#8221; an immutable view of a collection even while other threads are updating the collection, and for such purposes it uses <code>System.Collections.Immutable<\/code>.<\/p>\n<p>However, just because the above numbers look amazing doesn&#8217;t mean <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> is always the right tool for the immutable job&#8230; it actually rarely is. Why? Because the exact thing that made it so fast and memory efficient for the above benchmark is also its downfall on one of the most common tasks needed for an &#8220;immutable&#8221; dictionary: reading. With its tree-based data structure, not only are adds <code>O(log n)<\/code>, but lookups are also <code>O(log n)<\/code>, which for a large dictionary can be extremely inefficient when compared to the <code>O(1)<\/code> access times of a type like <code>Dictionary&lt;TKey, TValue&gt;<\/code>. We can see this as well with a simple benchmark. Let&#8217;s say we&#8217;ve built up our 10,000-element dictionary as in the previous example, and now we want to query it:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Immutable;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private const int Items = 1_000_000;\r\n\r\n    private static readonly Dictionary&lt;int, int&gt; s_d = new Dictionary&lt;int, int&gt;(Enumerable.Range(0, Items).ToDictionary(x =&gt; x, x =&gt; x));\r\n    private static readonly ImmutableDictionary&lt;int, int&gt; s_id = ImmutableDictionary.CreateRange(Enumerable.Range(0, Items).ToDictionary(x =&gt; x, x =&gt; x));\r\n\r\n    [Benchmark]\r\n    public int EnumerateDictionary()\r\n    {\r\n        int sum = 0;\r\n        foreach (var pair in s_d) sum++;\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int EnumerateImmutableDictionary()\r\n    {\r\n        int sum = 0;\r\n        foreach (var pair in s_id) sum++;\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int IndexerDictionary()\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            sum += s_d[i];\r\n        }\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int IndexerImmutableDictionary()\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            sum += s_id[i];\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>EnumerateImmutableDictionary<\/td>\n<td style=\"text-align: right\">28.065 ms<\/td>\n<\/tr>\n<tr>\n<td>EnumerateDictionary<\/td>\n<td style=\"text-align: right\">1.404 ms<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>IndexerImmutableDictionary<\/td>\n<td style=\"text-align: right\">46.538 ms<\/td>\n<\/tr>\n<tr>\n<td>IndexerDictionary<\/td>\n<td style=\"text-align: right\">3.780 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Uh oh. Our <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> in this example is ~12x as expensive for lookups and ~20x as expensive for enumeration as <code>Dictionary&lt;TKey, TValue&gt;<\/code>. If your process will be spending most of its time performing reads on the dictionary rather than creating it and\/or performing mutation, that&#8217;s a lot of cycles being left on the table.<\/p>\n<p>And that&#8217;s where frozen collections come in. The collections in <code>System.Collections.Frozen<\/code> are immutable, just as are those in <code>System.Collections.Immutable<\/code>, but they&#8217;re optimized for a different scenario. Whereas the purpose of a type like <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> is to enable efficient mutation (into a new instance), the purpose of <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> is to represent data that never changes, and thus it doesn&#8217;t expose any operations that suggest mutation, only operations for reading. Maybe you&#8217;re loading some configuration data into a dictionary once when your process starts (and then re-loading it only rarely when the configuration changes) and then querying that data over and over and over again. Maybe you&#8217;re creating a mapping from HTTP status codes to delegates representing how those status codes should be handled. Maybe you&#8217;re caching schema information about a set of dynamically-discovered types and then using the resulting parsed information every time you encounter those types later on. Whatever the scenario, you&#8217;re creating an immutable collection that you want to be optimized for reads, and you&#8217;re willing to spend some more cycles creating the collection (because you do it only once, or only once in a while) in order to make reads as fast as possible. That&#8217;s exactly what <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> and <code>FrozenSet&lt;T&gt;<\/code> provide.<\/p>\n<p>Let&#8217;s update our previous example to now also include <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Frozen;\r\nusing System.Collections.Immutable;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private const int Items = 10_000;\r\n\r\n    private static readonly Dictionary&lt;int, int&gt; s_d = new Dictionary&lt;int, int&gt;(Enumerable.Range(0, Items).ToDictionary(x =&gt; x, x =&gt; x));\r\n    private static readonly ImmutableDictionary&lt;int, int&gt; s_id = ImmutableDictionary.CreateRange(Enumerable.Range(0, Items).ToDictionary(x =&gt; x, x =&gt; x));\r\n    private static readonly FrozenDictionary&lt;int, int&gt; s_fd = FrozenDictionary.ToFrozenDictionary(Enumerable.Range(0, Items).ToDictionary(x =&gt; x, x =&gt; x));\r\n\r\n    [Benchmark]\r\n    public int DictionaryGets()\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            sum += s_d[i];\r\n        }\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark]\r\n    public int ImmutableDictionaryGets()\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            sum += s_id[i];\r\n        }\r\n        return sum;\r\n    }\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public int FrozenDictionaryGets()\r\n    {\r\n        int sum = 0;\r\n        for (int i = 0; i &lt; Items; i++)\r\n        {\r\n            sum += s_fd[i];\r\n        }\r\n        return sum;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ImmutableDictionaryGets<\/td>\n<td style=\"text-align: right\">360.55 us<\/td>\n<td style=\"text-align: right\">13.89<\/td>\n<\/tr>\n<tr>\n<td>DictionaryGets<\/td>\n<td style=\"text-align: right\">39.43 us<\/td>\n<td style=\"text-align: right\">1.52<\/td>\n<\/tr>\n<tr>\n<td>FrozenDictionaryGets<\/td>\n<td style=\"text-align: right\">25.95 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Now we&#8217;re talkin&#8217;. Whereas for this lookup test <code>Dictionary&lt;TKey, TValue&gt;<\/code> was ~9x faster than <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code>, <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> was 50% faster than even <code>Dictionary&lt;TKey, TValue&gt;<\/code>.<\/p>\n<p>How does that improvement happen? Just as <code>ImmutableDictionary&lt;TKey, TValue&gt;<\/code> doesn&#8217;t just wrap a <code>Dictionary&lt;TKey, TValue&gt;<\/code>, <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> doesn&#8217;t just wrap one, either. It has a customized implementation focused on making read operations as fast as possible, both for lookups and for enumerations. In fact, it doesn&#8217;t have just one implementation; it has many implementations.<\/p>\n<p>To start to see that, let&#8217;s change the example. In the United States, the Social Security Administration tracks the popularity of baby names. In 2022, the <a href=\"https:\/\/blog.ssa.gov\/social-securitys-most-popular-baby-names-for-2022\/\">most popular baby names<\/a> for girls were Olivia, Emma, Charlotte, Amelia, Sophia, Isabella, Ava, Mia, Evelyn, and Luna. Here&#8217;s a benchmark that checks to see whether a name is one of those:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Frozen;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly HashSet&lt;string&gt; s_s = new(StringComparer.OrdinalIgnoreCase)\r\n    {\r\n         \"Olivia\", \"Emma\", \"Charlotte\", \"Amelia\", \"Sophia\", \"Isabella\", \"Ava\", \"Mia\", \"Evelyn\", \"Luna\"\r\n    };\r\n    private static readonly FrozenSet&lt;string&gt; s_fs = s_s.ToFrozenSet(StringComparer.OrdinalIgnoreCase);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool HashSet_IsMostPopular() =&gt; s_s.Contains(\"Alexandria\");\r\n\r\n    [Benchmark]\r\n    public bool FrozenSet_IsMostPopular() =&gt; s_fs.Contains(\"Alexandria\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HashSet_IsMostPopular<\/td>\n<td style=\"text-align: right\">9.824 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FrozenSet_IsMostPopular<\/td>\n<td style=\"text-align: right\">1.518 ns<\/td>\n<td style=\"text-align: right\">0.15<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Significantly faster. Internally, <code>ToFrozenSet<\/code> can pick an implementation based on the data supplied, both the type of the data and the exact values being used. In this case, if we print out the type of <code>s_fs<\/code>, we see:<\/p>\n<pre><code class=\"language-text\">System.Collections.Frozen.LengthBucketsFrozenSet<\/code><\/pre>\n<p>That&#8217;s an implementation detail, but what we&#8217;re seeing here is that the <code>s_fs<\/code>, even though it&#8217;s strongly-typed as <code>FrozenSet&lt;string&gt;<\/code>, is actually a derived type named <code>LengthBucketsFrozenSet<\/code>. <code>ToFrozenSet<\/code> has analyzed the data supplied to it and chosen a strategy that it thinks will yield the best overall throughput. Part of that is just seeing that the type of the data is <code>string<\/code>, in which case all the <code>string<\/code>-based strategies are able to quickly discard queries that can&#8217;t possibly match. In this example, the set will have tracked that the longest string in the collection is &#8220;Charlotte&#8221; at only nine characters long; as such, when it&#8217;s asked whether the set contains &#8220;Alexandria&#8221;, it can immediately answer &#8220;no,&#8221; because it does a quick length check and sees that &#8220;Alexandria&#8221; at 10 characters can&#8217;t possibly be contained.<\/p>\n<p>Let&#8217;s take another example. Internal to the C# compiler, it has the notion of &#8220;special types,&#8221; and it has a dictionary that maps from a string-based type name to an <code>enum<\/code> used to identify that special-type. As a simplified representation of this, I&#8217;ve just extracted those strings to create a set:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Frozen;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly HashSet&lt;string&gt; s_s = new()\r\n    {\r\n        \"System.Object\", \"System.Enum\", \"System.MulticastDelegate\", \"System.Delegate\", \"System.ValueType\", \"System.Void\",\r\n        \"System.Boolean\", \"System.Char\", \"System.SByte\", \"System.Byte\", \"System.Int16\", \"System.UInt16\", \"System.Int32\",\r\n        \"System.UInt32\", \"System.Int64\",\"System.UInt64\", \"System.Decimal\", \"System.Single\", \"System.Double\", \"System.String\",\r\n        \"System.IntPtr\", \"System.UIntPtr\", \"System.Array\", \"System.Collections.IEnumerable\", \"System.Collections.Generic.IEnumerable`1\",\r\n        \"System.Collections.Generic.IList`1\", \"System.Collections.Generic.ICollection`1\", \"System.Collections.IEnumerator\",\r\n        \"System.Collections.Generic.IEnumerator`1\", \"System.Collections.Generic.IReadOnlyList`1\", \"System.Collections.Generic.IReadOnlyCollection`1\",\r\n        \"System.Nullable`1\", \"System.DateTime\", \"System.Runtime.CompilerServices.IsVolatile\", \"System.IDisposable\", \"System.TypedReference\",\r\n        \"System.ArgIterator\", \"System.RuntimeArgumentHandle\", \"System.RuntimeFieldHandle\", \"System.RuntimeMethodHandle\", \"System.RuntimeTypeHandle\",\r\n        \"System.IAsyncResult\", \"System.AsyncCallback\", \"System.Runtime.CompilerServices.RuntimeFeature\", \"System.Runtime.CompilerServices.PreserveBaseOverridesAttribute\",\r\n    };\r\n    private static readonly FrozenSet&lt;string&gt; s_fs = s_s.ToFrozenSet();\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool HashSet_IsSpecial() =&gt; s_s.Contains(\"System.Collections.Generic.IEnumerable`1\");\r\n\r\n    [Benchmark]\r\n    public bool FrozenSet_IsSpecial() =&gt; s_fs.Contains(\"System.Collections.Generic.IEnumerable`1\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HashSet_IsSpecial<\/td>\n<td style=\"text-align: right\">15.228 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FrozenSet_IsSpecial<\/td>\n<td style=\"text-align: right\">8.218 ns<\/td>\n<td style=\"text-align: right\">0.54<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here the item we&#8217;re searching for is in the collection, so it&#8217;s not getting its performance boost from a fast path to fail out of the search. The concrete type of <code>s_fs<\/code> in this case sheds some light on it:<\/p>\n<pre><code class=\"language-text\">System.Collections.Frozen.OrdinalStringFrozenSet_RightJustifiedSubstring<\/code><\/pre>\n<p>One of the biggest costs involved in looking up something in a hash table is often the cost of producing the hash in the first place. For a type like <code>int<\/code>, it&#8217;s trivial, as it&#8217;s literally just its value. But for a type like <code>string<\/code>, the hash is produced by looking at the string&#8217;s contents and factoring each character into the resulting value. The more characters need to be considered, the more it costs. In this case, the type has identified that in order to differentiate all of the items in the collection, only a subset of them needs to be hashed, such that it only needs to examine a subset of the incoming string to determine what a possible match might be in the collection.<\/p>\n<p>A bunch of PRs went into making <code>System.Collections.Frozen<\/code> happen in .NET 8. It started as an internal project used by several services at Microsoft, and was then cleaned up and added as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77799\">dotnet\/runtime#77799<\/a>. That provided the core types and initial strategy implementations, with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79794\">dotnet\/runtime#79794<\/a> following it to provide additional strategies (although we subsequently backed out a few due to lack of motivating scenarios for what their optimizations were targeting).<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81021\">dotnet\/runtime#81021<\/a> then removed some virtual dispatch from the string-based implementations. As noted in the previous example, one approach the strategies take is to try to hash less, so there&#8217;s a phase of analysis where the implementation looks at the various substrings in each of the items and determines whether there&#8217;s an offset and length for substring that across all of the items provides an ideal differentiation. For example, consider the strings &#8220;12a34&#8221;, &#8220;12b34&#8221;, &#8220;12c34&#8221;; the analyzer would determine that there&#8217;s no need to hash the whole string, it need only consider the character at index 2, as that&#8217;s enough to uniquely hash the relevant strings. This was initially achieved by using a custom comparer type, but that then meant that virtual dispatch was needed in order to invoke the hashing routine. Instead, this PR created more concrete derived types from <code>FrozenSet<\/code>\/<code>FrozenDictionary<\/code>, such that the choice of hashing logic was dictated by the choice of concrete collection type to instantiate, saving on the per-operation dispatch.<\/p>\n<p>In any good story, there&#8217;s a twist, and we encountered a twist with these frozen collection types as well. I&#8217;ve already described the scenarios that drove the creation of these types: create once, use <em>a lot<\/em>. And as such, a lot of attention was paid to overheads involved in reading from the collection, but initially very little time was paid to optimizing construction time. In fact, improving construction time was initially a non-goal, with a willingness to spend as much time as was needed to eke out more throughput for reading. This makes sense if you&#8217;re focusing on long-lived services, where you&#8217;re happy to spend extra seconds once an hour or day or week to optimize something that will then be used many thousands of times per second. However, the equation changes a bit when types like this are exposed in the core libraries, such that the expected number of developers using them, the use cases they have for them, and the variations of data thrown at them grows by orders of magnitude. We started hearing from developers that they were excited to use <code>FrozenDictionary<\/code>\/<code>FrozenSet<\/code> not just because of performance but also because they were truly immutable, both in implementation and in surface area (e.g. no <code>Add<\/code> method to confuse things), and that they&#8217;d be interested in employing them in object models, UIs, and so on. At that point, you&#8217;re no longer in the world of &#8220;we can take as much time for construction as we want,&#8221; and instead need to be concerned about construction taking inordinate amounts of time and resources.<\/p>\n<p>As a stop-gap measure, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81194\">dotnet\/runtime#81194<\/a> changed the existing <code>ToFrozenDictionary<\/code>\/<code>ToFrozenSet<\/code> methods to not do any analysis of the incoming data, and instead have both construction time and read throughput in line with that of <code>Dictionary<\/code>\/<code>HashSet<\/code>. It then added new overloads with a <code>bool optimizeForReading<\/code> argument, to enable developers to opt-in to those longer construction times in exchange for better read throughput. This wasn&#8217;t an ideal solution, as it meant that it took more discovery and more code for a developer to achieve the primary purpose of these types, but it also helped developers avoid pits of failure by using what looked like a harmless method but could result in significant increases in processing time (one degenerate example I created resulted in <code>ToFrozenDictionary<\/code> running literally for minutes).<\/p>\n<p>We then set about to improve the overall performance of the collections, with a bunch of PRs geared towards driving down the costs:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81389\">dotnet\/runtime#81389<\/a> removed various allocations and a dependency from some of the optimizations on the generic math interfaces from .NET 7, such that the optimizations would apply downlevel as well, simplifying the code.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81603\">dotnet\/runtime#81603<\/a> moved some code around to reduce how much code was in a generic context. With Native AOT, with type parameters involving value types, every unique set of type parameters used with these collections results in a unique copy of the code being made, and with all of the various strategies around just in case they&#8217;re necessary to optimize a given set, there&#8217;s potentially a lot of code that gets duplicated. This change was able to shave ~10Kb off each generic instantiation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86293\">dotnet\/runtime#86293<\/a> made a large number of tweaks, including limiting the maximum length substring that would be evaluated as part of determining the optimal hashing length to employ. This significantly reduced the worst-case running time when supplying problematic inputs.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84301\">dotnet\/runtime#84301<\/a> added similar early-exit optimizations as were seen earlier with string, but for a host of other types, including all the primitives, <code>TimeSpan<\/code>, <code>Guid<\/code>, and such. For these types, when no comparer is provided, we can sort the inputs, quickly check whether a supplied input is greater than anything known to be in the collection, and when dealing with a small number of elements such that we don&#8217;t hash at all and instead just do a linear search, we can stop searching once we&#8217;ve reached an item in the collection that&#8217;s larger than the one being tested (e.g. if the first item in the sorted list is larger than the one being tested, nothing will match). It&#8217;s interesting why we don&#8217;t just do this for an <code>IComparable&lt;T&gt;<\/code>; we did, initially, actually, but removed it because of several prominent <code>IComparable&lt;T&gt;<\/code> implementations that didn&#8217;t work for this purpose. <code>ValueTuple&lt;...&gt;<\/code>, for example, implements <code>IComparable&lt;ValueTuple&lt;...&gt;&gt;<\/code>, but the <code>T1<\/code>, <code>T2<\/code>, etc. types the <code>ValueTuple&lt;...&gt;<\/code> wraps may not, and the frozen collections didn&#8217;t have a good way to determine the viability of an <code>IComparable&lt;T&gt;<\/code> implementation. Instead, this PR added the optimization back with an allow list, such that all the relevant known good types that could be referenced were special-cased.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87510\">dotnet\/runtime#87510<\/a> was the first in a series of PRs to focus significantly on driving down the cost of construction. Its main contribution in this regard was in how collisions are handled. One of the main optimizations employed in the general case by <code>ToFrozenDictionary<\/code>\/<code>ToFrozenSet<\/code> is to try to drive down the number of collisions in the hash table, since the more collisions there are, the more work will need to be performed during lookups. It does this by populating the table and tracking the number of collisions, and then if there were too many, increasing the size of the table and trying again, repeatedly, until the table has grown large enough that collisions are no longer an issue. This process would hash everything, and then check to make sure it was as good as was desired. This PR changed that to instead bail the moment we knew there were enough collisions that we&#8217;d need to retry, rather than waiting until having processed everything.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87630\">dotnet\/runtime#87630<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87688\">dotnet\/runtime#87688<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88093\">dotnet\/runtime#88093<\/a> in particular improve collections keyed by <code>int<\/code>s, by avoiding unnecessary work. For example, as part of determining the ideal table size (to minimize collisions), the implementation generates a set of all unique hash codes, eliminating duplicate hash codes because they&#8217;d always collide regardless of the size of the table. But with <code>int<\/code>s, we can skip this step, because <code>int<\/code>s are their own hash codes, and so a set of unique <code>int<\/code>s is guaranteed to be a set of unique hash codes as well. This was then extended to also apply for <code>uint<\/code>, <code>short<\/code>\/<code>ushort<\/code>, <code>byte<\/code>\/<code>sbyte<\/code>, and <code>nint<\/code>\/<code>nuint<\/code> (in 32-bit processes), as they all similarly use their own value as the hash code.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87876\">dotnet\/runtime#87876<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87989\">dotnet\/runtime#87989<\/a> improve the &#8220;LengthBucket&#8221; strategy referenced in the earlier examples. This implementation buckets strings by their length and then does a lookup just within the strings of that length; if there are only a few strings per length, this can make searching very efficient. The initial implementation used an array of arrays, and this PR flattens that into a single array. This makes construction time much faster for this strategy, as there&#8217;s significantly less allocation involved.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87960\">dotnet\/runtime#87960<\/a> is based on an observation that we would invariably need to resize at least once in order to obtain the desired minimal collision rate, so it simply starts at a higher initial count than was previously being used.<\/li>\n<\/ul>\n<p>With all of those optimizations in place, construction time has now improved to the point where it&#8217;s no longer a threat, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87988\">dotnet\/runtime#87988<\/a> effectively reverted <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81194\">dotnet\/runtime#81194<\/a>, getting rid of the <code>optimizeForReading<\/code>-based overloads, such that everything is now optimized for reading.<\/p>\n<p>As an aside, it&#8217;s worth noting that for <code>string<\/code> keys in particular, the C# compiler has now also gotten in on the game of better optimizing based on the known characteristics of the data, such that if you know all of your <code>string<\/code> keys at compile-time, and you just need an ordinal, case-sensitive lookup, you might be best off simply writing a <code>switch<\/code> statement or expression. This is all thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/66081\">dotnet\/roslyn#66081<\/a>. Let&#8217;s take the name popularity example from earlier, and express it as a <code>switch<\/code> statement:<\/p>\n<pre><code class=\"language-C#\">static bool IsMostPopular(string name)\r\n{\r\n    switch (name)\r\n    {\r\n        case \"Olivia\":\r\n        case \"Emma\":\r\n        case \"Charlotte\":\r\n        case \"Amelia\":\r\n        case \"Sophia\":\r\n        case \"Isabella\":\r\n        case \"Ava\":\r\n        case \"Mia\":\r\n        case \"Evelyn\":\r\n        case \"Luna\":\r\n            return true;\r\n\r\n        default:\r\n            return false;\r\n    }\r\n}<\/code><\/pre>\n<p>Previously compiling this would result in the C# compiler providing a lowered equivalent to this:<\/p>\n<pre><code class=\"language-C#\">static bool IsMostPopular(string name)\r\n{\r\n    uint num = &lt;PrivateImplementationDetails&gt;.ComputeStringHash(name);\r\n    if (num &lt;= 1803517931)\r\n    {\r\n        if (num &lt;= 452280388)\r\n        {\r\n            if (num != 83419291)\r\n            {\r\n                if (num == 452280388 &amp;&amp; name == \"Isabella\")\r\n                    goto IL_012c;\r\n            }\r\n            else if (name == \"Olivia\")\r\n                goto IL_012c;\r\n        }\r\n        else if (num != 596915366)\r\n        {\r\n            if (num != 708112360)\r\n            {\r\n                if (num == 1803517931 &amp;&amp; name == \"Charlotte\")\r\n                    goto IL_012c;\r\n            }\r\n            else if (name == \"Evelyn\")\r\n                goto IL_012c;\r\n        }\r\n        else if (name == \"Mia\")\r\n            goto IL_012c;\r\n    }\r\n    else if (num &lt;= 2263917949u)\r\n    {\r\n        if (num != 2234485159u)\r\n        {\r\n            if (num == 2263917949u &amp;&amp; name == \"Ava\")\r\n                goto IL_012c;\r\n        }\r\n        else if (name == \"Luna\")\r\n            goto IL_012c;\r\n    }\r\n    else if (num != 2346269629u)\r\n    {\r\n        if (num != 3517830433u)\r\n        {\r\n            if (num == 3552467688u &amp;&amp; name == \"Amelia\")\r\n                goto IL_012c;\r\n        }\r\n        else if (name == \"Sophia\")\r\n            goto IL_012c;\r\n    }\r\n    else if (name == \"Emma\")\r\n        goto IL_012c;\r\n    return false;\r\n\r\n    IL_012c:\r\n    return true;\r\n}<\/code><\/pre>\n<p>If you stare at that for a moment, you&#8217;ll see the compiler has implemented a binary search tree. It hashes the name, and then having hashed all of the cases at build time, it does a binary search on the hash codes to find the the right case. Now with the recent improvements, it instead generates an equivalent of this:<\/p>\n<pre><code class=\"language-C#\">static bool IsMostPopular(string name)\r\n{\r\n    if (name != null)\r\n    {\r\n        switch (name.Length)\r\n        {\r\n            case 3:\r\n                switch (name[0])\r\n                {\r\n                    case 'A':\r\n                        if (name == \"Ava\")\r\n                            goto IL_012f;\r\n                        break;\r\n                    case 'M':\r\n                        if (name == \"Mia\")\r\n                            goto IL_012f;\r\n                        break;\r\n                }\r\n            case 4:\r\n                switch (name[0])\r\n                {\r\n                    case 'E':\r\n                        if (name == \"Emma\")\r\n                            goto IL_012f;\r\n                        break;\r\n                    case 'L':\r\n                        if (name == \"Luna\")\r\n                            goto IL_012f;\r\n                        break;\r\n                }\r\n            case 6:\r\n                switch (name[0])\r\n                {\r\n                    case 'A':\r\n                        if (name == \"Amelia\")\r\n                            goto IL_012f;\r\n                        break;\r\n                    case 'E':\r\n                        if (name == \"Evelyn\")\r\n                            goto IL_012f;\r\n                        break;\r\n                    case 'O':\r\n                        if (name == \"Olivia\")\r\n                            goto IL_012f;\r\n                        break;\r\n                    case 'S':\r\n                        if (name == \"Sophia\")\r\n                            goto IL_012f;\r\n                        break;\r\n                }\r\n            case 8:\r\n                if (name == \"Isabella\")\r\n                    goto IL_012f;\r\n                break;\r\n            case 9:\r\n                if (name == \"Charlotte\")\r\n                    goto IL_012f;\r\n                break;\r\n        }\r\n    }\r\n    return false;\r\n\r\n    IL_012f:\r\n    return true;\r\n}<\/code><\/pre>\n<p>Now what&#8217;s it doing? First, it&#8217;s bucketed the strings by their length; any string that comes in that&#8217;s not 3, 4, 6, 8, or 9 characters long will be immediately rejected. For 8 and 9 characters, there&#8217;s only one possible answer it could be for each, so it simply checks against that string. For the others, it&#8217;s recognized that each name in that length begins with a different letter, and switches over that. In this particular example, the first character in each bucket is a perfect differentiator, but if it wasn&#8217;t, the compiler will also consider other indices to see if any of those might be better differentiators. This is implementing the same basic strategy as the <code>System.Collections.Frozen.LengthBucketsFrozenSet<\/code> we saw earlier.<\/p>\n<p>I was careful in my choice above to use a <code>switch<\/code>. If I&#8217;d instead written the possibly more natural <code>is<\/code> expression:<\/p>\n<pre><code class=\"language-C#\">static bool IsMostPopular(string name) =&gt;\r\n    name is \"Olivia\" or\r\n            \"Emma\" or\r\n            \"Charlotte\" or\r\n            \"Amelia\" or\r\n            \"Sophia\" or\r\n            \"Isabella\" or\r\n            \"Ava\" or\r\n            \"Mia\" or\r\n            \"Evelyn\" or\r\n            \"Luna\";<\/code><\/pre>\n<p>then up until recently the compiler wouldn&#8217;t even have output the binary search, and would have instead just generated a cascading <code>if<\/code>\/<code>else if<\/code> as if I&#8217;d written:<\/p>\n<pre><code class=\"language-C#\">static bool IsMostPopular(string name) =&gt;\r\n    name == \"Olivia\" ||\r\n    name == \"Emma\" ||\r\n    name == \"Charlotte\" ||\r\n    name == \"Amelia\" ||\r\n    name == \"Sophia\" ||\r\n    name == \"Isabella\" ||\r\n    name == \"Ava\" ||\r\n    name == \"Mia\" ||\r\n    name == \"Evelyn\" ||\r\n    name == \"Luna\";<\/code><\/pre>\n<p>With <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/65874\">dotnet\/roslyn#65874<\/a> from <a href=\"https:\/\/github.com\/alrz\">@alrz<\/a>, however, the <code>is<\/code>-based version is now lowered the same as the <code>switch<\/code>-based version.<\/p>\n<p>Back to frozen collections. As noted, <code>System.Collections.Frozen<\/code> types are in the <code>System.Collections.Immutable<\/code> library, and they&#8217;re not the only improvements to that library. A variety of new APIs have been added to help enable more productive and efficient use of the existing immutable collections&#8230;<\/p>\n<h2>Immutable Collections<\/h2>\n<p>For years, developers have found the need to bypass an <code>ImmutableArray&lt;T&gt;<\/code>&#8216;s immutability. For example, the previously-discussed <code>FrozenDictionary&lt;TKey, TValue&gt;<\/code> exposes an <code>ImmutableArray&lt;TKey&gt;<\/code> for its keys and an <code>ImmutableArray&lt;TValue&gt;<\/code> for its values. It does this by creating a <code>TKey[]<\/code>, which it uses for a variety of purposes while building up the collection, and then it wants to wrap that as an <code>ImmutableArray&lt;TKey&gt;<\/code> to be exposed for consumption. But with the public APIs available on <code>ImmutableArray<\/code>\/<code>ImmutableArray&lt;T&gt;<\/code>, there&#8217;s no way to transfer ownership like that; all the APIs that accept an input <code>T[]<\/code> or <code>IEnumerable&lt;T&gt;<\/code> allocate a new array and copy all of the data into it, so that the implementation can be sure no one else is still holding onto a reference to the array being wrapped (if someone was, they could use that mutable reference to mutate the contents of the immutable array, and guarding against that is one of the key differentiators between a read-only collection and an immutable collection). Enabling such wrapping of the original array is thus an &#8220;unsafe&#8221; operation, albeit one that&#8217;s valuable to enable for developers willing to accept the responsibility. Previously, developers could achieve this by employing a hack that works but only because of implementation detail: using <code>Unsafe.As<\/code> to cast between the types. When a value type&#8217;s first field is a reference type, a reference to the beginning of the struct is also a reference to the reference type, since they&#8217;re both at the exact same memory location. Thus, because <code>ImmutableArray&lt;T&gt;<\/code> contains just a single field (for the <code>T[]<\/code> it wraps), a method like the following will successfully wrap an <code>ImmutableArray&lt;T&gt;<\/code> around a <code>T[]<\/code>:<\/p>\n<pre><code class=\"language-C#\">static ImmutableArray&lt;T&gt; UnsafeWrap&lt;T&gt;(T[] array) =&gt; Unsafe.As&lt;T[], ImmutableArray&lt;T&gt;&gt;(ref array);<\/code><\/pre>\n<p>That, however, is both uintuitive and depends on <code>ImmutableArray&lt;T&gt;<\/code> having the array at a 0-offset from the start of the struct, making it a brittle solution. To provide something robust, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85526\">dotnet\/runtime#85526<\/a> added the new <code>System.Runtime.InteropServices.ImmutableCollectionsMarshal<\/code> class, and on it two new methods: <code>AsImmutableArray<\/code> and <code>AsArray<\/code>. These methods support casting back and forth between a <code>T[]<\/code> and an <code>ImmutableArray&lt;T&gt;<\/code>, without allocation. They&#8217;re defined in <code>InteropServices<\/code> on a <code>Marshal<\/code> class, as that&#8217;s one of the ways we have to both hide more dangerous functionality and declare that something is inherently &#8220;unsafe&#8221; in some capacity.<\/p>\n<p>There are also new overloads exposed for constructing immutable collections with less allocation. All of the immutable collections have a corresponding static class that provides a <code>Create<\/code> method, e.g. <code>ImmutableList&lt;T&gt;<\/code> has the corresponding static class <code>ImmutableList<\/code> which provides a <code>static ImmutableList&lt;T&gt; Create&lt;T&gt;(params T[] items)<\/code> method. Now in .NET 8 as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87945\">dotnet\/runtime#87945<\/a>, these methods all have a new overload that takes a <code>ReadOnlySpan&lt;T&gt;<\/code>, e.g. <code>static ImmutableList&lt;T&gt; Create&lt;T&gt;(ReadOnlySpan&lt;T&gt; items)<\/code>. This means an immutable collection can be created without incurring the allocation required to either go through the associated builder (which is a reference type) or to allocate an array of the exact right size.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections.Immutable;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark(Baseline = true)]\r\n    public ImmutableList&lt;int&gt; CreateArray() =&gt; ImmutableList.Create&lt;int&gt;(1, 2, 3, 4, 5);\r\n\r\n    [Benchmark]\r\n    public ImmutableList&lt;int&gt; CreateBuilder()\r\n    {\r\n        var builder = ImmutableList.CreateBuilder&lt;int&gt;();\r\n        for (int i = 1; i &lt;= 5; i++) builder.Add(i);\r\n        return builder.ToImmutable();\r\n    }\r\n\r\n    [Benchmark]\r\n    public ImmutableList&lt;int&gt; CreateSpan() =&gt; ImmutableList.Create&lt;int&gt;(stackalloc int[] { 1, 2, 3, 4, 5 });\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CreateBuilder<\/td>\n<td style=\"text-align: right\">132.22 ns<\/td>\n<td style=\"text-align: right\">1.42<\/td>\n<td style=\"text-align: right\">312 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CreateArray<\/td>\n<td style=\"text-align: right\">92.98 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">312 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CreateSpan<\/td>\n<td style=\"text-align: right\">85.54 ns<\/td>\n<td style=\"text-align: right\">0.92<\/td>\n<td style=\"text-align: right\">264 B<\/td>\n<td style=\"text-align: right\">0.85<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>BitArray<\/h3>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81527\">dotnet\/runtime#81527<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a> added two new methods to <code>BitArray<\/code>, <code>HasAllSet<\/code> and <code>HasAnySet<\/code>, which do exactly what their names suggest: <code>HasAllSet<\/code> returns whether all of the bits in the array are set, and <code>HasAnySet<\/code> returns whether any of the bits in the array are set. While useful, what I really like about these additions is that they make good use of the <code>ContainsAnyExcept<\/code> method introduced in .NET 8. <code>BitArray<\/code>&#8216;s storage is an <code>int[]<\/code>, where each element in the array represents 32 bits (for the purposes of this discussion, I&#8217;m ignoring the corner-case it needs to deal with of the last element&#8217;s bits not all being used because the count of the collection isn&#8217;t a multiple of 32). Determining whether any bits are set is then simply a matter of doing <code>_array.AsSpan().ContainsAnyExcept(0)<\/code>. Similarly, determining whether all bits are set is simply a matter of doing <code>!_array.AsSpan().ContainsAnyExcept(-1)<\/code>. The bit pattern for <code>-1<\/code> is all 1s, so <code>ContainsAnyExcept(-1)<\/code> will return true if and only if it finds any integer that doesn&#8217;t have all of its bits set; thus if the call doesn&#8217;t find any, all bits are set. The net result is <code>BitArray<\/code> gets to maintain simple code that&#8217;s also vectorized and optimized, thanks to delegating to these shared helpers. You can see examples of these methods being used in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82057\">dotnet\/runtime#82057<\/a>, which replaced bespoke implementations of the same functionality with the new built-in helpers.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Collections;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly BitArray _bitArray = new BitArray(1024);\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public bool HasAnySet_Manual()\r\n    {\r\n        for (int i = 0; i &lt; _bitArray.Length; i++)\r\n        {\r\n            if (_bitArray[i])\r\n            {\r\n                return true;\r\n            }\r\n        }\r\n\r\n        return false;\r\n    }\r\n\r\n    [Benchmark]\r\n    public bool HasAnySet_BuiltIn() =&gt; _bitArray.HasAnySet();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HasAnySet_Manual<\/td>\n<td style=\"text-align: right\">731.041 ns<\/td>\n<td style=\"text-align: right\">1.000<\/td>\n<\/tr>\n<tr>\n<td>HasAnySet_BuiltIn<\/td>\n<td style=\"text-align: right\">5.423 ns<\/td>\n<td style=\"text-align: right\">0.007<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Collection Expressions<\/h3>\n<p>With <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/68831\">dotnet\/roslyn#68831<\/a> and then a myriad of subsequent PRs, C# 12 introduces a new terse syntax for constructing collections: &#8220;collection expressions.&#8221; Let&#8217;s say I want to construct a <code>List&lt;int&gt;<\/code>, for example, with the elements 1, 2, and 3. I could do it like so:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;int&gt;();\r\nlist.Add(1);\r\nlist.Add(2);\r\nlist.Add(3);<\/code><\/pre>\n<p>or utilizing collection initializers that were added in C# 3:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;int&gt;() { 1, 2, 3 };<\/code><\/pre>\n<p>Now in C# 12, I can write that as:<\/p>\n<pre><code class=\"language-C#\">List&lt;int&gt; list = [1, 2, 3];<\/code><\/pre>\n<p>I can also use &#8220;spreads,&#8221; where enumerables can be used in the syntax and have all of their contents splat into the collection. For example, instead of:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;int&gt;() { 1, 2 };\r\nforeach (int i in GetData())\r\n{\r\n    list.Add(i);\r\n}\r\nlist.Add(3);<\/code><\/pre>\n<p>or:<\/p>\n<pre><code class=\"language-C#\">var list = new List&lt;int&gt;() { 1, 2 };\r\nlist.AddRange(GetData());\r\nlist.Add(3);<\/code><\/pre>\n<p>I can simply write:<\/p>\n<pre><code class=\"language-C#\">List&lt;int&gt; list = [1, 2, ..GetData(), 3];<\/code><\/pre>\n<p>If it were just a simpler syntax for collections, it wouldn&#8217;t be worth discussing in this particular post. What makes it relevant from a performance perspective, however, is that the C# compiler is free to optimize this however it sees fit, and it goes to great lengths to write the best code it can for the given circumstance; some optimizations are already in the compiler, more will be in place by the time .NET 8 and C# 12 are released, and even more will come later, with the language specified in such a way that gives the compiler the freedom to innovate here. Let&#8217;s take a few examples&#8230;<\/p>\n<p>If you write:<\/p>\n<pre><code class=\"language-C#\">IEnumerable&lt;int&gt; e = [];<\/code><\/pre>\n<p>the compiler won&#8217;t just translate that into:<\/p>\n<pre><code class=\"language-C#\">IEnumerable&lt;int&gt; e = new int[0];<\/code><\/pre>\n<p>After all, we have a perfectly good singleton for this in the way of <code>Array.Empty&lt;int&gt;()<\/code>, something the compiler already emits use of for things like <code>params T[]<\/code>, and it can emit the same thing here:<\/p>\n<pre><code class=\"language-C#\">IEnumerable&lt;int&gt; e = Array.Empty&lt;int&gt;();<\/code><\/pre>\n<p>Ok, what about the optimizations we previously saw around the compiler lowering the creation of an array involving only constants and storing that directly into a <code>ReadOnlySpan&lt;T&gt;<\/code>? Yup, that applies here, too. So, instead of writing:<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;int&gt; daysToMonth365 = new int[] { 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365 };<\/code><\/pre>\n<p>you can write:<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;int&gt; daysToMonth365 = [0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365];<\/code><\/pre>\n<p>and the exact same code results.<\/p>\n<p>What about <code>List&lt;T&gt;<\/code>? Earlier in the discussion of collections we saw that <code>List&lt;T&gt;<\/code> now sports an <code>AddRange(ReadOnlySpan&lt;T&gt;)<\/code>, and the compiler is free to use that. For example, if you write this:<\/p>\n<pre><code class=\"language-C#\">Span&lt;int&gt; source1 = ...;\r\nIList&lt;int&gt; source2 = ...;\r\nList&lt;int&gt; result = [1, 2, ..source1, ..source2];<\/code><\/pre>\n<p>the compiler could emit the equivalent of this:<\/p>\n<pre><code class=\"language-C#\">Span&lt;int&gt; source1 = ...;\r\nIList&lt;int&gt; source2 = ...;\r\nList&lt;int&gt; result = new List&lt;int&gt;(2 + source1.Length + source2.Count);\r\nresult.Add(1);\r\nresult.Add(2);\r\nresult.AddRange(source1);\r\nresult.AddRange(source2);<\/code><\/pre>\n<p>One of my favorite optimizations it achieves, though, is with spans and the use of the <code>[InlineArray]<\/code> attribute we already saw. If you write:<\/p>\n<pre><code class=\"language-C#\">int a = ..., b = ..., c = ..., d = ..., e = ..., f = ..., g = ..., h = ...;\r\nSpan&lt;int&gt; span = [a, b, c, d, e, f, g, h];<\/code><\/pre>\n<p>the compiler can lower that to code along the lines of this:<\/p>\n<pre><code class=\"language-C#\">int a = ..., b = ..., c = ..., d = ..., e = ..., f = ..., g = ..., h = ...;\r\n&lt;&gt;y__InlineArray8&lt;int&gt; buffer = default;\r\nSpan&lt;int&gt; span = buffer;\r\nspan[0] = a;\r\nspan[1] = b;\r\nspan[2] = c;\r\nspan[3] = d;\r\nspan[4] = e;\r\nspan[5] = f;\r\nspan[6] = g;\r\nspan[7] = h;\r\n...\r\n[InlineArray(8)]\r\ninternal struct &lt;&gt;y__InlineArray8&lt;T&gt;\r\n{\r\n    private T _element0;\r\n}<\/code><\/pre>\n<p>In short, this collection expression syntax becomes <em>the<\/em> way to utilize <code>[InlineArray]<\/code> in the vast majority of situations, allowing the compiler to create a shared definition for you.<\/p>\n<p>That optimization also feeds into another, which is both an optimization and a functional improvement over what&#8217;s in C# 11. Let&#8217;s say you have this code&#8230; what do you expect it to print?<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -f net8.0\r\n\r\nusing System.Collections.Immutable;\r\n\r\nImmutableArray&lt;int&gt; array = new ImmutableArray&lt;int&gt; { 1, 2, 3 };\r\nforeach (int i in array)\r\n{\r\n    Console.WriteLine(i);\r\n}<\/code><\/pre>\n<p>Unless you&#8217;re steeped in <code>System.Collections.Immutable<\/code> and how collection initializers work, you likely didn&#8217;t predict the (unfortunate) answer:<\/p>\n<pre><code class=\"language-text\">Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.\r\n   at System.Collections.Immutable.ImmutableArray`1.get_IsEmpty()\r\n   at System.Collections.Immutable.ImmutableArray`1.Add(T item)\r\n   at Program.&lt;Main&gt;$(String[] args)<\/code><\/pre>\n<p><code>ImmutableArray&lt;T&gt;<\/code> is a struct, so this will end up using its default initialization, which contains a <code>null<\/code> array. But even if that was made to work, the C# compiler will have lowered the code I wrote to the equivalent of this:<\/p>\n<pre><code class=\"language-C#\">ImmutableArray&lt;int&gt; immutableArray = default;\r\nimmutableArray.Add(1);\r\nimmutableArray.Add(2);\r\nimmutableArray.Add(3);\r\nforeach (int i in immutableArray)\r\n{\r\n    Console.WriteLine(enumerator.Current);\r\n}<\/code><\/pre>\n<p>which is &#8220;wrong&#8221; in multiple ways. <code>ImmutableArray&lt;int&gt;.Add<\/code> doesn&#8217;t actually mutate the original collection, but instead returns a new instance that contains the additional element, so when we enumerate <code>immutableArray<\/code>, we wouldn&#8217;t see any of the additions. Plus, we&#8217;re doing all this work and allocation to create the results of <code>Add<\/code>, only to drop those results on the floor.<\/p>\n<p>Collection expressions fix this. Now you can write this:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -f net8.0\r\n\r\nusing System.Collections.Immutable;\r\n\r\nImmutableArray&lt;int&gt; array = [1, 2, 3];\r\nforeach (int i in array)\r\n{\r\n    Console.WriteLine(i);\r\n}<\/code><\/pre>\n<p>and running it successfully produces:<\/p>\n<pre><code class=\"language-text\">1\r\n2\r\n3<\/code><\/pre>\n<p>Why? Because <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88470\">dotnet\/runtime#88470<\/a> added a new <code>[CollectionBuilder]<\/code> attribute that&#8217;s recognized by the C# compiler. That attribute is placed on a type and points to a factory method for creating that type, accepting a <code>ReadOnlySpan&lt;T&gt;<\/code> and returning the instance constructed from that data. That PR also tagged <code>ImmutableArray&lt;T&gt;<\/code> with this attribute:<\/p>\n<pre><code class=\"language-C#\">[CollectionBuilder(typeof(ImmutableArray), nameof(ImmutableArray.Create))]<\/code><\/pre>\n<p>such that when the compiler sees an <code>ImmutableArray&lt;T&gt;<\/code> being constructed from a collection expression, it runs to use <code>ImmutableArray.Create&lt;T&gt;(ReadOnlySpan&lt;T&gt;)<\/code>. Not only that, it&#8217;s able to use the <code>[InlineArray]<\/code>-based optimization we just talked about for creating that input. As such, the code the compiler generates for this example as of today is equivalent to this:<\/p>\n<pre><code class=\"language-C#\">&lt;&gt;y__InlineArray3&lt;int&gt; buffer = default;\r\nbuffer._element = 1;\r\nUnsafe.Add(ref buffer._element, 1) = 2;\r\nUnsafe.Add(ref buffer._element, 2) = 3;\r\nImmutableArray&lt;int&gt; array = ImmutableArray.Create(buffer);\r\nforeach (int i in array)\r\n{\r\n    Console.WriteLine(array);\r\n}<\/code><\/pre>\n<p><code>ImmutableList&lt;T&gt;<\/code>, <code>ImmutableStack&lt;T&gt;<\/code>, <code>ImmutableQueue&lt;T&gt;<\/code>, <code>ImmutableHashSet&lt;T&gt;<\/code>, and <code>ImmutableSortedSet&lt;T&gt;<\/code> are all similarly attributed such that they all work with collection expressions as well.<\/p>\n<p>Of course, the compiler could actually do a bit better for <code>ImmutableArray&lt;T&gt;<\/code>. As was previously noted, the compiler is free to optimize these how it sees fit, and we already mentioned the new <code>ImmutableCollectionsMarshal.AsImmutableArray<\/code> method. As I write this, the compiler doesn&#8217;t currently employ that method, but in the future the compiler can special-case <code>ImmutableArray&lt;T&gt;<\/code>, such that it could then generate code equivalent to the following:<\/p>\n<pre><code class=\"language-C#\">ImmutableArray&lt;int&gt; array = ImmutableCollectionsMarshal.AsImmutableArray(new[] { 1, 2, 3 });<\/code><\/pre>\n<p>saving on both stack space as well as an extra copy of the data. This is just one of the additional optimizations possible.<\/p>\n<p>In short, collection expressions are intended to be a great way to express the collection you want built, and the compiler will ensure it&#8217;s done efficiently.<\/p>\n<h2>File I\/O<\/h2>\n<p>.NET 6 overhauled how file I\/O is implemented in .NET, rewriting <code>FileStream<\/code>, introducing the <code>RandomAccess<\/code> class, and a multitude of other changes. .NET 8 continues to improve performance with file I\/O further.<\/p>\n<p>One of the more interesting ways performance of a system can be improved is cancellation. After all, the fastest work is work you don&#8217;t have to do at all, and cancellation is about stopping doing unneeded work. The original patterns for asynchrony in .NET were based on a non-cancelable model (see <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/how-async-await-really-works\/\">How Async\/Await Really Works in C#<\/a> for an in-depth history and discussion), and over time as all of that support has shifted to the <code>Task<\/code>-based model based on <code>CancellationToken<\/code>, more and more implementations have become fully cancelable as well. As of .NET 7, the vast majority of code paths that accepted a <code>CancellationToken<\/code> actually respected it, more than just doing an up-front check to see whether cancellation was already requested but then not paying attention to it during the operation. Most of the holdouts have been very corner-case, but there&#8217;s one notable exception: <code>FileStream<\/code>s created without <code>FileOptions.Asynchronous<\/code>.<\/p>\n<p><code>FileStream<\/code> inherited the bifurcated model of asynchrony from Windows, where at the time you open a file handle you need to specify whether it&#8217;s being opened for synchronous or asynchronous (&#8220;overlapped&#8221;) access. A file handle opened for overlapped access requires that all operations be asynchronous, and vice versa if it&#8217;s opened for non-overlapped access requires that all operations be synchronous. That causes some friction with <code>FileStream<\/code>, which exposes both synchronous (e.g. <code>Read<\/code>) and asynchronous (e.g. <code>ReadAsync<\/code>) methods, as it means that one set of those needs to emulate the behavior. If the <code>FileStream<\/code> is opened for asynchronous access, then <code>Read<\/code> needs to do the operation asynchronously and block waiting for it complete (a pattern we less-than-affectionately refer to as <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/should-i-expose-synchronous-wrappers-for-asynchronous-methods\/\">&#8220;sync-over-async&#8221;<\/a>), and if the <code>FileStream<\/code> is opened for synchronous access, then <code>ReadAsync<\/code> needs to queue a work item that will do the operation synchronously (<a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/should-i-expose-asynchronous-wrappers-for-synchronous-methods\/\">&#8220;async-over-sync&#8221;<\/a>). Even though that <code>ReadAsync<\/code> method accepts a <code>CancellationToken<\/code>, the actual synchronous <code>Read<\/code> that ends up being invoked as part of a <code>ThreadPool<\/code> work item hasn&#8217;t been cancelable. Now in .NET 8, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87103\">dotnet\/runtime#87103<\/a>, it is, at least on Windows.<\/p>\n<p>In .NET 7, <code>PipeStream<\/code> was fixed for this same case, relying on an internal <code>AsyncOverSyncWithIoCancellation<\/code> helper that would use the Win32 <code>CancelSynchronousIo<\/code> to interrupt pending I\/O, while also using appropriate synchronization to ensure that only the intended associated work was interrupted and not work that happened to be running on the same worker thread before or after (Linux already fully supported <code>PipeStream<\/code> cancellation as of .NET 5). This PR adapted that same helper to then be usable as well inside of <code>FileStream<\/code> on Windows, in order to gain the same benefits. The same PR also further improved the implementation of that helper to reduce allocation and to further streamline the processing, such that the existing support in <code>PipeStream<\/code> gets leaner as well.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.IO.Pipes;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly CancellationTokenSource _cts = new();\r\n    private readonly byte[] _buffer = new byte[1];\r\n    private AnonymousPipeServerStream _server;\r\n    private AnonymousPipeClientStream _client;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _server = new AnonymousPipeServerStream(PipeDirection.Out);\r\n        _client = new AnonymousPipeClientStream(PipeDirection.In, _server.ClientSafePipeHandle);\r\n    }\r\n\r\n    [GlobalCleanup]\r\n    public void Cleanup()\r\n    {\r\n        _server.Dispose();\r\n        _client.Dispose();\r\n    }\r\n\r\n    [Benchmark(OperationsPerInvoke = 100_000)]\r\n    public async Task ReadWriteAsync()\r\n    {\r\n        for (int i = 0; i &lt; 100_000; i++)\r\n        {\r\n            ValueTask&lt;int&gt; read = _client.ReadAsync(_buffer, _cts.Token);\r\n            await _server.WriteAsync(_buffer, _cts.Token);\r\n            await read;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ReadWriteAsync<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">3.863 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">181 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ReadWriteAsync<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.941 us<\/td>\n<td style=\"text-align: right\">0.76<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Interacting with paths via <code>Path<\/code> and <code>File<\/code> has also improved in various ways. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74855\">dotnet\/runtime#74855<\/a> improved <code>Path.GetTempFileName()<\/code> on Windows both functionally and for performance; in many situations in the past, we&#8217;ve made the behavior of .NET on Unix match the behavior of .NET on Windows, but this PR interestingly goes in the other direction. On Unix, <code>Path.GetTempFileName()<\/code> uses the libc <code>mkstemp<\/code> function, which accepts a template that must end in &#8220;XXXXXX&#8221; (6 <code>X<\/code>s), and it populates those <code>X<\/code>s with random values, using the resulting name for a new file that gets created. On Windows, <code>GetTempFileName()<\/code> was using the Win32 <code>GetTempFileNameW<\/code> function, which uses a similar pattern but with only 4 <code>X<\/code>s. With the characters Windows will fill in, that enables only 65,536 possible names, and as the temp directory fills up, it becomes more and more likely there will be conflicts, leading to longer and longer times for creating a temp file (it also means that on Windows <code>Path.GetTempFileName()<\/code> has been limited to creating 65,536 simultaneously-existing files). This PR changes the format on Windows to match that used on Unix, and avoids the use of <code>GetTempFileNameW<\/code>, instead doing the random name assignment and retries-on-conflict itself. The net result is more consistency across OSes, a much larger number of temporary files possible (a billion instead of tens of thousands), as well as a better-performing method:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\/\/ NOTE: The results for this benchmark will vary wildly based on how full the temp directory is.\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly List&lt;string&gt; _files = new();\r\n\r\n    \/\/ NOTE: The performance of this benchmark is highly influenced by what's currently in your temp directory.\r\n    [Benchmark]\r\n    public void GetTempFileName()\r\n    {\r\n        for (int i = 0; i &lt; 1000; i++) _files.Add(Path.GetTempFileName());\r\n    }\r\n\r\n    [IterationCleanup]\r\n    public void Cleanup()\r\n    {\r\n        foreach (string path in _files) File.Delete(path);\r\n        _files.Clear();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetTempFileName<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,947.8 ms<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetTempFileName<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">276.5 ms<\/td>\n<td style=\"text-align: right\">0.34<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Path.GetFileName<\/code> is another on the list of methods that improves, thanks to making use of <code>IndexOf<\/code> methods. Here, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75318\">dotnet\/runtime#75318<\/a> uses <code>LastIndexOf<\/code> (on Unix, where the only directory separator is <code>'\/'<\/code>) or <code>LastIndexOfAny<\/code> (on Windows, where both <code>'\/'<\/code> and <code>'\\'<\/code> can be a directory separator) to search for the beginning of the file name.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private string _path = Path.Join(Path.GetTempPath(), \"SomeFileName.cs\");\r\n\r\n    [Benchmark]\r\n    public ReadOnlySpan&lt;char&gt; GetFileName() =&gt; Path.GetFileName(_path.AsSpan());\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetFileName<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">9.465 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetFileName<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">4.733 ns<\/td>\n<td style=\"text-align: right\">0.50<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Related to <code>File<\/code> and <code>Path<\/code>, various methods on <code>Environment<\/code> also return paths. <code>Microsoft.Extensions.Hosting.HostingHostBuilderExtensions<\/code> had been using <code>Environment.GetSpecialFolder(Environment.SpecialFolder.System)<\/code> to get the system path, but this was leading to noticeable overhead when starting up an ASP.NET application. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83564\">dotnet\/runtime#83564<\/a> changed this to use <code>Environment.SystemDirectory<\/code> directly, which on Windows takes advantage of the much more efficient path (and resulting in simpler code), but then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83593\">dotnet\/runtime#83593<\/a> also fixed <code>Environment.GetSpecialFolder(Environment.SpecialFolder.System)<\/code> on Windows to use <code>Environment.SystemDirectory<\/code>, such that its performance accrues to the higher-level uses as well.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public string GetFolderPath() =&gt; Environment.GetFolderPath(Environment.SpecialFolder.System);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetFolderPath<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,560.87 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">88 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetFolderPath<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">45.76 ns<\/td>\n<td style=\"text-align: right\">0.03<\/td>\n<td style=\"text-align: right\">64 B<\/td>\n<td style=\"text-align: right\">0.73<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73983\">dotnet\/runtime#73983<\/a> improves <code>DirectoryInfo<\/code> and <code>FileInfo<\/code>, making the <code>FileSystemInfo.Name<\/code> property lazy. Previously when constructing the info object if only the full name existed (and not the name of just the directory or file itself), the constructor would promptly create the <code>Name<\/code> string, even if the info object is never used (as is often the case when it&#8217;s returned from a method like <code>CreateDirectory<\/code>). Now, that <code>Name<\/code> string is lazily created on first use of the <code>Name<\/code> property.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly string _path = Environment.CurrentDirectory;\r\n\r\n    [Benchmark]\r\n    public DirectoryInfo Create() =&gt; new DirectoryInfo(_path);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Create<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">225.0 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">240 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Create<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">170.1 ns<\/td>\n<td style=\"text-align: right\">0.76<\/td>\n<td style=\"text-align: right\">200 B<\/td>\n<td style=\"text-align: right\">0.83<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>File.Copy<\/code> has gotten a whole lot faster on macOS, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79243\">dotnet\/runtime#79243<\/a> from <a href=\"https:\/\/github.com\/hamarb123\">@hamarb123<\/a>. <code>File.Copy<\/code> now employs the OS&#8217;s <code>clonefile<\/code> function (if available) to perform the copy, and if both the source and destination are on the same volume, <code>clonefile<\/code> creates a copy-on-write clone of the file in the destination; this makes the copy at the OS level much faster, incurring the majority cost of actually duplicating the data only occurring if one of the files is subsequently written to.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"Min\", \"Max\")]\r\npublic class Tests\r\n{\r\n    private string _source;\r\n    private string _dest;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _source = Path.GetTempFileName();\r\n        File.WriteAllBytes(_source, Enumerable.Repeat((byte)42, 1_000_000).ToArray());\r\n        _dest = Path.GetRandomFileName();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void FileCopy() =&gt; File.Copy(_source, _dest, overwrite: true);\r\n\r\n    [GlobalCleanup]\r\n    public void Cleanup()\r\n    {\r\n        File.Delete(_source);\r\n        File.Delete(_dest);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>FileCopy<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,624.8 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>FileCopy<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">366.7 us<\/td>\n<td style=\"text-align: right\">0.23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Some more specialized changes have been incorporated as well. <code>TextWriter<\/code> is a core abstraction for writing text to an arbitrary destination, but sometimes you want that destination to be nowhere, a la <code>\/dev\/null<\/code> on Linux. For this, <code>TextWriter<\/code> provides the <code>TextWriter.Null<\/code> property, which returns a <code>TextWriter<\/code> instance that nops on all of its members. Or, at least that&#8217;s the visible behavior. In practice, only a subset of its members were actually overridden, which meant that although nothing would end up being output, some work might still be incurred and then the fruits of that labor thrown away. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83293\">dotnet\/runtime#83293<\/a> ensures that all of the writing methods are overridden in order to do away with all of that wasted work.<\/p>\n<p>Further, one of the places <code>TextWriter<\/code> ends up being used is in <code>Console<\/code>, where <code>Console.SetOut<\/code> allows you to replace <code>stdout<\/code> with your own writer, at which point all of the writing methods on <code>Console<\/code> output to that <code>TextWriter<\/code> instead. In order to provide thread-safety of writes, <code>Console<\/code> synchronizes access to the underlying writer, but if the writer is doing nops anyway, there&#8217;s no need for that synchronization. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83296\">dotnet\/runtime#83296<\/a> does away with it in that case, such that if you want to temporarily silence <code>Console<\/code>, you can simply set its output to go to <code>TextWriter.Null<\/code>, and the overhead of operations on <code>Console<\/code> will be minimized.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly string _value = \"42\";\r\n\r\n    [GlobalSetup]\r\n    public void Setup() =&gt; Console.SetOut(TextWriter.Null);\r\n\r\n    [Benchmark]\r\n    public void WriteLine() =&gt; Console.WriteLine(\"The value was {0}\", _value);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WriteLine<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">80.361 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WriteLine<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">1.743 ns<\/td>\n<td style=\"text-align: right\">0.02<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Networking<\/h2>\n<p>Networking is the heart and soul of most modern services and applications, which makes it all the more important that .NET&#8217;s networking stack shine.<\/p>\n<h3>Networking Primitives<\/h3>\n<p>Let&#8217;s start at the bottom of the networking stack, looking at some primitives. Most of these improvements are around formatting, parsing, and manipulation as bytes. Take <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75872\">dotnet\/runtime#75872<\/a>, for example, which improved the performance of various such operations on <code>IPAddress<\/code>. <code>IPAddress<\/code> stores a <code>uint<\/code> that&#8217;s used as the address when it&#8217;s representing an IPv4 address, and it stores a <code>ushort[8]<\/code> that&#8217;s used when it&#8217;s representing an IPv6 address. A <code>ushort<\/code> is two bytes, so a <code>ushort[8]<\/code> is 16 bytes, or 128 bits. &#8220;128 bits&#8221; is a very convenient number when performing certain operations, as such a value can be manipulated as a <code>Vector128&lt;&gt;<\/code> (accelerating computation on systems that accelerate it, which is most). This PR takes advantage of that to optimize common operations with an <code>IPAddress<\/code>. The <code>IPAddress<\/code> constructor, for example, is handed a <code>ReadOnlySpan&lt;byte&gt;<\/code> for an IPv6 address, which it needs to read into its <code>ushort[8]<\/code>; previously that was done with a loop over the input, but now it&#8217;s handled with a single vector: load the single vector, possibly reverse the endianness (which can be done in just three instructions: OR together the vector shifted left by one byte and shifted right by one byte), and store it.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly IPAddress _addr = IPAddress.Parse(\"2600:141b:13:781::356e\");\r\n    private readonly byte[] _ipv6Bytes = IPAddress.Parse(\"2600:141b:13:781::356e\").GetAddressBytes();\r\n\r\n    [Benchmark] public IPAddress NewIPv6() =&gt; new IPAddress(_ipv6Bytes, 0);\r\n    [Benchmark] public bool WriteBytes() =&gt; _addr.TryWriteBytes(_ipv6Bytes, out _);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NewIPv6<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">36.720 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NewIPv6<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">16.715 ns<\/td>\n<td style=\"text-align: right\">0.45<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>WriteBytes<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">14.443 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WriteBytes<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">2.036 ns<\/td>\n<td style=\"text-align: right\">0.14<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>IPAddress<\/code> now also implements <code>ISpanFormattable<\/code> and <code>IUtf8SpanFormattable<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82913\">dotnet\/runtime#82913<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84487\">dotnet\/runtime#84487<\/a>. That means, for example, that using an <code>IPAddress<\/code> as part of string interpolation no longer needs to allocate an intermediate string. As part of this, some changes were made to <code>IPAddress<\/code> formatting to streamline it. It&#8217;s a bit harder to measure these changes, though, because <code>IPAddress<\/code> caches a string it creates, such that subsequent <code>ToString<\/code> calls just return the previous string created. To work around that, we can use private reflection to null out the field (<em>never<\/em> do this in a real code; private reflection against the core libraries is very much unsupported).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Reflection;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private IPAddress _address;\r\n    private FieldInfo _toStringField;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _address = IPAddress.Parse(\"123.123.123.123\");\r\n        _toStringField = typeof(IPAddress).GetField(\"_toString\", BindingFlags.NonPublic | BindingFlags.Instance);\r\n    }\r\n\r\n    [Benchmark]\r\n    public string NonCachedToString()\r\n    {\r\n        _toStringField.SetValue(_address, null);\r\n        return _address.ToString();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonCachedToString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">92.63 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NonCachedToString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">75.53 ns<\/td>\n<td style=\"text-align: right\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Unfortunately, such use of reflection has a non-trivial amount of overhead associated with it, which then decreases the perceived benefit from the improvement. Instead, we can use reflection emit either directly or via <code>System.Linq.Expression<\/code> to emit a custom helper that makes it less expensive to null out that private field.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Linq.Expressions;\r\nusing System.Net;\r\nusing System.Reflection;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private IPAddress _address;\r\n    private Action&lt;IPAddress, string&gt; _setter;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _address = IPAddress.Parse(\"123.123.123.123\");\r\n        _setter = BuildSetter&lt;IPAddress, string&gt;(typeof(IPAddress).GetField(\"_toString\", BindingFlags.NonPublic | BindingFlags.Instance));\r\n    }\r\n\r\n    [Benchmark]\r\n    public string NonCachedToString()\r\n    {\r\n        _setter(_address, null);\r\n        return _address.ToString();\r\n    }\r\n\r\n    private static Action&lt;TSource, TArg&gt; BuildSetter&lt;TSource, TArg&gt;(FieldInfo field)\r\n    {\r\n        ParameterExpression target = Expression.Parameter(typeof(TSource));\r\n        ParameterExpression value = Expression.Parameter(typeof(TArg));\r\n        return Expression.Lambda&lt;Action&lt;TSource, TArg&gt;&gt;(\r\n            Expression.Assign(Expression.Field(target, field), value),\r\n            target,\r\n            value).Compile();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonCachedToString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">48.39 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NonCachedToString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">36.30 ns<\/td>\n<td style=\"text-align: right\">0.75<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But .NET 8 actually includes a feature that streamlines this; the feature&#8217;s primary purpose is in support of scenarios like source generators with Native AOT, but it&#8217;s useful for this kind of benchmarking, too. The new <code>UnsafeAccessor<\/code> attribute (introduced in and supported by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86932\">dotnet\/runtime#86932<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88626\">dotnet\/runtime#88626<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88925\">dotnet\/runtime#88925<\/a>) lets you define an <code>extern<\/code> method that bypasses visibility. In this case, I&#8217;ve used it to get a <code>ref<\/code> to the private field, at which point I can just assign <code>null<\/code> through the <code>ref<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly IPAddress _address = IPAddress.Parse(\"123.123.123.123\");\r\n\r\n    [Benchmark]\r\n    public string NonCachedToString()\r\n    {\r\n        _toString(_address) = null;\r\n        return _address.ToString();\r\n\r\n        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = \"_toString\")]\r\n        extern static ref string _toString(IPAddress c);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonCachedToString<\/td>\n<td style=\"text-align: right\">34.42 ns<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Uri<\/code> is another networking primitive that saw multiple improvements. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80469\">dotnet\/runtime#80469<\/a> removed a variety of allocations, primarily around substrings that were instead replaced by spans. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90087\">dotnet\/runtime#90087<\/a> replaced unsafe code as part of scheme parsing with safe span-based code, making it both safer and faster. But <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88012\">dotnet\/runtime#88012<\/a> is more interesting, as it made <code>Uri<\/code> implement <code>ISpanFormattable<\/code>. That means that when, for example, a <code>Uri<\/code> is used as an argument to an interpolated string, the <code>Uri<\/code> can now format itself directly to the underlying buffer rather than needing to allocate a temporary string that&#8217;s then added in. This can be particularly useful for reducing the costs of logging and other forms of telemetry. It&#8217;s a little difficult to isolate just the formatting aspect of a <code>Uri<\/code> for benchmarking purposes, as <code>Uri<\/code> caches information gathered in the process, but even with constructing a new one each time you can see gains:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public string Interpolate() =&gt; $\"Uri: {new Uri(\"http:\/\/dot.net\")}\";\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">356.3 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">296 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Interpolate<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">278.4 ns<\/td>\n<td style=\"text-align: right\">0.78<\/td>\n<td style=\"text-align: right\">240 B<\/td>\n<td style=\"text-align: right\">0.81<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other networking primitives improved in other ways. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82095\">dotnet\/runtime#82095<\/a> reduced the overhead of the <code>GetHashCode<\/code> methods of several networking types, like <code>Cookie<\/code>. <code>Cookie.GetHashCode<\/code> was previously allocating and is now allocation-free. Same for <code>DnsEndPoint.GetHashCode<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly Cookie _cookie = new Cookie(\"Cookie\", \"Monster\");\r\n    private readonly DnsEndPoint _dns = new DnsEndPoint(\"localhost\", 80);\r\n\r\n    [Benchmark]\r\n    public int CookieHashCode() =&gt; _cookie.GetHashCode();\r\n\r\n    [Benchmark]\r\n    public int DnsHashCode() =&gt; _dns.GetHashCode();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CookieHashCode<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">105.30 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">160 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CookieHashCode<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">22.51 ns<\/td>\n<td style=\"text-align: right\">0.21<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>DnsHashCode<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">136.78 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">192 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>DnsHashCode<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">12.92 ns<\/td>\n<td style=\"text-align: right\">0.09<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <code>HttpUtility<\/code> improved in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78240\">dotnet\/runtime#78240<\/a>. This is a quintessential example of code doing its own manual looping looking for something (in this case, the four characters that require encoding) when it could have instead just used a well-placed <code>IndexOfAny<\/code>.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Web;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    [Benchmark]\r\n    public string HtmlAttributeEncode() =&gt;\r\n        HttpUtility.HtmlAttributeEncode(\"To encode, or not to encode: that is the question\");\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HtmlAttributeEncode<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.688 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>HtmlAttributeEncode<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">6.734 ns<\/td>\n<td style=\"text-align: right\">0.21<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Moving up the stack to <code>System.Net.Sockets<\/code>, there are some nice improvements in .NET 8 here as well.<\/p>\n<h2>Sockets<\/h2>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86524\">dotnet\/runtime#86524<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89808\">dotnet\/runtime#89808<\/a> are for Windows only because the problem they address doesn&#8217;t manifest on other operatings systems, due to how asynchronous operations are implemented on the various platforms.<\/p>\n<p>On Unix operatings systems, the typical approach to asynchrony is to put the socket into non-blocking mode. Issuing an operation like <code>recv<\/code> (<code>Socket.Receive{Async}<\/code>) when there&#8217;s nothing to receive then fails immediately with an <code>errno<\/code> value of <code>EWOULDBLOCK<\/code> or <code>EAGAIN<\/code>, informing the caller that no data was available to receive yet and it&#8217;s not going to wait for said data because it&#8217;s been told not to. At that point, the caller can choose how it wants to wait for data to become available. <code>Socket<\/code> does what many other systems do, which is to use <code>epoll<\/code> (on Linux) or <code>kqueues<\/code> (on macOS). These mechanisms allow for a single thread to wait efficiently for any number of registered file descriptors to signal that something has changed. As such, <code>Socket<\/code> has one or more dedicated threads that sit in a wait loop, waiting on the <code>epoll<\/code>\/<code>kqueue<\/code> to signal that there&#8217;s something to do, and when there is, queueing off the associated work, and then looping around to wait for the next notification. In the case of a <code>ReceiveAsync<\/code>, that queued work will end up reissuing the <code>recv<\/code>, which will now succeed as data will be available. The interesting thing here is that during that interim period while waiting for data to become available, there was no pending call from .NET to <code>recv<\/code> or anything else that would require a managed buffer (e.g. an array) be available. That&#8217;s not the case on Windows&#8230;<\/p>\n<p>On Windows, the OS provides dedicated asynchronous APIs (&#8220;overlapped I\/O&#8221;), with <code>ReceiveAsync<\/code> being a thin wrapper around the Win32 <code>WSARecv<\/code> function. <code>WSARecv<\/code> accepts a pointer to the buffer to write into and a pointer to a callback that will be invoked when the operation has completed. That means that while waiting for data to be available, <code>WSARecv<\/code> actually needs a pointer to the buffer it&#8217;ll write the data into (unless 0 bytes have been requested, which we&#8217;ll talk more about in a bit). In .NET world, buffers are typically on the managed heap, which means they can be moved around by the GC, and thus in order to pass a pointer to such a buffer down to <code>WSARecv<\/code>, that buffer needs to be &#8220;pinned,&#8221; telling the GC &#8220;do not move this.&#8221; For synchronous operations, such pinning is best accomplished with the C# <code>fixed<\/code> keyword; for asynchronous operations, <code>GCHandle<\/code> or something that wraps it (like <code>Memory.Pin<\/code> and <code>MemoryHandle<\/code>) are the answers. So, on Windows, <code>Socket<\/code> uses a <code>GCHandle<\/code> for any buffers it supplies to the OS to span an asynchronous operation&#8217;s lifetime.<\/p>\n<p>For the last 20 years, though, it&#8217;s been overaggressive in doing so. There&#8217;s a buffer passed to various Win32 methods, including <code>WSAConnect<\/code> (<code>Socket.ConnectAsync<\/code>), to represent the target IP address. Even though these are asynchronous operations, it turns out that data is only required as part of the synchronous part of the call to these APIs; only a <code>ReceiveFromAsync<\/code> operation (which is typically only used with connectionless protocols, and in particular UDP) that receives not only payload data but also the sender&#8217;s address actually needs the address buffer pinned over the lifetime of the operation. <code>Socket<\/code> was pinning the buffer using a <code>GCHandle<\/code>, and in fact doing so for the lifetime of the <code>Socket<\/code>, even though a <code>GCHandle<\/code> wasn&#8217;t actually needed at all for these calls, and a <code>fixed<\/code> would suffice around just the Win32 call itself. The first PR fixed that, the net effect of which is that a <code>GCHandle<\/code> that was previously pinning a buffer for the lifetime of every <code>Socket<\/code> on Windows then only did so for <code>Socket<\/code>s issuing <code>ReceiveFromAsync<\/code> calls. The second PR then fixed <code>ReceiveFromAsync<\/code>, using a native buffer instead of a managed one that would need to be permanently pinned. The primary benefit of these changes is that it helps to avoid a lot of fragmentation that can result at scale in the managed heap. We can see this most easily by looking at the runtime&#8217;s tracing, which I consume in this example via an <code>EventListener<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0\r\n\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\nusing System.Diagnostics.Tracing;\r\n\r\nusing var setCountListener = new GCHandleListener();\r\nThread.Sleep(1000);\r\n\r\nusing Socket listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\nlistener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\nlistener.Listen();\r\n\r\nfor (int i = 0; i &lt; 10_000; i++)\r\n{\r\n    using Socket client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n\r\n    await client.ConnectAsync(listener.LocalEndPoint!);\r\n    listener.Accept().Dispose();\r\n}\r\n\r\nThread.Sleep(1000);\r\nConsole.WriteLine($\"{Environment.Version} GCHandle count: {setCountListener.SetGCHandleCount}\");\r\n\r\nsealed class GCHandleListener : EventListener\r\n{\r\n    public int SetGCHandleCount = 0;\r\n\r\n    protected override void OnEventSourceCreated(EventSource eventSource)\r\n    {\r\n        if (eventSource.Name == \"Microsoft-Windows-DotNETRuntime\")\r\n            EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x2);\r\n    }\r\n\r\n    protected override void OnEventWritten(EventWrittenEventArgs eventData)\r\n    {\r\n        \/\/ https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/diagnostics\/runtime-garbage-collection-events#setgchandle-event\r\n        if (eventData.EventId == 30 &amp;&amp; eventData.Payload![2] is (uint)3)\r\n            Interlocked.Increment(ref SetGCHandleCount);\r\n    }\r\n}<\/code><\/pre>\n<p>When I run this on .NET 7 on Windows, I get this:<\/p>\n<pre><code class=\"language-text\">7.0.9 GCHandle count: 10000<\/code><\/pre>\n<p>When I run this on .NET 8, I get this:<\/p>\n<pre><code class=\"language-text\">8.0.0 GCHandle count: 0<\/code><\/pre>\n<p>Nice.<\/p>\n<p>I mentioned UDP above, with <code>ReceiveFromAsync<\/code>. We&#8217;ve invested a lot over the last several years in making the networking stack in .NET very efficient&#8230; for TCP. While most of the improvements there accrue to UDP as well, UDP has additional costs that hadn&#8217;t been addressed and that made it suboptimal from a performance perspective. The primary issues there are now addressed in .NET 8, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88970\">dotnet\/runtime#88970<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90086\">dotnet\/runtime#90086<\/a>. The key problem here with the UDP-related APIs, namely <code>SendTo{Async}<\/code> and <code>ReceiveFrom{Async}<\/code>, is that the API is based on <code>EndPoint<\/code> but the core implementation is based on <code>SocketAddress<\/code>. Every call to <code>SendToAsync<\/code>, for example, would accept the provided <code>EndPoint<\/code> and then call <code>EndPoint.Serialize<\/code> to produce a <code>SocketAddress<\/code>, which internally has its own <code>byte[]<\/code>; that <code>byte[]<\/code> contains the address actually passed down to the underlying OS APIs. The inverse happens on the <code>ReceiveFromAsync<\/code> side: the received data includes an address that would be deserialized into an <code>EndPoint<\/code> which is then returned to the consumer. You can see these allocations show up by profiling a simple repro:<\/p>\n<pre><code class=\"language-C#\">using System.Net;\r\nusing System.Net.Sockets;\r\n\r\nvar client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\nvar server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n\r\nEndPoint endpoint = new IPEndPoint(IPAddress.Loopback, 12345);\r\nserver.Bind(endpoint);\r\n\r\nMemory&lt;byte&gt; buffer = new byte[1];\r\n\r\nfor (int i = 0; i &lt; 10_000; i++)\r\n{\r\n    ValueTask&lt;SocketReceiveFromResult&gt; result = server.ReceiveFromAsync(buffer, endpoint);\r\n    await client.SendToAsync(buffer, endpoint);\r\n    await result;\r\n}<\/code><\/pre>\n<p>The .NET allocation profiler in Visual Studio shows this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SocketUdpAllocationsInNet7.png\" alt=\"Allocations in a UDP benchmark in .NET 7\" \/><\/p>\n<p>So for each send\/receive pair, we see three <code>SocketAddress<\/code>es which in turn leads to three <code>byte[]<\/code>s, and an <code>IPEndPoint<\/code> which in turn leads to an <code>IPAddress<\/code>. These costs are very difficult to address efficiently purely in implementation, as they&#8217;re directly related to what&#8217;s surfaced in the corresponding APIs. Even so, with the exact same code, it does improve a bit in .NET 8:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SocketUdpAllocationsInNet8.png\" alt=\"Allocations in a UDP benchmark in .NET 8\" \/><\/p>\n<p>So with zero code changes, we&#8217;ve managed to eliminate one of the <code>SocketAddress<\/code> allocations and its associated <code>byte[]<\/code>, and to shrink the size of the remaining instances (in part due to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78860\">dotnet\/runtime#78860<\/a>). But, we can do much better&#8230;<\/p>\n<p>.NET 8 introduces a new set of overloads. In .NET 7, we had these:<\/p>\n<pre><code class=\"language-C#\">public int SendTo(byte[] buffer, int offset, int size, SocketFlags socketFlags, EndPoint remoteEP);\r\npublic int ReceiveFrom(byte[] buffer, int offset, int size, SocketFlags socketFlags, ref EndPoint remoteEP);\r\n\r\npublic ValueTask&lt;int&gt; SendToAsync(ReadOnlyMemory&lt;byte&gt; buffer, SocketFlags socketFlags, EndPoint remoteEP, CancellationToken cancellationToken = default)\r\npublic ValueTask&lt;SocketReceiveFromResult&gt; ReceiveFromAsync(Memory&lt;byte&gt; buffer, SocketFlags socketFlags, EndPoint remoteEndPoint, CancellationToken cancellationToken = default);<\/code><\/pre>\n<p>and now in .NET 8 we also have these:<\/p>\n<pre><code class=\"language-C#\">public int SendTo(ReadOnlySpan&lt;byte&gt; buffer, SocketFlags socketFlags, SocketAddress socketAddress);\r\npublic int ReceiveFrom(Span&lt;byte&gt; buffer, SocketFlags socketFlags, SocketAddress receivedAddress);\r\n\r\npublic ValueTask&lt;int&gt; SendToAsync(ReadOnlyMemory&lt;byte&gt; buffer, SocketFlags socketFlags, SocketAddress socketAddress, CancellationToken cancellationToken = default);\r\npublic ValueTask&lt;int&gt; ReceiveFromAsync(Memory&lt;byte&gt; buffer, SocketFlags socketFlags, SocketAddress receivedAddress, CancellationToken cancellationToken = default);<\/code><\/pre>\n<p>Key things to note:<\/p>\n<ul>\n<li>The new APIs no longer work in terms of <code>EndPoint<\/code>. They now operate on <code>SocketAddress<\/code> directly. That means the implementation no longer needs to call <code>EndPoint.Serialize<\/code> to produce a <code>SocketAddress<\/code> and can just use the provided one directly.<\/li>\n<li>There&#8217;s no more <code>ref EndPoint<\/code> argument in the synchronous <code>ReceiveFrom<\/code> and no more <code>SocketReceiveFromResult<\/code> in the asynchronous <code>ReceiveFromAsync<\/code>. Both of these existed in order to pass back an <code>IPEndPoint<\/code> that represented the address of the received data&#8217;s sender. <code>SocketAddress<\/code>, however, is just a strongly-typed wrapper around a <code>byte[]<\/code> buffer, which means these methods can just mutate that provided instance, avoiding needing to instantiate anything to represent the received address.<\/li>\n<\/ul>\n<p>Let&#8217;s change our code sample to use these new APIs:<\/p>\n<pre><code class=\"language-C#\">using System.Net;\r\nusing System.Net.Sockets;\r\n\r\nvar client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\nvar server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n\r\nEndPoint endpoint = new IPEndPoint(IPAddress.Loopback, 12345);\r\nserver.Bind(endpoint);\r\n\r\nMemory&lt;byte&gt; buffer = new byte[1];\r\nSocketAddress receiveAddress = endpoint.Serialize();\r\nSocketAddress sendAddress = endpoint.Serialize();\r\n\r\nfor (int i = 0; i &lt; 10_000; i++)\r\n{\r\n    ValueTask&lt;int&gt; result = server.ReceiveFromAsync(buffer, SocketFlags.None, receiveAddress);\r\n    await client.SendToAsync(buffer, SocketFlags.None, sendAddress);\r\n    await result;\r\n}<\/code><\/pre>\n<p>When I profile that, and again look for objects created at least once per iteration, I now see this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/SocketUdpAllocationsInNet8WithNewOverloads.png\" alt=\"Allocations in a UDP benchmark in .NET 8 with new overloads\" \/><\/p>\n<p>That&#8217;s not a mistake; I didn&#8217;t accidentally crop the screenshot incorrectly. It&#8217;s empty because there are no allocations per iteration; the whole program incurs only three <code>SocketAddress<\/code> allocations as part of the up-front setup. We can see that more clearly with a standard BenchmarkDotNet repro:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly Memory&lt;byte&gt; _buffer = new byte[1];\r\n    SocketAddress _sendAddress, _receiveAddress;\r\n    IPEndPoint _ep;\r\n    Socket _client, _server;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n        _server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n\r\n        _ep = new IPEndPoint(IPAddress.Loopback, 12345);\r\n        _server.Bind(_ep);\r\n\r\n        _sendAddress = _ep.Serialize();\r\n        _receiveAddress = _ep.Serialize();\r\n    }\r\n\r\n    [Benchmark(OperationsPerInvoke = 1_000, Baseline = true)]\r\n    public async Task ReceiveFromSendToAsync_EndPoint()\r\n    {\r\n        for (int i = 0; i &lt; 1_000; i++)\r\n        {\r\n            var result = _server.ReceiveFromAsync(_buffer, SocketFlags.None, _ep);\r\n            await _client.SendToAsync(_buffer, SocketFlags.None, _ep);\r\n            await result;\r\n        }\r\n    }\r\n\r\n    [Benchmark(OperationsPerInvoke = 1_000)]\r\n    public async Task ReceiveFromSendToAsync_SocketAddress()\r\n    {\r\n        for (int i = 0; i &lt; 1_000; i++)\r\n        {\r\n            var result = _server.ReceiveFromAsync(_buffer, SocketFlags.None, _receiveAddress);\r\n            await _client.SendToAsync(_buffer, SocketFlags.None, _sendAddress);\r\n            await result;\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ReceiveFromSendToAsync_EndPoint<\/td>\n<td style=\"text-align: right\">32.48 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">216 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ReceiveFromSendToAsync_SocketAddress<\/td>\n<td style=\"text-align: right\">31.78 us<\/td>\n<td style=\"text-align: right\">0.98<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>TLS<\/h2>\n<p>Moving up the stack further, <code>SslStream<\/code> has received some love in this release. While in previous releases work was done to reduce allocation, .NET 8 sees it reduced further:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74619\">dotnet\/runtime#74619<\/a> avoids some allocations related to ALPN. Application-Layer Protocol Negotation is a mechanism that allows higher-level protocols to piggyback on the roundtrips already being performed as part of a TLS handshake. It&#8217;s used by an HTTP client and server to negotiate which HTTP version to use (e.g. HTTP\/2 or HTTP\/1.1). Previously, the implementation would end up allocating a <code>byte[]<\/code> for use with this HTTP version selection, but now with this PR, the implementation precomputes <code>byte[]<\/code>s for the most common protocol selections, avoiding the need to re-allocate those <code>byte[]<\/code>s on each new connection.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81096\">dotnet\/runtime#81096<\/a> removes a delegate allocation by moving some code around between the main <code>SslStream<\/code> implementation and the Platform Abstraction Layer (PAL) that&#8217;s used to handle OS-specific code (everything in the <code>SslStream<\/code> layer is compiled into <code>System.Net.Security.dll<\/code> regardless of OS, and then depending on the target OS, a different version of the <code>SslStreamPal<\/code> class is compiled in).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84690\">dotnet\/runtime#84690<\/a> from <a href=\"https:\/\/github.com\/am11\">@am11<\/a> avoids a gigantic <code>Dictionary&lt;TlsCipherSuite, TlsCipherSuiteData&gt;<\/code> that was being created to enable querying for information about a particular cipher suite for use with TLS. Instead of a dictionary mapping a <code>TlsCipherSuite<\/code> enum to a <code>TlsCipherSuiteData<\/code> struct (which contained details like an <code>ExchangeAlgorithmType<\/code> enum value, a <code>CipherAlgorithmType<\/code> enum value, an <code>int<\/code> <code>CipherAlgorithmStrength<\/code>, etc.), a <code>switch<\/code> statement is used, mapping that same <code>TlsCipherSuite<\/code> enum to an <code>int<\/code> that&#8217;s packed with all the same information. This not only avoids the run-time costs associated with allocating that dictionary and populating it, it also shaves almost 20Kb off a published Native AOT binary, due to all of the code that was necessary to populate the dictionary. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84921\">dotnet\/runtime#84921<\/a> from <a href=\"https:\/\/github.com\/am11\">@am11<\/a> uses a similar <code>switch<\/code> for well-known OIDs.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86163\">dotnet\/runtime#86163<\/a> changed an internal <code>ProtocolToken<\/code> class into a struct, passing it around by <code>ref<\/code> instead.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74695\">dotnet\/runtime#74695<\/a> avoids some <code>SafeHandle<\/code> allocation in interop as part of certificate handling on Linux. <code>SafeHandle<\/code>s are a valuable reliability feature in .NET: they wrap a native handle \/ file descriptor, providing the finalizer that ensures the resource isn&#8217;t leaked, but also providing ref counting to ensure that the resource isn&#8217;t closed while it&#8217;s still being used, leading to use-after-free and handle recycling bugs. They&#8217;re particularly helpful when a handle or file descriptor needs to be passed around and shared between multiple components, often as part of some larger object model (e.g. a <code>FileStream<\/code> wraps a <code>SafeFileHandle<\/code>). However, in some cases they&#8217;re unnecessary overhead. If you have a pattern like:\n<pre><code class=\"language-C#\">SafeHandle handle = GetResource();\r\ntry { Use(handle); }\r\nfinally { handle.Dispose(); }<\/code><\/pre>\n<p>such that the resource is provably used and freed correctly, you can avoid the <code>SafeHandle<\/code> and instead just use the resource directly:<\/p>\n<pre><code class=\"language-C#\">IntPtr handle = GetResource();\r\ntry { Use(handle); }\r\nfinally { Free(handle); }<\/code><\/pre>\n<p>thereby saving on the allocation of a finalizable object (which is more expensive than a normal allocation as synchronization is required to add that object to a finalization queue in the GC) as well as on ref-counting overhead associated with using a <code>SafeHandle<\/code> in interop.<\/p>\n<\/li>\n<\/ul>\n<p>This benchmark repeatedly creates new <code>SslStream<\/code>s and performs handshakes:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Net.Security;\r\nusing System.Net.Sockets;\r\nusing System.Runtime.InteropServices;\r\nusing System.Security.Authentication;\r\nusing System.Security.Cryptography;\r\nusing System.Security.Cryptography.X509Certificates;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private NetworkStream _client, _server;\r\n    private readonly SslServerAuthenticationOptions _options = new SslServerAuthenticationOptions\r\n    {\r\n        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),\r\n    };\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n        listener.Listen(1);\r\n\r\n        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };\r\n        client.Connect(listener.LocalEndPoint);\r\n\r\n        Socket serverSocket = listener.Accept();\r\n        serverSocket.NoDelay = true;\r\n        _server = new NetworkStream(serverSocket, ownsSocket: true);\r\n        _client = new NetworkStream(client, ownsSocket: true);\r\n    }\r\n\r\n    [GlobalCleanup]\r\n    public void Cleanup()\r\n    {\r\n        _client.Dispose();\r\n        _server.Dispose();\r\n    }\r\n\r\n    [Benchmark]\r\n    public async Task Handshake()\r\n    {\r\n        using var client = new SslStream(_client, leaveInnerStreamOpen: true, delegate { return true; });\r\n        using var server = new SslStream(_server, leaveInnerStreamOpen: true, delegate { return true; });\r\n\r\n        await Task.WhenAll(\r\n            client.AuthenticateAsClientAsync(\"localhost\", null, SslProtocols.Tls12, checkCertificateRevocation: false),\r\n            server.AuthenticateAsServerAsync(_options));\r\n    }\r\n\r\n    private static X509Certificate2 GetCertificate()\r\n    {\r\n        X509Certificate2 cert;\r\n        using (RSA rsa = RSA.Create())\r\n        {\r\n            var certReq = new CertificateRequest(\"CN=localhost\", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);\r\n            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));\r\n            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid(\"1.3.6.1.5.5.7.3.1\") }, false));\r\n            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));\r\n            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));\r\n            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))\r\n            {\r\n                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));\r\n            }\r\n        }\r\n        return cert;\r\n    }\r\n}<\/code><\/pre>\n<p>It shows an ~13% reduction in overall allocation as part of the <code>SslStream<\/code> lifecycle:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Handshake<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">828.5 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">7.07 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Handshake<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">769.0 us<\/td>\n<td style=\"text-align: right\">0.93<\/td>\n<td style=\"text-align: right\">6.14 KB<\/td>\n<td style=\"text-align: right\">0.87<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>My favorite <code>SslStream<\/code> improvement in .NET 8, though, is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87563\">dotnet\/runtime#87563<\/a>, which teaches <code>SslStream<\/code> to do &#8220;zero-byte reads&#8221; in order to minimize buffer use and pinning. This has been a long time coming, and is the result of multiple users of <code>SslStream<\/code> reporting significant heap fragmentation.<\/p>\n<p>When a read is issued to <code>SslStream<\/code>, it in turn needs to issue a read on the underlying <code>Stream<\/code>; the data it reads has a header, which gets peeled off, and then the remaining data is decrypted and stored into the user&#8217;s buffer. Since there&#8217;s manipulation of the data read from the underlying <code>Stream<\/code>, including not giving all of it to the user, <code>SslStream<\/code> doesn&#8217;t just pass the user&#8217;s buffer to the underlying <code>Stream<\/code>, but instead passes its own buffer down. That means it needs a buffer to pass. With performance improvements in recent .NET releases, <code>SslStream<\/code> rents said buffer on demand from the <code>ArrayPool<\/code> and returns it as soon as that temporary buffer has been drained of all the data read into it. There are two issues with this, though. On Windows, a buffer is being provided to <code>Socket<\/code>, which needs to pin the buffer in order to give a pointer to that buffer to the Win32 overlapped I\/O operation; that pinning means the GC can&#8217;t move the buffer on the heap, which can mean gaps end up being left on the heap that aren&#8217;t usable (aka &#8220;fragmentation&#8221;), and that in turn can lead to sporadic out-of-memory conditions. As noted earlier, the <code>Socket<\/code> implementation on Linux and macOS doesn&#8217;t need to do such pinning, however there&#8217;s still a problem here. Imagine you have a thousand open connections, or a million open connections, all of which are sitting in a read waiting for data; even if there&#8217;s no pinning, if each of those connections has an <code>SslStream<\/code> that&#8217;s rented a buffer of any meaningful size, that&#8217;s a whole lot of wasted memory just sitting there.<\/p>\n<p>An answer to this that .NET has been making more and more use of over the last few years is &#8220;zero-byte reads.&#8221; If you need to read 100 bytes, rather than handing down your 100-byte buffer, at which point it needs to be pinned, you instead issue a read for 0 bytes, handing down an empty buffer, at which point nothing needs to be pinned. When there&#8217;s data available, that zero-byte read completes (without consuming anything), and then you issue the actual read for the 100 bytes, which is much more likely to be synchronously satisfiable at that point. As of .NET 6, <code>SslStream<\/code> is already capable of passing along zero-byte reads, e.g. if you do <code>sslStream.ReadAsync(emptyBuffer)<\/code> and it doesn&#8217;t have any data buffered already, it&#8217;ll in turn issue a zero-byte read on the underlying <code>Stream<\/code>. However, today <code>SslStream<\/code> itself doesn&#8217;t <em>create<\/em> zero-byte reads, e.g. if you do <code>sslStream.ReadAsync(someNonEmptyBuffer)<\/code> and it doesn&#8217;t have enough data buffered, it in turn will issue a non-zero-byte read, and we&#8217;re back to pinning per operation at the <code>Socket<\/code> layer, plus needing a buffer to pass down, which means renting one.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87563\">dotnet\/runtime#87563<\/a> teaches <code>SslStream<\/code> how to create zero-byte reads. Now when you do <code>sslStream.ReadAsync(someNonEmptyBuffer)<\/code> and the <code>SslStream<\/code> doesn&#8217;t have enough data buffered, rather than immediately renting a buffer and passing that down, it instead issues a zero-byte read on the underlying <code>Stream<\/code>. Only once that operation completes does it then proceed to actually rent a buffer and issue another read, this time with the rented buffer. The primary downside to this is a bit more overhead, in that it can lead to an extra syscall; however, our measurements show that overhead to largely be in the noise, with very meaningful upside in reduced fragmentation, working set reduction, and <code>ArrayPool<\/code> stability.<\/p>\n<p>The <code>GCHandle<\/code> reduction on Windows is visible with this app, a variation of one showed earlier:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0\r\n\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Net;\r\nusing System.Net.Security;\r\nusing System.Net.Sockets;\r\nusing System.Runtime.InteropServices;\r\nusing System.Security.Cryptography.X509Certificates;\r\nusing System.Security.Cryptography;\r\nusing System.Diagnostics.Tracing;\r\n\r\nvar listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\nvar client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\nlistener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\nlistener.Listen();\r\n\r\nclient.Connect(listener.LocalEndPoint!);\r\nSocket server = listener.Accept();\r\nlistener.Dispose();\r\n\r\nX509Certificate2 cert;\r\nusing (RSA rsa = RSA.Create())\r\n{\r\n    var certReq = new CertificateRequest(\"CN=localhost\", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);\r\n    certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));\r\n    certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid(\"1.3.6.1.5.5.7.3.1\") }, false));\r\n    certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));\r\n    cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));\r\n    if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))\r\n    {\r\n        cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));\r\n    }\r\n}\r\n\r\nvar clientStream = new SslStream(new NetworkStream(client, ownsSocket: true), false, delegate { return true; });\r\nvar serverStream = new SslStream(new NetworkStream(server, ownsSocket: true), false, delegate { return true; });\r\nawait Task.WhenAll(\r\n    clientStream.AuthenticateAsClientAsync(\"localhost\", null, false),\r\n    serverStream.AuthenticateAsServerAsync(cert, false, false));\r\n\r\nusing var setCountListener = new GCHandleListener();\r\n\r\nMemory&lt;byte&gt; buffer = new byte[1];\r\nfor (int i = 0; i &lt; 100_000; i++)\r\n{\r\n    ValueTask&lt;int&gt; read = clientStream.ReadAsync(buffer);\r\n    await serverStream.WriteAsync(buffer);\r\n    await read;\r\n}\r\n\r\nThread.Sleep(1000);\r\nConsole.WriteLine($\"{Environment.Version} GCHandle count: {setCountListener.SetGCHandleCount:N0}\");\r\n\r\nsealed class GCHandleListener : EventListener\r\n{\r\n    public int SetGCHandleCount = 0;\r\n\r\n    protected override void OnEventSourceCreated(EventSource eventSource)\r\n    {\r\n        if (eventSource.Name == \"Microsoft-Windows-DotNETRuntime\")\r\n            EnableEvents(eventSource, EventLevel.Informational, (EventKeywords)0x2);\r\n    }\r\n\r\n    protected override void OnEventWritten(EventWrittenEventArgs eventData)\r\n    {\r\n        \/\/ https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/diagnostics\/runtime-garbage-collection-events#setgchandle-event\r\n        if (eventData.EventId == 30 &amp;&amp; eventData.Payload[2] is (uint)3)\r\n            Interlocked.Increment(ref SetGCHandleCount);\r\n    }\r\n}<\/code><\/pre>\n<p>On .NET 7, this outputs:<\/p>\n<pre><code class=\"language-text\">7.0.9 GCHandle count: 100,000<\/code><\/pre>\n<p>whereas on .NET 8, I now get:<\/p>\n<pre><code class=\"language-text\">8.0.0 GCHandle count: 0<\/code><\/pre>\n<p>So pretty.<\/p>\n<h2>HTTP<\/h2>\n<p>The primary consumer of <code>SslStream<\/code> in .NET itself is the HTTP stack, so let&#8217;s move up the stack now to <code>HttpClient<\/code>, which has seen important gains of its own in .NET 8. As with <code>SslStream<\/code>, there were a bunch of improvements here that all joined to make for a measurable end-to-end improvement (many of the opportunities here were found as part of improving <a href=\"https:\/\/github.com\/microsoft\/reverse-proxy\">YARP<\/a>):<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74393\">dotnet\/runtime#74393<\/a> streamlined how HTTP\/1.1 response headers are parsed, including making better use of <code>IndexOfAny<\/code> to speed up searching for various delimiters demarcating portions of the response.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79525\">dotnet\/runtime#79525<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79524\">dotnet\/runtime#79524<\/a> restructured buffer management for reading and writing on HTTP\/1.1 connections.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81251\">dotnet\/runtime#81251<\/a> reduced the size of <code>HttpRequestMessage<\/code> by 8 bytes and <code>HttpRequestHeaders<\/code> by 16 bytes (on 64-bit). <code>HttpRequestMessage<\/code> had a <code>Boolean<\/code> field that was replaced by using a bit from an existing <code>int<\/code> field that wasn&#8217;t using all of its bits; as the rest of the message&#8217;s fields fit neatly into a multiple of 8 bytes, that extra <code>Boolean<\/code>, even though only a byte in size, required the object to grow by 8 bytes. For <code>HttpRequestHeaders<\/code>, it already had an optimization where some uncommonly used headers were pushed off into a contingently-allocated array; there were additional rarely used fields that made more sense to be contingent.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83640\">dotnet\/runtime#83640<\/a> shrunk the size of various strongly typed <code>HeaderValue<\/code> types. For example, <code>ContentRangeHeaderValue<\/code> has three public properties <code>From<\/code>, <code>To<\/code>, and <code>Length<\/code>, all of which are <code>long?<\/code> aka <code>Nullable&lt;long&gt;<\/code>. Each of these properties was backed by a <code>Nullable&lt;long&gt;<\/code> field. Because of packing and alignment, <code>Nullable&lt;long&gt;<\/code> ends up consuming 16 bytes, 8 bytes for the <code>long<\/code> and then 8 bytes for the <code>bool<\/code> indicating whether the nullable has a value (<code>bool<\/code> is stored as a single byte, but because of alignment and packing, it&#8217;s rounded up to 8). Instead of storing these as <code>Nullable&lt;long&gt;<\/code>, they can just be <code>long<\/code>, using whether they contain a negative value to indicate whether they were initialized, reducing the size of the object from 72 bytes down to 48 bytes. Similar improvements were made to six other such <code>HeaderValue<\/code> types.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81253\">dotnet\/runtime#81253<\/a> tweaked how &#8220;Transfer-Encoding: chunked&#8221; is stored internally, special-casing it to avoid several allocations.<\/li>\n<li>When <code>Activity<\/code> is in use in order to enable the correlation of tracing information across end-to-end usage, every HTTP request ends up creating a new <code>Activity.Id<\/code>, which incurs not only the <code>string<\/code> for that ID, but also in the making of it temporary <code>string<\/code> and a temporary <code>string[6]<\/code> array. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86685\">dotnet\/runtime#86685<\/a> removes both of those intermediate allocations by making better use of spans.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79484\">dotnet\/runtime#79484<\/a> is specific to HTTP\/2 and applies to it similar changes to what was discussed for <code>SslStream<\/code>: it now rents buffers from the <code>ArrayPool<\/code> on demand, returning those buffers when idle, and it issues zero-byte reads to the underlying transport <code>Stream<\/code>. The net result of these changes is it can reduce the memory usage of an idle HTTP\/2 connection by up to 80Kb.<\/li>\n<\/ul>\n<p>We can use the following simple GET-request benchmark to how some of these changes accrue to reduced overheads with <code>HttpClient<\/code>:<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n    private static readonly HttpMessageInvoker s_client = new(new SocketsHttpHandler());\r\n    private static Uri s_uri;\r\n\r\n    [Benchmark]\r\n    public async Task HttpGet()\r\n    {\r\n        var m = new HttpRequestMessage(HttpMethod.Get, s_uri);\r\n        using (HttpResponseMessage r = await s_client.SendAsync(m, default))\r\n        using (Stream s = r.Content.ReadAsStream())\r\n            await s.CopyToAsync(Stream.Null);\r\n    }\r\n\r\n    [GlobalSetup]\r\n    public void CreateSocketServer()\r\n    {\r\n        s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n        s_listener.Listen(int.MaxValue);\r\n        var ep = (IPEndPoint)s_listener.LocalEndPoint;\r\n        s_uri = new Uri($\"http:\/\/{ep.Address}:{ep.Port}\/\");\r\n\r\n        Task.Run(async () =&gt;\r\n        {\r\n            while (true)\r\n            {\r\n                Socket s = await s_listener.AcceptAsync();\r\n                _ = Task.Run(() =&gt;\r\n                {\r\n                    using (var ns = new NetworkStream(s, true))\r\n                    {\r\n                        byte[] buffer = new byte[1024];\r\n                        int totalRead = 0;\r\n                        while (true)\r\n                        {\r\n                            int read = ns.Read(buffer, totalRead, buffer.Length - totalRead);\r\n                            if (read == 0) return;\r\n                            totalRead += read;\r\n                            if (buffer.AsSpan(0, totalRead).IndexOf(\"\\r\\n\\r\\n\"u8) == -1)\r\n                            {\r\n                                if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);\r\n                                continue;\r\n                            }\r\n\r\n                            ns.Write(\"HTTP\/1.1 200 OK\\r\\nDate: Sun, 05 Jul 2020 12:00:00 GMT \\r\\nServer: Example\\r\\nContent-Length: 5\\r\\n\\r\\nHello\"u8);\r\n\r\n                            totalRead = 0;\r\n                        }\r\n                    }\r\n                });\r\n            }\r\n        });\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>HttpGet<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">151.7 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1.52 KB<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>HttpGet<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">136.0 us<\/td>\n<td style=\"text-align: right\">0.90<\/td>\n<td style=\"text-align: right\">1.41 KB<\/td>\n<td style=\"text-align: right\">0.93<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>WebSocket<\/code> also sees improvements in .NET 8. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87329\">dotnet\/runtime#87329<\/a>, <code>ManagedWebSocket<\/code> (the implementation that&#8217;s used by <code>ClientWebSocket<\/code> and that&#8217;s returned from <code>WebSocket.CreateFromStream<\/code>) gets in on the zero-byte reads game. In .NET 7, you could perform a zero-byte <code>ReceiveAsync<\/code> on <code>ManagedWebSocket<\/code>, but doing so would still issue a <code>ReadAsync<\/code> to the underlying stream with the receive header buffer. That in turn could cause the underlying <code>Stream<\/code> to rent and\/or pin a buffer. By special-casing zero-byte reads now in .NET 8, <code>ClientWebSocket<\/code> can take advantage of any special-casing in the base stream, and hopefully make it so that when the actual read is performed, the data necessary to satisfy it synchronously is already available.<\/p>\n<p>And with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75025\">dotnet\/runtime#75025<\/a>, allocation with <code>ClientWebSocket.ConnectAsync<\/code> is reduced. This is a nice example of really needing to pay attention to defaults. <code>ClientWebSocket<\/code> has an optimization where it maintains a shared singleton <code>HttpMessageInvoker<\/code> that it reuses between <code>ClientWebSocket<\/code> instances. However, it can only reuse them when the settings of the <code>ClientWebSocket<\/code> match the settings of that shared singleton; by default <code>ClientWebSocketOptions.Proxy<\/code> is set, and that&#8217;s enough to knock it off the path that lets it use the shared handler. This PR adds a second shared singleton for when <code>Proxy<\/code> is set, such that requests using the default proxy can now use a shared handler rather than creating one a new.<\/p>\n<h2>JSON<\/h2>\n<p>A significant focus for <code>System.Text.Json<\/code> in .NET 8 was on improving support for trimming and source-generated <code>JsonSerializer<\/code> implementations, as its usage ends up on critical code paths in a multitude of services and applications, including those that are a primary focus area for Native AOT. Thus, a lot of work went into adding features to the source generator that might otherwise prevent a developer from prefering to use it. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79828\">dotnet\/runtime#79828<\/a>, for example, added support for <code>required<\/code> and <code>init<\/code> properties in C#, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83631\">dotnet\/runtime#83631<\/a> added support for &#8220;unspeakable&#8221; types (such as the compiler-generated types used to implement iterator methods), and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84768\">dotnet\/runtime#84768<\/a> added better support for boxed values. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79397\">dotnet\/runtime#79397<\/a> also added support for weakly-typed but trimmer-safe <code>Serialize<\/code>\/<code>Deserialize<\/code> methods, taking <code>JsonTypeInfo<\/code>, that make it possible for ASP.NET and other such consumers to cache JSON contract metadata appropriately. All of these improvements are functionally valuable on their own, but also accrue to the overall goals of reducing deployed binary size, improving startup time, and generally being able to be successful with Native AOT and gaining the benefits it brings.<\/p>\n<p>Even with that focus, however, there were still some nice throughput-focused improvements that made their way into .NET 8. In particular, a key improvement in .NET 8 is that the <code>JsonSerializer<\/code> is now able to utilize generated &#8220;fast-path&#8221; methods even when streaming.<\/p>\n<p>One of the main things the JSON source generator does is generate at build-time all of the things <code>JsonSerializer<\/code> would otherwise need reflection to access at run-time, e.g. discovering the shape of a type, all of its members, their names, attributes that control their serialization, and so on. With just that, however, the serializer would still be using generic routines to perform operations like serialization, just doing so without needing to use reflection. Instead, the source generator can emit a customized serialization routine specific to the data in question, in order to optimize writing it out. For example, given the following types:<\/p>\n<pre><code class=\"language-C#\">public class Rectangle\r\n{\r\n    public int X, Y, Width, Height;\r\n    public Color Color;\r\n}\r\n\r\npublic struct Color\r\n{\r\n    public byte R, G, B, A;\r\n}\r\n\r\n[JsonSerializable(typeof(Rectangle))]\r\n[JsonSourceGenerationOptions(IncludeFields = true)]\r\nprivate partial class JsonContext : JsonSerializerContext { }<\/code><\/pre>\n<p>the source generator will include the following serialization routines in the generated code:<\/p>\n<pre><code class=\"language-C#\">private void RectangleSerializeHandler(global::System.Text.Json.Utf8JsonWriter writer, global::Tests.Rectangle? value)\r\n{\r\n    if (value == null)\r\n    {\r\n        writer.WriteNullValue();\r\n        return;\r\n    }\r\n\r\n    writer.WriteStartObject();\r\n\r\n    writer.WriteNumber(PropName_X, ((global::Tests.Rectangle)value).X);\r\n    writer.WriteNumber(PropName_Y, ((global::Tests.Rectangle)value).Y);\r\n    writer.WriteNumber(PropName_Width, ((global::Tests.Rectangle)value).Width);\r\n    writer.WriteNumber(PropName_Height, ((global::Tests.Rectangle)value).Height);\r\n    writer.WritePropertyName(PropName_Color);\r\n    ColorSerializeHandler(writer, ((global::Tests.Rectangle)value).Color);\r\n\r\n    writer.WriteEndObject();\r\n}\r\n\r\nprivate void ColorSerializeHandler(global::System.Text.Json.Utf8JsonWriter writer, global::Tests.Color value)\r\n{\r\n    writer.WriteStartObject();\r\n\r\n    writer.WriteNumber(PropName_R, value.R);\r\n    writer.WriteNumber(PropName_G, value.G);\r\n    writer.WriteNumber(PropName_B, value.B);\r\n    writer.WriteNumber(PropName_A, value.A);\r\n\r\n    writer.WriteEndObject();\r\n}<\/code><\/pre>\n<p>The serializer can then just invoke these routines to write the data directly to the <code>Utf8JsonWriter<\/code>.<\/p>\n<p>However, in the past these routines weren&#8217;t used when serializing with one of the streaming routines (e.g. all of the <code>SerializeAsync<\/code> methods), in part because of the complexity of refactoring the implementation to accommodate them, but in larger part out of concern that an individual instance being serialized might need to write more data than should be buffered; these fast paths are synchronous-only today, and so can&#8217;t perform asynchronous flushes efficiently. This is particularly unfortunate because these streaming overloads are the primary ones used by ASP.NET, which means ASP.NET wasn&#8217;t benefiting from these fast paths. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78646\">dotnet\/runtime#78646<\/a>, in .NET 8 they now do benefit. The PR does the necessary refactoring internally and also puts in place various heuristics to minimize chances of over-buffering. The net result is these existing optimizations now kick in for a much broader array of use cases, including the primary ones higher in the stack, and the wins are significant.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json;\r\nusing System.Text.Json.Serialization;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic partial class Tests\r\n{\r\n    private readonly Rectangle _data = new()\r\n    {\r\n        X = 1, Y = 2,\r\n        Width = 3, Height = 4,\r\n        Color = new Color { R = 5, G = 6, B = 7, A = 8 }\r\n    };\r\n\r\n    [Benchmark]\r\n    public void Serialize() =&gt; JsonSerializer.Serialize(Stream.Null, _data, JsonContext.Default.Rectangle);\r\n\r\n    [Benchmark]\r\n    public Task SerializeAsync() =&gt; JsonSerializer.SerializeAsync(Stream.Null, _data, JsonContext.Default.Rectangle);\r\n\r\n    public class Rectangle\r\n    {\r\n        public int X, Y, Width, Height;\r\n        public Color Color;\r\n    }\r\n\r\n    public struct Color\r\n    {\r\n        public byte R, G, B, A;\r\n    }\r\n\r\n    [JsonSerializable(typeof(Rectangle))]\r\n    [JsonSourceGenerationOptions(IncludeFields = true)]\r\n    private partial class JsonContext : JsonSerializerContext { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Serialize<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">613.3 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">488 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Serialize<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">205.9 ns<\/td>\n<td style=\"text-align: right\">0.34<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>SerializeAsync<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">654.2 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">488 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SerializeAsync<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">259.6 ns<\/td>\n<td style=\"text-align: right\">0.40<\/td>\n<td style=\"text-align: right\">32 B<\/td>\n<td style=\"text-align: right\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The fast-path routines are better leveraged in additional scenarios now, as well. Another case where they weren&#8217;t used, even when not streaming, was when combining multiple source-generated contexts: if you have your <code>JsonSerializerContext<\/code>-derived type for your own types to be serialized, and someone passes to you another <code>JsonSerializerContext<\/code>-derived type for a type they&#8217;re giving you to serialize, you need to combine those contexts together into something you can give to <code>Serialize<\/code>. In doing so, however, the fast paths could get lost. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80741\">dotnet\/runtime#80741<\/a> adds additional APIs and support to enable the fast paths to still be used.<\/p>\n<p>Beyond <code>JsonSerializer<\/code>, there have been several other performance improvements. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88194\">dotnet\/runtime#88194<\/a>, for example, <code>JsonNode<\/code>&#8216;s implementation is streamlined, including avoiding allocating a delegate while setting values into the node, and in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85886\">dotnet\/runtime#85886<\/a>, <code>JsonNode.To<\/code> is improved via a one-line change that stops unnecessarily calling <code>Memory&lt;byte&gt;.ToArray()<\/code> in order to pass it to a method that accepts a <code>ReadOnlySpan&lt;byte&gt;<\/code>: <code>Memory&lt;byte&gt;.Span<\/code> can and should be used instead, saving on a potentially large array allocation and copy.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json.Nodes;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly JsonNode _node = JsonNode.Parse(\"\"\"{ \"Name\": \"Stephen\" }\"\"\"u8);\r\n\r\n    [Benchmark]\r\n    public string ToJsonString() =&gt; _node.ToString();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ToJsonString<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">244.5 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">272 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ToJsonString<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">189.6 ns<\/td>\n<td style=\"text-align: right\">0.78<\/td>\n<td style=\"text-align: right\">224 B<\/td>\n<td style=\"text-align: right\">0.82<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Lastly on the JSON front, there&#8217;s the new <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1869\">CA1869<\/a> analyzer added in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6850\">dotnet\/roslyn-analyzers#6850<\/a>.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/CA1869.png\" alt=\"CA1869\" \/><\/p>\n<p>The <code>JsonSerializerOptions<\/code> type looks like something that should be relatively cheap to allocate, just a small options type you could allocate on each call to <code>JsonSerializer.Serialize<\/code> or <code>JsonSerializer.Deserialize<\/code> with little ramification:<\/p>\n<pre><code class=\"language-C#\">T value = JsonSerializer.Deserialize&lt;T&gt;(source, new JsonSerializerOptions { AllowTrailingCommas = true });<\/code><\/pre>\n<p>That&#8217;s not the case, however. <code>JsonSerializer<\/code> may need to use reflection to analyze the type being serialized or deserialized in order to learn about its shape and then potentially even use reflection emit to generate custom processing code for using that type. The <code>JsonSerializerOptions<\/code> instance is then used not only as a simple bag for options information, but also as a place to store all of that state the serializer built up, enabling it to be shared from call to call. Prior to .NET 7, this meant that passing a new <code>JsonSerializerOptions<\/code> instance to each call resulted in a massive performance cliff. In .NET 7, the caching scheme was improved to combat the problems here, but even with those mitigations, there&#8217;s still significant overhead to using a new <code>JsonSerializerOptions<\/code> instance each time. Instead, a <code>JsonSerializerOptions<\/code> instance should be cached and reused.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Text.Json;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly string _json = \"\"\"{ \"Title\":\"Performance Improvements in .NET 8\", \"Author\":\"Stephen Toub\", }\"\"\";\r\n    private readonly JsonSerializerOptions _options = new JsonSerializerOptions { AllowTrailingCommas = true };\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public BlogData Deserialize_New() =&gt; JsonSerializer.Deserialize&lt;BlogData&gt;(_json, new JsonSerializerOptions { AllowTrailingCommas = true });\r\n\r\n    [Benchmark]\r\n    public BlogData Deserialize_Cached() =&gt; JsonSerializer.Deserialize&lt;BlogData&gt;(_json, _options);\r\n\r\n    public struct BlogData\r\n    {\r\n        public string Title { get; set; }\r\n        public string Author { get; set; }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Deserialize_New<\/td>\n<td style=\"text-align: right\">736.5 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">358 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Deserialize_Cached<\/td>\n<td style=\"text-align: right\">290.2 ns<\/td>\n<td style=\"text-align: right\">0.39<\/td>\n<td style=\"text-align: right\">176 B<\/td>\n<td style=\"text-align: right\">0.49<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Cryptography<\/h2>\n<p>Cryptography in .NET 8 sees a smattering of improvements, a few large ones and a bunch of smaller ones that contribute to removing some overhead across the system.<\/p>\n<p>One of the larger improvements, specific to Windows because it&#8217;s about switching what functionality is employed from the underlying OS, comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76277\">dotnet\/runtime#76277<\/a>. Windows CNG (&#8220;Next Generation&#8221;) provides two libraries: <code>bcrypt.dll<\/code> and <code>ncrypt.dll<\/code>. The former provides support for &#8220;ephemeral&#8221; operations, ones where the cryptographic key is in-memory only and generated on the fly as part of an operation. The latter supports both ephemeral and persisted-key operations, and as a result much of the .NET support has been based on <code>ncrypt.dll<\/code> since it&#8217;s more universal. This, however, can add unnecessary expense, as all of the operations are handled out-of-process by the <code>lsass.exe<\/code> service, and thus require remote procedure calls, which add overhead. This PR switches <code>RSA<\/code> ephemeral operations over to using <code>bcrypt<\/code> instead of <code>ncrypt<\/code>, and the results are noteworthy (in the future, we expect other algorithms to also switch).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Runtime.CompilerServices;\r\nusing System.Security.Cryptography;\r\nusing System.Security.Cryptography.X509Certificates;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\n[SkipLocalsInit]\r\npublic class Tests\r\n{\r\n    private static readonly RSA s_rsa = RSA.Create();\r\n    private static readonly byte[] s_signed = s_rsa.SignHash(new byte[256 \/ 8], HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);\r\n    private static readonly byte[] s_encrypted = s_rsa.Encrypt(new byte[3], RSAEncryptionPadding.OaepSHA256);\r\n    private static readonly X509Certificate2 s_cert = new X509Certificate2(Convert.FromBase64String(\"\"\"\r\n        MIIE7DCCA9SgAwIBAgITMwAAALARrwqL0Duf3QABAAAAsDANBgkqhkiG9w0BAQUFADB5MQswCQYDVQQGEwJVUzETMBEGA1UECBMKV2FzaGluZ3RvbjEQMA4GA1UEBxMH\r\n        UmVkbW9uZDEeMBwGA1UEChMVTWljcm9zb2Z0IENvcnBvcmF0aW9uMSMwIQYDVQQDExpNaWNyb3NvZnQgQ29kZSBTaWduaW5nIFBDQTAeFw0xMzAxMjQyMjMzMzlaFw0x\r\n        NDA0MjQyMjMzMzlaMIGDMQswCQYDVQQGEwJVUzETMBEGA1UECBMKV2FzaGluZ3RvbjEQMA4GA1UEBxMHUmVkbW9uZDEeMBwGA1UEChMVTWljcm9zb2Z0IENvcnBvcmF0\r\n        aW9uMQ0wCwYDVQQLEwRNT1BSMR4wHAYDVQQDExVNaWNyb3NvZnQgQ29ycG9yYXRpb24wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDor1yiIA34KHy8BXt\/\r\n        re7rdqwoUz8620B9s44z5lc\/pVEVNFSlz7SLqT+oN+EtUO01Fk7vTXrbE3aIsCzwWVyp6+HXKXXkG4Unm\/P4LZ5BNisLQPu+O7q5XHWTFlJLyjPFN7Dz636o9UEVXAhl\r\n        HSE38Cy6IgsQsRCddyKFhHxPuRuQsPWj\/ov0DJpOoPXJCiHiquMBNkf9L4JqgQP1qTXclFed+0vUDoLbOI8S\/uPWenSIZOFixCUuKq6dGB8OHrbCryS0DlC83hyTXEmm\r\n        ebW22875cHsoAYS4KinPv6kFBeHgD3FN\/a1cI4Mp68fFSsjoJ4TTfsZDC5UABbFPZXHFAgMBAAGjggFgMIIBXDATBgNVHSUEDDAKBggrBgEFBQcDAzAdBgNVHQ4EFgQU\r\n        WXGmWjNN2pgHgP+EHr6H+XIyQfIwUQYDVR0RBEowSKRGMEQxDTALBgNVBAsTBE1PUFIxMzAxBgNVBAUTKjMxNTk1KzRmYWYwYjcxLWFkMzctNGFhMy1hNjcxLTc2YmMw\r\n        NTIzNDRhZDAfBgNVHSMEGDAWgBTLEejK0rQWWAHJNy4zFha5TJoKHzBWBgNVHR8ETzBNMEugSaBHhkVodHRwOi8vY3JsLm1pY3Jvc29mdC5jb20vcGtpL2NybC9wcm9k\r\n        dWN0cy9NaWNDb2RTaWdQQ0FfMDgtMzEtMjAxMC5jcmwwWgYIKwYBBQUHAQEETjBMMEoGCCsGAQUFBzAChj5odHRwOi8vd3d3Lm1pY3Jvc29mdC5jb20vcGtpL2NlcnRz\r\n        L01pY0NvZFNpZ1BDQV8wOC0zMS0yMDEwLmNydDANBgkqhkiG9w0BAQUFAAOCAQEAMdduKhJXM4HVncbr+TrURE0Inu5e32pbt3nPApy8dmiekKGcC8N\/oozxTbqVOfsN\r\n        4OGb9F0kDxuNiBU6fNutzrPJbLo5LEV9JBFUJjANDf9H6gMH5eRmXSx7nR2pEPocsHTyT2lrnqkkhNrtlqDfc6TvahqsS2Ke8XzAFH9IzU2yRPnwPJNtQtjofOYXoJto\r\n        aAko+QKX7xEDumdSrcHps3Om0mPNSuI+5PNO\/f+h4LsCEztdIN5VP6OukEAxOHUoXgSpRm3m9Xp5QL0fzehF1a7iXT71dcfmZmNgzNWahIeNJDD37zTQYx2xQmdKDku\/\r\n        Og7vtpU6pzjkJZIIpohmgg==\r\n        \"\"\"));\r\n\r\n    [Benchmark]\r\n    public void Encrypt()\r\n    {\r\n        Span&lt;byte&gt; src = stackalloc byte[3];\r\n        Span&lt;byte&gt; dest = stackalloc byte[s_rsa.KeySize &gt;&gt; 3];\r\n        s_rsa.TryEncrypt(src, dest, RSAEncryptionPadding.OaepSHA256, out _);\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Decrypt()\r\n    {\r\n        Span&lt;byte&gt; dest = stackalloc byte[s_rsa.KeySize &gt;&gt; 3];\r\n        s_rsa.TryDecrypt(s_encrypted, dest, RSAEncryptionPadding.OaepSHA256, out _);\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Verify()\r\n    {\r\n        Span&lt;byte&gt; hash = stackalloc byte[256 &gt;&gt; 3];\r\n        s_rsa.VerifyHash(hash, s_signed, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);\r\n    }\r\n\r\n    [Benchmark]\r\n    public void VerifyFromCert()\r\n    {\r\n        using RSA rsa = s_cert.GetRSAPublicKey();\r\n        Span&lt;byte&gt; sig = stackalloc byte[rsa.KeySize &gt;&gt; 3];\r\n        ReadOnlySpan&lt;byte&gt; hash = sig.Slice(0, 256 &gt;&gt; 3);\r\n        rsa.VerifyHash(hash, sig, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Encrypt<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">132.79 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Encrypt<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">19.72 us<\/td>\n<td style=\"text-align: right\">0.15<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Decrypt<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">653.77 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">57 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Decrypt<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">538.25 us<\/td>\n<td style=\"text-align: right\">0.82<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Verify<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">94.92 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Verify<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">16.09 us<\/td>\n<td style=\"text-align: right\">0.17<\/td>\n<td style=\"text-align: right\">&#8211;<\/td>\n<td style=\"text-align: right\">0.00<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>VerifyFromCert<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">525.78 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">721 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>VerifyFromCert<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">31.60 us<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<td style=\"text-align: right\">696 B<\/td>\n<td style=\"text-align: right\">0.97<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For cases where implementations are still using <code>ncrypt<\/code>, there are however ways we can still avoid of some of the remote procedure calls. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89599\">dotnet\/runtime#89599<\/a> does so by caching some information (in particular the key size) that doesn&#8217;t change but that still otherwise results in these remote procedure calls.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly byte[] _emptyDigest = new byte[256 \/ 8];\r\n    private byte[] _rsaSignedHash, _ecdsaSignedHash;\r\n    private RSACng _rsa;\r\n    private ECDsaCng _ecdsa;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        _rsa = new RSACng(2048);\r\n        _rsaSignedHash = _rsa.SignHash(_emptyDigest, HashAlgorithmName.SHA256, RSASignaturePadding.Pss);\r\n\r\n        _ecdsa = new ECDsaCng(256);\r\n        _ecdsaSignedHash = _ecdsa.SignHash(_emptyDigest);\r\n    }\r\n\r\n    [Benchmark]\r\n    public bool Rsa_VerifyHash() =&gt; _rsa.VerifyHash(_emptyDigest, _rsaSignedHash, HashAlgorithmName.SHA256, RSASignaturePadding.Pss);\r\n\r\n    [Benchmark]\r\n    public bool Ecdsa_VerifyHash() =&gt; _ecdsa.VerifyHash(_emptyDigest, _ecdsaSignedHash);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Toolchain<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Rsa_VerifyHash<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">130.27 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Rsa_VerifyHash<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">75.30 us<\/td>\n<td style=\"text-align: right\">0.58<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Ecdsa_VerifyHash<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">400.23 us<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Ecdsa_VerifyHash<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">343.69 us<\/td>\n<td style=\"text-align: right\">0.86<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>System.Format.Asn1<\/code> library provides the support used for encoding various data structures used in cryptographic protocols. For example, <code>AsnWriter<\/code> is used as part of <code>CertificateRequest<\/code> to create the <code>byte[]<\/code> that&#8217;s handed off to the <code>X509Certificate2<\/code>&#8216;s constructor. As part of this, it relies heavily on OIDs (object identifiers) used to uniquely identify things like specific cryptographic algorithms. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75485\">dotnet\/runtime#75485<\/a> imbues <code>AsnReader<\/code> and <code>AsnWriter<\/code> with knowledge of the most-commonly used OIDs, making reading and writing with them significantly faster.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Formats.Asn1;\r\nusing System.Runtime.CompilerServices;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly AsnWriter _writer = new AsnWriter(AsnEncodingRules.DER);\r\n\r\n    [Benchmark]\r\n    public void Write()\r\n    {\r\n        _writer.Reset();\r\n        _writer.WriteObjectIdentifier(\"1.2.840.10045.4.3.3\"); \/\/ ECDsa with SHA384\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Write<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">608.50 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Write<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">33.69 ns<\/td>\n<td style=\"text-align: right\">0.06<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Interestingly, this PR does most of its work in two large switch statements. The first is a nice example of using C# list patterns to <code>switch<\/code> over a span of bytes and efficiently match to a case. The second is a great example of the C# compiler optimization mentioned earlier around <code>switch<\/code>es and length bucketing. The internal <code>WellKnownOids.GetContents<\/code> function this adds to do the lookup is based on a giant switch with ~100 cases. The C# compiler ends up generating a <code>switch<\/code> over the length of the supplied OID string, and then in each length bucket, it either does a sequential scan through the small number of keys in that bucket, or it does a secondary switch over the character at a specific offset into the input, due to all of the keys having a discriminating character at that position.<\/p>\n<p>Another interesting change comes in <code>RandomNumberGenerator<\/code>, which is the cryptographically-secure RNG in <code>System.Security.Cryptography<\/code> (as opposed to the non-cryptographically secure <code>System.Random<\/code>). <code>RandomNumberGenerator<\/code> provides a <code>GetNonZeroBytes<\/code> bytes method, which is the same as <code>GetBytes<\/code> but which promises not to yield any 0 values. It does so by using <code>GetBytes<\/code>, finding any produced 0s, removing them, and then calling <code>GetBytes<\/code> again to replace all of the 0 values (if that call happens to produce any 0s, then the process repeats). The previous implementation of <code>GetNonZeroBytes<\/code> was nicely using the vectorized <code>IndexOf((byte)0)<\/code> to search for a 0. Once it found one, however, it would shift down one at a time the rest of the bytes until the next zero. Since we expect 0s to be rare (on average, they should only occur once ever 256 generated bytes), it&#8217;s much more efficient to search for the next 0 using a vectorized operation, and then shift everything down using a vectorized memory move operation. And that&#8217;s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81340\">dotnet\/runtime#81340<\/a> does.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing System.Security.Cryptography;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private static readonly RandomNumberGenerator s_rng = RandomNumberGenerator.Create();\r\n    private readonly byte[] _bytes = new byte[1024];\r\n\r\n    [Benchmark]\r\n    public void GetNonZeroBytes() =&gt; s_rng.GetNonZeroBytes(_bytes);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetNonZeroBytes<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,115.8 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>GetNonZeroBytes<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">650.8 ns<\/td>\n<td style=\"text-align: right\">0.58<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Finally, a variety of changes went in to reduce allocation:<\/p>\n<ul>\n<li><code>AsnWriter<\/code> now also has a constructor that lets a caller presize its internal buffer, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73535\">dotnet\/runtime#73535<\/a>. That new constructor is then used in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81626\">dotnet\/runtime#81626<\/a> to improve throughput on other operations.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75138\">dotnet\/runtime#75138<\/a> removes a <code>string<\/code> allocation as part of reading certificates on Linux. Stack allocation and spans are used along with <code>Encoding.ASCII.GetString(ReadOnlySpan&lt;byte&gt;, Span&lt;char&gt;)<\/code> instead of <code>Encoding.ASCII.GetString(byte[])<\/code> that produces a <code>string<\/code>.<\/li>\n<li><code>ECDsa<\/code>&#8216;s <code>LegalKeySizes<\/code> don&#8217;t change. The property hands back a <code>KeySizes[]<\/code> array, and out of precaution the property needs to return a new array on each access, however the actual <code>KeySizes<\/code> instances are immutable. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76156\">dotnet\/runtime#76156<\/a> caches these <code>KeySizes<\/code> instances.<\/li>\n<\/ul>\n<h2>Logging<\/h2>\n<p>Logging, along with telemetry, is the lifeblood of any service. The more logging one incorporates, the more information is available to diagnose issues. But of course the more logging one incorporates, the more resources are possibly spent on logging, and thus it&#8217;s desirable for logging-related code to be as efficient as possible.<\/p>\n<p>One issue that&#8217;s plagued some applications is in <code>Microsoft.Extensions.Logging<\/code>&#8216;s <code>LoggerFactory.CreateLogger<\/code> method. Some libraries are passed an <code>ILoggerFactory<\/code>, call <code>CreateLogger<\/code> once, and then store and use that logger for all subsequent interactions; in such cases, the overhead of <code>CreateLogger<\/code> isn&#8217;t critical. However, other code paths, including some from ASP.NET, end up needing to &#8220;create&#8221; a logger on demand each time it needs to log. That puts significant stress on <code>CreateLogger<\/code>, incurring its overhead as part of every logging operation. To reduce these overheads, <code>LoggerFactory.CreateLogger<\/code> has long maintained a <code>Dictionary&lt;TKey, TValue&gt;<\/code> cache of all logger instances it&#8217;s created: pass in the same <code>categoryName<\/code>, get back the same <code>ILogger<\/code> instance (hence why I put &#8220;create&#8221; in quotes a few sentences back). However, that cache is also protected by a lock. That not only means every <code>CreateLogger<\/code> call is incurring the overhead of acquiring and releasing a lock, but if that lock is contended (meaning others are trying to access it at the same time), that contention can dramatically increase the costs associated with the cache. This is the perfect use case for a <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code>, which is optimized with lock-free support for reads, and that&#8217;s exactly how <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87904\">dotnet\/runtime#87904<\/a> improves performance here. We still want to perform some work atomically when there&#8217;s a cache miss, so the change uses &#8220;double-checked locking&#8221;: it performs a read on the dictionary, and only if the lookup fails does it then fall back to taking the lock, after which it checks the dictionary again, and only if that second read fails does it proceed to create the new logger and store it. The primary benefit of <code>ConcurrentDictionary&lt;TKey, TValue&gt;<\/code> here is it enables us to have that up-front read, which might execute concurrently with another thread mutating the dictionary; that&#8217;s not safe with <code>Dictionary&lt;,&gt;<\/code> but is with <code>ConcurrentDictionary&lt;,&gt;<\/code>. This measurably lowers the cost of even uncontended access, but dramatically reduces the overhead when there&#8217;s significant contention.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Logging;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet(\"Microsoft.Extensions.Logging\", \"7.0.0\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(\"Microsoft.Extensions.Logging\", \"8.0.0-rc.1.23419.4\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\npublic class Tests\r\n{\r\n    private readonly LoggerFactory _factory = new();\r\n\r\n    [Benchmark]\r\n    public void Serial() =&gt; _factory.CreateLogger(\"test\");\r\n\r\n    [Benchmark]\r\n    public void Concurrent()\r\n    {\r\n        Parallel.ForEach(Enumerable.Range(0, Environment.ProcessorCount), (i, ct) =&gt;\r\n        {\r\n            for (int j = 0; j &lt; 1_000_000; j++)\r\n            {\r\n                _factory.CreateLogger(\"test\");\r\n            }\r\n        });\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Serial<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">32.775 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Serial<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">7.734 ns<\/td>\n<td style=\"text-align: right\">0.24<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right\"><\/td>\n<td style=\"text-align: right\"><\/td>\n<\/tr>\n<tr>\n<td>Concurrent<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">509,271,719.571 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Concurrent<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">21,613,226.316 ns<\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(The same double-checked locking approach is also employed in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73893\">dotnet\/runtime#73893<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a>, in that case for the Data Contract Serialization library. Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82536\">dotnet\/runtime#82536<\/a> replaces a locked <code>Dictionary&lt;,&gt;<\/code> with a <code>ConcurrentDictionary&lt;,&gt;<\/code>, there in <code>System.ComponentModel.DataAnnotations<\/code>. In that case, it just uses <code>ConcurrentDictionary&lt;,&gt;<\/code>&#8216;s <code>GetOrAdd<\/code> method, which provides optimistic concurrency; the supplied delegate could be invoked multiple times in the case of contention to initialize a value for a given key, but only one such value will ever be published for all to consume.)<\/p>\n<p>Also related to <code>CreateLogger<\/code>, there&#8217;s a <code>CreateLogger(this ILoggerFactory factory, Type type)<\/code> extension method and a <code>CreateLogger&lt;T&gt;(this ILoggerFactory factory)<\/code> extension method, both of which infer the category to use from specified type, using its pretty-printed name. Previously that pretty-printing involved always allocating both a <code>StringBuilder<\/code> to build up the name and the resulting <code>string<\/code>. However, those are only necessary for more complex types, e.g. generic types, array types, and generic type parameters. For the common case, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79325\">dotnet\/runtime#79325<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> avoids those overheads, which were incurred even when the request for the logger could be satisfied from the cache, because the name was necessary to even perform the cache lookup.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Logging;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet(\"Microsoft.Extensions.Logging\", \"7.0.0\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(\"Microsoft.Extensions.Logging\", \"8.0.0-rc.1.23419.4\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly LoggerFactory _factory = new();\r\n\r\n    [Benchmark]\r\n    public ILogger CreateLogger() =&gt; _factory.CreateLogger&lt;Tests&gt;();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CreateLogger<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">156.77 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">160 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>CreateLogger<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">70.82 ns<\/td>\n<td style=\"text-align: right\">0.45<\/td>\n<td style=\"text-align: right\">24 B<\/td>\n<td style=\"text-align: right\">0.15<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There are also changes in .NET 8 to reduce overheads when logging actually does occur, and one such change makes use of a new .NET 8 feature we&#8217;ve already talked about: <code>CompositeFormat<\/code>. <code>CompositeFormat<\/code> isn&#8217;t currently used in many places throughout the core libraries, as most of the formatting they do is either with strings known at build time (in which case they use interpolated strings) or are on exceptional code paths (in which case we generally don&#8217;t want to regress working set or startup in order to optimize error conditions). However, there is one key place <code>CompositeFormat<\/code> is now used: in <code>LoggerMessage.Define<\/code>. This method is similar in concept to <code>CompositeFormat<\/code>: rather than having to redo work every time you want to log something, instead spend some more resources to frontload and cache that work, in order to optimize subsequent usage&#8230; that&#8217;s what <code>LoggerMessage.Define<\/code> does, just for logging. <code>Define<\/code> returns a strongly-typed delegate that can then be used any time logging should be performed. As of the same PR that introduced <code>CompositeFormat<\/code>, <code>LoggerMessage.Define<\/code> now also constructs a <code>CompositeFormat<\/code> under the covers, and uses that instance to perform any formatting work necessary based on the log message pattern provided (previously it would just call <code>string.Format<\/code> as part of every log operation that needed it).<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Logging;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic class Tests\r\n{\r\n    private readonly Action&lt;ILogger, int, Exception&gt; _message = LoggerMessage.Define&lt;int&gt;(LogLevel.Critical, 1, \"The value is {0}.\");\r\n    private readonly ILogger _logger = new MyLogger();\r\n\r\n    [Benchmark]\r\n    public void Format() =&gt; _message(_logger, 42, null);\r\n\r\n    sealed class MyLogger : ILogger\r\n    {\r\n        public IDisposable BeginScope&lt;TState&gt;(TState state) =&gt; null;\r\n        public bool IsEnabled(LogLevel logLevel) =&gt; true;\r\n        public void Log&lt;TState&gt;(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func&lt;TState, Exception, string&gt; formatter) =&gt; formatter(state, exception);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">127.04 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">91.78 ns<\/td>\n<td style=\"text-align: right\">0.72<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>LoggerMessage.Define<\/code> is used as part of the logging source generator, so the benefits there implicitly accrue not only to direct usage of <code>LoggerMessage.Define<\/code> but also to any use of the generator. We can see that in this benchmark here:<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add:\r\n\/\/     &lt;PackageReference Include=\"Microsoft.Extensions.Logging.Abstractions\" Version=\"7.0.0\" \/&gt;\r\n\/\/ to the benchmarks.csproj's &lt;ItemGroup&gt;.\r\n\/\/ dotnet run -c Release -f net7.0 --filter \"*\" --runtimes net7.0 net8.0\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Logging;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly ILogger _logger = new MyLogger();\r\n\r\n    [Benchmark]\r\n    public void Log() =&gt; LogValue(42);\r\n\r\n    [LoggerMessage(1, LogLevel.Critical, \"The value is {Value}.\")]\r\n    private partial void LogValue(int value);\r\n\r\n    sealed class MyLogger : ILogger\r\n    {\r\n        public IDisposable BeginScope&lt;TState&gt;(TState state) =&gt; null;\r\n        public bool IsEnabled(LogLevel logLevel) =&gt; true;\r\n        public void Log&lt;TState&gt;(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func&lt;TState, Exception, string&gt; formatter) =&gt; formatter(state, exception);\r\n    }\r\n}<\/code><\/pre>\n<p>Note the <code>LogValue<\/code> method, which is declared as a <code>partial<\/code> method with the <code>LoggerMessage<\/code> attribute applied to it. The generator will see that and inject into my application the following implementation (the only changes I&#8217;ve made to this copied code are removing the fully-qualified names, for readability), which as is visible here uses <code>LoggerMessage.Define<\/code>:<\/p>\n<pre><code class=\"language-C#\">partial class Tests\r\n{\r\n    [GeneratedCode(\"Microsoft.Extensions.Logging.Generators\", \"7.0.0\")]\r\n    private static readonly Action&lt;ILogger, Int32, Exception?&gt; __LogValueCallback =\r\n        LoggerMessage.Define&lt;Int32&gt;(LogLevel.Information, new EventId(1, nameof(LogValue)), \"The value is {Value}.\", new LogDefineOptions() { SkipEnabledCheck = true });\r\n\r\n    [GeneratedCode(\"Microsoft.Extensions.Logging.Generators\", \"7.0.0\")]\r\n    private partial void LogValue(Int32 value)\r\n    {\r\n        if (_logger.IsEnabled(LogLevel.Information))\r\n        {\r\n            __LogValueCallback(_logger, value, null);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>When running the benchmark, then, we can see the improvements that use <code>CompositeFormat<\/code> end up translating nicely:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Log<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">94.10 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Log<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">74.68 ns<\/td>\n<td style=\"text-align: right\">0.79<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other changes have also gone into reducing overheads in logging. Here&#8217;s the same <code>LoggerMessage.Define<\/code> benchmark as before, but I&#8217;ve tweaked two things:<\/p>\n<ol>\n<li>I&#8217;ve added <code>[MemoryDiagnoser]<\/code> so that allocation is more visible.<\/li>\n<li>I&#8217;ve explicitly controlled which NuGet package version is used for which run.<\/li>\n<\/ol>\n<p>The <code>Microsoft.Extensions.Logging.Abstractions<\/code> package carries with it multiple &#8220;assets&#8221;; the v7.0.0 package, even though it&#8217;s &#8220;7.0.0,&#8221; carries with it a build for net7.0, for net6.0, for netstandard2.0, etc. Similarly, the v8.0.0 package, even though it&#8217;s &#8220;8.0.0,&#8221; carries with it a build for net8.0, for net7.0, and so on. Each of those is created from compiling the source for that Target Framework Moniker (TFM). Changes that are specific to a particular TFM, such as the change to use <code>CompositeFormat<\/code>, are only compiled into that build, but other improvements that aren&#8217;t specific to a particular TFM end up in all of them. As such, to be able to see improvements that have gone into the general code in the last year, we need to actually compare the two different NuGet packages, and can&#8217;t just compare the net8.0 vs net7.0 assets in the same package version.<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Logging;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet(\"Microsoft.Extensions.Logging\", \"7.0.0\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(\"Microsoft.Extensions.Logging\", \"8.0.0-rc.1.23419.4\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Tests\r\n{\r\n    private readonly Action&lt;ILogger, int, Exception&gt; _message = LoggerMessage.Define&lt;int&gt;(LogLevel.Critical, 1, \"The value is {0}.\");\r\n    private readonly ILogger _logger = new MyLogger();\r\n\r\n    [Benchmark]\r\n    public void Format() =&gt; _message(_logger, 42, null);\r\n\r\n    sealed class MyLogger : ILogger\r\n    {\r\n        public IDisposable BeginScope&lt;TState&gt;(TState state) =&gt; null;\r\n        public bool IsEnabled(LogLevel logLevel) =&gt; true;\r\n        public void Log&lt;TState&gt;(LogLevel logLevel, EventId eventId, TState state, Exception exception, Func&lt;TState, Exception, string&gt; formatter) =&gt; formatter(state, exception);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">96.44 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">80 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">46.75 ns<\/td>\n<td style=\"text-align: right\">0.48<\/td>\n<td style=\"text-align: right\">56 B<\/td>\n<td style=\"text-align: right\">0.70<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Notice that throughput has increased and allocation has dropped. That&#8217;s primarily due to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88560\">dotnet\/runtime#88560<\/a>, which avoids boxing value type arguments as they&#8217;re being passed through the formatting logic.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89160\">dotnet\/runtime#89160<\/a> is another interesting example, not because it&#8217;s a significant savings (it ends up saving an allocation per HTTP request made using an <code>HttpClient<\/code> created from an <code>HttpClientFactory<\/code>), but because of why the allocation is there in the first place. Consider this C# class:<\/p>\n<pre><code class=\"language-C#\">public class C\r\n{\r\n    public void M(int value)\r\n    {\r\n        Console.WriteLine(value);\r\n        LocalFunction();\r\n\r\n        void LocalFunction() =&gt; Console.WriteLine(value);\r\n    }\r\n}<\/code><\/pre>\n<p>We&#8217;ve got a method <code>M<\/code> that contains a local function <code>LocalFunction<\/code> that &#8220;closes over&#8221; <code>M<\/code>&#8216;s <code>int value<\/code> argument. How does <code>value<\/code> find its way into that <code>LocalFunction<\/code>? Let&#8217;s look at a decompiled version of the IL the compiler generates:<\/p>\n<pre><code class=\"language-C#\">public class C\r\n{\r\n    public void M(int value)\r\n    {\r\n        &lt;&gt;c__DisplayClass0_0 &lt;&gt;c__DisplayClass0_ = default(&lt;&gt;c__DisplayClass0_0);\r\n        &lt;&gt;c__DisplayClass0_.value = value;\r\n        Console.WriteLine(&lt;&gt;c__DisplayClass0_.value);\r\n        &lt;M&gt;g__LocalFunction|0_0(ref &lt;&gt;c__DisplayClass0_);\r\n    }\r\n\r\n    [StructLayout(LayoutKind.Auto)]\r\n    [CompilerGenerated]\r\n    private struct &lt;&gt;c__DisplayClass0_0\r\n    {\r\n        public int value;\r\n    }\r\n\r\n    [CompilerGenerated]\r\n    private static void &lt;M&gt;g__LocalFunction|0_0(ref &lt;&gt;c__DisplayClass0_0 P_0)\r\n    {\r\n        Console.WriteLine(P_0.value);\r\n    }\r\n}<\/code><\/pre>\n<p>So, the compiler has emitted the <code>LocalFunction<\/code> as a static method, and it&#8217;s passed the state it needs by reference, with all of the state in a separate type (which the compiler refers to as a &#8220;display class&#8221;). Note that a) the instance of this type is constructed in <code>M<\/code> in order to store the <code>value<\/code> argument, and that all references to <code>value<\/code>, whether in <code>M<\/code> or in <code>LocalFunction<\/code>, are to the shared <code>value<\/code> on the display class, and b) that &#8220;class&#8221; is actually declared as a <code>struct<\/code>. That means we&#8217;re not going to incur any allocation as part of that data sharing. But now, let&#8217;s add a single keyword to our repro: add <code>async<\/code> to <code>LocalFunction<\/code> (I&#8217;ve elided some irrelevant code here for clarity):<\/p>\n<pre><code class=\"language-C#\">public class C\r\n{\r\n    public void M(int value)\r\n    {\r\n        &lt;&gt;c__DisplayClass0_0 &lt;&gt;c__DisplayClass0_ = new &lt;&gt;c__DisplayClass0_0();\r\n        &lt;&gt;c__DisplayClass0_.value = value;\r\n        Console.WriteLine(&lt;&gt;c__DisplayClass0_.value);\r\n        &lt;&gt;c__DisplayClass0_.&lt;M&gt;g__LocalFunction|0();\r\n    }\r\n\r\n    [CompilerGenerated]\r\n    private sealed class &lt;&gt;c__DisplayClass0_0\r\n    {\r\n        [StructLayout(LayoutKind.Auto)]\r\n        private struct &lt;&lt;M&gt;g__LocalFunction|0&gt;d : IAsyncStateMachine { ... }\r\n\r\n        public int value;\r\n\r\n        [AsyncStateMachine(typeof(&lt;&lt;M&gt;g__LocalFunction|0&gt;d))]\r\n        internal void &lt;M&gt;g__LocalFunction|0()\r\n        {\r\n            &lt;&lt;M&gt;g__LocalFunction|0&gt;d stateMachine = default(&lt;&lt;M&gt;g__LocalFunction|0&gt;d);\r\n            stateMachine.&lt;&gt;t__builder = AsyncVoidMethodBuilder.Create();\r\n            stateMachine.&lt;&gt;4__this = this;\r\n            stateMachine.&lt;&gt;1__state = -1;\r\n            stateMachine.&lt;&gt;t__builder.Start(ref stateMachine);\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<p>The code for <code>M<\/code> looks <em>almost<\/em> the same, but there&#8217;s a key difference: instead of <code>default(&lt;&gt;c__DisplayClass0_0)<\/code>, it has <code>new &lt;&gt;c__DisplayClass0_0()<\/code>. That&#8217;s because the display class now actually is a <code>class<\/code> rather than being a <code>struct<\/code>, and that&#8217;s because the state can no longer live on the stack; it&#8217;s being passed to an asynchronous method, which may need to continue to use it even after the stack has unwound. And that means it becomes more important avoiding these kinds of implicit closures when dealing with local functions that are asynchronous.<\/p>\n<p>In this particular case, <code>LoggingHttpMessageHandler<\/code> (and <code>LoggingScopeHttpMessageHandler<\/code>) had a <code>SendCoreAsync<\/code> method that looked like this:<\/p>\n<pre><code class=\"language-C#\">private Task&lt;HttpResponseMessage&gt; SendCoreAsync(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken)\r\n{\r\n    ThrowHelper.ThrowIfNull(request);\r\n    return Core(request, cancellationToken);\r\n\r\n    async Task&lt;HttpResponseMessage&gt; Core(HttpRequestMessage request, CancellationToken cancellationToken)\r\n    {\r\n        ...\r\n        HttpResponseMessage response = useAsync ? ... : ...;\r\n        ...\r\n    }\r\n}<\/code><\/pre>\n<p>Based on the previous discussion, you likely see the problem here: <code>useAsync<\/code> is being implicitly closed over by the local function, resulting in this allocating a display class to pass that state in. The cited PR changed the code to instead be:<\/p>\n<pre><code class=\"language-C#\">private Task&lt;HttpResponseMessage&gt; SendCoreAsync(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken)\r\n{\r\n    ThrowHelper.ThrowIfNull(request);\r\n    return Core(request, useAsync, cancellationToken);\r\n\r\n    async Task&lt;HttpResponseMessage&gt; Core(HttpRequestMessage request, bool useAsync, CancellationToken cancellationToken)\r\n    {\r\n        ...\r\n        HttpResponseMessage response = useAsync ? ... : ...;\r\n        ...\r\n    }\r\n}<\/code><\/pre>\n<p>and, voila, the allocation is gone.<\/p>\n<p><code>EventSource<\/code> is another logging mechanism in .NET that&#8217;s lower-level and which is used by the core libraries for their logging needs. The runtime itself publishes its events for things like the GC and the JIT via an <code>EventSource<\/code>, something I relied on earlier in this post when tracking how many <code>GCHandle<\/code>s were created (search above for <code>GCHandleListener<\/code>). When eventing is enabled for a particular source, that <code>EventSource<\/code> publishes a manifest describing the possible events and the shape of the data associated with each. While in the future, we aim to use a source generator to create that manifest at build time, today it&#8217;s all generated at run-time, using reflection to analyze the events defined on the <code>EventSource<\/code>-derived type and to dynamically build up the description. That unfortunately has some cost, which can measurably impact startup. Thankfully, one of the main contributors here is the manifest for that runtime source, <code>NativeRuntimeEventSource<\/code>, as it&#8217;s ever present, but it&#8217;s not actually necessary, since tools that consume this information already know about the well-documented schema. As such, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78213\">dotnet\/runtime#78213<\/a> stopped emitting the manifest for <code>NativeRuntimeEventSource<\/code> such that it doesn&#8217;t send a large amount of data across to the consumer that will subsequently ignore it. That prevented it from being sent, but it was still being created. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86850\">dotnet\/runtime#86850<\/a> from <a href=\"https:\/\/github.com\/n77y\">@n77y<\/a> addressed a large chunk of that by reducing the costs of that generation. The effect of this is obvious if we do a .NET allocation profile of a simple nop console application.<\/p>\n<pre><code class=\"language-C#\">class Program { static void Main() { } }<\/code><\/pre>\n<p>On .NET 7, we observe this:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/NativeRuntimeEventSourceAllocsNet7.png\" alt=\"Allocation from the NativeRuntimeEventSource on .NET 7\" \/>\nAnd on .NET 8, that reduces to this:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/NativeRuntimeEventSourceAllocsNet8.png\" alt=\"Allocation from the NativeRuntimeEventSource on .NET 8\" \/>\n(In the future, hopefully this whole thing will go away due to precomputing the manifest.)<\/p>\n<p><code>EventSource<\/code> also relies heavily on interop, and as part of that it&#8217;s historically used delegate marshaling as part of implementing callbacks from native code. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79970\">dotnet\/runtime#79970<\/a> switches it over to using function pointers, which is not only more efficient, it eliminates this as one of the last uses of delegate marshaling in the core libraries. That means for Native AOT, all of the code associated with supporting delegate marshaling can typically now be trimmed away, reducing application size further.<\/p>\n<h2>Configuration<\/h2>\n<p>Configuration support is critical for many services and applications, such that information necessary to the execution of the code can be extracted from the code, whether that be into a JSON file, environment variables, Azure Key Vault, wherever. This information then needs to be loaded into the application in a convenient manner, typically at startup but also potentially any time the configuration is seen to change. It&#8217;s thus not a typical candidate for throughput-focused optimization, but it is still valuable to drive associated costs down, especially to help with startup performance.<\/p>\n<p>With <code>Microsoft.Extensions.Configuration<\/code>, configuration is handled primarily with a <code>ConfigurationBuilder<\/code>, an <code>IConfiguration<\/code>, and a &#8220;binder.&#8221; Using a <code>ConfigurationBuilder<\/code>, you add in the various sources of your configuration information (e.g. <code>AddEnvironmentVariables<\/code>, <code>AddAzureKeyVault<\/code>, etc.), and then you publish that as an <code>IConfiguration<\/code> instance. In typical use, you then extract from that <code>IConfiguration<\/code> the data you want by &#8220;binding&#8221; it to an object, meaning a <code>Bind<\/code> method populates the provided object with data from the configuration based on the shape of the object. Let&#8217;s measure the cost of that <code>Bind<\/code> specifically:<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add:\r\n\/\/     &lt;EnableConfigurationBindingGenerator&gt;true&lt;\/EnableConfigurationBindingGenerator&gt;\r\n\/\/     &lt;Features&gt;$(Features);InterceptorsPreview&lt;\/Features&gt;\r\n\/\/ to the PropertyGroup in the benchmarks.csproj file, and add:\r\n\/\/    &lt;PackageReference Include=\"Microsoft.Extensions.Configuration\" Version=\"7.0.0\" \/&gt;\r\n\/\/    &lt;PackageReference Include=\"Microsoft.Extensions.Configuration.EnvironmentVariables\" Version=\"7.0.0\" \/&gt;\r\n\/\/    &lt;PackageReference Include=\"Microsoft.Extensions.Configuration.Binder\" Version=\"8.0.0-rc.1.23419.4\" Condition=\"'$(TargetFramework)'=='net8.0'\" \/&gt;\r\n\/\/ to the ItemGroup.\r\n\/\/ dotnet run -c Release -f net7.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Environments;\r\nusing BenchmarkDotNet.Jobs;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Configuration;\r\n\r\nvar config = DefaultConfig.Instance\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithNuGet(\"Microsoft.Extensions.Configuration\", \"7.0.0\").AsBaseline())\r\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80)\r\n        .WithNuGet(\"Microsoft.Extensions.Configuration\", \"8.0.0-rc.1.23419.4\")\r\n        .WithNuGet(\"Microsoft.Extensions.Configuration.Binder\", \"8.0.0-rc.1.23419.4\"));\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\", \"NuGetReferences\")]\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic partial class Tests\r\n{\r\n    private readonly MyConfigSection _data = new();\r\n    private IConfiguration _config;\r\n\r\n    [GlobalSetup]\r\n    public void Setup()\r\n    {\r\n        Environment.SetEnvironmentVariable(\"MyConfigSection__Message\", \"Hello World!\");\r\n        _config = new ConfigurationBuilder()\r\n            .AddEnvironmentVariables()\r\n            .Build();\r\n    }\r\n\r\n    [Benchmark]\r\n    public void Load() =&gt; _config.Bind(\"MyConfigSection\", _data);\r\n\r\n    internal sealed class MyConfigSection\r\n    {\r\n        public string Message { get; set; }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<th style=\"text-align: right\">Allocated<\/th>\n<th style=\"text-align: right\">Alloc Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Load<\/td>\n<td>.NET 7.0<\/td>\n<td style=\"text-align: right\">1,747.15 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<td style=\"text-align: right\">1328 B<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Load<\/td>\n<td>.NET 8.0<\/td>\n<td style=\"text-align: right\">73.45 ns<\/td>\n<td style=\"text-align: right\">0.04<\/td>\n<td style=\"text-align: right\">112 B<\/td>\n<td style=\"text-align: right\">0.08<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Whoa.<\/p>\n<p>Much of that cost in .NET 7 comes from what I alluded to earlier when I said &#8220;based on the shape of the object.&#8221; That <code>Bind<\/code> call is using this extension method defined in the <code>Microsoft.Extensions.Configuration.ConfigurationBinder<\/code> type:<\/p>\n<pre><code class=\"language-C#\">public static void Bind(this IConfiguration configuration, string key, object? instance)<\/code><\/pre>\n<p>How does it know what data to extract from the configuration and where on the <code>object<\/code> to store it? Reflection, of course. That means that every <code>Bind<\/code> call is using reflection to walk the supplied <code>object<\/code>&#8216;s type information, and is using reflection to store the configuration data onto the instance. That&#8217;s not cheap.<\/p>\n<p>What changes then in .NET 8? The mention of &#8220;EnableConfigurationBindingGenerator&#8221; in the benchmark code above probably gives it away, but the answer is there&#8217;s a new source generator for configuration in .NET 8. This source generator was initially introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82179\">dotnet\/runtime#82179<\/a> and was then improved upon in a multitude of PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84154\">dotnet\/runtime#84154<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86076\">dotnet\/runtime#86076<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86285\">dotnet\/runtime#86285<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86365\">dotnet\/runtime#86365<\/a>. The crux of the idea behind the configuration source generator is to emit a <em>replacement<\/em> for that <code>Bind<\/code> method, one that knows exactly what type is being populated and can do all the examination of its shape at build-time rather than at run-time via reflection.<\/p>\n<p>&#8220;Replacement.&#8221; For anyone familiar with C# source generators, this might be setting off alarm bells in your head. Source generators plug into the compiler and are handed all the data the compiler has about the code being compiled; the source generator is then able to <em>augment<\/em> that data, generating additional code into separate files that the compiler then also compiles into the same assembly. Source generators are able to add code but they can&#8217;t rewrite the code. This is why you see source generators like the <code>Regex<\/code> source generator or the <code>LibraryImport<\/code> source generator or the <code>LoggerMessage<\/code> source generator relying on partial methods: the developer writes the partial method declaration for the method they then consume in their code, and then separately the generator emits a partial method definition to supply the implementation for that method. How then is this new configuration generator able to <em>replace<\/em> a call to an existing method? I&#8217;m glad you asked! It takes advantage of a new preview feature of the C# compiler, added primarily in <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/68564\">dotnet\/roslyn#68564<\/a>: interceptors.<\/p>\n<p>Consider this program, defined in a <code>\/home\/stoub\/benchmarks\/Program.cs<\/code> file (and where the associated .csproj contains <code>&lt;Features&gt;$(Features);InterceptorsPreview&lt;\/Features&gt;<\/code> to enable the preview feature):<\/p>\n<pre><code class=\"language-C#\">\/\/ dotnet run -c Release -f net8.0\r\n\r\nusing System.Runtime.CompilerServices;\r\n\r\nConsole.WriteLine(\"Hello World!\");\r\n\r\n\/\/ ----------------------------------\r\n\r\ninternal static class Helpers\r\n{\r\n    [InterceptsLocation(@\"\/home\/stoub\/benchmarks\/Program.cs\", 5, 9)]\r\n    internal static void NotTheRealWriteLine(string message) =&gt;\r\n        Console.WriteLine($\"The message was '{message}'.\");\r\n}\r\n\r\nnamespace System.Runtime.CompilerServices\r\n{\r\n    [AttributeUsage(AttributeTargets.Method, AllowMultiple = true)]\r\n    file sealed class InterceptsLocationAttribute : Attribute\r\n    {\r\n        public InterceptsLocationAttribute(string filePath, int line, int column) { }\r\n    }\r\n}<\/code><\/pre>\n<p>This is a &#8220;hello world&#8221; application, except not quite the one-liner you&#8217;re used to. There&#8217;s a call to <code>Console.WriteLine<\/code>, but there&#8217;s also a method decorated with <code>InterceptsLocation<\/code>. That method has the same signature as the <code>Console.WriteLine<\/code> being used, and the attribute is pointing to the <code>WriteLine<\/code> method call in <code>Program.cs<\/code>&#8216;s line 5 column 9. When the compiler sees this, it will change that call from <code>Console.WriteLine(\"Hello World!\")<\/code> to instead be <code>Helpers.NotTheRealWriteLine(\"Hello World!\")<\/code>, allowing this other method in the same compilation unit to intercept the original call. This interceptor needn&#8217;t be in the same file, so a source generator can analyze the code handed to it, find a call it wants to intercept, and augment the compilation unit with such an interceptor.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/InterceptorHelloWorld.png\" alt=\"Decompiled &quot;Hello World&quot; with Interceptors\" \/><\/p>\n<p>That&#8217;s exactly what the configuration source generator does. In this benchmark, for example, the core of what the source generator emits is here (I&#8217;ve elided stuff that&#8217;s not relevant to this discussion):<\/p>\n<pre><code class=\"language-C#\">[InterceptsLocationAttribute(@\"...\/LoggerFilterConfigureOptions.cs\", 21, 35)]\r\npublic static void Bind_TestsMyConfigSection(this IConfiguration configuration, string key, object? obj)\r\n{\r\n    ...\r\n    var typedObj = (Tests.MyConfigSection)obj;\r\n    BindCore(configuration.GetSection(key), ref typedObj, binderOptions: null);\r\n}\r\n\r\npublic static void BindCore(IConfiguration configuration, ref Tests.MyConfigSection obj, BinderOptions? binderOptions)\r\n{\r\n    ...\r\n    obj.Message = configuration[\"Message\"]!;\r\n}<\/code><\/pre>\n<p>We can see the generated <code>Bind<\/code> method is strongly typed for my <code>MyConfigSection<\/code> type, and the generated <code>Bind_TestsMyConfigSection<\/code> method it invokes extracts the <code>\"Message\"<\/code> value from the <code>configuration<\/code> and stores it directly into the property. No reflection anywhere in sight.<\/p>\n<p>This is obviously great for throughput, but that actually wasn&#8217;t the primary goal for this particular source generator. Rather, it was in support of Native AOT and trimming. Without direct use of various portions of the object model for the bound object, the trimmer could see portions of it as being unused and trim them away (such as setters for properties that are only read by the application), at which point that data would not be available (because the deserializer would see the properties as being get-only). By having everything strongly typed in the generated source, that issue goes away. And as a bonus, if there isn&#8217;t other use of the reflection stack keeping it rooted, the trimmer can get rid of that, too.<\/p>\n<p><code>Bind<\/code> isn&#8217;t the only method that&#8217;s replaceable. <code>ConfigurationBinder<\/code> provides other methods consumers can use, like <code>GetValue<\/code>, which just retrieves the value associated with a specific key, and the configuration source generator can emit replacements for those as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87935\">dotnet\/runtime#87935<\/a> modified <code>Microsoft.Extensions.Logging.Configuration<\/code> to employ the config generator for this purpose, as it uses <code>GetValue<\/code> in its <code>LoadDefaultConfigValues<\/code> method:<\/p>\n<pre><code class=\"language-C#\">private void LoadDefaultConfigValues(LoggerFilterOptions options)\r\n{\r\n    if (_configuration == null)\r\n    {\r\n        return;\r\n    }\r\n    options.CaptureScopes = _configuration.GetValue(nameof(options.CaptureScopes), options.CaptureScopes);\r\n    ...\r\n}<\/code><\/pre>\n<p>And if we look at what&#8217;s in the compiled binary (via <a href=\"https:\/\/github.com\/icsharpcode\/ILSpy\">ILSpy<\/a>), we see this:\n<img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2023\/09\/ILSpyDecompiledLoadDefaultConfigValues.png\" alt=\"ILSpy decompilation of LoadDefaultConfigValues\" \/><\/p>\n<p>So, the code looks the same, but the actual target of the <code>GetValue<\/code> is the intercepting method emitted by the source generator. When that change merged, it knocked ~640Kb off the size of the ASP.NET app being used as an exemplar to track Native AOT app size!<\/p>\n<p>Once data has been loaded from the configuration system into some kind of model, often the next step is to validate that the supplied data meets requirements. Whether a data model is populated once from configuration or per request for user input, a typical approach for achieving such validation is via the <code>System.ComponentModel.DataAnnotations<\/code> namespace. This namespace supplies attributes that can be applied to members of a type to indicate constraints the data must satisfy, such as <code>[Required]<\/code> to indicate the data must be supplied or <code>[MinLength(...)]<\/code> to indicate a minimum length for a string, and .NET 8 adds additional attributes via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82311\">dotnet\/runtime#82311<\/a>, for example <code>[Base64String]<\/code>. On top of this, <code>Microsoft.Extensions.Options.DataAnnotationValidateOptions<\/code> provides an implementation of the <code>IValidateOptions&lt;TOptions&gt;<\/code> interface (an implementation of which is typically retrieved via DI) for validating models based on data annotations, and as you can probably guess, it does so via reflection. As is a trend you&#8217;re probably picking up on, for many such areas involving reflection, .NET has been moving to add source generators that can do at build-time what would have otherwise been done at run-time; that&#8217;s the case here as well. As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87587\">dotnet\/runtime#87587<\/a>, the <code>Microsoft.Extensions.Options<\/code> package in .NET 8 now includes a source generator that creates an implementation of <code>IValidateOptions&lt;TOptions&gt;<\/code> for a specific <code>TOptions<\/code> type.<\/p>\n<p>For example, consider this benchmark:<\/p>\n<pre><code class=\"language-C#\">\/\/ For this test, you'll also need to add these:\r\n\/\/  &lt;PackageReference Include=\"Microsoft.Extensions.Options\" Version=\"8.0.0-rc.1.23419.4\" \/&gt;\r\n\/\/  &lt;PackageReference Include=\"Microsoft.Extensions.Options.DataAnnotations\" Version=\"8.0.0-rc.1.23419.4\" \/&gt;\r\n\/\/ to the benchmarks.csproj's &lt;ItemGroup&gt;.\r\n\/\/ dotnet run -c Release -f net8.0 --filter \"*\"\r\n\r\nusing BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing Microsoft.Extensions.Options;\r\nusing System.ComponentModel.DataAnnotations;\r\n\r\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);\r\n\r\n[HideColumns(\"Error\", \"StdDev\", \"Median\", \"RatioSD\")]\r\npublic partial class Tests\r\n{\r\n    private readonly DataAnnotationValidateOptions&lt;MyOptions&gt; _davo = new DataAnnotationValidateOptions&lt;MyOptions&gt;(null);\r\n    private readonly MyOptionsValidator _ov = new();\r\n    private readonly MyOptions _options = new() { Path = \"1234567890\", Address = \"http:\/\/localhost\/path\", PhoneNumber = \"555-867-5309\" };\r\n\r\n    [Benchmark(Baseline = true)]\r\n    public ValidateOptionsResult WithReflection() =&gt; _davo.Validate(null, _options);\r\n\r\n    [Benchmark]\r\n    public ValidateOptionsResult WithSourceGen() =&gt; _ov.Validate(null, _options);\r\n\r\n    public sealed class MyOptions\r\n    {\r\n        [Length(1, 10)]\r\n        public string Path { get; set; }\r\n\r\n        [Url]\r\n        public string Address { get; set; }\r\n\r\n        [Phone]\r\n        public string PhoneNumber { get; set; }\r\n    }\r\n\r\n    [OptionsValidator]\r\n    public partial class MyOptionsValidator : IValidateOptions&lt;MyOptions&gt; { }\r\n}<\/code><\/pre>\n<p>Note the <code>[OptionsValidator]<\/code> at the end. It&#8217;s applied to a <code>partial<\/code> class that implements <code>IValidatOptions&lt;MyOptions&gt;<\/code>, which tells the source generator to emit the implementation for this interface in order to validate <code>MyOptions<\/code>. It ends up emitting code like this (which I&#8217;ve simplified a tad, e.g. removing fully-qualified namespaces, for the purposes of this post):<\/p>\n<pre><code class=\"language-C#\">[GeneratedCode(\"Microsoft.Extensions.Options.SourceGeneration\", \"8.0.8.41903\")]\r\npublic ValidateOptionsResult Validate(string? name, MyOptions options)\r\n{\r\n    var context = new ValidationContext(options);\r\n    var validationResults = new List&lt;ValidationResult&gt;();\r\n    var validationAttributes = new List&lt;ValidationAttribute&gt;(2);\r\n    ValidateOptionsResultBuilder? builder = null;\r\n\r\n    context.MemberName = \"Path\";\r\n    context.DisplayName = string.IsNullOrEmpty(name) ? \"MyOptions.Path\" : $\"{name}.Path\";\r\n    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A1);\r\n    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A2);\r\n    if (!Validator.TryValidateValue(options.Path, context, validationResults, validationAttributes))\r\n        (builder ??= new()).AddResults(validationResults);\r\n\r\n    context.MemberName = \"Address\";\r\n    context.DisplayName = string.IsNullOrEmpty(name) ? \"MyOptions.Address\" : $\"{name}.Address\";\r\n    validationResults.Clear();\r\n    validationAttributes.Clear();\r\n    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A3);\r\n    if (!Validator.TryValidateValue(options.Address, context, validationResults, validationAttributes))\r\n        (builder ??= new()).AddResults(validationResults);\r\n\r\n    context.MemberName = \"PhoneNumber\";\r\n    context.DisplayName = string.IsNullOrEmpty(name) ? \"MyOptions.PhoneNumber\" : $\"{name}.PhoneNumber\";\r\n    validationResults.Clear();\r\n    validationAttributes.Clear();\r\n    validationAttributes.Add(__OptionValidationStaticInstances.__Attributes.A4);\r\n    if (!Validator.TryValidateValue(options.PhoneNumber, context, validationResults, validationAttributes))\r\n        (builder ??= new()).AddResults(validationResults);\r\n\r\n    return builder is not null ? builder.Build() : ValidateOptionsResult.Success;\r\n}<\/code><\/pre>\n<p>eliminating the need to use reflection to discover the relevant properties and their attribution. The benchmark results highlight the benefits:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right\">Mean<\/th>\n<th style=\"text-align: right\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithReflection<\/td>\n<td style=\"text-align: right\">2,926.2 ns<\/td>\n<td style=\"text-align: right\">1.00<\/td>\n<\/tr>\n<tr>\n<td>WithSourceGen<\/td>\n<td style=\"text-align: right\">403.5 ns<\/td>\n<td style=\"text-align: right\">0.14<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Peanut Butter<\/h2>\n<p>In every .NET release, there are a multitude of welcome PRs that make small improvements. These changes on their own typically don&#8217;t &#8220;move the needle,&#8221; don&#8217;t on their own make very measurable end-to-end changes. However, an allocation removed here, an unnecessary bounds check removed there, it all adds up. Constantly working to remove this &#8220;peanut butter,&#8221; as we often refer to it (a thin smearing of overhead across everything), helps improve the performance of the platform in the aggregate.<\/p>\n<p>Here are some examples from the last year:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/77832\">dotnet\/runtime#77832<\/a>. The <code>MemoryStream<\/code> type provides a convenient <code>ToArray()<\/code> method that gives you all the stream&#8217;s data as a new <code>byte[]<\/code>. But while convenient, it&#8217;s a potentially large allocation and copy. The lesser known <code>GetBuffer<\/code> and <code>TryGetBuffer<\/code> methods give one access to the <code>MemoryStream<\/code>&#8216;s buffer directly, without incurring an allocation or copy. This PR replaced use of <code>ToArray<\/code> in <code>System.Private.Xml<\/code> and in <code>System.Reflection.Metadata<\/code> that were better served by <code>GetBuffer()<\/code>. Not only did it remove unnecessary allocation, as a bonus it also resulted in less code.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80523\">dotnet\/runtime#80523<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80389\">dotnet\/runtime#80389<\/a> removed string allocations from the <code>System.ComponentModel.Annotations<\/code> library. <code>CreditCardAttribute<\/code> was making two calls to <code>string.Replace<\/code> to remove <code>'-'<\/code> and <code>' '<\/code> characters, but it was then looping over every character in the input&#8230; rather than creating new strings without those characters, the loop can simply skip over them. Similarly, <code>PhoneAttribute<\/code> contained 6 <code>string.Substring<\/code> calls, all of which could be replaced with simple <code>ReadOnlySpan&lt;char&gt;<\/code> slices.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82041\">dotnet\/runtime#82041<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87479\">dotnet\/runtime#87479<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80386\">dotnet\/runtime#80386<\/a> changed several hundred lines across <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> to avoid various array and <code>string<\/code> allocation. In some cases it used <code>stackalloc<\/code>, in others <code>ArrayPool<\/code>, in others simply deleting arrays that were never used, in others using <code>ReadOnlySpan&lt;char&gt;<\/code> and slicing.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82411\">dotnet\/runtime#82411<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82456\">dotnet\/runtime#82456<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> do a similar optimization to one discussed previously in the context of <code>SslStream<\/code>. Here, they&#8217;re removing <code>SafeHandle<\/code> allocations in places where a simple <code>try<\/code>\/<code>finally<\/code> with the raw <code>IntPtr<\/code> for the handle suffices.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82096\">dotnet\/runtime#82096<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/83138\">dotnet\/runtime#83138<\/a> decreased some costs by using newer constructs: string interpolation instead of concatenation so as to avoid some intermediary string allocations, and <code>u8<\/code> instead of <code>Encoding.UTF8.GetBytes<\/code> to avoid the transcoding overhead.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75850\">dotnet\/runtime#75850<\/a> removed some allocations as part of initializing a <code>Dictionary&lt;,&gt;<\/code>. The dictionary in <code>TypeConverter<\/code> gets populated with a fixed set of predetermined items, and as such it&#8217;s provided with a capacity so as to presize its internal arrays to avoid intermediate allocations as part of growing. However, the provided capacity was smaller than the number of items actually being added. This PR simply fixed the number, and voila, less allocation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81036\">dotnet\/runtime#81036<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/81039\">dotnet\/runtime#81039<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> helped eliminate some bounds checking in various components across the core libraries. Today the JIT compiler recognizes the pattern <code>for (int i = 0; i &lt; arr.Length; i++) Use(arr[i]);<\/code>, understanding that the <code>i<\/code> can&#8217;t ever be negative nor greater than the <code>arr<\/code>&#8216;s length, and thus eliminates the bounds check it would have otherwise emitted on <code>arr[i]<\/code>. However, the compiler doesn&#8217;t currently recognize the same thing for <code>for (int i = 0; i != arr.Length; i++) Use(arr[i]);<\/code>. These PRs primarily replaced <code>!=<\/code>s with <code>&lt;<\/code>s in order to help in some such cases (it also makes the code more idiomatic, and so was welcomed even in cases where it wasn&#8217;t actually helping with bounds checks).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89030\">dotnet\/runtime#89030<\/a> fixed a case where a <code>Dictionary&lt;T, T&gt;<\/code> was being used as a set. Changing it to instead be <code>HashSet&lt;T&gt;<\/code> saves on the internal storage for the values that end up being identical to the keys.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78741\">dotnet\/runtime#78741<\/a> replaces a bunch of <code>Unsafe.SizeOf&lt;T&gt;()<\/code> with <code>sizeof(T)<\/code> and <code>Unsafe.As&lt;TFrom, TTo&gt;<\/code> with pointer manipulation. Most of these are with managed <code>T<\/code>s, such that it used to not be possible to do. However, as of C# 11, more of these operations are possible, with conditions that were previously always errors now being downgraded to warnings (which can then be suppressed) in an <code>unsafe<\/code> context. Such replacements generally won&#8217;t improve throughput, but they do make the binaries a bit smaller and require less work for the JIT, which can in turn help with startup time. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78914\">dotnet\/runtime#78914<\/a> takes advantage of this as well, this time to be able to pass a span as input to a <code>string.Create<\/code> call.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/78737\">dotnet\/runtime#78737<\/a> from <a href=\"https:\/\/github.com\/Poppyto\">@Poppyto<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/79345\">dotnet\/runtime#79345<\/a> from <a href=\"https:\/\/github.com\/Poppyto\">@Poppyto<\/a> remove some <code>char[]<\/code> allocations from <code>Microsoft.Win32.Registry<\/code> by replacing some code that was using <code>List&lt;string&gt;<\/code> to build up a result and then <code>ToArray<\/code> it at the end to get back a <code>string[]<\/code>. In the majority case, we know the exact required size ahead of time, and can avoid the extra allocations and copy by just using an array from the get-go.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82598\">dotnet\/runtime#82598<\/a> from <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a> also tweaked <code>Registry<\/code>, taking advantage of a Win32 function that was added after the original code was written, in order to reduce the number of system calls required to delete a subtree.<\/li>\n<li>Multiple changes went into <code>System.Xml<\/code> and <code>System.Runtime.Serialization.Xml<\/code> to streamline away peanut butter related to strings and arrays. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75452\">dotnet\/runtime#75452<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> replaces multiple <code>string.Trim<\/code> calls with span trimming and slicing, taking advantage of the C# language&#8217;s recently added support for using <code>switch<\/code> over <code>ReadOnlySpan&lt;char&gt;<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75946\">dotnet\/runtime#75946<\/a> removes some use of <code>ToCharArray<\/code> (these days, there&#8217;s almost always a better alternative than <code>string.ToCharArray<\/code>), while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/82006\">dotnet\/runtime#82006<\/a> replaces some <code>new char[]<\/code> with spans and <code>stackalloc char[]<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85534\">dotnet\/runtime#85534<\/a> removed an unnecessary dictionary lookup, replacing a use of <code>ContainsKey<\/code> followed by the indexer with just <code>TryGetValue<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/84888\">dotnet\/runtime#84888<\/a> from <a href=\"https:\/\/github.com\/mla-alm\">@mla-alm<\/a> removed some synchronous I\/O from the asynchronous code paths in <code>XsdValidatingReader<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/74955\">dotnet\/runtime#74955<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> deleted the internal <code>XmlConvert.StrEqual<\/code> helper that was comparing the two inputs character by character with just using <code>SequenceEqual<\/code> and <code>StartsWith<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/75812\">dotnet\/runtime#75812<\/a> from <a href=\"https:\/\/github.com\/jlennox\">@jlennox<\/a> replaced some manual UTF8 encoding with <code>\"...\"u8<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/76436\">dotnet\/runtime#76436<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> removed intermediate <code>string<\/code> allocation when writing primitive types as part of XML serialization. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/73336\">dotnet\/runtime#73336<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/71478\">dotnet\/runtime#71478<\/a> from <a href=\"https:\/\/github.com\/Daniel-Svensson\">@Daniel-Svensson<\/a> improved <code>XmlDictionaryWriter<\/code> by using <code>Encoding.UTF8<\/code> for UTF8 encoding and by doing more efficient writing using spans.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87905\">dotnet\/runtime#87905<\/a> makes a tiny tweak to the <code>ArrayPool<\/code>, but one that can lead to very measurable gains. The <code>ArrayPool&lt;T&gt;<\/code> instance returned from <code>ArrayPool&lt;T&gt;.Shared<\/code> currently is a multi-layered cache. The first layer is in thread-local storage. If renting can&#8217;t be satisfied by that layer, it falls through to the next layer, where there&#8217;s a &#8220;partition&#8221; per array size per core (by default). Each partition is an array of arrays. By default, this <code>T[][]<\/code> could store 8 arrays. Now with this PR, it can store 32 arrays, decreasing the chances that code will need to spend additional cycles searching other partitions. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86109\">dotnet\/runtime#86109<\/a>, that 32 value can also be changed, by setting the <code>DOTNET_SYSTEM_BUFFERS_SHAREDARRAYPOOL_MAXARRAYSPERPARTITION<\/code> environment variable to the desired maximum capacity. The <code>DOTNET_SYSTEM_BUFFERS_SHAREDARRAYPOOL_MAXPARTITIONCOUNT<\/code> environment variable can also be used to control how many partitions are employed.<\/li>\n<\/ul>\n<h2>What&#8217;s Next?<\/h2>\n<p>Whew! That was&#8230; a lot! So, what&#8217;s next?<\/p>\n<p>The .NET 8 Release Candidate is now available, and I encourage you to <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/8.0?wt.mc_id=net8perf\">download it<\/a> and take it for a spin. As you can likely sense from my enthusiasm throughout this post, I&#8217;m thrilled about the potential .NET 8 has to improve your system&#8217;s performance just by upgrading, and I&#8217;m thrilled about new features .NET 8 offers to help you tweak your code to be even more efficient. We&#8217;re eager to hear from you about your experiences in doing so, and if you find something that can be improved even further, we&#8217;d love for you to make it better by contributing to the various .NET repos, whether it be issues with your thoughts or PRs with your coded improvements. Your efforts will benefit not only you but every other .NET developer around the world!<\/p>\n<p>Thanks for reading, and happy coding!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>.NET 7 was super fast, .NET 8 is faster. Take an in-depth tour through over 500 pull requests that make that a reality.<\/p>\n","protected":false},"author":360,"featured_media":47453,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,756,3009,646],"tags":[7701,8082],"class_list":["post-47452","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-csharp","category-performance","category-visual-studio","tag-dotnet-8","tag-dotnetperf"],"acf":[],"blog_post_summary":"<p>.NET 7 was super fast, .NET 8 is faster. Take an in-depth tour through over 500 pull requests that make that a reality.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/47452","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=47452"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/47452\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/47453"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=47452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=47452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=47452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}