{"id":33921,"date":"2021-08-17T08:15:42","date_gmt":"2021-08-17T15:15:42","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=33921"},"modified":"2025-10-29T11:28:17","modified_gmt":"2025-10-29T18:28:17","slug":"performance-improvements-in-net-6","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-6\/","title":{"rendered":"Performance Improvements in .NET 6"},"content":{"rendered":"<p>Four years ago, around the time .NET Core 2.0 was being released, I wrote <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core\/\">Performance Improvements in .NET Core<\/a> to highlight the quantity and quality of performance improvements finding their way into .NET. With its very positive reception, I did so again a year later with <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-2-1\/\">Performance Improvements in .NET Core 2.1<\/a>, and an annual tradition was born. Then came <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-3-0\/\">Performance Improvements in .NET Core 3.0<\/a>, followed by <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\/\">Performance Improvements in .NET 5<\/a>. Which brings us to today.<\/p>\n<p>The <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> repository is the home of .NET&#8217;s runtimes, runtime hosts, and core libraries. Since its main branch forked a year or so ago to be for .NET 6, there have been over 6500 merged PRs (pull requests) into the branch for the release, and that&#8217;s excluding automated PRs from bots that do things like flow dependency version updates between repos (not to discount the bots&#8217; contributions; after all, they&#8217;ve actually received interview offers by email from recruiters who just possibly weren&#8217;t being particularly discerning with their candidate pool). I at least peruse if not review in depth the vast majority of all those PRs, and every time I see a PR that is likely to impact performance, I make a note of it in a running log, giving me a long list of improvements I can revisit when it&#8217;s blog time. That made this August a little daunting, as I sat down to write this post and was faced with the list I&#8217;d curated of almost 550 PRs. Don&#8217;t worry, I don&#8217;t cover all of them here, but grab a large mug of your favorite hot beverage, and settle in: this post takes a rip-roarin&#8217; tour through ~400 PRs that, all together, significantly improve .NET performance for .NET 6.<\/p>\n<p>Please enjoy!<\/p>\n<h3>Table Of Contents<\/h3>\n<ul>\n<li><a href=\"#benchmarking-setup\">Benchmarking Setup<\/a><\/li>\n<li><a href=\"#jit\">JIT<\/a><\/li>\n<li><a href=\"#gc\">GC<\/a><\/li>\n<li><a href=\"#threading\">Threading<\/a><\/li>\n<li><a href=\"#system-types\">System Types<\/a><\/li>\n<li><a href=\"#arrays-strings-spans\">Arrays, Strings, Spans<\/a><\/li>\n<li><a href=\"#buffering\">Buffering<\/a><\/li>\n<li><a href=\"#io\">IO<\/a><\/li>\n<li><a href=\"#networking\">Networking<\/a><\/li>\n<li><a href=\"#reflection\">Reflection<\/a><\/li>\n<li><a href=\"#collections-and-linq\">Collections and LINQ<\/a><\/li>\n<li><a href=\"#cryptography\">Cryptography<\/a><\/li>\n<li><a href=\"#peanut-butter\">&#8220;Peanut Butter&#8221;<\/a><\/li>\n<li><a href=\"#json\">JSON<\/a><\/li>\n<li><a href=\"#interop\">Interop<\/a><\/li>\n<li><a href=\"#startup\">Startup<\/a><\/li>\n<li><a href=\"#tracing\">Tracing<\/a><\/li>\n<li><a href=\"#size\">Size<\/a><\/li>\n<li><a href=\"#blazor-and-mono\">Blazor and mono<\/a><\/li>\n<li><a href=\"#is-that-all\">Is that all?<\/a><\/li>\n<\/ul>\n<h3>Benchmarking Setup<\/h3>\n<p>As in previous posts, I&#8217;m using <a href=\"https:\/\/github.com\/dotnet\/benchmarkdotnet\">BenchmarkDotNet<\/a> for the majority of the examples throughout. To get started, I created a new console application:<\/p>\n<pre><code class=\"language-console\">dotnet new console -o net6perf\r\ncd net6perf<\/code><\/pre>\n<p>and added a reference to the <a href=\"https:\/\/www.nuget.org\/packages\/BenchmarkDotNet\/\">BenchmarkDotNet nuget package<\/a>:<\/p>\n<pre><code class=\"language-console\">dotnet add package benchmarkdotnet<\/code><\/pre>\n<p>That yielded a net6perf.csproj, which I then overwrote with the following contents; most importantly, this includes multiple target frameworks so that I can use BenchmarkDotNet to easily compare performance on them:<\/p>\n<pre><code class=\"language-xml\">&lt;Project Sdk=\"Microsoft.NET.Sdk\"&gt;\r\n\r\n  &lt;PropertyGroup&gt;\r\n    &lt;OutputType&gt;Exe&lt;\/OutputType&gt;\r\n    &lt;TargetFrameworks&gt;net48;netcoreapp2.1;netcoreapp3.1;net5.0;net6.0&lt;\/TargetFrameworks&gt;\r\n    &lt;Nullable&gt;annotations&lt;\/Nullable&gt;\r\n    &lt;AllowUnsafeBlocks&gt;true&lt;\/AllowUnsafeBlocks&gt;\r\n    &lt;LangVersion&gt;10&lt;\/LangVersion&gt;\r\n    &lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;\r\n  &lt;\/PropertyGroup&gt;\r\n\r\n  &lt;ItemGroup&gt;\r\n    &lt;PackageReference Include=\"benchmarkdotnet\" Version=\"0.13.1\" \/&gt;\r\n  &lt;\/ItemGroup&gt;\r\n\r\n  &lt;ItemGroup Condition=\" '$(TargetFramework)' == 'net48' \"&gt;\r\n    &lt;Reference Include=\"System.Net.Http\" \/&gt;\r\n  &lt;\/ItemGroup&gt;\r\n\r\n&lt;\/Project&gt;<\/code><\/pre>\n<p>I then updated the generated Program.cs to contain the following boilerplate code:<\/p>\n<pre><code class=\"language-C#\">using BenchmarkDotNet.Attributes;\r\nusing BenchmarkDotNet.Running;\r\nusing BenchmarkDotNet.Columns;\r\nusing BenchmarkDotNet.Configs;\r\nusing BenchmarkDotNet.Reports;\r\nusing BenchmarkDotNet.Order;\r\nusing Perfolizer.Horology;\r\nusing System;\r\nusing System.Buffers;\r\nusing System.Buffers.Binary;\r\nusing System.Buffers.Text;\r\nusing System.Collections;\r\nusing System.Collections.Concurrent;\r\nusing System.Collections.Generic;\r\nusing System.Collections.Immutable;\r\nusing System.Collections.ObjectModel;\r\nusing System.Diagnostics.CodeAnalysis;\r\nusing System.Diagnostics;\r\nusing System.Diagnostics.Tracing;\r\nusing System.Globalization;\r\nusing System.IO;\r\nusing System.Linq;\r\nusing System.Net;\r\nusing System.Net.Http;\r\nusing System.Net.Sockets;\r\nusing System.Net.WebSockets;\r\nusing System.Numerics;\r\nusing System.Reflection;\r\nusing System.Runtime.CompilerServices;\r\nusing System.Runtime.InteropServices;\r\nusing System.Security.Cryptography;\r\nusing System.Security.Cryptography.X509Certificates;\r\nusing System.Text;\r\nusing System.Threading;\r\nusing System.Threading.Tasks;\r\nusing System.IO.Compression;\r\n#if NETCOREAPP3_0_OR_GREATER\r\nusing System.Text.Encodings.Web;\r\nusing System.Text.Json;\r\nusing System.Text.Json.Serialization;\r\n#endif\r\n\r\n[DisassemblyDiagnoser(maxDepth: 1)] \/\/ change to 0 for just the [Benchmark] method\r\n[MemoryDiagnoser(displayGenColumns: false)]\r\npublic class Program\r\n{\r\n    public static void Main(string[] args) =&gt;\r\n        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, DefaultConfig.Instance\r\n            \/\/.WithSummaryStyle(new SummaryStyle(CultureInfo.InvariantCulture, printUnitsInHeader: false, SizeUnit.B, TimeUnit.Microsecond))\r\n            );\r\n\r\n    \/\/ BENCHMARKS GO HERE\r\n}<\/code><\/pre>\n<p>With minimal friction, you should be able to copy and paste a benchmark from this post to where it says <code>\/\/ BENCHMARKS GO HERE<\/code>, and run the app to execute the benchmarks. You can do so with a command like this:<\/p>\n<pre><code class=\"language-console\">dotnet run -c Release -f net48 -filter \"**\" --runtimes net48 net5.0 net6.0<\/code><\/pre>\n<p>This tells BenchmarkDotNet:<\/p>\n<ul>\n<li>Build everything in a release configuration,<\/li>\n<li>build it targeting the .NET Framework 4.8 surface area,<\/li>\n<li>don&#8217;t exclude any benchmarks,<\/li>\n<li>and run each benchmark on each of .NET Framework 4.8, .NET 5, and .NET 6.<\/li>\n<\/ul>\n<p>In some cases, I&#8217;ve added additional frameworks to the list (e.g. <code>netcoreapp3.1<\/code>) to highlight cases where there&#8217;s a continuous improvement release-over-release. In other cases, I&#8217;ve only targeted .NET 6.0, such as when highlighting the difference between an existing API and a new one in this release. Most of the results in the post were generated by running on Windows, primarily so that .NET Framework 4.8 could be included in the result set. However, unless otherwise called out, all of these benchmarks show comparable improvements when run on Linux or on macOS. Simply ensure that you have installed each runtime you want to measure. I&#8217;m using a <a href=\"https:\/\/github.com\/dotnet\/installer\/blob\/main\/README.md#installers-and-binaries\">nightly build of .NET 6 RC1<\/a>, along with the latest <a href=\"https:\/\/dotnet.microsoft.com\/download\">released downloads<\/a> of .NET 5 and .NET Core 3.1.<\/p>\n<p>Final note and standard disclaimer: microbenchmarking can be very subject to the machine on which a test is run, what else is going on with that machine at the same time, and sometimes seemingly the way the wind is blowing. Your results may vary.<\/p>\n<p>Let&#8217;s get started&#8230;<\/p>\n<h3>JIT<\/h3>\n<p>Code generation is the foundation on top of which everything else is built. As such, improvements to code generation have a multiplicative effect, with the power to improve the performance of all code that runs on the platform. .NET 6 sees an unbelievable number of performance improvements finding their way into the JIT (just-in-time compiler), which is used to translate IL (intermediate language) into assembly code at run-time, and which is also used for AOT (ahead-of-time compilation) as part of <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/conversation-about-crossgen2\/\">Crossgen2<\/a> and the <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/main\/docs\/design\/coreclr\/botr\/readytorun-overview.md\">R2R format (ReadyToRun)<\/a>.<\/p>\n<p>Since it&#8217;s so foundational to good performance in .NET code, let&#8217;s start by talking about inlining and devirtualization. &#8220;Inlining&#8221; is the process by which the compiler takes the code from a method callee and emits it directly into the caller. This avoids the overhead of the method call, but that&#8217;s typically only a minor benefit. The major benefit is it exposes the contents of the callee to the context of the caller, enabling subsequent (&#8220;knock-on&#8221;) optimizations that wouldn&#8217;t have been possible without the inlining. Consider a simple case:<\/p>\n<pre><code class=\"language-C#\">[MethodImpl(MethodImplOptions.NoInlining)]\r\npublic static int Compute() =&gt; ComputeValue(123) * 11;\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static int ComputeValue(int length) =&gt; length * 7;<\/code><\/pre>\n<p>Here we have a method, <code>ComputeValue<\/code>, which just takes an <code>int<\/code> and multiplies it by 7, returning the result. This method is simple enough to always be inlined, so for demonstration purposes I&#8217;ve used <code>MethodImplOptions.NoInlining<\/code> to tell the JIT to not inline it. If I then look at what assembly code the JIT produces for <code>Compute<\/code> and <code>ComputeValue<\/code>, we get something like this:<\/p>\n<pre><code class=\"language-assembly\">; Program.Compute()\r\n       sub       rsp,28\r\n       mov       ecx,7B\r\n       call      Program.ComputeValue(Int32)\r\n       imul      eax,0B\r\n       add       rsp,28\r\n       ret\r\n\r\n; Program.ComputeValue(Int32)\r\n       imul      eax,ecx,7\r\n       ret<\/code><\/pre>\n<p><code>Compute<\/code> loads the value <code>123<\/code> (<code>0x7b<\/code> in hex) into the <code>ecx<\/code> register, which holds the argument to <code>ComputeValue<\/code>, calls <code>ComputeValue<\/code>, then takes the result (from the <code>eax<\/code> register) and multiples it by <code>11<\/code> (<code>0xb<\/code> in hex), returning that result. We can see <code>ComputeValue<\/code> in turn takes the input from <code>ecx<\/code> and multiplies it by <code>7<\/code>, storing the result into <code>eax<\/code> for <code>Compute<\/code> to consume. Now, what happens if we remove the <code>NoInlining<\/code>:<\/p>\n<pre><code class=\"language-assembly\">; Program.Compute()\r\n       mov       eax,24FF\r\n       ret<\/code><\/pre>\n<p>The multiplications and method calls have evaporated, and we&#8217;re left with <code>Compute<\/code> simply returning the value <code>0x24ff<\/code>, as the JIT has computed at compile-time the result of <code>(123 * 7) * 11<\/code>, which is <code>9471<\/code>, or <code>0x24ff<\/code> in hex. In other words, we didn&#8217;t just save the method call, we also transformed the entire operation into a constant. Inlining is a very powerful optimization.<\/p>\n<p>Of course, you also need to be careful with inlining. If you inline too much, you bloat the code in your methods, potentially very significantly. That can make microbenchmarks look very good in some circumstances, but it can also have some bad net effects. Let&#8217;s say all of the code associated with <code>Int32.Parse<\/code> is 1,000 bytes of assembly code (I&#8217;m making up that number for explanatory purposes), and let&#8217;s say we forced it to all always inline. Every call site to <code>Int32.Parse<\/code> will now end up carrying a (potentially optimized with knock-on effects) copy of the code; call it from 100 different locations, and you now have 100,000 bytes of assembly code rather than 1,000 that are reused. That means more memory consumption for the assembly code, and if it was AOT-compiled, more size on disk. But it also has other potentially deleterious affects. Computers use very fast and limited size instruction caches to store code to be run. If you have 1000 bytes of code that you invoke from 100 different places, each of those places can potentially reuse the bytes previously loaded into the cache. But give each of those places their own (likely mutated) copy, and as far as the hardware is concerned, that&#8217;s different code, meaning the inlining can result in code actually running slower due to forcing more evictions and loads from and to that cache. There&#8217;s also the impact on the JIT compiler itself, as the JIT has limits on things like the size of a method before it&#8217;ll give up on optimizing further; inline too much code, and you can exceed said limits.<\/p>\n<p>Net net, inlining is hugely powerful, but also something to be employed carefully, and the JIT methodically (but necessarily quickly) weighs decisions it makes about what to inline and what not to with a variety of heuristics.<\/p>\n<p>In this light, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50675\">dotnet\/runtime#50675<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51124\">dotnet\/runtime#51124<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52708\">dotnet\/runtime#52708<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53670\">dotnet\/runtime#53670<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55478\">dotnet\/runtime#55478<\/a> improved the JIT by helping it to understand (and more efficiently understand) what methods were being invoked by the callee; by teaching the inliner about new things to look for, e.g. whether the callee could benefit from folding if handed constants; and by teaching the inliner how to inline various constructs it previously considered off-limits, e.g. switches. Let&#8217;s take just one example from a comment on one of those PRs:<\/p>\n<pre><code class=\"language-C#\">private int _value = 12345;\r\nprivate byte[] _buffer = new byte[100];\r\n\r\n[Benchmark]\r\npublic bool Format() =&gt; Utf8Formatter.TryFormat(_value, _buffer, out _, new StandardFormat('D', 2));<\/code><\/pre>\n<p>Running this for .NET 5 vs .NET 6, we can see a few things changed:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">13.21 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,649 B<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">10.37 ns<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<td style=\"text-align: right;\">590 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>First, it got faster, yet there was little-to-no work done within <code>Utf8Formatter<\/code> itself in .NET 6 to improve the performance of this benchmark. Second, the code size (which is emitted thanks to using the <code>[DisassemblyDiagnoser]<\/code> attribute in our <code>Program.cs<\/code>) was cut to 35% of what it was in .NET 5. How is that possible? In both versions, the employed <code>TryFormat<\/code> call is a one-liner that delegates to a <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5\/src\/libraries\/System.Private.CoreLib\/src\/System\/Buffers\/Text\/Utf8Formatter\/Utf8Formatter.Integer.Signed.cs#L16-L17\">private <code>TryFormatInt64<\/code> method<\/a>, and the developer of that method decided to annotate it with <code>MethodImplOptions.AggressiveInlining<\/code>, which tells the JIT to override its heuristics and inline the method if it&#8217;s possible rather than if it&#8217;s possible and deemed useful. That method is a switch on the input <code>format.Symbol<\/code>, branching to call various other methods based on the format symbol employed (e.g. &#8216;D&#8217; vs &#8216;G&#8217; vs &#8216;N&#8217;). But we&#8217;ve actually already passed by the most interesting part, the <code>new StandardFormat('D', 2)<\/code> at the call site. In .NET 5, the JIT deems it not worthwhile to inline the <code>StandardFormat<\/code> constructor, and so we end up with a call to it:<\/p>\n<pre><code class=\"language-assembly\">       mov       edx,44\r\n       mov       r8d,2\r\n       call      System.Buffers.StandardFormat..ctor(Char, Byte)<\/code><\/pre>\n<p>As a result, even though <code>TryFormat<\/code> gets inlined, in .NET 5, the JIT is unable to connect the dots to see that the <code>'D'<\/code> passed into the <code>StandardFormat<\/code> constructor will influence which branch of that switch statement in <code>TryFormatInt64<\/code> gets taken. In .NET 6, the JIT does inline the <code>StandardFormat<\/code> constructor, the effect of which is that it effectively can shrink the contents of <code>TryFormatInt64<\/code> from:<\/p>\n<pre><code class=\"language-C#\">if (format.IsDefault)\r\n    return TryFormatInt64Default(value, destination, out bytesWritten);\r\n\r\nswitch (format.Symbol)\r\n{\r\n    case 'G':\r\n    case 'g':\r\n        if (format.HasPrecision)\r\n            throw new NotSupportedException(SR.Argument_GWithPrecisionNotSupported);\r\n        return TryFormatInt64D(value, format.Precision, destination, out bytesWritten);\r\n\r\n    case 'd':\r\n    case 'D':\r\n        return TryFormatInt64D(value, format.Precision, destination, out bytesWritten);\r\n\r\n    case 'n':\r\n    case 'N':\r\n        return TryFormatInt64N(value, format.Precision, destination, out bytesWritten);\r\n\r\n    case 'x':\r\n        return TryFormatUInt64X((ulong)value &amp; mask, format.Precision, true, destination, out bytesWritten);\r\n\r\n    case 'X':\r\n        return TryFormatUInt64X((ulong)value &amp; mask, format.Precision, false, destination, out bytesWritten);\r\n\r\n    default:\r\n        return FormattingHelpers.TryFormatThrowFormatException(out bytesWritten);\r\n}<\/code><\/pre>\n<p>to the equivalent of just:<\/p>\n<pre><code class=\"language-C#\">TryFormatInt64D(value, 2, destination, out bytesWritten);<\/code><\/pre>\n<p>avoiding the extra branches and not needing to inline the second copy of <code>TryFormatInt64D<\/code> (for the <code>'G'<\/code> case) or <code>TryFormatInt64N<\/code>, both which are <code>AggressiveInlining<\/code>.<\/p>\n<p>Inlining also goes hand-in-hand with devirtualization, which is the act in which the JIT takes a virtual or interface method call, determines statically the actual end target of the invocation, and emits a direct call to that target, saving on the cost of the virtual dispatch. Once devirtualized, the target may also be inlined (subject to all of the same rules and heuristics), in which case it can avoid not only the virtual dispatch overhead, but also potentially benefit from the further optimizations inlining can enable. For example, consider a function like the following, which you might find in a collection implementation:<\/p>\n<pre><code class=\"language-C#\">private int[] _values = Enumerable.Range(0, 100_000).ToArray();\r\n\r\n[Benchmark]\r\npublic int Find() =&gt; Find(_values, 99_999);\r\n\r\nprivate static int Find&lt;T&gt;(T[] array, T item)\r\n{\r\n    for (int i = 0; i &lt; array.Length; i++)\r\n        if (EqualityComparer&lt;T&gt;.Default.Equals(array[i], item))\r\n            return i;\r\n\r\n    return -1;\r\n}<\/code><\/pre>\n<p>A previous release of .NET Core taught the JIT how to devirtualize <code>EqualityComparer&lt;T&gt;.Default<\/code> in such a use, resulting in an ~2x improvement over .NET Framework 4.8 in this example.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Find<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">115.4 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">127 B<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">69.7 us<\/td>\n<td style=\"text-align: right;\">0.60<\/td>\n<td style=\"text-align: right;\">71 B<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">69.8 us<\/td>\n<td style=\"text-align: right;\">0.60<\/td>\n<td style=\"text-align: right;\">63 B<\/td>\n<\/tr>\n<tr>\n<td>Find<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">53.4 us<\/td>\n<td style=\"text-align: right;\">0.46<\/td>\n<td style=\"text-align: right;\">57 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, while the JIT has been able to devirtualize <code>EqualityComparer&lt;T&gt;.Default.Equals<\/code> (for value types), not so for its sibling <code>Comparer&lt;T&gt;.Default.Compare<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48160\">dotnet\/runtime#48160<\/a> addresses that. This can be seen with a benchmark like the following, which compares <code>ValueTuple<\/code> instances (the <code>ValueTuple&lt;&gt;<\/code>.<code>CompareTo<\/code> method uses <code>Comparer&lt;T&gt;.Default<\/code> to compare each element of the tuple):<\/p>\n<pre><code class=\"language-C#\">private (int, long, int, long) _value1 = (5, 10, 15, 20);\r\nprivate (int, long, int, long) _value2 = (5, 10, 15, 20);\r\n\r\n[Benchmark]\r\npublic int Compare() =&gt; _value1.CompareTo(_value2);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Compare<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">17.467 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">240 B<\/td>\n<\/tr>\n<tr>\n<td>Compare<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">9.193 ns<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">209 B<\/td>\n<\/tr>\n<tr>\n<td>Compare<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.533 ns<\/td>\n<td style=\"text-align: right;\">0.15<\/td>\n<td style=\"text-align: right;\">186 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But devirtualization improvements have gone well beyond such known intrinsic methods. Consider this microbenchmark:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic int GetLength() =&gt; ((ITuple)(5, 6, 7)).Length;<\/code><\/pre>\n<p>The fact that I&#8217;m using a <code>ValueTuple'3<\/code> and the <code>ITuple<\/code> interface here doesn&#8217;t matter: I just selected an arbitrary value type that implements an interface. A previous release of .NET Core enabled the JIT to avoid the boxing operation here (from casting a value type to an interface it implements) and emit this purely as a constrained method call, and then a subsequent release enabled it to be devirtualized and inlined:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetLength<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">6.3495 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">106 B<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td>GetLength<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">4.0185 ns<\/td>\n<td style=\"text-align: right;\">0.628<\/td>\n<td style=\"text-align: right;\">66 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>GetLength<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">0.1223 ns<\/td>\n<td style=\"text-align: right;\">0.019<\/td>\n<td style=\"text-align: right;\">27 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>GetLength<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">0.0204 ns<\/td>\n<td style=\"text-align: right;\">0.003<\/td>\n<td style=\"text-align: right;\">27 B<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Great. But now let&#8217;s make a small tweak:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic int GetLength()\r\n{\r\n    ITuple t = (5, 6, 7);\r\n    Ignore(t);\r\n    return t.Length;\r\n}\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static void Ignore(object o) { }<\/code><\/pre>\n<p>Here I&#8217;ve forced the boxing by needing the object to exist in order to call the <code>Ignore<\/code> method, and previously that was enough to disable the ability to devirtualize the <code>t.Length<\/code> call. But .NET 6 now &#8220;gets it.&#8221; We can also see this by looking at the assembly. Here&#8217;s what we get for .NET 5:<\/p>\n<pre><code class=\"language-assembly\">; Program.GetLength()\r\n       push      rsi\r\n       sub       rsp,30\r\n       vzeroupper\r\n       vxorps    xmm0,xmm0,xmm0\r\n       vmovdqu   xmmword ptr [rsp+20],xmm0\r\n       mov       dword ptr [rsp+20],5\r\n       mov       dword ptr [rsp+24],6\r\n       mov       dword ptr [rsp+28],7\r\n       mov       rcx,offset MT_System.ValueTuple~3[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       rsi,rax\r\n       vmovdqu   xmm0,xmmword ptr [rsp+20]\r\n       vmovdqu   xmmword ptr [rsi+8],xmm0\r\n       mov       rcx,rsi\r\n       call      Program.Ignore(System.Object)\r\n       mov       rcx,rsi\r\n       add       rsp,30\r\n       pop       rsi\r\n       jmp       near ptr System.ValueTuple~3[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].System.Runtime.CompilerServices.ITuple.get_Length()\r\n; Total bytes of code 92<\/code><\/pre>\n<p>and for .NET 6:<\/p>\n<pre><code class=\"language-assembly\">; Program.GetLength()\r\n       push      rsi\r\n       sub       rsp,30\r\n       vzeroupper\r\n       vxorps    xmm0,xmm0,xmm0\r\n       vmovupd   [rsp+20],xmm0\r\n       mov       dword ptr [rsp+20],5\r\n       mov       dword ptr [rsp+24],6\r\n       mov       dword ptr [rsp+28],7\r\n       mov       rcx,offset MT_System.ValueTuple~3[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       rcx,rax\r\n       lea       rsi,[rcx+8]\r\n       vmovupd   xmm0,[rsp+20]\r\n       vmovupd   [rsi],xmm0\r\n       call      Program.Ignore(System.Object)\r\n       cmp       [rsi],esi\r\n       mov       eax,3\r\n       add       rsp,30\r\n       pop       rsi\r\n       ret\r\n; Total bytes of code 92<\/code><\/pre>\n<p>Note in .NET 5 it&#8217;s tail calling to the interface implementation (jumping to the target method at the end rather than making a call that will need to return back to this method):<\/p>\n<pre><code class=\"language-assembly\">       jmp       near ptr System.ValueTuple~3[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].System.Runtime.CompilerServices.ITuple.get_Length()<\/code><\/pre>\n<p>whereas in .NET 6 it&#8217;s not only devirtualized but also inlined the <code>ITuple.Length<\/code> call, with the assembly now limited to moving the answer (<code>3<\/code>) into the return register:<\/p>\n<pre><code class=\"language-assembly\">       mov       eax,3<\/code><\/pre>\n<p>Nice.<\/p>\n<p>A multitude of other changes have impacted devirtualization as well. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53567\">dotnet\/runtime#53567<\/a> improves devirtualization in AOT ReadyToRun images, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45526\">dotnet\/runtime#45526<\/a> improves devirtualization with generics such that information about the exact class obtained is then made available to improve inlining.<\/p>\n<p>Of course, there are many situations in which it&#8217;s impossible for the JIT to statically determine the exact target for a method call, thus preventing devirtualization and inlining&#8230; or does it?<\/p>\n<p>One of my favorite features of .NET 6 is PGO (profile-guided optimization). PGO as a concept isn&#8217;t new; it&#8217;s been implemented in a variety of development stacks, and has existed in .NET in multiple forms over the years. But the implementation in .NET 6 is something special when compared to previous releases; in particular, from my perspective, &#8220;dynamic PGO&#8221;. The general idea behind profile-guided optimization is that a developer can first compile their app, using special tooling that instruments the binary to track various pieces of interesting data. They can then run their instrumented application through typical use, and the resulting data from the instrumentation can then be fed back into the compiler the next time around to influence how the compiler compiles the code. The interesting statement there is &#8220;next time&#8221;. Traditionally, you&#8217;d build your app, run the data gathering process, and then rebuild the app feeding in the resulting data, and typically this would all be automated as part of a build pipeline; that process is referred to as &#8220;static PGO&#8221;. However, with tiered compilation, a whole new world is available.<\/p>\n<p>&#8220;Tiered compilation&#8221; is enabled by default since .NET Core 3.0. For JIT&#8217;d code, it represents a compromise between getting going quickly and running with highly-optimized code. Code starts in &#8220;tier 0,&#8221; during which the JIT compiler applies very few optimizations, which also means the JIT compiles code very quickly (optimizations are often what end up taking the most time during compilation). The emitted code includes some tracking data to count how frequently methods are invoked, and once members pass a certain threshold, the JIT queues them to be recompiled in &#8220;tier 1,&#8221; this time with all the optimizations the JIT can muster, and learning from the previous compilation, e.g. an accessed <code>static readonly int<\/code> can become a constant, as its value will have already been computed by the time the tier 1 code is compiled (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45901\">dotnet\/runtime#45901<\/a> improves the aforementioned queueing, using a dedicated thread rather than using the thread pool). You can see where this is going. With &#8220;dynamic PGO,&#8221; the JIT can now do further instrumentation during tier 0, to track not just call counts but all of the interesting data it can use for profile-guided optimization, and then it can employ that during the compilation of tier 1.<\/p>\n<p>In .NET 6, dynamic PGO is off by default. To enable it, you need to set the <code>DOTNET_TieredPGO<\/code> environment variable:<\/p>\n<pre><code class=\"language-console\"># with bash\r\nexport DOTNET_TieredPGO=1\r\n\r\n# in cmd\r\nset DOTNET_TieredPGO=1\r\n\r\n# with PowerShell\r\n$env:DOTNET_TieredPGO=\"1\"<\/code><\/pre>\n<p>That enables gathering all of the interesting data during tier 0. On top of that, there are some other environment variables you&#8217;ll also want to consider setting. Note that the core libraries that make up .NET are installed with ReadyToRun images, which means they&#8217;ve essentially already been compiled into assembly code. ReadyToRun images can participate in tiering, but they don&#8217;t go through a tier 0, rather they go straight from the ReadyToRun code to tier 1; that means there&#8217;s no opportunity for dynamic PGO to instrument the binary for dynamically gathering insights. To enable instrumenting the core libraries as well, you can disable ReadyToRun:<\/p>\n<pre><code class=\"language-console\">$env:DOTNET_ReadyToRun=\"0\"<\/code><\/pre>\n<p>Then the core libraries will also participate. Finally, you can consider setting <code>DOTNET_TC_QuickJitForLoops<\/code>:<\/p>\n<pre><code class=\"language-console\">$env:DOTNET_TC_QuickJitForLoops=\"1\"<\/code><\/pre>\n<p>which enables tiering for methods that contain loops: otherwise, anything that has a backward jump goes straight to tier 1, meaning it gets optimized immediately as if tiered compilation didn&#8217;t exist, but in doing so loses out on the benefits of first going through tier 0. You may hear folks working on .NET referring to &#8220;full PGO&#8221;: that&#8217;s the case of all three of these environment variables being set, as then everything in the app is utilizing &#8220;dynamic PGO&#8221;. (Note that the ReadyToRun code for the framework assemblies does include implementations optimized based on PGO, just &#8220;static PGO&#8221;. The framework assemblies are compiled with PGO, used to execute a stable of representative apps and services, and then the resulting data is used to generate the final code that&#8217;s part of the shipped assemblies.)<\/p>\n<p>Enough setup&#8230; what does this do for us? Let&#8217;s take an example:<\/p>\n<pre><code class=\"language-C#\">private IEnumerator&lt;int&gt; _source = Enumerable.Range(0, int.MaxValue).GetEnumerator();\r\n\r\n[Benchmark]\r\npublic void MoveNext() =&gt; _source.MoveNext();<\/code><\/pre>\n<p>This is a pretty simple benchmark: we have an <code>IEnumerator&lt;int&gt;<\/code> stored in a field, and our benchmark is simply moving the iterator forward. When compiled on .NET 6 normally, we get this:<\/p>\n<pre><code class=\"language-assembly\">; Program.MoveNext()\r\n       sub       rsp,28\r\n       mov       rcx,[rcx+8]\r\n       mov       r11,7FFF8BB40378\r\n       call      qword ptr [7FFF8BEB0378]\r\n       nop\r\n       add       rsp,28\r\n       ret<\/code><\/pre>\n<p>That assembly code is the interface dispatch to whatever implementation backs that <code>IEnumerator&lt;int&gt;<\/code>. Now let&#8217;s set:<\/p>\n<pre><code class=\"language-console\">$env:DOTNET_TieredPGO=1<\/code><\/pre>\n<p>and try it again. This time, the code looks very different:<\/p>\n<pre><code class=\"language-assembly\">; Program.MoveNext()\r\n       sub       rsp,28\r\n       mov       rcx,[rcx+8]\r\n       mov       r11,offset MT_System.Linq.Enumerable+RangeIterator\r\n       cmp       [rcx],r11\r\n       jne       short M00_L03\r\n       mov       r11d,[rcx+0C]\r\n       cmp       r11d,1\r\n       je        short M00_L00\r\n       cmp       r11d,2\r\n       jne       short M00_L01\r\n       mov       r11d,[rcx+10]\r\n       inc       r11d\r\n       mov       [rcx+10],r11d\r\n       cmp       r11d,[rcx+18]\r\n       je        short M00_L01\r\n       jmp       short M00_L02\r\nM00_L00:\r\n       mov       r11d,[rcx+14]\r\n       mov       [rcx+10],r11d\r\n       mov       dword ptr [rcx+0C],2\r\n       jmp       short M00_L02\r\nM00_L01:\r\n       mov       dword ptr [rcx+0C],0FFFFFFFF\r\nM00_L02:\r\n       add       rsp,28\r\n       ret\r\nM00_L03:\r\n       mov       r11,7FFF8BB50378\r\n       call      qword ptr [7FFF8BEB0378]\r\n       jmp       short M00_L02<\/code><\/pre>\n<p>A few things to notice, beyond it being much longer. First, the <code>mov r11,7FFF8BB40378<\/code> followed by <code>call qword ptr [7FFF8BEB0378]<\/code> sequence for doing the interface dispatch still exists here, but it&#8217;s at the end of the method. One optimization common in PGO implementations is &#8220;hot\/cold splitting&#8221;, where sections of a method frequently executed (&#8220;hot&#8221;) are moved close together at the beginning of the method, and sections of a method infrequently executed (&#8220;cold&#8221;) are moved to the end of the method. That enables better use of instruction caches and minimizes loads necessary to bring in likely-unsed code. So, this interface dispatch has moved to the end of the method, as based on PGO data the JIT expects it to be cold \/ rarely invoked. Yet this is the entirety of the original implementation; if that&#8217;s cold, what&#8217;s hot? Now at the beginning of the method, we see:<\/p>\n<pre><code class=\"language-assembly\">       mov       rcx,[rcx+8]\r\n       mov       r11,offset MT_System.Linq.Enumerable+RangeIterator\r\n       cmp       [rcx],r11\r\n       jne       short M00_L03<\/code><\/pre>\n<p>This is the magic. When the JIT instrumented the tier 0 code for this method, that included instrumenting this interface dispatch to track the concrete type of <code>_source<\/code> on each invocation. And the JIT found that every invocation was on a type called <code>Enumerable+RangeIterator<\/code>, which is a <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5\/src\/libraries\/System.Linq\/src\/System\/Linq\/Range.cs#L31\">private class<\/a> used to implement <code>Enumerable.Range<\/code> inside of the <code>Enumerable<\/code> implementation. As such, for tier 1 the JIT has emitted a check to see whether the type of <code>_source<\/code> is that <code>Enumerable+RangeIterator<\/code>: if it isn&#8217;t, then it jumps to the cold section we previously highlighted that&#8217;s performing the normal interface dispatch. But if it is, which, based on the profiling data, is expected to be the case the vast majority of the time, it can then proceed to directly invoke the <code>Enumerable+RangeIterator.MoveNext<\/code> method, devirtualized. Not only that, but it decided it was profitable to inline that <code>MoveNext<\/code> method. That <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d019e70d2b7c2f7cd1137fac084dbcdc3d2e05f5\/src\/libraries\/System.Linq\/src\/System\/Linq\/Range.cs#L47-L67\"><code>MoveNext<\/code><\/a> implementation is then the assembly code that immediately follows. The net effect of this is a bit larger code, but optimized for the exact scenario expected to be most common:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>PGO Disabled<\/td>\n<td style=\"text-align: right;\">1.905 ns<\/td>\n<td style=\"text-align: right;\">30 B<\/td>\n<\/tr>\n<tr>\n<td>PGO Enabled<\/td>\n<td style=\"text-align: right;\">0.7071 ns<\/td>\n<td style=\"text-align: right;\">105 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The JIT optimizes for PGO data in a variety of ways. Given the data it knows about how the code behaves, it can be more aggressive about inlining, as it has more data about what will and won&#8217;t be profitable. It can perform this &#8220;guarded devirtualization&#8221; for most interface and virtual dispatch, emitting both one or more fast paths that are devirtualized and possibly inlined, with a fallback that performs the standard dispatch should the actual type not match the expected type. It can actually reduce code size in various circumstances by choosing to not apply optimizations that might otherwise increase code size (e.g. inlining, loop cloning, etc.) in blocks discovered to be cold. It can optimize for type casts, emitting checks that do a direct type comparison against the actual object type rather than always relying on more complicated and expensive cast helpers (e.g. ones that need to search ancestor hierarchies or interface lists or that can handle generic co- and contra-variance). The list will continue to grow over time as the JIT learns more and more how to, well, learn.<\/p>\n<p>Lots of PRs contributed to PGO. Here are just a few:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44427\">dotnet\/runtime#44427<\/a> added support to the inliner that utilized call site frequency to boost the profitability metric (i.e. how valuable would it be to inline a method).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45133\">dotnet\/runtime#45133<\/a> added the initial support for determining the distribution of concrete types used at virtual and interface dispatch call sites, in order to enable guarded devirtualization. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51157\">dotnet\/runtime#51157<\/a> further enhanced this with regards to small struct types, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51890\">dotnet\/runtime#51890<\/a> enabled improved code generation by chaining together guarded devirtualization call sites, grouping together the frequently-taken code paths where applicable.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52827\">dotnet\/runtime#52827<\/a> added support for special-casing <code>switch<\/code> cases when PGO data is available to support it. If there&#8217;s a dominant <code>switch<\/code> case, where the JIT sees that branch being taken at least 30% of the time, the JIT can emit a dedicated <code>if<\/code> check for that case up front, rather than having it go through the <code>switch<\/code> with the rest of the cases. (Note this applies to actual switches in the IL; not all C# <code>switch<\/code> statements will end up as <code>switch<\/code> instructions in IL, and in fact many won&#8217;t, as the C# compiler will often optimize smaller or more complicated switches into the equivalent of a cascading set of <code>if<\/code>\/<code>else if<\/code> checks.)<\/li>\n<\/ul>\n<p>That&#8217;s probably enough for now about inlining. There are other categories of optimization critical to high-performance C# and .NET code, as well. For example, bounds checking. One of the great things about C# and .NET is that, unless you go out of your way to circumvent the protections put in place (e.g. by using the <code>unsafe<\/code> keyword, the <code>Unsafe<\/code> class, the <code>Marshal<\/code> or <code>MemoryMarshal<\/code> classes, etc.), it&#8217;s near impossible to experience typical security vulnerabilities like buffer overruns. That&#8217;s because all accesses to arrays, strings, and spans are automatically &#8220;bounds checked&#8221; by the JIT, meaning it ensures before indexing into one of these data structures that the index is properly within bounds. You can see that with a simple example:<\/p>\n<pre><code class=\"language-C#\">public int M(int[] arr, int index) =&gt; arr[index];<\/code><\/pre>\n<p>for which the JIT will generate code similar to this:<\/p>\n<pre><code class=\"language-assembly\">; Program.M(Int32[], Int32)\r\n       sub       rsp,28\r\n       cmp       r8d,[rdx+8]\r\n       jae       short M01_L00\r\n       movsxd    rax,r8d\r\n       mov       eax,[rdx+rax*4+10]\r\n       add       rsp,28\r\n       ret\r\nM01_L00:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 28<\/code><\/pre>\n<p>The <code>rdx<\/code> register here stores the address of <code>arr<\/code>, and the length of <code>arr<\/code> is stored 8 bytes beyond that (in this 64-bit process), so <code>[rdx+8]<\/code> is <code>arr.Length<\/code>, and the <code>cmp r8d, [rdx+8]<\/code> instruction is comparing <code>arr.Length<\/code> against the <code>index<\/code> value stored in the <code>r8d<\/code> register. If the index is equal to or greater than the array length, it jumps to the end of the method, which calls a helper that throws an exception. That comparison is the &#8220;bounds check.&#8221;<\/p>\n<p>Of course, such bounds checks add overhead. For most code, the overhead is negligible, but if you&#8217;re reading this post, there&#8217;s a good chance you&#8217;ve written code where it&#8217;s not. And you certainly rely on code where it&#8217;s not: a lot of lower-level routines in the core .NET libraries do rely on avoiding this kind of overhead wherever possible. As such, the JIT goes to great lengths to avoid emitting bounds checking when it can prove going out of bounds isn&#8217;t possible. The prototypical example is a loop from <code>0<\/code> to an array&#8217;s <code>Length<\/code>. If you write:<\/p>\n<pre><code class=\"language-C#\">public int Sum(int[] arr)\r\n{\r\n    int sum = 0;\r\n    for (int i = 0; i &lt; arr.Length; i++) sum += arr[i];\r\n    return sum;\r\n}<\/code><\/pre>\n<p>the JIT will output code like this:<\/p>\n<pre><code class=\"language-assembly\">; Program.Sum(Int32[])\r\n       xor       eax,eax\r\n       xor       ecx,ecx\r\n       mov       r8d,[rdx+8]\r\n       test      r8d,r8d\r\n       jle       short M02_L01\r\nM02_L00:\r\n       movsxd    r9,ecx\r\n       add       eax,[rdx+r9*4+10]\r\n       inc       ecx\r\n       cmp       r8d,ecx\r\n       jg        short M02_L00\r\nM02_L01:\r\n       ret\r\n; Total bytes of code 29<\/code><\/pre>\n<p>Note there&#8217;s no tell-tale <code>call<\/code> followed by an <code>int3<\/code> instruction at the end of the method; that&#8217;s because no call to a throw helper is required here, as there&#8217;s no bounds checking needed. The JIT can see that, by construction, the loop can&#8217;t walk off either end of the array, and thus it needn&#8217;t emit a bounds check.<\/p>\n<p>Every release of .NET sees the JIT become wise to more and more patterns where it can safely eliminate bounds checking, and .NET 6 follows suit. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/40180\">dotnet\/runtime#40180<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43568\">dotnet\/runtime#43568<\/a> from <a href=\"https:\/\/github.com\/nathan-moore\">@nathan-moore<\/a> are great (and very helpful) examples. Consider the following benchmark:<\/p>\n<pre><code class=\"language-C#\">private char[] _buffer = new char[100];\r\n\r\n[Benchmark]\r\npublic bool TryFormatTrue() =&gt; TryFormatTrue(_buffer);\r\n\r\nprivate static bool TryFormatTrue(Span&lt;char&gt; destination)\r\n{\r\n    if (destination.Length &gt;= 4)\r\n    {\r\n        destination[0] = 't';\r\n        destination[1] = 'r';\r\n        destination[2] = 'u';\r\n        destination[3] = 'e';\r\n        return true;\r\n    }\r\n\r\n    return false;\r\n}<\/code><\/pre>\n<p>This represents relatively typical code you might see in some lower-level formatting, where the length of a span is checked and then data written into the span. In the past, the JIT has been a little finicky about which guard patterns here are recognized and which aren&#8217;t, and .NET 6 makes that a whole lot better, thanks to the aforementioned PRs. On .NET 5, this benchmark would result in assembly like the following:<\/p>\n<pre><code class=\"language-assembly\">; Program.TryFormatTrue(System.Span~1&lt;Char&gt;)\r\n       sub       rsp,28\r\n       mov       rax,[rcx]\r\n       mov       edx,[rcx+8]\r\n       cmp       edx,4\r\n       jl        short M01_L00\r\n       cmp       edx,0\r\n       jbe       short M01_L01\r\n       mov       word ptr [rax],74\r\n       cmp       edx,1\r\n       jbe       short M01_L01\r\n       mov       word ptr [rax+2],72\r\n       cmp       edx,2\r\n       jbe       short M01_L01\r\n       mov       word ptr [rax+4],75\r\n       cmp       edx,3\r\n       jbe       short M01_L01\r\n       mov       word ptr [rax+6],65\r\n       mov       eax,1\r\n       add       rsp,28\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       add       rsp,28\r\n       ret\r\nM01_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3<\/code><\/pre>\n<p>The beginning of the assembly loads the span&#8217;s reference into the <code>eax<\/code> register and the length of the span into the <code>edx<\/code> register:<\/p>\n<pre><code class=\"language-assembly\">       mov       rax,[rcx]\r\n       mov       edx,[rcx+8]<\/code><\/pre>\n<p>and then each assignment into the span ends up checking against this length, as in this sequence from above where we&#8217;re executing <code>destination[2] = 'u'<\/code>:<\/p>\n<pre><code class=\"language-assembly\">       cmp       edx,2\r\n       jbe       short M01_L01\r\n       mov       word ptr [rax+4],75<\/code><\/pre>\n<p>To save you from having to look at an ASCII table, lowercase &#8216;u&#8217; has an ASCII hex value of <code>0x75<\/code>, so this code is validating that <code>2<\/code> is less than the span&#8217;s length (and jumping to <code>call CORINFO_HELP_RNGCHKFAIL<\/code> if it&#8217;s not), then storing <code>'u'<\/code> into the 2nd element of the span (<code>[rax+4]<\/code>). That&#8217;s four bounds checks, one for each character in <code>\"true\"<\/code>, even though we know they&#8217;re all in-bounds. The JIT in .NET 6 knows that, too:<\/p>\n<pre><code class=\"language-assembly\">; Program.TryFormatTrue(System.Span~1&lt;Char&gt;)\r\n       mov       rax,[rcx]\r\n       mov       edx,[rcx+8]\r\n       cmp       edx,4\r\n       jl        short M01_L00\r\n       mov       word ptr [rax],74\r\n       mov       word ptr [rax+2],72\r\n       mov       word ptr [rax+4],75\r\n       mov       word ptr [rax+6],65\r\n       mov       eax,1\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       ret<\/code><\/pre>\n<p>Much better. Those changes then also allowed undoing some hacks (e.g. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49450\">dotnet\/runtime#49450<\/a> from <a href=\"https:\/\/github.com\/SingleAccretion\">@SingleAccretion<\/a>) in the core libraries that had previously been done to work around the lack of the bounds checking removal in such cases.<\/p>\n<p>Another bounds-checking improvement comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49271\">dotnet\/runtime#49271<\/a> from <a href=\"https:\/\/github.com\/SingleAccretion\">@SingleAccretion<\/a>. In previous releases, there was an issue in the JIT where an inlined method call could cause subsequent bounds checks that otherwise would have been removed to now no longer be removed. This PR fixes that, the effect of which is evident in this benchmark<\/p>\n<pre><code class=\"language-C#\">private long[] _buffer = new long[10];\r\nprivate DateTime _now = DateTime.UtcNow;\r\n\r\n[Benchmark]\r\npublic void Store() =&gt; Store(_buffer, _now);\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static void Store(Span&lt;long&gt; span, DateTime value)\r\n{\r\n    if (!span.IsEmpty)\r\n    {\r\n        span[0] = value.Ticks;\r\n    }\r\n}<\/code><\/pre>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Store(System.Span~1&lt;Int64&gt;, System.DateTime)\r\n       sub       rsp,28\r\n       mov       rax,[rcx]\r\n       mov       ecx,[rcx+8]\r\n       test      ecx,ecx\r\n       jbe       short M01_L00\r\n       cmp       ecx,0\r\n       jbe       short M01_L01\r\n       mov       rcx,0FFFFFFFFFFFF\r\n       and       rdx,rcx\r\n       mov       [rax],rdx\r\nM01_L00:\r\n       add       rsp,28\r\n       ret\r\nM01_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 46\r\n\r\n; .NET 6.0.0\r\n; Program.Store(System.Span~1&lt;Int64&gt;, System.DateTime)\r\n       mov       rax,[rcx]\r\n       mov       ecx,[rcx+8]\r\n       test      ecx,ecx\r\n       jbe       short M01_L00\r\n       mov       rcx,0FFFFFFFFFFFF\r\n       and       rdx,rcx\r\n       mov       [rax],rdx\r\nM01_L00:\r\n       ret\r\n; Total bytes of code 27<\/code><\/pre>\n<p>In other cases, it&#8217;s not about whether there&#8217;s a bounds check, but what code is emitted for a bounds check that isn&#8217;t elided. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42295\">dotnet\/runtime#42295<\/a> special-cases indexing into an array with a constant 0 index (which is actually fairly common) and emits a <code>test<\/code> instruction rather than a <code>cmp<\/code> instruction, which makes the code both slightly smaller and slightly faster.<\/p>\n<p>Another bounds-checking optimization that&#8217;s arguably a category of its own is &#8220;loop cloning.&#8221; The idea behind loop cloning is the JIT can duplicate a loop, creating one variant that&#8217;s the original and one variant that removes bounds checking, and then at run-time decide which to use based on an additional up-front check. For example, consider this code:<\/p>\n<pre><code class=\"language-C#\">public static int Sum(int[] array, int length)\r\n{\r\n    int sum = 0;\r\n    for (int i = 0; i &lt; length; i++)\r\n    {\r\n        sum += array[i];\r\n    }\r\n    return sum;\r\n}<\/code><\/pre>\n<p>The JIT still needs to bounds check the <code>array[i]<\/code> access, as while it knows that <code>i &gt;= 0<\/code> &amp;&amp; <code>i &lt; length<\/code>, it doesn&#8217;t know whether <code>length &lt;= array.Length<\/code> and thus doesn&#8217;t know whether <code>i &lt; array.Length<\/code>. However, doing such a bounds check on each iteration of the loop adds an extra comparison and branch on each iteration. Loop cloning enables the JIT to generate code that&#8217;s more like the equivalent of this:<\/p>\n<pre><code class=\"language-C#\">public static int Sum(int[] array, int length)\r\n{\r\n    int sum = 0;\r\n    if (array is not null &amp;&amp; length &lt;= array.Length)\r\n    {\r\n        for (int i = 0; i &lt; length; i++)\r\n        {\r\n            sum += array[i]; \/\/ bounds check removed\r\n        }\r\n    }\r\n    else\r\n    {\r\n        for (int i = 0; i &lt; length; i++)\r\n        {\r\n            sum += array[i]; \/\/ bounds check not removed\r\n        }\r\n    }\r\n    return sum;\r\n}<\/code><\/pre>\n<p>We end up paying for the extra up-front one time checks, but as long as there&#8217;s at least a couple of iterations, the elimination of the bounds check pays for that and more. Neat. However, as with other bounds checking removal optimizations, the JIT is looking for very specific patterns, and things that deviate and fall off the golden path lose out on the optimization. That can include something as simple as the type of the array itself: change the previous example to use <code>byte[]<\/code> instead of <code>int[]<\/code>, and that&#8217;s enough to throw the JIT off the scent&#8230; or, at least it was in .NET 5. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48894\">dotnet\/runtime#48894<\/a>, in .NET 6 the loop is now cloned, as can be seen from this benchmark:<\/p>\n<pre><code class=\"language-C#\">private byte[] _buffer = Enumerable.Range(0, 1_000_000).Select(i =&gt; (byte)i).ToArray();\r\n\r\n[Benchmark]\r\npublic void Sum() =&gt; Sum(_buffer, 999_999);\r\n\r\npublic static int Sum(byte[] array, int length)\r\n{\r\n    int sum = 0;\r\n    for (int i = 0; i &lt; length; i++)\r\n    {\r\n        sum += array[i];\r\n    }\r\n    return sum;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Sum<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">471.3 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">54 B<\/td>\n<\/tr>\n<tr>\n<td>Sum<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">350.0 us<\/td>\n<td style=\"text-align: right;\">0.74<\/td>\n<td style=\"text-align: right;\">97 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Sum()\r\n       sub       rsp,28\r\n       mov       rax,[rcx+8]\r\n       xor       edx,edx\r\n       xor       ecx,ecx\r\n       mov       r8d,[rax+8]\r\nM00_L00:\r\n       cmp       ecx,r8d\r\n       jae       short M00_L01\r\n       movsxd    r9,ecx\r\n       movzx     r9d,byte ptr [rax+r9+10]\r\n       add       edx,r9d\r\n       inc       ecx\r\n       cmp       ecx,0F423F\r\n       jl        short M00_L00\r\n       add       rsp,28\r\n       ret\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 54\r\n\r\n; .NET 6.0.0\r\n; Program.Sum()\r\n       sub       rsp,28\r\n       mov       rax,[rcx+8]\r\n       xor       edx,edx\r\n       xor       ecx,ecx\r\n       test      rax,rax\r\n       je        short M00_L01\r\n       cmp       dword ptr [rax+8],0F423F\r\n       jl        short M00_L01\r\n       nop       word ptr [rax+rax]\r\nM00_L00:\r\n       movsxd    r8,ecx\r\n       movzx     r8d,byte ptr [rax+r8+10]\r\n       add       edx,r8d\r\n       inc       ecx\r\n       cmp       ecx,0F423F\r\n       jl        short M00_L00\r\n       jmp       short M00_L02\r\nM00_L01:\r\n       cmp       ecx,[rax+8]\r\n       jae       short M00_L03\r\n       movsxd    r8,ecx\r\n       movzx     r8d,byte ptr [rax+r8+10]\r\n       add       r8d,edx\r\n       mov       edx,r8d\r\n       inc       ecx\r\n       cmp       ecx,0F423F\r\n       jl        short M00_L01\r\nM00_L02:\r\n       add       rsp,28\r\n       ret\r\nM00_L03:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 97<\/code><\/pre>\n<p>Not just bytes, but the same issue manifests for arrays of non-primitive structs. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55612\">dotnet\/runtime#55612<\/a> addressed that. Additionally, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55299\">dotnet\/runtime#55299<\/a> improved loop cloning for various loops over multidimensional arrays.<\/p>\n<p>Since we&#8217;re on the topic of loop optimization, consider loop inversion. &#8220;Loop inversion&#8221; is a standard compiler transform that&#8217;s aimed at eliminating some branching from a loop. Consider a loop like:<\/p>\n<pre><code class=\"language-C#\">while (i &lt; 3)\r\n{\r\n    ...\r\n    i++;\r\n}<\/code><\/pre>\n<p>Loop inversion involves the compiler transforming this into:<\/p>\n<pre><code class=\"language-C#\">if (i &lt; 3)\r\n{\r\n    do\r\n    {\r\n        ...\r\n        i++;\r\n    }\r\n    while (i &lt; 3);\r\n}<\/code><\/pre>\n<p>In other words, change the while into a do..while, moving the condition check from the beginning of each iteration to the end of each iteration, and then add a one-time condition check at the beginning to compensate. Now imagine that <code>i == 2<\/code>. In the original structure, we enter the loop, <code>i<\/code> is incremented, and then we jump back to the beginning to do the condition test, it&#8217;ll fail (as <code>i<\/code> is now 3), and we&#8217;ll then jump again to just past the end of the loop. Now consider the same situation with the inverted loop. We pass the <code>if<\/code> condition, as <code>i == 2<\/code>. We then enter the <code>do..while<\/code>, <code>i<\/code> is incremented, and we check the condition. The condition fails, and we&#8217;re already at the end of the loop, so we don&#8217;t jump back to the beginning and instead just keep running past the loop. Summary: we saved two jumps. And in either case, if <code>i<\/code> was <code>&gt;= 3<\/code>, we have exactly the same number of jumps as we just jump to after the <code>while<\/code>\/<code>if<\/code>. The inverted structure also often affords additional optimizations; for example, the JIT&#8217;s pattern recognition used for loop cloning and the hoisting of invariants depend on the loop being in an inverted form. Both <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50982\">dotnet\/runtime#50982<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52347\">dotnet\/runtime#52347<\/a> improved the JIT&#8217;s support for loop inversion.<\/p>\n<p>Ok, we&#8217;ve talked about inlining optimizations, bounds checking optimizations, and loop optimizations. What about constants?<\/p>\n<p>&#8220;Constant folding&#8221; is simply a fancy term to mean a compiler computing values at compile-time rather than leaving it to run-time. Folding can happen at various levels of compilation. If you write this C#:<\/p>\n<pre><code class=\"language-C#\">public static int M() =&gt; 10 + 20 * 30 \/ 40 ^ 50 | 60 &amp; 70;<\/code><\/pre>\n<p>the C# compiler will fold this while compiling to IL, computing the constant value <code>47<\/code> from all of those operations:<\/p>\n<pre><code class=\"language-assembly\">IL_0000: ldc.i4.s 47\r\nIL_0002: ret<\/code><\/pre>\n<p>Folding can also happen in the JIT, which is particularly valuable in the face of inlining. If I have this C#:<\/p>\n<pre><code class=\"language-C#\">public static int M() =&gt; 10 + N();\r\npublic static int N() =&gt; 20;<\/code><\/pre>\n<p>the C# compiler doesn&#8217;t (and in many cases shouldn&#8217;t) do any kind of interprocedural analysis to determine that <code>N<\/code> always returns <code>20<\/code>, so you end up with this IL for <code>M<\/code>:<\/p>\n<pre><code class=\"language-assembly\">IL_0000: ldc.i4.s 10\r\nIL_0002: call int32 C::N()\r\nIL_0007: add\r\nIL_0008: ret<\/code><\/pre>\n<p>But with inlining, the JIT is able to generate this for <code>M<\/code>:<\/p>\n<pre><code class=\"language-assembly\">L0000: mov eax, 0x1e\r\nL0005: ret<\/code><\/pre>\n<p>having inlined the <code>20<\/code>, constant folded <code>10 + 20<\/code>, and gotten the constant value <code>30<\/code> (hex <code>0x1e<\/code>). Constant folding also goes hand-in-hand with &#8220;constant propagation,&#8221; which is the practice of the compiler substituting a constant value into an expression, at which point compilers will often be able to iterate, apply more constant folding, do more constant propagation, and so on. Let&#8217;s say I have this non-trivial set of helper methods:<\/p>\n<pre><code class=\"language-C#\">public bool ContainsSpace(string s) =&gt; Contains(s, ' ');\r\n\r\nprivate static bool Contains(string s, char c)\r\n{\r\n    if (s.Length == 1)\r\n    {\r\n        return s[0] == c;\r\n    }\r\n\r\n    for (int i = 0; i &lt; s.Length; i++)\r\n    {\r\n        if (s[i] == c)\r\n            return true;\r\n    }\r\n\r\n    return false;\r\n}<\/code><\/pre>\n<p>Based on whatever their needs were, the developer of <code>Contains(string, char)<\/code> decided that it would very frequently be called with string literals, and that single character literals were common. Now if I write:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic bool M() =&gt; ContainsSpace(\" \");<\/code><\/pre>\n<p>the entirety of the generated code produced by the JIT for <code>M<\/code> is:<\/p>\n<pre><code class=\"language-C#\">L0000: mov eax, 1\r\nL0005: ret<\/code><\/pre>\n<p>How is that possible? The JIT inlines <code>Contains(string, char)<\/code> into <code>ContainsSpace(string)<\/code>, and inlines <code>ContainsSpace(string)<\/code> into <code>M()<\/code>. The implementation of <code>ContainsSpace(string, char)<\/code> is then exposed to the fact that <code>string s<\/code> is <code>\" \"<\/code> and <code>char c<\/code> is <code>' '<\/code>. It can then propagate the fact that <code>s.Length<\/code> is actually the constant <code>1<\/code>, which enables deleting as dead code everything after the <code>if<\/code> block. It can then see that <code>s[0]<\/code> is in-bounds, and remove any bounds checking, and can see that <code>s[0]<\/code> is the first character in the constant string <code>\" \"<\/code>, a <code>' '<\/code>, and can then see that <code>' ' == ' '<\/code>, making the entire operation return a constant <code>true<\/code>, hence the resulting <code>mov eax, 1<\/code>, which is used to return a Boolean value <code>true<\/code>. Neat, right? Of course, you may be asking yourself, &#8220;Does code really call such methods with literals?&#8221; And the answer is, absolutely, in lots of situations; the PR in .NET 5 that introduced the ability to treat <code>\"literalString\".Length<\/code> as a constant highlighted thousands of bytes of improvements in the generated assembly code across the core libraries. But a good example in .NET 6 that makes extra-special use of this is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/57217\">dotnet\/runtime#57217<\/a>. The methods being changed in this PR are expected to be called from C# compiler-generated code with literals, and being able to specialize based on the length of the string literal passed effectively enables multiple implementations of the method the JIT can choose from based on its knowledge of the literal used at the call site, resulting in faster and smaller code when such a literal is used.<\/p>\n<p>But, the JIT needs to be taught what kinds of things can be folded. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49930\">dotnet\/runtime#49930<\/a> teaches it how to fold null checks when used with constant strings, which as in the previous example, is most valuable with inlining. Consider the <code>Microsoft.Extensions.Logging.Console.ConsoleFormatter<\/code> abstract base class. It exposes a protected constructor that looks like this:<\/p>\n<pre><code class=\"language-C#\">protected ConsoleFormatter(string name)\r\n{\r\n    Name = name ?? throw new ArgumentNullException(nameof(name));\r\n}<\/code><\/pre>\n<p>which is a fairly typical construct: validating that an argument isn&#8217;t null, throwing an exception if it is, and storing it if it&#8217;s not. Now look at one of the built-in types derived from it, like <code>JsonConsoleFormatter<\/code>:<\/p>\n<pre><code class=\"language-C#\">public JsonConsoleFormatter(IOptionsMonitor&lt;JsonConsoleFormatterOptions&gt; options)\r\n    : base(ConsoleFormatterNames.Json)\r\n{\r\n    ReloadLoggerOptions(options.CurrentValue);\r\n    _optionsReloadToken = options.OnChange(ReloadLoggerOptions);\r\n}<\/code><\/pre>\n<p>Note that <code>base (ConsoleFormatterNames.Json)<\/code> call. <code>ConsoleFormatterNames.Json<\/code> is defined as:<\/p>\n<pre><code class=\"language-C#\">public const string Json = \"json\";<\/code><\/pre>\n<p>so this <code>base<\/code> call is really:<\/p>\n<pre><code class=\"language-C#\">base(\"json\")<\/code><\/pre>\n<p>When the JIT inlines the base constructor, it&#8217;ll now be able to see that the input is definitively not null, at which point it can eliminate as dead code the <code>?? throw new ArgumentNullException(nameof(name)<\/code>, and the entire inlined call will simply be the equivalent of <code>Name = \"json\"<\/code>.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50000\">dotnet\/runtime#50000<\/a> is similar. As mentioned earlier, thanks to tiered compilation, <code>static readonly<\/code>s initialized in tier 0 can become consts in tier 1. This was enabled in previous .NET releases. For example, you might find code that dynamically enables or disables a feature based on an environment variable and then stores the result of that into a <code>static readonly bool<\/code>. When code reading that static field is recompiled in tier 1, the Boolean value can be considered a constant, enabling branches based on that value to be trimmed away. For example, given this benchmark:<\/p>\n<pre><code class=\"language-C#\">private static readonly bool s_coolFeatureEnabled = GetCoolFeatureEnabled();\r\n\r\nprivate static bool GetCoolFeatureEnabled()\r\n{\r\n    string envVar = Environment.GetEnvironmentVariable(\"EnableCoolFeature\");\r\n    return envVar == \"1\" || \"true\".Equals(envVar, StringComparison.OrdinalIgnoreCase);\r\n}\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static void UsedWhenCoolEnabled() { }\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static void UsedWhenCoolNotEnabled() { }\r\n\r\n[Benchmark]\r\npublic void CallCorrectMethod()\r\n{\r\n    if (s_coolFeatureEnabled)\r\n    {\r\n        UsedWhenCoolEnabled();\r\n    }\r\n    else\r\n    {\r\n        UsedWhenCoolNotEnabled();\r\n    }\r\n}<\/code><\/pre>\n<p>since I&#8217;ve not set the environment variable, when I run this and examine the resulting tier 1 assembly for <code>CallCorrectMethod<\/code>, I see this:<\/p>\n<pre><code class=\"language-assembly\">; Program.CallCorrectMethod()\r\n       jmp       near ptr Program.UsedWhenCoolNotEnabled()\r\n; Total bytes of code 5<\/code><\/pre>\n<p>That is the entirety of the implementation; there&#8217;s no call to <code>UsedWhenCoolEnabled<\/code> anywhere in sight, because the JIT was able to prune away the <code>if<\/code> block as dead code based on <code>s_coolFeatureEnabled<\/code> being a constant <code>false<\/code>. The aforementioned PR builds on that capability by enabling null folding for such values. Consider a library that exposes a method like:<\/p>\n<pre><code class=\"language-C#\">public static bool Equals&lt;T&gt;(T i, T j, IEqualityComparer&lt;T&gt; comparer)\r\n{\r\n    comparer ??= EqualityComparer&lt;T&gt;.Default;\r\n    return comparer.Equals(i, j);\r\n}<\/code><\/pre>\n<p>comparing two values using the specified comparer, and if the specified comparer is null, using <code>EqualityComparer&lt;T&gt;.Default<\/code>. Now, with our benchmark we pass in <code>EqualityComparer&lt;int&gt;.Default<\/code>.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\n[Arguments(1, 2)]\r\npublic bool Equals(int i, int j) =&gt; Equals(i, j, EqualityComparer&lt;int&gt;.Default);\r\n\r\npublic static bool Equals&lt;T&gt;(T i, T j, IEqualityComparer&lt;T&gt; comparer)\r\n{\r\n    comparer ??= EqualityComparer&lt;T&gt;.Default;\r\n    return comparer.Equals(i, j);\r\n}<\/code><\/pre>\n<p>This is what the resulting assembly looks like with .NET 5 and .NET 6:<\/p>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Equals(Int32, Int32)\r\n       mov       rcx,1503FF62D58\r\n       mov       rcx,[rcx]\r\n       test      rcx,rcx\r\n       jne       short M00_L00\r\n       mov       rcx,1503FF62D58\r\n       mov       rcx,[rcx]\r\nM00_L00:\r\n       mov       r11,7FFE420C03A0\r\n       mov       rax,[7FFE424403A0]\r\n       jmp       rax\r\n; Total bytes of code 51\r\n\r\n; .NET 6.0.0\r\n; Program.Equals(Int32, Int32)\r\n       mov       rcx,1B4CE6C2F78\r\n       mov       rcx,[rcx]\r\n       mov       r11,7FFE5AE60370\r\n       mov       rax,[7FFE5B1C0370]\r\n       jmp       rax\r\n; Total bytes of code 33<\/code><\/pre>\n<p>On .NET 5, those first two <code>mov<\/code> instructions are loading the <code>EqualityComparer&lt;int&gt;.Default<\/code>. Then with the call to <code>Equals&lt;T&gt;(int, int, IEqualityComparer&lt;T&gt;<\/code> inlined, that <code>test rcx, rcx<\/code> is the null check for the <code>EqualityComparer&lt;int&gt;.Default<\/code> passed as an argument. If it&#8217;s not null (it won&#8217;t be null), it then jumps to <code>M00_L00<\/code>, where those two <code>mov<\/code>s and a <code>jmp<\/code> are a tail call to the interface <code>Equals<\/code> method. On .NET 6, you can see those first two instructions are still there, and the last three instructions are still there, but the middle four instructions (<code>test<\/code>, <code>jne<\/code>, <code>mov<\/code>, <code>mov<\/code>) have evaporated, because the compiler is now able to propagate the non-nullness of the <code>static readonly<\/code> and eliminate completely the <code>comparer ??= EqualityComparer&lt;T&gt;.Default;<\/code> from the inlined helper.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47321\">dotnet\/runtime#47321<\/a> also adds a lot of power with regards to folding. Most of the <code>Math<\/code> methods can now participate in constant folding, so if their inputs end up as constants for whatever reason, the results can become constants as well, and with constant propagation, this leads to the potential for serious reduction in run-time evaluation. Here&#8217;s a benchmark I created by copying some of the sample code from the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/api\/system.math\">System.Math docs<\/a>, editing it to create a method that computes the height of a trapezoid.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic double GetHeight() =&gt; GetHeight(20.0, 10.0, 8.0, 6.0);\r\n\r\n[MethodImpl(MethodImplOptions.AggressiveInlining)]\r\npublic static double GetHeight(double longbase, double shortbase, double leftLeg, double rightLeg)\r\n{\r\n    double x = (Math.Pow(rightLeg, 2.0) - Math.Pow(leftLeg, 2.0) + Math.Pow(longbase, 2.0) + Math.Pow(shortbase, 2.0) - 2 * shortbase * longbase) \/ (2 * (longbase - shortbase));\r\n    return Math.Sqrt(Math.Pow(rightLeg, 2.0) - Math.Pow(x, 2.0));\r\n}<\/code><\/pre>\n<p>These are what I get for benchmark results:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetHeight<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">151.7852 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">179 B<\/td>\n<\/tr>\n<tr>\n<td>GetHeight<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">0.0000 ns<\/td>\n<td style=\"text-align: right;\">0.000<\/td>\n<td style=\"text-align: right;\">12 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note the time spent for .NET 6 has dropped to nothing, and the code size has dropped from 179 bytes to 12. How is that possible? Because the entire operation became a single constant. The .NET 5 assembly looked like this:<\/p>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.GetHeight()\r\n       sub       rsp,38\r\n       vzeroupper\r\n       vmovsd    xmm0,qword ptr [7FFE66C31CA0]\r\n       vmovsd    xmm1,qword ptr [7FFE66C31CB0]\r\n       call      System.Math.Pow(Double, Double)\r\n       vmovsd    qword ptr [rsp+28],xmm0\r\n       vmovsd    xmm0,qword ptr [7FFE66C31CC0]\r\n       vmovsd    xmm1,qword ptr [7FFE66C31CD0]\r\n       call      System.Math.Pow(Double, Double)\r\n       vmovsd    xmm2,qword ptr [rsp+28]\r\n       vsubsd    xmm3,xmm2,xmm0\r\n       vmovsd    qword ptr [rsp+30],xmm3\r\n       vmovsd    xmm0,qword ptr [7FFE66C31CE0]\r\n       vmovsd    xmm1,qword ptr [7FFE66C31CF0]\r\n       call      System.Math.Pow(Double, Double)\r\n       vaddsd    xmm2,xmm0,qword ptr [rsp+30]\r\n       vmovsd    qword ptr [rsp+30],xmm2\r\n       vmovsd    xmm0,qword ptr [7FFE66C31D00]\r\n       vmovsd    xmm1,qword ptr [7FFE66C31D10]\r\n       call      System.Math.Pow(Double, Double)\r\n       vaddsd    xmm1,xmm0,qwor44562d ptr [rsp+30]\r\n       vsubsd    xmm1,xmm1,qword ptr [7FFE66C31D20]\r\n       vdivsd    xmm0,xmm1,[7FFE66C31D30]\r\n       vmovsd    xmm1,qword ptr [7FFE66C31D40]\r\n       call      System.Math.Pow(Double, Double)\r\n       vmovsd    xmm2,qword ptr [rsp+28]\r\n       vsubsd    xmm0,xmm2,xmm0\r\n       vsqrtsd   xmm0,xmm0,xmm0\r\n       add       rsp,38\r\n       ret\r\n; Total bytes of code 179<\/code><\/pre>\n<p>with at least five calls to <code>Math.Pow<\/code> on top of a bunch of double addition, subtraction, and square root operations, whereas with .NET 6, we get:<\/p>\n<pre><code class=\"language-assembly\">; .NET 6.0.0\r\n; Program.GetHeight()\r\n       vzeroupper\r\n       vmovsd    xmm0,qword ptr [7FFE5B1BCE70]\r\n       ret\r\n; Total bytes of code 12<\/code><\/pre>\n<p>which is just returning a constant double value. It&#8217;s hard not to smile when seeing that.<\/p>\n<p>There were additional folding-related improvements. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48568\">dotnet\/runtime#48568<\/a> from <a href=\"https:\/\/github.com\/SingleAccretion\">@SingleAccretion<\/a> improved the handling of unsigned comparisons as part of constant folding and propagation; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47133\">dotnet\/runtime#47133<\/a> from <a href=\"https:\/\/github.com\/SingleAccretion\">@SingleAccretion<\/a> changed in what phase of the JIT certain folding is performed in order to improve its impact on inlining; and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43567\">dotnet\/runtime#43567<\/a> improved the folding of commutative operators. Further, for ReadyToRun, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42831\">dotnet\/runtime#42831<\/a> from <a href=\"https:\/\/github.com\/nathan-moore\">@nathan-moore<\/a> ensured that the <code>Length<\/code> of an array created from a constant could be propagated as a constant.<\/p>\n<p>Most of the improvements we&#8217;ve talked about thus far are cross-cutting. Sometimes, though, improvements are much more focused, with a change intended to improve the code generated for a very specific pattern. And there have been a lot of those in .NET 6. Here are a few examples:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37245\">dotnet\/runtime#37245<\/a>. When implicitly casting a <code>string<\/code> to a <code>ReadOnlySpan&lt;char&gt;<\/code>, the operator performs a <code>null<\/code> check on the input, such that it&#8217;ll return an empty span if the string is null. The operator is aggressively inlined, however, and so if the call site can prove that the string is not null, the null check can be eliminated.\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic ReadOnlySpan&lt;char&gt; Const() =&gt; \"hello world\";<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Const()\r\n       mov       rax,12AE3A09B48\r\n       mov       rax,[rax]\r\n       test      rax,rax\r\n       jne       short M00_L00\r\n       xor       ecx,ecx\r\n       xor       r8d,r8d\r\n       jmp       short M00_L01\r\nM00_L00:\r\n       cmp       [rax],eax\r\n       cmp       [rax],eax\r\n       add       rax,0C\r\n       mov       rcx,rax\r\n       mov       r8d,0B\r\nM00_L01:\r\n       mov       [rdx],rcx\r\n       mov       [rdx+8],r8d\r\n       mov       rax,rdx\r\n       ret\r\n; Total bytes of code 53\r\n\r\n; .NET 6.0.0\r\n; Program.Const()\r\n       mov       rax,18030C4A038\r\n       mov       rax,[rax]\r\n       add       rax,0C\r\n       mov       [rdx],rax\r\n       mov       dword ptr [rdx+8],0B\r\n       mov       rax,rdx\r\n       ret\r\n; Total bytes of code 31<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/37836\">dotnet\/runtime#37836<\/a>. <code>BitOperations.PopCount<\/code> was added in .NET Core 3.0, and returns the &#8220;popcount&#8221;, or &#8220;population count&#8221;, of the input number, meaning the number of bits set. It&#8217;s implemented as a hardware intrinsic if the underlying hardware supports it, or via a software fallback otherwise, but it&#8217;s also easily computed at compile time if the input is a constant (or if it becomes a constant from the JIT&#8217;s perspective, e.g. if the input is a <code>static readonly<\/code>). This PR turns <code>PopCount<\/code> into a JIT intrinsic, enabling the JIT to substitute a value for the whole method invocation if it deems that appropriate.\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic int PopCount() =&gt; BitOperations.PopCount(42);<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.PopCount()\r\n       mov       eax,2A\r\n       popcnt    eax,eax\r\n       ret\r\n; Total bytes of code 10\r\n\r\n; .NET 6.0.0\r\n; Program.PopCount()\r\n       mov       eax,3\r\n       ret\r\n; Total bytes of code 6<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50997\">dotnet\/runtime#50997<\/a>. This is a great example of improvements being made to the JIT based on an evolving need from the kinds of things libraries end up doing. In particular, this came about because of improvements to string interpolation that we&#8217;ll discuss later in this post. Previously, if you wrote the interpolated string <code>$\"{_nullableValue}\"<\/code> where <code>_nullableValue<\/code> was, say, an <code>int?<\/code>, this would result in a <code>string.Format<\/code> call that passes <code>_nullableValue<\/code> as an <code>object<\/code> argument. Boxing that <code>int?<\/code> translates into either null if the nullable value is <code>null<\/code> or boxing its <code>int<\/code> value if it&#8217;s not null. With C# 10 and .NET 6, this will instead result in a call to a generic method, passing in the <code>_nullableValue<\/code> strongly-typed as <code>T<\/code>==<code>int?<\/code>, and that generic method then checks for various interfaces on the <code>T<\/code> and uses them if they exist. In performance testing of the feature, this exposed a measurable performance cliff due to the code generation employed for the nullable value types, both in allocation and in throughput. This PR helped to avoid that cliff by optimizing the boxing involved for this pattern of interface checking and usage.\n<pre><code class=\"language-C#\">private int? _nullableValue = 1;\r\n\r\n[Benchmark]\r\npublic string Format() =&gt; Format(_nullableValue);\r\n\r\nprivate string Format(T value, IFormatProvider provider = null)\r\n{\r\n    if (value is IFormattable)\r\n    {\r\n        return ((IFormattable)value).ToString(null, provider);\r\n    }\r\n\r\n    return value.ToString();\r\n}<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">87.71 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">154 B<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">51.88 ns<\/td>\n<td style=\"text-align: right;\">0.59<\/td>\n<td style=\"text-align: right;\">100 B<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50112\">dotnet\/runtime#50112<\/a>. For hot code paths, especially those concerned about size, there&#8217;s a common &#8220;throw helper&#8221; pattern employed where the code to perform a throw is moved out into a separate method, as the JIT won&#8217;t inline a method that is discovered to always throw. If there&#8217;s a common check being employed, that&#8217;s often then put it into its own helper. So, for example, if you wanted a helper method that checked to see if some reference type argument was null and then threw an exception if it was, that might look like this:\n<pre><code class=\"language-C#\">public static void ThrowIfNull(\r\n    [NotNull] object? argument, [CallerArgumentExpression(\"argument\")] string? paramName = null)\r\n{\r\n    if (argument is null)\r\n        Throw(paramName);\r\n}\r\n\r\n[DoesNotReturn]\r\nprivate static void Throw(string? paramName) =&gt; throw new ArgumentNullException(paramName);<\/code><\/pre>\n<p>And, in fact, that&#8217;s exactly what the new <code>ArgumentNullException.ThrowIfNull<\/code> helper introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55594\">dotnet\/runtime#55594<\/a> looks like. The trouble with this, however, is that in order to call the <code>ThrowIfNull<\/code> method with a string literal, we end up needing to materialize that string literal as a string object (e.g. for a <code>string input<\/code> argument, <code>nameof(input)<\/code>, aka <code>\"input\"<\/code>). If the check were being done inline, the JIT already has logic to deal with that, e.g. this:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\n[Arguments(\"hello\")]\r\npublic void ThrowIfNull(string input)\r\n{\r\n    \/\/ThrowIfNull(input, nameof(input));\r\n    if (input is null)\r\n        throw new ArgumentNullException(nameof(input));\r\n}<\/code><\/pre>\n<p>produces on .NET 5:<\/p>\n<pre><code class=\"language-assembly\">; Program.ThrowIfNull(System.String)\r\n       push      rsi\r\n       sub       rsp,20\r\n       test      rdx,rdx\r\n       je        short M00_L00\r\n       add       rsp,20\r\n       pop       rsi\r\n       ret\r\nM00_L00:\r\n       mov       rcx,offset MT_System.ArgumentNullException\r\n       call      CORINFO_HELP_NEWSFAST\r\n       mov       rsi,rax\r\n       mov       ecx,1\r\n       mov       rdx,7FFE715BB748\r\n       call      CORINFO_HELP_STRCNS\r\n       mov       rdx,rax\r\n       mov       rcx,rsi\r\n       call      System.ArgumentNullException..ctor(System.String)\r\n       mov       rcx,rsi\r\n       call      CORINFO_HELP_THROW\r\n       int       3\r\n; Total bytes of code 74<\/code><\/pre>\n<p>In particular, we&#8217;re talking about that <code>call CORINFO_HELP_STRCNS<\/code>. But with the check and throw moved into the helper, that lazy initialization of the string literal object doesn&#8217;t happen. We end up with the assembly for the check looking nice and slim, but from an overall memory perspective, it&#8217;s likely a regression to force all of those string literals to be materialized. This PR addressed that, by ensuring the lazy initialization still happens, only if we&#8217;re about to throw, even with the helper being used.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\n[Arguments(\"hello\")]\r\npublic void ThrowIfNull(string input)\r\n{\r\n    ThrowIfNull(input, nameof(input));\r\n}\r\n\r\nprivate static void ThrowIfNull(\r\n    [NotNull] object? argument, [CallerArgumentExpression(\"argument\")] string? paramName = null)\r\n{\r\n    if (argument is null)\r\n        Throw(paramName);\r\n}\r\n\r\n[DoesNotReturn]\r\nprivate static void Throw(string? paramName) =&gt; throw new ArgumentNullException(paramName);<\/code><\/pre>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.ThrowIfNull(System.String)\r\n       test      rdx,rdx\r\n       jne       short M00_L00\r\n       mov       rcx,1FC48939520\r\n       mov       rcx,[rcx]\r\n       jmp       near ptr Program.Throw(System.String)\r\nM00_L00:\r\n       ret\r\n; Total bytes of code 24\r\n\r\n; .NET 6.0.0\r\n; Program.ThrowIfNull(System.String)\r\n       sub       rsp,28\r\n       test      rdx,rdx\r\n       jne       short M00_L00\r\n       mov       ecx,1\r\n       mov       rdx,7FFEBF512BE8\r\n       call      CORINFO_HELP_STRCNS\r\n       mov       rcx,rax\r\n       add       rsp,28\r\n       jmp       near ptr Program.Throw(System.String)\r\nM00_L00:\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 46<\/code><\/pre>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43811\">dotnet\/runtime#43811<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46237\">dotnet\/runtime#46237<\/a>. It&#8217;s fairly common, in particular in the face of inlining, to end up with sequences that have redundant comparison operations. Consider a fairly typical expression when dealing with nullable value types: <code>if (i.HasValue) { Use(i.Value); }<\/code>. That <code>i.Value<\/code> access invokes the <code>Nullable&lt;T&gt;.Value<\/code> getter, which itself checks <code>HasValue<\/code>, leading to a redundant comparison with the developer-written <code>HasValue<\/code> check in the guard. This specific example has led some folks to adopt a pattern of using <code>GetValueOrDefault()<\/code> after a <code>HasValue<\/code> check, since somewhat ironically <code>GetValueOrDefault()<\/code> just returns the <code>value<\/code> field without any additional checks. But there shouldn&#8217;t be a penalty for writing the simpler code that makes logical sense. And thanks to this PR, there isn&#8217;t. The JIT will now walk the control flow graph to see if any <a href=\"https:\/\/en.wikipedia.org\/wiki\/Dominator_%28graph_theory%29\">dominating block<\/a> (basically, code we had to go through to get to this point) has a similar compare.\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic bool IsGreaterThan() =&gt; IsGreaterThan(42, 40);\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static bool IsGreaterThan(int? i, int j) =&gt; i.HasValue &amp;&amp; i.Value &gt; j;<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.IsGreaterThan(System.Nullable~1&lt;Int32&gt;, Int32)\r\n       sub       rsp,28\r\n       mov       [rsp+30],rcx\r\n       movzx     eax,byte ptr [rsp+30]\r\n       test      eax,eax\r\n       je        short M01_L00\r\n       test      eax,eax\r\n       je        short M01_L01\r\n       cmp       [rsp+34],edx\r\n       setg      al\r\n       movzx     eax,al\r\n       add       rsp,28\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       add       rsp,28\r\n       ret\r\nM01_L01:\r\n       call      System.ThrowHelper.ThrowInvalidOperationException_InvalidOperation_NoValue()\r\n       int       3\r\n; Total bytes of code 50\r\n\r\n; .NET 6.0.0\r\n; Program.IsGreaterThan(System.Nullable~1&lt;Int32&gt;, Int32)\r\n       mov       [rsp+8],rcx\r\n       cmp       byte ptr [rsp+8],0\r\n       je        short M01_L00\r\n       cmp       [rsp+0C],edx\r\n       setg      al\r\n       movzx     eax,al\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 26<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49585\">dotnet\/runtime#49585<\/a>. Learning from others is very important. Division is typically a relatively slow operation on modern hardware, and thus compilers try to find ways to avoid it, especially when dividing by a constant. In such cases, the JIT will try to find an alternative, which typically involves some combination of shifting and multiplying by a &#8220;magic number&#8221; that&#8217;s derived from the particular constant. This PR implements the techniques from <a href=\"https:\/\/ridiculousfish.com\/files\/faster_unsigned_division_by_constants.pdf\">Faster Unsigned Division by Constants<\/a> to improve the magic number selected for a certain subset of constants, enabling better code generation when dividing by numbers like 7 or 365.\n<pre><code class=\"language-C#\">private uint _value = 12345;\r\n\r\n[Benchmark]\r\npublic uint Div7() =&gt; _value \/ 7;<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Div()\r\n       mov       ecx,[rcx+8]\r\n       mov       edx,24924925\r\n       mov       eax,ecx\r\n       mul       edx\r\n       sub       ecx,edx\r\n       shr       ecx,1\r\n       lea       eax,[rcx+rdx]\r\n       shr       eax,2\r\n       ret\r\n; Total bytes of code 23\r\n\r\n; .NET 6.0.0\r\n; Program.Div()\r\n       mov       eax,[rcx+8]\r\n       mov       rdx,492492492493\r\n       mov       eax,eax\r\n       mul       rdx\r\n       mov       eax,edx\r\n       ret\r\n; Total bytes of code 21<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45463\">dotnet\/runtime#45463<\/a>. It&#8217;s fairly common to see code check whether a value is even by using <code>i % 2 == 0<\/code>. The JIT can now transform that into code more like <code>i &amp; 1 == 0<\/code> to arrive at the same answer but with less ceremony.\n<pre><code class=\"language-C#\">[Benchmark]\r\n[Arguments(42)]\r\npublic bool IsEven(int i) =&gt; i % 2 == 0;<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.IsEven(Int32)\r\n       mov       eax,edx\r\n       shr       eax,1F\r\n       add       eax,edx\r\n       and       eax,0FFFFFFFE\r\n       sub       edx,eax\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 19\r\n\r\n; .NET 6.0.0\r\n; Program.IsEven(Int32)\r\n       test      dl,1\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 10<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44562\">dotnet\/runtime#44562<\/a>. It&#8217;s common in high-performance code that uses cached arrays to see the code first store the arrays into locals and then operate on the locals. This enables the JIT to prove to itself, if it sees nothing else assigning into the array reference, that the array is invariant, such that it can learn from previous use of the array to optimize subsequent use. For example, if you iterate <code>for (int i = 0; i &lt; arr.Length; i++) Use(arr[i]);<\/code>, it can eliminate the bounds check on the <code>arr[i]<\/code>, as it trusts <code>i &lt; arr.Length<\/code>. However, if this had instead been written as <code>for (int i = 0; i &lt; s_arr.Length; i++) Use(s_arr[i]);<\/code>, where <code>s_arr<\/code> is defined as <code>static readonly int[] s_arr = ...;<\/code>, the JIT would not eliminate the bounds check, as the JIT wasn&#8217;t satisfied that <code>s_arr<\/code> was definitely not going to change, despite the <code>readonly<\/code>. This PR fixed that, enabling the JIT to see this static readonly array as being invariant, which then enables subsequent optimizations like bounds check elimination and common subexpression elimination.\n<pre><code class=\"language-C#\">static readonly int[] s_array = { 1, 2, 3, 4 };\r\n\r\n[Benchmark]\r\npublic int Sum()\r\n{\r\n    if (s_array.Length &gt;= 4)\r\n    {\r\n        return s_array[0] + s_array[1] + s_array[2] + s_array[3];\r\n    }\r\n\r\n    return 0;\r\n}<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.Sum()\r\n       sub       rsp,28\r\n       mov       rax,15434127338\r\n       mov       rax,[rax]\r\n       cmp       dword ptr [rax+8],4\r\n       jl        short M00_L00\r\n       mov       rdx,rax\r\n       mov       ecx,[rdx+8]\r\n       cmp       ecx,0\r\n       jbe       short M00_L01\r\n       mov       edx,[rdx+10]\r\n       mov       r8,rax\r\n       cmp       ecx,1\r\n       jbe       short M00_L01\r\n       add       edx,[r8+14]\r\n       mov       r8,rax\r\n       cmp       ecx,2\r\n       jbe       short M00_L01\r\n       add       edx,[r8+18]\r\n       cmp       ecx,3\r\n       jbe       short M00_L01\r\n       add       edx,[rax+1C]\r\n       mov       eax,edx\r\n       add       rsp,28\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       add       rsp,28\r\n       ret\r\nM00_L01:\r\n       call      CORINFO_HELP_RNGCHKFAIL\r\n       int       3\r\n; Total bytes of code 89\r\n\r\n; .NET 6.0.0\r\n; Program.Sum()\r\n       mov       rax,28B98007338\r\n       mov       rax,[rax]\r\n       mov       edx,[rax+8]\r\n       cmp       edx,4\r\n       jl        short M00_L00\r\n       mov       rdx,rax\r\n       mov       edx,[rdx+10]\r\n       mov       rcx,rax\r\n       add       edx,[rcx+14]\r\n       mov       rcx,rax\r\n       add       edx,[rcx+18]\r\n       add       edx,[rax+1C]\r\n       mov       eax,edx\r\n       ret\r\nM00_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 48<\/code><\/pre>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49548\">dotnet\/runtime#49548<\/a>. This PR optimized various patterns involving comparisons against 0. Given an expression like <code>a == 0 &amp;&amp; b == 0<\/code>, the JIT can now optimize that to be equivalent to <code>(a | b) == 0<\/code>, replacing a branch and second comparison with an <code>or<\/code>.\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic bool AreZero() =&gt; AreZero(1, 2);\r\n\r\n[MethodImpl(MethodImplOptions.NoInlining)]\r\nprivate static bool AreZero(int x, int y) =&gt; x == 0 &amp;&amp; y == 0;<\/code><\/pre>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<pre><code class=\"language-assembly\">; .NET 5.0.9\r\n; Program.AreZero(Int32, Int32)\r\n       test      ecx,ecx\r\n       jne       short M01_L00\r\n       test      edx,edx\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\nM01_L00:\r\n       xor       eax,eax\r\n       ret\r\n; Total bytes of code 16\r\n\r\n; .NET 6.0.0\r\n; Program.AreZero(Int32, Int32)\r\n       or        edx,ecx\r\n       sete      al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 9<\/code><\/pre>\n<p>I can&#8217;t cover all of the pattern changes in as much detail, but there have been many more, e.g.<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46253\">dotnet\/runtime#46253<\/a> converted the <code>Interlocked.And<\/code> and <code>Interlocked.Or<\/code> methods introduced in .NET 5 into JIT intrinsics on ARM64.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46243\">dotnet\/runtime#46243<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45311\">dotnet\/runtime#45311<\/a> avoided cast helpers from being emitted for <code>(T)array.Clone()<\/code> and <code>object.MemberwiseClone()<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43947\">dotnet\/runtime#43947<\/a> added support for unrolling single-iteration loops.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54864\">dotnet\/runtime#54864<\/a> enabled more methods to be tail-called by allowing implicit widening.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53214\">dotnet\/runtime#53214<\/a> eliminated redundant <code>test<\/code> instructions in some situations.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44419\">dotnet\/runtime#44419<\/a> enabled common subexpression elimination (CSE) for floating-point constants.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45604\">dotnet\/runtime#45604<\/a> from <a href=\"https:\/\/github.com\/alexcovington\">@alexcovington<\/a> optimized division like <code>-i \/ 7<\/code> to instead be emitted as the equivalent of <code>i \/ -7<\/code>, saving on a negation operation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48589\">dotnet\/runtime#48589<\/a> extended support for throw helpers that are non-void returning.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52298\">dotnet\/runtime#52298<\/a> optimized how floating-point constants are assigned to ref parameters.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/32000\">dotnet\/runtime#32000<\/a> from <a href=\"https:\/\/github.com\/damageboy\">@damageboy<\/a> taught the JIT how to remove double-negation (e.g. <code>~(~x)<\/code>).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49238\">dotnet\/runtime#49238<\/a> enabled the JIT to elide some additional null checks.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35627\">dotnet\/runtime#35627<\/a> caused the JIT to emit better instructions for <code>i &lt; 0<\/code> checks.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42164\">dotnet\/runtime#42164<\/a> yielded better code generation for floating-point <code>-X<\/code> and <code>MathF.Abs(X)<\/code> operations.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/41772\">dotnet\/runtime#41772<\/a> enabled use of the BMI2 <code>rorx<\/code> instruction as part of rotate operations (<code>BitOperations.RotateRight<\/code>).<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55614\">dotnet\/runtime#55614<\/a> increased the number of loops in a given method that the JIT will optimize from 16 to 64.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51158\">dotnet\/runtime#51158<\/a> avoided some unnecessary spilling when storing into fields.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50813\">dotnet\/runtime#50813<\/a> updated the JIT&#8217;s knowledge of the execution characteristics of several operations (SQRT, RCP, RSQRT).<\/li>\n<\/ul>\n<p>At this point, I&#8217;ve spent a lot of blog real estate writing a love letter to the improvements made to the JIT in .NET 6. There&#8217;s still a lot more, but rather than share long sections about the rest, I&#8217;ll make a few final shout outs here:<\/p>\n<ul>\n<li>Value types have become more and more critical to optimize for, as developers focused on driving down allocations have turned to structs for salvation. However, historically the JIT hasn&#8217;t been able to optimize structs as well as one might have hoped, in particular around being able to keeps struct in registers aggressively. A lot of work happened in .NET 6 to improve the situation, and while there&#8217;s still some more to be done in .NET 7, things have come a long way.\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43870\">dotnet\/runtime#43870<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39326\">dotnet\/runtime#39326<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44555\">dotnet\/runtime#44555<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48377\">dotnet\/runtime#48377<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55045\">dotnet\/runtime#55045<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55535\">dotnet\/runtime#55535<\/a>,\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55558\">dotnet\/runtime#55558<\/a>, and\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55727\">dotnet\/runtime#55727<\/a>, among others, all contributed here.<\/li>\n<li>Registers are really, really fast memory used to store data being used immediately by instructions. In any given code, there are typically many more variables in use than there are registers, and so something needs to determine which of those variables gets to live in which registers when. That process is referred to as &#8220;register allocation,&#8221; and getting it right contributes significantly to how well code performs. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48308\">dotnet\/runtime#48308<\/a> from <a href=\"https:\/\/github.com\/alexcovington\">@alexcovington<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54345\">dotnet\/runtime#54345<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47307\">dotnet\/runtime#47307<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45135\">dotnet\/runtime#45135<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52269\">dotnet\/runtime#52269<\/a> all contributed to improving the JIT&#8217;s register allocation heuristics in .NET 6. There&#8217;s also a <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/main\/docs\/design\/coreclr\/jit\/lsra-heuristic-tuning.md\">great write-up in dotnet\/runtime<\/a> about some of these tuning efforts.<\/li>\n<li>&#8220;Loop alignment&#8221; is a technique in which nop instructions are added before a loop to ensure that the beginning of the loop&#8217;s instructions fall at an address most likely to minimize the number of fetches required to load the instructions that make up that loop. Rather than trying to do justice to the topic, I recommend <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/loop-alignment-in-net-6\/\">Loop alignment in .NET 6<\/a>, which is very well written and provides excellent details on the topic, including highlighting the improvements that came from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44370\">dotnet\/runtime#44370<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42909\">dotnet\/runtime#42909<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55047\">dotnet\/runtime#55047<\/a>.<\/li>\n<li>Checking whether a type implements an interface (e.g. <code>if (something is ISomething)<\/code>) can be relatively expensive, and in the worst case involves a linear walk through all of a type&#8217;s implemented interfaces to see whether the specified one is in the list. The implementation here is relegated by the JIT to several helper functions, which, as of .NET 5, are now written in C# and live in the <code>System.Runtime.CompilerServices.CastHelpers<\/code> type as the <code>IsInstanceOfInterface<\/code> and <code>ChkCastInterface<\/code> interface methods. It&#8217;s not an understatement to say that the performance of these methods is critical to many applications running efficiently. So, lots of folks were excited to see <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49257\">dotnet\/runtime#49257<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a>, which managed to improve the performance of these methods by ~15% to ~35%, depending on the usage.<\/li>\n<\/ul>\n<h3>GC<\/h3>\n<p>There&#8217;s been a lot of work happening in .NET 6 on the GC (garbage collector), the vast majority of which has been in the name of switching the GC implementation to be based on &#8220;regions&#8221; rather than on &#8220;segments&#8221;. The initial commit for regions is in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45172\">dotnet\/runtime#45172<\/a>, with over 30 PRs since expanding on it. <a href=\"https:\/\/github.com\/maoni0\">@maoni0<\/a> is shepherding this effort and has already written on the topic; I encourage reading her post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/put-a-dpad-on-that-gc\/\">Put a DPAD on that GC!<\/a> to learn more in depth. But here are a few key statements from her post to help shed some light on the terminology:<\/p>\n<blockquote><p>&#8220;So what are the key differences between segments and regions? Segments are large units or memory \u2013 on Server GC 64-bit if the segment sizes are 1GB, 2GB or 4GB each (for Workstation it&#8217;s much smaller \u2013 256MB) on SOH. Regions are much smaller units, they are by default 4MB each. So you might ask, &#8220;so they are smaller, why is that significant?&#8221;<\/p>\n<p>&#8220;[Imagine] a scenario where we have free spaces in one generation, say gen0 because there&#8217;s some async IO going on that caused us to demote a bunch of pins in gen0, that we don&#8217;t actually use (this could be due to not waiting for so long to do the next GC or we&#8217;d have accumulated too much survival which means the GC pause would be too long). Wouldn&#8217;t it be nice if we could use those free spaces for other generations if they need them! Same with free spaces in gen2 and LOH \u2013 you might have some free spaces in gen2, it would be nice to use them to allocate some large objects. We do decommit on a segment but only the end of the segment which is after the very last live object on that segment (denoted by the light gray space at the end of each segment). And if you have pinning that prevents the GC from retracting the end of the segment, then we can only form free spaces and free spaces are always committed memory. Of course you might ask, &#8220;why don&#8217;t you just decommit the middle of a segment that has large free spaces?&#8221;. But that requires bookkeeping to remember which parts in the middle of a segment are decommitted so we need to re-commit them when we want to use them to allocate objects. And now we are getting into the idea of regions anyway, which is to have much smaller amounts of memory being manipulated separately by the GC.&#8221;<\/p><\/blockquote>\n<p>Beyond regions, there have been other improvements to the GC in .NET 6:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45208\">dotnet\/runtime#45208<\/a> optimized the &#8220;plan phase&#8221; of foreground GCs (gen0 and gen1 GCs done while a background GC is in progress) by enabling it to use its list of marked objects, shaving a significant amount of time off the operation.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/41599\">dotnet\/runtime#41599<\/a> helps reduce pause times by ensuring that the mark lists are distributely evenly across all of the GC heaps \/ threads in server GC.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55174\">dotnet\/runtime#55174<\/a> added a time-based decay that enables gen 0 and gen1 budgets to shrink over time with inactivity after they&#8217;d previously significantly expanded.<\/li>\n<\/ul>\n<h3>Threading<\/h3>\n<p>Moving up the stack a bit, let&#8217;s talk threading, starting with <code>ThreadPool<\/code>.<\/p>\n<p>Sometimes performance optimizations are about eliminating unnecessary work, or making tradeoffs that optimize for the common case while slightly pessimizing niche cases, or taking advantage of new lower-level capabilities to do something faster, or any number of other things. But sometimes, performance optimizations are about finding ways to help bad-but-common code be a little less bad.<\/p>\n<p>A thread pool&#8217;s job is simple: run work items. To do that, at its core a thread pool needs two things: a queue of work to be processed, and a set of threads to process them. We can write a functional, trivial thread pool, well, trivially:<\/p>\n<pre><code class=\"language-C#\">static class SimpleThreadPool\r\n{\r\n    private static BlockingCollection&lt;Action&gt; s_work = new();\r\n\r\n    public static void QueueUserWorkItem(Action action) =&gt; s_work.Add(action);\r\n\r\n    static SimpleThreadPool()\r\n    {\r\n        for (int i = 0; i &lt; Environment.ProcessorCount; i++)\r\n            new Thread(() =&gt;\r\n            {\r\n                while (true) s_work.Take()();\r\n            }) { IsBackground = true }.Start();\r\n    }\r\n}<\/code><\/pre>\n<p>Boom, functional thread pool. But&#8230; not a very good one. The hardest part of a good thread pool is in the management of the threads, and in particular determining at any given point how many threads should be servicing the queue of work. Too many threads, and you can grind a system to a halt, as all threads are fighting for the system&#8217;s resources, adding huge overheads with context switching, and getting in each other&#8217;s way with cache thrashing. Too few threads, and you can grind a system to a halt, as work items aren&#8217;t getting processed fast enough or, worse, running work items are blocked waiting for other work items to run but without enough additional threads to run them. The .NET <code>ThreadPool<\/code> has multiple mechanisms in place for determining how many threads should be in play at any point in time. First, it has a starvation detection mechanism. This mechanism is a fairly straightforward gate that kicks in once or twice a second and checks to see whether any progress has been made on removing items from the pool&#8217;s queues: if progress hasn&#8217;t been made, meaning nothing has been dequeued, the pool assumes the system is starved and injects an additional thread. Second, it has a hill climbing algorithm that is constantly seeking to maximimize work item throughput by manipulating available thread count; after every N work item completions, it evaluates whether adding or removing a thread to\/from circulation helps or hurts work item throughput, thereby making it adaptive to the current needs of the system. However, the hill climbing mechanism has a weakness: in order to properly do its job, work items need to be completing&#8230; if work items aren&#8217;t completing because, say, all of the threads in the pool are blocked, hill climbing becomes temporarily useless, and the only mechanism for injecting additional threads is the starvation mechanism, which is (by design) fairly slow.<\/p>\n<p>Such a situation might emerge when a system is flooded with &#8220;sync over async&#8221; work, a term <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/should-i-expose-synchronous-wrappers-for-asynchronous-methods\/\">coined<\/a> to mean kicking off asynchronous work and then synchronously blocking waiting for it to complete; in the common case, such an anti-pattern ends up blocking one thread pool thread that depends on another thread pool thread doing work in order to unblock the first, and that can quickly result in all thread pool threads being blocked until enough have been injected to enable everyone to make forward progress. Such &#8220;sync-over-async&#8221; code, which often manifests as calling an async method and then blocking waiting on the returned task (e.g. <code>int i = GetValueAsync().Result<\/code>) is invariably considered a no-no in production code meant to be scalable, but sometimes it&#8217;s unavoidable, e.g. you&#8217;re forced to implement an interface that&#8217;s synchronous and the only means at your disposal to do so is with functionality exposed only as an async method.<\/p>\n<p>We can see the impact of this with a terrible repro:<\/p>\n<pre><code class=\"language-C#\">using System;\r\nusing System.Collections.Generic;\r\nusing System.Diagnostics;\r\nusing System.Threading.Tasks;\r\n\r\nvar tcs = new TaskCompletionSource();\r\nvar tasks = new List&lt;Task&gt;();\r\nfor (int i = 0; i &lt; Environment.ProcessorCount * 4; i++)\r\n{\r\n    int id = i;\r\n    tasks.Add(Task.Run(() =&gt;\r\n    {\r\n        Console.WriteLine($\"{DateTime.UtcNow:MM:ss.ff}: {id}\");\r\n        tcs.Task.Wait();\r\n    }));\r\n}\r\ntasks.Add(Task.Run(() =&gt; tcs.SetResult()));\r\n\r\nvar sw = Stopwatch.StartNew();\r\nTask.WaitAll(tasks.ToArray());\r\nConsole.WriteLine($\"Done: {sw.Elapsed}\");<\/code><\/pre>\n<p>This queues a bunch of work items to the thread pool, all of which block waiting for a task to complete, but that task won&#8217;t complete until the final queued work item completes it to unblock all the other workers. Thus, we end up blocking every thread in the pool, waiting for the thread pool to detect the starvation and inject another thread, which the repro then dutifully blocks, and on and on, until finally there are enough threads that every queued work item can be running concurrently. On .NET Framework 4.8 and .NET 5, the above repro on my 12-logical-core machine takes ~32 seconds to complete. You can see the output here; pay attention to the timestamps on each work item, where you can see that after ramping up very quickly to have a number of threads equal to the number of cores, it then very slowly introduces additional threads.<\/p>\n<pre><code class=\"language-console\">07:54.51: 4\r\n07:54.51: 8\r\n07:54.51: 1\r\n07:54.51: 5\r\n07:54.51: 9\r\n07:54.51: 0\r\n07:54.51: 10\r\n07:54.51: 2\r\n07:54.51: 11\r\n07:54.51: 3\r\n07:54.51: 6\r\n07:54.51: 7\r\n07:55.52: 12\r\n07:56.52: 13\r\n07:57.53: 14\r\n07:58.52: 15\r\n07:59.52: 16\r\n07:00.02: 17\r\n07:01.02: 18\r\n07:01.52: 19\r\n07:02.51: 20\r\n07:03.52: 21\r\n07:04.52: 22\r\n07:05.03: 23\r\n07:06.02: 24\r\n07:07.03: 25\r\n07:08.01: 26\r\n07:09.03: 27\r\n07:10.02: 28\r\n07:11.02: 29\r\n07:11.52: 30\r\n07:12.52: 31\r\n07:13.52: 32\r\n07:14.02: 33\r\n07:15.02: 34\r\n07:15.53: 35\r\n07:16.51: 36\r\n07:17.02: 37\r\n07:18.02: 38\r\n07:18.52: 39\r\n07:19.52: 40\r\n07:20.52: 41\r\n07:21.52: 42\r\n07:22.55: 43\r\n07:23.52: 44\r\n07:24.53: 45\r\n07:25.52: 46\r\n07:26.02: 47\r\nDone: 00:00:32.5128769<\/code><\/pre>\n<p>I&#8217;m happy to say the situation improves here for .NET 6. This is not license to start writing more sync-over-async code, but rather a recognition that sometimes it&#8217;s unavoidable, especially in existing applications that may not be able to move to an asynchronous model all at once, that might have some legacy components, etc. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53471\">dotnet\/runtime#53471<\/a> teaches the thread pool about the most common form of blocking we see in these situations, waiting on a <code>Task<\/code> that hasn&#8217;t yet completed. In response, the thread pool becomes much more aggressive about increasing its target thread count while the blocking persists, and then immediately lowers the target count again as soon as the blocking has ended. Running the same console app again on .NET 6, we can see that ~32 seconds drops to ~1.5 seconds, with the pool injecting threads much faster in response to the blocking.<\/p>\n<pre><code class=\"language-console\">07:53.39: 5\r\n07:53.39: 7\r\n07:53.39: 6\r\n07:53.39: 8\r\n07:53.39: 9\r\n07:53.39: 10\r\n07:53.39: 1\r\n07:53.39: 0\r\n07:53.39: 4\r\n07:53.39: 2\r\n07:53.39: 3\r\n07:53.47: 12\r\n07:53.47: 11\r\n07:53.47: 13\r\n07:53.47: 14\r\n07:53.47: 15\r\n07:53.47: 22\r\n07:53.47: 16\r\n07:53.47: 17\r\n07:53.47: 18\r\n07:53.47: 19\r\n07:53.47: 21\r\n07:53.47: 20\r\n07:53.50: 23\r\n07:53.53: 24\r\n07:53.56: 25\r\n07:53.59: 26\r\n07:53.63: 27\r\n07:53.66: 28\r\n07:53.69: 29\r\n07:53.72: 30\r\n07:53.75: 31\r\n07:53.78: 32\r\n07:53.81: 33\r\n07:53.84: 34\r\n07:53.91: 35\r\n07:53.97: 36\r\n07:54.03: 37\r\n07:54.10: 38\r\n07:54.16: 39\r\n07:54.22: 40\r\n07:54.28: 41\r\n07:54.35: 42\r\n07:54.41: 43\r\n07:54.47: 44\r\n07:54.54: 45\r\n07:54.60: 46\r\n07:54.68: 47\r\nDone: 00:00:01.3649530<\/code><\/pre>\n<p>Interestingly, this improvement was made easier by another large thread pool related change in .NET 6: the implementation is now entirely in C#. In previous releases of .NET, the thread pool&#8217;s core dispatch routine was in managed code, but all of the logic around thread management was all still in native in the runtime. All of that logic was ported to C# previously in support of CoreRT and mono, but it wasn&#8217;t used for coreclr. As of .NET 6 and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43841\">dotnet\/runtime#43841<\/a>, it now is used everywhere. This should make further improvements and optimizations easier and enable more advancements in the pool in future releases.<\/p>\n<p>Moving on from the thread pool, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55295\">dotnet\/runtime#55295<\/a> is an interesting improvement. One of the things you find a lot in multithreaded code, whether direct usage in low-lock algorithms or indirect usage in concurrency primitives like locks and semaphores, is spinning. Spinning is based on the idea that blocking in the operating system waiting for something to happen is very efficient for longer waits but incurs non-trivial overheads at the start and end of the waiting operation; if the thing you&#8217;re waiting for will likely happen very, very soon, you might be better off just looping around to try again immediately or after a very short pause. My use of the word &#8220;pause&#8221; there is not coincidental, as the x86 instruction set includes the &#8220;PAUSE&#8221; instruction, which tells the processor the code is doing a spin-wait and helps it to optimize accordingly. However, the delay incurred by the &#8220;PAUSE&#8221; instruction can varely greatly across processor architectures, e.g. it might take only 9 cycles on an Intel Core i5, but 65 cycles on an AMD Ryzen 7, or 140 cycles on an Intel Core i7. That makes it challenging for tuning the behavior of higher-level code written using spin loops, which core code in the runtime and key concurrency-related types in the core libraries do. To address this discrepancy and provide a consistent view of pauses, previous releases of .NET have tried to measure at startup the duration of pauses, and then used those metrics to normalize how many pauses are used when one is needed. However, this approach has a few downsides. While the measurement wasn&#8217;t being done on the main thread of the startup path, it was still contributing milliseconds of CPU time to every process, a number that can add up over the millions or billions of .NET process invocations that happen every day. It also was only done once for a process, but for a variety of reasons that overhead could actually change during a process&#8217; lifetime, for example if a VM was suspended and moved from one physical machine to another. To address this, the aforementioned PR changes its scheme. Rather than measuring once at startup for a longer period of time, it periodically does a short measurement and uses that to refresh its perspective on how long pauses take. This should lead to an overall decrease in CPU usage as well as a more up-to-date understanding of what these pauses cost, leading to a more consistent behavior of the apps and services that rely on it.<\/p>\n<p>Let&#8217;s move on to <code>Task<\/code>, where there have been a multitude of improvements. One notable and long overdue change is enabling <code>Task.FromResult&lt;T&gt;<\/code> to return a cached instance. When async methods were added in .NET Framework 4.5, we added a cache that <code>async Task&lt;T&gt;<\/code> methods could use for synchronously-completing operations (synchronously completing async methods are counterintuitively extremely common; consider a method where the first invocation does I\/O to fill a buffer, but subsequent operations simply consume from that buffer). Rather than constructing a new <code>Task&lt;T&gt;<\/code> for every invocation of such a method, the cache would be consulted to see if a singleton <code>Task&lt;T&gt;<\/code> could be used instead. The cache obviously can&#8217;t store a singleton for every possible value of every <code>T<\/code>, but it can special-case some <code>T<\/code>s and cache a few values for each. For example, it caches two <code>Task&lt;bool&gt;<\/code> instances, one for <code>true<\/code> and one for <code>false<\/code>, and around 10 <code>Task&lt;int&gt;<\/code> instances, one for each of the values between <code>-1<\/code> and <code>8<\/code>, inclusive. But <code>Task.FromResult&lt;T&gt;<\/code> never used this cache, always returning a new instance even if there was a task for it in the cache. This has led to one of two commonly-seen occurrences: either a developer using <code>Task.FromResult<\/code> recognizes this deficiency and has to maintain their own cache for values like <code>true<\/code> and <code>false<\/code>, or a developer using <code>Task.FromResult<\/code> doesn&#8217;t recognize it and ends up paying arguably unnecessary allocations. For .NET 6, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43894\">dotnet\/runtime#43894<\/a> changes <code>Task.FromResult&lt;T&gt;<\/code> to consult the cache, so creating tasks for a <code>bool<\/code> <code>true<\/code> or an <code>int<\/code> <code>1<\/code>, for example, no longer allocates. This adds a tiny bit of overhead (a branch or two) when <code>Task.FromResult&lt;T&gt;<\/code> is used with a type that can be cached but for which the specific value is not; however, on the balance it&#8217;s worthwhile given the savings for extremely common values.<\/p>\n<p>Of course, tasks are very closely tied to async methods in C#, and it&#8217;s worth looking at a small but significant feature in C# 10 and .NET 6 that is likely to impact a lot of .NET code, directly or indirectly. This requires some backstory. When the C# compiler goes to implement an async method with the signature <code>async SomeTaskLikeType<\/code>, it consults the <code>SomeTaskLikeType<\/code> to see what &#8220;builder&#8221; should be used to help implement the method. For example, <code>ValueTask<\/code> is attributed with <code>[AsyncMethodBuilder(typeof(AsyncValueTaskMethodBuilder))]<\/code>, and so any <code>async ValueTask<\/code> method will cause the compiler to use <code>AsyncValueTaskMethodBuilder<\/code> as the builder for that method. We can see that if we compile a simple async method:<\/p>\n<pre><code class=\"language-C#\">public static async ValueTask ExampleAsync() { }<\/code><\/pre>\n<p>for which the compiler produces approximately the following as the implementation of <code>ExampleAsync<\/code>:<\/p>\n<pre><code class=\"language-C#\">public static ValueTask ExampleAsync()\r\n{\r\n    &lt;ExampleAsync&gt;d__0 stateMachine = default;\r\n    stateMachine.&lt;&gt;t__builder = AsyncValueTaskMethodBuilder.Create();\r\n    stateMachine.&lt;&gt;1__state = -1;\r\n    stateMachine.&lt;&gt;t__builder.Start(ref stateMachine);\r\n    return stateMachine.&lt;&gt;t__builder.Task;\r\n}<\/code><\/pre>\n<p>This builder type is used in the generated code to create the builder instance (via a static <code>Create<\/code> method), to access the built task (via a <code>Task<\/code> instance property), to complete that built task (via <code>SetResult<\/code> and <code>SetException<\/code> instance methods), and to handle the state management associated with that built task when an await yields (via <code>AwaitOnCompleted<\/code> and <code>UnsafeAwaitOnCompleted<\/code> instance methods). And as there are four types built into the core libraries that are intended to be used as the return type from async methods (<code>Task<\/code>, <code>Task&lt;T&gt;<\/code>, <code>ValueTask<\/code>, and <code>ValueTask&lt;T&gt;<\/code>), the core libraries also include four builders (<code>AsyncTaskMethodBuilder<\/code>, <code>AsyncTaskMethodBuilder&lt;T&gt;<\/code>, <code>AsyncValueTaskMethodBuilder<\/code>, and <code>AsyncValueTaskMethodBuilder&lt;T&gt;<\/code>), all in <code>System.Runtime.CompilerServices<\/code>. Most developers should never see these types in any code they read or write.<\/p>\n<p>One of the downsides to this model, however, is that which builder is selected is tied to the definition of the type being returned from the async method. So, if you want to define your async method to return <code>Task<\/code>, <code>Task&lt;T&gt;<\/code>, <code>ValueTask<\/code>, or <code>ValueTask&lt;T&gt;<\/code>, you have no way to control the builder that&#8217;s employed: it&#8217;s determined by that type and only by that type. Why would you want to change the builder? There are a variety of reasons someone might want to control the details of the lifecycle of the task, but one of the most prominent is pooling. When an <code>async Task<\/code>, <code>async ValueTask<\/code> or <code>async ValueTask&lt;T&gt;<\/code> method completes synchronously, nothing need be allocated: for <code>Task<\/code>, the implementation can just hand back <code>Task.CompletedTask<\/code>, for <code>ValueTask<\/code> it can just hand back <code>ValueTask.CompletedTask<\/code> (which is the same as <code>default(ValueTask)<\/code>), and for <code>ValueTask&lt;T&gt;<\/code> it can hand back <code>ValueTask.FromResult&lt;T&gt;<\/code>, which creates a struct that wraps the <code>T<\/code> value. However, when the method completes asynchronously, the implementations need to allocate some object (a <code>Task<\/code> or <code>Task&lt;T&gt;<\/code>) to uniquely identify this async operation and provide a conduit via which the completion information can be passed back to the caller awaiting the returned instance.<\/p>\n<p><code>ValueTask&lt;T&gt;<\/code> supports being backed not only by a <code>T<\/code> or a <code>Task&lt;T&gt;<\/code>, but also by an <code>IValueTaskSource&lt;T&gt;<\/code>, which allows enterprising developers to plug in a custom implementation, including one that could potentially be pooled. What if, instead of using the aforementioned builders, we could author a builder that used and pooled custom <code>IValueTaskSource&lt;T&gt;<\/code> instances? It could use those instead of <code>Task&lt;T&gt;<\/code> to back a <code>ValueTask&lt;T&gt;<\/code> returned from an asynchronously-completing <code>async ValueTask&lt;T&gt;<\/code> method. As outlined in the blog post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/async-valuetask-pooling-in-net-5\/\">Async ValueTask Pooling in .NET 5<\/a>, .NET 5 included that as an opt-in experiment, where <code>AsyncValueTaskMethodBuilder<\/code> and <code>AsyncValueTaskMethodBuilder&lt;T&gt;<\/code> had a custom <code>IValueTaskSource<\/code>\/<code>IValueTaskSource&lt;T&gt;<\/code> implementation they could instantiate and pool and use as the backing object behind a <code>ValueTask<\/code> or <code>ValueTask&lt;T&gt;<\/code>. The first time an async method needed to yield and move all its state from the stack to the heap, these builders would consult the pool and try to use an object already there, only allocating a new one if one wasn&#8217;t available in the pool. Then upon <code>GetResult()<\/code> being called via an <code>await<\/code> on the resulting <code>ValueTask<\/code>\/<code>ValueTask&lt;T&gt;<\/code>, the object would be returned to the pool. That experiment is complete and the environment variable removed for .NET 6. In its stead, this capability is supported in a new form in .NET 6 and C# 10.<\/p>\n<p>The <code>[AsyncMethodBuilder]<\/code> attribute we saw before can now be placed on methods in addition to on types, thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/54033\">dotnet\/roslyn#54033<\/a>; when an async method is attributed with <code>[AsyncMethodBuilder(typeof(SomeBuilderType))]<\/code>, the C# compiler will then prefer that builder over the default. And along with the C# 10 language\/compiler feature, .NET 6 includes two new builder types, <code>PoolingAsyncValueTaskMethodBuilder<\/code> and <code>PoolingAsyncValueTaskMethodBuilder&lt;T&gt;<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50116\">dotnet\/runtime#50116<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55955\">dotnet\/runtime#55955<\/a>. If we change our previous example to be:<\/p>\n<pre><code class=\"language-C#\">[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\npublic static async ValueTask ExampleAsync() { }<\/code><\/pre>\n<p>now the compiler generates:<\/p>\n<pre><code class=\"language-C#\">public static ValueTask ExampleAsync()\r\n{\r\n    &lt;ExampleAsync&gt;d__0 stateMachine = default;\r\n    stateMachine.&lt;&gt;t__builder = PoolingAsyncValueTaskMethodBuilder.Create();\r\n    stateMachine.&lt;&gt;1__state = -1;\r\n    stateMachine.&lt;&gt;t__builder.Start(ref stateMachine);\r\n    return stateMachine.&lt;&gt;t__builder.Task;\r\n}<\/code><\/pre>\n<p>which means <code>ExampleAsync<\/code> may now use pooled objects to back the returned <code>ValueTask<\/code> instances. We can see that with a simple benchmark:<\/p>\n<pre><code class=\"language-C#\">const int Iters = 100_000;\r\n\r\n[Benchmark(OperationsPerInvoke = Iters, Baseline = true)]\r\npublic async Task WithoutPooling()\r\n{\r\n    for (int i = 0; i &lt; Iters; i++)\r\n        await YieldAsync();\r\n\r\n    async ValueTask YieldAsync() =&gt; await Task.Yield();\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = Iters)]\r\npublic async Task WithPooling()\r\n{\r\n    for (int i = 0; i &lt; Iters; i++)\r\n        await YieldAsync();\r\n\r\n    [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\n    async ValueTask YieldAsync() =&gt; await Task.Yield();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithoutPooling<\/td>\n<td style=\"text-align: right;\">763.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">112 B<\/td>\n<\/tr>\n<tr>\n<td>WithPooling<\/td>\n<td style=\"text-align: right;\">781.9 ns<\/td>\n<td style=\"text-align: right;\">1.02<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note the allocation per call dropping from 112 bytes to 0. So, why not just make this the default behavior of <code>AsyncValueTaskMethodBuilder<\/code> and <code>AsyncValueTaskMethodBuilder&lt;T&gt;<\/code>? Two reasons. First, it does create a functional difference. <code>Task<\/code>s are more capable than <code>ValueTask<\/code>s, supporting concurrent usage, multiple awaiters, and synchronous blocking. If consuming code was, for example, doing:<\/p>\n<pre><code class=\"language-C#\">ValueTask vt = SomeMethodAsync();\r\nawait vt;\r\nawait vt;<\/code><\/pre>\n<p>that would have &#8220;just worked&#8221; when <code>ValueTask<\/code> was backed by a <code>Task<\/code>, but failed in one of multiple ways and varying levels of severity when pooling was enabled. Code analysis rule <a href=\"https:\/\/docs.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca2012\">CA2012<\/a> is meant to help avoid such code, but that alone is insufficient to prevent such breaks. Second, as you can see from the benchmark above, while the pooling avoided the allocation, it came with a bit more overhead. And not shown here is the additional overhead in memory and working set of having to maintain the pool at all, which is maintained per async method. There are also some potential overheads not shown here, things that are common pitfalls to any kind of pooling. For example, the GC is optimized to make gen0 collections really fast, and one of the ways it can do that is by not having to scan gen1 or gen2 as part of a gen0 GC. But if there are references to gen0 objects from gen1 or gen2, then it does need to scan portions of those generations (this is why storing references into fields involves &#8220;GC write barriers,&#8221; to see if a reference to a gen0 object is being stored into one from a higher generation). Since the entire purpose of pooling is to keep objects around for a long time, those objects will likely end up being in these higher generations, and any references they store could end up making GCs more expensive; that can easily be the case with these state machines, as every parameter and local used in the method could potentially need to be tracked as such. So, from a performance perspective, it&#8217;s best to use this capability only in places where it&#8217;s both likely to matter and where performance testing demonstrates it moves the needle in the right direction. We can see, of course, that there are scenarios where in addition to saving on allocation, it actually does improve throughput, which at the end of the day is typically what one is really focusing on improving when they&#8217;re measuring allocation reduction (i.e. reducing allocation to reduce time spent in garbage collection).<\/p>\n<pre><code class=\"language-C#\">private const int Concurrency = 256;\r\nprivate const int Iters = 100_000;\r\n\r\n[Benchmark(Baseline = true)]\r\npublic Task NonPooling()\r\n{\r\n    return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)\r\n                        select Task.Run(async delegate\r\n                        {\r\n                            for (int i = 0; i &lt; Iters; i++)\r\n                                await A().ConfigureAwait(false);\r\n                        }));\r\n\r\n    static async ValueTask A() =&gt; await B().ConfigureAwait(false);\r\n\r\n    static async ValueTask B() =&gt; await C().ConfigureAwait(false);\r\n\r\n    static async ValueTask C() =&gt; await D().ConfigureAwait(false);\r\n\r\n    static async ValueTask D() =&gt; await Task.Yield();\r\n}\r\n\r\n[Benchmark]\r\npublic Task Pooling()\r\n{\r\n    return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)\r\n                        select Task.Run(async delegate\r\n                        {\r\n                            for (int i = 0; i &lt; Iters; i++)\r\n                                await A().ConfigureAwait(false);\r\n                        }));\r\n\r\n    [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\n    static async ValueTask A() =&gt; await B().ConfigureAwait(false);\r\n\r\n    [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\n    static async ValueTask B() =&gt; await C().ConfigureAwait(false);\r\n\r\n    [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\n    static async ValueTask C() =&gt; await D().ConfigureAwait(false);\r\n\r\n    [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]\r\n    static async ValueTask D() =&gt; await Task.Yield();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonPooling<\/td>\n<td style=\"text-align: right;\">3.271 s<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">11,800,058 KB<\/td>\n<\/tr>\n<tr>\n<td>Pooling<\/td>\n<td style=\"text-align: right;\">2.896 s<\/td>\n<td style=\"text-align: right;\">0.88<\/td>\n<td style=\"text-align: right;\">214 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Beyond these new builders, there have been other new APIs introduced in .NET 6 related to tasks. <code>Task.WaitAsync<\/code> was introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48842\">dotnet\/runtime#48842<\/a> and provides an optimized implementation for creating a new <code>Task<\/code> that will complete when either the previous one completes or when a specified timeout has elapsed or a specified <code>CancellationToken<\/code> has had cancellation requested. This is useful in replacing a fairly common pattern that shows up (and that, unfortunately, developers often get wrong) with developers wanting to wait for a task to complete but with either or both a timeout and cancellation. For example, this:<\/p>\n<pre><code class=\"language-C#\">Task t = ...;\r\nusing (var cts = new CancellationTokenSource())\r\n{\r\n    if (await Task.WhenAny(Task.Delay(timeout, cts.Token), t) != t)\r\n    {\r\n        throw new TimeoutException();\r\n    }\r\n\r\n    cts.Cancel();\r\n    await t;\r\n}<\/code><\/pre>\n<p>can now be replaced with just this:<\/p>\n<pre><code class=\"language-C#\">Task t = ...;\r\nawait t.WaitAsync(timeout);<\/code><\/pre>\n<p>and be faster with less overhead. A good example of that comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55262\">dotnet\/runtime#55262<\/a>, which used the new <code>Task.WaitAsync<\/code> to replace a similar implementation that existed inside of <code>SemaphoreSlim.WaitAsync<\/code>, such that the latter is now both simpler to maintain and faster with less allocation.<\/p>\n<pre><code class=\"language-C#\">private SemaphoreSlim _sem = new SemaphoreSlim(0, 1);\r\nprivate CancellationTokenSource _cts = new CancellationTokenSource();\r\n\r\n[Benchmark]\r\npublic Task WithCancellationToken()\r\n{\r\n    Task t = _sem.WaitAsync(_cts.Token);\r\n    _sem.Release();\r\n    return t;\r\n}\r\n\r\n[Benchmark]\r\npublic Task WithTimeout()\r\n{\r\n    Task t = _sem.WaitAsync(TimeSpan.FromMinutes(1));\r\n    _sem.Release();\r\n    return t;\r\n}\r\n\r\n[Benchmark]\r\npublic Task WithCancellationTokenAndTimeout()\r\n{\r\n    Task t = _sem.WaitAsync(TimeSpan.FromMinutes(1), _cts.Token);\r\n    _sem.Release();\r\n    return t;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WithCancellationToken<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">2.993 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,263 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationToken<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">1.327 us<\/td>\n<td style=\"text-align: right;\">0.44<\/td>\n<td style=\"text-align: right;\">536 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationToken<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1.337 us<\/td>\n<td style=\"text-align: right;\">0.45<\/td>\n<td style=\"text-align: right;\">496 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationToken<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.056 us<\/td>\n<td style=\"text-align: right;\">0.35<\/td>\n<td style=\"text-align: right;\">448 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>WithTimeout<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">3.267 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,304 B<\/td>\n<\/tr>\n<tr>\n<td>WithTimeout<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">1.768 us<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<td style=\"text-align: right;\">1,064 B<\/td>\n<\/tr>\n<tr>\n<td>WithTimeout<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1.769 us<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<td style=\"text-align: right;\">1,056 B<\/td>\n<\/tr>\n<tr>\n<td>WithTimeout<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.086 us<\/td>\n<td style=\"text-align: right;\">0.33<\/td>\n<td style=\"text-align: right;\">544 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>WithCancellationTokenAndTimeout<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">3.838 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,409 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationTokenAndTimeout<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">1.901 us<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<td style=\"text-align: right;\">1,080 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationTokenAndTimeout<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1.929 us<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<td style=\"text-align: right;\">1,072 B<\/td>\n<\/tr>\n<tr>\n<td>WithCancellationTokenAndTimeout<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.186 us<\/td>\n<td style=\"text-align: right;\">0.31<\/td>\n<td style=\"text-align: right;\">544 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>.NET 6 also sees the long-requested addition of <code>Parallel.ForEachAsync<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46943\">dotnet\/runtime#46943<\/a>), which makes it easy to asynchronously enumerate an <code>IEnumerable&lt;T&gt;<\/code> or <code>IAsyncEnumerable&lt;T&gt;<\/code> and run a delegate for each yielded element, with those delegates executed in parallel, and with some modicum of control over how it happens, e.g. what <code>TaskScheduler<\/code> should be used, the maximum level of parallelism to enable, and what <code>CancellationToken<\/code> to use to cancel the work.<\/p>\n<p>On the subject of <code>CancellationToken<\/code>, the cancellation support in .NET 6 has also seen performance improvements, both for existing functionality and for new APIs that enable an app to do even better. One interesting improvement is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48251\">dotnet\/runtime#48251<\/a>, which is a good example of how one can design and implement and optimize for one scenario only to find that it&#8217;s making the wrong tradeoffs. When <code>CancellationToken<\/code> and <code>CancellationTokenSource<\/code> were introduced in .NET Framework 4.0, the expectation at the time was that the majority use case would be lots of threads registering and unregistering from the same <code>CancellationToken<\/code> in parallel. That led to a really neat (but complicated) lock-free implementation that involved quite a bit of allocation and overhead. If you were in fact registering and unregistering from the same token from lots of threads in parallel, the implementation was very efficient and resulted in good throughput. But if you weren&#8217;t, you were paying a lot of overhead for something that wasn&#8217;t providing reciprocal benefit. And, as luck would have it, that&#8217;s almost never the scenario these days. It&#8217;s much, much more common to have a <code>CancellationToken<\/code> that&#8217;s used serially, often with multiple registrations all in place at the same time, but with those registrations mostly having been added as part of the serial flow of execution rather than all concurrently. This PR recognizes this reality and reverts the implementation to a much simpler, lighterweight, and faster one that performs better for the vast majority use case (while taking a hit if it is actually hammered by multiple threads concurrently).<\/p>\n<pre><code class=\"language-C#\">private CancellationTokenSource _source = new CancellationTokenSource();\r\n\r\n[Benchmark]\r\npublic void CreateTokenDispose()\r\n{\r\n    using (var cts = new CancellationTokenSource())\r\n        _ = cts.Token;\r\n}\r\n\r\n[Benchmark]\r\npublic void CreateRegisterDispose()\r\n{\r\n    using (var cts = new CancellationTokenSource())\r\n        cts.Token.Register(s =&gt; { }, null).Dispose();\r\n}\r\n\r\n[Benchmark]\r\npublic void CreateLinkedTokenDispose()\r\n{\r\n    using (var cts = CancellationTokenSource.CreateLinkedTokenSource(_source.Token))\r\n        _ = cts.Token;\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = 1_000_000)]\r\npublic void CreateManyRegisterDispose()\r\n{\r\n    using (var cts = new CancellationTokenSource())\r\n    {\r\n        CancellationToken ct = cts.Token;\r\n        for (int i = 0; i &lt; 1_000_000; i++)\r\n            ct.Register(s =&gt; { }, null).Dispose();\r\n    }\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = 1_000_000)]\r\npublic void CreateManyRegisterMultipleDispose()\r\n{\r\n    using (var cts = new CancellationTokenSource())\r\n    {\r\n        CancellationToken ct = cts.Token;\r\n        for (int i = 0; i &lt; 1_000_000; i++)\r\n        {\r\n            var ctr1 = ct.Register(s =&gt; { }, null);\r\n            var ctr2 = ct.Register(s =&gt; { }, null);\r\n            var ctr3 = ct.Register(s =&gt; { }, null);\r\n            var ctr4 = ct.Register(s =&gt; { }, null);\r\n            var ctr5 = ct.Register(s =&gt; { }, null);\r\n            ctr5.Dispose();\r\n            ctr4.Dispose();\r\n            ctr3.Dispose();\r\n            ctr2.Dispose();\r\n            ctr1.Dispose();\r\n        }\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>CreateTokenDispose<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">10.236 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<\/tr>\n<tr>\n<td>CreateTokenDispose<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">6.934 ns<\/td>\n<td style=\"text-align: right;\">0.68<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<\/tr>\n<tr>\n<td>CreateTokenDispose<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">7.268 ns<\/td>\n<td style=\"text-align: right;\">0.71<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<\/tr>\n<tr>\n<td>CreateTokenDispose<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">6.200 ns<\/td>\n<td style=\"text-align: right;\">0.61<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>CreateRegisterDispose<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">144.218 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">385 B<\/td>\n<\/tr>\n<tr>\n<td>CreateRegisterDispose<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">79.392 ns<\/td>\n<td style=\"text-align: right;\">0.55<\/td>\n<td style=\"text-align: right;\">352 B<\/td>\n<\/tr>\n<tr>\n<td>CreateRegisterDispose<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">79.431 ns<\/td>\n<td style=\"text-align: right;\">0.55<\/td>\n<td style=\"text-align: right;\">352 B<\/td>\n<\/tr>\n<tr>\n<td>CreateRegisterDispose<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">56.715 ns<\/td>\n<td style=\"text-align: right;\">0.39<\/td>\n<td style=\"text-align: right;\">192 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>CreateLinkedTokenDispose<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">103.622 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">209 B<\/td>\n<\/tr>\n<tr>\n<td>CreateLinkedTokenDispose<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">61.944 ns<\/td>\n<td style=\"text-align: right;\">0.60<\/td>\n<td style=\"text-align: right;\">112 B<\/td>\n<\/tr>\n<tr>\n<td>CreateLinkedTokenDispose<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">53.526 ns<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<td style=\"text-align: right;\">80 B<\/td>\n<\/tr>\n<tr>\n<td>CreateLinkedTokenDispose<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">38.631 ns<\/td>\n<td style=\"text-align: right;\">0.37<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterDispose<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">87.713 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterDispose<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">43.491 ns<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterDispose<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">41.124 ns<\/td>\n<td style=\"text-align: right;\">0.47<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterDispose<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">35.437 ns<\/td>\n<td style=\"text-align: right;\">0.40<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterMultipleDispose<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">439.874 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">281 B<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterMultipleDispose<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">234.367 ns<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterMultipleDispose<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">229.483 ns<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>CreateManyRegisterMultipleDispose<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">192.213 ns<\/td>\n<td style=\"text-align: right;\">0.44<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>CancellationToken<\/code> also has new APIs that help with performance. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43114\">dotnet\/runtime#43114<\/a> added new overloads of <code>Register<\/code> and <code>Unregister<\/code> that, rather than taking an <code>Action&lt;object&gt;<\/code> delegate, accept an <code>Action&lt;object, CancellationToken&gt;<\/code> delegate. This gives the delegate access to the <code>CancellationToken<\/code> responsible for the callback being invoked, enabling code that was instantiating a new delegate and potentially a closure in order to get access to that information to instead be able to use a cached delegate instance (as the compiler generates for lambdas that don&#8217;t close over any state). And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50346\">dotnet\/runtime#50346<\/a> makes it easier to reuse <code>CancellationTokenSource<\/code> instances for applications that want to pool them. In the past there have been multiple requests to be able to reuse any <code>CancellationTokenSource<\/code>, enabling its state to be reset from one that&#8217;s had cancellation requested to one that hasn&#8217;t. That&#8217;s <em>not<\/em> something we&#8217;ve done nor plan to do, as a <em>lot<\/em> of code depends on the idea that once a <code>CancellationToken<\/code>&#8216;s <code>IsCancellationRequested<\/code> is true it&#8217;ll always be true; if that&#8217;s not the case, it&#8217;s very difficult to reason about. However, the vast majority of <code>CancellationTokenSource<\/code>s are never canceled, and if they&#8217;re not canceled, there&#8217;s nothing that prevents them from continuing to be used, potentially stored into a pool for someone else to use in the future. This gets a bit tricky, however, if <code>CancelAfter<\/code> is used or if the constructor is used that takes a timeout, as both of those cause a timer to be created, and there are race conditions possible between the timer firing and someone checking to see whether <code>IsCancellationRequested<\/code> is true (to determine whether to reuse the instance). The new <code>TryReset<\/code> method avoids this race condition. If you do want to reuse such a <code>CancellationTokenSource<\/code>, call <code>TryReset<\/code>: if it returns true, it hasn&#8217;t had cancellation requested and any underlying timer has been reset as well such that it won&#8217;t fire without a new timeout being set. If it returns false, well, don&#8217;t try to reuse it, as no guarantees are made about its state. You can see how the Kestrel web server does this, via <a href=\"https:\/\/github.com\/dotnet\/aspnetcore\/pull\/31528\">dotnet\/aspnetcore#31528<\/a> and <a href=\"https:\/\/github.com\/dotnet\/aspnetcore\/pull\/34075\">dotnet\/aspnetcore#34075<\/a>.<\/p>\n<p>Those are some of the bigger performance-focused changes in threading. There are a myriad of smaller ones as well, for example the new <code>Thread.UnsafeStart<\/code> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47056\">dotnet\/runtime#47056<\/a>, <code>PreAllocatedOverlapped.UnsafeCreate<\/code> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53196\">dotnet\/runtime#53196<\/a>, and <code>ThreadPoolBoundHandle.UnsafeAllocateNativeOverlapped<\/code> APIs that make it easier and cheaper to avoid capturing <code>ExecutingContext<\/code>; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43891\">dotnet\/runtime#43891<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44199\">dotnet\/runtime#44199<\/a> that avoided several volatile accesses in threading types (this is mainly impactful on ARM); <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44853\">dotnet\/runtime#44853<\/a> from <a href=\"https:\/\/github.com\/LeaFrock\">@LeaFrock<\/a> that optimized the <code>ElapsedEventArgs<\/code> constructor to avoid some unnecessary roundtripping of a <code>DateTime<\/code> through a <code>FILETIME<\/code>; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/38896\">dotnet\/runtime#38896<\/a> from <a href=\"https:\/\/github.com\/Bond-009\">@Bond-009<\/a> that added a fast path to <code>Task.WhenAny(IEnumerable&lt;Task&gt;)<\/code> for the relatively common case of the input being an <code>ICollection&lt;Task&gt;<\/code>; and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47368\">dotnet\/runtime#47368<\/a>, which improved the code generation for <code>Interlocked.Exchange<\/code> and <code>Interlocked.CompareExchange<\/code> when used with <code>nint<\/code> (<code>IntPtr<\/code>) or <code>nuint<\/code> (<code>UIntPtr<\/code>) by enabling them to reuse the existing intrinsics for <code>int<\/code> and <code>long<\/code>:<\/p>\n<pre><code class=\"language-C#\">private nint _value;\r\n\r\n[Benchmark]\r\npublic nint CompareExchange() =&gt; Interlocked.CompareExchange(ref _value, (nint)1, (nint)0) + (nint)1;<\/code><\/pre>\n<pre><code class=\"language-assembly\">; .NET 5\r\n; Program.CompareExchange()\r\n       sub       rsp,28\r\n       cmp       [rcx],ecx\r\n       add       rcx,8\r\n       mov       edx,1\r\n       xor       r8d,r8d\r\n       call      00007FFEC051F8B0\r\n       inc       rax\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 31\r\n\r\n; .NET 6\r\n; Program.CompareExchange()\r\n       cmp       [rcx],ecx\r\n       add       rcx,8\r\n       mov       edx,1\r\n       xor       eax,eax\r\n       lock cmpxchg [rcx],rdx\r\n       inc       rax\r\n       ret\r\n; Total bytes of code 22<\/code><\/pre>\n<h3>System Types<\/h3>\n<p>Every .NET app uses types from the core <code>System<\/code> namespace, and so improvements to these types often have wide-reaching impact. There have been many performance enhancements to these types in .NET 6.<\/p>\n<p>Let&#8217;s start with <code>Guid<\/code>. <code>Guid<\/code> is used to provide unique identifiers for any number of things and operations. The ability to create them quickly is important, as is the ability to quickly format and parse them. Previous releases have seen significant improvements on all these fronts, but they get even better in .NET 6. Let&#8217;s take a simple benchmark for parsing:<\/p>\n<pre><code class=\"language-C#\">private string _guid = Guid.NewGuid().ToString();\r\n\r\n[Benchmark]\r\npublic Guid Parse() =&gt; Guid.Parse(_guid);<\/code><\/pre>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44918\">dotnet\/runtime#44918<\/a> helped avoid some overheads involved in unnecessarily accessing <code>CultureInfo.CurrentCulture<\/code> during parsing, as culture isn&#8217;t necessary or desired when parsing hexadecimal digits. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55792\">dotnet\/runtime#55792<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56210\">dotnet\/runtime#56210<\/a> rewrote parsing for the &#8216;D&#8217;, &#8216;B&#8217;, &#8216;P&#8217;, and &#8216;N&#8217; formats (all but the antiquated &#8216;X&#8217;) to be much more streamlined, with careful attention paid to avoidance of bounds checking, how data is moved around, number of instructions to be executed, and so on. The net result is a very nice increase in throughput:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Parse<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">251.88 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Parse<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">100.78 ns<\/td>\n<td style=\"text-align: right;\">0.40<\/td>\n<\/tr>\n<tr>\n<td>Parse<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">80.13 ns<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<\/tr>\n<tr>\n<td>Parse<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">33.84 ns<\/td>\n<td style=\"text-align: right;\">0.13<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I love seeing tables like this. A 2.5x speedup going from .NET Framework 4.8 to .NET Core 3.1, another 1.3x on top of that going from .NET Core 3.1 to .NET 5, and then another 2.3x going from .NET 5 to .NET 6. Just one small example of how the platform gets better every release.<\/p>\n<p>One other <code>Guid<\/code> related improvement won&#8217;t actually show up as a performance improvement (potentially even a tiny regression), but is worth mentioning in this context, anway. <code>Guid.NewGuid<\/code> has never guaranteed that the values generated would employ cryptographically-secure randomness, however as an implementation detail, on Windows <code>NewGuid<\/code> was implemented with <code>CoCreateGuid<\/code> which was in turn implemented with <code>CryptGenRandom<\/code>, and developers starting using <code>Guid.NewGuid<\/code> as an easy source of randomness seeded by a cryptographically-secure generator. On Linux, <code>Guid.NewGuid<\/code> was then implemented using data read from <code>\/dev\/urandom<\/code>, which is also intended to provide cryptographic-level entropy, but on macOS, due to performance problems on macOS with <code>\/dev\/urandom<\/code>, <code>Guid.NewGuid<\/code> was years ago switched to using <code>arc4random_buf<\/code>, which is for non-cryptographic purposes. It was decided in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42770\">dotnet\/runtime#42770<\/a> in the name of defense-in-depth security that <code>NewGuid<\/code> should revert back to using <code>\/dev\/urandom<\/code> on macOS and accept the resulting regression. Thankfully, it doesn&#8217;t have to accept it; as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51526\">dotnet\/runtime#51526<\/a>, <code>Guid.NewGuid<\/code> on macOS is now able to use CommonCrypto&#8217;s <code>CCRandomGenerateBytes<\/code>, which not only returns cryptographically-strong random bits, but is also comparable in performance to <code>arc4random_buf<\/code>, such that there shouldn&#8217;t be a perceivable impact release-over-release:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NewGuid<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">94.94 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NewGuid<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">96.32 ns<\/td>\n<td style=\"text-align: right;\">1.01<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Moving on in <code>System<\/code>, <code>Version<\/code> is another such example of just getting better and better every release. <code>Version.ToString<\/code>\/<code>Version.TryFormat<\/code> had been using a cached <code>StringBuilder<\/code> for formatting. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48511\">dotnet\/runtime#48511<\/a> rewrote <code>TryFormat<\/code> to format directly into the caller-supplied span, rather than first formatting into a <code>StringBuilder<\/code> and then copying to the span. Then <code>ToString<\/code> was implemented as a wrapper for <code>TryFormat<\/code>, stack-allocating a span with enough space to hold any possible version, formatting into that, and then slicing and <code>ToString<\/code>&#8216;ing that span to produce the final string. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56051\">dotnet\/runtime#56051<\/a> then further improved upon this by being a little more thoughtful about how the code was structured. For example, it had been using <code>Int32.TryFormat<\/code> to format each of the <code>int<\/code> version components (<code>Major<\/code>, <code>Minor<\/code>, <code>Build<\/code>, <code>Revision<\/code>), but these components are guaranteed to never be negative, so we could actually format them as <code>uint<\/code> with no difference in behavior. Why is that helpful here? Because there&#8217;s an extra non-inlined function call on the <code>int<\/code> code path than there is on the <code>uint<\/code> code path, due to the former needing to be able to handle negative rendering as well, and when you&#8217;re counting nanoseconds at this low-level of the stack, such calls can make a measurable difference.<\/p>\n<pre><code class=\"language-C#\">private Version _version = new Version(6, 1, 21412, 16);\r\n\r\n[Benchmark]\r\npublic string VersionToString() =&gt; _version.ToString();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>VersionToString<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">184.50 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<\/tr>\n<tr>\n<td>VersionToString<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">107.35 ns<\/td>\n<td style=\"text-align: right;\">0.58<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>VersionToString<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">67.75 ns<\/td>\n<td style=\"text-align: right;\">0.37<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>VersionToString<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">44.83 ns<\/td>\n<td style=\"text-align: right;\">0.24<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One of my personal favorite sets of changes in .NET 6 is the overhauling of <code>System.Random<\/code>. There are many ways performance improvements can come about, and one of the most elusive but also impactful is completely changing the algorithm used to something much faster. Until .NET 6, <code>Random<\/code> employed the same algorithm it had been using for the last two decades, a variant of Knuth&#8217;s subtractive random number generator algorithm that dates back to the 1980s. That served .NET well, but it was time for an upgrade. In the intervening years, a myriad number of pseudo-random algorithms have emerged, and for .NET 6 in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47085\">dotnet\/runtime#47085<\/a>, we picked the <code>xoshiro**<\/code> family, using <code>xoshiro128**<\/code> on 32-bit and <code>xoshiro256**<\/code> on 64-bit. These algorithms were introduced by <a href=\"https:\/\/prng.di.unimi.it\/\">Blackman and Vigna in 2018<\/a>, are very fast, and yield good enough pseudo-randomness for <code>Random<\/code>&#8216;s needs (for cryptographically-secure random number generation, <code>System.Security.Cryptography.RandomNumberGenerator<\/code> should be used instead). However, beyond the algorithm employed, the implementation is now smarter about overheads. For good or bad reasons, <code>Random<\/code> was introduced with almost all of its methods virtual. In addition to that leading to virtual dispatch overheads, it has additional impact on the evolution of the type: because someone could have overridden one of the methods, any new method we introduce has to be written in terms of the existing virtuals&#8230; so, for example, when we added the span-based <code>NextBytes<\/code> method, we had to implement that in terms of one of the existing <code>Next<\/code> methods, to ensure that any existing overrides would have their behavior respected (imagine if we didn&#8217;t, and someone had a <code>ThreadSafeRandom<\/code> derived type, which overrode all the methods, and locked around each one&#8230; except for the ones unavailable at the time the derived type was created). Now in .NET 6, we check at construction time whether we&#8217;re dealing with a derived type, and fall back to the old implementation if this is a derived type, otherwise preferring to use an implementation that needn&#8217;t be concerned about such compatibility issues. Similarly, over the years we&#8217;ve been hesitant to change <code>Random<\/code>&#8216;s implementation for fear of changing the numerical sequence yielded if someone provided a fixed seed to <code>Random<\/code>&#8216;s constructor (which is common, for example, in tests); now in .NET 6, just as for derived types, we fall back to the old implementation if a seed is supplied, otherwise preferring the new algorithm. This sets us up for the future where we can freely change and evolve the algorithm used by <code>new Random()<\/code> as better approaches present themselves. On top of that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47390\">dotnet\/runtime#47390<\/a> from <a href=\"https:\/\/github.com\/colgreen\">@colgreen<\/a> tweaked the <code>NextBytes<\/code> implementation further to avoid unnecessary moves between locals and fields, yielding another significant gain in throughput.<\/p>\n<pre><code class=\"language-C#\">private byte[] _buffer = new byte[10_000_000];\r\nprivate Random _random = new Random();\r\n\r\n[Benchmark]\r\npublic Random Ctor() =&gt; new Random();\r\n\r\n[Benchmark]\r\npublic int Next() =&gt; _random.Next();\r\n\r\n[Benchmark]\r\npublic int NextMax() =&gt; _random.Next(64);\r\n\r\n[Benchmark]\r\npublic int NextMinMax() =&gt; _random.Next(0, 64);\r\n\r\n[Benchmark]\r\npublic double NextDouble() =&gt; _random.NextDouble();\r\n\r\n[Benchmark]\r\npublic void NextBytes_Array() =&gt; _random.NextBytes(_buffer);\r\n\r\n[Benchmark]\r\npublic void NextBytes_Span() =&gt; _random.NextBytes((Span&lt;byte&gt;)_buffer);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1,473.7 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Ctor<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">112.9 ns<\/td>\n<td style=\"text-align: right;\">0.08<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Next<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">7.653 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Next<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.033 ns<\/td>\n<td style=\"text-align: right;\">0.27<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NextMax<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">10.146 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextMax<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">3.032 ns<\/td>\n<td style=\"text-align: right;\">0.30<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NextMinMax<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">10.537 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextMinMax<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">3.110 ns<\/td>\n<td style=\"text-align: right;\">0.30<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NextDouble<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">8.682 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextDouble<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.354 ns<\/td>\n<td style=\"text-align: right;\">0.27<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NextBytes_Array<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">72,202,543.956 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextBytes_Array<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1,199,496.150 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NextBytes_Span<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">76,654,821.111 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NextBytes_Span<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1,199,474.872 ns<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The <code>Random<\/code> changes also highlight tradeoffs made in optimizations. The approach of dynamically choosing the implementation to use when the instance is constructed means we incur an extra virtual dispatch on each operation.  For the <code>new Random()<\/code> case that utilizes a new, faster algorithm, that overhead is well worth it and is much less than the significant savings incurred.  But for the <code>new Random(seed)<\/code> case, we don&#8217;t have those algorithmic wins to offset things.  As the overhead is small (on my machine 1-2ns) and as the scenarios for providing a seed are a minority use case in situations where counting nanoseconds matters (passing a specific seed is often used in testing, for example, where repeatable results are required), we accepted the tradeoff.  But even the smallest, planned regressions can nag at you, especially when discussing them very publicly in a blog post, so in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/57530\">dotnet\/runtime#57530<\/a> we mitigated most of them (basically everything other than the simplest seeded <code>Next()<\/code> overload, which on my machine is ~4% slower in .NET 6 than in .NET 5) and even managed to turn most into improvements.  This was done primarily by splitting the compat strategy implementation further into one for <code>new Random(seed)<\/code> and one for <code>new DerivedRandom()<\/code>, which enables the former to avoid any virtual dispatch between members (and for the latter, derived types can override to provide their own completion implementation).  As previously noted, a method like `Next(int, int)` delegates to another virtual method on the instance, but that virtual delegation can now be removed entirely for the seeded case as well.<\/p>\n<p>In addition to changes in the implementation, <code>Random<\/code> also gained new surface area in .NET 6. This includes new <code>NextInt64<\/code> and <code>NextSingle<\/code> methods, but also <code>Random.Shared<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50297\">dotnet\/runtime#50297<\/a>). The static <code>Random.Shared<\/code> property returns a thread-safe instance that can be used from any thread. This means code no longer needs to pay the overheads of creating a new <code>Random<\/code> instance when it might sporadically want to get a pseudo-random value, nor needs to manage its own scheme for caching and using <code>Random<\/code> instances in a thread-safe manner. Code can simply do <code>Random.Shared.Next()<\/code>.<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic int NewNext() =&gt; new Random().Next();\r\n\r\n[Benchmark]\r\npublic int SharedNext() =&gt; Random.Shared.Next();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NewNext<\/td>\n<td style=\"text-align: right;\">114.713 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<\/tr>\n<tr>\n<td>SharedNext<\/td>\n<td style=\"text-align: right;\">5.377 ns<\/td>\n<td style=\"text-align: right;\">0.05<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Next, <code>Environment<\/code> provides access to key information about the current machine and process. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45057\">dotnet\/runtime#45057<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49484\">dotnet\/runtime#49484<\/a> updated <code>GetEnvironmentVariables<\/code> to use <code>IndexOf<\/code> to search for the separators between key\/value pairs, rather than using an open-coded loop. In addition to reducing the amount of code needed in the implementation, this takes advantage of the fact that <code>IndexOf<\/code> is heavily optimized using a vectorized implementation. The net result is much faster retrieval of all environment variables: on my machine, with the environment variables I have in my environment, I get results like these:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic IDictionary GetEnvironmentVariables() =&gt; Environment.GetEnvironmentVariables();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetEnvironmentVariables<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">35.04 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">33 KB<\/td>\n<\/tr>\n<tr>\n<td>GetEnvironmentVariables<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">13.43 us<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<td style=\"text-align: right;\">33 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>.NET 6 also sees new APIs added to <code>Environment<\/code> to provide not only simpler access to commonly-accessed information, but also much faster access. It&#8217;s pretty common for apps, for example in logging code, to want to get the current process&#8217; ID. To achieve that prior to .NET 5, code would often do something like <code>Process.GetCurrentProcess().Id<\/code>, and .NET 5 added <code>Environment.ProcessId<\/code> to make that easier and faster. Similarly, code that wants access to the current process&#8217; executable&#8217;s path would typically use code along the lines of <code>Process.GetCurrentProcess().MainModule.FileName<\/code>; now in .NET 6 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42768\">dotnet\/runtime#42768<\/a>, that code can just use <code>Environment.ProcessPath<\/code>:<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic string GetPathViaProcess() =&gt; Process.GetCurrentProcess().MainModule.FileName;\r\n\r\n[Benchmark]\r\npublic string GetPathViaEnvironment() =&gt; Environment.ProcessPath;<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetPathViaProcess<\/td>\n<td style=\"text-align: right;\">85,570.951 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">1,072 B<\/td>\n<\/tr>\n<tr>\n<td>GetPathViaEnvironment<\/td>\n<td style=\"text-align: right;\">1.174 ns<\/td>\n<td style=\"text-align: right;\">0.000<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The .NET 6 SDK also includes new analyzers, introduced in <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4909\">dotnet\/roslyn-analyzers#4909<\/a>, to help find places these new APIs might be valuable. There are other new analyzers as well:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4764\">dotnet\/roslyn-analyzers#4764<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> to help find places <code>String.Concat<\/code> can be used with spans.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4806\">dotnet\/roslyn-analyzers#4806<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> to help find places <code>AsSpan<\/code> can be used instead of <code>Substring<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/5116\">dotnet\/roslyn-analyzers#5116<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> to help find places <code>String.Equals<\/code> can be used instead of <code>String.Compare<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4908\">dotnet\/roslyn-analyzers#4908<\/a> from <a href=\"https:\/\/github.com\/MeikTranel\">@MeikTranel<\/a> to help find places <code>String.Contains<\/code> can be used with a <code>char<\/code> rather than a <code>string<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4687\">dotnet\/roslyn-analyzers#4687<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> to help find places <code>Dictionary&lt;,&gt;.Keys.Contains<\/code> is used but <code>Dictionary&lt;,&gt;.ContainsKey<\/code> would suffice.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4726\">dotnet\/roslyn-analyzers#4726<\/a> from <a href=\"https:\/\/github.com\/MeikTranel\">@MeikTranel<\/a> to help find <code>Stream<\/code>-derived types that would benefit from the <code>Memory<\/code>-based <code>ReadAsync<\/code>\/<code>WriteAsync<\/code> overloads being overridden.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/4841\">dotnet\/roslyn-analyzers#4841<\/a> from <a href=\"https:\/\/github.com\/ryzngard\">@ryzngard<\/a> to help find places <code>Task.WhenAll<\/code> and <code>Task.WaitAll<\/code> are used unnecessarily.<\/li>\n<\/ul>\n<p><code>Enum<\/code> has also seen both improvements to the performance of its existing methods (so that existing usage just gets faster) and new methods added to it (such that minor tweaks to how it&#8217;s being consumed in an app can yield further fruit). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44355\">dotnet\/runtime#44355<\/a> is a small PR with a sizeable impact, improving the performance of the generic <code>Enum.IsDefined<\/code>, <code>Enum.GetName<\/code>, and <code>Enum.GetNames<\/code>. There were several issues to be addressed here. First, originally there weren&#8217;t any generic APIs on <code>Enum<\/code> (since it was introduced before generics existed), and thus all input values for methods like <code>IsDefined<\/code> or <code>GetName<\/code> were typed as <code>object<\/code>. That then meant that internal helpers for doing things like getting the numerical value of an enum were also typed to accept <code>object<\/code>. When the generic overloads came along in .NET 5, they utilized the same internal helpers, and ended up boxing the strongly-typed input as an implementation detail. This PR fixes that by adding a strongly-typed internal helper, and it tweaks what existing methods these generic methods delegate to so as to use ones that can operate faster given the strongly-typed nature of the generic methods. The net result is some nice wins.<\/p>\n<pre><code class=\"language-C#\">private DayOfWeek _value = DayOfWeek.Friday;\r\n\r\n[Benchmark]\r\npublic bool IsDefined() =&gt; Enum.IsDefined(_value);\r\n\r\n[Benchmark]\r\npublic string GetName() =&gt; Enum.GetName(_value);\r\n\r\n[Benchmark]\r\npublic string[] GetNames() =&gt; Enum.GetNames&lt;DayOfWeek&gt;();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">31.46 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">19.30 ns<\/td>\n<td style=\"text-align: right;\">0.61<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>GetName<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">50.23 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td>GetName<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">19.77 ns<\/td>\n<td style=\"text-align: right;\">0.39<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>GetNames<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">36.78 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">80 B<\/td>\n<\/tr>\n<tr>\n<td>GetNames<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">21.04 ns<\/td>\n<td style=\"text-align: right;\">0.57<\/td>\n<td style=\"text-align: right;\">80 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43255\">dotnet\/runtime#43255<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a>, .NET 6 also sees additional generic <code>Parse<\/code> and <code>TryParse<\/code> overloads added that can parse <code>ReadOnlySpan&lt;char&gt;<\/code> in addition to the existing support for <code>string<\/code>. While not directly faster than their <code>string<\/code>-based counterparts (in fact, the <code>string<\/code>-based implementations eventually call into the same <code>ReadOnlySpan&lt;char&gt;<\/code>-based logic), they enable code parsing out enums from larger strings to do so with zero additional allocations and copies.<\/p>\n<p>Another very common operation in many apps is <code>DateTime.UtcNow<\/code> and <code>DateTimeOffset.UtcNow<\/code>, often used as part of tracing or logging code that&#8217;s designed to add as little overhead as possible. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45479\">dotnet\/runtime#45479<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45281\">dotnet\/runtime#45281<\/a> streamlined <code>DateTime.UtcNow<\/code> and <code>DateTimeOffset.UtcNow<\/code>, respectively, by avoiding some duplicative validation, ensuring fast paths are appropriately inlined (and slow paths aren&#8217;t), and other such tweaks. Those changes impacted all operating systems. But the biggest impact came from negating the regressions incurred when leap seconds support was added in .NET Core 3.0 (<a href=\"https:\/\/github.com\/dotnet\/coreclr\/pull\/21420\">dotnet\/coreclr#21420<\/a>). &#8220;Leap seconds&#8221; are rare, one-second adjustments made to UTC that stem from the fact that the Earth&#8217;s rotation speed can and does actually vary over time. When this support was added to .NET Core 3.0 (and to .NET Framework 4.8 at the same time), it (knowingly) regressed the performance of <code>UtcNow<\/code> by around 2.5x when the Windows feature is enabled. Thankfully, in .NET 6, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50263\">dotnet\/runtime#50263<\/a> provides a scheme for still maintaining the leap seconds support while avoiding the impactful overhead, getting back to the same throughput as without the feature.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic DateTime UtcNow() =&gt; DateTime.UtcNow;<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>UtcNow<\/td>\n<td>.NET Core 2.1<\/td>\n<td style=\"text-align: right;\">20.96 ns<\/td>\n<td style=\"text-align: right;\">0.40<\/td>\n<\/tr>\n<tr>\n<td>UtcNow<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">52.25 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>UtcNow<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">63.35 ns<\/td>\n<td style=\"text-align: right;\">1.21<\/td>\n<\/tr>\n<tr>\n<td>UtcNow<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">58.22 ns<\/td>\n<td style=\"text-align: right;\">1.11<\/td>\n<\/tr>\n<tr>\n<td>UtcNow<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">19.95 ns<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other small but valuable changes have gone into various primitives. For example, the newly public <code>ISpanFormattable<\/code> interface was previously internal and implemented on a handful of primitive types, but it&#8217;s now also implemented by <code>Char<\/code> and <code>Rune<\/code> as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50272\">dotnet\/runtime#50272<\/a>, and by <code>IntPtr<\/code> and <code>UIntPtr<\/code> as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44496\">dotnet\/runtime#44496<\/a>. <code>ISpanFormattable<\/code> is already recognized by various string formatting implementations, including that used by <code>string.Format<\/code>; you can see the impact of these interface implementations with a little benchmark, which gets better on .NET 6 as each instance&#8217;s <code>TryFormat<\/code> is used to format directly into the target buffer rather than first having to <code>ToString<\/code>.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic string Format() =&gt; string.Format(\"{0} {1} {2} {3}\", 'a', (Rune)'b', (nint)'c', (nuint)'d');<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">212.3 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">312 B<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">179.7 ns<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">312 B<\/td>\n<\/tr>\n<tr>\n<td>Format<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">137.1 ns<\/td>\n<td style=\"text-align: right;\">0.65<\/td>\n<td style=\"text-align: right;\">200 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Arrays, Strings, Spans<\/h3>\n<p>For many apps and services, creating and manipulating arrays, strings, and spans represent a significant portion of their processing, and lot of effort goes into finding ways to continually drive down the costs of these operations. .NET 6 is no exception.<\/p>\n<p>Let&#8217;s start with <code>Array.Clear<\/code>. The current <code>Array.Clear<\/code> signature accepts the <code>Array<\/code> to clear, the starting position, and the number of elements to clear. However, if you look at usage, the vast majority use case is with code like <code>Array.Clear(array, 0, array.Length)<\/code>&#8230; in other words, clearing the whole array. For a fundamental operation that&#8217;s used on hot paths, the extra validation that&#8217;s required in order to ensure the offset and count are in-bounds adds up. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51548\">dotnet\/runtime#51548<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53388\">dotnet\/runtime#53388<\/a> add a new <code>Array.Clear(Array)<\/code> method that avoids these overheads and changes many call sites across dotnet\/runtime to use the new overload.<\/p>\n<pre><code class=\"language-C#\">private int[] _array = new int[10];\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void Old() =&gt; Array.Clear(_array, 0, _array.Length);\r\n\r\n[Benchmark]\r\npublic void New() =&gt; Array.Clear(_array);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">5.563 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">3.775 ns<\/td>\n<td style=\"text-align: right;\">0.68<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In a similar vein is <code>Span&lt;T&gt;.Fill<\/code>, which doesn&#8217;t just zero but sets every element to a specific value. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51365\">dotnet\/runtime#51365<\/a> provides a significant improvement here: while for <code>byte[]<\/code> it&#8217;s already been able to directly invoke the <code>initblk<\/code> (<code>memset<\/code>) implementation, which is vectorized, for other <code>T[]<\/code> arrays where <code>T<\/code> is a primitive type (e.g. <code>char<\/code>), it can now also use a vectorized implementation, leading to quite nice speedups. Then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52590\">dotnet\/runtime#52590<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> reuses <code>Span&lt;T&gt;.Fill<\/code> as the underlying implementation for <code>Array.Fill&lt;T&gt;<\/code> as well.<\/p>\n<pre><code class=\"language-C#\">private char[] _array = new char[128];\r\nprivate char _c = 'c';\r\n\r\n[Benchmark]\r\npublic void SpanFill() =&gt; _array.AsSpan().Fill(_c);\r\n\r\n[Benchmark]\r\npublic void ArrayFill() =&gt; Array.Fill(_array, _c);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SpanFill<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">32.103 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SpanFill<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">3.675 ns<\/td>\n<td style=\"text-align: right;\">0.11<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ArrayFill<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">55.994 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ArrayFill<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">3.810 ns<\/td>\n<td style=\"text-align: right;\">0.07<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Interestingly, <code>Array.Fill&lt;T&gt;<\/code> can&#8217;t simply delegate to <code>Span&lt;T&gt;.Fill<\/code>, for a reason that&#8217;s relevant to others looking to rebase array-based implementations on top of (mutable) spans. Arrays of reference types in .NET are covariant, meaning given a reference type <code>B<\/code> that derives from <code>A<\/code>, you can write code like:<\/p>\n<pre><code class=\"language-C#\">var arrB = new B[4];\r\nA[] arrA = arrB;<\/code><\/pre>\n<p>Now you&#8217;ve got an <code>A[]<\/code> where you can happily read out instances as <code>A<\/code>s but that can only store <code>B<\/code> instances, e.g. this is fine:<\/p>\n<pre><code class=\"language-C#\">arrA[0] = new B();<\/code><\/pre>\n<p>but this will throw an exception:<\/p>\n<pre><code class=\"language-C#\">arrA[0] = new A();<\/code><\/pre>\n<p>along the lines of <code>System.ArrayTypeMismatchException: Attempted to access an element as a type incompatible with the array.<\/code> This also incurs measurable overhead every time an element is stored into an array of (most) reference types. When spans were introduced, it was recognized that if you create a writeable span, you&#8217;re very likely going to write to it, and thus if the cost of a check needs to be paid somewhere, it&#8217;s better to pay that cost once when the span is created rather than on every write into the span. As such, <code>Span&lt;T&gt;<\/code> is invariant and its constructor includes this code:<\/p>\n<pre><code class=\"language-C#\">if (!typeof(T).IsValueType &amp;&amp; array.GetType() != typeof(T[]))\r\n    ThrowHelper.ThrowArrayTypeMismatchException();<\/code><\/pre>\n<p>The check, which is removed entirely by the JIT for value types and which is optimized heavily by the JIT for reference types, validates that the <code>T<\/code> specified matches the concrete type of the array. As an example, if you write this code:<\/p>\n<pre><code class=\"language-C#\">new Span&lt;A&gt;(new B[4]);<\/code><\/pre>\n<p>that will throw an exception. Why is this relevant to <code>Array.Fill&lt;T&gt;<\/code>? It can accept arbitrary <code>T[]<\/code> arrays, and there&#8217;s no guarantee that the <code>T<\/code> exactly matches the array type, e.g.<\/p>\n<pre><code class=\"language-C#\">var arr = new B[4];\r\nArray.Fill&lt;A&gt;(new B[4], null);<\/code><\/pre>\n<p>If <code>Array.Fill&lt;T&gt;<\/code> were implemented purely as <code>new Span&lt;T&gt;(array).Fill(value)<\/code>, the above code would throw an exception from <code>Span&lt;T&gt;<\/code>&#8216;s constructor. Instead, <code>Array.Fill&lt;T&gt;<\/code> itself performs the same check that <code>Span&lt;T&gt;<\/code>&#8216;s constructor does; if the check passes, it creates the <code>Span&lt;T&gt;<\/code> and calls <code>Fill<\/code>, but if the check doesn&#8217;t pass, it falls back to a typical loop, writing the value into each element of the array.<\/p>\n<p>As long as we&#8217;re on the topic of vectorization, other support in this release has been vectorized. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44111\">dotnet\/runtime#44111<\/a> takes advantage of SSSE3 hardware intrinsics (e.g. <code>Ssse3.Shuffle<\/code>) to optimize the implementation of the internal <code>HexConverter.EncodeToUtf16<\/code> which is used in a few places, including the public <code>Convert.ToHexString<\/code>:<\/p>\n<pre><code class=\"language-C#\">private byte[] _data;\r\n\r\n[GlobalSetup]\r\npublic void Setup()\r\n{\r\n    _data = new byte[64];\r\n    RandomNumberGenerator.Fill(_data);\r\n}\r\n\r\n[Benchmark]\r\npublic string ToHexString() =&gt; Convert.ToHexString(_data);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ToHexString<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">130.89 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ToHexString<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">44.78 ns<\/td>\n<td style=\"text-align: right;\">0.34<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44088\">dotnet\/runtime#44088<\/a> also takes advantage of vectorization, though indirectly, by using the already vectorized <code>IndexOf<\/code> methods to improve the performance of <code>String.Replace(String, String)<\/code>. This PR is another good example of &#8220;optimizations&#8221; frequently being tradeoffs, making some scenarios faster at the expense of making others slower, and needing to make a decision based on the expected frequency of these scenarios occurring. In this case, the PR improves three specific cases significantly:<\/p>\n<ul>\n<li>If both inputs are just a single character (e.g. <code>str.Replace(\"\\n\", \" \")<\/code>), then it can delegate to the already-optimized <code>String.Replace(char, char)<\/code> overload.<\/li>\n<li>If the <code>oldValue<\/code> is a single character, the implementation can use <code>IndexOf(char)<\/code> to find it, rather than using a hand-rolled loop.<\/li>\n<li>If the <code>oldValue<\/code> is multiple characters, the implementation can use the equivalent of <code>IndexOf(string, StringComparison.Ordinal)<\/code> to find it.<\/li>\n<\/ul>\n<p>The second and third bullet points significantly speed up operation if the <code>oldValue<\/code> being searched for isn&#8217;t super frequent in the input, enabling the vectorization to pay for itself and more. If, however, it&#8217;s very frequent (like every or every other character in the input), this change can actually regress performance. Our bet, based on reviewing use cases in a variety of code bases, is this overall will be a very positive win.<\/p>\n<pre><code class=\"language-C#\">private string _str;\r\n\r\n[GlobalSetup]\r\npublic async Task Setup()\r\n{\r\n    using var hc = new HttpClient();\r\n    _str = await hc.GetStringAsync(\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\"); \/\/ The Entire Project Gutenberg Works of Mark Twain\r\n}\r\n\r\n[Benchmark]\r\npublic string Yell() =&gt; _str.Replace(\".\", \"!\");\r\n\r\n[Benchmark]\r\npublic string ConcatLines() =&gt; _str.Replace(\"\\n\", \"\");\r\n\r\n[Benchmark]\r\npublic string NormalizeEndings() =&gt; _str.Replace(\"\\r\\n\", \"\\n\");<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Yell<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">32.85 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Yell<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">16.99 ms<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ConcatLines<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">34.36 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>ConcatLines<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">22.93 ms<\/td>\n<td style=\"text-align: right;\">0.67<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NormalizeEndings<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">33.09 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>NormalizeEndings<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">23.61 ms<\/td>\n<td style=\"text-align: right;\">0.71<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Also for vectorization, previous .NET releases saw vectorization added to various algorithms in <code>System.Text.Encodings.Web<\/code>, but specifically employing x86 hardware intrinsics, such that these optimizations didn&#8217;t end up applying on ARM. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49847\">dotnet\/runtime#49847<\/a> now augments that with support from the <code>AdvSimd<\/code> hardware intrinsics, enabling similar speedups on ARM64 devices. And as long as we&#8217;re looking at <code>System.Text.Encodings.Web<\/code>, it&#8217;s worth calling out <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49373\">dotnet\/runtime#49373<\/a>, which completely overhauls the implementation of the library, with a primary goal of significantly reducing the amount of <code>unsafe<\/code> code involved; in the process, however, as we&#8217;ve seen now time and again, using spans and other modern practices to replace <code>unsafe<\/code> pointer-based code often not only makes the code simpler and safer but also faster. Part of the change involved vectorizing the &#8220;skip over all ASCII chars which don&#8217;t require encoding&#8221; logic that all of the encoders utilize, helping to yield some significant speedups in common scenarios.<\/p>\n<pre><code class=\"language-C#\">private string _text;\r\n\r\n[Params(\"HTML\", \"URL\", \"JSON\")]\r\npublic string Encoder { get; set; }\r\n\r\nprivate TextEncoder _encoder;\r\n\r\n[GlobalSetup]\r\npublic async Task Setup()\r\n{\r\n    using (var hc = new HttpClient())\r\n        _text = await hc.GetStringAsync(\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\");\r\n\r\n    _encoder = Encoder switch\r\n    {\r\n        \"HTML\" =&gt; HtmlEncoder.Default,\r\n        \"URL\" =&gt; UrlEncoder.Default,\r\n        _ =&gt; JavaScriptEncoder.Default,\r\n    };\r\n}\r\n\r\n[Benchmark]\r\npublic string Encode() =&gt; _encoder.Encode(_text);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>Encoder<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Encode<\/td>\n<td>.NET Core 3.1<\/td>\n<td>HTML<\/td>\n<td style=\"text-align: right;\">106.44 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">128 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 5.0<\/td>\n<td>HTML<\/td>\n<td style=\"text-align: right;\">101.58 ms<\/td>\n<td style=\"text-align: right;\">0.96<\/td>\n<td style=\"text-align: right;\">128 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 6.0<\/td>\n<td>HTML<\/td>\n<td style=\"text-align: right;\">43.97 ms<\/td>\n<td style=\"text-align: right;\">0.41<\/td>\n<td style=\"text-align: right;\">36 MB<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET Core 3.1<\/td>\n<td>JSON<\/td>\n<td style=\"text-align: right;\">113.70 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">124 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 5.0<\/td>\n<td>JSON<\/td>\n<td style=\"text-align: right;\">96.36 ms<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">124 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 6.0<\/td>\n<td>JSON<\/td>\n<td style=\"text-align: right;\">39.73 ms<\/td>\n<td style=\"text-align: right;\">0.35<\/td>\n<td style=\"text-align: right;\">33 MB<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET Core 3.1<\/td>\n<td>URL<\/td>\n<td style=\"text-align: right;\">165.60 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">136 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 5.0<\/td>\n<td>URL<\/td>\n<td style=\"text-align: right;\">141.26 ms<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">136 MB<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 6.0<\/td>\n<td>URL<\/td>\n<td style=\"text-align: right;\">70.63 ms<\/td>\n<td style=\"text-align: right;\">0.43<\/td>\n<td style=\"text-align: right;\">44 MB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another <code>string<\/code> API that&#8217;s been enhanced for .NET 6 is <code>string.Join<\/code>. One of the <code>Join<\/code> overloads takes the strings to be joined as an <code>IEnumerable&lt;string?&gt;<\/code>, which it iterates, appending to a builder as it goes. But there&#8217;s already a separate array-based code path that does two passes over the strings, one to count the size required and then another to fill in the resulting string of the required length. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44032\">dotnet\/runtime#44032<\/a> converts that functionality to be based on a <code>ReadOnlySpan&lt;string?&gt;<\/code> rather than <code>string?[]<\/code>, and then special-cases enumerables that are actually <code>List&lt;string?&gt;<\/code> to go through the span-based path as well, utilizing the <code>CollectionsMarshal.AsSpan<\/code> method to get a span for the <code>List&lt;string?&gt;<\/code>&#8216;s backing array. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56857\">dotnet\/runtime#56857<\/a> then does the same for the <code>IEnumerable&lt;T&gt;<\/code>-based overload.<\/p>\n<pre><code class=\"language-C#\">private List&lt;string&gt; _strings = new List&lt;string&gt;() { \"Hi\", \"How\", \"are\", \"you\", \"today\" };\r\n\r\n[Benchmark]\r\npublic string Join() =&gt; string.Join(\", \", _strings);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Join<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">124.81 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">120 B<\/td>\n<\/tr>\n<tr>\n<td>Join<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">123.54 ns<\/td>\n<td style=\"text-align: right;\">0.99<\/td>\n<td style=\"text-align: right;\">112 B<\/td>\n<\/tr>\n<tr>\n<td>Join<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">51.08 ns<\/td>\n<td style=\"text-align: right;\">0.41<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One of the biggest string-related improvements, though, comes from the new interpolated string handler support in C# 10 and .NET 6, with new language support added in <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/54692\">dotnet\/roslyn#54692<\/a> and library support added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51086\">dotnet\/runtime#51086<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51653\">dotnet\/runtime#51653<\/a>. If I write:<\/p>\n<pre><code class=\"language-C#\">static string Format(int major, int minor, int build, int revision) =&gt;\r\n    $\"{major}.{minor}.{build}.{revision}\";<\/code><\/pre>\n<p>C# 9 would compile that as:<\/p>\n<pre><code class=\"language-C#\">static string Format(int major, int minor, int build, int revision)\r\n{\r\n    var array = new object[4];\r\n    array[0] = major;\r\n    array[1] = minor;\r\n    array[2] = build;\r\n    array[3] = revision;\r\n    return string.Format(\"{0}.{1}.{2}.{3}\", array);\r\n}<\/code><\/pre>\n<p>which incurs a variety of overheads, such as having to parse the composite format string on every call at run-time, box each of the <code>int<\/code>s, and allocate an array to store them. With C# 10 and .NET 6, that&#8217;s instead compiled as:<\/p>\n<pre><code class=\"language-C#\">static string Format(int major, int minor, int build, int revision)\r\n{\r\n    var h = new DefaultInterpolatedStringHandler(3, 4);\r\n    h.AppendFormatted(major);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(minor);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(build);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(revision);\r\n    return h.ToStringAndClear();\r\n}<\/code><\/pre>\n<p>with all of the parsing handled at compile-time, no additional array allocation, and no additional boxing allocations. You can see the impact of the changes with the aforementioned examples turned into a benchmark:<\/p>\n<pre><code class=\"language-C#\">private int Major = 6, Minor = 0, Build = 100, Revision = 21380;\r\n\r\n[Benchmark(Baseline = true)]\r\npublic string Old()\r\n{\r\n    object[] array = new object[4];\r\n    array[0] = Major;\r\n    array[1] = Minor;\r\n    array[2] = Build;\r\n    array[3] = Revision;\r\n    return string.Format(\"{0}.{1}.{2}.{3}\", array);\r\n}\r\n\r\n[Benchmark]\r\npublic string New()\r\n{\r\n    var h = new DefaultInterpolatedStringHandler(3, 4);\r\n    h.AppendFormatted(Major);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(Minor);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(Build);\r\n    h.AppendLiteral(\".\");\r\n    h.AppendFormatted(Revision);\r\n    return h.ToStringAndClear();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">127.31 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">200 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">69.62 ns<\/td>\n<td style=\"text-align: right;\">0.55<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For an in-depth look, including discussion of various custom interpolated string handlers built-in to .NET 6 for improved support with <code>StringBuilder<\/code>, <code>Debug.Assert<\/code>, and <code>MemoryExtensions<\/code>, see the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/string-interpolation-in-c-10-and-net-6\">String Interpolation in C# 10 and .NET 6<\/a>.<\/p>\n<h3>Buffering<\/h3>\n<p>Performance improvements can manifest in many ways: increasing throughput, reducing working set, reducing latencies, increasing startup speeds, lowering size on disk, and so on. Anyone paying attention to the performance of .NET will also notice a focus on reducing allocation. This is typically a means to an end rather than a goal in and of itself, as managed allocations themselves are easily trackable \/ measurable and incur varying costs, in particular the secondary cost of causing GCs to happen more frequently and\/or take longer periods of time. Sometimes reducing allocations falls into the category of just stopping doing unnecessary work, or doing something instead that&#8217;s way cheaper; for example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42776\">dotnet\/runtime#42776<\/a> changed an eight-byte array allocation to an eight-byte stack-allocation, the latter of which is very close to free (in particular as this code is compiled with <code>[SkipLocalsInit]<\/code> and thus doesn&#8217;t need to pay to zero out that stack-allocated space). Beyond that, though, there are almost always real tradeoffs. One common technique is pooling, which can look great on microbenchmarks because it drives down that allocation number, but it doesn&#8217;t always translate into a measurable improvement in one of the other metrics that&#8217;s actually an end goal. In fact, it can make things worse, such as if the overhead of renting and returning from the pool is higher than expected (especially if it incurs synchronization costs), if it leads to cache problems as something returned on one NUMA node ends up being consumed from another, if it leads to GCs taking longer by increasing the number of references from Gen1 or Gen2 objects to Gen0 objects, and so on. However, one place that pooling has shown to be quite effective is with arrays, in particular larger arrays of value types (e.g. <code>byte[]<\/code>, <code>char[]<\/code>), which has led to <code>ArrayPool&lt;T&gt;.Shared<\/code> being used <em>everywhere<\/em>. This places a high premium on <code>ArrayPool&lt;T&gt;.Shared<\/code> being as efficient as possible, and this release sees several impactful improvements in this area.<\/p>\n<p>Probably the most visible change in this area in .NET 6 is the for-all-intents-and-purposes removal of the upper limit on the size of arrays <code>ArrayPool&lt;T&gt;.Shared<\/code> will cache. Previously, <code>ArrayPool&lt;T&gt;.Shared<\/code> would only cache up to approximately one million elements (<code>1024 * 1024<\/code>), a fact evident from this test run on .NET 5:<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic void RentReturn_1048576() =&gt; ArrayPool&lt;byte&gt;.Shared.Return(ArrayPool&lt;byte&gt;.Shared.Rent(1024 * 1024));\r\n\r\n[Benchmark]\r\npublic void RentReturn_1048577() =&gt; ArrayPool&lt;byte&gt;.Shared.Return(ArrayPool&lt;byte&gt;.Shared.Rent(1024 * 1024 + 1));<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RentReturn_1048576<\/td>\n<td style=\"text-align: right;\">21.90 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_1048577<\/td>\n<td style=\"text-align: right;\">18,210.30 ns<\/td>\n<td style=\"text-align: right;\">883.37<\/td>\n<td style=\"text-align: right;\">1,048,598 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Ouch. That is a large cliff to fall off of, and either the developer is aware of the cliff and is forced to adapt to it in their code, or they&#8217;re not aware of it and end up having unexpected performance problems. This somewhat arbitrary limit was originally put in place before the pool had &#8220;trimming,&#8221; a mechanism that enabled the pool to drop cached arrays in response to Gen2 GCs, with varying levels of aggressiveness based on perceived memory pressure. But then that trimming was added, and the limit was never revisited&#8230; until now. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55621\">dotnet\/runtime#55621<\/a> raises the limit as high as the current implementation&#8217;s scheme enables, which means it can now cache arrays up to approximately one billion elements (<code>1024 * 1024 * 1024<\/code>); that should hopefully be larger than almost anyone wants to pool.<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RentReturn_1048576<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">21.01 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_1048576<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">16.36 ns<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>RentReturn_1048577<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">12,132.90 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">1,048,593 B<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_1048577<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">16.38 ns<\/td>\n<td style=\"text-align: right;\">0.002<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, pooling such arrays means it&#8217;s important that trimming works as expected, and while there&#8217;s an unending amount of tuning we could do to the trimming heuristics, the main gap that stood out had to do with how arrays in the pool are stored. With today&#8217;s implementation, the pool is divided into buckets with sizes equal to powers of two, so for example there&#8217;s a bucket for arrays with a length up to 16, then up to 32, then up to 64, and so on: requesting an array of size 100 will garner you an array of size 128. The pool is also split into two layers. The first layer is stored in thread-local storage, where each thread can store at most one array of each bucket size. The second layer is itself split into <code>Environment.ProcessorCount<\/code> stacks, each of which is logically associated with one core, and each of which is individually synchronized. Code renting an array first consults the thread-local storage slot, and if it&#8217;s unable to get an array from there, proceeds to examine each of the stacks, starting with the one associated with the core it&#8217;s currently running on (which can of course change at any moment, so the affinity is quite soft and accesses require synchronization). Upon returning an array, a similar path is followed, with the code first trying to return to the thread-local slot, and then proceeding to try to find space in one of the stacks. The trimming implementation in .NET 5 and earlier is able to remove arrays from the stacks, and is given the opportunity on every Gen2 GC, but it will only ever drop arrays from the thread-local storage if there&#8217;s very high memory pressure. This can lead to some rarely-used arrays sticking around for a very long time, negatively impacting working set. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56316\">dotnet\/runtime#56316<\/a> addresses this by tracking approximately how long arrays have been sitting in thread-local storage, and enabling them to be culled regardless of high memory pressure, instead using memory pressure to indicate what&#8217;s an acceptable duration for an array to remain.<\/p>\n<p>On top of these changes around what can be cached and for how long, more typical performance optimizations have also been done. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55710\">dotnet\/runtime#55710<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55959\">dotnet\/runtime#55959<\/a> reduced typical overheads for renting and returning arrays. This entailed paying attention to where and why bounds checks were happening and avoiding them where possible, ordering checks performed to prioritize common cases (e.g. a request for a pooled size) over rare cases (e.g. a request for a size of 0), and reducing code size to make better use of instruction caches.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic void RentReturn_Single() =&gt; ArrayPool&lt;char&gt;.Shared.Return(ArrayPool&lt;char&gt;.Shared.Rent(4096));\r\n\r\nprivate char[][] _arrays = new char[4][];\r\n\r\n[Benchmark]\r\npublic void RentReturn_Multi()\r\n{\r\n    char[][] arrays = _arrays;\r\n\r\n    for (int i = 0; i &lt; arrays.Length; i++)\r\n        arrays[i] = ArrayPool&lt;char&gt;.Shared.Rent(4096);\r\n\r\n    for (int i = 0; i &lt; arrays.Length; i++)\r\n        ArrayPool&lt;char&gt;.Shared.Return(arrays[i]);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RentReturn_Single<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">23.60 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_Single<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">18.48 ns<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_Single<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">16.27 ns<\/td>\n<td style=\"text-align: right;\">0.69<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>RentReturn_Multi<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">248.57 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_Multi<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">204.13 ns<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>RentReturn_Multi<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">197.21 ns<\/td>\n<td style=\"text-align: right;\">0.79<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>IO<\/h3>\n<p>A good deal of effort in .NET 6 has gone into fixing the performance of one of the oldest types in .NET: <code>FileStream<\/code>. Every app and service reads and writes files. Unfortunately, <code>FileStream<\/code> has also been plagued over the years by numerous performance-related issues, most of which are part of its asynchronous I\/O implementation on Windows. For example, a call to <code>ReadAsync<\/code> might have issued an overlapped I\/O read operation, but typically that read would end up then blocking in a sync-over-async manner, in order to avoid potential race conditions in the implementation that could otherwise result. Or when flushing its buffer, even when flushing asynchronously, those flushes would end up doing synchronous writes. Such issues often ended up defeating any scalability benefits of using asynchronous I\/O while still incurring the overheads associated with asynchronous I\/O (async I\/O often has higher overheads in exchange for being more scalable). All of this was complicated further by the <code>FileStream<\/code> code being a tangled web difficult to unravel, in large part because it was trying to integrate a bunch of different capabilities into the same code paths: using overlapped I\/O or not, buffering or not, targeting disk files or pipes, etc., with different logic for each, all interwined. Combined, this has meant that, with a few exceptions, the <code>FileStream<\/code> code has remained largely untouched, until now.<\/p>\n<p>.NET 6 sees <code>FileStream<\/code> entirely rewritten, and in the process, all of these issues resolved. The result is a much more maintainable implementation that&#8217;s also dramatically faster, in particular for asynchronous operations. There have been a plethora of PRs as part of this effort, but I&#8217;ll call out a few. First <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47128\">dotnet\/runtime#47128<\/a> laid the groundwork for the new implementation, refactoring <code>FileStream<\/code> to be a wrapper around a &#8220;strategy&#8221; (as in the Strategy design pattern), which then enables the actual implementation to be substituted and composed at runtime (similar to the approach discussed with <code>Random<\/code>), with the existing implementation moved into one strategy that can be used in .NET 6 if maximum compatibility is required (it&#8217;s off by default but can be enabled with an environment variable or <code>AppContext<\/code> switch). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48813\">dotnet\/runtime#48813<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49750\">dotnet\/runtime#49750<\/a> then introduced the beginnings of the new implementation, splitting it apart into several strategies on Windows, one for if the file was opened for synchronous I\/O, one for if it was opened for asynchronous I\/O, and one that enabled buffering to be layered on top of any strategy. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55191\">dotnet\/runtime#55191<\/a> then introduced a Unix-optimized strategy for the new scheme. All the while, additional PRs were flowing in to optimize various conditions. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49975\">dotnet\/runtime#49975<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56465\">dotnet\/runtime#56465<\/a> avoided an expensive syscall made on every operation on Windows to track the file&#8217;s length, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44097\">dotnet\/runtime#44097<\/a> removed an unnecessary seek when setting file length on Unix. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50802\">dotnet\/runtime#50802<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51363\">dotnet\/runtime#51363<\/a> changed the overlapped I\/O implementation on Windows to use a custom, reusable <code>IValueTaskSource<\/code>-based implementation rather than one based on <code>TaskCompletionSource<\/code>, which enabled making (non-buffered) async reads and writes amortized-allocation-free when using async I\/O. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55206\">dotnet\/runtime#55206<\/a> from <a href=\"https:\/\/github.com\/tmds\">@tmds<\/a> used knowledge from an existing syscall being made on Unix to then avoid a subsequent unnecessary <code>stat<\/code> system call. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56095\">dotnet\/runtime#56095<\/a> took advantage of the new <code>PoolingAsyncValueTaskMethodBuilder<\/code> previously discussed to reduce allocations involved in async operations on <code>FileStream<\/code> when buffering is being used (which is the default). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56387\">dotnet\/runtime#56387<\/a> avoided a <code>ReadFile<\/code> call on Windows if we already had enough information to prove nothing would be available to read. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56682\">dotnet\/runtime#56682<\/a> took the same optimizations that had been done for <code>Read\/WriteAsync<\/code> on Unix and applied them to Windows as well when the <code>FileStream<\/code> was opened for synchronous I\/O. In the end, all of this adds up to huge maintainability benefits for <code>FileStream<\/code>, huge performance improvements for <code>FileStream<\/code> (in particular for but not limited to asynchronous operations), and much better scalability for <code>FileStream<\/code>. Here are just a few microbenchmarks to highlight some of the impact:<\/p>\n<pre><code class=\"language-C#\">private FileStream _fileStream;\r\nprivate byte[] _buffer = new byte[1024];\r\n\r\n[Params(false, true)]\r\npublic bool IsAsync { get; set; }\r\n\r\n[Params(1, 4096)]\r\npublic int BufferSize { get; set; }\r\n\r\n[GlobalSetup]\r\npublic void Setup()\r\n{\r\n    byte[] data = new byte[10_000_000];\r\n    new Random(42).NextBytes(data);\r\n\r\n    string path = Path.GetTempFileName();\r\n    File.WriteAllBytes(path, data);\r\n\r\n    _fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, BufferSize, IsAsync);\r\n}\r\n\r\n[GlobalCleanup]\r\npublic void Cleanup()\r\n{\r\n    _fileStream.Dispose();\r\n    File.Delete(_fileStream.Name);\r\n}\r\n\r\n[Benchmark]\r\npublic void Read()\r\n{\r\n    _fileStream.Position = 0;\r\n    while (_fileStream.Read(_buffer\r\n#if !NETCOREAPP2_1_OR_GREATER\r\n            , 0, _buffer.Length\r\n#endif\r\n            ) != 0) ;\r\n}\r\n\r\n[Benchmark]\r\npublic async Task ReadAsync()\r\n{\r\n    _fileStream.Position = 0;\r\n    while (await _fileStream.ReadAsync(_buffer\r\n#if !NETCOREAPP2_1_OR_GREATER\r\n            , 0, _buffer.Length\r\n#endif\r\n            ) != 0) ;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th>IsAsync<\/th>\n<th>BufferSize<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Read<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">30.717 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Core 3.1<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">30.745 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 5.0<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">31.156 ms<\/td>\n<td style=\"text-align: right;\">1.01<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 6.0<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">30.772 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">50.806 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">2,125,865 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">44.505 ms<\/td>\n<td style=\"text-align: right;\">0.88<\/td>\n<td style=\"text-align: right;\">1,953,592 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 5.0<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">39.212 ms<\/td>\n<td style=\"text-align: right;\">0.77<\/td>\n<td style=\"text-align: right;\">1,094,096 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 6.0<\/td>\n<td>False<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">36.018 ms<\/td>\n<td style=\"text-align: right;\">0.71<\/td>\n<td style=\"text-align: right;\">247 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">9.593 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Core 3.1<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">9.761 ms<\/td>\n<td style=\"text-align: right;\">1.02<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 5.0<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">9.446 ms<\/td>\n<td style=\"text-align: right;\">0.99<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 6.0<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">9.569 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">30.920 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">2,141,481 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">23.758 ms<\/td>\n<td style=\"text-align: right;\">0.81<\/td>\n<td style=\"text-align: right;\">1,953,592 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 5.0<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">25.101 ms<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<td style=\"text-align: right;\">1,094,096 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 6.0<\/td>\n<td>False<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">13.108 ms<\/td>\n<td style=\"text-align: right;\">0.42<\/td>\n<td style=\"text-align: right;\">382 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">413.228 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">2,121,728 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Core 3.1<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">217.891 ms<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">3,050,056 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 5.0<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">219.388 ms<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">3,062,741 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 6.0<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">83.070 ms<\/td>\n<td style=\"text-align: right;\">0.20<\/td>\n<td style=\"text-align: right;\">2,109,867 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">355.670 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">3,833,856 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">262.625 ms<\/td>\n<td style=\"text-align: right;\">0.74<\/td>\n<td style=\"text-align: right;\">3,048,120 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 5.0<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">259.284 ms<\/td>\n<td style=\"text-align: right;\">0.73<\/td>\n<td style=\"text-align: right;\">3,047,496 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 6.0<\/td>\n<td>True<\/td>\n<td>1<\/td>\n<td style=\"text-align: right;\">119.573 ms<\/td>\n<td style=\"text-align: right;\">0.34<\/td>\n<td style=\"text-align: right;\">403 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">106.696 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">530,842 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET Core 3.1<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">56.785 ms<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<td style=\"text-align: right;\">353,151 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 5.0<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">54.359 ms<\/td>\n<td style=\"text-align: right;\">0.51<\/td>\n<td style=\"text-align: right;\">353,966 B<\/td>\n<\/tr>\n<tr>\n<td>Read<\/td>\n<td>.NET 6.0<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">22.971 ms<\/td>\n<td style=\"text-align: right;\">0.22<\/td>\n<td style=\"text-align: right;\">527,930 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Framework 4.8<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">143.082 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">3,026,980 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">55.370 ms<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<td style=\"text-align: right;\">355,001 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 5.0<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">54.436 ms<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<td style=\"text-align: right;\">354,036 B<\/td>\n<\/tr>\n<tr>\n<td>ReadAsync<\/td>\n<td>.NET 6.0<\/td>\n<td>True<\/td>\n<td>4096<\/td>\n<td style=\"text-align: right;\">32.478 ms<\/td>\n<td style=\"text-align: right;\">0.23<\/td>\n<td style=\"text-align: right;\">420 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Some of the improvements in <code>FileStream<\/code> also involved moving the read\/write aspects of its implementation out into a separate public class: <code>System.IO.RandomAccess<\/code>. Implemented in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53669\">dotnet\/runtime#53669<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54266\">dotnet\/runtime#54266<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55490\">dotnet\/runtime#55490<\/a> (with additional optimizations in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55123\">dotnet\/runtime#55123<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a>), <code>RandomAccess<\/code> provides overloads that enable sync and async reading and writing, for both a single and multiple buffers at a time, and specifying the exact offset into the file at which the read or write should occur. All of these static methods accept a <code>SafeFileHandle<\/code>, which can now be obtained from the new <code>File.OpenHandle<\/code> method. This all means code is now able to access files without going through <code>FileStream<\/code> if the <code>Stream<\/code>-based interface isn&#8217;t desirable, and it means code is able to issue concurrent reads or writes for the same <code>SafeFileHandle<\/code>, if parallel processing of a file is desired. Subsequent PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55150\">dotnet\/runtime#55150<\/a> took advantage of these new APIs to avoid the extra allocations and complexity involved in using <code>FileStream<\/code> when all that was really needed was the handle and the ability to perform a single read or write. (<a href=\"https:\/\/github.com\/adamsitnik\">@adamsitnik<\/a> is working on a dedicated blog post focused on these <code>FileStream<\/code> improvements; look for that on the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/\">.NET Blog<\/a> soon.)<\/p>\n<p>Of course, there&#8217;s more to working with files than just <code>FileStream<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55210\">dotnet\/runtime#55210<\/a> from <a href=\"https:\/\/github.com\/tmds\">@tmds<\/a> eliminated a <code>stat<\/code> syscall from <code>Directory\/File.Exists<\/code> when the target doesn&#8217;t exist, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47118\">dotnet\/runtime#47118<\/a> from <a href=\"https:\/\/github.com\/gukoff\">@gukoff<\/a> avoids a <code>rename<\/code> syscall when moving a file across volumes on Unix, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55644\">dotnet\/runtime#55644<\/a> simplifies <code>File.WriteAllTextAsync<\/code> and makes it faster with less allocation (this benchmark of course also benefits from the <code>FileStream<\/code> improvements:<\/p>\n<pre><code class=\"language-C#\">private static string s_contents = string.Concat(Enumerable.Range(0, 100_000).Select(i =&gt; (char)('a' + (i % 26))));\r\nprivate static string s_path = Path.GetRandomFileName();\r\n\r\n[Benchmark]\r\npublic Task WriteAllTextAsync() =&gt; File.WriteAllTextAsync(s_path, s_contents);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WriteAllTextAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">1.609 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">23 KB<\/td>\n<\/tr>\n<tr>\n<td>WriteAllTextAsync<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1.590 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">23 KB<\/td>\n<\/tr>\n<tr>\n<td>WriteAllTextAsync<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.143 ms<\/td>\n<td style=\"text-align: right;\">0.72<\/td>\n<td style=\"text-align: right;\">15 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And, of course, there&#8217;s more to I\/O than just files. <code>NamedPipeServerStream<\/code> on Windows provides an overlapped I\/O-based implementation very similar to that of <code>FileStream<\/code>. With <code>FileStream<\/code>&#8216;s implementation being overhauled, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52695\">dotnet\/runtime#52695<\/a> from <a href=\"https:\/\/github.com\/manandre\">@manandre<\/a> also overhauled the pipes implementation to mimic the same updated structure as that used in <code>FileStream<\/code>, and thereby incur many of the same benefits, in particular around allocation reduction due to a reusable <code>IValueTaskSource<\/code>-based implementation rather than a <code>TaskCompletionSource<\/code>-based implementation.<\/p>\n<p>On the compression front, in addition to the introduction of the new <code>ZlibStream<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42717\">dotnet\/runtime#42717<\/a>), the underlying <code>Brotli<\/code> code that&#8217;s used behind <code>BrotliStream<\/code>, <code>BrotliEncoder<\/code>, and <code>BrotliDecoder<\/code> was upgraded from v1.0.7 in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44107\">dotnet\/runtime#44107<\/a> from <a href=\"https:\/\/github.com\/saucecontrol\">@saucecontrol<\/a> to v1.0.9. That upgrade brings with it various <a href=\"https:\/\/github.com\/google\/brotli\/releases\/tag\/v1.0.9\">performance improvements<\/a>, including code paths that make better use of intrinsics. Not all compression\/decompression measurably benefits, but some certainly does:<\/p>\n<pre><code class=\"language-C#\">private byte[] _toCompress;\r\nprivate MemoryStream _destination = new MemoryStream();\r\n\r\n[GlobalSetup]\r\npublic async Task Setup()\r\n{\r\n    using var hc = new HttpClient();\r\n    _toCompress = await hc.GetByteArrayAsync(@\"https:\/\/raw.githubusercontent.com\/dotnet\/performance\/5584a8b201b8c9c1a805fae4868b30a678107c32\/src\/benchmarks\/micro\/corefx\/System.IO.Compression\/TestData\/alice29.txt\");\r\n}\r\n\r\n[Benchmark]\r\npublic void Compress()\r\n{\r\n    _destination.Position = 0;\r\n    using var ds = new BrotliStream(_destination, CompressionLevel.Fastest, leaveOpen: true);\r\n    ds.Write(_toCompress);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Compress<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">1,050.2 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Compress<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">786.6 us<\/td>\n<td style=\"text-align: right;\">0.75<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47125\">dotnet\/runtime#47125<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> also added some missing overrides to various <code>Stream<\/code> types, including <code>DeflateStream<\/code>, which has an effect of reducing the overhead of <code>DeflateStream.WriteAsync<\/code>.<\/p>\n<p>There&#8217;s one more interesting, performance-related improvement in <code>DeflateStream<\/code> (and <code>GZipStream<\/code> and <code>BrotliStream<\/code>). The <code>Stream<\/code> contract for asynchronous read operations is that, assuming you request at least one byte, the operation won&#8217;t complete until at least one byte is read; however, the contract makes no guarantees whatsoever that the operation will return all that you requested, in fact it&#8217;s rare to find a stream that does make such a guarantee, and it&#8217;s problematic in many cases when it does. Unfortunately, as an implementation detail, <code>DeflateStream<\/code> was in fact trying to return as much data as was requested, by issuing as many reads against the underlying stream as it needed to in order to make that happen, stopping only when it decoded a sufficient amount of data to satisfy the request or hit EOF (end of file) on the underlying stream. This is a problem for multiple reasons. First, it prevents overlapping the processing of any data that may have already been received with the waiting for more data to receive; if 100 bytes are already available, but I asked for 200, I&#8217;m then forced to wait to process the 100 until another 100 are received or the stream hits EOF. Second, and more impactful, is it effectively prevents <code>DeflateStream<\/code> from being used in any bidirectional communication scenario. Imagine a <code>DeflateStream<\/code> wrapped around a <code>NetworkStream<\/code>, and the stream is being used to send and receive compressed messages to and from a remote party. Let&#8217;s say I pass <code>DeflateStream<\/code> a 1K buffer, the remote party sends me a 100-byte message, and I&#8217;m supposed to read that message and respond (a response the remote party will be waiting for before sending me anything further). <code>DeflateStream<\/code>&#8216;s behavior here will deadlock the whole system, as it will prevent the receipt of the 100-byte message waiting for another 900 bytes or EOF that will never arrive. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53644\">dotnet\/runtime#53644<\/a> fixes that by enabling <code>DeflateStream<\/code> (and a few other streams) to return once it has data to hand back, even if not the sum total requested. This has been <a href=\"https:\/\/docs.microsoft.com\/dotnet\/core\/compatibility\/core-libraries\/6.0\/partial-byte-reads-in-streams\">documented as a breaking change<\/a>, not because the previous behavior was guaranteed (it wasn&#8217;t), but we&#8217;ve seen enough code erroneously depend on the old behavior that it was important to call out.<\/p>\n<p>The PR also fixes one more performance-related thing. One issue scalable web servers need to be cognizant of is memory utilization. If you&#8217;ve got a thousand open connections, and you&#8217;re waiting for data to arrive on each connection, you could perform an asynchronous read on each using a buffer, but if that buffer is, say, 4K, that&#8217;s 4MB worth of buffers that are sitting there wasting working set. If you could instead issue a zero-byte read, where you perform an empty read simply to be notified when there is data that could be received, you can then avoid any working set impact from buffers, only allocating or renting a buffer to be used once you know there&#8217;s data to put in it. Lots of <code>Streams<\/code> intended for bidirectional communication, like <code>NetworkStream<\/code> and <code>SslStream<\/code>, support such zero-byte reads, not returning from an empty read operation until there&#8217;s at least one byte that could be read. For .NET 6, <code>DeflateStream<\/code> can now also be used in this capacity, with the PR changing the implementation to ensure that <code>DeflateStream<\/code> will still issue a read to its underlying <code>Stream<\/code> in the case the <code>DeflateStream<\/code>&#8216;s output buffer is empty, even if the caller asked for zero bytes. Callers that don&#8217;t want this behavior can simply avoid making the 0-byte call.<\/p>\n<p>Moving on, for <code>System.IO.Pipelines<\/code>, a couple of PRs improved performance. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55086\">dotnet\/runtime#55086<\/a> added overrides of <code>ReadByte<\/code> and <code>WriteByte<\/code> that avoid the asynchronous code paths when a byte to read is already buffered or space in the buffer is available to write the byte, respectively. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52159\">dotnet\/runtime#52159<\/a> from <a href=\"https:\/\/github.com\/manandre\">@manandre<\/a> added a <code>CopyToAsync<\/code> override to the <code>PipeReader<\/code> used for reading from <code>Stream<\/code>s, optimizing it to first copy whatever data was already buffered and then delegate to the <code>Stream<\/code>&#8216;s <code>CopyToAsync<\/code>, taking advantage of whatever optimizations it may have.<\/p>\n<p>Beyond that, there were a variety of small improvements. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55373\">dotnet\/runtime#55373<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56568\">dotnet\/runtime#56568<\/a> from <a href=\"https:\/\/github.com\/steveberdy\">@steveberdy<\/a> removed unnecessary <code>Contains('\\0')<\/code> calls from <code>Path.GetFullPath(string, string)<\/code>; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54991\">dotnet\/runtime#54991<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a> improved <code>BufferedStream.Position<\/code>&#8216;s setter to avoid pitching buffered read data if it would still be valuable for the new position; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55147\">dotnet\/runtime#55147<\/a> removed some casting overhead from the base <code>Stream<\/code> type; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53070\">dotnet\/runtime#53070<\/a> from <a href=\"https:\/\/github.com\/DavidKarlas\">@DavidKarlas<\/a> avoided unnecessarily roundtripping a file time through local time in <code>File.GetLastWriteTimeUtc<\/code> on Unix; and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43968\">dotnet\/runtime#43968<\/a> consolidating the argument validation logic for derived <code>Stream<\/code> types into public helpers (<code>Stream.ValidateBufferArguments<\/code> and <code>Stream.ValidateCopyToArguments<\/code>), which, in addition to eliminating duplicated code and helping to ensure consistency of behavior, helps to streamline the validation logic using a shared, efficient implementation of the relevant checks.<\/p>\n<h3>Networking<\/h3>\n<p>Let&#8217;s turn our attention to networking. It goes without saying that networking is at the heart of services and most significant apps today, and so improvements in networking performance are critical to the platform.<\/p>\n<p>At the bottom of the networking stack, we have <code>System.Net.Sockets<\/code>. One of my favorite sets of changes in this release is that we finally rid the System.Net.Sockets.dll assembly of all custom <code>IAsyncResult<\/code> implementations; all of the remaining places where <code>Begin\/EndXx<\/code> methods provided an implementation and a <code>Task<\/code>-based <code>XxAsync<\/code> method wrapped that <code>Begin\/EndXx<\/code> have now been flipped, with the <code>XxAsync<\/code> method providing the implementation, and the <code>Begin\/EndXx<\/code> methods just delegating to the <code>Task<\/code>-based methods. So, for example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43886\">dotnet\/runtime#43886<\/a> reimplemented <code>Socket.BeginSend\/EndSend<\/code> and <code>Socket.BeginReceive\/EndReceive<\/code> as wrappers for <code>Socket.SendAsync<\/code> and <code>Socket.ReceiveAsync<\/code>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43661\">dotnet\/runtime#43661<\/a> rewrote <code>Socket.ConnectAsync<\/code> using tasks and <code>async\/await<\/code>, and then <code>Begin\/EndConnect<\/code> were just implemented in terms of that. Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53340\">dotnet\/runtime#53340<\/a> added new <code>AcceptAsync<\/code> overloads that are not only task-based but also cancelable (a long requested feature), and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51212\">dotnet\/runtime#51212<\/a> then deleted a lot of code by having the <code>Begin\/EndAccept<\/code> methods just use the task-based implementation. These changes not only reduced the size of the assembly, reduced dependencies from System.Net.Sockets.dll (the custom <code>IAsyncResult<\/code> implementations were depending on libraries like System.Security.Windows.Principal.dll), and reduced the complexity of the code, they also reduced allocation. To see the impact, here&#8217;s a little microbenchmark that repeatedly establishes a new loopback connection:<\/p>\n<pre><code class=\"language-C#\">[Benchmark(OperationsPerInvoke = 1000)]\r\npublic async Task ConnectAcceptAsync()\r\n{\r\n    using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n    listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n    listener.Listen(1);\r\n\r\n    for (int i = 0; i &lt; 1000; i++)\r\n    {\r\n        using var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n        await client.ConnectAsync(listener.LocalEndPoint);\r\n        using var server = await listener.AcceptAsync();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ConnectAcceptAsync<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">282.3 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">2,780 B<\/td>\n<\/tr>\n<tr>\n<td>ConnectAcceptAsync<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">278.3 us<\/td>\n<td style=\"text-align: right;\">0.99<\/td>\n<td style=\"text-align: right;\">1,698 B<\/td>\n<\/tr>\n<tr>\n<td>ConnectAcceptAsync<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">273.8 us<\/td>\n<td style=\"text-align: right;\">0.97<\/td>\n<td style=\"text-align: right;\">1,402 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Then for .NET 6, I can also add <code>CancellationToken.None<\/code> as an argument to <code>ConnectAsync<\/code> and <code>AcceptAsync<\/code>. Passing <code>CancellationToken.None<\/code> as the last argument changes the overload used; this overload doesn&#8217;t just enable cancellation (if you were to pass in a cancelable token), but those new overloads return <code>ValueTask&lt;T&gt;<\/code>s, further reducing allocation. With that, I get the following, for an additional reduction:<\/p>\n<pre><code class=\"language-C#\">[Params(false, true)]\r\npublic bool NewOverload { get; set; }\r\n\r\n[Benchmark(OperationsPerInvoke = 1000)]\r\npublic async Task ConnectAcceptAsync()\r\n{\r\n    using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n    listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n    listener.Listen(1);\r\n\r\n    for (int i = 0; i &lt; 1000; i++)\r\n    {\r\n        using var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n        if (NewOverload)\r\n        {\r\n            await client.ConnectAsync(listener.LocalEndPoint, CancellationToken.None);\r\n        }\r\n        else\r\n        {\r\n            await client.ConnectAsync(listener.LocalEndPoint);\r\n        }\r\n        using var server = await listener.AcceptAsync();\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>NewOverload<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ConnectAcceptAsync<\/td>\n<td>False<\/td>\n<td style=\"text-align: right;\">270.5 us<\/td>\n<td style=\"text-align: right;\">1,403 B<\/td>\n<\/tr>\n<tr>\n<td>ConnectAcceptAsync<\/td>\n<td>True<\/td>\n<td style=\"text-align: right;\">262.5 us<\/td>\n<td style=\"text-align: right;\">1,324 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47781\">dotnet\/runtime#47781<\/a> is another example of flipping the <code>Begin\/End<\/code> and <code>Task<\/code>-based implementations. It adds new task-based overloads for the UDP-focused sending and receiving operations on <code>Socket<\/code> (<code>SendTo<\/code>, <code>ReceiveFrom<\/code>, <code>ReceiveMessageFrom<\/code>), and then reimplements the existing <code>Begin\/End<\/code> methods on top of the new task-based (actually <code>ValueTask<\/code>) methods. Here&#8217;s an example; note that technically these benchmarks are flawed given that UDP is lossy, but I&#8217;ve ignored that for the purposes of determining the costs of these methods.<\/p>\n<pre><code class=\"language-C#\">private Socket _client;\r\nprivate Socket _server;\r\nprivate byte[] _buffer = new byte[1];\r\n\r\n[GlobalSetup]\r\npublic void Setup()\r\n{\r\n    _client = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n    _client.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n\r\n    _server = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n    _server.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = 10_000)]\r\npublic async Task ReceiveSendAsync()\r\n{\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        var receive = _client.ReceiveFromAsync(_buffer, SocketFlags.None, _server.LocalEndPoint);\r\n        await _server.SendToAsync(_buffer, SocketFlags.None, _client.LocalEndPoint);\r\n        await receive;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ReceiveSendAsync<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">36.24 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,888 B<\/td>\n<\/tr>\n<tr>\n<td>ReceiveSendAsync<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">36.22 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,672 B<\/td>\n<\/tr>\n<tr>\n<td>ReceiveSendAsync<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">28.50 us<\/td>\n<td style=\"text-align: right;\">0.79<\/td>\n<td style=\"text-align: right;\">384 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Then, as in the previous example, I can try adding in the additional <code>CancellationToken<\/code> argument:<\/p>\n<pre><code class=\"language-C#\">[Benchmark(OperationsPerInvoke = 10_000, Baseline = true)]\r\npublic async Task Old()\r\n{\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        var receive = _client.ReceiveFromAsync(_buffer, SocketFlags.None, _server.LocalEndPoint);\r\n        await _server.SendToAsync(_buffer, SocketFlags.None, _client.LocalEndPoint);\r\n        await receive;\r\n    }\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = 10_000)]\r\npublic async Task New()\r\n{\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        var receive = _client.ReceiveFromAsync(_buffer, SocketFlags.None, _server.LocalEndPoint, CancellationToken.None);\r\n        await _server.SendToAsync(_buffer, SocketFlags.None, _client.LocalEndPoint, CancellationToken.None);\r\n        await receive;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">28.95 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">384 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">27.83 us<\/td>\n<td style=\"text-align: right;\">0.96<\/td>\n<td style=\"text-align: right;\">288 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other new overloads have also been added in .NET 6 (almost all of the operations on <code>Socket<\/code> now have overloads accepting <code>ReadOnlySpan&lt;T&gt;<\/code> or <code>{ReadOnly}Memory&lt;T&gt;<\/code>, complete with functioning support for <code>CancellationToken<\/code>). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47230\">dotnet\/runtime#47230<\/a> from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a> added a span-based overload of <code>Socket.SendFile<\/code>, enabling the pre- and post- buffers to be specified as <code>ReadOnlySpan&lt;byte&gt;<\/code> rather than <code>byte[]<\/code>, which makes it cheaper to send only a portion of some array (the alternative with the existing overloads would be to allocate a new array of the desired length and copy the relevant data into it), and then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52208\">dotnet\/runtime#52208<\/a> also from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a> added a <code>Memory<\/code>-based overload of <code>Socket.SendFileAsync<\/code>, returning a <code>ValueTask<\/code> (subsequently <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53062\">dotnet\/runtime#53062<\/a> provided the cancellation support that had been stubbed out in the previous PR). On top of that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55232\">dotnet\/runtime#55232<\/a>, and then <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56777\">dotnet\/runtime#56777<\/a> from <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a>, reduced the overhead of these <code>SendFile{Async}<\/code> operations by utilizing the new <code>RandomAccess<\/code> class to create <code>SafeFileHandle<\/code> instances directly rather than going through <code>FileStream<\/code> to open the appropriate handles to the files to be sent. The net result is a nice reduction in overhead for these operations, beyond the improvements in usability.<\/p>\n<p>As long as we&#8217;re on the subject of <code>SendFileAsync<\/code>, it&#8217;s somewhat interesting to look at <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55263\">dotnet\/runtime#55263<\/a>. This is a tiny PR that reduced the size of some allocations in the networking stack, including one in <code>SendFileAsync<\/code> (or, rather, in the <code>SendPacketsAsync<\/code> that <code>SendFileAsync<\/code> wraps). The internal <code>SocketPal.SendPacketsAsync<\/code> on Unix is implemented as an <code>async Task<\/code> method, which means that all &#8220;locals&#8221; in the method that need to survive across <code>await<\/code>s are lifted by the compiler to live as fields on the generated state machine type for that async method, and that state machine will end up being allocated to live on the heap if the async method needs to complete asynchronously. The fewer and smaller fields we can have on these state machines, the smaller the resulting allocation will be for asynchronously completing async methods. But locals written by the developer aren&#8217;t the only reason for fields being added. Let&#8217;s take a look at an example:<\/p>\n<pre><code class=\"language-C#\">public class C\r\n{\r\n    public static async Task&lt;int&gt; Example1(int someParameter) =&gt;\r\n        Process(someParameter) + await Task.FromResult(42);\r\n\r\n    public static async Task&lt;int&gt; Example2(int someParameter) =&gt;\r\n        await Task.FromResult(42) + Process(someParameter);\r\n\r\n    private static int Process(int i) =&gt; i;\r\n}<\/code><\/pre>\n<p>Just focusing on fields, the C# compiler will produce for <code>Example1<\/code> a type like this:<\/p>\n<pre><code class=\"language-C#\">[StructLayout(LayoutKind.Auto)]\r\n[CompilerGenerated]\r\nprivate struct &lt;Example1&gt;d__0 : IAsyncStateMachine\r\n{\r\n    public AsyncTaskMethodBuilder&lt;int&gt; &lt;&gt;t__builder;\r\n    public int &lt;&gt;1__state;\r\n    public int someParameter;\r\n    private TaskAwaiter&lt;int&gt; &lt;&gt;u__1;\r\n    private int &lt;&gt;7__wrap1;\r\n    ....\r\n}<\/code><\/pre>\n<p>Let&#8217;s examine a few of these fields:<\/p>\n<ul>\n<li><code>&lt;&gt;t__builder<\/code> here is the &#8220;builder&#8221; we discussed earlier when talking about pooling in async methods.<\/li>\n<li><code>&lt;&gt;1__state<\/code> is the &#8220;state&#8221; of the state machine. The compiler rewrites an async method to have a jump table at the beginning, where the current state dictates to where in the method it jumps. <code>await<\/code>s are assigned a state based on their position in the source code, and the code for awaiting something that&#8217;s not yet completed involves updating <code>&lt;&gt;1__state<\/code> to refer to the await that should be jumped to when the continuation is invoked to re-enter the async method after the awaited task has completed.<\/li>\n<li><code>someParameter<\/code> is the argument to the method. It needs to be on the state machine to feed it into the <code>MoveNext<\/code> method generated by the compiler, but it would also need to be on the state machine if code after an <code>await<\/code> wanted to read its value.<\/li>\n<li><code>&lt;&gt;u__1<\/code> stores the awaiter for the <code>await<\/code> on the <code>Task&lt;int&gt;<\/code> returned by <code>Task.FromResult(42)<\/code>. The code generated for the await involves calling <code>GetAwaiter()<\/code> on the awaited thing, checking its <code>IsCompleted<\/code> property, and if that&#8217;s false, storing the awaiter into this field so that it can be read and its <code>GetResult()<\/code> method called upon completion of the task.<\/li>\n<\/ul>\n<p>But&#8230; what is this <code>&lt;&gt;7__wrap1<\/code> thing? The answer has to do with order of operations. Let&#8217;s look at the code generated for <code>Example2<\/code>:<\/p>\n<pre><code class=\"language-C#\">private struct &lt;Example2&gt;d__1 : IAsyncStateMachine\r\n{\r\n    public int &lt;&gt;1__state;\r\n    public AsyncTaskMethodBuilder&lt;int&gt; &lt;&gt;t__builder;\r\n    public int someParameter;\r\n    private TaskAwaiter&lt;int&gt; &lt;&gt;u__1;\r\n}<\/code><\/pre>\n<p>This code has the exact same fields as the state machine for <code>Example1<\/code>, except it&#8217;s missing the <code>&lt;&gt;7__wrap1<\/code> field. The reason is the compiler is required to respect the order of operations in an expression like <code>Process(someParameter) + await Task.FromResult(42)<\/code>. That means it must compute <code>Process(someParameter)<\/code> before it computes <code>await Task.FromResult(42)<\/code>. But <code>Process(someParameter)<\/code> returns an <code>int<\/code> value; where should that be stored while <code>await Task.FromResult(42)<\/code> is being processed? On the state machine. That &#8220;spilled&#8221; field is <code>&lt;&gt;7__wrap1<\/code>. This also explains why the field isn&#8217;t there in <code>Example2<\/code>: the order of operations was explicitly reversed by the developer to be <code>await Task.FromResult(42) + Process(someParameter)<\/code>, so we don&#8217;t have to stash the result of <code>Process(someParameter)<\/code> anywhere, as it&#8217;s no longer crossing an <code>await<\/code> boundary. So, back to the cited PR. The original line of code in question was <code>bytesTransferred += await socket.SendAsync(...)<\/code>, which is the same as <code>bytesTransferred = bytesTransferred + await socket.SendAsync(...)<\/code>. Look familiar? Technically the compiler needs to stash away the <code>bytesTransferred<\/code> value in order to preserve the order of operations with regards to the <code>SendAsync<\/code> operation, and so the PR just explicitly reverses this to be <code>bytesTransferred = await socket.SendAsync(...) + bytesTransferred<\/code> in order to make the state machine a little smaller. You can see more examples of this in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55190\">dotnet\/runtime#55190<\/a> for <code>BufferedStream<\/code>. In practice, the compiler should be able to special-case this constrained version of the issue, as it should be able to see that no other code would have access to <code>bytesTransferred<\/code> to modify it, and thus the defensive copy shouldn&#8217;t be necessary&#8230; maybe <a href=\"https:\/\/github.com\/dotnet\/roslyn\/issues\/54629\">some day<\/a>.<\/p>\n<p>Let&#8217;s move up the stack a little: DNS. <code>System.Net.Dns<\/code> is a relatively thin wrapper for OS functionality. It provides both synchronous and asynchronous APIs. On Windows, the asynchronous APIs are implemented on top of Winsock&#8217;s <code>GetAddrInfoExW<\/code> function (if available), which provides a scalable overlapped I\/O-based model for performing name resolution asynchronously. The story is more convoluted on Unix, where POSIX provides <code>getaddrinfo<\/code> but no asynchronous counterpart. Linux does have <code>getaddrinfo_a<\/code>, which does provide an asynchronous version, and in fact <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/34633\">dotnet\/runtime#34633<\/a> from <a href=\"https:\/\/github.com\/gfoidl\">@gfoidl<\/a> did temporarily change <code>Dns<\/code>&#8216;s async APIs to use it, but that PR was subsequently reverted in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48666\">dotnet\/runtime#48666<\/a> upon realizing that the implementation was just queueing these calls to be executed synchronously on a limited size thread pool internal to glibc, and we could similarly employ an &#8220;async-over-sync&#8221; solution in managed code and with more control. Here <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/should-i-expose-asynchronous-wrappers-for-synchronous-methods\/\">&#8220;async-over-sync&#8221;<\/a> is referring to the idea of implementing an asynchronous operation that&#8217;s just queueing a synchronous piece of work to be done on another thread, rather than having it employ truly asynchronous I\/O all the way down to the hardware. This ends up blocking that other thread for the duration of the operation, which inherently limits scalability. It can also be a real bottleneck for something like DNS. Typically an operating system will cache some amount of DNS data, but in cases where a request is made for unavailable data, the OS has to reach out across the network to a DNS server to obtain it. If lots of requests are made concurrently for the same non-cached address, that can starve the pool with all of the operations performing the exact same request. To address this, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49171\">dotnet\/runtime#49171<\/a> implements that async-over-sync in <code>Dns<\/code> in a way that asynchronously serializes all requests for the same destination; that way, if bursts do show up, we only end up blocking one thread for all of them rather than one thread for each. This adds a small amount of overhead for individual operations, but significantly reduces the overhead in the bursty, problematic scenarios. In the future, we will hopefully be able to do away with this once we&#8217;re able to implement a true async I\/O-based mechanism on Unix, potentially implemented directly on <code>Socket<\/code> in a managed DNS client, or potentially employing a library like <a href=\"https:\/\/c-ares.haxx.se\/\">c-ares<\/a>.<\/p>\n<p>Another nice improvement in <code>Dns<\/code> comes in the form of new overloads introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/33420\">dotnet\/runtime#33420<\/a> for specifying the desired <code>AddressFamily<\/code>. By default, operations on <code>Dns<\/code> can return both IPv4 and IPv6 addresses, but if you know you only care about one or the other, you can now be explicit about it. Doing so can save on both the amount of data transferred and the resulting allocations to hand back that data.<\/p>\n<pre><code class=\"language-C#\">private string _hostname = Dns.GetHostName();\r\n\r\n[Benchmark(OperationsPerInvoke = 1000, Baseline = true)]\r\npublic async Task GetHostAddresses()\r\n{\r\n    for (int i = 0; i &lt; 1_000; i++)\r\n        await Dns.GetHostAddressesAsync(_hostname);\r\n}\r\n\r\n[Benchmark(OperationsPerInvoke = 1000)]\r\npublic async Task GetHostAddresses_OneFamily()\r\n{\r\n    for (int i = 0; i &lt; 1_000; i++)\r\n        await Dns.GetHostAddressesAsync(_hostname, AddressFamily.InterNetwork);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetHostAddresses<\/td>\n<td style=\"text-align: right;\">210.1 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">808 B<\/td>\n<\/tr>\n<tr>\n<td>GetHostAddresses_OneFamily<\/td>\n<td style=\"text-align: right;\">195.7 us<\/td>\n<td style=\"text-align: right;\">0.93<\/td>\n<td style=\"text-align: right;\">370 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Moving up the stack, we start getting into specifying URLs, which typically uses <code>System.Uri<\/code>. <code>Uri<\/code> instances are created in many places, and being able to create them more quickly and with less GC impact is a boon for end-to-end performance of networking-related code. The internal <code>Uri.ReCreateParts<\/code> method is the workhorse behind a lot of the public <code>Uri<\/code> surface area, and is responsible for formatting into a <code>string<\/code> whatever parts of the <code>Uri<\/code> have been requested (e.g. <code>UriComponents.Path | UriComponents.Query | UriComponents.Fragment<\/code>) while also factoring in desired escaping (e.g. <code>UriFormat.Unescaped<\/code>). It also unfortunately had quite a knack for allocating <code>char[]<\/code> arrays. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/34864\">dotnet\/runtime#34864<\/a> fixed that, using stack-allocated space for most <code>Uri<\/code>s (e.g. those whose length is &lt;= 256 characters) and falling back to using <code>ArrayPool&lt;char&gt;.Shared<\/code> for longer lengths, while also cleaning up some code paths to make them a bit more streamlined. The impact of this is visible in these benchmarks:<\/p>\n<pre><code class=\"language-C#\">private Uri _uri = new Uri(\"http:\/\/dot.net\");\r\n\r\n[Benchmark]\r\npublic string GetComponents() =&gt; _uri.GetComponents(UriComponents.PathAndQuery | UriComponents.Fragment, UriFormat.UriEscaped);\r\n\r\n[Benchmark]\r\npublic Uri NewUri() =&gt; new Uri(\"http:\/\/dot.net\");\r\n\r\n[Benchmark]\r\npublic string PathAndQuery() =&gt; _uri.PathAndQuery;<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetComponents<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">49.4856 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">241 B<\/td>\n<\/tr>\n<tr>\n<td>GetComponents<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">47.8179 ns<\/td>\n<td style=\"text-align: right;\">0.96<\/td>\n<td style=\"text-align: right;\">232 B<\/td>\n<\/tr>\n<tr>\n<td>GetComponents<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">39.5046 ns<\/td>\n<td style=\"text-align: right;\">0.80<\/td>\n<td style=\"text-align: right;\">232 B<\/td>\n<\/tr>\n<tr>\n<td>GetComponents<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">31.0651 ns<\/td>\n<td style=\"text-align: right;\">0.63<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>NewUri<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">280.0722 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">168 B<\/td>\n<\/tr>\n<tr>\n<td>NewUri<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">144.3990 ns<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<td style=\"text-align: right;\">72 B<\/td>\n<\/tr>\n<tr>\n<td>NewUri<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">100.0479 ns<\/td>\n<td style=\"text-align: right;\">0.36<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<\/tr>\n<tr>\n<td>NewUri<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">92.1300 ns<\/td>\n<td style=\"text-align: right;\">0.33<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>PathAndQuery<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">50.3840 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">241 B<\/td>\n<\/tr>\n<tr>\n<td>PathAndQuery<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">48.7625 ns<\/td>\n<td style=\"text-align: right;\">0.97<\/td>\n<td style=\"text-align: right;\">232 B<\/td>\n<\/tr>\n<tr>\n<td>PathAndQuery<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">2.1615 ns<\/td>\n<td style=\"text-align: right;\">0.04<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>PathAndQuery<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">0.7380 ns<\/td>\n<td style=\"text-align: right;\">0.01<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, not all URLs contain pure ASCII. Such cases often involve escaping these characters using percent-encoding, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/32552\">dotnet\/runtime#32552<\/a> optimized those code paths by changing a multi-pass scheme that involved both a temporary <code>byte[]<\/code> buffer and a temporary <code>char[]<\/code> buffer into a single-pass scheme that used stack-allocation with a fallback to the <code>ArrayPool&lt;T&gt;.Shared<\/code>.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic string Unescape() =&gt; Uri.UnescapeDataString(\"%E4%BD%A0%E5%A5%BD\");<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Unescape<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">284.03 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">385 B<\/td>\n<\/tr>\n<tr>\n<td>Unescape<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">144.55 ns<\/td>\n<td style=\"text-align: right;\">0.51<\/td>\n<td style=\"text-align: right;\">208 B<\/td>\n<\/tr>\n<tr>\n<td>Unescape<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">125.98 ns<\/td>\n<td style=\"text-align: right;\">0.44<\/td>\n<td style=\"text-align: right;\">144 B<\/td>\n<\/tr>\n<tr>\n<td>Unescape<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">69.85 ns<\/td>\n<td style=\"text-align: right;\">0.25<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>UriBuilder<\/code> is also used in some applications to compose <code>Uri<\/code> instances. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51826\">dotnet\/runtime#51826<\/a> reduced the size of <code>UriBuilder<\/code> itself by getting rid of some fields that weren&#8217;t strictly necessary, avoided some string concatenations and substring allocations, and utilized stack-allocation and <code>ArrayPool&lt;T&gt;<\/code> as part of its <code>ToString<\/code> implementation. As a result, <code>UriBuilder<\/code> is now also lighterweight for most uses:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic string BuilderToString()\r\n{\r\n    var builder = new UriBuilder();\r\n    builder.Scheme = \"https\";\r\n    builder.Host = \"dotnet.microsoft.com\";\r\n    builder.Port = 443;\r\n    builder.Path = \"\/platform\/try-dotnet\";\r\n    return builder.ToString();\r\n}  <\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>BuilderToString<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">604.5 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">810 B<\/td>\n<\/tr>\n<tr>\n<td>BuilderToString<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">446.7 ns<\/td>\n<td style=\"text-align: right;\">0.74<\/td>\n<td style=\"text-align: right;\">432 B<\/td>\n<\/tr>\n<tr>\n<td>BuilderToString<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">225.7 ns<\/td>\n<td style=\"text-align: right;\">0.38<\/td>\n<td style=\"text-align: right;\">432 B<\/td>\n<\/tr>\n<tr>\n<td>BuilderToString<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">171.7 ns<\/td>\n<td style=\"text-align: right;\">0.28<\/td>\n<td style=\"text-align: right;\">216 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As noted previously, I love seeing this continual march of progress, with every release the exact same API getting faster and faster, as more and more opportunities are discovered, new capabilities of the underlying platform utilized, code generation improving, and on. Exciting.<\/p>\n<p>Now we get to <code>HttpClient<\/code>. There were a few areas in which <code>HttpClient<\/code>, and specifically <code>SocketsHttpHandler<\/code>, was improved from a performance perspective (and many more from a functionality perspective, including preview support for HTTP\/3, better standards adherence, distributed tracing integration, and more knobs for configuring how it should behave). One key area is around header management. Previous releases saw a lot of effort applied to driving down the overheads of the HTTP stack, but the public API for headers forced a particular set of work and allocations to be performed. Even within those constraints, we&#8217;ve driven down some costs, such as by no longer forcing headers added into the <code>HttpClient.DefaultRequestHeaders<\/code> collection to be parsed if the developer added them with <code>TryAddWithoutValidation<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49673\">dotnet\/runtime#49673<\/a>), removing a lock that&#8217;s no longer necessary (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54130\">dotnet\/runtime#54130<\/a>), and enabling a singleton empty enumerator to be returned when enumerating an <code>HttpHeaderValueCollection<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47010\">dotnet\/runtime#47010<\/a>).<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic async Task Enumerate()\r\n{\r\n    var request = new HttpRequestMessage(HttpMethod.Get, s_uri);\r\n    using var resp = await s_client.SendAsync(request, default);\r\n    foreach (var header in resp.Headers) { }\r\n    foreach (var contentHeader in resp.Content.Headers) { }\r\n    await resp.Content.CopyToAsync(Stream.Null);\r\n}\r\n\r\nprivate static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\nprivate static readonly HttpMessageInvoker s_client = new HttpMessageInvoker(new HttpClientHandler { UseProxy = false, UseCookies = false, AllowAutoRedirect = false });\r\nprivate static Uri s_uri;\r\n\r\n[GlobalSetup]\r\npublic void CreateSocketServer()\r\n{\r\n    s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n    s_listener.Listen(int.MaxValue);\r\n    var ep = (IPEndPoint)s_listener.LocalEndPoint;\r\n    s_uri = new Uri($\"http:\/\/{ep.Address}:{ep.Port}\/\");\r\n    byte[] response = Encoding.UTF8.GetBytes(\"HTTP\/1.1 200 OK\\r\\nDate: Tue, 01 Jul 2021 12:00:00 GMT \\r\\nServer: Example\\r\\nAccess-Control-Allow-Credentials: true\\r\\nAccess-Control-Allow-Origin: *\\r\\nConnection: keep-alive\\r\\nContent-Type: text\/html; charset=utf-8\\r\\nContent-Length: 5\\r\\n\\r\\nHello\");\r\n    byte[] endSequence = new byte[] { (byte)'\\r', (byte)'\\n', (byte)'\\r', (byte)'\\n' };\r\n\r\n    Task.Run(async () =&gt;\r\n    {\r\n        while (true)\r\n        {\r\n            Socket s = await s_listener.AcceptAsync();\r\n            _ = Task.Run(() =&gt;\r\n            {\r\n                using (var ns = new NetworkStream(s, true))\r\n                {\r\n                    byte[] buffer = new byte[1024];\r\n                    int totalRead = 0;\r\n                    while (true)\r\n                    {\r\n                        int read = ns.Read(buffer, totalRead, buffer.Length - totalRead);\r\n                        if (read == 0) return;\r\n                        totalRead += read;\r\n                        if (buffer.AsSpan(0, totalRead).IndexOf(endSequence) == -1)\r\n                        {\r\n                            if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);\r\n                            continue;\r\n                        }\r\n\r\n                        ns.Write(response, 0, response.Length);\r\n\r\n                        totalRead = 0;\r\n                    }\r\n                }\r\n            });\r\n        }\r\n    });\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Enumerate<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">145.97 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">18 KB<\/td>\n<\/tr>\n<tr>\n<td>Enumerate<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">85.51 us<\/td>\n<td style=\"text-align: right;\">0.56<\/td>\n<td style=\"text-align: right;\">3 KB<\/td>\n<\/tr>\n<tr>\n<td>Enumerate<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">82.45 us<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<td style=\"text-align: right;\">3 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>But the biggest impact in this area comes from the addition of the new <code>HttpHeaders.NonValidated<\/code> property (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53555\">dotnet\/runtime#53555<\/a>), which returns a view over the headers collection that does not force parsing or validation when reading\/enumerating. This has both a functional and a performance benefit. Functionally, it means headers sent by a server can be inspected in their original form, for consumers that really want to see the data prior to it having been sanitized\/transformed by <code>HttpClient<\/code>. But from a performance perspective, it has a significant impact, as it means that a) the validation logic we&#8217;d normally run on headers can be omitted entirely, and b) any allocations that would result from that validation are also avoided. Now if we run <code>Enumerate<\/code> and <code>EnumerateNew<\/code> on .NET 6, we can see the improvement that results from using the new API:<\/p>\n<pre><code class=\"language-C#\">\/\/ Added to the previous benchmark\r\n[Benchmark]\r\npublic async Task EnumerateNew()\r\n{\r\n    var request = new HttpRequestMessage(HttpMethod.Get, s_uri);\r\n    using var resp = await s_client.SendAsync(request, default);\r\n    foreach (var header in resp.Headers.NonValidated) { }\r\n    foreach (var contentHeader in resp.Content.Headers.NonValidated) { }\r\n    await resp.Content.CopyToAsync(Stream.Null);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Enumerate<\/td>\n<td style=\"text-align: right;\">82.70 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">3 KB<\/td>\n<\/tr>\n<tr>\n<td>EnumerateNew<\/td>\n<td style=\"text-align: right;\">67.36 us<\/td>\n<td style=\"text-align: right;\">0.81<\/td>\n<td style=\"text-align: right;\">2 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>So, even with all the I\/O and HTTP protocol logic being performed, tweaking the API used for header enumeration here results in an ~20% boost in throughput.<\/p>\n<p>Another area that saw significant improvement was in <code>SocketsHttpHandler<\/code>&#8216;s connection pooling. One change here comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50545\">dotnet\/runtime#50545<\/a>, which simplifies the code and helps on all platforms, but in particular improves a long-standing potential performance issue on Windows (our Unix implementation generally didn&#8217;t suffer the same problem, because of differences in how asynchronous I\/O is implemented). <code>SocketsHttpHandler<\/code> maintains a pool of connections that remain open to the server and that it can use to service future requests. By default, it needs to scavenge this pool periodically, to close connections that have been around for too long or that, more relevant to this discussion, the server has chosen to close. To determine whether the server has closed a connection, we need to poll the underlying socket, but in some situations, we don&#8217;t actually have access to the underlying socket in order to perform the poll (and, with the advent of <code>ConnectCallback<\/code> in .NET 5 that enables an arbitrary <code>Stream<\/code> to be provided for use with a connection, there may not even be a <code>Socket<\/code> involved at all). In such situations, the only way we can be notified of a connection being closed is to perform a read on the connection. Thus, if we were unable to poll the socket directly, we would issue an asynchronous read (which would then be used as the first read as part of handling the next request on that connection), and the scavenging logic could check the task for that read to see whether it had completed erroneously. Now comes the problem. On Windows, overlapped I\/O read operations often involve pinning a buffer for the duration of the operation (on Unix, we implement asynchronous reads via epoll, and no buffer need be pinned for the duration); that meant if we ended up with a lot of connections in the pool, and we had to issue asynchronous reads for each, we&#8217;d likely end up pinning a whole bunch of sizeable buffers, leading to memory fragmentation and potentially sizeable working set growth. The fix is to use zero-byte reads. Rather than issuing the actual read using the connection&#8217;s buffer, we instead issue a read using an empty buffer. All of the streams <code>SocketsHttpHandler<\/code> uses by default (namely <code>NetworkStream<\/code> and <code>SslStream<\/code>) support the notion of zero-byte reads, where rather than returning immediately, they instead wait to complete the asynchronous read until at least some data is available, even though they won&#8217;t be returning any of that data as part of the operation. Then, only once that operation has completed, the actual initial read is issued, which is both necessary to actually get the first batch of response data, but also to handle arbitrary <code>Stream<\/code>s that may return immediately from a zero-byte read without actually waiting. Interestingly, though, just supporting zero-byte reads can sometimes be the &#8220;min bar&#8221;. <code>SslStream<\/code> has long supported zero-byte reads on it, but it did so by in turn issuing a read on the stream it wraps using its internal read buffer. That means <code>SslStream<\/code> was potentially itself holding onto a valuable buffer, and (on Windows) pinning it, even though that was unnecessary. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49123\">dotnet\/runtime#49123<\/a> addresses that by special-casing zero-byte reads to not use a buffer and to not force an internal buffer into existence if one isn&#8217;t currently available (<code>SslStream<\/code> returns buffers back to a pool when it&#8217;s not currently using them).<\/p>\n<p><code>SslStream<\/code> has seen multiple other performance-related PRs come through for .NET 6. Previously, <code>SslStream<\/code> would hand back to a caller of <code>Read{Async}<\/code> the data from at most one TLS frame. As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50815\">dotnet\/runtime#50815<\/a>, it can now hand back data from multiple TLS frames should those frames be available and a large enough buffer be provided to <code>Read{Async}<\/code>. This can help reduce the chattiness of <code>ReadAsync<\/code> calls, making better use of buffer space to reduce frequency of calls. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51320\">dotnet\/runtime#51320<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> helped avoid some unnecessary buffer growth after he noticed some constants related to TLS frame sizes that had been in the code for a long time were no longer sufficient for newer TLS protocols, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51324\">dotnet\/runtime#51324<\/a> also from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> helped avoid some casting overheads by being more explicit about the actual types being passed through the system.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53851\">dotnet\/runtime#53851<\/a> provides another very interesting improvement related to connection pooling. Let&#8217;s say all of the connections for a given server are currently busy handling requests, and another request comes along. Unless you&#8217;ve configured a maximum limit on the number of connections per server and hit that limit, <code>SocketsHttpHandler<\/code> will happily create a new connection to service your request (in the case of HTTP\/2, by default per the HTTP\/2 specification there&#8217;s only one connection and a limit set by the server to the number of requests\/streams multiplexed onto that connection, but <code>SocketsHttpHandler<\/code> allows you to opt-in to using more than one connection). The question then is, what happens to that request if, while waiting for the new connection to be established, one of the existing connections becomes available? Up until now, that request would just wait for and use the new connection. With the aforementioned PR, the request can now use whatever connection becomes available first, whether it be an existing one or a new one, and whatever connection isn&#8217;t used will simply find its way back to the pool. This should both improve latency and response time, and potentially reduce the number of connections needed in the pool, thus saving memory and networking resources.<\/p>\n<p>.NET Core 3.0 introduced support for HTTP\/2, and since then the use of the protocol has been growing. This has led us to discover where things worked well and where more work was needed. One area in particular that needed some love was around <code>SocketsHttpHandler<\/code>&#8216;s HTTP\/2 download performance. Investigations showed slowdowns here were due to <code>SocketsHttpHandler<\/code> using a fixed-size receive window (64KB), such that if the receive buffer wasn&#8217;t large enough to keep the network busy, the system could stall. To address that, the receive buffer needs to be large enough to handle the &#8220;bandwidth-delay product&#8221; (a network connection&#8217;s capacity multiplied by round-trip communication time). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54755\">dotnet\/runtime#54755<\/a> adds support for dynamically-sizing the receive window, as well as several knobs for tweaking the behavior. This should significantly help with performance in particular on networks with reasonably-high bandwidth along with some meaningful delay in communications (e.g. with geographically distributed data centers), while also not consuming too much memory.<\/p>\n<p>There&#8217;s also been a steady stream of small improvements to <code>HttpClient<\/code>, things that on their own don&#8217;t account for much but when added together help to move the needle. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54209\">dotnet\/runtime#54209<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> converted a small class to a struct, saving an allocation per connection; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50487\">dotnet\/runtime#50487<\/a> removed a closure allocation from the <code>SocketsHttpHandler<\/code> connection pool, simply by changing the scope in which a variable was declared so that it wasn&#8217;t in scope of a hotter path; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44750\">dotnet\/runtime#44750<\/a> removed a string allocation from <code>MediaTypeHeaderValue<\/code> in the common case where it has a media type but no additional parameters; and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45303\">dotnet\/runtime#45303<\/a> optimized the loading of the Huffman static encoding table used by HTTP\/2. The original code employed a single, long array of tuples, which required the C# compiler to generate a very large function for initializing each element of the array; the PR changed that to instead be two blittable <code>uint[]<\/code> arrays that are cheaply stored in the binary.<\/p>\n<p>Finally, let&#8217;s look at <code>WebSockets<\/code>. <code>WebSocket.CreateFromStream<\/code> was introduced in .NET Core 2.1 and layers a managed implementation of the <a href=\"https:\/\/datatracker.ietf.org\/doc\/html\/rfc6455\">websocket protocol<\/a> on top of an arbitrary bidirectional <code>Stream<\/code>; <code>ClientWebSocket<\/code> uses it with a <code>Stream<\/code> created by <code>SocketsHttpHandler<\/code> to enable client websockets, and Kestrel uses it to enable server websockets. Thus, any improvements we make to that managed implementation (the internal <code>ManagedWebSocket<\/code>) benefit both client and server. There have been a handful of small improvements in this area, such as with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49831\">dotnet\/runtime#49831<\/a> that saved a few hundred bytes in allocation as part of the websocket handshake by using span-based APIs to create the data for the headers used in the websocket protocol, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52022\">dotnet\/runtime#52022<\/a> from <a href=\"https:\/\/github.com\/zlatanov\">@zlatanov<\/a> that saved a few hundred bytes from each <code>ManagedWebSocket<\/code> by avoiding a <code>CancellationTokenSource<\/code> that was overkill for the target scenario. But there were two significant changes worth examining in more detail.<\/p>\n<p>The first is websocket compression. The implementation for this came in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49304\">dotnet\/runtime#49304<\/a> from <a href=\"https:\/\/github.com\/zlatanov\">@zlatanov<\/a>, providing a long-requested feature of <a href=\"https:\/\/datatracker.ietf.org\/doc\/html\/rfc7692\">per-message compression<\/a>. Adding compression increases the CPU cost of sending and receiving, but it decreases the amount of data sent and received, which can in turn decrease the overall cost of communication, especially as networking latency increases. As such, the benefit of this one is harder to measure with BenchmarkDotNet, and I&#8217;ll instead just use a console app:<\/p>\n<pre><code class=\"language-C#\">using System.Diagnostics;\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\nusing System.Net.WebSockets;\r\n\r\nReadOnlyMemory&lt;byte&gt; dataToSend;\r\nusing (var hc = new HttpClient())\r\n{\r\n    dataToSend = await hc.GetByteArrayAsync(\"https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt\");\r\n}\r\nMemory&lt;byte&gt; receiveBuffer = new byte[dataToSend.Length];\r\n\r\nforeach (bool compressed in new[] { false, true })\r\n{\r\n    using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n    listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));\r\n    listener.Listen();\r\n\r\n    using var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);\r\n    client.Connect(listener.LocalEndPoint);\r\n    using Socket server = listener.Accept();\r\n\r\n    using var clientStream = new PassthroughTrackingStream(new NetworkStream(client, ownsSocket: true));\r\n    using var clientWS = WebSocket.CreateFromStream(clientStream, new WebSocketCreationOptions { IsServer = false, DangerousDeflateOptions = compressed ? new WebSocketDeflateOptions() : null });\r\n    using var serverWS = WebSocket.CreateFromStream(new NetworkStream(server, ownsSocket: true), new WebSocketCreationOptions { IsServer = true, DangerousDeflateOptions = compressed ? new WebSocketDeflateOptions() : null });\r\n\r\n    var sw = new Stopwatch();\r\n    for (int trial = 0; trial &lt; 5; trial++)\r\n    {\r\n        long before = clientStream.BytesRead;\r\n        sw.Restart();\r\n\r\n        await serverWS.SendAsync(dataToSend, WebSocketMessageType.Binary, true, default);\r\n        while (!(await clientWS.ReceiveAsync(receiveBuffer, default)).EndOfMessage) ;\r\n\r\n        sw.Stop();\r\n        Console.WriteLine($\"Compressed: {compressed,5} Bytes: {clientStream.BytesRead - before,10:N0} Time: {sw.ElapsedMilliseconds:N0}ms\");\r\n    }\r\n}\r\n\r\nsealed class PassthroughTrackingStream : Stream\r\n{\r\n    private readonly Stream _stream;\r\n    public long BytesRead;\r\n\r\n    public PassthroughTrackingStream(Stream stream) =&gt; _stream = stream;\r\n\r\n    public override bool CanWrite =&gt; true;\r\n    public override bool CanRead =&gt; true;\r\n\r\n    public override async ValueTask&lt;int&gt; ReadAsync(Memory&lt;byte&gt; buffer, CancellationToken cancellationToken)\r\n    {\r\n        int n = await _stream.ReadAsync(buffer, cancellationToken);\r\n        BytesRead += n;\r\n        return n;\r\n    }\r\n\r\n    public override ValueTask WriteAsync(ReadOnlyMemory&lt;byte&gt; buffer, CancellationToken cancellationToken) =&gt;\r\n        _stream.WriteAsync(buffer, cancellationToken);\r\n\r\n    protected override void Dispose(bool disposing) =&gt; _stream.Dispose();\r\n    public override bool CanSeek =&gt; false;\r\n    public override long Length =&gt; throw new NotSupportedException();\r\n    public override long Position { get =&gt; throw new NotSupportedException(); set =&gt; throw new NotSupportedException(); }\r\n    public override void Flush() { }\r\n    public override int Read(byte[] buffer, int offset, int count) =&gt; throw new NotSupportedException();\r\n    public override long Seek(long offset, SeekOrigin origin) =&gt; throw new NotSupportedException();\r\n    public override void SetLength(long value) =&gt; throw new NotSupportedException();\r\n    public override void Write(byte[] buffer, int offset, int count) =&gt; throw new NotSupportedException();\r\n}<\/code><\/pre>\n<p>This app is creating a loopback socket connection and then layering on top of that a websocket connection created using <code>WebSocket.CreateFromStream<\/code>. But rather than just wrapping the <code>NetworkStream<\/code>s directly, the &#8220;client&#8221; end of the stream that&#8217;s receiving data sent by the &#8220;server&#8221; is wrapping the <code>NetworkStream<\/code> in an intermediary stream that&#8217;s tracking the number of bytes read, which it then exposes for the console app to print. That way, we can see how much data ends up actually being sent. The app is downloading the complete works of Mark Twain from Project Gutenberg, such that each sent message is ~15MB. When I run this, I get results like the following:<\/p>\n<pre><code class=\"language-console\">Compressed: False Bytes: 16,013,945 Time: 42ms\r\nCompressed: False Bytes: 16,013,945 Time: 13ms\r\nCompressed: False Bytes: 16,013,945 Time: 13ms\r\nCompressed: False Bytes: 16,013,945 Time: 12ms\r\nCompressed: False Bytes: 16,013,945 Time: 12ms\r\nCompressed:  True Bytes:  6,326,310 Time: 580ms\r\nCompressed:  True Bytes:  6,325,285 Time: 571ms\r\nCompressed:  True Bytes:  6,325,246 Time: 569ms\r\nCompressed:  True Bytes:  6,325,229 Time: 571ms\r\nCompressed:  True Bytes:  6,325,168 Time: 571ms<\/code><\/pre>\n<p>So, we can see that on this very fast loopback connection, the cost of the operation is dominated by the compression; however, we&#8217;re sending only a third as much data. That could be a good tradeoff if communicating over a real network with longer latencies, where the additional few hundred milliseconds to perform the compression and decompression is minimal compared to the cost of sending and receiving an additional 10MB.<\/p>\n<p>The second is amortized zero-allocation websocket receiving. In .NET Core 2.1, overloads were added to <code>WebSocket<\/code> for the <code>SendAsync<\/code> and <code>ReceiveAsync<\/code> methods. These overloads accepted <code>ReadOnlyMemory&lt;byte&gt;<\/code> and <code>Memory&lt;byte&gt;<\/code>, respectively, and returned <code>ValueTask<\/code> and <code>ValueTask&lt;int&gt;<\/code>, respectively. That <code>ValueTask&lt;int&gt;<\/code> in particular was important because it enabled <code>ReceiveAsync<\/code> to perform in an allocation-free manner when the operation completed synchronously, which would happen if the data being received was already available. When the operation completed asynchronously, however, it would still allocate a <code>Task&lt;int&gt;<\/code> to back the <code>ValueTask&lt;int&gt;<\/code>, and even with the advent of <code>IValueTaskSource&lt;int&gt;<\/code>, that still remained, given the complexity of the <code>ReceiveAsync<\/code> method and how difficult it would be to manually implement the function by hand without the assistance of <code>async<\/code> and <code>await<\/code>. However, as previously discussed, C# 10 and .NET 6 now have opt-in support for pooling with async methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56282\">dotnet\/runtime#56282<\/a> included adding <code>[AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]<\/code> to <code>ReceiveAsync<\/code>. On my 12-logical-core machine, this code:<\/p>\n<pre><code class=\"language-C#\">private Connection[] _connections = Enumerable.Range(0, 256).Select(_ =&gt; new Connection()).ToArray();\r\nprivate const int Iters = 1_000;\r\n\r\n[Benchmark]\r\npublic Task PingPong() =&gt;\r\n    Task.WhenAll(from c in _connections\r\n                    select Task.WhenAll(\r\n                    Task.Run(async () =&gt;\r\n                    {\r\n                        for (int i = 0; i &lt; Iters; i++)\r\n                        {\r\n                            await c.Server.ReceiveAsync(c.ServerBuffer, c.CancellationToken);\r\n                            await c.Server.SendAsync(c.ServerBuffer, WebSocketMessageType.Binary, endOfMessage: true, c.CancellationToken);\r\n                        }\r\n                    }),\r\n                    Task.Run(async () =&gt;\r\n                    {\r\n                        for (int i = 0; i &lt; Iters; i++)\r\n                        {\r\n                            await c.Client.SendAsync(c.ClientBuffer, WebSocketMessageType.Binary, endOfMessage: true, c.CancellationToken);\r\n                            await c.Client.ReceiveAsync(c.ClientBuffer, c.CancellationToken);\r\n                        }\r\n                    })));\r\n\r\nprivate class Connection\r\n{\r\n    public readonly WebSocket Client, Server;\r\n    public readonly Memory&lt;byte&gt; ClientBuffer = new byte[256];\r\n    public readonly Memory&lt;byte&gt; ServerBuffer = new byte[256];\r\n    public readonly CancellationToken CancellationToken = default;\r\n\r\n    public Connection()\r\n    {\r\n        (Stream Stream1, Stream Stream2) streams = ConnectedStreams.CreateBidirectional();\r\n        Client = WebSocket.CreateFromStream(streams.Stream1, isServer: false, subProtocol: null, Timeout.InfiniteTimeSpan);\r\n        Server = WebSocket.CreateFromStream(streams.Stream2, isServer: true, subProtocol: null, Timeout.InfiniteTimeSpan);\r\n    }\r\n}<\/code><\/pre>\n<p>then yielded this improvement:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Gen 0<\/th>\n<th style=\"text-align: right;\">Gen 1<\/th>\n<th style=\"text-align: right;\">Gen 2<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>PingPong<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">148.7 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">29750.0000<\/td>\n<td style=\"text-align: right;\">3000.0000<\/td>\n<td style=\"text-align: right;\">250.0000<\/td>\n<td style=\"text-align: right;\">180,238 KB<\/td>\n<\/tr>\n<tr>\n<td>PingPong<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">108.9 ms<\/td>\n<td style=\"text-align: right;\">0.72<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<td style=\"text-align: right;\">249 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Reflection<\/h3>\n<p>Reflection provides a very powerful mechanism for inspecting metadata about .NET assemblies and invoking functionality in those assemblies. That mechanism can incur non-trivial expense, however. While functionality exists to avoid that overhead for repeated calls (e.g. using <code>MethodInfo.CreateDelegate<\/code> to get a strongly-typed delegate directly to the target method), that&#8217;s not always relevant or appropriate. As such, it&#8217;s valuable to reduce the overhead associated with reflection, which .NET 6 does in multiple ways.<\/p>\n<p>A variety of PRs targeted reducing the overhead involved in inspecting attributes on .NET types and members. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54402\">dotnet\/runtime#54402<\/a> significantly reduced the overhead of calling <code>Attribute.GetCustomAttributes<\/code> when specifying that inherited attributes should be included (even if there aren&#8217;t any to inherit); <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44694\">dotnet\/runtime#44694<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> reduced the memory allocation associated with <code>Attribute.IsDefined<\/code> via a dedicated code path rather than relegating the core logic to an existing shared method (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45292\">dotnet\/runtime#45292<\/a>, from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> as well, also removed some low-level overhead from filtering attribute records); and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54405\">dotnet\/runtime#54405<\/a> eliminated the allocation from <code>MethodInfo.GetCustomAttributeData<\/code> when there aren&#8217;t any attributes (it&#8217;s common to call this API to check if there are, and thus it&#8217;s helpful to improve performance in the common case where there aren&#8217;t).<\/p>\n<pre><code class=\"language-C#\">private MethodInfo _noAttributes = typeof(C).GetMethod(\"NoAttributes\");\r\nprivate PropertyInfo _hasAttributes = typeof(C).GetProperty(\"HasAttributes\");\r\n\r\n[Benchmark]\r\npublic IList&lt;CustomAttributeData&gt; GetCustomAttributesData() =&gt; _noAttributes.GetCustomAttributesData();\r\n\r\n[Benchmark]\r\npublic bool IsDefined() =&gt; Attribute.IsDefined(_hasAttributes, typeof(ObsoleteAttribute));\r\n\r\n[Benchmark]\r\npublic Attribute[] GetCustomAttributes() =&gt; Attribute.GetCustomAttributes(_hasAttributes, inherit: true);\r\n\r\nclass A { }\r\n\r\nclass C : A\r\n{\r\n    public void NoAttributes() { }\r\n    [Obsolete]\r\n    public bool HasAttributes { get; set; }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetCustomAttributesData<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">329.48 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">168 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributesData<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">85.27 ns<\/td>\n<td style=\"text-align: right;\">0.26<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributesData<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">73.58 ns<\/td>\n<td style=\"text-align: right;\">0.22<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributesData<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">69.59 ns<\/td>\n<td style=\"text-align: right;\">0.21<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">640.15 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">144 B<\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">399.75 ns<\/td>\n<td style=\"text-align: right;\">0.62<\/td>\n<td style=\"text-align: right;\">136 B<\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">292.01 ns<\/td>\n<td style=\"text-align: right;\">0.46<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>IsDefined<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">252.00 ns<\/td>\n<td style=\"text-align: right;\">0.39<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">5,155.93 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,380 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">2,702.26 ns<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<td style=\"text-align: right;\">1,120 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">2,406.51 ns<\/td>\n<td style=\"text-align: right;\">0.47<\/td>\n<td style=\"text-align: right;\">1,056 B<\/td>\n<\/tr>\n<tr>\n<td>GetCustomAttributes<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">446.29 ns<\/td>\n<td style=\"text-align: right;\">0.09<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Code often looks up information beyond attributes, and it can be helpful for performance to special-case common patterns. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44759\">dotnet\/runtime#44759<\/a> recognizes that reflection-based code will often look at method parameters, which many methods don&#8217;t have, yet <code>GetParameters<\/code> was always allocating a <code>ParameterInfo[]<\/code>, even for zero parameters. A given <code>MethodInfo<\/code> will cache the array, but this would still result in an extra array for every individual method inspected. This PR fixes that.<\/p>\n<p>Reflection is valuable not just for getting metadata but also for invoking members. If you ever do an allocation profile for code using reflection to invoke methods, you&#8217;ll likely see a bunch of <code>object[]<\/code> allocations showing up, typically coming from a method named <code>CheckArguments<\/code>. This is part of the runtime&#8217;s type safety validation. Reflection is going to pass the <code>object[]<\/code> of arguments you pass to <code>MethodInfo.Invoke<\/code> to the target method, which means it needs to validate that the arguments are of the right types the method expects&#8230; if they&#8217;re not, it could end up violating type safety by passing type A to a method that instead receives it as a completely unrelated type B, and now all use of that &#8220;B&#8221; is potentially invalid and corrupting. However, if a caller erroneously mutated the array concurrently with the reflection call, such mutation could happen after the type checks occurred, enabling type safety to be violated, anyway. So, the runtime is forced to make a defensive copy of the argument array and then validate the copy to which the caller doesn&#8217;t have access. That&#8217;s the <code>object[]<\/code> that shows up in these traces. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50814\">dotnet\/runtime#50814<\/a> addresses this by recognizing that most methods have at most only a few parameters, and special-cases methods with up to four parameters to instead use a stack-allocated <code>Span&lt;object&gt;<\/code> rather than a heap-allocated a <code>object[]<\/code> for storing that defensive copy.<\/p>\n<pre><code class=\"language-C#\">private MethodInfo _method = typeof(Program).GetMethod(\"M\");\r\n\r\npublic void M(int arg1, string arg2) { }\r\n\r\n[Benchmark]\r\npublic void Invoke() =&gt; _method.Invoke(this, new object[] { 1, \"two\" });<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Invoke<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">195.5 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">104 B<\/td>\n<\/tr>\n<tr>\n<td>Invoke<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">156.0 ns<\/td>\n<td style=\"text-align: right;\">0.80<\/td>\n<td style=\"text-align: right;\">104 B<\/td>\n<\/tr>\n<tr>\n<td>Invoke<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">141.0 ns<\/td>\n<td style=\"text-align: right;\">0.72<\/td>\n<td style=\"text-align: right;\">104 B<\/td>\n<\/tr>\n<tr>\n<td>Invoke<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">123.1 ns<\/td>\n<td style=\"text-align: right;\">0.63<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another very common form of dynamic invocation is when creating new instances via <code>Activator.CreateInstance<\/code>, which is usable directly but is also employed by the C# compiler to implement the <code>new()<\/code> constraint on generic parameters. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/32520\">dotnet\/runtime#32520<\/a> overhauled the <code>Activator.CreateInstance<\/code> implementation in the runtime, employing a per-type cache of function pointers that can be used to quickly allocate an uninitialized object of the relevant type and invoke its constructor.<\/p>\n<pre><code class=\"language-C#\">private T Create&lt;T&gt;() where T : new() =&gt; new T();\r\n\r\n[Benchmark]\r\npublic Program Create() =&gt; Create&lt;Program&gt;();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Create<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">49.496 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td>Create<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">28.296 ns<\/td>\n<td style=\"text-align: right;\">0.57<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td>Create<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">26.350 ns<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<tr>\n<td>Create<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">9.439 ns<\/td>\n<td style=\"text-align: right;\">0.19<\/td>\n<td style=\"text-align: right;\">24 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another common operation is creating closed generic types from open ones, e.g. given the type for <code>List&lt;T&gt;<\/code> creating a type for <code>List&lt;int&gt;<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45137\">dotnet\/runtime#45137<\/a> special-cased the most common case of having just one type parameter in order to optimize that path, while also avoiding an extra <code>GetGenericArguments<\/code> call internally for all arities.<\/p>\n<pre><code class=\"language-C#\">private Type[] _oneRef = new[] { typeof(string) };\r\nprivate Type[] _twoValue = new[] { typeof(int), typeof(int) };\r\n\r\n[Benchmark] public Type OneRefType() =&gt; typeof(List&lt;&gt;).MakeGenericType(_oneRef);\r\n[Benchmark] public Type TwoValueType() =&gt; typeof(Dictionary&lt;,&gt;).MakeGenericType(_twoValue);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>OneRefType<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">363.1 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<\/tr>\n<tr>\n<td>OneRefType<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">266.8 ns<\/td>\n<td style=\"text-align: right;\">0.74<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<\/tr>\n<tr>\n<td>OneRefType<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">248.7 ns<\/td>\n<td style=\"text-align: right;\">0.69<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<\/tr>\n<tr>\n<td>OneRefType<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">171.6 ns<\/td>\n<td style=\"text-align: right;\">0.47<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>TwoValueType<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">418.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">160 B<\/td>\n<\/tr>\n<tr>\n<td>TwoValueType<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">292.3 ns<\/td>\n<td style=\"text-align: right;\">0.70<\/td>\n<td style=\"text-align: right;\">160 B<\/td>\n<\/tr>\n<tr>\n<td>TwoValueType<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">290.5 ns<\/td>\n<td style=\"text-align: right;\">0.69<\/td>\n<td style=\"text-align: right;\">160 B<\/td>\n<\/tr>\n<tr>\n<td>TwoValueType<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">215.0 ns<\/td>\n<td style=\"text-align: right;\">0.51<\/td>\n<td style=\"text-align: right;\">120 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Finally, sometimes optimizations are all about deleting code and just calling something else that already exists. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42891\">dotnet\/runtime#42891<\/a> just changed the implementation of one helper in the runtime to call another existing helper, in order to make <code>Type.IsPrimitive<\/code> measurably faster:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\n[Arguments(typeof(int))]\r\npublic bool IsPrimitive(Type type) =&gt; type.IsPrimitive;<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>IsPrimitive<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">5.021 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>IsPrimitive<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">3.184 ns<\/td>\n<td style=\"text-align: right;\">0.63<\/td>\n<\/tr>\n<tr>\n<td>IsPrimitive<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">3.032 ns<\/td>\n<td style=\"text-align: right;\">0.60<\/td>\n<\/tr>\n<tr>\n<td>IsPrimitive<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.376 ns<\/td>\n<td style=\"text-align: right;\">0.47<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Of course, reflection extends beyond just the core reflection APIs, and a number of PRs have gone in to improving areas of reflection higher in the stack. <code>DispatchProxy<\/code>, for example. <code>DispatchProxy<\/code> provides an interface-based alternative to the older remoting-based <code>RealProxy<\/code> (<a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/migrating-realproxy-usage-to-dispatchproxy\/\">Migrating RealProxy Usage to DispatchProxy<\/a> provides a good description). It utilizes reflection emit to generate IL at run-time, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47134\">dotnet\/runtime#47134<\/a> optimizes both that process and the generated code in such a way that it saves several hundred bytes of allocation per method invocation on a <code>DispatchProxy<\/code>.<\/p>\n<h3>Collections and LINQ<\/h3>\n<p>Every .NET release has seen the core collection types and LINQ get faster and faster. Even as a lot of the low-hanging fruit was picked in previous releases, developers contributing to .NET 6 have still managed to find meaningful improvements, some in the form of optimizing existing APIs, and some in the form of new APIs developers can use to make their own code fly.<\/p>\n<p>Improvements to <code>Dictionary&lt;TKey, TValue&gt;<\/code> are always exciting, as it&#8217;s used <em>everywhere<\/em>, and performance improvements to it have a way of &#8220;moving the needle&#8221; on a variety of workloads. One improvement to <code>Dictionary&lt;TKey, TValue&gt;<\/code> in .NET 6 comes from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/41944\">dotnet\/runtime#41944<\/a>. The PR improves the performance of creating one dictionary from another, by enabling the common case of the source dictionary and the new dictionary sharing a key comparer to copy the underlying buckets without rehashing.<\/p>\n<pre><code class=\"language-C#\">private IEnumerable&lt;KeyValuePair&lt;string, int&gt;&gt; _dictionary = Enumerable.Range(0, 100).ToDictionary(i =&gt; i.ToString(), StringComparer.OrdinalIgnoreCase);\r\n\r\n[Benchmark]\r\npublic Dictionary&lt;string, int&gt; Clone() =&gt; new Dictionary&lt;string, int&gt;(_dictionary);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Clone<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">3.224 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Clone<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">2.880 us<\/td>\n<td style=\"text-align: right;\">0.89<\/td>\n<\/tr>\n<tr>\n<td>Clone<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.685 us<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45659\">dotnet\/runtime#45659<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56634\">dotnet\/runtime#56634<\/a>, <code>SortedDictionary&lt;TKey, TValue&gt;<\/code> also gains a similar optimization:<\/p>\n<pre><code class=\"language-C#\">private IDictionary&lt;string, int&gt; _dictionary = new SortedDictionary&lt;string, int&gt;(Enumerable.Range(0, 100).ToDictionary(i =&gt; i.ToString(), StringComparer.OrdinalIgnoreCase));\r\n\r\n[Benchmark]\r\npublic SortedDictionary&lt;string, int&gt; Clone() =&gt; new SortedDictionary&lt;string, int&gt;(_dictionary);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Clone<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">69.546 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Clone<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">54.560 us<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<\/tr>\n<tr>\n<td>Clone<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">53.196 us<\/td>\n<td style=\"text-align: right;\">0.76<\/td>\n<\/tr>\n<tr>\n<td>Clone<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.330 us<\/td>\n<td style=\"text-align: right;\">0.03<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49388\">dotnet\/runtime#49388<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54611\">dotnet\/runtime#54611<\/a> from <a href=\"https:\/\/github.com\/Sergio0694\">@Sergio0694<\/a> are examples of new APIs that developers can use with dictionaries when they want to eke out that last mile of performance. These APIs are defined on the <code>CollectionMarshal<\/code> class as they provide low-level access to internals of the dictionary, returning a ref into the <code>Dictionary&lt;TKey, TValue&gt;<\/code>s data structures; thus, you need to be careful when using them, but they can measurably improve performance in specific situations. <code>CollectionMarshal.GetValueRefOrNullRef<\/code> returns a <code>ref TValue<\/code> that will either point to an existing entry in the dictionary or be a null reference (e.g. <code>Unsafe.NullRef&lt;T&gt;()<\/code>) if the key could not be found. And <code>CollectionMarshal.GetValueRefOrAddDefault<\/code> returns a <code>ref TValue?<\/code>, returning a ref to the value if the key could be found, or adding an empty entry and returning a ref to it, otherwise. These can be used to avoid duplicate lookups as well as avoid potentially expensive struct value copies.<\/p>\n<pre><code class=\"language-C#\">private Dictionary&lt;int, int&gt; _counts = new Dictionary&lt;int, int&gt;();\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void AddOld()\r\n{\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        _counts[i] = _counts.TryGetValue(i, out int count) ? count + 1 : 1;\r\n    }\r\n}\r\n\r\n[Benchmark]\r\npublic void AddNew()\r\n{\r\n    for (int i = 0; i &lt; 10_000; i++)\r\n    {\r\n        CollectionsMarshal.GetValueRefOrAddDefault(_counts, i, out _)++;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AddOld<\/td>\n<td style=\"text-align: right;\">95.39 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AddNew<\/td>\n<td style=\"text-align: right;\">49.85 us<\/td>\n<td style=\"text-align: right;\">0.52<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>ImmutableSortedSet&lt;T&gt;<\/code> and <code>ImmutableList&lt;T&gt;<\/code> indexing also get faster, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53266\">dotnet\/runtime#53266<\/a> from <a href=\"https:\/\/github.com\/L2\">@L2<\/a>. Indexing into these collections performs a binary search through a tree of nodes, and each layer of the traversal was performing a range check on the index. But for all but the entry point check, that range validation is duplicative and can be removed, which is exactly what the PR does:<\/p>\n<pre><code class=\"language-C#\">private ImmutableList&lt;int&gt; _list = ImmutableList.CreateRange(Enumerable.Range(0, 100_000));\r\n\r\n[Benchmark]\r\npublic int Item() =&gt; _list[1];<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Item<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">17.468 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Item<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">16.296 ns<\/td>\n<td style=\"text-align: right;\">0.93<\/td>\n<\/tr>\n<tr>\n<td>Item<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">9.457 ns<\/td>\n<td style=\"text-align: right;\">0.54<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>ObservableCollection&lt;T&gt;<\/code> also improves in .NET 6, specifically due to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54899\">dotnet\/runtime#54899<\/a>, which reduces the allocations involved in creating NotifyCollectionChangedEventArgs (as such, this isn&#8217;t actually specific to <code>ObservableCollection&lt;T&gt;<\/code> and will help other systems that use the event arguments). The crux of the change is introducing an internal <code>SingleItemReadOnlyList<\/code> that&#8217;s used when an <code>IList<\/code> is needed to represent a single item; this replaces allocating an <code>object[]<\/code> that&#8217;s then wrapped in a <code>ReadOnlyList<\/code>.<\/p>\n<pre><code class=\"language-C#\">private ObservableCollection&lt;int&gt; _collection = new ObservableCollection&lt;int&gt;();\r\n\r\n[Benchmark]\r\npublic void ClearAdd()\r\n{\r\n    _collection.Clear();\r\n    for (int i = 0; i &lt; 100; i++)\r\n    {\r\n        _collection.Add(i);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>ClearAdd<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">4.014 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">17 KB<\/td>\n<\/tr>\n<tr>\n<td>ClearAdd<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">3.104 us<\/td>\n<td style=\"text-align: right;\">0.78<\/td>\n<td style=\"text-align: right;\">13 KB<\/td>\n<\/tr>\n<tr>\n<td>ClearAdd<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.471 us<\/td>\n<td style=\"text-align: right;\">0.62<\/td>\n<td style=\"text-align: right;\">9 KB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There have been a variety of other changes, such as <code>HashSet&lt;T&gt;<\/code> shrinking in size by a reference field, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49483\">dotnet\/runtime#49483<\/a>; <code>ConcurrentQueue&lt;T&gt;<\/code> and <code>ConcurrentBag&lt;T&gt;<\/code> avoiding some unnecessary writes when T doesn&#8217;t contain any references, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53438\">dotnet\/runtime#53438<\/a>; new <code>EnsureCapacity<\/code> APIs for <code>List&lt;T&gt;<\/code>, <code>Stack&lt;T&gt;<\/code>, and <code>Queue&lt;T&gt;<\/code>, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47149\">dotnet\/runtime#47149<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a>; and a brand new <code>PriorityQueue&lt;TElement, TPriority&gt;<\/code>, which was initially added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46009\">dotnet\/runtime#46009<\/a> by <a href=\"https:\/\/github.com\/pgolebiowski\">@pgolebiowski<\/a> and then subsequently optimized further in PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48315\">dotnet\/runtime#48315<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48324\">dotnet\/runtime#48324<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48346\">dotnet\/runtime#48346<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50065\">dotnet\/runtime#50065<\/a>.<\/p>\n<p>Having mentioned <code>HashSet&lt;T&gt;<\/code>, <code>HashSet&lt;T&gt;<\/code> gets a new customer in .NET 6: LINQ. Previous releases of LINQ brought with it its own internal <code>Set&lt;T&gt;<\/code> implementation, but in .NET 6 <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49591\">dotnet\/runtime#49591<\/a> ripped that out and replaced it with the built-in <code>HashSet&lt;T&gt;<\/code>, benefiting LINQ from the myriad of performance improvements that have gone into <code>HashSet&lt;T&gt;<\/code> in the last few years (but especially in .NET 5), while also reducing code duplication.<\/p>\n<pre><code class=\"language-C#\">private IEnumerable&lt;string&gt; _data = Enumerable.Range(0, 100_000).Select(i =&gt; i.ToString()).ToArray();\r\n\r\n[Benchmark]\r\npublic int DistinctCount() =&gt; _data.Distinct().Count();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DistinctCount<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">5.154 ms<\/td>\n<td style=\"text-align: right;\">1.04<\/td>\n<td style=\"text-align: right;\">5 MB<\/td>\n<\/tr>\n<tr>\n<td>DistinctCount<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">2.626 ms<\/td>\n<td style=\"text-align: right;\">0.53<\/td>\n<td style=\"text-align: right;\">2 MB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Enumerable.SequenceEqual<\/code> has also been accelerated when both enumerables are arrays, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48287\">dotnet\/runtime#48287<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48677\">dotnet\/runtime#48677<\/a>. The latter PR adds a <code>MemoryExtensions.SequenceEqual<\/code> overload that accepts an <code>IEqualityComparer&lt;T&gt;<\/code> (the existing overloads constrain <code>T<\/code> to being <code>IEquatable&lt;T&gt;<\/code>), which enables <code>Enumerable.SequenceEqual<\/code> to delegate to the span-based method and obtain vectorization of the comparison &#8220;for free&#8221; when the <code>T<\/code> used is amenable.<\/p>\n<pre><code class=\"language-C#\">private IEnumerable&lt;int&gt; _e1 = Enumerable.Range(0, 1_000_000).ToArray();\r\nprivate IEnumerable&lt;int&gt; _e2 = Enumerable.Range(0, 1_000_000).ToArray();\r\n\r\n[Benchmark]\r\npublic bool SequenceEqual() =&gt; _e1.SequenceEqual(_e2);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>SequenceEqual<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">10,822.6 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>SequenceEqual<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">5,421.1 us<\/td>\n<td style=\"text-align: right;\">0.50<\/td>\n<\/tr>\n<tr>\n<td>SequenceEqual<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">150.2 us<\/td>\n<td style=\"text-align: right;\">0.01<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><code>Enumerable.Min&lt;T&gt;<\/code> and <code>Enumerable.Max&lt;T&gt;<\/code> have also improved, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48273\">dotnet\/runtime#48273<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48289\">dotnet\/runtime#48289<\/a> (and the aforementioned JIT improvements that recognize <code>Comparer&lt;T&gt;.Default<\/code> as an intrinsic). By special-casing the comparer being <code>Comparer&lt;T&gt;.Default<\/code>, a dedicated loop could then be written explicitly using <code>Comparer&lt;T&gt;.Default<\/code> rather than going through the <code>comparer<\/code> parameter, which enables all of the calls through <code>Comparer&lt;T&gt;.Default.Compare<\/code> to devirtualize when <code>T<\/code> is a value type.<\/p>\n<pre><code class=\"language-C#\">private TimeSpan[] _values = Enumerable.Range(0, 1_000_000).Select(i =&gt; TimeSpan.FromMilliseconds(i)).ToArray();\r\n\r\n[Benchmark]\r\npublic TimeSpan Max() =&gt; _values.Max();\r\n\r\n[Benchmark]\r\npublic TimeSpan Min() =&gt; _values.Min();<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Max<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">5.984 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Max<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">4.926 ms<\/td>\n<td style=\"text-align: right;\">0.82<\/td>\n<\/tr>\n<tr>\n<td>Max<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">4.222 ms<\/td>\n<td style=\"text-align: right;\">0.71<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td style=\"text-align: right;\"><\/td>\n<td style=\"text-align: right;\"><\/td>\n<\/tr>\n<tr>\n<td>Min<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">5.917 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>Min<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">5.207 ms<\/td>\n<td style=\"text-align: right;\">0.88<\/td>\n<\/tr>\n<tr>\n<td>Min<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">4.291 ms<\/td>\n<td style=\"text-align: right;\">0.73<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In addition, there have been several new APIs added to LINQ in .NET 6. A new <code>Enumerable.Zip<\/code> overload accepting three rather than only two sources was added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47147\">dotnet\/runtime#47147<\/a> from <a href=\"https:\/\/github.com\/huoyaoyuan\">@huoyaoyuan<\/a>, making it both easier and faster to combine three sources:<\/p>\n<pre><code class=\"language-C#\">private IEnumerable&lt;int&gt; _e1 = Enumerable.Range(0, 1_000);\r\nprivate IEnumerable&lt;int&gt; _e2 = Enumerable.Range(0, 1_000);\r\nprivate IEnumerable&lt;int&gt; _e3 = Enumerable.Range(0, 1_000);\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void Old()\r\n{\r\n    IEnumerable&lt;(int, int, int)&gt; zipped = _e1.Zip(_e2).Zip(_e3, (x, y) =&gt; (x.First, x.Second, y));\r\n    foreach ((int, int, int) values in zipped)\r\n    {\r\n    }\r\n}\r\n\r\n[Benchmark]\r\npublic void New()\r\n{\r\n    IEnumerable&lt;(int, int, int)&gt; zipped = _e1.Zip(_e2, _e3);\r\n    foreach ((int, int, int) values in zipped)\r\n    {\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">20.50 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">304 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">14.88 us<\/td>\n<td style=\"text-align: right;\">0.73<\/td>\n<td style=\"text-align: right;\">232 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48559\">dotnet\/runtime#48559<\/a> from <a href=\"https:\/\/github.com\/Dixin\">@Dixin<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48634\">dotnet\/runtime#48634<\/a> add a new overload of <code>Enumerable.Take<\/code> that accepts a <code>Range<\/code> (as well as an <code>ElementAt<\/code> that takes an <code>Index<\/code>). In addition to then enabling the C# 8 range syntax to be used with <code>Take<\/code>, it also reduces some overheads associated with needing to combine multiple existing combinators to achieve the same thing.<\/p>\n<pre><code class=\"language-C#\">private static IEnumerable&lt;int&gt; Range(int count)\r\n{\r\n    for (int i = 0; i &lt; count; i++) yield return i;\r\n}\r\n\r\nprivate IEnumerable&lt;int&gt; _e = Range(10_000);\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void Old()\r\n{\r\n    foreach (int i in _e.Skip(1000).Take(10)) { }\r\n}\r\n\r\n[Benchmark]\r\npublic void New()\r\n{\r\n    foreach (int i in _e.Take(1000..1010)) { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">2.954 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">152 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">2.935 us<\/td>\n<td style=\"text-align: right;\">0.99<\/td>\n<td style=\"text-align: right;\">96 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48239\">dotnet\/runtime#48239<\/a> introduced <code>Enumerable.TryGetNonEnumeratedCount<\/code>, which enables getting the count of the number of items in an enumerable if that count can be determined quickly. This can be useful to avoid the overhead of resizes when presizing a collection that will be used to store the contents of the enumerable.<\/p>\n<p>Lastly, it&#8217;s somewhat rare today to see code written against instances of <code>Array<\/code> rather than a strongly-typed array (e.g. <code>int[]<\/code> or <code>T[]<\/code>), but such code does exist. We don&#8217;t need to optimize heavily for such code, but sometimes the stars align and efforts to simplify such code actually make it significantly faster as well, as is the case with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51351\">dotnet\/runtime#51351<\/a>, which simplified the implementation of the non-generic <code>ArrayEnumerator<\/code>, and in doing so made code like this much faster:<\/p>\n<pre><code class=\"language-C#\">private Array _array = Enumerable.Range(0, 1000).Select(i =&gt; new object()).ToArray();\r\n\r\n[Benchmark]\r\npublic int Count()\r\n{\r\n    int count = 0;\r\n    foreach (object o in _array) count++;\r\n    return count;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Count<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">14.992 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">14.134 us<\/td>\n<td style=\"text-align: right;\">0.94<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">12.866 us<\/td>\n<td style=\"text-align: right;\">0.86<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td>Count<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">5.778 us<\/td>\n<td style=\"text-align: right;\">0.39<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Cryptography<\/h3>\n<p>Let&#8217;s turn to cryptography. A lot of work has gone into crypto for the .NET 6 release, mostly functional. However, there are have been a handful of impactful performance improvements in the space.<\/p>\n<p><code>CryptoStream<\/code> was improved over the course of multiple PRs. When async support was initially added to <code>CryptoStream<\/code>, it was decided that, because <code>CryptoStream<\/code> does compute-intensive work, it shouldn&#8217;t block the caller of the asynchronous method; as a result, <code>CryptoStream<\/code> was originally written to forcibly queue encryption and decryption operations to the thread pool. However, typical usage is actually very fast and doesn&#8217;t warrant a thread hop, and even if it wasn&#8217;t fast, guidance has evolved over the years such that now the recommendation wouldn&#8217;t be to queue, anyway. So, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45150\">dotnet\/runtime#45150<\/a> removed that queueing. On top of that, <code>CryptoStream<\/code> hadn&#8217;t really kept up with the times, and when new <code>Memory<\/code>\/<code>ValueTask<\/code>-based <code>ReadAsync<\/code> and <code>WriteAsync<\/code> overloads were introduced on <code>Stream<\/code> in .NET Core 2.1, <code>CryptoStream<\/code> didn&#8217;t provide overrides; for .NET 6, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47207\">dotnet\/runtime#47207<\/a> from <a href=\"https:\/\/github.com\/NewellClark\">@NewellClark<\/a> addresses that deficiency by adding the appropriate overrides. As in the earlier discussion of <code>DeflateStream<\/code>, <code>CryptoStream<\/code> now also can complete a read operation once at least one byte of output is available and can be used for zero-byte reads.<\/p>\n<p><code>CryptoStream<\/code> works with arbitrary implementations of <code>ICryptoTransform<\/code>, of which one is <code>ToBase64Transform<\/code>; not exactly cryptography, but it makes it easy to Base64-encode a stream of data. <code>ICryptoTransform<\/code> is an interesting interface, providing a <code>CanTransformMultipleBlocks<\/code> property that dictates whether an implementation&#8217;s <code>TransformBlock<\/code> and <code>Transform<\/code> can transform just one or multiple &#8220;blocks&#8221; of data at a time. The interface expects that input is processed in blocks of a particular fixed number of input bytes which then yield a fixed number of output bytes, e.g. <code>ToBase64Transform<\/code> encodes blocks of three input bytes into blocks of four output bytes. Historically, <code>ToBase64Transform<\/code> returned <code>false<\/code> from <code>CanTransformMultipleBlocks<\/code>, which then forced <code>CryptoStream<\/code> to take the slower path of processing only three input bytes at a time. <code>ToBase64Transform<\/code> uses <code>Base64.EncodeToUtf8<\/code>, which is vectorized for fast processing, but three input bytes per call is too small to take advantage of the vectorized code paths, which ended up making <code>ToBase64Transform<\/code> quite slow. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55055\">dotnet\/runtime#55055<\/a> fixed this by teaching <code>ToBase64Transform<\/code> how to process multiple blocks, which in turn has a big impact on its performance when used with <code>CryptoStream<\/code>.<\/p>\n<pre><code class=\"language-C#\">private byte[] _data = Enumerable.Range(0, 10_000_000).Select(i =&gt; (byte)i).ToArray();\r\nprivate MemoryStream _destination = new MemoryStream();\r\n\r\n[Benchmark]\r\npublic async Task Encode()\r\n{\r\n    _destination.Position = 0;\r\n    using (var toBase64 = new ToBase64Transform())\r\n    using (var stream = new CryptoStream(_destination, toBase64, CryptoStreamMode.Write, leaveOpen: true))\r\n    {\r\n        await stream.WriteAsync(_data, 0, _data.Length);\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Encode<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">329.871 ms<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">213,976,944 B<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET Core 3.1<\/td>\n<td style=\"text-align: right;\">251.986 ms<\/td>\n<td style=\"text-align: right;\">0.765<\/td>\n<td style=\"text-align: right;\">213,334,112 B<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">146.058 ms<\/td>\n<td style=\"text-align: right;\">0.443<\/td>\n<td style=\"text-align: right;\">974 B<\/td>\n<\/tr>\n<tr>\n<td>Encode<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">1.998 ms<\/td>\n<td style=\"text-align: right;\">0.006<\/td>\n<td style=\"text-align: right;\">300 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Even as <code>CryptoStream<\/code> improves in .NET 6, sometimes you don&#8217;t need the power of a <code>Stream<\/code> and instead just want something simple and fast to handle encrypting and decrypting data you already have in memory. For that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52510\">dotnet\/runtime#52510<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55184\">dotnet\/runtime#55184<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55480\">dotnet\/runtime#55480<\/a> introduced new &#8220;one shot&#8221; <code>EncryptCbc<\/code>, <code>EncryptCfb<\/code>, <code>EncryptEcb<\/code>, <code>DecryptCbc<\/code>, <code>DecryptCfb<\/code>, and <code>DecryptEcb<\/code> methods on <code>SymmetricAlgorithm<\/code> (along with some protected virtual methods these delegate to) that support encrypting and decrypting <code>byte[]<\/code>s and <code>ReadOnlySpan&lt;byte&gt;<\/code>s without having to go through a <code>Stream<\/code>. This not only leads to simpler code when you already have the data to process, it&#8217;s also faster.<\/p>\n<pre><code class=\"language-C#\">private byte[] _key, _iv, _ciphertext;\r\n\r\n[GlobalSetup]\r\npublic void Setup()\r\n{\r\n    using Aes aes = Aes.Create();\r\n    _key = aes.Key;\r\n    _iv = aes.IV;\r\n    _ciphertext = aes.EncryptCbc(Encoding.UTF8.GetBytes(\"This is a test.  This is only a test.\"), _iv);\r\n}\r\n\r\n[Benchmark(Baseline = true)]\r\npublic byte[] Old()\r\n{\r\n    using Aes aes = Aes.Create();\r\n\r\n    aes.Key = _key;\r\n    aes.IV = _iv;\r\n    aes.Padding = PaddingMode.PKCS7;\r\n    aes.Mode = CipherMode.CBC;\r\n\r\n    using MemoryStream destination = new MemoryStream();\r\n    using ICryptoTransform transform = aes.CreateDecryptor();\r\n    using CryptoStream cryptoStream = new CryptoStream(destination, transform, CryptoStreamMode.Write);\r\n\r\n    cryptoStream.Write(_ciphertext);\r\n    cryptoStream.FlushFinalBlock();\r\n\r\n    return destination.ToArray();\r\n}\r\n\r\n[Benchmark]\r\npublic byte[] New()\r\n{\r\n    using Aes aes = Aes.Create();\r\n\r\n    aes.Key = _key;\r\n\r\n    return aes.DecryptCbc(_ciphertext, _iv);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">1.657 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">1,320 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">1.073 us<\/td>\n<td style=\"text-align: right;\">0.65<\/td>\n<td style=\"text-align: right;\">664 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>I previously mentioned improvements to <code>System.Random<\/code> in .NET 6. That&#8217;s for non-cryptographically-secure randomness. If you need cryptographically-secure randomness, <code>System.Security.Cryptography.RandomNumberGenerator<\/code> is your friend. This type has existed for years but it&#8217;s been receiving more love over the last several .NET releases. For example, <code>RandomNumberGenerator<\/code> is instantiable via the <code>Create<\/code> method, and instance methods do expose the full spread of the type&#8217;s functionality, but there&#8217;s no actual need for it to be its own instance, as the underlying OS objects used now on all platforms are thread-safe and implemented in a scalable manner. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43221\">dotnet\/runtime#43221<\/a> added a static <code>GetBytes<\/code> method that makes it simple and a bit faster to get a <code>byte[]<\/code> filled with cryptographically-strong random data:<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic byte[] Old()\r\n{\r\n    using (RandomNumberGenerator rng = RandomNumberGenerator.Create())\r\n    {\r\n        byte[] buffer = new byte[8];\r\n        rng.GetBytes(buffer);\r\n        return buffer;\r\n    }\r\n}\r\n\r\n[Benchmark]\r\npublic byte[] New()\r\n{\r\n    return RandomNumberGenerator.GetBytes(8);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">80.46 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">78.10 ns<\/td>\n<td style=\"text-align: right;\">0.97<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>However, the <code>Old<\/code> case here is already improved on .NET 6 than on previous releases. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52495\">dotnet\/runtime#52495<\/a> recognizes that there&#8217;s little benefit to <code>Create<\/code> creating a new instance, and converts it into a singleton.<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic byte[] GetBytes()\r\n{\r\n    using (RandomNumberGenerator rng = RandomNumberGenerator.Create())\r\n    {\r\n        byte[] buffer = new byte[8];\r\n        rng.GetBytes(buffer);\r\n        return buffer;\r\n    }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>GetBytes<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">948.94 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">514 B<\/td>\n<\/tr>\n<tr>\n<td>GetBytes<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">85.35 ns<\/td>\n<td style=\"text-align: right;\">0.09<\/td>\n<td style=\"text-align: right;\">56 B<\/td>\n<\/tr>\n<tr>\n<td>GetBytes<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">80.12 ns<\/td>\n<td style=\"text-align: right;\">0.08<\/td>\n<td style=\"text-align: right;\">32 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The addition of the static <code>GetBytes<\/code> method continues a theme throughout crypto of exposing more &#8220;one-shot&#8221; APIs as static helpers. The <code>Rfc2898DeriveBytes<\/code> class enables code to derive bytes from passwords, and historically this has been done by instantiating an instance of this class and calling <code>GetBytes<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48107\">dotnet\/runtime#48107<\/a> adds static <code>Pbkdf2<\/code> methods that use the PBKDF2 (Password-Based Key Derivation Function 2) key-derivation function to generate the requested bytes without explicitly creating an instance; this, in turn, enables the implementation to use any &#8220;one-shot&#8221; APIs provided by the underlying operating system, e.g. those from CommonCrypto on macOS.<\/p>\n<pre><code class=\"language-C#\">private byte[] _salt = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8 };\r\n\r\n[Benchmark(Baseline = true)]\r\npublic byte[] Old()\r\n{\r\n    using Rfc2898DeriveBytes db = new Rfc2898DeriveBytes(\"my super strong password\", _salt, 1000, HashAlgorithmName.SHA256);\r\n    return db.GetBytes(16);\r\n}\r\n\r\n[Benchmark]\r\npublic byte[] New()\r\n{\r\n    return Rfc2898DeriveBytes.Pbkdf2(\"my super strong password\", _salt, 1000, HashAlgorithmName.SHA256, 16);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">637.5 us<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">561 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">554.9 us<\/td>\n<td style=\"text-align: right;\">0.87<\/td>\n<td style=\"text-align: right;\">73 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Other improvements in crypto include avoiding unnecessary zero&#8217;ing for padding in symmetric encryption (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52465\">dotnet\/runtime#52465<\/a>); using the span-based support with <code>IncrementalHash.CreateHMAC<\/code> to avoid some <code>byte[]<\/code> allocations (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43541\">dotnet\/runtime#43541<\/a>); caching failed lookups in addition to successful lookups in <code>OidLookup.ToOid<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46819\">dotnet\/runtime#46819<\/a>); using stack allocation in signature generation to avoid unnecessary allocation (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46893\">dotnet\/runtime#46893<\/a>); using better OS APIs on macOS for RSA\/ECC keys (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52759\">dotnet\/runtime#52759<\/a> from <a href=\"https:\/\/github.com\/filipnavara\">@filipnavara<\/a>); and avoiding closures in the interop layers of <code>X509Certificate<\/code>s on both Unix (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50511\">dotnet\/runtime#50511<\/a>) and Windows (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50376\">dotnet\/runtime#50376<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50377\">dotnet\/runtime#50377<\/a>). One of my favorites, simply because it eliminates an annoyance I hit now and again, is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53129\">dotnet\/runtime#53129<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a>, which adds an implementation of the generic <code>IEnumerable&lt;T&gt;<\/code> to each of several <code>X509Certificate<\/code>-related collections that previously only implemented the non-generic <code>IEnumerable<\/code>. This in turn removes the common need to use LINQ&#8217;s <code>OfType&lt;X509Certificate2&gt;<\/code> when enumerating <code>X509CertificateCollection<\/code>, both improving maintainability and reducing overhead.<\/p>\n<pre><code class=\"language-C#\">private X509Certificate2Collection _certs;\r\n\r\n[GlobalSetup]\r\npublic void Setup()\r\n{\r\n    using var store = new X509Store(StoreLocation.CurrentUser);\r\n    _certs = store.Certificates;\r\n}\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void Old()\r\n{\r\n    foreach (string s in _certs.OfType&lt;X509Certificate2&gt;().Select(c =&gt; c.Subject)) { }\r\n}\r\n\r\n[Benchmark]\r\npublic void New()\r\n{\r\n    foreach (string s in _certs.Select(c =&gt; c.Subject)) { }\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">63.45 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">160 B<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">53.94 ns<\/td>\n<td style=\"text-align: right;\">0.85<\/td>\n<td style=\"text-align: right;\">128 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>&#8220;Peanut Butter&#8221;<\/h3>\n<p>As has been shown in this post and in those that I&#8217;ve written for previous versions, there have been literally thousands of PRs into .NET over the last several years to improve its performance. Many of these changes on their own have a profound and very measurable impact to some scenario. However, a fair number of the changes are what we lovingly refer to as &#8220;peanut butter&#8221;, a thin layer of tiny performance-impacting changes that individually aren&#8217;t hugely meaningful but that over time add up to bigger impact. Sometimes these changes make one specific change in one place (e.g. removing one allocation), and it&#8217;s the aggregate of all such changes that helps .NET to get better and better. Sometimes it&#8217;s a pattern of change applied en mass across the stack. There are dozens of such changes in .NET 6, and I&#8217;ll walk through some of them here.<\/p>\n<p>One of my favorite sets of changes, and a pattern which will hopefully be codified in a future release by an analyzer, shows up in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49958\">dotnet\/runtime#49958<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50225\">dotnet\/runtime#50225<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49969\">dotnet\/runtime#49969<\/a>. These PRs changed over 2300 internal and private classes across dotnet\/runtime to be sealed. Why does that matter? For some of the types, it won&#8217;t, but there are multiple reasons why sealing types can measurably improve performance, and so we&#8217;ve adopted a general policy that all non-public types that can be sealed should be, so as to maximize the chances use of these types will simply be better than it otherwise would be.<\/p>\n<p>One reason sealing helps is that virtual methods on a sealed type are more likely to be devirtualized by the runtime. If the runtime can see that a given instance on which a virtual call is being made is actually sealed, then it knows for certain what the actual target of the call will be, and it can invoke that target directly rather than doing a virtual dispatch operation. Better yet, once the call is devirtualized, it might be inlineable, and then if it&#8217;s inlined, all the previously discussed benefits around optimizing the caller+callee combined kick in.<\/p>\n<pre><code class=\"language-C#\">private SealedType _sealed = new();\r\nprivate NonSealedType _nonSealed = new();\r\n\r\n[Benchmark(Baseline = true)]\r\npublic int NonSealed() =&gt; _nonSealed.M() + 42;\r\n\r\n[Benchmark]\r\npublic int Sealed() =&gt; _sealed.M() + 42;\r\n\r\npublic class BaseType\r\n{\r\n    public virtual int M() =&gt; 1;\r\n}\r\n\r\npublic class NonSealedType : BaseType\r\n{\r\n    public override int M() =&gt; 2;\r\n}\r\n\r\npublic sealed class SealedType : BaseType\r\n{\r\n    public override int M() =&gt; 2;\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonSealed<\/td>\n<td style=\"text-align: right;\">0.9837 ns<\/td>\n<td style=\"text-align: right;\">1.000<\/td>\n<td style=\"text-align: right;\">26 B<\/td>\n<\/tr>\n<tr>\n<td>Sealed<\/td>\n<td style=\"text-align: right;\">0.0018 ns<\/td>\n<td style=\"text-align: right;\">0.002<\/td>\n<td style=\"text-align: right;\">12 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre><code class=\"language-assembly\">; Program.NonSealed()\r\n       sub       rsp,28\r\n       mov       rcx,[rcx+10]\r\n       mov       rax,[rcx]\r\n       mov       rax,[rax+40]\r\n       call      qword ptr [rax+20]\r\n       add       eax,2A\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 26\r\n\r\n; Program.Sealed()\r\n       mov       rax,[rcx+8]\r\n       cmp       [rax],eax\r\n       mov       eax,2C\r\n       ret\r\n; Total bytes of code 12<\/code><\/pre>\n<p>Note the code gen difference. <code>NonSealed()<\/code> is doing a virtual dispatch (that series of <code>mov<\/code> instructions to find the address of the actual method to invoke followed by a <code>call<\/code>), whereas <code>Sealed()<\/code> isn&#8217;t calling anything: in fact, it&#8217;s been reduced to a null check followed by returning a constant value, as SealedType.M was devirtualized and inlined, at which point the JIT could constant fold the <code>2 + 42<\/code> into just <code>44<\/code> (hex 0x2C). BenchmarkDotNet actually issues a warning (a good warning in this case) about the resulting metrics as a result:<\/p>\n<pre><code class=\"language-console\">\/\/ * Warnings *\r\nZeroMeasurement\r\n  Program.Sealed: Runtime=.NET 6.0, Toolchain=net6.0 -&gt; The method duration is indistinguishable from the empty method duration<\/code><\/pre>\n<p>In order to measure the cost of a benchmark, it not only times how long it takes to invoke the benchmark but also how long it takes to invoke an empty benchmark with a similar signature, with the results presented subtracting the latter from the former. BenchmarkDotNet is then highlighting that with the method just returning a constant, the benchmark and the empty method are now indistinguishable. Cool.<\/p>\n<p>Another benefit of sealing is that it can make type checks a lot faster. When you write code like <code>obj is SomeType<\/code>, there are multiple ways that could be emitted in assembly. If <code>SomeType<\/code> is sealed, then this check can be implemented along the lines of <code>obj is not null &amp;&amp; obj.GetType() == typeof(SomeType)<\/code>, where the latter clause can be implemented simply by comparing the type handle of <code>obj<\/code> against the known type handle of <code>SomeType<\/code>; after all, if it&#8217;s sealed, it&#8217;s not possible there could be any type derived from <code>SomeType<\/code>, so there&#8217;s no other type than <code>SomeType<\/code> that need be considered. But if <code>SomeType<\/code> isn&#8217;t sealed, this check becomes a lot more complicated, needing to determine whether <code>obj<\/code> is not only <code>SomeType<\/code> but potentially something derived from <code>SomeType<\/code>, which means it needs to examine all of the type&#8217;s in <code>obj<\/code>&#8216;s type&#8217;s parent hierarchy to see whether any of them are <code>SomeType<\/code>. There&#8217;s enough logic there that it&#8217;s actually factored out into a helper method the JIT can emit a call to, the internal <code>System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass<\/code>. We can see this in a benchmark:<\/p>\n<pre><code class=\"language-C#\">private object _o = \"hello\";\r\n\r\n[Benchmark(Baseline = true)]\r\npublic bool NonSealed() =&gt; _o is NonSealedType;\r\n\r\n[Benchmark]\r\npublic bool Sealed() =&gt; _o is SealedType;\r\n\r\npublic class NonSealedType { }\r\npublic sealed class SealedType { }<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonSealed<\/td>\n<td style=\"text-align: right;\">1.7694 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">37 B<\/td>\n<\/tr>\n<tr>\n<td>Sealed<\/td>\n<td style=\"text-align: right;\">0.0749 ns<\/td>\n<td style=\"text-align: right;\">0.04<\/td>\n<td style=\"text-align: right;\">36 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre><code class=\"language-assembly\">; Program.NonSealed()\r\n       sub       rsp,28\r\n       mov       rdx,[rcx+8]\r\n       mov       rcx,offset MT_Program+NonSealedType\r\n       call      CORINFO_HELP_ISINSTANCEOFCLASS\r\n       test      rax,rax\r\n       setne     al\r\n       movzx     eax,al\r\n       add       rsp,28\r\n       ret\r\n; Total bytes of code 37\r\n\r\n; Program.Sealed()\r\n       mov       rax,[rcx+8]\r\n       test      rax,rax\r\n       je        short M00_L00\r\n       mov       rdx,offset MT_Program+SealedType\r\n       cmp       [rax],rdx\r\n       je        short M00_L00\r\n       xor       eax,eax\r\nM00_L00:\r\n       test      rax,rax\r\n       setne     al\r\n       movzx     eax,al\r\n       ret\r\n; Total bytes of code 36<\/code><\/pre>\n<p>Note the <code>NonSealed()<\/code> benchmark is making a <code>call<\/code> to the <code>CORINFO_HELP_ISINSTANCEOFCLASS<\/code> helper, whereas <code>Sealed()<\/code> is just directly comparing the type handle of <code>_o<\/code> (<code>mov rax,[rcx+8]<\/code>) against the type handle of <code>SealedType<\/code> (<code>mov rdx,offset MT_Program+SealedType<\/code>, <code>cmp [rax],rdx<\/code>), and the resulting impact that has on the cost of running this code.<\/p>\n<p>Yet another benefit here comes when using arrays of types. As has been mentioned, arrays in .NET are covariant, which means if you have a type <code>B<\/code> that derives from type <code>A<\/code>, and you have an array of <code>B<\/code>s, you can store that <code>B[]<\/code> into a reference of type <code>A[]<\/code>. That, however, means the runtime needs to ensure that any <code>A<\/code> stored into an <code>A[]<\/code> is of an appropriate type for the actual array being referenced, e.g. in this case that every <code>A<\/code> is actually a <code>B<\/code> or something derived from <code>B<\/code>. Of course, if the runtime knows that for a given <code>T[]<\/code> the <code>T<\/code> being stored couldn&#8217;t possibly be anything other than <code>T<\/code> itself, it needn&#8217;t employ such a check. How could it know that? For one thing, if <code>T<\/code> is sealed. So given a benchmark like:<\/p>\n<pre><code class=\"language-C#\">private SealedType _sealedInstance = new();\r\nprivate SealedType[] _sealedArray = new SealedType[1_000_000];\r\n\r\nprivate NonSealedType _nonSealedInstance = new();\r\nprivate NonSealedType[] _nonSealedArray = new NonSealedType[1_000_000];\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void NonSealed()\r\n{\r\n    NonSealedType inst = _nonSealedInstance;\r\n    NonSealedType[] arr = _nonSealedArray;\r\n    for (int i = 0; i &lt; arr.Length; i++)\r\n    {\r\n        arr[i] = inst;\r\n    }\r\n}\r\n\r\n[Benchmark]\r\npublic void Sealed()\r\n{\r\n    SealedType inst = _sealedInstance;\r\n    SealedType[] arr = _sealedArray;\r\n    for (int i = 0; i &lt; arr.Length; i++)\r\n    {\r\n        arr[i] = inst;\r\n    }\r\n}\r\n\r\npublic class NonSealedType { }\r\npublic sealed class SealedType { }<\/code><\/pre>\n<p>we get results like this:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonSealed<\/td>\n<td style=\"text-align: right;\">2.580 ms<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">53 B<\/td>\n<\/tr>\n<tr>\n<td>Sealed<\/td>\n<td style=\"text-align: right;\">1.445 ms<\/td>\n<td style=\"text-align: right;\">0.56<\/td>\n<td style=\"text-align: right;\">59 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Beyond arrays, this is also relevant to spans. As previously mentioned, <code>Span&lt;T&gt;<\/code> is invariant, and its constructor that takes a <code>T[]<\/code> prevents you from storing an array of a derived type with a check that validates the <code>T<\/code> and the element type of the actual array passed in are the same:<\/p>\n<pre><code class=\"language-C#\">if (!typeof(T).IsValueType &amp;&amp; array.GetType() != typeof(T[]))\r\n    ThrowHelper.ThrowArrayTypeMismatchException();<\/code><\/pre>\n<p>You get where this is going. <code>Span&lt;T&gt;<\/code>&#8216;s constructor is aggressively inlined, so the code from the constructor is exposed to the caller, which frequently allows the JIT to know the actual type of <code>T<\/code>; if it then knows that <code>T<\/code> is sealed, there&#8217;s no way that <code>array.GetType() != typeof(T[])<\/code>, so it can remove the whole check entirely. A very microbenchmark:<\/p>\n<pre><code class=\"language-C#\">private SealedType[] _sealedArray = new SealedType[10];\r\nprivate NonSealedType[] _nonSealedArray = new NonSealedType[10];\r\n\r\n[Benchmark(Baseline = true)]\r\npublic Span&lt;NonSealedType&gt; NonSealed() =&gt; _nonSealedArray;\r\n\r\n[Benchmark]\r\npublic Span&lt;SealedType&gt; Sealed() =&gt; _sealedArray;\r\n\r\npublic class NonSealedType { }\r\npublic sealed class SealedType { }<\/code><\/pre>\n<p>highlights this difference:<\/p>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Code Size<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>NonSealed<\/td>\n<td style=\"text-align: right;\">0.2435 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">64 B<\/td>\n<\/tr>\n<tr>\n<td>Sealed<\/td>\n<td style=\"text-align: right;\">0.0379 ns<\/td>\n<td style=\"text-align: right;\">0.16<\/td>\n<td style=\"text-align: right;\">35 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>but it&#8217;s most visible in the generated assembly:<\/p>\n<pre><code class=\"language-assembly\">; Program.NonSealed()\r\n       sub       rsp,28\r\n       mov       rax,[rcx+10]\r\n       test      rax,rax\r\n       je        short M00_L01\r\n       mov       rcx,offset MT_Program+NonSealedType[]\r\n       cmp       [rax],rcx\r\n       jne       short M00_L02\r\n       lea       rcx,[rax+10]\r\n       mov       r8d,[rax+8]\r\nM00_L00:\r\n       mov       [rdx],rcx\r\n       mov       [rdx+8],r8d\r\n       mov       rax,rdx\r\n       add       rsp,28\r\n       ret\r\nM00_L01:\r\n       xor       ecx,ecx\r\n       xor       r8d,r8d\r\n       jmp       short M00_L00\r\nM00_L02:\r\n       call      System.ThrowHelper.ThrowArrayTypeMismatchException()\r\n       int       3\r\n; Total bytes of code 64\r\n\r\n; Program.Sealed()\r\n       mov       rax,[rcx+8]\r\n       test      rax,rax\r\n       je        short M00_L01\r\n       lea       rcx,[rax+10]\r\n       mov       r8d,[rax+8]\r\nM00_L00:\r\n       mov       [rdx],rcx\r\n       mov       [rdx+8],r8d\r\n       mov       rax,rdx\r\n       ret\r\nM00_L01:\r\n       xor       ecx,ecx\r\n       xor       r8d,r8d\r\n       jmp       short M00_L00\r\n; Total bytes of code 35<\/code><\/pre>\n<p>where we can see the <code>call System.ThrowHelper.ThrowArrayTypeMismatchException()<\/code> doesn&#8217;t exist in the <code>Sealed()<\/code> version at all because the check that would lead to it was removed completely.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43474\">dotnet\/runtime#43474<\/a> is another example of performing some cleanup operation across a bunch of call sites. The <code>System.Buffers.Binary.BinaryPrimitives<\/code> class was introduced in .NET Core 2.1 and has been getting a lot of use with its operations like <code>ReverseEndianness(Int32)<\/code> or <code>ReadInt32BigEndian(ReadOnlySpan&lt;Byte&gt;)<\/code>, but there were a bunch of places in the dotnet\/runtime codebase still manually performing such operations when they could have been using these optimized helpers to do it for them. The PR addresses that, nicely changing complicated code like this in <code>TimeZoneInfo<\/code> on Unix:<\/p>\n<pre><code class=\"language-C#\">private static unsafe long TZif_ToInt64(byte[] value, int startIndex)\r\n{\r\n    fixed (byte* pbyte = &amp;value[startIndex])\r\n    {\r\n        int i1 = (*pbyte &lt;&lt; 24) | (*(pbyte + 1) &lt;&lt; 16) | (*(pbyte + 2) &lt;&lt; 8) | (*(pbyte + 3));\r\n        int i2 = (*(pbyte + 4) &lt;&lt; 24) | (*(pbyte + 5) &lt;&lt; 16) | (*(pbyte + 6) &lt;&lt; 8) | (*(pbyte + 7));\r\n        return (uint)i2 | ((long)i1 &lt;&lt; 32);\r\n    }\r\n}<\/code><\/pre>\n<p>to instead be this:<\/p>\n<pre><code class=\"language-C#\">private static long TZif_ToInt64(byte[] value, int startIndex) =&gt; \r\n    BinaryPrimitives.ReadInt64BigEndian(value.AsSpan(startIndex));<\/code><\/pre>\n<p>Ahhh, so much nicer. Not only is such code simpler, safer, and more readily understandable to a reader, it&#8217;s also faster:<\/p>\n<pre><code class=\"language-C#\">private byte[] _buffer = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8 };\r\n\r\n[Benchmark(Baseline = true)]\r\npublic long Old() =&gt; Old(_buffer, 0);\r\n\r\n[Benchmark]\r\npublic long New() =&gt; New(_buffer, 0);\r\n\r\nprivate static unsafe long Old(byte[] value, int startIndex)\r\n{\r\n    fixed (byte* pbyte = &amp;value[startIndex])\r\n    {\r\n        int i1 = (*pbyte &lt;&lt; 24) | (*(pbyte + 1) &lt;&lt; 16) | (*(pbyte + 2) &lt;&lt; 8) | (*(pbyte + 3));\r\n        int i2 = (*(pbyte + 4) &lt;&lt; 24) | (*(pbyte + 5) &lt;&lt; 16) | (*(pbyte + 6) &lt;&lt; 8) | (*(pbyte + 7));\r\n        return (uint)i2 | ((long)i1 &lt;&lt; 32);\r\n    }\r\n}\r\n\r\nprivate static long New(byte[] value, int startIndex) =&gt;\r\n    BinaryPrimitives.ReadInt64BigEndian(value.AsSpan(startIndex));<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">1.9856 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">0.3853 ns<\/td>\n<td style=\"text-align: right;\">0.19<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Another example of such a cleanup is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54004\">dotnet\/runtime#54004<\/a>, which changes several <code>{U}Int32\/64.TryParse<\/code> call sites to explicitly use <code>CultureInfo.InvariantCulture<\/code> instead of <code>null<\/code>. Passing in <code>null<\/code> will cause the implementation to access <code>CultureInfo.CurrentCulture<\/code>, which incurs a thread-local storage access, but all of the changed call sites use <code>NumberStyles.None<\/code> or <code>NumberStyles.Hex<\/code>. The only reason the culture is required for parsing is to be able to parse a positive or negative symbol, but with these styles set, the implementation won&#8217;t actually use those symbol values, and thus the actual culture utilized doesn&#8217;t matter. Passing in <code>InvariantCulture<\/code> then means we&#8217;re paying only for a static field access rather than a thread-static field access. Beyond this, <code>TryParse<\/code> also improved for hexadecimal inputs, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52470\">dotnet\/runtime#52470<\/a>, which changed an internal routine used to determine whether a character is valid hex, making it branchless (which makes its performance consistent regardless of inputs or branch prediction) and removing the dependency on a lookup table. Corresponding functionality on <code>Utf8Parser<\/code> also improved. Whereas a method like <code>Int32.TryParse<\/code> parses data from a sequence of <code>char<\/code>s (e.g. <code>ReadOnlySpan&lt;char&gt;<\/code>), <code>Utf8Parser.TryParse<\/code> parses data from a sequence of <code>byte<\/code>s (e.g. <code>ReadOnlySpan&lt;byte&gt;<\/code>) interpreted as UTF8 data. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52423\">dotnet\/runtime#52423<\/a> also improved the performance of <code>TryParse<\/code> for <code>long<\/code> and <code>ulong<\/code> values. This is another good example of an optimization tradeoff: the tweaks employed here benefit most values but end up slightly penalizing extreme values.<\/p>\n<pre><code class=\"language-C#\">private byte[] _buffer = new byte[10];\r\n\r\n[GlobalSetup]\r\npublic void Setup() =&gt; Utf8Formatter.TryFormat(12345L, _buffer, out _);\r\n\r\n[Benchmark]\r\npublic bool TryParseInt64() =&gt; Utf8Parser.TryParse(_buffer, out long _, out int _);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>TryParseInt64<\/td>\n<td>.NET Framework 4.8<\/td>\n<td style=\"text-align: right;\">26.490 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>TryParseInt64<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">7.724 ns<\/td>\n<td style=\"text-align: right;\">0.29<\/td>\n<\/tr>\n<tr>\n<td>TryParseInt64<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">6.552 ns<\/td>\n<td style=\"text-align: right;\">0.25<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Then there&#8217;s <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51190\">dotnet\/runtime#51190<\/a>, which recognizes that, at a very low-level, when extending a 32-bit value in a 64-bit process to be native word size, it&#8217;s ever so slightly more efficient from a codegen perspective to zero-extend rather than sign-extend; if the code is happening on a path where those are identical (i.e. we know by construction we don&#8217;t have negative values), on a really hot path it can be beneficial to change.<\/p>\n<p>Along with the new and improved support for interpolated strings, a lot of cleanup across dotnet\/runtime was also done with regards to string formatting. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50267\">dotnet\/runtime#50267<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55738\">dotnet\/runtime#55738<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44765\">dotnet\/runtime#44765<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44746\">dotnet\/runtime#44746<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55831\">dotnet\/runtime#55831<\/a> all updated code to use better mechanisms. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51653\/commits\/91f39e2c545b853ced0bfae08653c89381b32c42\">dotnet\/runtime#commits\/91f39e<\/a> alone updated over 3000 lines of string-formatting related code. Some of these changes are to use string interpolation where it wasn&#8217;t used before due to knowledge of the performance implications; for example, there&#8217;s code to read the <code>status<\/code> file in <code>procfs<\/code> on Linux, and that needs to compose the path to the file to use. Previously that code was:<\/p>\n<pre><code class=\"language-C#\">internal static string GetStatusFilePathForProcess(int pid) =&gt;\r\n    RootPath + pid.ToString(CultureInfo.InvariantCulture) + StatusFileName;<\/code><\/pre>\n<p>which ends up first creating a string from the <code>int pid<\/code>, and then doing a <code>String.Concat<\/code> on the resulting strings. Now, it&#8217;s:<\/p>\n<pre><code class=\"language-C#\">internal static string GetStatusFilePathForProcess(int pid) =&gt;\r\n    string.Create(null, stackalloc char[256], $\"{RootPath}{(uint)pid}{MapsFileName}\");<\/code><\/pre>\n<p>which takes advantage of the new <code>string.Create<\/code> overload that works with interpolated strings and enables doing the interpolation using stack-allocated buffer space. Also note the lack of the <code>CultureInfo.InvariantCulture<\/code> call; that&#8217;s because when formatting an <code>int<\/code>, the culture is only needed if the number is negative and would require looking up the negative sign symbol for the relevant culture, but here we know that process IDs are never negative, making the culture irrelevant. As a bonus, the implementation casts the known non-negative value to <code>uint<\/code>, which is slightly faster to format than <code>int<\/code>, exactly because we needn&#8217;t check for a sign.<\/p>\n<p>Another pattern of cleanup in those PRs was avoiding creating strings in places spans would suffice. For example, this code from Microsoft.CSharp.dll:<\/p>\n<pre><code class=\"language-C#\">int arity = int.Parse(t.Name.Substring(\"VariantArray\".Length), CultureInfo.InvariantCulture);<\/code><\/pre>\n<p>was replaced by:<\/p>\n<pre><code class=\"language-C#\">int arity = int.Parse(t.Name.AsSpan(\"VariantArray\".Length), provider: CultureInfo.InvariantCulture);<\/code><\/pre>\n<p>avoiding the intermediate string allocation. Or this code from System.Private.Xml.dll:<\/p>\n<pre><code class=\"language-C#\">if (s.Substring(i) == \"INF\")<\/code><\/pre>\n<p>which was replaced by:<\/p>\n<pre><code class=\"language-C#\">if (s.AsSpan(i).SequenceEqual(\"INF\"))<\/code><\/pre>\n<p>Another pattern is using something other than <code>string.Format<\/code> when the power of <code>string.Format<\/code> is unwarranted. For example, this code existed in Microsoft.Extensions.FileSystemGlobbing:<\/p>\n<pre><code class=\"language-C#\">return string.Format(\"{0}\/{1}\", left, right);<\/code><\/pre>\n<p>where both <code>left<\/code> and <code>right<\/code> are strings. This is forcing the system to parse the composite format string and incur all the associated overhead, when at the end of the day this can be a simple concat operation, which the C# compiler will employ for an interpolated string when all the parts are strings and there are sufficiently few to enable using one of the non-params-array <code>string.Concat<\/code> overloads:<\/p>\n<pre><code class=\"language-C#\">return $\"{left}\/{right}\";<\/code><\/pre>\n<p>We can see that difference with a simple benchmark:<\/p>\n<pre><code class=\"language-C#\">private string _left = \"hello\";\r\nprivate string _right = \"world\";\r\n\r\n[Benchmark(Baseline = true)]\r\npublic string Format() =&gt; string.Format(\"{0}\/{1}\", _left, _right);\r\n\r\n[Benchmark]\r\npublic string Interpolated() =&gt; $\"{_left}\/{_right}\";<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Format<\/td>\n<td style=\"text-align: right;\">58.74 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<tr>\n<td>Interpolated<\/td>\n<td style=\"text-align: right;\">14.73 ns<\/td>\n<td style=\"text-align: right;\">0.25<\/td>\n<td style=\"text-align: right;\">48 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>.NET 6 also continues the trend of exposing more span-based APIs for things that would otherwise result in creating strings or arrays. For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/57357\">dotnet\/runtime#57357<\/a> adds a new <code>ValueSpan<\/code> property to the <code>Capture<\/code> class in <code>System.Text.RegularExpressions<\/code> (to go along with the string-returning <code>Value<\/code> property that&#8217;s already there). That means code can now extract a <code>ReadOnlySpan&lt;char&gt;<\/code> for a <code>Match<\/code>, <code>Group<\/code>, or <code>Capture<\/code> rather than having to allocate a string to determine what matched.<\/p>\n<p>Then there are the plethora of changes that remove an array or boxing allocation here, an unnecessary LINQ query there, and so on. For example:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56207\">dotnet\/runtime#56207<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> removed ~50 <code>byte[]<\/code> allocations from <code>System.Reflection.MetadataLoadContext<\/code>, by changing some static readonly <code>byte[]<\/code> fields to instead be <code>ReadOnlySpan&lt;byte&gt;<\/code> properties, taking advantage of the C# compiler&#8217;s ability to store constant data used in this capacity in a very efficient manner.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55652\">dotnet\/runtime#55652<\/a> removed a <code>char[]<\/code> allocation from <code>System.Xml.UniqueId.ToString()<\/code>, replacing the use of a temporary <code>new char[length]<\/code> followed by a <code>new string(charArray)<\/code> to instead use a call to <code>string.Create<\/code> that was able to populate the string instance directly.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49485\">dotnet\/runtime#49485<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49488\">dotnet\/runtime#49488<\/a> removed <code>StringBuilder<\/code> allocations, where a <code>StringBuilder<\/code> was being allocated and then appended to multiple times, to instead use a single call to <code>string.Join<\/code> (which has a much more optimized implementation), making the code both simpler and faster. These also included a few changes where <code>StringBuilder<\/code>s were being allocated and then just a handful of appends were always being performed, when a simple <code>string.Concat<\/code> would suffice.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50483\">dotnet\/runtime#50483<\/a> avoided a closure and delegate allocation in <code>System.ComponentModel.Annotations<\/code> by minimizing the scope of the data being closed over.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50502\">dotnet\/runtime#50502<\/a> avoided a closure and delegate allocation in <code>ClientWebSocket.ConnectAsync<\/code> by open-coding a loop rather than using using <code>List&lt;T&gt;.Find<\/code> with a lambda that captured surrounding state.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50512\">dotnet\/runtime#50512<\/a> avoided a closure and delegate in <code>Regex<\/code> that slipped in due to using a captured local rather than the exact same state that was already being passed into the lambda. These kinds of issues are easy to miss, and they&#8217;re one of the reasons I love being able to add <code>static<\/code> to lambdas, to ensure they&#8217;re not closing over anything unexpectedly.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50496\">dotnet\/runtime#50496<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50387\">dotnet\/runtime#50387<\/a> avoided closure and delegate allocations in <code>System.Diagnostics.Process<\/code>, by being more deliberate about how state is passed around.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50357\">dotnet\/runtime#50357<\/a> avoided a closure and delegate allocation in the polling mechanism employed by <code>DiagnosticCounter<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54621\">dotnet\/runtime#54621<\/a> avoided cloning an immutable <code>Version<\/code> object. The original instance could be used just as easily; the only downside would be if someone was depending on object identity here for some reason, of which there&#8217;s very low risk.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51119\">dotnet\/runtime#51119<\/a> fixed <code>DispatchProxyGenerator<\/code>, which was almost humorously cloning an array from another array just to get the new array&#8217;s length&#8230; when it could have just used the original array&#8217;s length.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47473\">dotnet\/runtime#47473<\/a> is more complicated than some of these other PRs, but it removed the need for an <code>OrderedDictionary<\/code> (which itself creates an <code>ArrayList<\/code> and a <code>Hashtable<\/code>) in <code>TypeDescriptor.GetAttributes<\/code>, instead using a <code>List&lt;T&gt;<\/code> and a <code>HashSet&lt;T&gt;<\/code> directly.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44495\">dotnet\/runtime#44495<\/a> changed <code>StreamWriter<\/code>&#8216;s <code>byte[]<\/code> buffer to be lazily allocated. For scenarios where only small payloads are written synchronously, the <code>byte[]<\/code> may never be needed.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46455\">dotnet\/runtime#46455<\/a> is fun, and a holdover from where this code originated in the .NET Framework. The PR deletes a bunch of code, including the preallocation of a <code>ThreadAbortException<\/code> that could be used by the system should one ever be needed and the system is too low on memory to allocate one. That might have been useful, if thread aborts were still a thing. Which they&#8217;re not. Goodbye.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47453\">dotnet\/runtime#47453<\/a>. Enumerating a <code>Hashtable<\/code> using a standard <code>foreach<\/code>, even if all of the keys and values are reference types, still incurs an allocation per iteration, as the <code>DictionaryEntry<\/code> struct yielded for each iteration gets boxed. To avoid this, <code>Hashtable<\/code>&#8216;s enumerator implemented <code>IDictionaryEnumerable<\/code>, which provides strongly-typed access to the <code>DictionaryEntry<\/code> and enables direct use of <code>MoveNext<\/code>\/<code>Entry<\/code> to avoid that allocation. This PR takes advantage of that to avoid a bunch of boxing allocations as part of <code>EnvironmentVariablesConfigurationProvider.Load<\/code>.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49883\">dotnet\/runtime#49883<\/a>. <code>Lazy&lt;T&gt;<\/code> is one of those types that&#8217;s valuable when used correctly, but that&#8217;s also easy to overuse. Creating a <code>Lazy&lt;T&gt;<\/code> involves at least one if not multiple allocations beyond the cost of whatever&#8217;s being lazily-created, plus a delegate invocation to create the thing. But sometimes the double-checked locking you get by default is overkill, and all you really need is an <code>Interlocked.CompareExchange<\/code> to provide simple and efficient optimistic concurrency. This PR avoids a <code>Lazy&lt;T&gt;<\/code> for just such a case in <code>UnnamedOptionsManager<\/code>.<\/li>\n<\/ul>\n<h3>JSON<\/h3>\n<p><code>System.Text.Json<\/code> was <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/try-the-new-system-text-json-apis\/\">introduced in .NET Core 3.0<\/a> with performance as a primary goal. .NET 5 delivered an enhanced version of the library, providing <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/whats-next-for-system-text-json\/\">new APIs and even better performance<\/a>, and .NET 6 continues that trend.<\/p>\n<p>There have been multiple PRs in .NET 6 to improve the performance of different aspects of <code>System.Text.Json<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46460\">dotnet\/runtime#46460<\/a> from <a href=\"https:\/\/github.com\/lezzi\">@lezzi<\/a> is a small but valuable change that avoids boxing every key in a dictionary with a value type <code>TKey<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51367\">dotnet\/runtime#51367<\/a> from <a href=\"https:\/\/github.com\/devsko\">@devsko<\/a> makes serializing <code>DateTime<\/code>s faster by reducing the cost of trimming off ending <code>0<\/code>s. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55350\">dotnet\/runtime#55350<\/a> from <a href=\"https:\/\/github.com\/CodeBlanch\">@CodeBlanch<\/a> cleans up a bunch of <code>stackalloc<\/code> usage in the library, including changing a bunch of call sites from using a variable to instead using a constant, the latter of which the JIT can better optimize.<\/p>\n<p>But arguably the biggest performance improvement in <code>System.Text.Json<\/code> in .NET 6 comes from source generation. <code>JsonSerializer<\/code> needs information about the types it&#8217;s serializing to know what what to serialize and how to serialize it. It retrieves that data via reflection, examining for example what properties are exposed on a type and whether there are any customization attributes applied. But reflection is relatively expensive, and certainly not something you&#8217;d want to do every time you serialized an instance of a type, so <code>JsonSerializer<\/code> caches that information. That cached information may include, for example, delegates used to access the properties on an instance in order to retrieve the data that needs to be serialized. Depending on how the <code>JsonSerializer<\/code> is configured, that delegate might use reflection to invoke the property, or if the system permits it, it might point to specialized code emitted via reflection emit. Unfortunately, both of those techniques have potential downsides. Gathering all of this data, and potentially doing this reflection emit work, at run-time has a cost, and it can measurably impact both the startup performance and the working set of an application. It also leads to increased size, as all of the code necessary to enable this (including support for reflection emit itself) needs to be kept around just in case the serializer needs it. The new <code>System.Text.Json<\/code> source generator introduced in .NET 6 addresses this.<\/p>\n<p>Generating source during a build is nothing new; these techniques have been used in and out of the .NET ecosystem for decades. What is new, however, is the C# compiler making the capability a first-class feature, and core libraries in .NET taking advantage of it. Just as the compiler allows for <a href=\"https:\/\/docs.microsoft.com\/visualstudio\/extensibility\/getting-started-with-roslyn-analyzers\">analyzers<\/a> to be plugged into a build to add custom analysis as part of the compiler&#8217;s execution (with the compiler giving the analyzer access to all of the syntactical and semantic data it gathers and creates), the compiler now also enables a <a href=\"https:\/\/docs.microsoft.com\/dotnet\/csharp\/roslyn-sdk\/source-generators-overview\">source generator<\/a> to access the same information and then spit out additional C# code that&#8217;s incorporated into the same compilation unit. This makes it very attractive for doing certain operations at compile-time that code may have been doing previously via reflection and reflection emit at run-time&#8230; like analyzing types as part of a serializer in order to generate fast member accessors.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51149\">dotnet\/runtime#51149<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51300\">dotnet\/runtime#51300<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51528\">dotnet\/runtime#51528<\/a> introduce a new <code>System.Text.Json.SourceGeneration<\/code> component, included as part of the .NET 6 SDK. I create a new app, and I can see in Visual Studio the generator is automatically included:<\/p>\n<p><img decoding=\"async\" class=\"alignnone\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/JsonSourceGeneratorSolutionExplorer.png\" alt=\"Visual Studio Solution Explorer showing JSON source generator\" width=\"682\" height=\"406\" \/><\/p>\n<p>Then I can add this to my program:<\/p>\n<pre><code class=\"language-C#\">using System;\r\nusing System.Text.Json;\r\nusing System.Text.Json.Serialization;\r\n\r\nnamespace JsonExample;\r\n\r\nclass Program\r\n{\r\n    public static void Main()\r\n    {\r\n        JsonSerializer.Serialize(Console.OpenStandardOutput(), new BlogPost { Title = \".NET 6 Performance\", Author = \"Me\", PublicationYear = 2021 }, MyJsonContext.Default.BlogPost);\r\n    }\r\n}\r\n\r\ninternal class BlogPost\r\n{\r\n    public string? Title { get; set; }\r\n    public string? Author { get; set; }\r\n    public int PublicationYear { get; set; }\r\n}\r\n\r\n[JsonSerializable(typeof(BlogPost))]\r\ninternal partial class MyJsonContext : JsonSerializerContext { }<\/code><\/pre>\n<p>Over what I might have written in the past, note the addition of the partial <code>MyJsonContext<\/code> class (the name here doesn&#8217;t matter) and the additional <code>MyJsonContext.Default.BlogPost<\/code> argument to JsonSerializer.Serialize. As you&#8217;d expect, when I run it, I get this output:<\/p>\n<pre><code class=\"language-C#\">{\"Title\":\".NET 6 Performance\",\"Author\":\"Me\",\"PublicationYear\":2021}<\/code><\/pre>\n<p>What&#8217;s interesting, however, is what happened behind the scenes. If you look again at Solution Explorer, you&#8217;ll see a bunch of code the JSON source generator output:<\/p>\n<p><img decoding=\"async\" class=\"alignnone\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/JsonSourceGeneratorSolutionExplorer_Populated.png\" alt=\"Generated JSON files\" width=\"678\" height=\"571\" \/><\/p>\n<p>Those files essentially contain all of the glue code reflection and reflection emit would have generated, including lines like:<\/p>\n<pre><code class=\"language-C#\">getter: static (obj) =&gt; ((global::JsonExample.BlogPost)obj).Title,\r\nsetter: static (obj, value) =&gt; ((global::JsonExample.BlogPost)obj).Title = value,<\/code><\/pre>\n<p>highlighting the property accessor delegates being generated as part of source generation. The <code>JsonSerializer<\/code> is then able to use these delegates just as it&#8217;s able to use ones that use reflection or that were generated via reflection emit.<\/p>\n<p>As long as the source generator is spitting out all this code for doing at compile-time what was previously done at run-time, it can take things a step further. If I were writing my own serializer customized specifically for my <code>BlogPost<\/code> class, I wouldn&#8217;t use all this indirection&#8230; I&#8217;d just use a writer directly and write out each property, e.g.<\/p>\n<pre><code class=\"language-C#\">using System;\r\nusing System.Text.Json;\r\nusing System.Text.Json.Serialization;\r\n\r\nnamespace JsonExample;\r\n\r\nclass Program\r\n{\r\n    public static void Main()\r\n    {\r\n        using var writer = new Utf8JsonWriter(Console.OpenStandardOutput());\r\n        BlogPostSerialize(writer, new BlogPost { Title = \".NET 6 Performance\", Author = \"Me\", PublicationYear = 2021 });\r\n        writer.Flush();\r\n    }\r\n\r\n    private static void BlogPostSerialize(Utf8JsonWriter writer, BlogPost value)\r\n    {\r\n        writer.WriteStartObject();\r\n        writer.WriteString(nameof(BlogPost.Title), value.Title);\r\n        writer.WriteString(nameof(BlogPost.Author), value.Author);\r\n        writer.WriteNumber(nameof(BlogPost.PublicationYear), value.PublicationYear);\r\n        writer.WriteEndObject();\r\n    }\r\n}\r\n\r\ninternal class BlogPost\r\n{\r\n    public string? Title { get; set; }\r\n    public string? Author { get; set; }\r\n    public int PublicationYear { get; set; }\r\n}<\/code><\/pre>\n<p>There&#8217;s no reason the source generator shouldn&#8217;t be able to output such a streamlined implementation. And as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53212\">dotnet\/runtime#53212<\/a>, it can. The generated code contains this method on the <code>MyJsonContext<\/code> class:<\/p>\n<pre><code class=\"language-C#\">private static void BlogPostSerialize(global::System.Text.Json.Utf8JsonWriter writer, global::Benchmarks.BlogPost value)\r\n{\r\n    if (value == null)\r\n    {\r\n        writer.WriteNullValue();\r\n        return;\r\n    }\r\n\r\n    writer.WriteStartObject();\r\n    writer.WriteString(TitlePropName, value.Title);\r\n    writer.WriteString(AuthorPropName, value.Author);\r\n    writer.WriteNumber(PublicationYearPropName, value.PublicationYear);\r\n\r\n    writer.WriteEndObject();\r\n}<\/code><\/pre>\n<p>Looks familiar. Note, too, that the design of this fast path code enables the <code>JsonSerializer<\/code> to use it as well: if the serializer is passed a <code>JsonSerializerContext<\/code> that has a fast-path delegate, it&#8217;ll use it, which means code only needs to explicitly call the fast-path if it really wants to eke out the last mile of performance.<\/p>\n<pre><code class=\"language-C#\">private Utf8JsonWriter _writer = new Utf8JsonWriter(Stream.Null);\r\nprivate BlogPost _blogPost = new BlogPost { Title = \".NET 6 Performance\", Author = \"Me\", PublicationYear = 2021 };\r\n\r\n[Benchmark(Baseline = true)]\r\npublic void JsonSerializerWithoutFastPath()\r\n{\r\n    _writer.Reset();\r\n    JsonSerializer.Serialize(_writer, _blogPost);\r\n    _writer.Flush();\r\n}\r\n\r\n[Benchmark]\r\npublic void JsonSerializerWithFastPath()\r\n{\r\n    _writer.Reset();\r\n    JsonSerializer.Serialize(_writer, _blogPost, MyJsonContext.Default.BlogPost);\r\n    _writer.Flush();\r\n}\r\n\r\n[Benchmark]\r\npublic void DirectFastPath()\r\n{\r\n    _writer.Reset();\r\n    MyJsonContext.Default.BlogPost.Serialize(_writer, _blogPost);\r\n    _writer.Flush();\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>JsonSerializerWithoutFastPath<\/td>\n<td style=\"text-align: right;\">239.9 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>JsonSerializerWithFastPath<\/td>\n<td style=\"text-align: right;\">150.9 ns<\/td>\n<td style=\"text-align: right;\">0.63<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<tr>\n<td>DirectFastPath<\/td>\n<td style=\"text-align: right;\">134.9 ns<\/td>\n<td style=\"text-align: right;\">0.56<\/td>\n<td style=\"text-align: right;\">&#8211;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The impact of these improvements can be quite meaningful. <a href=\"https:\/\/github.com\/aspnet\/Benchmarks\/pull\/1683#issuecomment-864841394\">aspnet\/Benchmarks#1683<\/a> is a good example. It updates the ASP.NET implementation of the <a href=\"https:\/\/www.techempower.com\/benchmarks\/\">TechEmpower caching benchmark<\/a> to use the JSON source generator. Previously, a significant portion of the time in that benchmark was being spent doing JSON serialization using <code>JsonSerializer<\/code>, making it a prime candidate. With the changes to use the source generator and benefit from the fast path implicitly being used, the benchmark gets ~30% faster.<\/p>\n<p>The blog post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/try-the-new-system-text-json-source-generator\">Try the new System.Text.Json source generator<\/a> provides a lot more detail and background.<\/p>\n<h3>Interop<\/h3>\n<p>One of the really neat projects worked on during .NET 6 is another source generator, this time one related to interop. Since the beginning of .NET, C# code can call out to native C functions via the P\/Invoke (Platform Invoke) mechanism, whereby a <code>static extern<\/code> method is annotated as <code>[DllImport]<\/code>. However, not all <code>[DllImport]<\/code>s are created equally. Certain <code>[DllImport]<\/code>s are referred to as being &#8220;blittable,&#8221; which really just means the runtime doesn&#8217;t need to do any special transformation or marshaling as part of the call (that includes the <a href=\"https:\/\/docs.microsoft.com\/dotnet\/framework\/interop\/blittable-and-non-blittable-types\">signature&#8217;s types being blittable<\/a>, but also the <code>[DllImport(...)]<\/code> attribute itself not declaring the need for any special processing, like <code>SetLastError = true<\/code>). For those that aren&#8217;t blittable, the runtime needs to generate a &#8220;stub&#8221; that does any marshaling or manipulation necessary. For example, if you write:<\/p>\n<pre><code class=\"language-C#\">[DllImport(SetLastError = true)]\r\nprivate static extern bool GetValue(SafeHandle handle);<\/code><\/pre>\n<p>for a native API defined in C as something like the following on Windows:<\/p>\n<pre><code class=\"language-C\">BOOL GetValue(HANDLE h);<\/code><\/pre>\n<p>or the following on Unix:<\/p>\n<pre><code class=\"language-C\">int32_t GetValue(void* fileDescriptor);<\/code><\/pre>\n<p>there are three special things the runtime needs to handle:<\/p>\n<ol>\n<li>The <code>SafeHandle<\/code> needs to be marshaled as an <code>IntPtr<\/code>, and the runtime needs to ensure the <code>SafeHandle<\/code> won&#8217;t be released during the native call.<\/li>\n<li>The <code>bool<\/code> return value needs to be marshaled from a 4-byte integer value.<\/li>\n<li>The <code>SetLastError = true<\/code> needs to properly ensure any error from the native call is consumable appropriately.<\/li>\n<\/ol>\n<p>To do so, the runtime effectively needs to translate that <code>[DllImport]<\/code> into something like:<\/p>\n<pre><code class=\"language-C#\">private static bool GetValue(SafeHandle handle)\r\n{\r\n    bool success = false;\r\n    try\r\n    {\r\n        handle.DangerousAddRef(ref success);\r\n        IntPtr ptr = handle.DangerousGetHandle();\r\n\r\n        Marshal.SetLastSystemError(0);\r\n        int result = __GetValue(ptr);\r\n        Marshal.SetLastPInvokeError(Marshal.GetLastSystemError());\r\n\r\n        return result != 0;\r\n    }\r\n    finally\r\n    {\r\n        if (success)\r\n        {\r\n            handle.DangerousRelease();\r\n        }\r\n    }\r\n}\r\n\r\n[DllImport]\r\nprivate static extern int __GetValue(IntPtr handle);<\/code><\/pre>\n<p>using dynamic code generation at run-time to generate a &#8220;stub&#8221; method that in turn calls an underlying <code>[DllImport]<\/code> that actually is blittable. Doing that at run-time has multiple downsides, including the startup impact on having to do this code generation on first use. So, for .NET 7 we plan to enable a source generator to do it, and the groundwork has been laid in .NET 6 by building out a prototype. While the <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/43060\">P\/Invoke source generator<\/a> won&#8217;t ship as part of .NET 6, as part of that prototype various investments were made that will ship in .NET 6, such as changing <code>[DllImport]<\/code>s that could easily be made blittable to be so. You can see an example of that in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54029\">dotnet\/runtime#54029<\/a>, which changed a handful of <code>[DllImport]<\/code>s in the <code>System.IO.Compression.Brotli<\/code> library to be blittable. For example, this method:<\/p>\n<pre><code class=\"language-C#\">[DllImport(Libraries.CompressionNative)]\r\ninternal static extern unsafe bool BrotliDecoderDecompress(nuint availableInput, byte* inBytes, ref nuint availableOutput, byte* outBytes);<\/code><\/pre>\n<p>required the runtime to generate a stub in order to 1) handle the return <code>bool<\/code> marshaling from a 4-byte integer, and 2) handle pinning the <code>availableOutput<\/code> parameter passed as a <code>ref<\/code>. Instead, it can be defined as:<\/p>\n<pre><code class=\"language-C#\">[DllImport(Libraries.CompressionNative)]\r\ninternal static extern unsafe int BrotliDecoderDecompress(nuint availableInput, byte* inBytes, nuint* availableOutput, byte* outBytes);<\/code><\/pre>\n<p>which is blittable, and then a call site like:<\/p>\n<pre><code class=\"language-C#\">nuint availableOutput = (nuint)destination.Length;\r\nbool success = Interop.Brotli.BrotliDecoderDecompress((nuint)source.Length, inBytes, ref availableOutput, outBytes);<\/code><\/pre>\n<p>can be tweaked to:<\/p>\n<pre><code class=\"language-C#\">nuint availableOutput = (nuint)destination.Length;\r\nbool success = Interop.Brotli.BrotliDecoderDecompress((nuint)source.Length, inBytes, &amp;availableOutput, outBytes) != 0;<\/code><\/pre>\n<p>Boom, a small tweak and we&#8217;ve saved an extra unlikely-to-be-inlined method call and avoided the need to even generate the stub in the first place. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53968\">dotnet\/runtime#53968<\/a> makes all of the <code>[DllImports]<\/code> for interop with <code>zlib<\/code> (System.IO.Compression) to be blittable. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54370\">dotnet\/runtime#54370<\/a> fixes up more <code>[DllImport]<\/code>s across <code>System.Security.Cryptography<\/code>, <code>System.Diagnostics<\/code>, <code>System.IO.MemoryMappedFiles<\/code>, and elsewhere to be blittable, as well.<\/p>\n<p>Another area in which we&#8217;ve seen cross-cutting improvements in .NET 6 is via the use of function pointers to simplify and streamline interop. C# 9 added support for function pointers, which, via the <code>delegate*<\/code> syntax, enable efficient access to the <code>ldftn<\/code> and <code>calli<\/code> IL instructions. Let&#8217;s say you&#8217;re the <code>PosixSignalRegistration<\/code> type, which was implemented in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54136\">dotnet\/runtime#54136<\/a> from <a href=\"https:\/\/github.com\/tmds\">@tmds<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55333\">dotnet\/runtime#55333<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55552\">dotnet\/runtime#55552<\/a> to enable code to register a callback to handle a POSIX signal. Both the Unix and Windows implementations of this type need to hand off to native code a callback to be invoked when a signal is received. On Unix, the native function that&#8217;s called to register the callback is declared as:<\/p>\n<pre><code class=\"language-C\">typedef int32_t (*PosixSignalHandler)(int32_t signalCode, PosixSignal signal);\r\nvoid SystemNative_SetPosixSignalHandler(PosixSignalHandler signalHandler);<\/code><\/pre>\n<p>expecting a function pointer it can invoke. Thankfully, on the managed side we want to hand off a static method, so we don&#8217;t need to get bogged down in the details of how we pass an instance method, keep the relevant state rooted, and so on. Instead, we can declare the <code>[DllImport]<\/code> as:<\/p>\n<pre><code class=\"language-C#\">[DllImport(Libraries.SystemNative, EntryPoint = \"SystemNative_SetPosixSignalHandler\")]\r\ninternal static extern unsafe void SetPosixSignalHandler(delegate* unmanaged&lt;int, PosixSignal, int&gt; handler);<\/code><\/pre>\n<p>Now, we define a method we want to be called that&#8217;s compatible with this function pointer type:<\/p>\n<pre><code class=\"language-C#\">[UnmanagedCallersOnly]\r\nprivate static int OnPosixSignal(int signo, PosixSignal signal) { ... }<\/code><\/pre>\n<p>and, finally, we can pass the address of this method to the native code:<\/p>\n<pre><code class=\"language-C#\">Interop.Sys.SetPosixSignalHandler(&amp;OnPosixSignal);<\/code><\/pre>\n<p>Nowhere did we need to allocate a delegate and store it into a static field to prevent it from being collected, just so we can hand off the address of this <code>OnPosixSignal<\/code> method; instead, we just pass down the method&#8217;s address. This ends up being simpler and more efficient, and multiple PRs in .NET 6 converted delegate-based interop to function pointer-based interop. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43793\">dotnet\/runtime#43793<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43514\">dotnet\/runtime#43514<\/a> converted a bunch of interop on both Windows and Unix to use function pointers. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54636\">dotnet\/runtime#54636<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54884\">dotnet\/runtime#54884<\/a> did the same for <code>System.Drawing<\/code> as part of a larger effort to migrate <code>System.Drawing<\/code> to use <code>System.Runtime.InteropServices.ComWrappers<\/code>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46690\">dotnet\/runtime#46690<\/a> moved <code>DateTime<\/code> to being a fully managed implementation rather than using <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657\/docs\/design\/coreclr\/botr\/corelib.md#calling-from-managed-to-native-code\">&#8220;FCalls&#8221;<\/a> into the runtime to get the current time, and in doing so used function pointers to be able to store a pointer to desired native OS function for getting the current time. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52090\">dotnet\/runtime#52090<\/a> converted the macOS implementation of <code>FileSystemWatcher<\/code> to use function pointers. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52192\">dotnet\/runtime#52192<\/a> did the same for <code>System.Net.NetworkInformation<\/code>.<\/p>\n<p>Beyond these cross-cutting changes, there was also more traditional optimization investment in interop. The <code>Marshal<\/code> class has long provided the <code>AllocHGlobal<\/code> and <code>FreeHGlobal<\/code> methods which .NET developers could use effectively as the equivalent of <code>malloc<\/code> and <code>free<\/code>, in situations where natively allocated memory was preferable to allocation controlled by the GC. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/41911\">dotnet\/runtime#41911<\/a> revised the implementation of these and other <code>Marshal<\/code> methods as part of moving all of the <code>Marshal<\/code> allocation-related implementations out of native code in the runtimes up into C#. In doing so, a fair amount of overhead was removed, in particular on Unix where a layer of wrappers was removed, as is evident from this benchmark run on Ubuntu:<\/p>\n<pre><code class=\"language-C#\">[Benchmark]\r\npublic void AllocFree() =&gt; Marshal.FreeHGlobal(Marshal.AllocHGlobal(100));<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AllocFree<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">58.50 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>AllocFree<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">28.21 ns<\/td>\n<td style=\"text-align: right;\">0.48<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In a similar area, the new <code>System.Runtime.InteropServices.NativeMemory<\/code> class (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54006\">dotnet\/runtime#54006<\/a>) provides fast APIs for allocating, reallocating, and freeing native memory, with options including requiring the memory having a particular alignment or having the memory be forcibly zeroed out (note the above numbers and the below numbers were taken on different machines, the above on Ubuntu and the below on Windows, and are not directly comparable).<\/p>\n<pre><code class=\"language-C#\">[Benchmark(Baseline = true)]\r\npublic void AllocHGlobal() =&gt; Marshal.FreeHGlobal(Marshal.AllocHGlobal(100));\r\n\r\n[Benchmark]\r\npublic void Alloc() =&gt; NativeMemory.Free(NativeMemory.Alloc(100));<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">RatioSD<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AllocHGlobal<\/td>\n<td style=\"text-align: right;\">58.34 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">0.00<\/td>\n<\/tr>\n<tr>\n<td>Alloc<\/td>\n<td style=\"text-align: right;\">48.33 ns<\/td>\n<td style=\"text-align: right;\">0.83<\/td>\n<td style=\"text-align: right;\">0.02<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There&#8217;s also the new <code>MemoryMarshal.CreateReadOnlySpanFromNullTerminated<\/code> method (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47539\">dotnet\/runtime#47539<\/a>), which provides two overloads, one for <code>char*<\/code> and one for <code>byte*<\/code>, and which is intended to simplify the handling of null-terminated strings received while doing interop. As an example, <code>FileSystemWatcher<\/code>&#8216;s implementation on macOS would receive from the operating system a pointer to a null-terminated UTF8 string representing the path of the file that changed. With just the <code>byte*<\/code> pointer to the string, the implementation had code that looked like this:<\/p>\n<pre><code class=\"language-C#\">byte* temp = nativeEventPath;\r\nint byteCount = 0;\r\nwhile (*temp != 0)\r\n{\r\n    temp++;\r\n    byteCount++;\r\n}\r\nvar span = new ReadOnlySpan&lt;byte&gt;(nativeEventPath, byteCount);<\/code><\/pre>\n<p>in order to create a span representing the string beginning to end. Now, the implementation is simply:<\/p>\n<pre><code class=\"language-C#\">ReadOnlySpan&lt;byte&gt; eventPath = MemoryMarshal.CreateReadOnlySpanFromNullTerminated(nativeEventPath);<\/code><\/pre>\n<p>More maintainable, safer code, but there&#8217;s also a performance benefit. <code>CreateReadOnlySpanFromNullTerminated<\/code> employs a vectorized search for the null terminator, making it typically much faster than the open-coded manual loop.<\/p>\n<pre><code class=\"language-C#\">private IntPtr _ptr;\r\n\r\n[GlobalSetup]\r\npublic void Setup() =&gt;\r\n    _ptr = Marshal.StringToCoTaskMemUTF8(\"And yet, by heaven, I think my love as rare. As any she belies with false compare.\");\r\n\r\n[GlobalCleanup]\r\npublic void Cleanup() =&gt;\r\n    Marshal.FreeCoTaskMem(_ptr);\r\n\r\n[Benchmark(Baseline = true)]\r\npublic unsafe ReadOnlySpan&lt;byte&gt; Old()\r\n{\r\n    int byteCount = 0;\r\n    for (byte* p = (byte*)_ptr; *p != 0; p++) byteCount++;\r\n    return new ReadOnlySpan&lt;byte&gt;((byte*)_ptr, byteCount);\r\n}\r\n\r\n[Benchmark]\r\npublic unsafe ReadOnlySpan&lt;byte&gt; New() =&gt;\r\n    MemoryMarshal.CreateReadOnlySpanFromNullTerminated((byte*)_ptr);<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Old<\/td>\n<td style=\"text-align: right;\">38.536 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<\/tr>\n<tr>\n<td>New<\/td>\n<td style=\"text-align: right;\">6.597 ns<\/td>\n<td style=\"text-align: right;\">0.17<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Tracing<\/h3>\n<p>.NET has multiple tracing implementations, with <code>EventSource<\/code> at the heart of those used in the most performance-sensitive systems. The runtime itself traces details and exposes counters for the JIT, GC, ThreadPool, and more through a <code>\"System.Runtime\"<\/code> event source, and many other components up and down the stack do the same with their own. Even just within the core libraries, among others you can find the <code>\"System.Diagnostics.Metrics\"<\/code> event source, which is intended to enable out-of-process tools to do ad-hoc monitoring of the new <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/opentelemetry-net-reaches-v1-0\/\">OpenTelemtry<\/a> <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/announcing-net-6-preview-5\/#libraries-support-for-opentelemetry-metrics\">Metrics APIs<\/a>; the <code>\"System.Net.Http\"<\/code> event source that exposes information such as when requests start and complete; the <code>\"System.Net.NameResolution\"<\/code> event source that exposes information such as the number of DNS lookups that have been performed; the <code>\"System.Net.Security\"<\/code> event source that exposes data about TLS handshakes; the <code>\"System.Net.Sockets\"<\/code> event source that enables monitoring of connections being made and data being transferred; and the <code>\"System.Buffers.ArrayPoolEventSource\"<\/code> event source that gives a window into arrays being rented and returned and dropped and trimmed. This level of usage demands the system to be as efficient as possible.<\/p>\n<p><code>EventSource<\/code>-derived types use overloads of <code>EventSource.WriteEvent<\/code> or <code>EventSource.WriteEventCore<\/code> to do the core of their logging. There are then multiple ways that data from an <code>EventSource<\/code> can be consumed. One way is via ETW (Event Tracing for Windows), through which another process can request an <code>EventSource<\/code> start tracing and the relevant data will be written by the operating system to a log for subsequent analysis with a tool like Visual Studio, PerfView, or Windows Performance Analyzer. The most general <code>WriteEvent<\/code> overload accepts an <code>object[]<\/code> of all the data to trace, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54925\">dotnet\/runtime#54925<\/a> reduced the overhead of using this API, specifically when the data is being consumed by ETW, which has dedicated code paths in the implementation; the PR reduced allocation by 3-4x for basic use cases by avoiding multiple temporary <code>List&lt;object&gt;<\/code> and <code>object[]<\/code> arrays, leading also to an ~8% improvement in throughput.<\/p>\n<p>Another increasingly common way <code>EventSource<\/code> data can be consumed is via <a href=\"https:\/\/docs.microsoft.com\/dotnet\/core\/diagnostics\/eventpipe\">EventPipe<\/a>, which provides a cross-platform mechanism for serializing <code>EventSource<\/code> data either to a <code>.nettrace<\/code> file or to an out-of-process consumer, such as a tool like <a href=\"https:\/\/docs.microsoft.com\/dotnet\/core\/diagnostics\/dotnet-counters\">dotnet-counters<\/a>. Given the high rate at which data can be generated, it&#8217;s important that this mechanism be as low-overhead as possible. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50797\">dotnet\/runtime#50797<\/a> changed how access to buffers in EventPipe were synchronized, leading to significant increases in event throughput, on all operating systems. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46555\">dotnet\/runtime#46555<\/a> also helps here. If either ETW or EventPipe was being used to consume events, <code>EventSource<\/code> would P\/Invoke into native code for each, but if only one of them was being used, that would lead to an unnecessary P\/Invoke; the PR addressed this simply by checking whether the P\/Invoke is necessary based on the known state of the consumers.<\/p>\n<p>Another way <code>EventSource<\/code> data can be consumed is in-process via a custom <code>EventListener<\/code>. Code can derive from <code>EventListener<\/code> and override a few methods to say what <code>EventSource<\/code> should be subscribed to and what should be done with the data. For example, here&#8217;s a simple app that uses an <code>EventListener<\/code> to dump to the console the events generated for a single HTTP request by the `&#8221;System.Net.Http&#8221;&#8220; event source:<\/p>\n<pre><code class=\"language-C#\">using System;\r\nusing System.Diagnostics.Tracing;\r\nusing System.Linq;\r\nusing System.Net.Http;\r\n\r\nusing var listener = new HttpConsoleListener();\r\nusing var hc = new HttpClient();\r\nawait hc.GetStringAsync(\"https:\/\/dotnet.microsoft.com\/\");\r\n\r\nsealed class HttpConsoleListener : EventListener\r\n{\r\n    protected override void OnEventSourceCreated(EventSource eventSource)\r\n    {\r\n        if (eventSource.Name == \"System.Net.Http\")\r\n            EnableEvents(eventSource, EventLevel.LogAlways);\r\n    }\r\n\r\n    protected override void OnEventWritten(EventWrittenEventArgs eventData)\r\n    {\r\n        string? payload =\r\n            eventData.Payload is null ? null :\r\n            eventData.PayloadNames != null ? string.Join(\", \", eventData.PayloadNames.Zip(eventData.Payload, (n, v) =&gt; $\"{n}={v}\")) :\r\n            string.Join(\", \", eventData.Payload);\r\n        Console.WriteLine($\"[{eventData.TimeStamp:o}] {eventData.EventName}: {payload}\");\r\n    }\r\n}<\/code><\/pre>\n<p>and the output it produced when I ran it:<\/p>\n<pre><code class=\"language-console\">[2021-08-06T15:38:47.4758871Z] RequestStart: scheme=https, host=dotnet.microsoft.com, port=443, pathAndQuery=\/, versionMajor=1, versionMinor=1, versionPolicy=0\r\n[2021-08-06T15:38:47.5981990Z] ConnectionEstablished: versionMajor=1, versionMinor=1\r\n[2021-08-06T15:38:47.5995700Z] RequestLeftQueue: timeOnQueueMilliseconds=86.1312, versionMajor=1, versionMinor=1\r\n[2021-08-06T15:38:47.6011745Z] RequestHeadersStart:\r\n[2021-08-06T15:38:47.6019475Z] RequestHeadersStop:\r\n[2021-08-06T15:38:47.7591555Z] ResponseHeadersStart:\r\n[2021-08-06T15:38:47.7628194Z] ResponseHeadersStop:\r\n[2021-08-06T15:38:47.7648776Z] ResponseContentStart:\r\n[2021-08-06T15:38:47.7665603Z] ResponseContentStop:\r\n[2021-08-06T15:38:47.7667290Z] RequestStop:\r\n[2021-08-06T15:38:47.7684536Z] ConnectionClosed: versionMajor=1, versionMinor=1<\/code><\/pre>\n<p>The other mechanisms for consuming events are more efficient, but being able to write a custom <code>EventListener<\/code> like this is very flexible and allows for a myriad of interesting uses, so we still want to drive down the overhead associated with all of these callbacks.\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44026\">dotnet\/runtime#44026<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51822\">dotnet\/runtime#51822<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52092\">dotnet\/runtime#52092<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52455\">dotnet\/runtime#52455<\/a> all contributed here, doing things like wrapping a <code>ReadOnlyCollection&lt;object&gt;<\/code> directly around an <code>object[]<\/code> created with the exact right size rather around an intermediate <code>List&lt;object&gt;<\/code> dynamically grown; using a singleton collection for empty payloads; avoiding unnecessary <code>[ThreadStatic]<\/code> accesses; avoiding recalcuating information and instead calculating it once and passing it to everywhere that needs the value; using <code>Type.GetTypeCode<\/code> to quickly jump to the handling code for the relevant primitive rather than a large cascade of <code>if<\/code>s; reducing the size of <code>EventWrittenEventArgs<\/code> in the common case by pushing off lesser-used fields to a contingently-allocated class; and so on. This benchmark shows an example impact of those changes.<\/p>\n<pre><code class=\"language-C#\">private BenchmarkEventListener _listener;\r\n\r\n[GlobalSetup]\r\npublic void Setup() =&gt; _listener = new BenchmarkEventListener();\r\n[GlobalCleanup]\r\npublic void Cleanup() =&gt; _listener.Dispose();\r\n\r\n[Benchmark]\r\npublic void Log()\r\n{\r\n    BenchmarkEventSource.Log.NoArgs();\r\n    BenchmarkEventSource.Log.MultipleArgs(\"hello\", 6, 0);\r\n}\r\n\r\nprivate sealed class BenchmarkEventListener : EventListener\r\n{\r\n    protected override void OnEventSourceCreated(EventSource eventSource)\r\n    {\r\n        if (eventSource is BenchmarkEventSource)\r\n            EnableEvents(eventSource, EventLevel.LogAlways);\r\n    }\r\n\r\n    protected override void OnEventWritten(EventWrittenEventArgs eventData) { }\r\n}\r\n\r\nprivate sealed class BenchmarkEventSource : EventSource\r\n{\r\n    public static readonly BenchmarkEventSource Log = new BenchmarkEventSource();\r\n\r\n    [Event(1)]\r\n    public void NoArgs() =&gt; WriteEvent(1);\r\n\r\n    [Event(2)]\r\n    public void MultipleArgs(string arg1, int arg2, int arg3) =&gt; WriteEvent(2, arg1, arg2, arg3);\r\n}<\/code><\/pre>\n<table>\n<thead>\n<tr>\n<th>Method<\/th>\n<th>Runtime<\/th>\n<th style=\"text-align: right;\">Mean<\/th>\n<th style=\"text-align: right;\">Ratio<\/th>\n<th style=\"text-align: right;\">Allocated<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Log<\/td>\n<td>.NET 5.0<\/td>\n<td style=\"text-align: right;\">394.4 ns<\/td>\n<td style=\"text-align: right;\">1.00<\/td>\n<td style=\"text-align: right;\">472 B<\/td>\n<\/tr>\n<tr>\n<td>Log<\/td>\n<td>.NET 6.0<\/td>\n<td style=\"text-align: right;\">126.9 ns<\/td>\n<td style=\"text-align: right;\">0.32<\/td>\n<td style=\"text-align: right;\">296 B<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Startup<\/h3>\n<p>There are many things that impact how long it takes an application to start up. Code generation plays a large role, which is why .NET has technology like tiered JIT compilation and ReadyToRun. Managed code prior to an application&#8217;s <code>Main<\/code> method being invoked also plays a role (yes, there&#8217;s managed code that executes before <code>Main<\/code>);\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44469\">dotnet\/runtime#44469<\/a>, for example, was the result of profiling for allocations that occurred on the startup path of a simple &#8220;hello, world&#8221; console application and addressing a variety of issues:<\/p>\n<ul>\n<li><code>EqualityComparer&lt;string&gt;.Default<\/code> is used by various components on the startup path, but the <code>CreateDefaultEqualityComparer<\/code> that&#8217;s used to initialize that singleton wasn&#8217;t special-casing <code>type == typeof(string)<\/code>, which then meant it ended up going down the more allocation-heavy <code>CreateInstanceForAnotherGenericParameter<\/code> code path. The PR fixed it by special-casing <code>string<\/code>.<\/li>\n<li>Any use of <code>Console<\/code> was forcing various <code>Encoding<\/code> objects to be instantiated, even if they wouldn&#8217;t otherwise be used, purely to access their <code>CodePage<\/code>. The PR fixed that by just hardcoding the relevant code page number in a constant.<\/li>\n<li>Every process was registering with <code>AppContext.ProcessExit<\/code> in order to clean up after the runtime&#8217;s <code>EventSource<\/code>s that were being created, and that in turn resulted in several allocations. We can instead sacrifice a small amount of layering purity and just do the cleanup as part of the <code>AppContext.OnProcessExit<\/code> routine that&#8217;s already doing other work, like calling <code>AssemblyLoadContext.OnProcessExit<\/code> and invoking the <code>ProcessExit<\/code> event itself.<\/li>\n<li><code>AppDomain<\/code> was taking a lock to protect the testing-and-setting of some state, and that operation was easily transformed into an <code>Interlocked.CompareExchange<\/code>. The benefit to that here wasn&#8217;t reduced locking (which is also nice), but rather no longer needing to allocate the object that was there purely to be locked on.<\/li>\n<li><code>EventSource<\/code> was always allocating an <code>object<\/code> to be used as a lock necessary for synchronization in the <code>WriteEventString<\/code> method, which is only used for logging error messages about <code>EventSource<\/code>s; not a common case. That <code>object<\/code> can instead be lazily allocated with an <code>Interlocked.CompareExchange<\/code> only when there&#8217;s first a failure to log. <code>EventSource<\/code> was also allocating a pinning <code>GCHandle<\/code> in order to pass the address of a pinned array to a P\/Invoke. That was just as easily (and more cheaply) done with a <code>fixed<\/code> statement.<\/li>\n<li>Similarly, <code>EncodingProvider<\/code> was always allocating an <code>object<\/code> to be used for pessimistic locking, when an optimistic <code>CompareExchange<\/code> loop-based scheme was cheaper in the common case.<\/li>\n<\/ul>\n<p>But beyond both of those, there&#8217;s the .NET host. Fundamentally, the .NET runtime is &#8220;just&#8221; a library that can be hosted inside of a larger application, the &#8220;host&#8221;; that host calls into various APIs that initialize the runtime and invoke static methods, like <code>Main<\/code>. While it&#8217;s possible for anyone to <a href=\"https:\/\/docs.microsoft.com\/dotnet\/core\/tutorials\/netcore-hosting\">build a custom host<\/a>, there are hosts built into .NET, that are used as part of the <code>dotnet<\/code> tool and as part of building and publishing an app (when you build a .NET console app, for example, the <code>.exe<\/code> that pops out is a .NET host). What that host does or does not do can have a significant impact on the startup performance of the app, and investments were made in .NET 6 to reduce these hosting overheads.<\/p>\n<p>One of the most expensive things a host can do is file I\/O, especially if there&#8217;s a lot of it. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50671\">dotnet\/runtime#50671<\/a> tries to reduce startup time by avoiding the file existence checks that were being performed for each file listed in <code>deps.json<\/code> (which describes a set of dependencies that come from packages). On top of that, <a href=\"https:\/\/github.com\/dotnet\/sdk\/pull\/17014\">dotnet\/sdk#17014<\/a> stopped generating the <code>&lt;app&gt;.runtimeconfig.dev.json<\/code> file as part of builds; this file contained additional probing paths that weren&#8217;t actually necessary and were causing the host to probe more than necessary and negating the wins from the previous PR. On top of that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53631\">dotnet\/runtime#53631<\/a> also helped reduce overheads by removing unnecessary string copies in the hosting layer, shaving milliseconds off execution time.<\/p>\n<p>All told, this adds up to sizeable reductions in app startup. For this example, I used:<\/p>\n<pre><code class=\"language-console\">D:\\examples&gt; dotnet new console -o app5 -f net5.0\r\nD:\\examples&gt; dotnet new console -o app6 -f net6.0<\/code><\/pre>\n<p>to create two &#8220;Hello, World&#8221; apps, one targeting .NET 5 and one targeting .NET 6. Then I built them both with <code>dotnet build -c Release<\/code> in each directory, and then used PowerShell&#8217;s <code>Measure-Command<\/code> to time their execution.<\/p>\n<pre><code class=\"language-console\">D:\\examples\\app5&gt; Measure-Command { D:\\examples\\app5\\bin\\Release\\net5.0\\app5.exe } | Select-Object -Property TotalMilliseconds\r\n\r\nTotalMilliseconds\r\n-----------------\r\n          63.9716\r\n\r\nD:\\examples\\app5&gt; cd ..app6\r\nD:\\examples\\app6&gt; Measure-Command { D:\\examples\\app6\\bin\\Release\\net6.0\\app6.exe } | Select-Object -Property TotalMilliseconds\r\n\r\nTotalMilliseconds\r\n-----------------\r\n          43.2652\r\n\r\nD:\\examples\\app6&gt;<\/code><\/pre>\n<p>highlighting an ~30% reduction in the cost of executing this &#8220;Hello, World&#8221; app.<\/p>\n<h3>Size<\/h3>\n<p>When I&#8217;ve written about improving .NET performance, throughput and memory have been the primary two metrics on which I&#8217;ve focused. Of late, however, another metric has been getting a lot of attention: size, and in particular size-on-disk for a self-contained, trimmed application. That&#8217;s primarily because of the <a href=\"https:\/\/docs.microsoft.com\/aspnet\/core\/blazor\">Blazor WebAssembly (WASM)<\/a> application model, where an entire .NET application, inclusive of the runtime, is downloaded to and executed in a browser. Some amount of work went into .NET 5 to reduce size, but <em>a lot<\/em> of work has gone into .NET 6, inclusive of changes in dotnet\/runtime as well as in mono\/linker, which provides the trimmer that analyzes and rewrites assemblies to remove (or &#8220;trim&#8221;, or &#8220;tree shake&#8221;) unused functionality. A large percentage of the work in .NET 6 actually went into trimming safety, making it possible for any of the core libraries to be used in a trimmed application such that either everything that&#8217;s needed will be correctly kept or the trimmer will produce warnings about what&#8217;s wrong and how the developer can fix it. However, there was a sizable effort (pun intended, I&#8217;m so funny) on the size reduction itself.<\/p>\n<p>To start, let&#8217;s take a look at what size looked like for .NET 5. I&#8217;ll create and run a new .NET 5 Blazor WASM application using <code>dotnet<\/code><\/p>\n<pre><code class=\"language-console\">dotnet new blazorwasm --framework net5.0 --output app5\r\ncd app5\r\ndotnet run<\/code><\/pre>\n<p><img decoding=\"async\" class=\"alignnone\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/BlazorTemplate.png\" alt=\"Blazor App, Performance Improvements in .NET 6\" width=\"1404\" height=\"564\" \/><\/p>\n<p>It works, nice. Now, I can publish it, which will create and trim the whole application, and produce all the relevant assets ready for pushing to my server; that includes Brotli-compressing all the required components.<\/p>\n<pre><code class=\"language-console\">dotnet publish -c Release<\/code><\/pre>\n<p>I see output like the following, and it completes successfully:<\/p>\n<pre><code class=\"language-console\">D:examples\\app5&gt; dotnet publish -c Release\r\nMicrosoft (R) Build Engine version 17.0.0-preview-21411-06+b0bb46ab8 for .NET\r\nCopyright (C) Microsoft Corporation. All rights reserved.\r\n\r\n  Determining projects to restore...\r\n  All projects are up-to-date for restore.\r\n  You are using a preview version of .NET. See: https:\/\/aka.ms\/dotnet-core-preview\r\n  app5 -&gt; D:\\examples\\app5\\bin\\Release\\net5.0\\app5.dll\r\n  app5 (Blazor output) -&gt; D:\\examples\\app5\\bin\\Release\\net5.0\\wwwroot\r\n  Optimizing assemblies for size, which may change the behavior of the app. Be sure to test after publishing. See: https:\/\/aka.ms\/dotnet-illink\r\n  Compressing Blazor WebAssembly publish artifacts. This may take a while...\r\n  app5 -&gt; D:\\examples\\app5\\bin\\Release\\net5.0\\publish\r\nD:\\examples\\app5&gt;<\/code><\/pre>\n<p>The published compressed files end up for me in <code>D:\\examples\\app5\\bin\\Release\\net5.0\\publish\\wwwroot_framework<\/code>, and if I sum up all of the <code>.br<\/code> files (except for <code>icudt_CJK.dat.br<\/code>, <code>icudt_EFIGS.dat.br<\/code>, <code>icudt_no_CJK.dat.br<\/code>, which are subsets of <code>icudt.dat.br<\/code> that&#8217;s also there), I get a total size of <code>2.10 MB<\/code>. That&#8217;s the entirety of the application, including the runtime, all of the library functionality used by the app, and the app code itself. Cool.<\/p>\n<p>Now, let&#8217;s do the exact same thing, but with .NET 6:<\/p>\n<pre><code class=\"language-console\">dotnet new blazorwasm --framework net6.0 --output app6\r\ncd app6\r\ndotnet publish -c Release<\/code><\/pre>\n<p>which yields:<\/p>\n<pre><code class=\"language-console\">D:\\examples\\app6&gt; dotnet publish -c Release\r\nMicrosoft (R) Build Engine version 17.0.0-preview-21411-06+b0bb46ab8 for .NET\r\nCopyright (C) Microsoft Corporation. All rights reserved.\r\n\r\n  Determining projects to restore...\r\n  All projects are up-to-date for restore.\r\n  You are using a preview version of .NET. See: https:\/\/aka.ms\/dotnet-core-preview\r\n  app6 -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\app6.dll\r\n  app6 (Blazor output) -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\wwwroot\r\n  Optimizing assemblies for size, which may change the behavior of the app. Be sure to test after publishing. See: https:\/\/aka.ms\/dotnet-illink\r\n  Compressing Blazor WebAssembly publish artifacts. This may take a while...\r\n  app6 -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\publish\r\nD:\\examples\\app6&gt;<\/code><\/pre>\n<p>as before. Except now when I sum up the relevant <code>.br<\/code> files in <code>D:\\examples\\app6\\bin\\Release\\net6.0\\publish\\wwwroot_framework<\/code>, the total size is <code>1.88MB<\/code>. Just by upgrading from .NET 5 to .NET 6, we&#8217;ve shaved ~220KB off the total size, even as .NET 6 gains lots of additional code. <em>A lot<\/em> of PRs contributed here, as most changes shave off a few bytes here and a few bytes there. Here are some example changes that were made in the name of size, as they can help to highlight the kinds of changes applications and libraries in general can make to help reduce their footprint:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44712\">dotnet\/runtime#44712<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44706\">dotnet\/runtime#44706<\/a>. When run over an application, the trimmer identifies unused types and removes them; use a type just once, and it needs to be kept. Over the years, .NET has amassed variations on a theme, with multiple types that can be used for the same purpose, and there&#8217;s then value in consolidating all usage to just one of them such that the other is more likely to be trimmed away. A good example of this is <code>Tuple&lt;&gt;<\/code> and <code>ValueTuple&lt;&gt;<\/code>. Throughout the codebase there are occurrences of both; this PR replaces a bunch of <code>Tuple&lt;&gt;<\/code> use with <code>ValueTuple&lt;&gt;<\/code>, which not only helps to avoid allocations in some cases, it makes it much more likely that more <code>Tuple&lt;&gt;<\/code> types can be trimmed away.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43634\">dotnet\/runtime#43634<\/a>. The nature of <code>ValueTuple&lt;&gt;<\/code> is that there are lots of copies of <code>ValueTuple&lt;&gt;<\/code>, one for each arity (e.g. <code>ValueTuple&lt;T1, T2&gt;<\/code>, <code>ValueTuple&lt;T1, T2, T3&gt;<\/code>, etc.), and then because it&#8217;s a generic often used with value types, every generic instantiation of a given tuple type ends up duplicating all of the assembly for that type. Thus, it&#8217;s valuable to keep the types as slim as possible. This PR introduced a throw helper and then replaced ~20 throw sites from across the types, reducing the amount of assembly required for each.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45054\">dotnet\/runtime#45054<\/a>. This PR represents a typical course of action when trying to reduce the size of a trimmed application: cutting unnecessary references from commonly used code to less commonly used code. In this case, <code>RuntimeType<\/code> (the derivation of <code>Type<\/code> that&#8217;s most commonly used) has a dependency on the <code>Convert<\/code> type, which makes it so that the trimmer can&#8217;t trim away <code>Convert<\/code>&#8216;s static constructor. This PR rewrites the relevant, small portions of <code>RuntimeType<\/code> to not need <code>Convert<\/code> at all.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45127\">dotnet\/runtime#45127<\/a>. If a type&#8217;s static constructor initializes a field, the trimmer is unable to remove the field or the thing it&#8217;s being initialized to, so for rarely used fields, it can be beneficial to initialize them lazily. This PR makes the <code>Task&lt;T&gt;.Factory<\/code> property lazily-initialized (<code>Task.Factory<\/code> remains non-lazily-initialized, as it&#8217;s much more commonly used), which then also makes it more likely that <code>TaskFactory&lt;T&gt;<\/code> can be trimmed away.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45239\">dotnet\/runtime#45239<\/a>. It&#8217;s very common when multiple overloads of something exist for the simpler overload to delegate to the more complicated overload. However, typically the more complicated overload has more inherent dependencies than would the simpler one, and so from a trimming perspective, it can actually be beneficial to invert the dependency chain, and have the more complicated overload delegate to the simpler one to handle the subset of functionality required for the simple one. This PR does that for <code>Utf8Encoding<\/code>&#8216;s constructors and <code>TaskFactory<\/code>&#8216;s constructors.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52681\">dotnet\/runtime#52681<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52794\">dotnet\/runtime#52794<\/a>. Sometimes analyzing the code that remains after trimming makes you rethink whether functionality you have in your library or app is actually necessary. In doing so for System.Net.Http.dll, we realized we were keeping around a ton of mail address-related parsing code in the assembly, purely in the name of validating <code>From<\/code> headers in a way that wasn&#8217;t particularly useful, so we removed it. We also avoided including code into the WASM build of the assembly that wouldn&#8217;t actually be used in that build. These efforts shrunk the size of the assembly by almost 15%.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45296\">dotnet\/runtime#45296<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45643\">dotnet\/runtime#45643<\/a>. To support various concepts from the globalization APIs when using the ICU globalization implementation, <code>System.Private.CoreLib<\/code> carries with it several sizeable data tables. These PRs significantly reduce the size of that data, by encoding it in a much more compact form and by accessing the blittable data as a <code>ReadOnlySpan&lt;byte&gt;<\/code> around the data directly from the assembly.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46061\">dotnet\/runtime#46061<\/a>. Similarly, to support ordinal case conversion, <code>System.Private.CoreLib<\/code> carries a large casing table. This table is stored in memory in a static <code>ushort[]?[]<\/code> array, and previously, it was initialized with collection-initialization syntax. That resulted in the generated static constructor for initializing the array being over 1KB of IL instructions. This PR changed it to actually store the data in the assembly encoded as bytes, and then in the constructor create the <code>ushort[]?[]<\/code> from a <code>ReadOnlySpan&lt;byte&gt;<\/code> over that data.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48906\">dotnet\/runtime#48906<\/a>. This is also in a similar vein to the previous ICU changes. <code>WebUtility<\/code> has a static <code>Dictionary&lt;ulong, char&gt;<\/code> lookup table. Previously, that dictionary was being initialized in a manner that led to <code>WebUtility<\/code>&#8216;s static constructor being over 17KB of IL. This PR reduces it to less than 300B.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46211\">dotnet\/runtime#46211<\/a>. The trimmer looks at IL to determine whether some code uses some other code, but there are dependencies it can&#8217;t always see. There are multiple ways a developer can inform the trimmer it should keep some code around even if it doesn&#8217;t know why. One is via a special XML file that can be fed to the trimmer to tell it which members should be considered rooted and not trimmed away. That mechanism, however, is a very large hammer. The preferred mechanism is a set of attributes that allow for the information to be much more granular. In particular, the <code>DynamicDependencyAttribute<\/code> lets the developer declare that some member <code>A<\/code> should be kept if some other member <code>B<\/code> is kept. This PR switches some rooting with the XML file to instead use the attributes.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47918\">dotnet\/runtime#47918<\/a>. Since its porting to .NET Core, LINQ has received a lot of attention, as it&#8217;s a ripe area for optimization. A set of the optimizations that went into LINQ involved adding a few new internal interfaces that could then be implemented on a bunch of types representing the various LINQ combinators in order to communicate additional data between operators. This resulted in massive speedups for certain operations, however it also added a significant amount of IL to System.Linq.dll, around 20K uncompressed (around 6K compressed). And it has the potential to result in an order of magnitude more assembly code, depending on how these types are instantiated. Because of the latter issue, a special-build flavor was previously added to the assembly, so that it could be built without those optimizations that were contributing significantly to its size. This PR cleaned that up and extended it so that the size-optimized build could be used for Blazor WASM and other mobile targets.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53317\">dotnet\/runtime#53317<\/a>. <code>System.Text.Json<\/code>&#8216;s <code>JsonSerializer<\/code> was using <code>[DynamicDependency]<\/code> to root all of the <code>System.Collections.Immutable<\/code> collections, just in case they were used. This PR undoes that dependency, saving ~28KB compressed in a default Blazor WASM application.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44696\">dotnet\/runtime#44696<\/a> from <a href=\"https:\/\/github.com\/benaadams\">@benaadams<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44734\">dotnet\/runtime#44734<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44825\">dotnet\/runtime#44825<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47496\">dotnet\/runtime#47496<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47873\">dotnet\/runtime#47873<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/53123\">dotnet\/runtime#53123<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/56937\">dotnet\/runtime#56937<\/a>. We have a love\/hate relationship with LINQ. On the one hand, it&#8217;s an invaluable tool for quickly expressing complicated operations with very little code, and in the common case, it&#8217;s perfectly fine to use and we encourage applications to use it. On the other hand, from a performance perspective, LINQ isn&#8217;t stellar, even as we&#8217;ve invested significantly in improving it. From a throughput and memory perspective, simple LINQ operations will invariably be more expensive than hand-rolled versions of the same thing, if for no other reason than because the expressability it provides means functionality is passed around as delegates, enumerators are allocated, and so on. And from a size perspective, all that functionality comes with a lot of IL (and most of the time, any attempts we make to increase throughput also increase size). If in the libraries we can replace LINQ usage with only minimally larger open-coded replacements, we&#8217;ll typically do so.<\/li>\n<li><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/39549\">dotnet\/runtime#39549<\/a>. <code>dotnet.wasm<\/code> contains the compiled WebAssembly for the mono runtime used in Blazor WASM apps. The more features unnecessary for this scenario (e.g. various debugging features, dead code in this configuration, etc.) that can be removed in the build, the smaller the file will be.<\/li>\n<\/ul>\n<p>Now, let&#8217;s take the size reduction a step further. The runtime itself is contained in the <code>dotnet.wasm<\/code> file, but when we trim the app as part of publishing, we&#8217;re only trimming the managed assemblies, not the runtime, as the SDK itself doesn&#8217;t include the tools necessary to do so. We can rectify that by installing the <code>wasm-tools<\/code> workload via <code>dotnet<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43785\">dotnet\/runtime#43785<\/a>):<\/p>\n<pre><code class=\"language-console\">dotnet workload install wasm-tools<\/code><\/pre>\n<p>With that installed, we can publish again, exactly as before:<\/p>\n<pre><code class=\"language-console\">dotnet publish -c Release<\/code><\/pre>\n<p>but now we see some extra output (and it takes longer to publish):<\/p>\n<pre><code class=\"language-console\">D:\\examples\\app6&gt; dotnet publish -c Release\r\nMicrosoft (R) Build Engine version 17.0.0-preview-21411-06+b0bb46ab8 for .NET\r\nCopyright (C) Microsoft Corporation. All rights reserved.\r\n\r\n  Determining projects to restore...\r\n  Restored D:\\examples\\app6\\app6.csproj (in 245 ms).\r\n  You are using a preview version of .NET. See: https:\/\/aka.ms\/dotnet-core-preview\r\n  app6 -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\app6.dll\r\n  app6 (Blazor output) -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\wwwroot\r\n  Optimizing assemblies for size, which may change the behavior of the app. Be sure to test after publishing. See: https:\/\/aka.ms\/dotnet-illink\r\n  Compiling native assets with emcc. This may take a while ...\r\n  Linking with emcc. This may take a while ...\r\n  Compressing Blazor WebAssembly publish artifacts. This may take a while...\r\n  app6 -&gt; D:\\examples\\app6\\bin\\Release\\net6.0\\publish\r\nD:\\examples\\app6&gt;<\/code><\/pre>\n<p>See those two extra <code>emcc<\/code>-related lines. <code>emcc<\/code> is the Emscripten front-end compiler (Emscripten is a compiler toolchain for compiling to WebAssembly), and what we&#8217;re seeing here is <code>dotnet.wasm<\/code> being relinked so as to remove functionality from the binary that&#8217;s not used by the trimmed binaries in the application. If I now re-measure the size of the relevant <code>.br<\/code> files in <code>D:examplesapp6binReleasenet6.0publishwwwroot_framework<\/code>, it&#8217;s now <code>1.82MB<\/code>, so we&#8217;ve removed an additional <code>60KB<\/code> from the published application size.<\/p>\n<p>We can go further, though. I&#8217;ll add two lines to my app6.csproj in the top <code>&lt;PropertyGroup&gt;...&lt;\/PropertyGroup&gt;<\/code> section:<\/p>\n<pre><code class=\"language-xml\">    &lt;InvariantGlobalization&gt;true&lt;\/InvariantGlobalization&gt;\r\n    &lt;BlazorEnableTimeZoneSupport&gt;false&lt;\/BlazorEnableTimeZoneSupport&gt;    <\/code><\/pre>\n<p>These are feature switches, and serve two purposes. First, they can be queried by code in the app (and, in particular, in the core libraries) to determine what functionality to employ. For example, if you search dotnet\/runtime for &#8220;GlobalizationMode.Invariant&#8221;, you&#8217;ll find code along the lines of:<\/p>\n<pre><code class=\"language-C#\">if (GlobalizationMode.Invariant)\r\n{\r\n    ... \/\/ only use invariant culture \/ functionality\r\n}\r\nelse\r\n{\r\n    ... \/\/ use ICU or NLS for processing based on the appropriate culture\r\n}<\/code><\/pre>\n<p>Second, the switch informs the trimmer that it can substitute in a fixed value for the property associated with the switch, e.g. setting <code>&lt;InvariantGlobalization&gt;true&lt;\/InvariantGlobalization&gt;<\/code> causes the trimmer to rewrite the <code>GlobalizationMode.Invariant<\/code> property to be hardcoded to return <code>true<\/code>, at which point it can then use that to prune away any visibly unreachable code paths. That means in an example like the code snippet above, the trimmer can elide the entire <code>else<\/code> block, and if that ends up meaning additional types and members become unused, they can be removed, as well. By setting the two aforementioned switches, we&#8217;re eliminating any need the app has for the ICU globalization library, which is a significant portion of the app&#8217;s size, both in terms of the logic linked into <code>dotnet.wasm<\/code> and the data necessary to drive it (<code>icudt.dat.br<\/code>). With those switches set, we can re-publish (after deleting the old <code>publish<\/code> directory). Two things I immediately notice. First, there aren&#8217;t any <code>icu*.br<\/code> files at all, as there&#8217;s no longer a need for anything ICU-related. And second, all of the <code>.br<\/code> files weigh in at only <code>1.07MB<\/code>, removing another 750KB from the app&#8217;s size, more than 40% of where we were before.<\/p>\n<h3>Blazor and mono<\/h3>\n<p>Ok, so we&#8217;ve got our Blazor WASM app, and we&#8217;re able to ship a small package down to the browser to execute it. Does it run efficiently?<\/p>\n<p>The <code>dotnet.wasm<\/code> file mentioned previously contains the .NET runtime used to execute these applications. The runtime is itself compiled to WASM, downloaded to the browser, and used to execute the application and library code on which the app depends. I say &#8220;the runtime&#8221; here, but in reality there are actually multiple incarnations of a runtime for .NET. In .NET 6, all of the .NET core libraries for all of the .NET app models, whether it be console apps or ASP.NET Core or Blazor WASM or mobile apps, come from the same source in dotnet\/runtime, but there are actually two runtime implementations in dotnet\/runtime: &#8220;coreclr&#8221; and &#8220;mono&#8221;. In this blog post, when I&#8217;ve talked about runtime improvements in components like &#8220;the&#8221; JIT and GC, I&#8217;ve actually been referring to coreclr, which is what&#8217;s used for console apps, ASP.NET Core, Windows Forms, and WPF. Blazor WebAssembly, however, relies on mono, which has been honed over the years to be small and agile for these kinds of scenarios, and has also received a lot of performance investment in .NET 6.<\/p>\n<p>There are three significant areas of investment here. The first is around improvements to the IL interpreter in mono. Mono not only has a JIT capable of on-demand assembly generation ala coreclr, it also supports interpreting IL, which is valuable on platforms that for security reasons prohibit executing machine assembly code generated on the fly. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/46037\">dotnet\/runtime#46037<\/a> overhauled the interpreter to move it from being stack-based (where IL instructions push and pop values from a stack) to being based on the concept of reading\/writing local variables, a switch that both simplified the code base and gave it a performance boost. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48513\">dotnet\/runtime#48513<\/a> improved the interpreter&#8217;s ability to inline, in particular for methods attributed with <code>[MethodImpl(MethodImplOptions.AggressiveInlining)]<\/code>, which is important with the libraries in dotnet\/runtime as some of the lower-level processing routines make strategic use of <code>AggressiveInlining<\/code> in places it&#8217;s been measured to yield impactful gains. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50361\">dotnet\/runtime#50361<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51273\">dotnet\/runtime#51273<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52130\">dotnet\/runtime#52130<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/52242\">dotnet\/runtime#52242<\/a> all served to optimize how various kinds of instructions were encoded and invoked, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51309\">dotnet\/runtime#51309<\/a> improved the efficiency of <code>finally<\/code> blocks by removing overhead associated with thread aborts, which no longer exist in .NET (.NET Framework 4.8 and earlier have the concept of a thread abort, where one thread can inject a special exception into another, and that exception could end up being thrown at practically any instruction; by default, however, they don&#8217;t interrupt <code>finally<\/code> blocks).<\/p>\n<p>The second area of investment was around <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/hardware-intrinsics-in-net-core\/\">hardware intrinsics<\/a>. .NET Core 3.0 and .NET 5 added literally thousands of new methods, each of which map effectively 1:1 with some hardware-specific instruction, enabling C# code to directly target functionality from various ISAs (Instruction Set Architectures) like SSSE3 or AVX2. Of course, something needs to be able to translate the C# methods into the underlying instructions they represent, which means a lot of work to fully enable every code generator. Mono supports using LLVM for code generation, and a bunch of PRs improved the LLVM-enabled mono&#8217;s support for hardware intrinsics, whether it be <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49260\">dotnet\/runtime#49260<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49737\">dotnet\/runtime#49737<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48361\">dotnet\/runtime#48361<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47482\">dotnet\/runtime#47482<\/a> adding support for ARM64 AdvSimd APIs; <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48413\">dotnet\/runtime#48413<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47337\">dotnet\/runtime#47337<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/48525\">dotnet\/runtime#48525<\/a> rounding out the support for the Sha1, Sha256, and Aes intrinsics; or <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54924\">dotnet\/runtime#54924<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/47028\">dotnet\/runtime#47028<\/a> implementing foundational support with <code>Vector64&lt;T&gt;<\/code> and <code>Vector128&lt;T&gt;<\/code>. Many of the library performance improvements highlighted in previous blog posts rely on the throughput improvements from vectorization, which then accrue here as well, which includes when building Blazor WASM apps with AOT.<\/p>\n<p>And that brings us to the third, and arguably most impactful, area of investment: AOT for Blazor WASM. I highlighted earlier that Blazor WASM apps targeting .NET 5 were interpreted, meaning while the runtime itself was compiled to WASM, the runtime then turned around and interpreted the IL for the app and the libraries it depends on. Now with .NET 6, a Blazor WASM app can be compiled ahead of time entirely to WebAssembly, avoiding the need for JIT&#8217;ing or interpreting at run-time. All of these improvements together lead to huge, cross-cutting performance improvements for Blazor WASM apps when targeting .NET 6 instead of .NET 5.<\/p>\n<p>Let&#8217;s do one last benchmark. Continuing with the <code>app5<\/code> and <code>app6<\/code> examples from the previous section, we&#8217;ll do something that involves a bit of computation: SHA-256. The implementation of <code>SHA256<\/code> used for Blazor WASM on both .NET 5 and .NET 6 is exactly the same, and is implemented in C#, making it a reasonable test case. I&#8217;ve replaced the entire contents of the Counter.razor file in both of those projects with this, which in response to a button click is simply SHA-256 hashing a byte array of some UTF8 Shakespeare several thousand times.<\/p>\n<pre><code class=\"language-C#\">@page \"\/counter\"\r\n@using System.Security.Cryptography\r\n@using System.Diagnostics\r\n@using System.Text\r\n\r\n&lt;h1&gt;Hashing&lt;\/h1&gt;\r\n\r\n&lt;p&gt;Time: @_time&lt;\/p&gt;\r\n\r\n&lt;button class=\"btn btn-primary\" @onclick=\"Hash\"&gt;Click me&lt;\/button&gt;\r\n\r\n@code {\r\n    private const string Sonnet18 =\r\n@\"Shall I compare thee to a summer\u2019s day?\r\nThou art more lovely and more temperate:\r\nRough winds do shake the darling buds of May,\r\nAnd summer\u2019s lease hath all too short a date;\r\nSometime too hot the eye of heaven shines,\r\nAnd often is his gold complexion dimm'd;\r\nAnd every fair from fair sometime declines,\r\nBy chance or nature\u2019s changing course untrimm'd;\r\nBut thy eternal summer shall not fade,\r\nNor lose possession of that fair thou ow\u2019st;\r\nNor shall death brag thou wander\u2019st in his shade,\r\nWhen in eternal lines to time thou grow\u2019st:\r\nSo long as men can breathe or eyes can see,\r\nSo long lives this, and this gives life to thee.\";\r\n\r\n    private TimeSpan _time;\r\n\r\n    private void Hash()\r\n    {\r\n        byte[] bytes = Encoding.UTF8.GetBytes(Sonnet18);\r\n        var sw = Stopwatch.StartNew();\r\n        for (int i = 0; i &lt; 2000; i++)\r\n        {\r\n            _ = SHA256.HashData(bytes);\r\n        }\r\n        _time = sw.Elapsed;\r\n    }\r\n}<\/code><\/pre>\n<p>I&#8217;ll start by publishing the <code>app5<\/code> app:<\/p>\n<pre><code class=\"language-console\">D:\\examples\\app5&gt; dotnet publish -c Release<\/code><\/pre>\n<p>Then to run it, we need a web server to host the server side of the app, and to make that easy, I&#8217;ll use the <a href=\"https:\/\/www.nuget.org\/packages\/dotnet-serve\/\"><code>dotnet serve<\/code><\/a> global tool. To install it, run:<\/p>\n<pre><code class=\"language-console\">dotnet tool install --global dotnet-serve<\/code><\/pre>\n<p>at which point you can start a simple web server for the files in the published directory:<\/p>\n<pre><code class=\"language-console\">pushd D:\\examples\\app5\\bin\\Release\\net5.0\\publish\\wwwroot\r\ndotnet serve -o<\/code><\/pre>\n<p>click <code>Counter<\/code>, and then click the <code>Click Me<\/code> button a few times. I get resulting numbers like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/Hashing1.png\" alt=\"SHA256 benchmark on .NET 5\" \/><\/p>\n<p>so ~0.45 seconds on .NET 5. Now I can do the exact same thing on .NET 6 with the <code>app6<\/code> project:<\/p>\n<pre><code class=\"language-console\">popd\r\ncd ..app6\r\ndotnet publish -c Release\r\npushd D:\\examples\\app6\\bin\\Release\\net6.0\\publish\\wwwroot\r\ndotnet serve -o<\/code><\/pre>\n<p>and I get results like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/Hashing2.png\" alt=\"SHA256 benchmark on .NET 6\" \/><\/p>\n<p>so ~0.28 seconds on .NET 6. That ~40% improvement is due to the interpreter optimizations, as we&#8217;re otherwise running the exact same code.<\/p>\n<p>Now, let&#8217;s try this out with AOT. I modify the <code>app6.csproj<\/code> to include this in the top <code>&lt;PropertyGroup&gt;...&lt;\/PropertyGroup&gt;<\/code> node:<\/p>\n<pre><code class=\"language-xml\">&lt;RunAOTCompilation&gt;true&lt;\/RunAOTCompilation&gt;<\/code><\/pre>\n<p>Then I republish (and get a cup of coffee&#8230; the AOT step adds some time to the build process). With that, I now get results like this:<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/08\/Hashing3.png\" alt=\"SHA256 benchmark on .NET 6 with AOT\" \/><\/p>\n<p>so ~0.018 seconds, making it ~16x faster than it was before. A nice note to end this post on.<\/p>\n<h2>Is that all?<\/h2>\n<p>Of course not! \ud83d\ude42 Whether it be for <code>System.Xml<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/49988\">dotnet\/runtime#49988<\/a> from <a href=\"https:\/\/github.com\/kronic\">@kronic<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54344\">dotnet\/runtime#54344<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54299\">dotnet\/runtime#54299<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54346\">dotnet\/runtime#54346<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54356\">dotnet\/runtime#54356<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/54836\">dotnet\/runtime#54836<\/a>), or caching (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/51761\">dotnet\/runtime#51761<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45410\">dotnet\/runtime#45410<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45563\">dotnet\/runtime#45563<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/45280\">dotnet\/runtime#45280<\/a>), or <code>System.Drawing<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50489\">dotnet\/runtime#50489<\/a> from <a href=\"https:\/\/github.com\/L2\">@L2<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/50622\">dotnet\/runtime#50622<\/a> from <a href=\"https:\/\/github.com\/L2\">@L2<\/a>), or <code>System.Diagnostics.Process<\/code> (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44691\">dotnet\/runtime#44691<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/43365\">dotnet\/runtime#43365<\/a> from <a href=\"https:\/\/github.com\/am11\">@am11<\/a>), or any number of other areas, there have been an untold number of performance improvements in .NET 6 that I haven&#8217;t been able to do justice to in this post.<\/p>\n<p>There are also many outstanding PRs in dotnet\/runtime that haven&#8217;t yet been merged but may be for .NET 6.  For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/57079\">dotnet\/runtime#57079<\/a> enables support for TLS resumption on Linux, which has the potential to improve the time it takes to establish a secure connection by an order of magnitude.  Or <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/55745\">dotnet\/runtime#55745<\/a>, which enables the JIT to fold <code>TimeSpan.FromSeconds(constant)<\/code> (and other such `From` methods) into a single instruction.  Or <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/35565\">dotnet\/runtime#35565<\/a> from <a href=\"https:\/\/github.com\/sakno\">@sakno<\/a>, which uses spans more aggressively throughout the implementation of <code>BigInteger<\/code>.  So much goodness already merged and so much more on the way.<\/p>\n<p>Don&#8217;t just take my word for it, though. Please <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/6.0\">download .NET 6<\/a> and give it a spin. I&#8217;m quite hopeful you&#8217;ll like what you see. If you do, tell us. If you don&#8217;t, tell us. We want to hear from you, and even more than that, we&#8217;d love your involvement. Of the ~400 merged PRs linked to in this blog post, over 15% of them came from the .NET community outside of Microsoft, and we&#8217;d love to see that number grow even higher. If you&#8217;ve got ideas for improvements or the inclination to try to make them a reality, please join us for a fun and fulfilling time in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>.<\/p>\n<p>Happy coding!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>.NET 6 is chock-full of exciting performance improvements.<\/p>\n","protected":false},"author":360,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,3012,756,3009],"tags":[8082],"class_list":["post-33921","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-internals","category-csharp","category-performance","tag-dotnetperf"],"acf":[],"blog_post_summary":"<p>.NET 6 is chock-full of exciting performance improvements.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/33921","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=33921"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/33921\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=33921"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=33921"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=33921"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}