January 27th, 2022

Performance improvements in ASP.NET Core 6

Inspired by the blog posts by Stephen Toub about performance in .NET we are writing a similar post to highlight the performance improvements done to ASP.NET Core in 6.0.

Benchmarking Setup

We will be using BenchmarkDotNet for the majority of the examples throughout. A repo at https://github.com/BrennanConroy/BlogPost60Bench is provided that includes the majority of the benchmarks used in this post.

Most of the benchmark results in this post were generated with the following command line:

dotnet run -c Release -f net48 --runtimes net48 netcoreapp3.1 net5.0 net6.0

Then selecting a specific benchmark to run from the list.

This tells BenchmarkDotNet:

  • Build everything in a release configuration.
  • Build it targeting the .NET Framework 4.8 surface area.
  • Run each benchmark on each of .NET Framework 4.8, .NET Core 3.1, .NET 5, and .NET 6.

For some benchmarks, they were only run on .NET 6 (e.g. if comparing two ways of coding something on the same version):

dotnet run -c Release -f net6.0 --runtimes net6.0

and for others only a subset of the versions were run, e.g.

dotnet run -c Release -f net5.0 --runtimes net5.0 net6.0

I’ll include the command used to run each of the benchmarks as they come up.

Most of the results in the post were generated by running the above benchmarks on Windows, primarily so that .NET Framework 4.8 could be included in the result set. However, unless otherwise called out, in general all of these benchmarks show comparable improvements when run on Linux or on macOS. Simply ensure that you have installed each runtime you want to measure. The benchmarks were run with a nightly build of .NET 6 RC1, along with the latest released downloads of .NET 5 and .NET Core 3.1.

Span<T>

Every release since the addition of Span<T> in .NET 2.1 we have converted more code to use spans both internally and as part of the public API to improve performance. This release is no exception.

PR dotnet/aspnetcore#28855 removed a temporary string allocation in PathString coming from string.SubString when adding two PathString instances and instead uses a Span<char> for the temporary string. In the benchmark below we use a short string and a longer string to show the performance difference from avoiding the temporary string.

dotnet run -c Release -f net48 --runtimes net48 net5.0 net6.0 --filter *PathStringBenchmark*
private PathString _first = new PathString("/first/");
private PathString _second = new PathString("/second/");
private PathString _long = new PathString("/longerpathstringtoshowsubstring/");

[Benchmark]
public PathString AddShortString()
{
    return _first.Add(_second);
}

[Benchmark]
public PathString AddLongString()
{
    return _first.Add(_long);
}
Method Runtime Toolchain Mean Ratio Allocated
AddShortString .NET Framework 4.8 net48 23.51 ns 1.00 96 B
AddShortString .NET 5.0 net5.0 22.73 ns 0.97 96 B
AddShortString .NET 6.0 net6.0 14.92 ns 0.64 56 B
AddLongString .NET Framework 4.8 net48 30.89 ns 1.00 201 B
AddLongString .NET 5.0 net5.0 25.18 ns 0.82 192 B
AddLongString .NET 6.0 net6.0 15.69 ns 0.51 104 B

dotnet/aspnetcore#34001 introduced a new Span based API for enumerating a query string that is allocation free in a common case of no encoded characters, and lower allocations when the query string contains encoded characters.

dotnet run -c Release -f net6.0 --runtimes net6.0 --filter *QueryEnumerableBenchmark*
#if NET6_0_OR_GREATER
    public enum QueryEnum
    {
        Simple = 1,
        Encoded,
    }

    [ParamsAllValues]
    public QueryEnum QueryParam { get; set; }

    private string SimpleQueryString = "?key1=value1&key2=value2";
    private string QueryStringWithEncoding = "?key1=valu%20&key2=value%20";

    [Benchmark(Baseline  = true)]
    public void QueryHelper()
    {
        var queryString = QueryParam == QueryEnum.Simple ? SimpleQueryString : QueryStringWithEncoding;
        foreach (var queryParam in QueryHelpers.ParseQuery(queryString))
        {
            _ = queryParam.Key;
            _ = queryParam.Value;
        }
    }

    [Benchmark]
    public void QueryEnumerable()
    {
        var queryString = QueryParam == QueryEnum.Simple ? SimpleQueryString : QueryStringWithEncoding;
        foreach (var queryParam in new QueryStringEnumerable(queryString))
        {
            _ = queryParam.DecodeName();
            _ = queryParam.DecodeValue();
        }
    }
#endif
Method QueryParam Mean Ratio Allocated
QueryHelper Simple 243.13 ns 1.00 360 B
QueryEnumerable Simple 91.43 ns 0.38
QueryHelper Encoded 351.25 ns 1.00 432 B
QueryEnumerable Encoded 197.59 ns 0.56 152 B

It’s important to note that there is no free lunch. In the new QueryStringEnumerable API case, if you are planning on enumerating the query string values multiple times it can actually be more expensive than using QueryHelpers.ParseQuery and storing the dictionary of the parsed query string values.

dotnet/aspnetcore#29448 from @paulomorgado uses the string.Create method that allows initializing a string after it’s created if you know the final size it will be. This was used to remove some temporary string allocations in UriHelper.BuildAbsolute.

dotnet run -c Release -f netcoreapp3.1 --runtimes netcoreapp3.1 net6.0 --filter *UriHelperBenchmark*
#if NETCOREAPP
    [Benchmark]
    public void BuildAbsolute()
    {
        _ = UriHelper.BuildAbsolute("https", new HostString("localhost"));
    }
#endif
Method Runtime Toolchain Mean Ratio Allocated
BuildAbsolute .NET Core 3.1 netcoreapp3.1 92.87 ns 1.00 176 B
BuildAbsolute .NET 6.0 net6.0 52.88 ns 0.57 64 B

PR dotnet/aspnetcore#31267 converted some parsing logic in ContentDispositionHeaderValue to use Span<T> based APIs to avoid temporary strings and a temporary byte[] in common cases.

dotnet run -c Release -f net48 --runtimes net48 netcoreapp3.1 net5.0 net6.0 --filter *ContentDispositionBenchmark*
[Benchmark]
public void ParseContentDispositionHeader()
{
    var contentDisposition = new ContentDispositionHeaderValue("inline");
    contentDisposition.FileName = "FileÃName.bat";
}
Method Runtime Toolchain Mean Ratio Allocated
ContentDispositionHeader .NET Framework 4.8 net48 654.9 ns 1.00 570 B
ContentDispositionHeader .NET Core 3.1 netcoreapp3.1 581.5 ns 0.89 536 B
ContentDispositionHeader .NET 5.0 net5.0 519.2 ns 0.79 536 B
ContentDispositionHeader .NET 6.0 net6.0 295.4 ns 0.45 312 B

Idle Connections

One of the major components of ASP.NET Core is hosting a server which brings with it a host of different problems to optimize for. We’ll focus on improvements to idle connections in 6.0 where we made many changes to reduce the amount a memory used when a connection is waiting for data.

There were three distinct types of changes we made, one was to reduce the size of the objects used by connections, this includes System.IO.Pipelines, SocketConnections, and SocketSenders. The second type of change was to pool commonly accessed objects so we can reuse old instances and save on allocations. The third type of change was to take advantage of something called “zero byte reads”. This is where we try to read from the connection with a zero byte buffer, if there is data available the read will return with no data, but we will know there is now data available and can provide a buffer to read that data immediately. This avoids allocating a buffer up front for a read that may complete at a future time, so we can avoid a large allocation until we know data is available.

dotnet/runtime#49270 reduced the size of System.IO.Pipelines from ~560 bytes to ~368 bytes which is a 34% size reduction, there are at least 2 pipes per connection so this was a great win. dotnet/aspnetcore#31308 refactored the Socket layer of Kestrel to avoid a few async state machines and reduce the size of remaining state machines to get a ~33% allocation savings for each connection.

dotnet/aspnetcore#30769 removed a per connection PipeOptions allocation and moved the allocation to the connection factory so we only allocate one for the entire lifetime of the server and reuse the same options for every connection. dotnet/aspnetcore#31311 from @benaadams replaced well known header values in WebSocket requests with interned strings which allowed the strings allocated during header parsing to be garbage collected, reducing the memory usage of the long lived WebSocket connections. dotnet/aspnetcore#30771 refactored the Sockets layer in Kestrel to first avoid allocating a SocketReceiver object + a SocketAwaitableEventArgs and combine it into a single object, that saved a few bytes and resulted in less unique objects allocated per connection. That PR also pooled the SocketSender class so instead of creating one per connection you now on average have number of cores SocketSender. So in the below benchmark when we have 10,000 connections there are only 16 allocated on my machine instead of 10,000 which is a savings of ~46 MB!

Another similar sized change is dotnet/runtime#49123 which adds support for zero-byte reads in SslStream so our 10,000 idle connections go from ~46 MB to ~2.3 MB from SslStream allocations. dotnet/runtime#49117 added support for zero-byte reads on StreamPipeReader which was then used by Kestrel in dotnet/aspnetcore#30863 to start using the zero-byte reads in SslStream.

The culmination of all these changes is a massive reduction in memory usage for idle connections.

The following numbers are not from a BenchmarkDotNet app as it’s measuring idle connections and it was easier to setup with a client and server application.

Console and WebApplication code are pasted in the following gist: https://gist.github.com/BrennanConroy/02e8459d63305b4acaa0a021686f54c7

Below is the amount of memory 10,000 idle secure WebSocket connections (WSS) take on the server on different frameworks.

Framework Memory
net48 665.4 MB
net5.0 603.1 MB
net6.0 160.8 MB

That’s an almost 4x memory reduction from net5.0 to net6.0!

Entity Framework Core

EF Core made some massive improvements in 6.0, it is 31% faster at executing queries and the TechEmpower Fortunes benchmark improved by 70% with Runtime updates, optimized benchmarks and the EF improvements.

These improvements came from improving object pooling, intelligently checking if telemetry is enabled, and adding an option to opt out of thread safety checks when you know your app uses DbContext safely.

See the Announcing Entity Framework Core 6.0 Preview 4: Performance Edition blog post which highlights many of the improvements in detail.

Blazor

Native byte[] Interop

Blazor now has efficient support for byte arrays when performing JavaScript interop. Previously, byte arrays sent to and from JavaScript were Base64 encoded so they could be serialized as JSON, which increased the transfer size and the CPU load. The Base64 encoding has now been optimized away in .NET 6 allowing users to transparently work with byte[] in .NET and Uint8Array in JavaScript. Documentation on using this feature for JavaScript to .NET and .NET to JavaScript.

Let’s take a look at a quick benchmark to see the difference between byte[] interop in .NET 5 and .NET 6. The following Razor code creates a 22 kB byte[], and sends it to a JavaScript receiveAndReturnBytes function, which immediately returns the byte[]. This roundtrip of data is repeated 10,000 times and the time data is printed to the screen. This code is the same for .NET 5 and .NET 6.

<button @onclick="@RoundtripData">Roundtrip Data</button>

<hr />

@Message

@code {
    public string Message { get; set; } = "Press button to benchmark";

    private async Task RoundtripData()
    {
        var bytes = new byte[1024*22];
        List<double> timeForInterop = new List<double>();
        var testTime = DateTime.Now;

        for (var i = 0; i < 10_000; i++)
        {
            var interopTime = DateTime.Now;

            var result = await JSRuntime.InvokeAsync<byte[]>("receiveAndReturnBytes", bytes);

            timeForInterop.Add(DateTime.Now.Subtract(interopTime).TotalMilliseconds);
        }

        Message = $"Round-tripped: {bytes.Length / 1024d} kB 10,000 times and it took on average {timeForInterop.Average():F3}ms, and in total {DateTime.Now.Subtract(testTime).TotalMilliseconds:F1}ms";
    }
}

Next we take a look at the receiveAndReturnBytes JavaScript function. In .NET 5. We must first decode the Base64 encoded byte array into a Uint8Array so it may be used in application code. Then we must re-encode it into Base64 before returning the data to the server.

function receiveAndReturnBytes(bytesReceivedBase64Encoded) {
    const bytesReceived = base64ToArrayBuffer(bytesReceivedBase64Encoded);

    // Use Uint8Array data in application

    const bytesToSendBase64Encoded = base64EncodeByteArray(bytesReceived);

    if (bytesReceivedBase64Encoded != bytesToSendBase64Encoded) {
        throw new Error("Expected input/output to match.")
    }

    return bytesToSendBase64Encoded;
}

// https://stackoverflow.com/a/21797381
function base64ToArrayBuffer(base64) {
    const binaryString = atob(base64);
    const length = binaryString.length;
    const result = new Uint8Array(length);
    for (let i = 0; i < length; i++) {
        result[i] = binaryString.charCodeAt(i);
    }
    return result;
}

function base64EncodeByteArray(data) {
    const charBytes = new Array(data.length);
    for (var i = 0; i < data.length; i++) {
        charBytes[i] = String.fromCharCode(data[i]);
    }
    const dataBase64Encoded = btoa(charBytes.join(''));
    return dataBase64Encoded;
}

The encoded/decoding adds significant overhead both on the client and server, along with requiring extensive boiler plate code as well. So how would this be done in .NET 6? Well, it’s quite a bit simpler:

function receiveAndReturnBytes(bytesReceived) {
    // bytesReceived comes as a Uint8Array ready for use
    // and can be used by the application or immediately returned.
    return bytesReceived;
}

So it’s definitely easier to write, but how does it perform? Running these snippets in a blazorserver template in .NET 5 and .NET 6 respectively, under Release configuration, we see .NET 6 offers a 78% performance improvement in byte[] interop!

—————– .NET 6 (ms) .NET 5 (ms) Improvement
Total Time 5273 24463 78%

Additionally, this byte array interop support is leveraged within the framework to enable bidirectional streaming interop between JavaScript and .NET. Users are now able to transport arbitrary binary data. Documentation on streaming from .NET to JavaScript is available here, and the JavaScript to .NET documentation is here.

Input File

Using the Blazor Streaming Interop​ mentioned above, we now support uploading large files via the InputFile​ component (previously uploads were limited to ~2GB). This component also features significant speed improvements on account of native byte[] streaming as opposed to going through Base64 encoding. For instance, a 100 MB file is uploaded 77% quicker in comparison to .NET 5.

.NET 6 (ms) .NET 5 (ms) Percentage
2591 10504 75%
2607 11764 78%
2632 11821 78%
Average: 77%

Note the streaming interop support also enables efficient downloads of (large) files, for more details, please see the documentation.

The InputFile component was upgraded to utilize streaming via dotnet/aspnetcore#33900.

Hodgepodge

dotnet/aspnetcore#30320 from @benaadams modernized our Typescript libraries and optimized them so websites load faster. The signalr.min.js file went from 36.8 kB compressed and 132 kB uncompressed, to 16.1 kB compressed and 42.2 kB uncompressed. And the blazor.server.js file 86.7 kB compressed and 276 kB uncompressed, to 43.9 kB compressed and 130 kB uncompressed.

dotnet/aspnetcore#31322 from @benaadams removes some unnecessary casts when getting common features from the connections feature collection. This gives a ~50% improvement when accessing common features from the collection. Seeing the performance improvement in a benchmark isn’t really possible unfortunately because it requires a bunch of internal types so I’ll include the numbers from the PR here, and if you’re interested in running them, the PR includes benchmarks that can run against the internal code.

Method Mean Op/s Diff
Get<IHttpRequestFeature>* 8.507 ns 117,554,189.6 +50.0%
Get<IHttpResponseFeature>* 9.034 ns 110,689,963.7
Get<IHttpResponseBodyFeature>* 9.466 ns 105,636,431.7 +58.7%
Get<IRouteValuesFeature>* 10.007 ns 99,927,927.4 +50.0%
Get<IEndpointFeature>* 10.564 ns 94,656,794.2 +44.7%

dotnet/aspnetcore#31519 also from @benaadams adds default interface methods to the IHeaderDictionary type for accessing common headers via properties named after the header name. No more mistyping common headers when accessing the header dictionary! More interestingly for this blog post, this change allows server implementations to return a custom header dictionary that implements these new interface methods more optimally. For example, instead of querying an internal dictionary for a header value which requires hashing the key and looking up an entry, the server might have the header value stored directly in a field and can return the field directly. This change resulted in up to 480% improvements in some cases when getting or setting header values. Once again, to properly benchmark this change to show the improvements it requires using internal types for the setup so I will be including the numbers from the PR, and for those interested in trying it out the PR contains benchmarks that run on the internal code.

Method Branch Type Mean Op/s Delta
GetHeaders before Plaintext 25.793 ns 38,770,569.6
GetHeaders after Plaintext 12.775 ns 78,279,480.0 +101.9%
GetHeaders before Common 121.355 ns 8,240,299.3
GetHeaders after Common 37.598 ns 26,597,474.6 +222.8%
GetHeaders before Unknown 366.456 ns 2,728,840.7
GetHeaders after Unknown 223.472 ns 4,474,824.0 +64.0%
SetHeaders before Plaintext 49.324 ns 20,273,931.8
SetHeaders after Plaintext 34.996 ns 28,574,778.8 +40.9%
SetHeaders before Common 635.060 ns 1,574,654.3
SetHeaders after Common 108.041 ns 9,255,723.7 +487.7%
SetHeaders before Unknown 1,439.945 ns 694,470.8
SetHeaders after Unknown 517.067 ns 1,933,985.7 +178.4%

 

dotnet/aspnetcore#31466 used the new CancellationTokenSource.TryReset() method introduced in .NET 6 to reuse CancellationTokenSource’s if a connection closed without being canceled. The below numbers were collected by running bombardier against Kestrel with 125 connections and it ran for ~100,000 requests.

Branch Type Allocations Bytes
Before CancellationTokenSource 98,314 4,719,072
After CancellationTokenSource 125 6,000

dotnet/aspnetcore#31528 and dotnet/aspnetcore#34075 made similar changes for reusing CancellationTokenSource‘s for HTTPS handshakes and HTTP3 streams respectively.

dotnet/aspnetcore#31660 improved the perf of server to client streaming in SignalR by reusing the allocated StreamItem object for the whole stream instead of allocating one per stream item. And dotnet/aspnetcore#31661 stores the HubCallerClients object on the SignalR connection instead of allocating it per Hub method call.

dotnet/aspnetcore#31506 from @ShreyasJejurkar refactored the internals of the WebSocket handshake to avoid a temporary List<T> allocation. dotnet/aspnetcore#32829 from @gfoidl refactored QueryCollection to reduce allocations and vectorize some of the code. dotnet/aspnetcore#32234 from @benaadams removed an unused field in HttpRequestHeaders enumeration which improves the perf by no longer assigning to the field for every header enumerated.

dotnet/aspnetcore#31333 from martincostello converted Http.Sys to use LoggerMessage.Define which is the high performance logging API. This avoids unnecessary boxing of value types, parsing of the logging format string, and in some cases avoids allocations of strings or objects when the log level isn’t enabled.

dotnet/aspnetcore#31784 adds a new IApplicationBuilder.Use overload for registering middleware that avoids some unnecessary per-request allocations when running the middleware. Old code looks like:

app.Use(async (context, next) =>
{
    await next();
});

New code looks like:

app.Use(async (context, next) =>
{
    await next(context);
});

The below benchmark simulates the middleware pipeline without setting up a server to showcase the improvement. An int is used instead of HttpContext for a request and the middleware returns a completed task.

dotnet run -c Release -f net6.0 --runtimes net6.0 --filter *UseMiddlewareBenchmark*
static private Func<Func<int, Task>, Func<int, Task>> UseOld(Func<int, Func<Task>, Task> middleware)
{
    return next =>
    {
        return context =>
        {
            Func<Task> simpleNext = () => next(context);
            return middleware(context, simpleNext);
        };
    };
}

static private Func<Func<int, Task>, Func<int, Task>> UseNew(Func<int, Func<int, Task>, Task> middleware)
{
    return next => context => middleware(context, next);
}

Func<int, Task> Middleware = UseOld((c, n) => n())(i => Task.CompletedTask);
Func<int, Task> NewMiddleware = UseNew((c, n) => n(c))(i => Task.CompletedTask);

[Benchmark(Baseline = true)]
public Task Use()
{
    return Middleware(10);
}

[Benchmark]
public Task UseNew()
{
    return NewMiddleware(10);
}
Method Mean Ratio Allocated
Use 15.832 ns 1.00 96 B
UseNew 2.592 ns 0.16

Summary

I hope you enjoyed reading about some of the improvements made in ASP.NET Core 6.0! And I encourage you to take a look at the performance improvements in .NET 6 blog post that goes over performance in the Runtime.

6 comments

Discussion is closed. Login to edit/delete existing comments.

  • Vassily Godunov · Edited

    Yep. Really fast.

    First chance exception 0XC0000005.dmp
    Entry point coreclr!::
    Source
    coreclr!WKS::gc_heap::background_sweep+778 ….

  • Thorsten Sommer

    Thank you Brennan for this detailed report. It is nice to see how the performance is permanently improved. Thanks also for mentioning the names of the contributors. This is very important for the community. It is really a pleasure to work with ASP.NET Core every day: thanks to the whole team for that.

  • Jefferson Motta

    My website menphis.com.br suffered from .NET 5 but .NET 6 makes it shine.

  • Gábor Szabó

    Cool! I'm all for this! Very great write-up, the format is good too (kudos to both you Brennan, and Stephen for being the inspirator :)).

    One small thing regarding the format I'd like to see: the tables showcase some impressive results, but it's a little hard to parse in this plain format. I'd rather see an additional diagram or some kind of summary graph that shows the results in a visually more easily digestible way.

    Thanks!

    Read more