Performance Improvements in .NET Core
Update (2017/06/12): Added BenchmarkDotNet blog post link.
There are many exciting aspects to .NET Core (open source, cross platform, x-copy deployable, etc.) that have been covered in posts on this blog before. To me, though, one of the most exciting aspects of .NET Core is performance. There’s been a lot of discussion about the significant advancements that have been made in ASP.NET Core performance, its status as a top contender on various TechEmpower benchmarks, and the continual advancements being made in pushing it further. However, there’s been much less discussion about some equally exciting improvements throughout the runtime and the base class libraries.
There are way too many improvements to mention. After all, as an open source project that’s very accepting of contributions, Microsoft and community developers from around the world have found places where performance is important to them and submitted pull requests to improve things. I’d like to thank all the community developers for their .NET Core contributions, some of which are specifically called out in this post. We expect that many of these improvements will be brought to the .NET Framework over the next few releases, too. For this post, I’ll provide a tour through just a small smattering of the performance improvements you’ll find in .NET Core, and in particular in .NET Core 2.0, focusing on a few examples from a variety of the core libraries.
NOTE: This blog post contains lots of example code and timings. As with any such timings, take them with a grain of salt: these were taken on one machine in one configuration (all 64-bit processes), and so you may see different results on different systems. However, I ran each test on .NET Framework 4.7 and .NET Core 2.0 on the same machine in the same configuration at approximately the same time, providing a consistent environment for each comparison. Further, normally such testing is best done with a tool like BenchmarkDotNet; I’ve not done so for this post simply to make it easy for you to copy-and-paste the samples out into a console app and try them.
Editor: See the excellent follow-up by Andrey Akinshin where he Measures Performance Improvements in .NET Core with BenchmarkDotNet.
Collections are the bedrock of any application, and there are a multitude of collections available in the .NET libraries. Not every operation on every collection has been made faster, but many have. Some of these improvements are due to eliminating overheads, such as streamlining operations to enable better inlining, reducing instruction count, and so on. For example, consider this small example with a Queue<T>:
PR dotnet/corefx #2515 from OmariO removed from Enqueue and Dequeue a relatively expensive modulus operation that dominated the costs of these operations. On my machine, this code on .NET 4.7 produces output like this:
00:00:00.9392595 00:00:00.9390453 00:00:00.9455784 00:00:00.9508294 00:00:01.0107745
whereas with .NET Core 2.0 it produces output like this:
00:00:00.5514887 00:00:00.5662477 00:00:00.5627481 00:00:00.5685286 00:00:00.5262378
As this is “wall clock” time elapsed, smaller values are better, and this shows an ~2x increase in throughput!
In other cases, operations have been made faster by changing the algorithmic complexity of an operation. It’s often best when writing software to first write a simple implementation, one that’s easily maintained and easily proven correct. However, such implementations often don’t exhibit the best possible performance, and it’s not until a specific scenario comes along that drives a need to improve performance does that happen. For example, SortedSet<T>‘s ctor was originally written in a relatively simple way that didn’t scale well due to (I assume accidentally) employing an O(N^2) algorithm for handling duplicates. The algorithm was fixed in .NET Core in PR dotnet/corefx #1955. The following short program exemplifies the difference the fix made:
On my system, on .NET Framework this code takes ~7.7 seconds to execute. On .NET Core 2.0, that is reduced to ~0.013s, for an ~600x improvement (at least with 400K elements… as the fix changed the algorithmic complexity, the larger the set, the more the times will diverge).
Or consider this example on SortedSet<T>:
The implementation of Min and Max in .NET 4.7 walks the whole tree underlying the SortedSet<T>, but that’s unnecessary for finding just the min or the max, as the implementation can traverse down to just the relevant node. PR dotnet/corefx #11968 fixes the .NET Core implementation to do just that. On .NET 4.7, this example produces results like:
00:00:01.1427246 00:00:01.1295220 00:00:01.1350696 00:00:01.1502784 00:00:01.1677880
whereas on .NET Core 2.0, we get results like:
00:00:00.0861391 00:00:00.0861183 00:00:00.0866616 00:00:00.0848434 00:00:00.0860198
showing a sizeable decrease in time and increase in throughput.
Even a core workhorse like List<T> has found room for improvement. Due to JIT improvements and PRs like dotnet/coreclr #9539 from benaadams, core operations like List<T>.Add have gotten faster. Consider this small example:
On .NET 4.7, I get results like:
00:00:00.4434135 00:00:00.4394329 00:00:00.4496867 00:00:00.4496383 00:00:00.4515505
and with .NET Core 2.0, I see:
00:00:00.3213094 00:00:00.3211772 00:00:00.3179631 00:00:00.3198449 00:00:00.3164009
To be sure, the fact that we can do 100 million such adds and removes from a list like this in just 0.3 seconds highlights that the operation wasn’t slow to begin with. But over the execution of an app, lists are often added to a lot, and the savings add up.
These kinds of collections improvements expand beyond just the System.Collections.Generic namespace; System.Collections.Concurrent has had many improvements as well. In fact, both ConcurrentQueue<T> and ConcurrentBag<T> were essentially completely rewritten for .NET Core 2.0, in PRs dotnet/corefx #14254 and dotnet/corefx #14126, respectively. Let’s look at a basic example, using ConcurrentQueue<T> but without any concurrency, essentially the same example as earlier with Queue<T> but with ConcurrentQueue<T> instead:
On my machine on .NET 4.7, this yields output like the following:
00:00:02.6485174 00:00:02.6144919 00:00:02.6699958 00:00:02.6441047 00:00:02.6255135
Obviously the ConcurrentQueue<T> example on .NET 4.7 is slower than the Queue<T> version on .NET 4.7, as ConcurrentQueue<T> needs to employ synchronization to ensure it can be used safely concurrently. But the more interesting comparison is what happens when we run the same code on .NET Core 2.0:
00:00:01.7700190 00:00:01.8324078 00:00:01.7552966 00:00:01.7518632 00:00:01.7560811
This shows that the throughput using ConcurrentQueue<T> without any concurrency improves when switching to .NET Core 2.0 by ~30%. But there are even more interesting aspects. The changes in the implementation improved serialized throughput, but even more so reduced the synchronization between producers and consumers using the queue, which can have a more demonstrable impact on throughput. Consider the following code instead:
This example is spawing a consumer that sits in a tight loop dequeueing any elements it can find, until it consumes everything the producer adds. On .NET 4.7, this outputs results on my machine like the following:
00:00:06.1366044 00:00:05.7169339 00:00:06.3870274 00:00:05.5487718 00:00:06.6069291
whereas with .NET Core 2.0, I see results like the following:
00:00:01.2052460 00:00:01.5269184 00:00:01.4638793 00:00:01.4963922 00:00:01.4927520
That’s an ~3.5x throughput increase. But better CPU efficiency isn’t the only impact of the rewrite; memory allocation is also substantially decreased. Consider a small variation to the original test, this time looking at the number of GC collections instead of the wall-clock time:
On .NET 4.7, I get output like the following:
Gen0=162 Gen1=80 Gen2=0 Gen0=162 Gen1=81 Gen2=0 Gen0=162 Gen1=81 Gen2=0 Gen0=162 Gen1=81 Gen2=0 Gen0=162 Gen1=81 Gen2=0
whereas with .NET Core 2.0, I get output like the following:
Gen0=0 Gen1=0 Gen2=0 Gen0=0 Gen1=0 Gen2=0 Gen0=0 Gen1=0 Gen2=0 Gen0=0 Gen1=0 Gen2=0 Gen0=0 Gen1=0 Gen2=0
That’s not a typo: 0 collections. The implementation in .NET 4.7 employs a linked list of fixed-size arrays that are thrown away once the fixed number of elements are added to each; this helps to simplify the implementation, but results in lots of garbage being generated for the segments. In .NET Core 2.0, the new implementation still employs a linked list of segments, but these segments increase in size as new segments are added, and more importantly, utilize circular buffers, such that new segments only need be added if the previous segment is entirely full (though other operations on the collection, such as enumeration, can also cause the current segments to be frozen and force new segments to be created in the future). Such reductions in allocation can have a sizeable impact on the overall performance of an application.
Similar improvements surface with ConcurrentBag<T>. ConcurrentBag<T> maintains thread-local work-stealing queues, such that every thread that adds to the bag has its own queue. In .NET 4.7, these queues are implemented as linked lists of one node per element, which means that any addition to the bag incurs an allocation. In .NET Core 2.0, these queues are now arrays, which means that other than the amortized costs involved in growing the arrays, additions are allocation-free. This can be seen in the following repro:
On .NET 4.7, this yields the following output on my machine:
Elapsed=00:00:06.5672723 Gen0=953 Gen1=0 Gen2=0 Elapsed=00:00:06.4829793 Gen0=954 Gen1=1 Gen2=0 Elapsed=00:00:06.9008532 Gen0=954 Gen1=0 Gen2=0 Elapsed=00:00:06.6485667 Gen0=953 Gen1=1 Gen2=0 Elapsed=00:00:06.4671746 Gen0=954 Gen1=1 Gen2=0
whereas with .NET Core 2.0 I get:
Elapsed=00:00:04.3377355 Gen0=0 Gen1=0 Gen2=0 Elapsed=00:00:04.2892791 Gen0=0 Gen1=0 Gen2=0 Elapsed=00:00:04.3101593 Gen0=0 Gen1=0 Gen2=0 Elapsed=00:00:04.2652497 Gen0=0 Gen1=0 Gen2=0 Elapsed=00:00:04.2808077 Gen0=0 Gen1=0 Gen2=0
That’s an ~30% improvement in throughput, and a huge (complete) reduction in allocations and resulting garbage collections.
In application code, collections often go hand-in-hand with Language Integrated Query (LINQ), which has seen even more improvements. Many of the operators in LINQ have been entirely rewritten for .NET Core in order to reduce the number and size of allocations, reduce algorithmic complexity, and generally eliminate unnecessary work.
For example, the Enumerable.Concat method is used to create a single IEnumerable<T> that first yields all of the elements of one enumerable and then all the elements of a second. Its implementation in .NET 4.7 is simple and easy to understand, reflecting exactly this statement of behavior:
This is about as good as you can expect when the two sequences are simple enumerables like those produced by an iterator in C#. But what if application code instead had code like the following?
Every time we yield out of an iterator, we return out of the enumerator’s MoveNext method. That means if you yield an element from enumerating another iterator, you’re returning out of two MoveNext methods, and moving to the next element requires calling back into both of those MoveNext methods. The more enumerators you need to call into, the longer the operation takes, especially since every one of those operations involves multiple interface calls (MoveNext and Current). That means that concatenating multiple enumerables grows exponentially rather than linearly with the number of enumerables involved. PR dotnet/corefx #6131 fixed that, and the difference is obvious in the following example, which concatenates 10K enumerables of 10 elements each:
On my machine on .NET 4.7, this takes ~4.12 seconds. On my machine on .NET Core 2.0, this takes only ~0.14 seconds, for an ~30x improvement.
Other operators have been improved substantially by eliminating overheads involved when various operators are used together. For example, a multitude of PRs from JonHanna have gone into optimizing various such cases and into making it easier to add more cases in the future. Consider this example:
Here we create an enumerable of the numbers 10,000,000 down to 0, and then time how long it takes to sort them ascending, skip the first 4 elements of the sorted result, and grab the fifth one (which will be 4, as the sequence starts at 0). On my machine on .NET 4.7, I get output like:
00:00:01.3879042 00:00:01.3438509 00:00:01.4141820 00:00:01.4248908 00:00:01.3548279
whereas with .NET Core 2.0, I get output like:
00:00:00.1776617 00:00:00.1787467 00:00:00.1754809 00:00:00.1765863 00:00:00.1735489
That’s a sizeable improvement (~8x), in this case due primarily (though not exclusively) to PR dotnet/corefx #2401, which avoids most of the costs of the sort.
Similarly, PR dotnet/corefx #3429 from justinvp added optimizations around the common ToList method, providing optimized paths for when the source had a known length, and plumbing that through operators like Select. The impact of this is evident in a simple test like the following:
On .NET 4.7, this provides results like:
00:00:00.1308687 00:00:00.1228546 00:00:00.1268445 00:00:00.1247647 00:00:00.1503511
whereas on .NET Core 2.0, I get results like the following:
00:00:00.0386857 00:00:00.0337234 00:00:00.0346344 00:00:00.0345419 00:00:00.0355355
showing an ~4x increase in throughput.
In other cases, the performance wins have come from streamlining the implementation to avoid overheads, such as reducing allocations, avoiding delegate allocations, avoiding interface calls, minimizing field reads and writes, avoiding copies, and so on. For example, jamesqo contributed PR dotnet/corefx #11208, which substantially reduced overheads involved in Enumerable.ToArray, in particular by better managing how the internal buffer(s) used grow to accommodate the unknown amount of data being aggregated. To see this, consider this simple example:
On .NET 4.7, I get results like:
Elapsed=00:00:01.0548794 Gen0=2 Gen1=2 Gen2=2 Elapsed=00:00:01.1147146 Gen0=2 Gen1=2 Gen2=2 Elapsed=00:00:01.0709146 Gen0=2 Gen1=2 Gen2=2 Elapsed=00:00:01.0706030 Gen0=2 Gen1=2 Gen2=2 Elapsed=00:00:01.0620943 Gen0=2 Gen1=2 Gen2=2
and on .NET Core 2.0, results like:
Elapsed=00:00:00.1716550 Gen0=1 Gen1=1 Gen2=1 Elapsed=00:00:00.1720829 Gen0=1 Gen1=1 Gen2=1 Elapsed=00:00:00.1717145 Gen0=1 Gen1=1 Gen2=1 Elapsed=00:00:00.1713335 Gen0=1 Gen1=1 Gen2=1 Elapsed=00:00:00.1705285 Gen0=1 Gen1=1 Gen2=1
so for this example ~6x faster with half the garbage collections.
There are over a hundred operators in LINQ, and while I’ve only mentioned a few, many of them have been subject to these kinds of improvements.
The examples shown thus far, of collections and LINQ, have been about manipulating data in memory. There are of course many other forms of data manipulation, including transformations that are heavily CPU-bound in nature. Investments have also been made in improving such operations.
One key example is compression, such as with DeflateStream, and several impactful performance changes have gone in here. For example, in .NET 4.7, zlib (a native compression library) is used for compressing data, but a relatively unoptimized managed implementation is used for decompressing data; PR dotnet/corefx #2906 added .NET Core support for using zlib for decompression as well. And PR dotnet/corefx #5674 from bjjones enabled using a more optimized version of zlib produced by Intel. These combine to a fairly dramatic effect. Consider this example, which just creates a large array of (fairly compressible) data:
On .NET 4.7, for this one compression/decompression operation I get results like:
whereas with .NET Core 2.0, I get results like:
Another common source of compute in a .NET application is the use of cryptographic operations. Improvements can be seen here as well. For example, in .NET 4.7, SHA256.Create returns a SHA256 type implemented in managed code, and while managed code can be made to run very fast, for very compute-bound computations it’s still hard to compete with the raw throughput and compiler optimizations available to code written in C/C++. In contrast, for .NET Core 2.0, SHA256.Create returns an implementation based on the underlying operating system, e.g. using CNG on Windows or OpenSSL on Unix. The impact can be seen in this simple example that hashes a 100MB byte array:
On .NET 4.7, I get:
whereas with .NET Core 2.0, I get:
Another nice improvement for zero code changes.
Mathematical operations are also a large source of computation, especially when dealing with large numbers. Through PRs like dotnet/corefx #2182, axelheer made some substantial improvements to various operations on BigInteger. Consider the following example:
On my machine on .NET 4.7, this outputs results like:
The same code on .NET Core 2.0 instead outputs results like:
This is another great example of a developer caring a lot about a particular area of .NET and helping to make it better for their own needs and for everyone else that might be using it.
Even some math operations on core integral types have been improved. For example, consider:
and on .NET Core 2.0 I get results like:
for an ~2x improvement in throughput.
Binary serialization is another area of .NET that can be fairly CPU/data/memory intensive. BinaryFormatter is a component that was initially left out of .NET Core, but it reappears in .NET Core 2.0 in support of existing code that needs it (in general, other forms of serialization are recommended for new code). The component is almost an identical port of the code from .NET 4.7, with the exception of tactical fixes that have been made to it since, in particular around performance. For example, PR dotnet/corefx #17949 is a one-line fix that increases the maximum size that a particular array is allowed to grow to, but that one change can have a substantial impact on throughput, by allowing for an O(N) algorithm to operate for much longer than it previously would have before switching to an O(N^2) algorithm. This is evident in the following code example:
On .NET 4.7, this code outputs results like:
whereas on .NET Core 2.0, it outputs results like:
showing an ~12x throughput improvement for this case. In other words, it’s able to deal with much larger serialized inputs more efficiently.
Another very common form of computation in .NET applications is the processing of text, and a large number of improvements have gone in here, at various levels of the stack.
On my machine on .NET 4.7, I get results like:
Elapsed=00:00:05.4367262 Gen0=820 Gen1=0 Gen2=0
whereas with .NET Core 2.0, I get results like:
That’s an ~25% improvement in throughput and an ~70% reduction in allocation / garbage collections, due to a small change in PR dotnet/corefx #231 that made a fix to how some data is cached.
Another example of text processing is in various forms of encoding and decoding, such as URL decoding via WebUtility.UrlDecode. It’s often the case in decoding methods like this one that the input doesn’t actually need any decoding, but the input is still passed through the decoder in case it does. Thanks to PR dotnet/corefx #7671 from hughbe, this case has been optimized. So, for example, with this program:
on .NET 4.7, I see the following output:
whereas on .NET Core 2.0, I see this output:
Other forms of encoding and decoding have also been improved. For example, dotnet/coreclr #10124 optimized the loops involved in using some of the built-in Encoding-derived types. So, for example, this code that repeatedly encodes an ASCII input string as UTF8:
on .NET 4.7 produces output for me like:
00:00:02.4028829 00:00:02.3743152 00:00:02.3401392 00:00:02.4024785 00:00:02.3550876
and on .NET Core 2.0 produces output for me like:
00:00:01.6133550 00:00:01.5915718 00:00:01.5759625 00:00:01.6070851 00:00:01.6070767
These kinds of improvements extend as well to general Parse and ToString methods in .NET for converting between strings and other representations. For example, it’s fairly common to use enums to represent various kinds of state, and to use Enum.Parse to parse a string into a corresponding Enum. PR dotnet/coreclr #2933 helped to improve this. Consider the following code:
On .NET 4.7, I get results like:
Elapsed=00:00:00.9529354 Gen0=293 Elapsed=00:00:00.9422960 Gen0=294 Elapsed=00:00:00.9419024 Gen0=294 Elapsed=00:00:00.9417014 Gen0=294 Elapsed=00:00:00.9514724 Gen0=293
and on .NET Core 2.0, I get results like:
Elapsed=00:00:00.6448327 Gen0=11 Elapsed=00:00:00.6438907 Gen0=11 Elapsed=00:00:00.6285656 Gen0=12 Elapsed=00:00:00.6286561 Gen0=11 Elapsed=00:00:00.6294286 Gen0=12
so not only an ~33% improvement in throughput, but also an ~25x reduction in allocations and associated garbage collections.
Or consider PRs dotnet/coreclr #7836 and dotnet/coreclr #7891, which improved DateTime.ToString with formats “o” (the round-trip date/time pattern) and “r” (the RFC1123 pattern), respectively. The result is that given code like this:
on .NET 4.7 I see output like:
Elapsed=00:00:03.7552059 Gen0=949 Elapsed=00:00:03.6992357 Gen0=950 Elapsed=00:00:03.5459498 Gen0=950 Elapsed=00:00:03.5565029 Gen0=950 Elapsed=00:00:03.5388134 Gen0=950
and on .NET Core 2.0 output like:
Elapsed=00:00:01.3588804 Gen0=87 Elapsed=00:00:01.3932658 Gen0=88 Elapsed=00:00:01.3607030 Gen0=88 Elapsed=00:00:01.3675958 Gen0=87 Elapsed=00:00:01.3546522 Gen0=88
That’s an almost 3x increase in throughput and a whopping ~90% reduction in allocations / garbage collections.
Of course, there’s lots of custom text processing done in .NET applications, beyond using built in types like Regex/Encoding and built-in operations like Parse and ToString, often built directly on top of string, and lots of improvements have gone into operations on String itself.
For example, String.IndexOf is very commonly used to find characters in strings. IndexOf was improved in dotnet/coreclr #5327 by bbowyersmyth, who’s submitted a bunch of performance improvements for String. So this example:
on .NET 4.7 produces results for me like this:
00:00:05.9718129 00:00:05.9199793 00:00:06.0203108 00:00:05.9458049 00:00:05.9622262
whereas on .NET Core 2.0 it produces results for me like this:
00:00:03.1283763 00:00:03.0925150 00:00:02.9778923 00:00:03.0782851
for an ~2x improvement in throughput.
Or consider comparing strings. Here’s an example that uses String.StartsWith and ordinal comparisons:
Thanks to dotnet/coreclr #2825, on .NET 4.7 I get results like:
00:00:01.3097317 00:00:01.3072381 00:00:01.3045015 00:00:01.3068244 00:00:01.3210207
and on .NET Core 2.0 results like:
00:00:00.6239002 00:00:00.6150021 00:00:00.6147173 00:00:00.6129136 00:00:00.6099822
It’s quite fun looking through all of the changes that have gone into String, seeing their impact, and thinking about the additional possibilities for more improvements.
Thus far I’ve been focusing on various improvements around manipulating data in memory. But lots of the changes that have gone into .NET Core have been about I/O.
Let’s start with files. Here’s an example of asynchronously reading all of the data from one file and writing it to another (using FileStreams configured to use async I/O):
A bunch of PRs have gone into reducing the overheads involved in FileStream, such as dotnet/corefx #11569 which adds a specialized CopyToAsync implementation, and dotnet/corefx #2929 which improves how asynchronous writes are handled, and so when running this on .NET 4.7 I get results like:
Elapsed=00:00:09.4070345 Gen0=14 Gen1=7 Gen2=1
and on .NET Core 2.0, results like:
Elapsed=00:00:06.4286604 Gen0=4 Gen1=1 Gen2=1
Networking is a big area of focus now, and likely will be even more so moving forward. A good amount of effort is being applied to optimizing and tuning the lower-levels of the networking stack, so that higher-level components can be built efficiently.
One such change that has a big impact is PR dotnet/corefx #15141. SocketAsyncEventArgs is at the center of a bunch of asynchronous operations on Socket, and it supports a synchronous completion model whereby asynchronous operations that actually complete synchronously can avoid costs associated with asynchronous completions. However, the implementation in .NET 4.7 only ever synchronously completes operations that fail; the aforementioned PR fixed the implementation to allow for synchronous completions of all async operations on sockets. The impact of this is very obvious in code like the following:
This program creates two connected sockets, and then writes 1,000,000 times to one socket and receives on the other, in both cases using asynchronous methods but where the vast majority (if not all) of the operations will complete synchronously. On .NET 4.7 I see results like:
Elapsed=00:00:20.5272910 Gen0=42 Gen1=2 Gen2=0
whereas on .NET Core 2.0 with most of these operations able to complete synchronously, I see results instead like:
Elapsed=00:00:05.6197060 Gen0=0 Gen1=0 Gen2=0
Not only do such improvements accrue to components using sockets directly, but also to using sockets indirectly via higher-level components, and other PRs have resulted in additional performance increases in higher-level components, such as NetworkStream. For example, PR dotnet/corefx #16502 re-implemented Socket’s Task-based SendAsync and ReceiveAsync operations on top of SocketAsyncEventArgs and then allowed those to be used from NetworkStream.Read/WriteAsync, and PR dotnet/corefx #12664 added a specialized CopyToAsync override to support more efficiently reading the data from a NetworkStream and copying it out to some other stream. Those changes have a very measurable impact on NetworkStream throughput and allocations. Consider this example:
As with the previous Sockets one, we’re creating two connected sockets. We’re then wrapping those in NetworkStreams. On one of the streams we write 1K of data a million times, and on the other stream we read out all of its data via a CopyToAsync operation. On .NET 4.7, I get output like the following:
Elapsed=00:00:24.7827947 Gen0=220 Gen1=3 Gen2=0
whereas on .NET Core 2.0, the time is cut by 5x, and garbage collections are reduced effectively to zero:
Elapsed=00:00:04.9452962 Gen0=0 Gen1=0 Gen2=0
Further optimizations have gone into other networking-related components. For example, SslStream is often wrapped around a NetworkStream in order to add SSL to a connection. We can see the impact of these changes as well as others in an example like the following, which just adds usage of SslStream on top of the previous NetworkStream example:
On .NET 4.7, I get results like the following:
Elapsed=00:00:21.1171962 Gen0=470 Gen1=3 Gen2=1
.NET Core 2.0 includes changes from PRs like dotnet/corefx #12935 and dotnet/corefx #13274, both of which together significantly reduce the allocations involved in using SslStream. When running the same code on .NET Core 2.0, I get results like the following:
Elapsed=00:00:05.6456073 Gen0=74 Gen1=0 Gen2=0
That’s 85% of the garbage collections removed!
Not to be left out, lots of improvements have gone into infrastructure and primitives related to concurrency and parallelism.
One of the key focuses here has been the ThreadPool, which is at the heart of the execution of many .NET apps. For example, PR dotnet/coreclr #3157 reduced the sizes of some of the objects involved in QueueUserWorkItem, and PR dotnet/coreclr #9234 used the previously mentioned rewrite of ConcurrentQueue<T> to replace the global queue of the ThreadPool with one that involves less synchronization and less allocation. The net result in visible in an example like the following:
On .NET 4.7, I see results like the following:
Elapsed=00:00:03.6263995 Gen0=225 Gen1=51 Gen2=16 Elapsed=00:00:03.6304345 Gen0=231 Gen1=62 Gen2=17 Elapsed=00:00:03.6142323 Gen0=225 Gen1=53 Gen2=16 Elapsed=00:00:03.6565384 Gen0=232 Gen1=62 Gen2=16 Elapsed=00:00:03.5999892 Gen0=228 Gen1=62 Gen2=17
whereas on .NET Core 2.0, I see results like the following:
Elapsed=00:00:02.1797508 Gen0=153 Gen1=0 Gen2=0 Elapsed=00:00:02.1188833 Gen0=154 Gen1=0 Gen2=0 Elapsed=00:00:02.1000003 Gen0=153 Gen1=0 Gen2=0 Elapsed=00:00:02.1024852 Gen0=153 Gen1=0 Gen2=0 Elapsed=00:00:02.1044461 Gen0=154 Gen1=1 Gen2=0
That’s both a huge improvement in throughput and a huge reduction in garbage collections for such a core component.
Synchronization primitives have also gotten a boost in .NET Core. For example, SpinLock is often used by low-level concurrent code trying either to avoid allocating lock objects or minimize the time it takes to acquire a rarely contended lock, and its TryEnter method is often called with a value of 0 in order to only take the lock if it can be taken immediately, or else fail immediately if it can’t, without any spinning. PR dotnet/coreclr #6952 improved that fail fast path, as is evident in the following test:
On .NET 4.7, I get results like:
00:00:02.3276463 00:00:02.3174042 00:00:02.3022212 00:00:02.3015542 00:00:02.2974777
whereas on .NET Core 2.0, I get results like:
00:00:00.3915327 00:00:00.3953084 00:00:00.3875121 00:00:00.3980009 00:00:00.3886977
Such an ~6x difference in throughput can make a significant impact on hot paths that exercise such locks.
That’s just one example of many. Another is around Lazy<T>, which was rewritten in PR dotnet/coreclr #8963 by manofstick to improve the efficiency of accessing an already initialized Lazy<T> (while the performance of accessing a Lazy<T> for the first time matters, the expectation is that it’s accessed many times after that, and thus we want to minimize the cost of those subsequent accesses). The effect is visible in a small example like the following:
On .NET 4.7, I get results like:
00:00:02.6769712 00:00:02.6789140 00:00:02.6535493 00:00:02.6911146 00:00:02.7253927
whereas on .NET Core 2.0, I get results like:
00:00:00.5278348 00:00:00.5594950 00:00:00.5458245 00:00:00.5381743 00:00:00.5502970
for an ~5x increase in throughput.
As I noted earlier, these are just a few of the many performance-related improvements that have gone into .NET Core. Search for “perf” or “performance” in pull requests in the dotnet/corefx and dotnet/coreclr repos, and you’ll find close to a thousand merged PRs; some of them are big and impactful on their own, while others whittle away at costs across the libraries and runtime, changes that add up to applications running faster on .NET Core. Hopefully subsequent blog posts will highlight additional performance improvements, including those in the runtime, of which there have been many but which I haven’t covered here.
We’re far from done, though. Many of the performance-related changes up until this point have been mostly ad-hoc, opportunistic changes, or those driven by specific needs that resulted from profiling specific higher-level applications and scenarios. Many have also come from the community, with developers everywhere finding and fixing issues important to them. Moving forward, performance will be a bigger focus, both in terms of adding additional performance-focused APIs (you can see experimentation with such APIs in the dotnet/corefxlab repo) and in terms of improving the performance of the existing libraries.
To me, though, the most exciting part is this: you can help make all of this even better. Throughout this post I highlighted some of the many great contributions from the community, and I highly encourage everyone reading this to dig in to the .NET Core codebase, find bottlenecks impacting your own apps and libraries, and submit PRs to fix them. Rather than stumbling upon a performance issue and working around it in your app, fix it for you and everyone else to consume. We are all very excited to work with you on bringing such improvements into the code base, and we hope to see all of you involved in the various .NET Core repos.