Hardware Intrinsics in .NET Core

Avatar

Tanner

Several years ago, we decided that it was time to support SIMD code in .NET. We introduced the System.Numerics namespace with Vector2Vector3Vector4Vector<T>, and related types. These types expose a general-purpose API for creating, accessing, and operating on them using hardware vector instructions (when available). They also provide a software fallback for when the hardware does not provide the appropriate instructions. This enabled a number of common algorithms to be vectorized, often with only minor refactorings. However, the generality of this approach made it difficult for programs to take full advantage of all vector instructions available on modern hardware. Additionally, modern hardware often exposes a number of specialized non-vector instructions that can dramatically improve performance. In this blog post, I’m exploring how we’ve addressed this limitation in .NET Core 3.0.

What are hardware intrinsics?

In .NET Core 3.0, we added a new feature called hardware intrinsics. Hardware intrinsics provide access to many of these hardware specific instructions that can’t easily be exposed in a more general-purpose mechanism. They differ from the existing SIMD intrinsics in that they are not general-purpose (the new hardware intrinsics are not cross-platform and the architecture does not provide a software fallback) and instead directly expose platform and hardware specific functionality to the .NET developer. The existing SIMD intrinsics, in comparison, are cross-platform, provide a software fallback, and are slightly abstracted from the underlying hardware. That abstraction can come at a cost and prevent certain functionality from being exposed (when said functionality does not exist or is not easily emulated on all target hardware).

The new intrinsics and supporting types are exposed under the System.Runtime.Intrinsics namespace. For .NET Core 3.0 there currently exists one namespace: System.Runtime.Intrinsics.X86. We are working on exposing hardware intrinsics for other platforms, such as System.Runtime.Intrinsics.Arm.

Under the platform specific namespaces, intrinsics are grouped into classes which represent logical hardware instruction groups (frequently referred to as Instruction Set Architectures or ISAs). Each class then exposes an IsSupported property that indicates whether the hardware you are currently executing on supports that instruction set. Each class then also exposes a set of methods that map to the underlying instructions exposed by that instruction set. There is sometimes additionally a subclass that is part of the same instruction set but that may be limited to specific hardware. For example, the Lzcnt class provides access to the leading zero count instructions. There is then a subclass named X64 which exposes the forms of the instruction that are only usable on 64-bit machines.

Some of the classes are also hierarchical in nature. For example, if Lzcnt.X64.IsSupported returns true, then Lzcnt.IsSupported must also return true since it is an explicit subclass. Likewise, if Sse2.IsSupported returns true, then Sse.IsSupported must also return true because Sse2 explicitly inherits from the Sseclass. However, it is worth noting that just because classes have similar names does not mean they are definitely hierarchical. For example, Bmi2 does not inherit from Bmi1 and so the IsSupported checks for the two instruction sets are distinct from each other. The design philosophy of these types is to truthfully represent the ISA specification. SSE2 requires to support SSE1, so we exposed a subclass and since BMI2 doesn’t require supporting BMI1, we didn’t use inheritance.

An example of the API shape described above is the following:

You can also see a more complete list by browsing the source code on source.dot.net or dotnet/coreclr on GitHub.

The IsSupported checks are treated as runtime constants by the JIT (when optimizations are enabled) and so you do not need to cross-compile to support multiple different ISAs, platforms, or architectures. Instead, you just write your code using if-statements and the unused code paths (any code path which is not executed, due to the condition for the branch being false or an earlier branch being taken instead) are dropped from the generated code (the native assembly code generated by the JIT at runtime).

It is essential that you guard usage of hardware intrinsics with the appropriate IsSupported check. If your code is unguarded and runs on a machine or architecture/platform that doesn’t support the intrinsic, a PlatformNotSupportedException is thrown by the runtime.

What benefits do these provide me?

Hardware Intrinsics definitely aren’t for everyone, but they can be used to boost perf in some computationally heavy workloads. Frameworks such as CoreFX or ML.NET take advantage of these methods to help accelerate things like copying memory, searching for the index of an item in an array/string, resizing images, or working with vectors, matrices, and tensors. Manually vectorizing some code that has been identified as a bottleneck can also be easier than it seems. Vectorizing your code is really all about performing multiple operations at once, generally using Single-Instruction Multiple Data (SIMD) instructions.

It is important to profile your code before vectorizing to ensure that the code you are optimizing is part of a hot spot (and therefore the optimization will be impactful). It is also important to profile while you are iterating on the vectorized code, as not all code will benefit from vectorization.

Vectorizing a simple algorithm

Take for example an algorithm which sums all elements in an array or span. This code is a perfect candidate for vectorization because it does the same unconditional operation every iteration of the loop and those operations are fairly trivial in nature.

An example of such an algorithm might look like the following:

The code is simple and understandable, but it is also not particularly fast for large inputs since you are only doing a single trivial operation per loop iteration.

MethodCountMeanErrorStdDev
Sum12.477 ns0.0192 ns0.0179 ns
Sum22.164 ns0.0265 ns0.0235 ns
Sum43.224 ns0.0302 ns0.0267 ns
Sum84.347 ns0.0665 ns0.0622 ns
Sum168.444 ns0.2042 ns0.3734 ns
Sum3213.963 ns0.2182 ns0.2041 ns
Sum6450.374 ns0.2955 ns0.2620 ns
Sum12860.139 ns0.3890 ns0.3639 ns
Sum256106.416 ns0.6404 ns0.5990 ns
Sum512291.450 ns3.5148 ns3.2878 ns
Sum1024574.243 ns9.5851 ns8.4970 ns
Sum20481,137.819 ns5.9363 ns5.5529 ns
Sum40962,228.341 ns22.8882 ns21.4097 ns
Sum81922,973.040 ns14.2863 ns12.6644 ns
Sum163845,883.504 ns15.9619 ns14.9308 ns
Sum3276811,699.237 ns104.0970 ns97.3724 ns

Improving the perf by unrolling the loop

Modern CPUs have many ways of increasing the throughput at which it executes your code. For single-threaded applications, one of the ways it can do this is by executing multiple primitive operations in a single cycle (a cycle is the basic unit of time in a CPU).

Most modern CPUs can execute about 4 add operations in a single cycle (under optimal conditions), so by laying out your code correctly and profiling it, you can sometimes optimize your code to have better performance, even when only executing on a single-thread.

While the JIT can perform loop unrolling itself, it is conservative in deciding when to do so due to the larger codegen it produces. So, it can be beneficial to manually unroll the loop in your source code instead.

You might unroll your code like the following:

The code is slightly more complicated but takes better advantage of your hardware.

For really small loops, the code ends up being slightly slower, but that normalizes itself for inputs that have 8 elements and then starts getting faster for inputs with even more elements (taking 26% less time at 32k elements). It’s also worth noting that this optimization doesn’t always improve performance. For example, when handling float, the unrolled version is practically the same speed as the original version, so it’s important to profile your code accordingly.

MethodCountMeanErrorStdDev
SumUnrolled12.922 ns0.0651 ns0.0609 ns
SumUnrolled23.576 ns0.0116 ns0.0109 ns
SumUnrolled43.708 ns0.0157 ns0.0139 ns
SumUnrolled84.832 ns0.0486 ns0.0454 ns
SumUnrolled167.490 ns0.1131 ns0.1058 ns
SumUnrolled3211.277 ns0.0910 ns0.0851 ns
SumUnrolled6419.761 ns0.2016 ns0.1885 ns
SumUnrolled12836.639 ns0.3043 ns0.2847 ns
SumUnrolled25677.969 ns0.8409 ns0.7866 ns
SumUnrolled512146.357 ns1.3209 ns1.2356 ns
SumUnrolled1024287.354 ns0.9223 ns0.8627 ns
SumUnrolled2048566.405 ns4.0155 ns3.5596 ns
SumUnrolled40961,131.016 ns7.3601 ns6.5246 ns
SumUnrolled81922,259.836 ns8.6539 ns8.0949 ns
SumUnrolled163844,501.295 ns6.4186 ns6.0040 ns
SumUnrolled327688,979.690 ns19.5265 ns18.2651 ns

Improving the perf by vectorizing the loop

However, we can still optimize the code a bit more. SIMD instructions are another way modern CPUs allow you to improve throughput. Using a single instruction they allow you to perform multiple operations in a single cycle. This can be better than the loop unrolling because it performs essentially the same operation, but with smaller generated code.

To elaborate a bit, each one of the add instructions from the unrolled loop is 4 bytes in size, so it takes 16-bytes of space to have all 4 adds in the unrolled form. However, the SIMD add instruction also performs 4 additions, but it only takes 4 bytes to encode. This means there are less instructions for the CPU to decode and execute each iteration of the loop. There are also other things the CPU can assume and optimize around for this single instruction, but those are out of scope for this blog post. What’s even better is that modern CPUs can also execute more than one SIMD instruction per cycle, so in certain cases you can then unroll your vectorized code to improve the performance further.

You should generally start by looking at whether the general-purpose Vector<T> class will suit your needs. It, like the newer hardware intrinsics, will emit SIMD instructions, but given that it is general-purpose you can reduce the amount of code you need to write/maintain.

The code might look like:

The code is faster, but we have to fall back to accessing individual elements when computing the overall sum. Vector<T> also does not have a well-defined size and can vary based on the hardware you are running against. The hardware intrinsics provide some additional functionality that can make this code a bit nicer and faster still (at the cost of additional code complexity and maintainence requirements).

MethodCountMeanErrorStdDev
SumVectorT14.517 ns0.0752 ns0.0703 ns
SumVectorT24.853 ns0.0609 ns0.0570 ns
SumVectorT45.047 ns0.0909 ns0.0850 ns
SumVectorT85.671 ns0.0251 ns0.0223 ns
SumVectorT166.579 ns0.0330 ns0.0276 ns
SumVectorT3210.460 ns0.0241 ns0.0226 ns
SumVectorT6417.148 ns0.0407 ns0.0381 ns
SumVectorT12823.239 ns0.0853 ns0.0756 ns
SumVectorT25662.146 ns0.8319 ns0.7782 ns
SumVectorT512114.863 ns0.4175 ns0.3906 ns
SumVectorT1024172.129 ns1.8673 ns1.7467 ns
SumVectorT2048429.722 ns1.0461 ns0.9786 ns
SumVectorT4096654.209 ns3.6215 ns3.0241 ns
SumVectorT81921,675.046 ns14.5231 ns13.5849 ns
SumVectorT163842,514.778 ns5.3369 ns4.9921 ns
SumVectorT327686,689.829 ns13.9947 ns13.0906 ns

NOTE: For the purposes of this blogpost, I forced the size of Vector<T> to 16-bytes using an internal configuration knob (COMPlus_SIMD16ByteOnly=1). This normalized the results when comparing SumVectorT to SumVectorizedSse and kept the latter code simpler. Namely, it avoided the need to write an if (Avx2.IsSupported) { } code path. Such a code path is nearly identical to the Sse2 path, but deals with Vector256<T> (32-bytes) and processes even more elements per loop iteration.

So, you might take advantage of the new hardware intrinsics like so:

The code is again slightly more complicated, but it’s significantly faster for all but the smallest workloads. At 32k elements, it’s taking 75% less time than the unrolled loop and 81% less than the original code.

You’ll notice that we have a few IsSupported checks. The first checks if the hardware intrinsics are supported for the current platform at all and falls back to the unrolled loop if they aren’t. This path will currently be hit for platforms like ARM/ARM64 which don’t have hardware intrinsics or if someone disables them for any reason. The second IsSupported check is in the SumVectorizedSse method and is used to produce slightly better codegen on newer hardware that additionally supports the Ssse3 instruction set.

Otherwise, most of the logic is essentially the same as what we had done for the unrolled version. Vector128<T> is a 128-bit type that contains Vector128<T>.Count elements. In the case of uint, which is itself 32-bits, you have 4 (128 / 32) elements, which is exactly how much we unrolled the loop by.

MethodCountMeanErrorStdDev
SumVectorized14.555 ns0.0192 ns0.0179 ns
SumVectorized24.848 ns0.0147 ns0.0137 ns
SumVectorized45.381 ns0.0210 ns0.0186 ns
SumVectorized84.838 ns0.0209 ns0.0186 ns
SumVectorized165.107 ns0.0175 ns0.0146 ns
SumVectorized325.646 ns0.0230 ns0.0204 ns
SumVectorized646.763 ns0.0338 ns0.0316 ns
SumVectorized1289.308 ns0.1041 ns0.0870 ns
SumVectorized25615.634 ns0.0927 ns0.0821 ns
SumVectorized51234.706 ns0.2851 ns0.2381 ns
SumVectorized102468.110 ns0.4016 ns0.3756 ns
SumVectorized2048136.533 ns1.3104 ns1.2257 ns
SumVectorized4096277.930 ns0.5913 ns0.5531 ns
SumVectorized8192554.720 ns3.5133 ns3.2864 ns
SumVectorized163841,110.730 ns3.3043 ns3.0909 ns
SumVectorized327682,200.996 ns21.0538 ns19.6938 ns

Summary

The new hardware intrinsics allow you to take advantage of platform-specific functionality for the machine you’re running on. There are approximately 1,500 APIs for x86 and x64 spread across 15 instruction sets and far too many to cover in a single blog post. By profiling your code to identify hot spots you can also potentially identify areas of your code that would benefit from vectorization and see some pretty good performance gains. There are multiple scenarios where vectorization can be applied and loop unrolling is just the beginning.

Anyone wanting to see more examples can search for uses of the intrinsics in the framework (see the dotnet and aspnet organizations) or in various other blog posts written by the community. And while the currently exposed intrinsics are extensive, there is still a lot of functionality that could be exposed. If you have functionality you would like exposed, feel free to log an API request against dotnet/corefx on GitHub. The API review process is detailed here and there is a good example for the API Request template listed under Step 1.

Special Thanks

A special thanks to our community members Fei Peng (@fiigii) and Jacek Blaszczynski (@4creators) who helped implement the hardware intrinsics. Also to all the community members who have provided valuable feedback to the design, implementation, and usability of the feature.

Avatar
Tanner Gooding

Software Engineer, .NET Team

Follow Tanner   

17 comments

  • Avatar
    LOST _

    How do I ensure proper alignment? When I tried to use Avx2 intrinsics, I noticed unaligned versions of vector loading were used.

    • Avatar
      Tanner Gooding

      The VEX encoding (encoding used by AVX and later ISAs) does not do alignment checking for most memory operands. This is in contrast to the legacy encoding used by SSE that does (for the most part).

      You can explicitly enforce alignment checking by using the `LoadAligned` intrinsic; but you may get less efficient codegen as the load will be a separate instruction rather than folded into the instruction that consumes the load.

      On modern CPUs (basically any CPU that is less than 10 years old), unaligned loads are generally as fast as aligned loads; provided that load doesn’t cross a cache-line or page boundary. So, it is generally sufficient to pin your memory, get to the first aligned address, and then use unaligned loads to operate on the data (which guarantees it won’t cross a cache-line or page boundary). This ensures you have both alignment and efficient codegen.

      You can then add something like a `Debug.Assert((address % expectedAlignment) == 0)` to help catch any bugs around alignment at runtime.

  • Avatar
    Petr Onderka

    In the last version of the code, how come it’s checking Sse.IsSupported, but then it’s using Sse2 without further checking?

  • Юрий Соколов
    Юрий Соколов

    Main question: why third version (SumVectorT) is not as fast as version with explicite intrinsics (SumVectorizedSSE2) ? Why compiller didn’t use same CPU instructions? What makes it slower?

    • Avatar
      Tanner Gooding

      It is a general purpose API and so it can’t always generate the same code. In this particular case, it is accessing data via a Span<T> and there are bounds checks that the JIT isn’t able to elide (the same would be true of accessing via an array). The HWIntrinsic code, on the other hand, is pinning the underlying buffer and access the data via a pointer and explicit load instructions. Not only is the JIT able to generate slightly better code for this (as what you want done and the instructions you want emitted are being explicitly specified), but accessing the data via the raw pointer allows the bounds checks to be elided.

      There are certain new language features (such as `unmanaged constructed types`: https://github.com/dotnet/csharplang/issues/1937) which will allow you to bypass the bounds checks here as well (by operating on a `Vector<int>*`), in which case the numbers are more comparable. But, there would likely still be slight codegen differences due to it being a general-purpose API.

  • ganchito55
    ganchito55

    I wrote a post about how to optimize C# using SIMD instructions, although it’s in spanish you can see that it’s easy to get a x10 speedup
    | Method | Mean | Error | StdDev |
    |—————- |———–:|———-:|———-:|
    | FindWithMinSIMD | 3.696 ns | 0.0225 ns | 0.0200 ns |
    | FindWithLINQ | 182.543 ns | 3.5925 ns | 4.2767 ns |
    | FindWithLoop | 29.490 ns | 0.1920 ns | 0.1796 ns |

    I uploaded the code to Github

  • Avatar
    Maciej Pilichowski

    On one hand any progress in number crunching is a good news, but on the other hand C# does not support basic math in generic form (when you want to add, multiply, etc. some data but you don’t the specifics whether they are shorts, ints, etc). So I really wish MS would take to its heart the very basics and then move to improve advanced stuff. For dead simple `a+b` I have to write pretty convoluted code and each solution you can think of is uglier than the other. Having `ints`, `doubles`, etc with some INumerics common interface could help (or support for C++ like templates for code like this). In general anything, that finally fixes this problem.

    • Avatar
      Tanner Gooding

      The blog post is about .NET Core and the RyuJIT compiler, it is not about Mono and does not go into detail about other languages (such as Rust, Go, or Swift) or frameworks (such as Mono) which provided similar functionality.

      Mono.Simd was also slightly different and not quite as extensive as this functionality (it falls slightly between the System.Numerics and the System.Runtime.Intrinsics work). It also did not provide the same raw level of control over what CPU instructions were emitted for any given API call or break it up into the specific ISAs.

  • Avatar
    László Csöndes

    Why do I need to bother with manually unrolling loops and doing basic vectorization in 2019? GCC and LLVM/Clang both happily produce unrolled SSE code just fine on the equivalent of Sum() here, and they don’t even have the benefit of being able to detect your CPU’s capabilities at runtime like a JIT can.
    With tiered compilation there’s really no excuse why RyuJIT couldn’t do most of the optimizations that are available to C++. Make it yet another LLVM frontend for all I care. If I wanted to meticulously micro-optimize my code at the level presented in this article I would be using C++ instead of C# to begin with.

    • Avatar
      Tanner Gooding

      The blog post touches on this briefly but does not go into detail.

      > While the JIT can perform loop unrolling itself, it is conservative in deciding when to do so due to the larger codegen it produces. So, it can be beneficial to manually unroll the loop in your source code instead.

      To expand. AOT compilers (which, for brevity here, will include things like the native Clang or MSVC compilers, as well as things like MonoAOT) are not time constrained and have basically unlimited time to do various analysis and optimizations. A JIT compiler, however, is live and has stricter time constraints since it impacts how fast a particular method is the first time it executes. This means that some more complex optimizations are not feasible to do, especially when a method could potentially be “cold” (only called a few times or less throughout the entire lifetime of the program). 

      Some new features, like Tiered Compilation (https://devblogs.microsoft.com/dotnet/tiered-compilation-preview-in-net-core-2-1/), will allow “hot methods” to be recompiled with more advanced optimizations in the future (so startup is still fast, but steady state performance improves as the program continues executing).

      Loop unrolling and auto-vectorization are two optimizations that generally fall into the category of “expensive” to do, so if the JIT wants to allow it happen more extensively, it likely needs to be done after a method (or loop) has been detected to be hot.

      Additionally, relying on loop unrolling and/or auto-vectorization is not always desirable. For very perf critical scenarios the codegen can be worse than a manually written loop and since the compiler is detecting specific code patterns to optimize, it can be broken or changed by what appears to be a trivial refactoring (although compilers are getting better at optimizing more algorithms).

  • Karsten Thamm
    Karsten Thamm

    How about Matrices ?
    Matrix multiplication and inversion is really intensive (especially from 4X4 and higher).

    I’ve once done it in C++ for 3DNow! (RIP), so the order of operations is essential for performance.

    SSE is ideal for Matrices, Vectors and FFT, especially for mult and invert operations.
    On the other hand, for lager operations the GPU becomes the better solution.

    Inversion can be performed via Cramer for 4X4 and below, but should be implemented in Gauss for higher dimensions. Cramer can benefit from Intrinsics, but it’s hard for Gauss.

    I’m currently using a self implemented Gauss for matrix inversion (self implemented because I rather need numerical stability than performance, ‘cos it’s only performed once per session), but however, for mx inversion intenxive calculations hw acceleration (SSE or GPU) would be a helpful feature.

Leave a comment