{"id":32451,"date":"2021-03-25T12:58:16","date_gmt":"2021-03-25T19:58:16","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/dotnet\/?p=32451"},"modified":"2021-09-29T12:15:42","modified_gmt":"2021-09-29T19:15:42","slug":"loop-alignment-in-net-6","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/loop-alignment-in-net-6\/","title":{"rendered":"Loop alignment in .NET 6"},"content":{"rendered":"<p style=\"text-align: justify;\">When writing a software, developers try their best to maximize the performance they can get from the code they have baked into the product. Often, there are various tools available to the developers to find that last change they can squeeze into their code to make their software run faster. But sometimes, they might notice slowness in the product because of a totally unrelated change. Even worse, when measured the performance of a feature in a lab, it might show instable performance results that looks like the following\u00a0<strong>BubbleSort<\/strong>\u00a0graph<sup>1<\/sup>. What could possibly be introducing such flakiness in the performance?<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32454 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/instable-bubble.png\" alt=\"Instable bubble sort, Loop alignment in .NET 6\" width=\"834\" height=\"308\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/instable-bubble.png 834w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/instable-bubble-300x111.png 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/instable-bubble-768x284.png 768w\" sizes=\"(max-width: 834px) 100vw, 834px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">To understand this behavior, first we need to understand how the machine code generated by the compiler is executed by the CPU. CPU\u00a0<em>fetch<\/em>\u00a0the machine code (also known as instruction stream) it need to execute. The instruction stream is represented as series of bytes known as opcode. Modern CPUs\u00a0<em>fetch<\/em>\u00a0the opcodes of instructions in chunks of 16-bytes (16B), 32-bytes (32B) or 64-bytes (64B). The CISC architecture has variable length encoding, meaning the opcode representing each instruction in the instruction stream is of variable length. So, when the Fetcher fetches a single chunk, it doesn&#8217;t know at that point the start and end of an instruction. From the instruction stream chunk, CPU&#8217;s Pre-decoder identifies the boundary and lengths of instruction, while the Decoder decodes the meaning of the opcodes of those individual instructions and produce micro-operations (<code>\u03bcops<\/code>) for each instruction. These\u00a0<code>\u03bcops<\/code>\u00a0are fed to the Decoder Stream Buffer (DSB) which is a cache that indexes\u00a0<code>\u03bcops<\/code>\u00a0with the address from where the actual instruction was fetched. Before doing a\u00a0<em>fetch<\/em>, CPU first checks if the DSB contains the\u00a0<code>\u03bcops<\/code>\u00a0of the instruction it wants to fetch. If it is already present, there is no need to do a cycle of instruction fetching, pre-decoding and decoding. Further, there also exist Loop Stream Detector (LSD) that detects if a stream of\u00a0<code>\u03bcops<\/code>\u00a0represents a loop and if yes, it skips the front-end fetch and decode cycle and continue executing the\u00a0<code>\u03bcops<\/code>\u00a0until a loop misprediction happens.<\/p>\n<h2><a id=\"user-content-code-alignment\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#code-alignment\" aria-hidden=\"true\"><\/a>Code alignment<\/h2>\n<p style=\"text-align: justify;\">Let us suppose we are executing an application on a CPU that fetches instruction in 32B chunks. The application has a method having a hot loop inside it. Every time the application is run, the loop&#8217;s machine code is placed at different offset. Sometimes, it could get placed such that the loop body does not cross the 32B address boundary. In those cases, the instruction fetcher could fetch the machine code of the entire loop in one round. On the contrary, if the loop&#8217;s machine code is placed such that the loop body crosses the 32B boundary, the fetcher would have to fetch the loop body in multiple rounds. A developer cannot control the variation in fetching time because that is dependent on where the machine code of the loop is present. In such cases, you could see instability in the method&#8217;s performance. Sometimes, the method runs faster because the loop was aligned at fetcher favorable address while other times, it can show slowness because the loop was misaligned and fetcher spent time in fetching the loop body. Even a tiny change unrelated to the method body (like introducing a new class level variable, etc.) can affect the code layout and misalign the loop&#8217;s machine code. This is the pattern that can be seen in bubble sort benchmark above. This problem is mostly visible in CISC architectures because of variable length encoding of the instructions. The RISC architectures CPUs like Arm have fixed length encoding and hence might not see such a large variance in the performance.<\/p>\n<p style=\"text-align: justify;\">To solve this problem, compilers perform alignment of the hot code region to make sure the code&#8217;s performance remains stable. Code alignment is a technique in which one or more\u00a0<code>NOP<\/code>\u00a0instructions are added by the compiler in the generated machine code just before the hot region of the code so that the hot code is shifted to an address that is\u00a0<code>mod(16)<\/code>,\u00a0<code>mod(32)<\/code>\u00a0or\u00a0<code>mod(64)<\/code>. By doing that, maximum fetching of the hot code can happen in fewer cycles. Study shows that by performing such alignments, the code can benefit immensely. Additionally, the performance of such code is stable since it is not affected by the placement of code at misaligned address location. To understand the code alignment impact in details, I would highly encourage to watch the\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=IX16gcX4vDQ&amp;ab_channel=LLVM\" rel=\"nofollow\">Causes of Performance Swings due to Code Placement in IA<\/a>\u00a0talk given by Intel&#8217;s engineer Zia Ansari at 2016 LLVM Developer&#8217;s Meeting.<\/p>\n<p style=\"text-align: justify;\">In .NET 5, we started\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42909\">aligning methods at 32B boundary<\/a>. In .NET 6, we have added a feature to\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/44370\">perform adaptive loop alignment<\/a>\u00a0that adds\u00a0<code>NOP<\/code>\u00a0padding instructions in a method having loops such that the loop code starts at\u00a0<code>mod(16)<\/code>\u00a0or\u00a0<code>mod(32)<\/code>\u00a0memory address. In this blog, I will describe the design choices we made, various heuristics that we accounted for and the analysis and implication we studied on 100+ benchmarks that led us to believe that our current loop alignment algorithm will be beneficial in stabilizing and improving the performance of .NET code.<\/p>\n<h2>Heuristics<\/h2>\n<p>When we started working on this feature, we wanted to accomplish the following things:<\/p>\n<ul>\n<li>Identify hot inner most loop(s) that executes very frequently.<\/li>\n<li>Add\u00a0<code>NOP<\/code>\u00a0instructions before the loop code such that the first instruction within the loop falls on 32B boundary.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">Below is an example of a loop\u00a0<code>IG04~IG05<\/code>\u00a0that is aligned by adding 6-bytes of\u00a0<code>align<\/code>\u00a0instruction. In this post, although I will represent the padding as\u00a0<code>align [X bytes]<\/code>\u00a0in the disassembly, we actually emit\u00a0<a href=\"https:\/\/www.felixcloutier.com\/x86\/nop\" rel=\"nofollow\">multi-byte NOP<\/a>\u00a0for the actual padding.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">...<\/span>\r\n<span class=\"pl-en\">00007ff9a59ecff6        <\/span><span class=\"pl-k\">test<\/span>     <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">edx<\/span>\r\n<span class=\"pl-en\">00007ff9a59ecff8        <\/span><span class=\"pl-k\">jle<\/span><span class=\"pl-en\">      SHORT G_M22313_IG06<\/span>\r\n<span class=\"pl-en\">00007ff9a59ecffa        align    <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-c1\">6<\/span><span class=\"pl-en\"> bytes<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-c\">; ............................... 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">G_M22313_IG04:<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed000        <\/span><span class=\"pl-k\">movsxd<\/span>   <span class=\"pl-v\">r8<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">eax<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed003        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">r8d<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rcx<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">4<\/span><span class=\"pl-s1\">*<\/span><span class=\"pl-v\">r8<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed008        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">r8d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">esi<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed00b        <\/span><span class=\"pl-k\">jge<\/span><span class=\"pl-en\">      SHORT G_M22313_IG14<\/span>\r\n\r\n<span class=\"pl-en\">G_M22313_IG05:<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed00d        <\/span><span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">eax<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed00f        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">eax<\/span>\r\n<span class=\"pl-en\">00007ff9a59ed011        <\/span><span class=\"pl-k\">jg<\/span><span class=\"pl-en\">       SHORT G_M22313_IG04<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">A simple approach would be to add padding to all the hot loops. However, as I will describe in\u00a0<a href=\"#memory-cost\">Memory cost<\/a>\u00a0section below, there is a cost associated with padding all the loops of method. There are lot of considerations that we need to take into account to get stable performance boost for the hot loops, and ensure that the performance is not downgraded for loops that don&#8217;t benefit from padding.<\/p>\n<h3>Alignment boundary<\/h3>\n<p style=\"text-align: justify;\">Depending on the design of processors the software running on them benefit more if the hot code is aligned at\u00a0<code>16B<\/code>,\u00a0<code>32B<\/code>\u00a0or\u00a0<code>64B<\/code>\u00a0alignment boundary. While the alignment should be in multiples of\u00a0<code>16<\/code>\u00a0and most recommended boundary for major hardware manufacturers like Intel, AMD and Arm is\u00a0<code>32 byte<\/code>, we had\u00a0<code>32<\/code>\u00a0as our default alignment boundary. With adaptive alignment (controlled using\u00a0<code>COMPlus_JitAlignLoopAdaptive<\/code>\u00a0environment variable and is set to be\u00a0<code>1<\/code>\u00a0by default), we will try to align a loop at\u00a0<code>32 byte<\/code>\u00a0boundary. But if we do not see that it is profitable to align a loop on\u00a0<code>32 byte<\/code>\u00a0boundary (for reasons listed below), we will try to align that loop at\u00a0<code>16 byte<\/code>\u00a0boundary. With non-adaptive alignment (<code>COMPlus_JitAlignLoopAdaptive=0<\/code>), we will always try to align a loop to a\u00a0<code>32 byte<\/code>\u00a0alignment by default. The alignment boundary can also be changed using\u00a0<code>COMPlus_JitAlignLoopBoundary<\/code>\u00a0environment variable. Adaptive and non-adaptive alignment differs by the amount of padding bytes added, which I will discuss in\u00a0<code>Padding amount<\/code>\u00a0section below.<\/p>\n<h3><a id=\"user-content-loop-selection\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#loop-selection\" aria-hidden=\"true\"><\/a>Loop selection<\/h3>\n<p style=\"text-align: justify;\">There is a cost associated with a padding instruction. Although\u00a0<code>NOP<\/code>\u00a0instruction is cheap, it takes few cycles to fetch and decode it. So, having too many\u00a0<code>NOP<\/code>\u00a0or\u00a0<code>NOP<\/code>\u00a0instructions in hot code path can adversely affect the performance of the code. Hence, it will not be appropriate to align every possible loop in a method. That is the reason, LLVM has\u00a0<code>-align-all-*<\/code>\u00a0or gcc has\u00a0<code>-falign-loops<\/code>\u00a0flags to give the control to developers, to let them decide which loops should be aligned. Hence, the foremost thing that we wanted to do is to identify the loops in the method that will be most beneficial with the alignment. To start with, we decided to align just the non-nested loops whose block-weight meets a certain weight threshold (controlled by\u00a0<code>COMPlus_JitAlignLoopMinBlockWeight<\/code>). Block weight is a mechanism by which the compiler knows how frequently a particular block executes, and depending on that, performs various optimizations on that block. In below example,\u00a0<code>j-loop<\/code>\u00a0and\u00a0<code>k-loop<\/code>\u00a0are marked as loop alignment candidates, provided they get executed more often to satisfy the block weight criteria. This is done in\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d39ad7a8df5a7dc22e042a392919d181df34392d\/src\/coreclr\/jit\/optimizer.cpp#L2587\">optIdentifyLoopsForAlignment<\/a>\u00a0method of the JIT.<\/p>\n<p style=\"text-align: justify;\">If a loop has a call, the instructions of caller method will be flushed and those of callee will be loaded. In such case, there is no benefit in aligning the loop present inside the caller. Therefore, we decided not to align loops that contains a method call. Below,\u00a0<code>l-loop<\/code>, although is non-nested, it has a call and hence we will not align it. We filter such loops in\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/cb606ad46457dd2e9e11f1d9b954f239b2d9a98a\/src\/coreclr\/jit\/optimizer.cpp#L8153-L8159\">AddContainsCallAllContainingLoops<\/a>.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">void<\/span> <span class=\"pl-en\">SomeMethod<\/span>(<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">N<\/span>, <span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">M<\/span>) {\r\n    <span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-smi\">N<\/span>; <span class=\"pl-smi\">i<\/span><span class=\"pl-k\">++<\/span>) {\r\n\r\n        <span class=\"pl-c\">\/\/ j-loop is alignment candidate<\/span>\r\n        <span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">j<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">j<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-smi\">M<\/span>; <span class=\"pl-smi\">j<\/span><span class=\"pl-k\">++<\/span>) {\r\n            <span class=\"pl-c\">\/\/ body<\/span>\r\n        }\r\n    }\r\n\r\n    <span class=\"pl-k\">if<\/span> (<span class=\"pl-smi\">condition<\/span>) {\r\n        <span class=\"pl-k\">return<\/span>;\r\n    }\r\n\r\n    <span class=\"pl-c\">\/\/ k-loop is alignment candidate<\/span>\r\n    <span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">k<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">k<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-smi\">M<\/span> <span class=\"pl-k\">+<\/span> <span class=\"pl-smi\">N<\/span>; <span class=\"pl-smi\">k<\/span><span class=\"pl-k\">++<\/span>) {\r\n        <span class=\"pl-c\">\/\/ body<\/span>\r\n    }\r\n\r\n    <span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">l<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">l<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-smi\">M<\/span>; <span class=\"pl-smi\">l<\/span><span class=\"pl-k\">++<\/span>) {\r\n        <span class=\"pl-c\">\/\/ body<\/span>\r\n        <span class=\"pl-en\">OtherMethod<\/span>();\r\n    }\r\n}<\/pre>\n<\/div>\n<p>Once loops are identified in early phase, we proceed forward with advanced checks to see if padding is beneficial and if yes, what should be the padding amount. All those calculations happen in\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/d39ad7a8df5a7dc22e042a392919d181df34392d\/src\/coreclr\/jit\/emit.cpp#L4919\">emitCalculatePaddingForLoopAlignment<\/a>.<\/p>\n<h3><a id=\"user-content-loop-size\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#loop-size\" aria-hidden=\"true\"><\/a>Loop size<\/h3>\n<p style=\"text-align: justify;\">Aligning a loop is beneficial if the loop is small. As the loop size grows, the effect of padding disappears because there is already lot of instruction fetching, decoding and control flow happening that it does not matter the address at which the first instruction of a loop is present. We have defaulted the loop size to\u00a0<code>96 bytes<\/code>\u00a0which is 3 X 32-byte chunks. In other words, any inner loop that is small enough to fit in 3 chunks of\u00a0<code>32B<\/code>\u00a0each, will be considered for alignment. For experimentation, that limit can be changed using\u00a0<code>COMPlus_JitAlignLoopMaxCodeSize<\/code>\u00a0environment variable.<\/p>\n<h3><a id=\"user-content-aligned-loop\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#aligned-loop\" aria-hidden=\"true\"><\/a>Aligned loop<\/h3>\n<p style=\"text-align: justify;\">Next, we check if the loop is already aligned at the desired alignment boundary (<code>32 byte<\/code>\u00a0or\u00a0<code>16 byte<\/code>\u00a0for adaptive alignment and\u00a0<code>32 byte<\/code>\u00a0for non-adaptive alignment). In such cases, no extra padding is needed. Below, the loop at\u00a0<code>IG10<\/code>\u00a0starts at address\u00a0<code>0x00007ff9a91f5980 == 0 (mod 32)<\/code>\u00a0is already at desired offset and no extra padding is needed to align it further.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">00007ff9a91f597a        <\/span><span class=\"pl-k\">cmp<\/span><span class=\"pl-en\">      dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rbp<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-v\">r8d<\/span>\r\n<span class=\"pl-en\">00007ff9a91f597e        <\/span><span class=\"pl-k\">jl<\/span><span class=\"pl-en\">       SHORT G_M24050_IG12<\/span>\r\n<span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jl: 0) 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">00007ff9a91f5980        align    <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-c1\">0<\/span><span class=\"pl-en\"> bytes<\/span><span class=\"pl-s1\">]<\/span>\r\n\r\n<span class=\"pl-en\">G_M24050_IG10:<\/span>\r\n<span class=\"pl-en\">00007ff9a91f5980        <\/span><span class=\"pl-k\">movsxd<\/span>   <span class=\"pl-v\">rdx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a91f5983        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">r9<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> qword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rbp<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">*<\/span><span class=\"pl-v\">rdx<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">00007ff9a91f5988        <\/span><span class=\"pl-k\">mov<\/span><span class=\"pl-en\">      qword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rsi<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">*<\/span><span class=\"pl-v\">rdx<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-v\">r9<\/span>\r\n<span class=\"pl-en\">00007ff9a91f598d        <\/span><span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a91f598f        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">r8d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a91f5992        <\/span><span class=\"pl-k\">jg<\/span><span class=\"pl-en\">       SHORT G_M24050_IG10<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">We have also added a &#8220;nearly aligned loop&#8221; guard. There can be loops that do not start exactly at\u00a0<code>32B<\/code>\u00a0boundary, but they are small enough to entirely fit in a single\u00a0<code>32B<\/code>\u00a0chunk. All the code of such loops can be fetched with a single instruction fetcher request. In below example, the instructions between the two\u00a0<code>32B<\/code>\u00a0boundary (marked with\u00a0<code>32B boundary<\/code>) fits in a single chunk of 32 bytes. The loop\u00a0<code>IG04<\/code>\u00a0is part of that chunk and its performance will not improve if we add extra padding to it to make the loop start at\u00a0<code>32B<\/code>\u00a0boundary. Even without padding, the entire loop will be fetched anyway in a single request. Hence, there is no point aligning such loops.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">00007ff9a921a903        <\/span><span class=\"pl-k\">call<\/span><span class=\"pl-en\">     CORINFO_HELP_NEWARR_1_VC<\/span>\r\n<span class=\"pl-en\">00007ff9a921a908        <\/span><span class=\"pl-k\">xor<\/span>      <span class=\"pl-v\">ecx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a921a90a        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">00007ff9a921a90d        <\/span><span class=\"pl-k\">test<\/span>     <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">edx<\/span>\r\n<span class=\"pl-en\">00007ff9a921a90f        <\/span><span class=\"pl-k\">jle<\/span><span class=\"pl-en\">      SHORT G_M24257_IG05<\/span>\r\n<span class=\"pl-en\">00007ff9a921a911        align    <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-c1\">0<\/span><span class=\"pl-en\"> bytes<\/span><span class=\"pl-s1\">]<\/span>\r\n\r\n<span class=\"pl-en\">G_M24257_IG04:<\/span>\r\n<span class=\"pl-en\">00007ff9a921a911        <\/span><span class=\"pl-k\">movsxd<\/span>   <span class=\"pl-v\">r8<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a921a914        <\/span><span class=\"pl-k\">mov<\/span><span class=\"pl-en\">      qword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">*<\/span><span class=\"pl-v\">r8<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-v\">rsi<\/span>\r\n<span class=\"pl-en\">00007ff9a921a919        <\/span><span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a921a91b        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">edx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a921a91d        <\/span><span class=\"pl-k\">jg<\/span><span class=\"pl-en\">       SHORT G_M24257_IG04<\/span>\r\n\r\n<span class=\"pl-en\">G_M24257_IG05:<\/span>\r\n<span class=\"pl-en\">00007ff9a921a91f        <\/span><span class=\"pl-k\">add<\/span>      <span class=\"pl-v\">rsp<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">40<\/span>\r\n<span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 3) 32B boundary ...............................<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">This was an important guard that we added in our loop alignment logic. Without this, imagine a loop of size\u00a0<code>20 bytes<\/code>\u00a0that starts at offset\u00a0<code>mod(32) + 1<\/code>. To align this loop, it needed padding of\u00a0<code>31 bytes<\/code>\u00a0which might not be beneficial in certain scenarios where\u00a0<code>31 byte<\/code>\u00a0<code>NOP<\/code>\u00a0instructions are on hot code path. The &#8220;nearly aligned loop&#8221; protects us from such scenarios.<\/p>\n<p style=\"text-align: justify;\">The &#8220;nearly aligned loop&#8221; check is not restrictive to just small loop that fits in a single\u00a0<code>32B<\/code>\u00a0chunk. For any loop, we calculate the minimum number of chunks needed to fit the loop code. Now, if the loop is already aligned such that it occupies those minimum number of chunks, then we can safely ignore padding the loop further because padding will not make it any better.<\/p>\n<p style=\"text-align: justify;\">In below example, the loop\u00a0<code>IG04<\/code>\u00a0is\u00a0<code>37 bytes<\/code>\u00a0long (<code>00007ff9a921c690 - 00007ff9a921c66b = 37<\/code>). It needs minimum 2 blocks of\u00a0<code>32B<\/code>\u00a0chunk to fit. If the loop starts anywhere between\u00a0<code>mod(32)<\/code>\u00a0and\u00a0<code>mod(32) + (64 - 37)<\/code>, we can safely skip the padding because the loop is already placed such that its body will be fetched in 2 request (<code>32 bytes<\/code>\u00a0in 1st request and\u00a0<code>5 bytes<\/code>\u00a0in next request).<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (xor: 2) 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">00007ff9a921c662        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">r12d<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> dword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">r14<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">8<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">00007ff9a921c666        <\/span><span class=\"pl-k\">test<\/span>     <span class=\"pl-v\">r12d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">r12d<\/span>\r\n<span class=\"pl-en\">00007ff9a921c669        <\/span><span class=\"pl-k\">jle<\/span><span class=\"pl-en\">      SHORT G_M11250_IG07<\/span>\r\n<span class=\"pl-en\">00007ff9a921c66b        align    <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-c1\">0<\/span><span class=\"pl-en\"> bytes<\/span><span class=\"pl-s1\">]<\/span>\r\n\r\n<span class=\"pl-en\">G_M11250_IG04:<\/span>\r\n<span class=\"pl-en\">00007ff9a921c66b        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">r15d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ebx<\/span>\r\n<span class=\"pl-en\">00007ff9a921c66e        <\/span><span class=\"pl-k\">jae<\/span><span class=\"pl-en\">      G_M11250_IG19<\/span>\r\n<span class=\"pl-en\">00007ff9a921c674        <\/span><span class=\"pl-k\">movsxd<\/span>   <span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">r15d<\/span>\r\n<span class=\"pl-en\">00007ff9a921c677        <\/span><span class=\"pl-k\">shl<\/span>      <span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">5<\/span>\r\n<span class=\"pl-en\">00007ff9a921c67b        <\/span><span class=\"pl-k\">vmovupd<\/span>  <span class=\"pl-v\">ymm0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> ymmword ptr<\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rsi<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movupd: 1) 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">00007ff9a921c681        <\/span><span class=\"pl-k\">vmovupd<\/span><span class=\"pl-en\">  ymmword ptr<\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">r14<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">+<\/span><span class=\"pl-c1\">16<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-v\">ymm0<\/span>\r\n<span class=\"pl-en\">00007ff9a921c688        <\/span><span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">r15d<\/span>\r\n<span class=\"pl-en\">00007ff9a921c68b        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">r12d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">r15d<\/span>\r\n<span class=\"pl-en\">00007ff9a921c68e        <\/span><span class=\"pl-k\">jg<\/span><span class=\"pl-en\">       SHORT G_M11250_IG04<\/span>\r\n\r\n<span class=\"pl-en\">G_M11250_IG05:<\/span>\r\n<span class=\"pl-en\">00007ff9a921c690        <\/span><span class=\"pl-k\">jmp<\/span><span class=\"pl-en\">      SHORT G_M11250_IG07<\/span>\r\n<span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (xor: 1) 32B boundary ...............................<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">To recap, so far, we have identified the hot nested loops in a method that needs padding, filtered out the ones that has calls, filtered the ones that are big than our threshold and verified if the first instruction of the loop is placed such that extra padding will align that instruction at the desired alignment boundary.<\/p>\n<h3>Padding amount<\/h3>\n<p style=\"text-align: justify;\">To align a loop,\u00a0<code>NOP<\/code>\u00a0instructions need to be inserted before the loop starts so that the first instruction of the loop starts at an address which is\u00a0<code>mod(32)<\/code>\u00a0or\u00a0<code>mod(16)<\/code>. It can be a design choice on how much padding we need to add to align a loop. E.g., for aligning a loop to 32B boundary, we can choose to add maximum padding of 31 bytes or can have a limitation on the padding amount. Since padding or\u00a0<code>NOP<\/code>\u00a0instructions are not free, they will get executed (either as part of the method flow or if the aligned loop is nested inside another loop) and hence we need to make a careful choice of how much padding should be added. With non-adaptive approach, if an alignment needs to happen at\u00a0<code>N<\/code>\u00a0bytes boundary, we will try to add at most\u00a0<code>N-1<\/code>\u00a0bytes to align the first instruction of the loop. So, with\u00a0<code>32B<\/code>\u00a0or\u00a0<code>16B<\/code>\u00a0non-adaptive technique, we will try to align a loop to 32-byte or 16-byte boundary by adding at most 31 bytes or 15 bytes, respectively.<\/p>\n<p style=\"text-align: justify;\">However, as mentioned above, we realized that adding lot of padding regresses the performance of the code. For example, if a loop which is 15 bytes long, starts at offset\u00a0<code>mod(32) + 2<\/code>, with non-adaptive\u00a0<code>32B<\/code>\u00a0approach, we would add\u00a0<code>30 bytes<\/code>\u00a0of padding to align that loop to the next\u00a0<code>32B<\/code>\u00a0boundary address. Thus, to align a loop that is 15 bytes long, we have added extra 30 bytes to align it. If the loop that we aligned was a nested loop, the processor would be fetching and decoding these 30 bytes\u00a0<code>NOP<\/code>\u00a0instructions on every iteration of outer loop. We have also increased the size of method by 30 bytes. Lastly, since we would always try to align a loop at\u00a0<code>32B<\/code>\u00a0boundary, we might add more padding compared to the amount of padding needed, had we had to align the loop at\u00a0<code>16B<\/code>\u00a0boundary. With all these shortcomings, we came up with an adaptive alignment algorithm.<\/p>\n<p style=\"text-align: justify;\">In adaptive alignment, we would limit the amount of padding added depending on the size of the loop. In this technique, the biggest possible padding that will be added is 15 bytes for a loop that fits in one 32B chunk. If the loop is bigger and fits in two 32B chunks, then we would reduce the padding amount to 7 bytes and so forth. The reasoning behind this is that bigger the loop gets, it will have lesser effect of the alignment. With this approach, we could align a loop that takes 4 32B chunks if padding needed is 1 byte. With 32B non-adaptive approach, we would never align such loops (because of\u00a0<code>COMPlus_JitAlignLoopMaxCodeSize<\/code>\u00a0limit).<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"center\">Max Pad (bytes)<\/th>\n<th align=\"center\">Minimum 32B blocks needed to fit the loop<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"center\">15<\/td>\n<td align=\"center\">1<\/td>\n<\/tr>\n<tr>\n<td align=\"center\">7<\/td>\n<td align=\"center\">2<\/td>\n<\/tr>\n<tr>\n<td align=\"center\">3<\/td>\n<td align=\"center\">3<\/td>\n<\/tr>\n<tr>\n<td align=\"center\">1<\/td>\n<td align=\"center\">4<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Next, because of padding limit, if we cannot get the loop to align to 32B boundary, the algorithm will try to align the loop to\u00a0<code>16B<\/code>\u00a0boundary. We reduce the max padding limit if we get here as seen in the table below.<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"center\">Max Pad (bytes)<\/th>\n<th align=\"center\">Minimum 32B blocks to fit the loop<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"center\">7<\/td>\n<td align=\"center\">1<\/td>\n<\/tr>\n<tr>\n<td align=\"center\">3<\/td>\n<td align=\"center\">2<\/td>\n<\/tr>\n<tr>\n<td align=\"center\">1<\/td>\n<td align=\"center\">3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>With the adaptive alignment model, instead of totally restricting the padding of a loop (because of padding limit of\u00a0<code>32B<\/code>), we will still try to align the loop on the next better alignment boundary.<\/p>\n<h3><a id=\"user-content-padding-placement\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#padding-placement\" aria-hidden=\"true\"><\/a>Padding placement<\/h3>\n<p style=\"text-align: justify;\">If it is decided that padding is needed and we calculate the padding amount, the important design choice to make is where to place the padding instructions. In .NET 6, it is done na\u00efvely by placing the padding instruction just before the loop starts. But as described above, that can adversely affect the performance because the padding instructions can fall on the execution path. A smarter way would be to detect some blind spots in the code before the loop and place it at such that the padding instruction do not get executed or are executed rarely. E.g., If we have an unconditional jump somewhere in the method code, we could add padding instruction after that unconditional jump. By doing this, we will make sure that the padding instruction is never executed but we still get the loop aligned at right boundary. Another place where such padding can be added is in code block or a block that executes rarely (based on Profile-Guided Optimization data). The blind spot that we select should be lexically before the loop that we are trying to align.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">00007ff9a59feb6b        <\/span><span class=\"pl-k\">jmp<\/span><span class=\"pl-en\">      SHORT G_M17025_IG30<\/span>\r\n\r\n<span class=\"pl-en\">G_M17025_IG29:<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb6d        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">rax<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">rcx<\/span>\r\n\r\n<span class=\"pl-en\">G_M17025_IG30:<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb70        <\/span><span class=\"pl-k\">mov<\/span>      <span class=\"pl-v\">ecx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">eax<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb72        <\/span><span class=\"pl-k\">shr<\/span>      <span class=\"pl-v\">ecx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">3<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb75        <\/span><span class=\"pl-k\">xor<\/span>      <span class=\"pl-v\">r8d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">r8d<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb78        <\/span><span class=\"pl-k\">test<\/span>     <span class=\"pl-v\">ecx<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb7a        <\/span><span class=\"pl-k\">jbe<\/span><span class=\"pl-en\">      SHORT G_M17025_IG32<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb7c        align    <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-c1\">4<\/span><span class=\"pl-en\"> bytes<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-c\">; ............................... 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">G_M17025_IG31:<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb80        <\/span><span class=\"pl-k\">vmovupd<\/span>  <span class=\"pl-v\">xmm0<\/span><span class=\"pl-s1\">,<\/span><span class=\"pl-en\"> xmmword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rdi<\/span><span class=\"pl-s1\">]<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb84        <\/span><span class=\"pl-k\">vptest<\/span>   <span class=\"pl-v\">xmm0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">xmm6<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb89        <\/span><span class=\"pl-k\">jne<\/span><span class=\"pl-en\">      SHORT G_M17025_IG33<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb8b        <\/span><span class=\"pl-k\">vpackuswb<\/span> <span class=\"pl-v\">xmm0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">xmm0<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">xmm0<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb8f        <\/span><span class=\"pl-k\">vmovq<\/span><span class=\"pl-en\">    xmmword ptr <\/span><span class=\"pl-s1\">[<\/span><span class=\"pl-v\">rsi<\/span><span class=\"pl-s1\">],<\/span> <span class=\"pl-v\">xmm0<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb93        <\/span><span class=\"pl-k\">add<\/span>      <span class=\"pl-v\">rdi<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">16<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb97        <\/span><span class=\"pl-k\">add<\/span>      <span class=\"pl-v\">rsi<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-c1\">8<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb9b        <\/span><span class=\"pl-k\">inc<\/span>      <span class=\"pl-v\">r8d<\/span>\r\n<span class=\"pl-en\">00007ff9a59feb9e        <\/span><span class=\"pl-k\">cmp<\/span>      <span class=\"pl-v\">r8d<\/span><span class=\"pl-s1\">,<\/span> <span class=\"pl-v\">ecx<\/span>\r\n<span class=\"pl-c\">; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1) 32B boundary ...............................<\/span>\r\n<span class=\"pl-en\">00007ff9a59feba1        <\/span><span class=\"pl-k\">jb<\/span><span class=\"pl-en\">       SHORT G_M17025_IG31<\/span>\r\n<\/pre>\n<\/div>\n<p style=\"text-align: justify;\">In above example, we aligned loop\u00a0<code>IG31<\/code>\u00a0with\u00a0<code>4 bytes<\/code>\u00a0padding, but we have inserted the padding right before the first instruction of the loop. Instead, we can add that padding after the\u00a0<code>jmp<\/code>\u00a0instruction present at\u00a0<code>00007ff9a59feb6b<\/code>. That way, the padding will never be executed, but\u00a0<code>IG31<\/code>\u00a0will still get aligned at desired boundary.<\/p>\n<h3>Memory cost<\/h3>\n<p style=\"text-align: justify;\">Lastly, there is a need to evaluate how much extra memory is allocated by the runtime for adding the extra padding before the loop. If the compiler aligns every hot loop, it can increase the code size of a method. There must be a right balance between the loop size, frequency of its execution, padding needed, padding placement to ensure only the loops that truly benefit with the alignment are padded. Another aspect is that the if the JIT, before allocating memory for the generated code, can evaluate how much padding is needed to align a loop, it will request precise amount of memory to accommodate the extra padding instruction. However, like in RyuJIT, we first generate the code (using our internal data structures), sum up the total instruction size and then determine the amount of memory needed to store the instructions. Next, it allocates the memory from runtime and lastly, it will emit and store the actual machine instructions in the allocated memory buffer. During code generation (when we do the loop alignment calculation), we do not know the offset where the loop will be placed in the memory buffer. In such case, we will have to pessimistically assume the maximum possible padding needed. If there are many loops in a method that would benefit from alignment, assuming maximum possible padding for all the loops would increase the allocation size of that method although the code size would be much smaller (depending on actual padding added).<\/p>\n<p style=\"text-align: justify;\">Below graph demonstrates the code size and allocation size&#8217;s impact due to the loop alignment. Allocation size represents the amount of memory allocated to store the machine code of all the .NET libraries methods while code size represents the actual amount of memory needed to store method&#8217;s machine code. The code size is lowest for\u00a0<code>32BAdaptive<\/code>\u00a0technique. This is because we have cut off the padding amount depending on the loop size, as discussed before. So from memory perspective,\u00a0<code>32BAdaptive<\/code> wins. The numbers on Y-axis represents code and allocation sizes in bytes.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32455 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare1.png\" alt=\"Size comparison 1\" width=\"645\" height=\"330\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare1.png 645w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare1-300x153.png 300w\" sizes=\"(max-width: 645px) 100vw, 645px\" \/><\/p>\n<p style=\"text-align: justify;\">Allocation size in above graph is higher than the code size for all the implementation because we accounted for maximum possible padding for every loop during allocation size calculation. Ideally, we wanted to have allocation size same as code size. Below is another view that demonstrates the difference between the allocation size and code size. The difference is highest for 32B non-adaptive implementation and lowest with 16B non-adaptive. 32B adaptive is marginally higher than 16B non-adaptive, but again since the overall code size is minimal as compared to 16B\/32B non-adaptive,\u00a0<code>32BAdaptive<\/code>\u00a0is the winner.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32456 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare2.png\" alt=\"Size comparison 2\" width=\"633\" height=\"297\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare2.png 633w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/size-compare2-300x141.png 300w\" sizes=\"(max-width: 633px) 100vw, 633px\" \/><\/p>\n<p style=\"text-align: justify;\">However, to make sure that we know precise amount of padding we are going to add before allocating the memory, we devised a work around. During code generation, we know that\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/42909\">the method starts<\/a>\u00a0at offset\u00a0<code>0(mod 32)<\/code>. We calculate the padding needed to align the loop and\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/2ee54f78dd4c35dd370287c12672d6b6750f98a4\/src\/coreclr\/jit\/emit.cpp#L4797\">update the\u00a0<code>align<\/code>\u00a0instruction with that amount<\/a>. Thus, we would allocate the memory considering the real padding and would not allocate memory for loops for which we do not need padding. This works if the estimated size of all the instructions during code generation of a method matches the actual size during emitting those instructions. Sometimes, during emitting, we realize that it is optimal to have shorter encoding for an instruction and that deviates the estimated vs. actual size of that instruction. We cannot afford to have this misprediction happen for instruction that falls before the loop that we are about to align, because that would change the placement of the loop.<\/p>\n<p style=\"text-align: justify;\">In below example, the loop starts at\u00a0<code>IG05<\/code>\u00a0and during code generation, we know that by adding padding of 1 byte, we can align that loop at\u00a0<code>0080<\/code>\u00a0offset. But during emitting the instruction, if we decide to encode\u00a0<code>instruction_1<\/code>\u00a0such that it just takes 2 bytes instead of 3 bytes (that we estimated), the loop will start from memory address\u00a0<code>00007ff9a59f007E<\/code>. Adding 1 byte of padding would make it start at\u00a0<code>00007ff9a59f007F<\/code>\u00a0which is not what we wanted.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">007A instruction_1<\/span><span class=\"pl-c\">  ; size = 3 bytes<\/span>\r\n<span class=\"pl-en\">007D instruction_2<\/span><span class=\"pl-c\">  ; size = 2 bytes<\/span>\r\n\r\n<span class=\"pl-en\">IG05:<\/span>\r\n<span class=\"pl-en\">007F instruction_3<\/span><span class=\"pl-c\">  ; start of loop<\/span>\r\n<span class=\"pl-c1\">0083<\/span><span class=\"pl-en\"> instruction_4<\/span>\r\n<span class=\"pl-c1\">0087<\/span><span class=\"pl-en\"> instruction_5<\/span>\r\n<span class=\"pl-c1\">0089<\/span> <span class=\"pl-k\">jmp<\/span><span class=\"pl-en\"> IG05<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">Hence, to account for this over-estimation of certain instructions, we\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/63a5d5beb462221d5ca4f737c497902dd334f0d5\/src\/coreclr\/jit\/emitxarch.cpp#L13865-L13897\">compensate by adding extra NOP<\/a>\u00a0instructions. As seen below, with this\u00a0<code>NOP<\/code>, our loop will continue to start at\u00a0<code>00007ff9a59f007F<\/code>\u00a0and the padding of 1 byte will make it align at\u00a0<code>00007ff9a59f0080<\/code>\u00a0address.<\/p>\n<div class=\"highlight highlight-source-assembly\">\n<pre><span class=\"pl-en\">00007ff9a59f007A instruction_1<\/span><span class=\"pl-c\">  ; size = 2 bytes<\/span>\r\n<span class=\"pl-en\">00007ff9a59f007C <\/span><span class=\"pl-k\">NOP<\/span><span class=\"pl-c\">            ; size = 1 byte (compensation)<\/span>\r\n<span class=\"pl-en\">00007ff9a59f007D instruction_2<\/span><span class=\"pl-c\">  ; size = 2 bytes<\/span>\r\n\r\n<span class=\"pl-en\">IG05:<\/span>\r\n<span class=\"pl-en\">00007ff9a59f007F instruction_3<\/span><span class=\"pl-c\">  ; start of loop<\/span>\r\n<span class=\"pl-en\">00007ff9a59f0083 instruction_4<\/span>\r\n<span class=\"pl-en\">00007ff9a59f0087 instruction_5<\/span>\r\n<span class=\"pl-c1\">0089<\/span> <span class=\"pl-k\">jmp<\/span><span class=\"pl-en\"> IG05<\/span><\/pre>\n<\/div>\n<p style=\"text-align: justify;\">With that, we can precisely allocate memory for generated code such that the difference between allocated and actual code size is zero. In the long term, we want to address the problem of over-estimation so that the instruction size is precisely known during code generation and it matches during emitting the instruction.<\/p>\n<h2><a id=\"user-content-impact\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#impact\" aria-hidden=\"true\"><\/a>Impact<\/h2>\n<p style=\"text-align: justify;\">Finally, let&#8217;s talk about the impact of this work. While I have done\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/43227\">lots<\/a>\u00a0and\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/44051\">lots<\/a>\u00a0of analysis to understand the loop alignment impact on our various benchmarks, I would like to highlight two graphs that demonstrates both, the increased stability as well as improved performance due to the loop alignment.<\/p>\n<p style=\"text-align: justify;\">In below performance graph of\u00a0<a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/master\/src\/benchmarks\/micro\/runtime\/Benchstones\/BenchI\/BubbleSort2.cs\">Bubble sort<\/a>, data point 1 represents the point where we started aligning methods at\u00a0<code>32B<\/code>\u00a0boundary. Data point 2 represents the point where we started aligning inner loops that I described above. As you can see, the instability has reduced by heavy margin and we also gained performance.<\/p>\n<p>&nbsp;<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32457 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/stable-bubble.png\" alt=\"Stable bubble sort\" width=\"856\" height=\"306\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/stable-bubble.png 856w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/stable-bubble-300x107.png 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/stable-bubble-768x275.png 768w\" sizes=\"(max-width: 856px) 100vw, 856px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: left;\">Below is another graph of &#8220;LoopReturn&#8221; benchmark<sup>2<\/sup>\u00a0ran on Ubuntu x64 box where we see similar trend.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32458 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/ubuntu_loopreturn.png\" alt=\"Ubuntu loop return\" width=\"810\" height=\"292\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/ubuntu_loopreturn.png 810w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/ubuntu_loopreturn-300x108.png 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/ubuntu_loopreturn-768x277.png 768w\" sizes=\"(max-width: 810px) 100vw, 810px\" \/><\/p>\n<p style=\"text-align: justify;\">Below is the graph that shows the comparison of various algorithms that we tried to understand the loop alignment impact across benchmarks. In this graph, the X-axis represents all the <a href=\"https:\/\/github.com\/dotnet\/performance\/tree\/master\/src\/benchmarks\/micro\">microbenchmarks<\/a> sorted by the impact they have because of loop alignment. The Y-axis represents log10 scale of <code>before \/ after<\/code> ratio, before being without loop alignment and after being with the loop alignment. Since the benchmark measurements are in <code>nanoseconds<\/code>, higher the ratio, more performant the benchmarks became with the loop alignment. <code>32B<\/code>\u00a0and\u00a0<code>16B<\/code>\u00a0represents non-adaptive technique while\u00a0<code>32BAdaptive<\/code>\u00a0represents\u00a0<code>32B<\/code> adaptive technique.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter wp-image-32452 size-full\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/bench-compare.png\" alt=\"Bench comparison, Loop alignment in .NET 6\" width=\"859\" height=\"492\" srcset=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/bench-compare.png 859w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/bench-compare-300x172.png 300w, https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2021\/03\/bench-compare-768x440.png 768w\" sizes=\"(max-width: 859px) 100vw, 859px\" \/><\/p>\n<p>32B adaptive improves sooner after 171 benchmarks as compared to the next better approach which is 32B non-adaptive that gains performance after 241 benchmarks. We get maximum performance benefit sooner with 32B adaptive approach.<\/p>\n<h2><a id=\"user-content-edge-cases\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#edge-cases\" aria-hidden=\"true\"><\/a>Edge cases<\/h2>\n<p style=\"text-align: justify;\">While implementing the loop alignment feature, I came across several edge cases that are worth mentioning. We identify that a loop needs alignment by setting a flag on the first basic block that is part of the loop. During later phases, if the loop gets unrolled, we need to make sure that we remove the alignment flag from that loop because it no longer represents the loop. Likewise, for other scenarios like loop cloning, or eliminating bogus loops, we had to make sure that we updated the alignment flag appropriately.<\/p>\n<h2><a id=\"user-content-future-work\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#future-work\" aria-hidden=\"true\"><\/a>Future work<\/h2>\n<p style=\"text-align: justify;\">One of\u00a0<a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/43227\">our planned future work<\/a>\u00a0is to add the &#8220;Padding placement&#8221; in blind spots as I described above. Additionally, we need to not just restrict aligning the inner loops but outer loops whose relative weight is higher than the inner loop. In below example,\u00a0<code>i-loop<\/code>\u00a0executes 1000 times, while the\u00a0<code>j-loop<\/code>\u00a0executes just 2 times in every iteration. If we pad the\u00a0<code>j-loop<\/code>\u00a0we will end up making the padded instruction execute 1000 times which can be expensive. Better approach would be to instead pad and align the\u00a0<code>i-loop<\/code>.<\/p>\n<div class=\"highlight highlight-source-cs\">\n<pre><span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">i<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-c1\">1000<\/span>; <span class=\"pl-smi\">i<\/span><span class=\"pl-k\">++<\/span>) {\r\n    <span class=\"pl-k\">for<\/span> (<span class=\"pl-k\">int<\/span> <span class=\"pl-smi\">j<\/span> <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">0<\/span>; <span class=\"pl-smi\">j<\/span> <span class=\"pl-k\">&lt;<\/span> <span class=\"pl-c1\">2<\/span>; <span class=\"pl-smi\">j<\/span><span class=\"pl-k\">++<\/span>) {\r\n        <span class=\"pl-c\">\/\/ body<\/span>\r\n    }\r\n}<\/pre>\n<\/div>\n<p>Lastly, the loop alignment is only enabled for\u00a0<code>x86<\/code>\u00a0and\u00a0<code>x64<\/code>\u00a0architecture, but we would like to take it forward and support\u00a0<code>Arm32<\/code>\u00a0and\u00a0<code>Arm64<\/code>\u00a0architectures as well.<\/p>\n<h2><a id=\"user-content-loop-alignment-in-other-compilers\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#loop-alignment-in-other-compilers\" aria-hidden=\"true\"><\/a>Loop alignment in other compilers<\/h2>\n<p style=\"text-align: justify;\">For native or ahead of time compilers, it is hard to predict which loop will need alignment because the target address where the loop will be placed can only be known during runtime and not during ahead of time compilation. However, certain native runtimes at least give an option to the user to let them specify the alignment.<\/p>\n<h3><a id=\"user-content-gcc\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#gcc\" aria-hidden=\"true\"><\/a>GCC<\/h3>\n<p style=\"text-align: justify;\">GCC provides\u00a0<code>-falign-functions<\/code>\u00a0attribute that the user can add on top of a function. More documentation can be seen\u00a0<a href=\"https:\/\/gcc.gnu.org\/onlinedocs\/gcc\/Common-Function-Attributes.html#Common-Function-Attributes\" rel=\"nofollow\">on the gcc documentation page<\/a>\u00a0under &#8220;aligned&#8221; section. This will align the first instruction of every function at the specified boundary. It also provides\u00a0<a href=\"https:\/\/gcc.gnu.org\/onlinedocs\/gcc\/Optimize-Options.html\" rel=\"nofollow\">options<\/a>\u00a0for\u00a0<code>-falign-loops<\/code>,\u00a0<code>-falign-labels<\/code>\u00a0and\u00a0<code>-falign-jumps<\/code>\u00a0that will align all loops, labels or jumps in the entire code getting compiled. I did not inspect the GCC code, but looking at these options, it has several limitations. First, the padding amount is fixed and can anywhere between 0 and (N \u2013 1) bytes. Second, the alignment will happen for the entire code base and cannot be restricted to a portion of files, methods, loops or hot regions.<\/p>\n<h3><a id=\"user-content-llvm\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#llvm\" aria-hidden=\"true\"><\/a>LLVM<\/h3>\n<p style=\"text-align: justify;\">Same as GCC, dynamic alignment during runtime is not possible so LLVM too exposes an option of alignment choice to the user.\u00a0<a href=\"https:\/\/easyperf.net\/blog\/2018\/01\/25\/Code_alignment_options_in_llvm\" rel=\"nofollow\">This blog<\/a>\u00a0gives a good overview of various options available. One of the options that it gives is\u00a0<code>align-all-nofallthru-blocks<\/code>\u00a0which will not add padding instructions if the previous block can reach the current block by falling through because that would mean that we are adding NOPs in the execution path. Instead, it tries to add the padding at blocks that ends with unconditional jumps. This is like what I mentioned above under &#8220;Padding placement&#8221;.<\/p>\n<h2><a id=\"user-content-conclusion\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#conclusion\" aria-hidden=\"true\"><\/a>Conclusion<\/h2>\n<p style=\"text-align: justify;\">Code alignment is a complicated mechanism to implement in a compiler and it is even harder to make sure that it optimizes the performance of a user code. We started with a simple problem statement and expectation, but during implementation, had to conduct various experiments to ensure that we cover maximum possible cases where the alignment would benefit. We also had to take into account that the alignment does not affect the performance adversely and devised mechanism to minimize such surface areas. I owe a big thanks to\u00a0<a href=\"https:\/\/github.com\/andyayersms\">Andy Ayers<\/a>\u00a0who provided me guidance and suggested some great ideas during the implementation of loop alignment.<\/p>\n<h2><a id=\"user-content-references\" class=\"anchor\" href=\"https:\/\/github.com\/microsoft\/dotnet-blog\/blob\/725edfb73c37b7da39498dd826fc71ce1bc1c430\/2021\/03-Mar\/loop-alignment\/loopalign.md#references\" aria-hidden=\"true\"><\/a>References<\/h2>\n<ol>\n<li>BubbleSort2 benchmark is part of .NET\u2019s micro-benchmarks suite and the source code is\u00a0<a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/master\/src\/benchmarks\/micro\/runtime\/Benchstones\/BenchI\/BubbleSort2.cs\">in dotnet\/performance repository<\/a>. Results taken in .NET perf lab can be seen on\u00a0<a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fBenchstone.BenchI.BubbleSort2.Test.html\" rel=\"nofollow\">BubbleSort2 result page<\/a>.<\/li>\n<li>LoopReturn benchmark is part of .NET\u2019s micro-benchmarks suite and the source code is\u00a0<a href=\"https:\/\/github.com\/dotnet\/performance\/blob\/4e14ecc54759224121b807c70e46fb1d9ee0e7d6\/src\/benchmarks\/micro\/runtime\/Layout\/SearchLoops.cs#L30\">in dotnet\/performance repository<\/a>. Results taken in .NET perf lab can be seen on\u00a0<a href=\"https:\/\/pvscmdupload.blob.core.windows.net\/reports\/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fLayout.SearchLoops.LoopReturn.html\" rel=\"nofollow\">LoopReturn result page<\/a>.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Loop alignment improvements in .NET 6<\/p>\n","protected":false},"author":38211,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[685,196,3012,756,3009],"tags":[4,9,108,121],"class_list":["post-32451","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-dotnet","category-dotnet-core","category-internals","category-csharp","category-performance","tag-net","tag-net-core","tag-performance","tag-ryujit"],"acf":[],"blog_post_summary":"<p>Loop alignment improvements in .NET 6<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/32451","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/38211"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=32451"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/32451\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=32451"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=32451"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=32451"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}