{"id":91151,"date":"2015-08-07T07:00:00","date_gmt":"2015-08-07T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20150807-00\/?p=91151\/"},"modified":"2019-03-13T12:18:14","modified_gmt":"2019-03-13T19:18:14","slug":"20150807-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150807-00\/?p=91151","title":{"rendered":"The Itanium processor, part 10: Register rotation"},"content":{"rendered":"<p><!-- backref: The Itanium processor, part 9: Counted loops and loop pipelining -->Last time<\/a>, we looked at counted loops and then improved a simple loop by explicitly pipelining the loop iterations. This time, we&#8217;re going to take the pipelining to the next level. <!--more--><\/p>\n<p>Let&#8217;s reorder the columns of the chart we had last time so the instructions are grouped not by the register being operated upon but by the operation being performed. Since no instructions within an instruction group are dependent on each other in our example, I can reorder them without affecting the logic. <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" STYLE=\"border-collapse: collapse\">\n<tr>\n<td ALIGN=\"right\">1<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\">&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"width: 8em\">&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td>&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td>&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">2<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r33 = [r29], 4<\/code><\/td>\n<td STYLE=\"border-top: 1px dotted black\">&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">3<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r34 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>adds r32 = r32, 1<\/code><\/td>\n<td>&nbsp;<\/td>\n<td><code>;;<\/code><\/td>\n<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">4<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r35 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r33 = r33, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>st4 [r28] = r32, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">5<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r33, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">6<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r33 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r35 = r35, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r34, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">7<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>ld4 r34 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r32 = r32, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">8<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r35 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r33 = r33, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r32, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">9<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r33, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">10<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r33 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r35 = r35, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r34, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">11<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\">&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r32 = r32, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">12<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r33 = r33, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r32, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">13<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-bottom: 1px dotted black\"><code>st4 [r28] = r33, 4<\/code><\/td>\n<td><code>;;<\/code><\/td>\n<\/tr>\n<\/table>\n<p>What an interesting pattern. Each column represents a functional unit, and at each cycle, the unit operates on a different register in a clear pattern: <var>r32<\/var>, <var>r33<\/var>, <var>r34<\/var>, <var>r35<\/var>, then back to <var>r32<\/var>. The units are staggered so that each operates on a register precisely when its result from the previous unit is ready. <\/p>\n<p>Suppose you have to make 2000 sandwiches and you have four employees. You could arrange your sandwich factory with three stations. At the first station, you have the bread and the toaster. At the second station, you have the protein. At the third station, you have the toppings. Each employee goes through the stations in order: First they take two pieces of bread and put them in the toaster. When the toast is finished, they add the protein, then they add the toppings, and then they put the finished sandwich in the box. Once that&#8217;s done, they go back to the first station. You stagger the starts of the four employees so that at any moment, one is preparing the bread, one is waiting for the toaster, one is adding protein, and one is adding the toppings. <\/p>\n<p>That is how the original code was arranged. Each register is an employee that is at one of the four stages of sandwich construction. <\/p>\n<p>But another way to organize your sandwich factory is as an assembly line. You put one employee in charge of the bread, one in charge of the toaster, one in charge of the protein, and one in charge of the toppings. When a sandwich completes a stage in the process, it gets handed from one employee to the next. <\/p>\n<p>(And since there isn&#8217;t really anything for the toaster-boy to do, you can eliminate that position and create the same number of sandwiches per second with only three employees. The original version had each employee sitting idle waiting for the toaster 25% of the time. Switching to the assembly line model allowed us to squeeze out that idle time.) <\/p>\n<p>Let&#8217;s apply the assembly line model to our code. Handing a sandwich from one person to the next is done by moving the value from one register to the next. Let&#8217;s imagine than there is a <var>slide<\/var> instruction that you can put at the end of an instruction group which copies <var>r32<\/var> to <var>r33<\/var>, <var>r33<\/var> to <var>r34<\/var>, and so on. <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" STYLE=\"border-collapse: collapse\">\n<tr>\n<td ALIGN=\"right\">1<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\">&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"width: 8em\">&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td>&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td>&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">2<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td STYLE=\"border-top: 1px dotted black\">&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">3<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>adds r34 = r34, 1<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">4<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">5<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">6<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">7<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">8<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">9<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">10<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">11<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\">&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">12<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">13<\/td>\n<td>&nbsp;<\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-bottom: 1px dotted black\"><code>st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<\/table>\n<p>During the execution of the first instruction group, the first value is loaded into <var>r32<\/var>, and the <var>slide<\/var> instruction slides it into <var>r33<\/var>. <\/p>\n<p>At the second instruction group, the second value is loaded into <var>r32<\/var>, and the first value sits unchanged in <var>r33<\/var>. (Technically, the value is waiting to be loaded into <var>r33<\/var>.) The <var>slide<\/var> instruction slides the second value into <var>r33<\/var> and the first value into <var>r34<\/var>. <\/p>\n<p>At the third instruction group, the third value is loaded into <var>r32<\/var>, and the first value (now in <var>r34<\/var>) is incremented. Then the <var>slide<\/var> instruction slides the third value into <var>r33<\/var>, the second value into <var>r34<\/var>, and the first value into <var>r35<\/var>. <\/p>\n<p>At the fourth instruction group, the fourth value is loaded into <var>r32<\/var>, the second value (now in <var>r34<\/var>) is incremented, and the first value (now in <var>r35<\/var>) is stored to memory. Then the <var>slide<\/var> instruction slides the fourth value into <var>r33<\/var>, the third value into <var>r34<\/var>, and the second value into <var>r35<\/var>. (The first value slides into <var>r36<\/var>, but we don&#8217;t really care.) <\/p>\n<p>And so on. At each instruction group, a fresh value is loaded into <var>r32<\/var>, a previously-loaded value is incremented in <var>r34<\/var>, and the incremented value is stored from <var>r35<\/var>. And then the <var>slide<\/var> instruction moves everything down one step for the next instruction group. <\/p>\n<p>When we reach the 11th instruction group, we drain out the last value and don&#8217;t bother starting up any new ones. <\/p>\n<p>Observe that the above code also falls into a <i>prologue<\/i>\/<i>kernel<\/i>\/<i>epilogue<\/i> pattern. In the prologue, the assembly line starts up and gradually fills the registers with work. In the kernel, the assembly line is completely busy. And in the epilogue, the work of the final registers drains out. <\/p>\n<p>You can already see how <var>br.cloop<\/var> would come in handy here: The kernel can be written as a single-instruction loop! But wait there&#8217;s more. <\/p>\n<p>Let&#8217;s add some predicate registers to the mix. Let&#8217;s suppose that the <code>slide<\/code> instruction slides not only integer registers but also predicate registers. <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" STYLE=\"border-collapse: collapse\">\n<tr>\n<td ALIGN=\"right\">1<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\">&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"width: 8em\">&nbsp;<\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td ROWSPAN=\"13\" STYLE=\"border-right: 1px solid black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">2<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td STYLE=\"border-top: 1px dotted black\">&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code><font COLOR=\"red\">(p18) adds r34 = r34, 1<\/font><\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">3<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">4<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">5<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">6<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">7<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">8<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">9<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">10<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">11<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black;color: red\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">12<\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black\"><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<tr>\n<td ALIGN=\"right\">13<\/td>\n<td STYLE=\"padding: 0ex 1ex;color: red\"><code>(p16) ld4 r32 = [r29], 4<\/code><\/td>\n<td>&nbsp;<\/td>\n<td STYLE=\"padding: 0ex 1ex;border-top: 1px dotted black;color: red\"><code>(p18) adds r34 = r34, 1<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex;border-bottom: 1px dotted black\"><code>(p19) st4 [r28] = r35, 4<\/code><\/td>\n<td STYLE=\"padding: 0ex 1ex\"><code>slide ;;<\/code><\/td>\n<\/tr>\n<\/table>\n<p>We can initally set <var>p16 = true<\/var>, <var>p17 = p18 = p19 = false<\/var>. That way, only the load executes from the first instruction group. And then the <var>slide<\/var> instruction slides both the integer registers and the predicate registers, which causes <var>p17<\/var> to become <var>true<\/var>. <\/p>\n<p>In the second instruction group, again, only the load executes. And then the <var>slide<\/var> instruction slides <var>p17<\/var> into <var>p18<\/var>, so now <var>p18 = true<\/var> also. <\/p>\n<p>Since <var>p18 = true<\/var>, the third instruction group both loads and increments. And then the <var>slide<\/var> instruction slides <var>p18<\/var> into <var>p19<\/var>, so now all of the predicates are true. <\/p>\n<p>With all the predicates true, every step in instruction groups three through 10 execute. <\/p>\n<p>Now, with instruction group 11, we want to slide the predicates, but also set <var>p16<\/var> to <var>false<\/var>. That turns off the <var>ld4<\/var> instruction. <\/p>\n<p>The <var>p16 = false<\/var> then slides into <var>p17<\/var> for instruction group 12, then into <var>p18<\/var> for instruction group 13, which turns off the increment instruction. <\/p>\n<p>If we can get the <var>slide<\/var> instruction to slide the predicates and set <var>p16<\/var> to <var>true<\/var> for the first 10 instructions, and set it to <var>false<\/var> for the last three, then we can simply execute the same instruction 13 times! <\/p>\n<p>Okay, now I can reveal the true identity of the <var>slide<\/var> instruction: It&#8217;s called <var>br.ctop<\/var>. <\/p>\n<p>The <var>br.ctop<\/var> instruction works like this: <\/p>\n<pre>\n if (ar.lc != 0) { slide; p16 = true; ar.lc = ar.lc - 1; goto branch; }\n if (ar.ec != 0) { slide; p16 = false; ar.ec = ar.ec - 1;\n                  if (ar.ec != 0) goto branch; }\n else { \/* unimportant *\/ }\n<\/pre>\n<p>In words, the <var>br.ctop<\/var> instruction first checks the <var>ar.lc<\/var> register (<i>loop counter<\/i>). If it is nonzero, then the registers slide over, the <var>p16<\/var> register is set to <var>true<\/var>, the loop counter is decremented, and the jump is taken. <\/p>\n<p>If <var>ar.lc<\/var> is zero, then the <var>br.ctop<\/var> instruction checks the <var>ar.ec<\/var> register (<i>epilogue counter<\/i>). If it is nonzero, then the register slide over, the <var>p16<\/var> register is set to <var>false<\/var>, and.the epilogue counter is decremented. If the decremented value of the epilogue counter is nonzero, then the jump is taken; otherwise we fall through and the loop ends. <\/p>\n<p>(If both <var>ar.lc<\/var> and <var>ar.ec<\/var> are zero, then the loop is finished before it even started. Some weird edge-case handing happens here which is not important to the discussion.) <\/p>\n<p>Code that takes advantage of the <var>br.ctop<\/var> instruction goes like this: <\/p>\n<pre>\n      alloc r36 = ar.pfs, 0, 8, 0, 4 \/\/ four rotating registers!\n      mov r37 = ar.lc         \/\/ preserve lc\n      mov r38 = ar.ec         \/\/ preserve ec\n      mov r39 = preds         \/\/ preserve predicates\n\n      addl r31 = r0, 1999 ;;  \/\/ r31 = 1999\n      mov ar.lc = r31         \/\/ ar.lc = 1999\n      mov ar.ec = 4\n      mov pr.rot = 1 &lt;&lt; 16 \/\/ p16 = true, all others false\n      addl r29 = gp, -205584  \/\/ calculate start of array\n      addl r28 = r29, 0 ;;     \/\/ put it in both r28 and r29\n\nagain:\n(p16) ld4 r32 = [r29], 4      \/\/ execute an entire loop with\n(p18) adds r34 = r34, 1       \/\/ a single instruction group\n(p19) st4 [r28] = r35, 4      \/\/ using this one weird trick\n      br.ctop again ;;\n\n      mov ar.lc = r37         \/\/ restore registers we preserved\n      mov ar.ec = r38\n      mov preds = r39\n      mov ar.pfs = r36\n      br.ret.sptk.many rp     \/\/ return\n<\/pre>\n<p>We are now using the last parameter to the <var>alloc<\/var> instruction. The <var>4<\/var> says that we want four rotating registers. The <var>ar.lc<\/var> and <var>ar.ec<\/var> register must be preserved across calls, so we save them here for restoration at the end. Predicate registers <var>p16<\/var> through <var>p63<\/var> must also be preserved, so we save all the predicate registers by using the <var>preds<\/var> pseudo-register which grabs all 64 predicates into a single 64-bit value. <\/p>\n<p>Next, we set up the loop by setting the loop counter to the number of additional times we want to execute the loop (not counting the one execution we get via fall-through), the epilogue counter to the number of steps we need in order to drain the final iterations, and set the predicates so that <var>p16 = true<\/var> and everything else is <var>false<\/var>. We also set up <var>r28<\/var> and <var>r29<\/var> to step through the array. <\/p>\n<p>Once that is done, we can execute the entire loop in a single instruction group. <\/p>\n<p>And then we clean up after the loop by restoring all the registers to how we found them, then return. <\/p>\n<p>And there you have register rotation. It lets you compress the prologue, kernel, and epilogue of a pipelined loop into a single instruction group. <\/p>\n<p>I pulled a fast one here: The Itanium requires that the number of rotating registers be a multiple of eight. So our code really should look like this: <\/p>\n<pre>\n      alloc r40 = ar.pfs, 0, <font COLOR=\"blue\">12<\/font>, 0, <font COLOR=\"blue\">8<\/font>\n      mov <font COLOR=\"blue\">r41<\/font> = ar.lc         \/\/ preserve lc\n      mov <font COLOR=\"blue\">r42<\/font> = ar.ec         \/\/ preserve ec\n      mov <font COLOR=\"blue\">r43<\/font> = preds         \/\/ preserve predicates\n\n      addl r31 = r0, 1999 ;;  \/\/ r31 = 1999\n      mov ar.lc = r31         \/\/ ar.lc = 1999\n      mov ar.ec = 4\n      mov pr.rot = 1 &lt;&lt; 16 \/\/ p16 = true, all others false\n      addl r29 = gp, -205584  \/\/ calculate start of array\n      addl r28 = r29, 0 ;;     \/\/ put it in both r28 and r29\n\nagain:\n(p16) ld4 r32 = [r29], 4      \/\/ execute an entire loop with\n(p18) adds r34 = r34, 1       \/\/ a single instruction group\n(p19) st4 [r28] = r35, 4      \/\/ using this one weird trick\n      br.ctop again ;;\n\n      mov ar.lc = <font COLOR=\"blue\">r41<\/font>         \/\/ restore registers we preserved\n      mov ar.ec = <font COLOR=\"blue\">r42<\/font>\n      mov preds = <font COLOR=\"blue\">r43<\/font>\n      mov ar.pfs = <font COLOR=\"blue\">r40<\/font>\n      br.ret.sptk.many rp     \/\/ return\n<\/pre>\n<p>Instead of four rotating registers, we use eight. The underlying analysis remains the same. We are just throwing more registers into the pot. <\/p>\n<p>Now, the loop we were studying happens to be very simple, with only one load and one store. For more complex loops, you may need to use things like <!-- backref: The Itanium processor, part 6: Calculating conditionals -->the unconditional comparison<\/a>, or pipelining the iterations with a stagger of more than one cycle. <\/p>\n<p>There are other types of instructions for managing loops with register rotation. For example, <var>br.cexit<\/var> is like <var>br.ctop<\/var> except that it jumps when <var>br.ctop<\/var> falls through and vice versa. This is handy to put at the start of your pipelined loop to handle the case where the number of iterations is zero. There are also <var>br.wtop<\/var> and <var>br.wexit<\/var> instructions to handle <code>while<\/code> loops instead of counted loops. The basic idea is the same, so I won&#8217;t go into the details. You can read the Itanium manual to learn more. <\/p>\n<p>That&#8217;s the end of the whirlwind tour of the Itanium architecture. There are still parts left unexplored, but I tried to hit the most interesting sights. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Around and around.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-91151","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>Around and around.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/91151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=91151"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/91151\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=91151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=91151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=91151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}