{"id":91171,"date":"2015-08-05T07:00:00","date_gmt":"2015-08-05T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20150805-00\/?p=91171\/"},"modified":"2019-03-13T12:18:07","modified_gmt":"2019-03-13T19:18:07","slug":"20150805-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150805-00\/?p=91171","title":{"rendered":"The Itanium processor, part 8: Advanced loads"},"content":{"rendered":"<p>Today we&#8217;ll look at advanced loads, which is when you load a value before you&#8217;re supposed to, in the hope that the value won&#8217;t change in the meantime. <!--more--><\/p>\n<p>Consider the following code: <\/p>\n<pre>\nint32_t SomeClass::tryGetValue(int32_t *value)\n{\n if (!m_errno) {\n  *value = m_value;\n  m_readCount++;\n }\n return m_errno;\n}\n<\/pre>\n<p>Let&#8217;s say that the <code>Some&shy;Class<\/code> has <code>m_value<\/code> at offset zero, <code>m_errno<\/code> at offset 4, and <code>m_readCount<\/code> at offset 8. <\/p>\n<p>The na&iuml;ve way of compiling this function would go something like this: <\/p>\n<pre>\n        \/\/ we are a leaf function, so no need to use \"alloc\" or to save rp.\n        \/\/ on entry: r32 = this, r33 = value\n\n        addl    r30 = 08h, r32          \/\/ calculate &amp;m_errno\n        addl    r29 = 04h, r32 ;;       \/\/ calculate &amp;m_readCount\n\n        ld4     ret0 = [r30] ;;         \/\/ load m_errno\n\n        cmp4.eq p6, p7 = ret0, r0       \/\/ p6 = m_errno == 0, p7 = !p6\n\n(p7)    br.ret.sptk.many rp             \/\/ return m_errno if there was an error&sup1;\n\n        ld4     r31 = [r32] ;;          \/\/ load m_value (at offset 0)\n        st4     [r33] = r31 ;;          \/\/ store m_value to *value\n\n        ld4     r28 = [r29] ;;          \/\/ load m_readCount\n        addl    r28 = 01h, r28 ;;       \/\/ calculate m_readCount + 1\n        st4     [r29] = r28 ;;          \/\/ store updated m_readCount\n\n        ld4     ret0 = [r30]            \/\/ reload m_errno for return value\n\n        br.ret.sptk.many rp             \/\/ return\n<\/pre>\n<p>First, we calculate the addresses of our member variables. Then we load <code>m_errno<\/code>, and if there is an error, then we return it immediately. Otherwise, we copy the current value to <code>*value<\/code>, load <code>m_readCount<\/code>, increment it, and finally, we return <code>m_errno<\/code>. <\/p>\n<p>The problem here is that we have a deep dependency chain. <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" CELLSPACING=\"0\" STYLE=\"text-align: center;margin-left: 1px\">\n<tr>\n<td><\/td>\n<td STYLE=\"width: 2em\"><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">addl r30 = 08h, r32<\/td>\n<td STYLE=\"width: 3em\"><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">ld4 ret0 = [r30]<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: em;height: 2em\">cmp4.eq p6, p7 = ret0, r0<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>&#x2199;&#xfe0e;<\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">(p7) br.ret.sptk.many rp<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">ld4 r31 = [r32]<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">st4 [r33] = r31<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">addl r29 = 04h, r32<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"2\" ALIGN=\"right\">non-obvious dependency<\/td>\n<td>&darr;<\/td>\n<td>&#x2199;&#xfe0e;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">ld4 r28 = [r29]<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">addl r28 = 01h, r28<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">st4 [r29] = r28<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"2\" ALIGN=\"right\">non-obvious dependency<\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">ld4 ret0 = [r30]<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">br.ret.sptk.many rp<\/td>\n<\/tr>\n<\/table>\n<p>Pretty much every instruction depends on the result of the previous instruction. Some of these dependencies are obvious. You have to calculate the address of a member variable before you can read it, and you have to get the result of a memory access befure you can perform arithmetic on it. Some of the dependencies are not obvious. For example, we cannot access <code>m_value<\/code> or <code>m_readCount<\/code> until after we confirm that <code>m_errno<\/code> is zero to avoid a potential access violation if the object straddles a page boundary with <code>m_errno<\/code> on one page and <code>m_value<\/code> on the other (invalid) page. (We saw last time how this can be solved with speculative loads, but let&#8217;s not add that to the mix yet.) <\/p>\n<p>Returning <code>m_errno<\/code> is a non-obvious dependency. We&#8217;ll see why later. For now, note that the return value came from a memory access, which means that if the caller of the function tries to use the return value, it may stall waiting for the result to arrive from the memory controller. <\/p>\n<p>When you issue a read on Itanium, the processor merely initiates the operation and proceeds to the next instruction before the read completes. If you try to use the result of the read too soon, the processor stalls until the value is received from the memory controller. Therefore, you want to put as much distance as possible between the load of a value from memory and the attempt to use the result. <\/p>\n<p>Let&#8217;s see what we can do to parallelize this function. We&#8217;ll perform the increment of <code>m_readCount<\/code> and the fetch of <code>m_value<\/code> simultaneously. <\/p>\n<pre>\n        \/\/ we are a leaf function, so no need to use \"alloc\" or to save rp.\n        \/\/ on entry: r32 = this, r33 = value\n\n        addl    r30 = 08h, r32          \/\/ calculate &amp;m_errno\n        addl    r29 = 04h, r32 ;;       \/\/ calculate &amp;m_readCount\n\n        ld4     ret0 = [r30] ;;         \/\/ load m_errno\n\n        cmp4.eq p6, p7 = ret0, r0       \/\/ p6 = m_errno == 0, p7 = !p6\n\n(p7)    br.ret.sptk.many rp             \/\/ return m_errno if there was an error\n\n        ld4     r31 = [r32]             \/\/ load m_value (at offset 0)\n        ld4     r28 = [r29] ;;          \/\/ preload m_readCount\n\n        addl    r28 = 01h, r28          \/\/ calculate m_readCount + 1\n        st4     [r33] = r31 ;;          \/\/ store m_value to *value\n\n        st4     [r29] = r28             \/\/ store updated m_readCount\n\n        br.ret.sptk.many rp             \/\/ return (answer already in ret0)\n<\/pre>\n<p>We&#8217;ve basically rewritten the function as <\/p>\n<pre>\nint32_t SomeClass::getValue(int32_t *value)\n{\n int32_t local_errno = m_errno;\n if (!local_errno) {\n  int32_t local_readCount = m_readCount;\n  int32_t local_value = m_value;\n  local_readCount = local_readCount + 1;\n  *value = local_value;\n  m_readCount = local_readCount;\n }\n return local_errno;\n}\n<\/pre>\n<p>This time we loaded the return value from <code>m_errno<\/code> long before the function ends, so when the caller tries to use the return value, it will definitely be ready and not incur a memory stall. (If a stall were needed, it would have occurred at the <code>cmp4<\/code>.) And we&#8217;ve also shortened the dependency chain significantly in the second half of the function. <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" CELLSPACING=\"0\" STYLE=\"text-align: center;margin-left: 1px\">\n<tr>\n<td STYLE=\"width: 11em\"><\/td>\n<td STYLE=\"width: 2em\"><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">addl r30 = 08h, r32<\/td>\n<td STYLE=\"width: 3em\"><\/td>\n<td STYLE=\"width: 11em\"><\/td>\n<td STYLE=\"width: 3em\"><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">ld4 ret0 = [r30]<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">cmp4.eq p6, p7 = ret0, r0<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">addl r29 = 04h, r32<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td>&#x2199;&#xfe0e;<\/td>\n<td>&darr;<\/td>\n<td>&#x2198;&#xfe0e;<\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td STYLE=\"border: solid black 1px;height: 2em\">(p7) br.ret.sptk.many rp<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">ld4 r31 = [r32]<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">ld4 r28 = [r29]<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">st4 [r33] = r31<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">addl r28 = 01h, r28<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">st4 [r29] = r28<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td COLSPAN=\"3\" STYLE=\"border: solid black 1px;height: 2em\">br.ret.sptk.many rp<\/td>\n<\/tr>\n<\/table>\n<p>This works great until somebody does this: <\/p>\n<pre>\nint32_t SomeClass::Haha()\n{\n  return this-&gt;tryGetValue(&amp;m_readCount);\n}\n<\/pre>\n<p>or even this: <\/p>\n<pre>\nint32_t SomeClass::Hoho()\n{\n  return this-&gt;tryGetValue(&amp;m_errno);\n}\n<\/pre>\n<p>Oops. <\/p>\n<p>Let&#8217;s look at <code>Haha<\/code>. Suppose that our initial conditions are <code>m_errno = 0<\/code>, <code>m_value = 42<\/code>, and <code>m_readCount = 0<\/code>. <\/p>\n<table CLASS=\"cp3\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<th COLSPAN=\"2\">Original<\/th>\n<td ROWSPAN=\"20\" STYLE=\"width: 1px;padding: 0px;background-color: black\"><\/td>\n<th COLSPAN=\"2\">Optimized<\/th>\n<\/tr>\n<tr>\n<th COLSPAN=\"5\" STYLE=\"height: 1px;padding: 0px;background-color: black\"><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>local_errno = m_errno;<\/td>\n<td>\/\/ true<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>if (!m_errno)<\/td>\n<td>\/\/ true<\/td>\n<td>if (!m_errno)<\/td>\n<td>\/\/ true<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><\/td>\n<td>readCount = m_readCount;<\/td>\n<td>\/\/ 0<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>*value = m_value;<\/td>\n<td>\/\/ m_readCount = 42<\/font><\/td>\n<td>*value = m_value;<\/td>\n<td>\/\/ m_readCount = 42<\/font><\/td>\n<\/tr>\n<tr>\n<td>m_readCount++;<\/td>\n<td>\/\/ m_readCount = 43<\/td>\n<td>m_readCount = readCount + 1;<\/td>\n<td>\/\/ <font COLOR=\"red\">m_readCount = 1<\/font><\/td>\n<\/tr>\n<tr>\n<td>return m_errno;<\/td>\n<td>\/\/ 0<\/td>\n<td>return errno;<\/td>\n<td>\/\/ 0<\/td>\n<\/tr>\n<\/table>\n<p>The original code copies the <code>value<\/code> before incrementing the read count. This means that if the caller says that <code>m_readCount<\/code> is the output variable, the act of copying the value <i>modifies <code>m_readCount<\/code><\/i>. This modified value is then incremented. Our optimized version does not take this case into account and sets <code>m_readCount<\/code> to the old value incremented by 1. <\/p>\n<p>We were faked out by pointer aliasing! <\/p>\n<p>(A similar disaster occurs in <code>Hoho<\/code>.) <\/p>\n<p>Now, whether the behavior described above is intentional or desirable is not at issue here. The C++ language specification requires that the original code result in the specified behavior, so the compiler is required to honor it. Optimizations cannot alter the behavior of standard-conforming code, even if that behavior seems strange to a human being reading it. <\/p>\n<p>But we can still salvage this optimization by handling the aliasing case. The processor contains support for aliasing detection via the <code>ld.a<\/code> instruction. <\/p>\n<pre>\n        \/\/ we are a leaf function, so no need to use \"alloc\" or to save rp.\n        \/\/ on entry: r32 = this, r33 = value\n\n        addl    r30 = 08h, r32          \/\/ calculate &amp;m_errno\n        addl    r29 = 04h, r32 ;;       \/\/ calculate &amp;m_readCount\n\n        ld4     ret0 = [r30] ;;         \/\/ load m_errno\n\n        cmp4.eq p6, p7 = ret0, r0       \/\/ p6 = m_errno == 0, p7 = !p6\n\n(p7)    br.ret.sptk.many rp             \/\/ return m_errno if there was an error\n\n        ld4     r31 = [r32]             \/\/ load m_value (at offset 0)\n        <font COLOR=\"blue\">ld4.a   r28 = [r29] ;;          \/\/ preload m_readCount<\/font>\n\n        addl    r28 = 01h, r28          \/\/ calculate m_readCount + 1\n        st4     [r33] = r31             \/\/ store m_value to *value\n\n        chk.a.clr r28, recover ;;       \/\/ recover from pointer aliasing\nrecovered:\n        st4     [r29] = r28 ;;          \/\/ store updated m_readCount\n\n        br.ret.sptk.many rp             \/\/ return\n\nrecover:\n        ld4     r28 = [r29] ;;          \/\/ reload m_readCount\n        addl    r28 = 01h, r28          \/\/ recalculate m_readCount + 1\n        br      recovered               \/\/ recovery complete, resume mainline code\n<\/pre>\n<p>The <code>ld.a<\/code> instruction is the same as an <code>ld<\/code> instruction, but it also tells the processor that this is an <var>advanced load<\/var>, and that the processor should stay on the lookout for any instructions that write to any bytes accessed by the load instruction. When the value is finally consumed, you perform a <code>chk.a.clr<\/code> to check whether the value you loaded is still valid. If no instructions have written to the memory in the meantime, then great. But if the address was written to, the processor will jump to the recovery code you provided. The recovery code re-executes the load and any other follow-up calculations, then returns to the original mainline code path. <\/p>\n<p>The <code>.clr<\/code> completer tells the processor to stop monitoring that address. It clears the entry from the Advanced Load Address Table, freeing it up for somebody else to use. <\/p>\n<p>There is also a <code>ld.c<\/code> instruction which is equivalent to a <code>chk.a<\/code> that jumps to a reload and then jumps back. In other words, <\/p>\n<pre>\n    ld.c.clr r1 = [r2]\n<\/pre>\n<p>is equivalent to <\/p>\n<pre>\n    chk.a.clr r1, recover\nrecovered:\n    ...\n\nrecover:\n    ld     r1 = [r2]\n    br     recovered\n<\/pre>\n<p>but is much more compact and doesn&#8217;t take branch penalties. This is used if there is no follow-up computation; you merely want to reload the value if it changed. <\/p>\n<p>As with recovery from speculative loads, we can inline some of the mainline code into the recovery code so that we don&#8217;t have to pad out the mainline code to get <code>recovered<\/code> to sit on a bundle boundary. I didn&#8217;t bother doing it here; you can do it as an exercise. <\/p>\n<p>The nice thing about processor support for pointer aliasing detection is that it can be done across functions, something that cannot easily be done statically. Consider this function: <\/p>\n<pre>\nvoid accumulateTenTimes(void (*something)(int32_t), int32_t *victim)\n{\n int32_t total = 0;\n for (int32_t i = 0; i &lt; 10; i++) {\n  total += something(*victim);\n }\n *victim = total;\n}\n\nint32_t negate(int32_t a) { return -a; }\n\nint32_t value = 2;\naccumulateTenTimes(negate, &amp;value);\n\/\/ result: value = -2 + -2 + -2 + ... + -2 = -20\n\nint32_t sneaky_negate(int32_t a) { value2 \/= 2; return -a; }\nint32_t value2 = 2;\naccumulateTenTimes(sneaky_negate, &amp;value2);\n\/\/ result: value2 = -2 + -1 + -0 + -0 + ... + -0 = -3\n<\/pre>\n<p>When compiling the <code>accumulate&shy;Ten&shy;Times<\/code> function, the compiler has no way of knowing whether the <code>something<\/code> function will modify <code>victim<\/code>, so it must be conservative and assume that it might, just in case we are in the <code>sneaky_negate<\/code> case. <\/p>\n<p>Let&#8217;s assume that the compiler has done flow analysis and determined that the function pointer passed to <code>accumulate&shy;Ten&shy;Times<\/code> is always within the same module, so it doesn&#8217;t need to deal with <code>gp<\/code>. Since function descriptors are immutable, it can also enregister the function address. <\/p>\n<pre>\n        \/\/ 2 input registers, 6 local registers, 1 output register\n        alloc   r34 = ar.pfs, 2, 6, 1, 0\n        mov     r35 = rp                \/\/ save return address\n        mov     r36 = ar.lc             \/\/ save loop counter\n        or      r37 = r0, r0            \/\/ total = 0\n        ld8     r38 = [r32]             \/\/ get the function address\n        or      r31 = 09h, r0 ;;        \/\/ r31 = 9\n        mov     ar.lc = r31             \/\/ loop nine more times (ten total)\nagain:\n        ld4     r39 = [r33]             \/\/ load *victim for output\n        mov     b6 = r38                \/\/ move to branch register\n        br.call.dptk.many rp = b6 ;;    \/\/ call function in b6\n        addl    r37 = ret0, r37         \/\/ accumulate total\n        br.cloop.sptk.few again ;;      \/\/ loop 9 more times\n\n        st4     [r33] = r37             \/\/ save the total\n\n        mov     ar.lc = r36             \/\/ restore loop counter\n        mov     rp = r35                \/\/ restore return address\n        mov     ar.pfs = r34            \/\/ restore stack frame\n        br.ret.sptk.many rp             \/\/ return\n<\/pre>\n<p>Note that at each iteration, we read <code>*victim<\/code> from memory because we aren&#8217;t sure whether the <code>something<\/code> function modifies it. But with advanced loads, we can remove the memory access from the loop. <\/p>\n<pre>\n        \/\/ 2 input registers, 7 local registers, 1 output register\n        alloc   r34 = ar.pfs, 2, 7, 1, 0\n        mov     r35 = rp                \/\/ save return address\n        mov     r36 = ar.lc             \/\/ save loop counter\n        or      r37 = r0, r0            \/\/ total = 0\n        ld8     r38 = [r32]             \/\/ get the function address\n        or      r31 = 09h, r0 ;;        \/\/ r31 = 9\n        mov     ar.lc = r31             \/\/ loop nine more times (ten total)\n        <font COLOR=\"blue\">ld4.a   r39 = [r33]             \/\/ get the value of *victim<\/font>\nagain:\n        <font COLOR=\"blue\">ld4.c.nc r39 = [r33]            \/\/ reload *victim if necessary<\/font>\n        or      r40 = r39, r0           \/\/ set *victim as the output parameter\n        mov     b6 = r38                \/\/ move to branch register\n        br.call.dptk.many rp = b6 ;;    \/\/ call function in b6\n        addl    r37 = ret0, r37         \/\/ accumulate total\n        br.cloop.sptk.few again ;;      \/\/ loop 9 more times\n\n        <font COLOR=\"blue\">invala.e r39                    \/\/ stop tracking r39<\/font>\n\n        st4     [r33] = r37             \/\/ save the total\n\n        mov     ar.lc = r36             \/\/ restore loop counter\n        mov     rp = r35                \/\/ restore return address\n        mov     ar.pfs = r34            \/\/ restore stack frame\n        br.ret.sptk.many rp             \/\/ return\n<\/pre>\n<p>We perform an advanced load of <code>*value<\/code> in the hope that the callback function will not modify it. This is true if the callback function is <code>negate<\/code>, but it will trigger reloads if the accumulator function is <code>sneaky_negate<\/code>. <\/p>\n<p>Note here that we use the <code>.nc<\/code> completer on the <code>ld.c<\/code> instruction. This stands for <var>no clear<\/var> and tells the processor to keep tracking the address because we will be checking it again. When the loop is over, we use <code>invala.e<\/code> to tell the processor, &#8220;Okay, you can stop tracking it now.&#8221; This also shows how handy the <code>ld.c<\/code> instruction is. We can do the reload inline rather than have to write separate recovery code and jumping out and back. <\/p>\n<p>(Processor trivia: We do not need a stop after the <code>ld4.c.nc<\/code>. You are allowed to consume the result of a check load in the same instruction group.) <\/p>\n<p>In the case where the callback function does not modify <code>value<\/code>, the only memory accesses performed by this function and the callback are loading the function address, loading the initial value from <code>*value<\/code>, and storing the final value to <code>*value<\/code>. The loop body itself runs without any memory access at all! <\/p>\n<p>Going back to our original function, I noted that we could also add speculation to the mix. So let&#8217;s do that. We&#8217;re going to speculate an advanced load! <\/p>\n<pre>\n        \/\/ we are a leaf function, so no need to use \"alloc\" or to save rp.\n        \/\/ on entry: r32 = this, r33 = value\n\n        <font COLOR=\"blue\">ld4.sa  r31 = [r32]             \/\/ speculatively preload m_value (at offset 0)<\/font>\n        addl    r30 = 08h, r32          \/\/ calculate &amp;m_errno\n        addl    r29 = 04h, r32 ;;       \/\/ calculate &amp;m_readCount\n\n        <font COLOR=\"blue\">ld4.sa  r28 = [r29]             \/\/ speculatively preload m_readCount<\/font>\n        ld4     ret0 = [r30] ;;         \/\/ load m_errno\n\n        cmp4.eq p6, p7 = ret0, r0       \/\/ p6 = m_errno == 0, p7 = !p6\n\n<font COLOR=\"blue\">(p7)    invala.e r31                    \/\/ abandon the advanced load\n(p7)    invala.e r28                    \/\/ abandon the advanced load<\/font>\n(p7)    br.ret.sptk.many rp             \/\/ return false if value not set\n\n        <font COLOR=\"blue\">ld4.c.clr r31 = [r32]           \/\/ validate speculation and advanced load of m_value<\/font>\n        st4     [r33] = r31             \/\/ store m_value to *value\n\n        <font COLOR=\"blue\">ld4.c.clr r28 = [r29]           \/\/ validate speculation and advanced load of m_readCount<\/font>\n        addl    r28 = 01h, r28 ;;       \/\/ calculate m_readCount + 1\n        st4     [r29] = r28             \/\/ store updated m_readCount\n\n        br.ret.sptk.many rp             \/\/ return\n<\/pre>\n<p>To validate a speculative advanced load, you just need to do a <code>ld.c<\/code>. If the speculation failed, then the advanced load also fails, so all we need to do is check the advanced load. and the reload will raise the exception. <\/p>\n<p>The dependency chain for this function is even shorter now that we were able to speculate the case where there is no error. (Since you are allowed to consume an <code>ld4.c<\/code> in the same instruction group, I combined the <code>ld4.c<\/code> and its consumption in a single box since they occur within the same cycle.) <\/p>\n<table BORDER=\"0\" CELLPADDING=\"0\" CELLSPACING=\"0\" STYLE=\"text-align: center;margin-left: 1px\">\n<tr>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">ld4.sa r31 = [r32]<\/td>\n<td STYLE=\"width: 3em\"><\/td>\n<td STYLE=\"border: solid black 1px;width: 13em;height: 2em\">addl r30 = 08h, r32<\/td>\n<td STYLE=\"width: 3em\"><\/td>\n<td STYLE=\"border: solid black 1px;width: 11em;height: 2em\">addl r29 = 04h, r32<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">ld4 ret0 = [r30]<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">ld4.sa r28 = [r29]<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">cmp4.eq p6, p7 = ret0, r0<\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td>&#x2199;&#xfe0e;<\/td>\n<td>&darr;<\/td>\n<td>&#x2198;&#xfe0e;<\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td STYLE=\"height: 2em\">\n<table STYLE=\"width: 100%;height: 2em;border-collapse: collapse;text-align: center\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<td STYLE=\"border: solid black 1px\">ld4.c<\/td>\n<td STYLE=\"border: solid black 1px\">st4 [r33] = r31<\/td>\n<\/tr>\n<\/table>\n<\/td>\n<td><\/td>\n<td STYLE=\"height: 2em\">\n<table STYLE=\"width: 100%;height: 2em;border-collapse: collapse;text-align: center\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<td STYLE=\"border: solid black 1px\">invala.e r31<\/td>\n<td STYLE=\"border: solid black 1px\">invala.e r28<\/td>\n<td STYLE=\"border: solid black 1px\">br.ret rp<\/td>\n<\/tr>\n<\/table>\n<\/td>\n<td><\/td>\n<td STYLE=\"height: 2em\">\n<table STYLE=\"width: 100%;height: 2em;border-collapse: collapse;text-align: center\" CELLPADDING=\"3\" CELLSPACING=\"0\">\n<tr>\n<td STYLE=\"border: solid black 1px\">ld4.c<\/td>\n<td STYLE=\"border: solid black 1px\">addl r28 = 01h, r28<\/td>\n<\/tr>\n<\/table>\n<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td STYLE=\"border: solid black 1px;height: 2em\">st4 [r29] = r28<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>&darr;<\/td>\n<td><\/td>\n<td><\/td>\n<td><\/td>\n<td>&darr;<\/td>\n<\/tr>\n<tr>\n<td COLSPAN=\"5\" STYLE=\"border: solid black 1px;height: 2em\">br.ret.sptk.many rp<\/td>\n<\/tr>\n<\/table>\n<p>Aw, look at that pretty diagram. Control speculation and data speculation allowed us to run three different operations in parallel even though they might have dependencies on each other. The idea here is that if profiling suggests that the dependencies are rarely realized (pointers are usually not aliased), you can use speculation to run the operations as if they had no dependencies, and then use the check instructions to convert the speculated results to real ones. <\/p>\n<p>&sup1; Note the absence of a stop between the <code>cmp4<\/code> and the <code>br.ret<\/code>. That&#8217;s because of a special Itanium rule that says that a conditional branch is permitted to use a predicate register calculated earlier within the same instruction group. (Normally, instructions within an instruction group are not allowed to have dependencies among each other.) This allows a test and jump to occur within the same cycle. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>I hoped you were going to say that.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[26],"class_list":["post-91171","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-other"],"acf":[],"blog_post_summary":"<p>I hoped you were going to say that.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/91171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=91171"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/91171\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=91171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=91171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=91171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}