Today we’ll look at advanced loads, which is when you load a value before you’re supposed to, in the hope that the value won’t change in the meantime.
Consider the following code:
int32_t SomeClass::tryGetValue(int32_t *value) { if (!m_errno) { *value = m_value; m_readCount++; } return m_errno; }
Let’s say that the SomeClass
has m_value
at offset zero, m_errno
at offset 4, and m_readCount
at offset 8.
The naïve way of compiling this function would go something like this:
// we are a leaf function, so no need to use "alloc" or to save rp. // on entry: r32 = this, r33 = value addl r30 = 08h, r32 // calculate &m_errno addl r29 = 04h, r32 ;; // calculate &m_readCount ld4 ret0 = [r30] ;; // load m_errno cmp4.eq p6, p7 = ret0, r0 // p6 = m_errno == 0, p7 = !p6 (p7) br.ret.sptk.many rp // return m_errno if there was an error¹ ld4 r31 = [r32] ;; // load m_value (at offset 0) st4 [r33] = r31 ;; // store m_value to *value ld4 r28 = [r29] ;; // load m_readCount addl r28 = 01h, r28 ;; // calculate m_readCount + 1 st4 [r29] = r28 ;; // store updated m_readCount ld4 ret0 = [r30] // reload m_errno for return value br.ret.sptk.many rp // return
First, we calculate the addresses of our member variables. Then we load m_errno
, and if there is an error, then we return it immediately. Otherwise, we copy the current value to *value
, load m_readCount
, increment it, and finally, we return m_errno
.
The problem here is that we have a deep dependency chain.
addl r30 = 08h, r32 | ||||
↓ | ||||
ld4 ret0 = [r30] | ||||
↓ | ||||
cmp4.eq p6, p7 = ret0, r0 | ||||
↙︎ | ↓ | |||
(p7) br.ret.sptk.many rp | ld4 r31 = [r32] | |||
↓ | ||||
st4 [r33] = r31 | addl r29 = 04h, r32 | |||
non-obvious dependency | ↓ | ↙︎ | ||
ld4 r28 = [r29] | ||||
↓ | ||||
addl r28 = 01h, r28 | ||||
↓ | ||||
st4 [r29] = r28 | ||||
non-obvious dependency | ↓ | |||
ld4 ret0 = [r30] | ||||
↓ | ||||
br.ret.sptk.many rp |
Pretty much every instruction depends on the result of the previous instruction. Some of these dependencies are obvious. You have to calculate the address of a member variable before you can read it, and you have to get the result of a memory access befure you can perform arithmetic on it. Some of the dependencies are not obvious. For example, we cannot access m_value
or m_readCount
until after we confirm that m_errno
is zero to avoid a potential access violation if the object straddles a page boundary with m_errno
on one page and m_value
on the other (invalid) page. (We saw last time how this can be solved with speculative loads, but let’s not add that to the mix yet.)
Returning m_errno
is a non-obvious dependency. We’ll see why later. For now, note that the return value came from a memory access, which means that if the caller of the function tries to use the return value, it may stall waiting for the result to arrive from the memory controller.
When you issue a read on Itanium, the processor merely initiates the operation and proceeds to the next instruction before the read completes. If you try to use the result of the read too soon, the processor stalls until the value is received from the memory controller. Therefore, you want to put as much distance as possible between the load of a value from memory and the attempt to use the result.
Let’s see what we can do to parallelize this function. We’ll perform the increment of m_readCount
and the fetch of m_value
simultaneously.
// we are a leaf function, so no need to use "alloc" or to save rp. // on entry: r32 = this, r33 = value addl r30 = 08h, r32 // calculate &m_errno addl r29 = 04h, r32 ;; // calculate &m_readCount ld4 ret0 = [r30] ;; // load m_errno cmp4.eq p6, p7 = ret0, r0 // p6 = m_errno == 0, p7 = !p6 (p7) br.ret.sptk.many rp // return m_errno if there was an error ld4 r31 = [r32] // load m_value (at offset 0) ld4 r28 = [r29] ;; // preload m_readCount addl r28 = 01h, r28 // calculate m_readCount + 1 st4 [r33] = r31 ;; // store m_value to *value st4 [r29] = r28 // store updated m_readCount br.ret.sptk.many rp // return (answer already in ret0)
We’ve basically rewritten the function as
int32_t SomeClass::getValue(int32_t *value) { int32_t local_errno = m_errno; if (!local_errno) { int32_t local_readCount = m_readCount; int32_t local_value = m_value; local_readCount = local_readCount + 1; *value = local_value; m_readCount = local_readCount; } return local_errno; }
This time we loaded the return value from m_errno
long before the function ends, so when the caller tries to use the return value, it will definitely be ready and not incur a memory stall. (If a stall were needed, it would have occurred at the cmp4
.) And we’ve also shortened the dependency chain significantly in the second half of the function.
addl r30 = 08h, r32 | |||||
↓ | |||||
ld4 ret0 = [r30] | |||||
↓ | |||||
cmp4.eq p6, p7 = ret0, r0 | addl r29 = 04h, r32 | ||||
↙︎ | ↓ | ↘︎ | ↓ | ||
(p7) br.ret.sptk.many rp | ld4 r31 = [r32] | ld4 r28 = [r29] | |||
↓ | ↓ | ||||
st4 [r33] = r31 | addl r28 = 01h, r28 | ||||
↓ | ↓ | ||||
↓ | st4 [r29] = r28 | ||||
↓ | ↓ | ||||
br.ret.sptk.many rp |
This works great until somebody does this:
int32_t SomeClass::Haha() { return this->tryGetValue(&m_readCount); }
or even this:
int32_t SomeClass::Hoho() { return this->tryGetValue(&m_errno); }
Oops.
Let’s look at Haha
. Suppose that our initial conditions are m_errno = 0
, m_value = 42
, and m_readCount = 0
.
Original | Optimized | |||
---|---|---|---|---|
local_errno = m_errno; | // true | |||
if (!m_errno) | // true | if (!m_errno) | // true | |
readCount = m_readCount; | // 0 | |||
*value = m_value; | // m_readCount = 42 | *value = m_value; | // m_readCount = 42 | |
m_readCount++; | // m_readCount = 43 | m_readCount = readCount + 1; | // m_readCount = 1 | |
return m_errno; | // 0 | return errno; | // 0 |
The original code copies the value
before incrementing the read count. This means that if the caller says that m_readCount
is the output variable, the act of copying the value modifies m_readCount
. This modified value is then incremented. Our optimized version does not take this case into account and sets m_readCount
to the old value incremented by 1.
We were faked out by pointer aliasing!
(A similar disaster occurs in Hoho
.)
Now, whether the behavior described above is intentional or desirable is not at issue here. The C++ language specification requires that the original code result in the specified behavior, so the compiler is required to honor it. Optimizations cannot alter the behavior of standard-conforming code, even if that behavior seems strange to a human being reading it.
But we can still salvage this optimization by handling the aliasing case. The processor contains support for aliasing detection via the ld.a
instruction.
// we are a leaf function, so no need to use "alloc" or to save rp. // on entry: r32 = this, r33 = value addl r30 = 08h, r32 // calculate &m_errno addl r29 = 04h, r32 ;; // calculate &m_readCount ld4 ret0 = [r30] ;; // load m_errno cmp4.eq p6, p7 = ret0, r0 // p6 = m_errno == 0, p7 = !p6 (p7) br.ret.sptk.many rp // return m_errno if there was an error ld4 r31 = [r32] // load m_value (at offset 0) ld4.a r28 = [r29] ;; // preload m_readCount addl r28 = 01h, r28 // calculate m_readCount + 1 st4 [r33] = r31 // store m_value to *value chk.a.clr r28, recover ;; // recover from pointer aliasing recovered: st4 [r29] = r28 ;; // store updated m_readCount br.ret.sptk.many rp // return recover: ld4 r28 = [r29] ;; // reload m_readCount addl r28 = 01h, r28 // recalculate m_readCount + 1 br recovered // recovery complete, resume mainline code
The ld.a
instruction is the same as an ld
instruction, but it also tells the processor that this is an advanced load, and that the processor should stay on the lookout for any instructions that write to any bytes accessed by the load instruction. When the value is finally consumed, you perform a chk.a.clr
to check whether the value you loaded is still valid. If no instructions have written to the memory in the meantime, then great. But if the address was written to, the processor will jump to the recovery code you provided. The recovery code re-executes the load and any other follow-up calculations, then returns to the original mainline code path.
The .clr
completer tells the processor to stop monitoring that address. It clears the entry from the Advanced Load Address Table, freeing it up for somebody else to use.
There is also a ld.c
instruction which is equivalent to a chk.a
that jumps to a reload and then jumps back. In other words,
ld.c.clr r1 = [r2]
is equivalent to
chk.a.clr r1, recover recovered: ... recover: ld r1 = [r2] br recovered
but is much more compact and doesn’t take branch penalties. This is used if there is no follow-up computation; you merely want to reload the value if it changed.
As with recovery from speculative loads, we can inline some of the mainline code into the recovery code so that we don’t have to pad out the mainline code to get recovered
to sit on a bundle boundary. I didn’t bother doing it here; you can do it as an exercise.
The nice thing about processor support for pointer aliasing detection is that it can be done across functions, something that cannot easily be done statically. Consider this function:
void accumulateTenTimes(void (*something)(int32_t), int32_t *victim) { int32_t total = 0; for (int32_t i = 0; i < 10; i++) { total += something(*victim); } *victim = total; } int32_t negate(int32_t a) { return -a; } int32_t value = 2; accumulateTenTimes(negate, &value); // result: value = -2 + -2 + -2 + ... + -2 = -20 int32_t sneaky_negate(int32_t a) { value2 /= 2; return -a; } int32_t value2 = 2; accumulateTenTimes(sneaky_negate, &value2); // result: value2 = -2 + -1 + -0 + -0 + ... + -0 = -3
When compiling the accumulateTenTimes
function, the compiler has no way of knowing whether the something
function will modify victim
, so it must be conservative and assume that it might, just in case we are in the sneaky_negate
case.
Let’s assume that the compiler has done flow analysis and determined that the function pointer passed to accumulateTenTimes
is always within the same module, so it doesn’t need to deal with gp
. Since function descriptors are immutable, it can also enregister the function address.
// 2 input registers, 6 local registers, 1 output register alloc r34 = ar.pfs, 2, 6, 1, 0 mov r35 = rp // save return address mov r36 = ar.lc // save loop counter or r37 = r0, r0 // total = 0 ld8 r38 = [r32] // get the function address or r31 = 09h, r0 ;; // r31 = 9 mov ar.lc = r31 // loop nine more times (ten total) again: ld4 r39 = [r33] // load *victim for output mov b6 = r38 // move to branch register br.call.dptk.many rp = b6 ;; // call function in b6 addl r37 = ret0, r37 // accumulate total br.cloop.sptk.few again ;; // loop 9 more times st4 [r33] = r37 // save the total mov ar.lc = r36 // restore loop counter mov rp = r35 // restore return address mov ar.pfs = r34 // restore stack frame br.ret.sptk.many rp // return
Note that at each iteration, we read *victim
from memory because we aren’t sure whether the something
function modifies it. But with advanced loads, we can remove the memory access from the loop.
// 2 input registers, 7 local registers, 1 output register alloc r34 = ar.pfs, 2, 7, 1, 0 mov r35 = rp // save return address mov r36 = ar.lc // save loop counter or r37 = r0, r0 // total = 0 ld8 r38 = [r32] // get the function address or r31 = 09h, r0 ;; // r31 = 9 mov ar.lc = r31 // loop nine more times (ten total) ld4.a r39 = [r33] // get the value of *victim again: ld4.c.nc r39 = [r33] // reload *victim if necessary or r40 = r39, r0 // set *victim as the output parameter mov b6 = r38 // move to branch register br.call.dptk.many rp = b6 ;; // call function in b6 addl r37 = ret0, r37 // accumulate total br.cloop.sptk.few again ;; // loop 9 more times invala.e r39 // stop tracking r39 st4 [r33] = r37 // save the total mov ar.lc = r36 // restore loop counter mov rp = r35 // restore return address mov ar.pfs = r34 // restore stack frame br.ret.sptk.many rp // return
We perform an advanced load of *value
in the hope that the callback function will not modify it. This is true if the callback function is negate
, but it will trigger reloads if the accumulator function is sneaky_negate
.
Note here that we use the .nc
completer on the ld.c
instruction. This stands for no clear and tells the processor to keep tracking the address because we will be checking it again. When the loop is over, we use invala.e
to tell the processor, “Okay, you can stop tracking it now.” This also shows how handy the ld.c
instruction is. We can do the reload inline rather than have to write separate recovery code and jumping out and back.
(Processor trivia: We do not need a stop after the ld4.c.nc
. You are allowed to consume the result of a check load in the same instruction group.)
In the case where the callback function does not modify value
, the only memory accesses performed by this function and the callback are loading the function address, loading the initial value from *value
, and storing the final value to *value
. The loop body itself runs without any memory access at all!
Going back to our original function, I noted that we could also add speculation to the mix. So let’s do that. We’re going to speculate an advanced load!
// we are a leaf function, so no need to use "alloc" or to save rp. // on entry: r32 = this, r33 = value ld4.sa r31 = [r32] // speculatively preload m_value (at offset 0) addl r30 = 08h, r32 // calculate &m_errno addl r29 = 04h, r32 ;; // calculate &m_readCount ld4.sa r28 = [r29] // speculatively preload m_readCount ld4 ret0 = [r30] ;; // load m_errno cmp4.eq p6, p7 = ret0, r0 // p6 = m_errno == 0, p7 = !p6 (p7) invala.e r31 // abandon the advanced load (p7) invala.e r28 // abandon the advanced load (p7) br.ret.sptk.many rp // return false if value not set ld4.c.clr r31 = [r32] // validate speculation and advanced load of m_value st4 [r33] = r31 // store m_value to *value ld4.c.clr r28 = [r29] // validate speculation and advanced load of m_readCount addl r28 = 01h, r28 ;; // calculate m_readCount + 1 st4 [r29] = r28 // store updated m_readCount br.ret.sptk.many rp // return
To validate a speculative advanced load, you just need to do a ld.c
. If the speculation failed, then the advanced load also fails, so all we need to do is check the advanced load. and the reload will raise the exception.
The dependency chain for this function is even shorter now that we were able to speculate the case where there is no error. (Since you are allowed to consume an ld4.c
in the same instruction group, I combined the ld4.c
and its consumption in a single box since they occur within the same cycle.)
ld4.sa r31 = [r32] | addl r30 = 08h, r32 | addl r29 = 04h, r32 | |||||||||
↓ | ↓ | ↓ | |||||||||
↓ | ld4 ret0 = [r30] | ld4.sa r28 = [r29] | |||||||||
↓ | ↓ | ↓ | |||||||||
↓ | cmp4.eq p6, p7 = ret0, r0 | ↓ | |||||||||
↓ | ↙︎ | ↓ | ↘︎ | ↓ | |||||||
|
|
|
|||||||||
↓ | ↓ | ||||||||||
↓ | st4 [r29] = r28 | ||||||||||
↓ | ↓ | ||||||||||
br.ret.sptk.many rp |
Aw, look at that pretty diagram. Control speculation and data speculation allowed us to run three different operations in parallel even though they might have dependencies on each other. The idea here is that if profiling suggests that the dependencies are rarely realized (pointers are usually not aliased), you can use speculation to run the operations as if they had no dependencies, and then use the check instructions to convert the speculated results to real ones.
¹ Note the absence of a stop between the cmp4
and the br.ret
. That’s because of a special Itanium rule that says that a conditional branch is permitted to use a predicate register calculated earlier within the same instruction group. (Normally, instructions within an instruction group are not allowed to have dependencies among each other.) This allows a test and jump to occur within the same cycle.
0 comments