How does InterlockedIncrement work internally?

Raymond Chen

Raymond

The

Interlocked

family of functions perform atomic operations
on memory.
How do they do it?

It depends on the underlying CPU architecture.
For some CPUs, it’s easy:
The x86, for example, has direct support for many interlocked operations
by means of the

LOCK prefix

(with the bonus feature that LOCK is implied for the
XCHG opcode.)
The ia64 and x64 also have
direct support for atomic load-modify-store operations.

Most other architectures break the operation into two parts,
known as

Load-link/store-conditional
.
The first part (load-link) reads a value from memory and instructions
the processor to monitor the memory address to see if any other
processors modify that same memory.
The second part (store-conditional) stores a value to memory
provided that no other processors have written to the memory
in the meantime.
An atomic load-modify-store operation is therefore performed by
reading the value via load-link,
performing the desired computation,
then
attempting a store-conditional.
If the store-conditional fails,
then start all over again.

LONG InterlockedIncrement(LONG volatile *value)
{
  LONG lOriginal, lNewValue;
  do {
    // Read the current value via load-link so we will know if
    // somebody has modified it while we weren't looking.
    lOriginal = load_link(value);
    // Calculate the new value
    lNewValue = lOriginal + 1;
    // Store the value conditionally. This will fail if somebody
    // has updated the value in the meantime.
  } while (!store_conditional(value, lNewValue));
  return lNewValue;
}

(If this looks familiar, it should.

You’ve seen this pattern before
.)

Now, asking the CPU to monitor a memory address comes with its own
gotchas.
For one thing, the CPU can monitor only one memory address at a time,
and its memory is very short-term.
If your code gets pre-empted or if a hardware interrupt comes in
after your load_link, then your
store_conditional will fail because the CPU got distracted
by the shiny object known as hardware interrupt and totally
forgot about that memory address it was supposed to be monitoring.
(Even if it managed to remember it, it won’t remember it for long,
because the hardware interrupt will almost certainly execute its own
load_link instruction, thereby replacing the monitored
address with its own.)

Furthermore, the CPU might be a little sloppy in its monitoring
and monitor not the address itself but

the cache line
.
If somebody modifies a different memory location which happens to
reside in the same cache line, the store_conditional
might fail even though you would expect it to succeed.
The ARM architecture allows a processor to be so sloppy that
any write in the same block of 2048 bytes can cause a
store_conditional to fail.

What this means for you, the assembly-language coder who is
implementing an interlocked operation, is that you need to minimize
the number of instructions between the
load_link and store_conditional.
For example,
InterlockedIncrement merely adds 1 to the value.
The more instructions you insert between the
load_link and store_conditional,
the greater the chance that your store_conditional will fail
and you will have to retry.
And if you put too much code in between, your
store_conditional will never succeed.
As an extreme example, if you put code that
takes five seconds to calculate the new value,
you will certainly receive several hardware interrupts during those
five seconds, and your
store_conditional will always fail.

Bonus reading:

Why did InterlockedIncrement/Decrement only return the sign of the result
?

Raymond Chen
Raymond Chen

Follow Raymond