The AArch64 processor (aka arm64), part 13: Atomic access

Raymond Chen

Atomic operations are performed by the traditional RISC-style load locked/store conditional pattern.

    ; load exclusive register byte
    ldxrb   Rd/zr, [Xn/sp]

    ; load exclusive register halfword
    ldxrh   Rd/zr, [Xn/sp]

    ; load exclusive register
    ldxr    Rd/zr, [Xn/sp]

    ; load exclusive register pair
    ldxp    Rd1/zr, Rd2/zr, [Xn/sp]

These instructions atomically load a byte, halfword, word, doubleword, or pair of registers from memory. The instruction also tells the processor to monitor the memory address to see if any other processor writes to that same address, or addresses in the same “exclusive reservation granule”. (Implementations are allowed to have granules as large as 2KB.)

Note that the atomicity guarantee is only partial if you use LDXP to load a pair of 64-bit registers.¹ The entire 128-bit value is not loaded atomically; instead, each 64-bit portion is loaded atomically separately. You can still get tearing between the two registers.

The only supported addressing mode is register indirect. No offsets or indexes allowed.

After an exclusive load, you can attempt to store a value back to the same address:

    ; store exclusive register byte
    stxrb   Rs/zr, Rt/zr, [Xn/sp]

    ; store exclusive register halfword
    stxrh   Rs/zr, Rt/zr, [Xn/sp]

    ; store exclusive register
    stxr    Rs/zr, Rt/zr, [Xn/sp]

    ; store exclusive register pair
    stxp    Rs/zr, Rt1/zr, Rt2/zr, [Xn/sp]

If the reservation obtained by the previous LDX instruction is still valid, then the value in Rt/zr is stored to memory, and Rs is set to 0. Otherwise, no store is performed, and Rs is set to 1.

Whether the store succeeds or fails, the STX instructions clears the reservation.

For these exclusive load and store instructions, the address must be a multiple of the number of bytes being loaded. If not, then the behavior is undefined: There is no requirement that an exception be raised.

So don’t do that.

It is also required that the STX match the LDX both in address and operand sizes. You cannot perform an LDX for one address and follow up with a STX to a different address. You also cannot perform a LDXR and follow up with a STXRH to the same address. You aren’t even allowed to do a LDXP with two 32-bit registers and follow up with a STXR with a single 64-bit register. Again, the behavior is undefined if you break this rule.

The last instruction allows you to hit the reset button:

    ; clear exclusive

The CLREX discards any active reservation, and forces any subsequent STX to fail. This typically happens as part of interrupt handling or context switching to ensure that undefined behavior doesn’t occur if the thread was interrupted while it was in the middle of a LDX/STX sequence.

These instructions are usually coupled with memory barriers, which we’ll look at soon, but the next entry will be a little diversion.

Bonus chatter: There is an optional instruction set extension (mandatory starting in version 8.4) which includes a large set of atomic read-modify-write operations.

    ; atomic read-modify-write operation
    ; Rt = previous value of [Xr]
    ; [Xr] = Rt op Rs
    ldadd   Rs/zr, Rt/zr, [Xr/sp]       ; add
    ldclr   Rs/zr, Rt/zr, [Xr/sp]       ; and not
    ldeor   Rs/zr, Rt/zr, [Xr/sp]       ; exclusive or
    ldset   Rs/zr, Rt/zr, [Xr/sp]       ; or
    ldsmax  Rs/zr, Rt/zr, [Xr/sp]       ; signed maximum
    ldsmin  Rs/zr, Rt/zr, [Xr/sp]       ; signed minimum
    ldumax  Rs/zr, Rt/zr, [Xr/sp]       ; unsigned maximum
    ldumin  Rs/zr, Rt/zr, [Xr/sp]       ; unsigned minimum

By default, there is no memory ordering. You can add the suffix a to load with acquire, the suffix l to store with release, or the suffix al to get both. Note, however, that the acquire suffix is ignored if the destination register Rt is zr.

Furthermore, you can suffix b for byte memory access or h for halfword memory access.

The overall syntax is therefore

Prefix Op Acquire Release Size
ld add

For example, the instruction ldclrlh means

  • ld: Atomic load/modify/store
  • clr: Clear bits
  • (blank): No acquire on load
  • l: Release on store
  • h: Halfword size.

If you don’t care about the previous value, then you can use a pseudo-instruction that uses zr as the destination.

    ; atomic read-modify-write operation
    ; [Xr] = [Xr] op Rs
    stadd   Rs/zr, [Xr/sp]       ; add
    stclr   Rs/zr, [Xr/sp]       ; and not
    steor   Rs/zr, [Xr/sp]       ; exclusive or
    stset   Rs/zr, [Xr/sp]       ; or
    stsmax  Rs/zr, [Xr/sp]       ; signed maximum
    stsmin  Rs/zr, [Xr/sp]       ; signed minimum
    stumax  Rs/zr, [Xr/sp]       ; unsigned maximum
    stumin  Rs/zr, [Xr/sp]       ; unsigned minimum

You can add the l suffix for store with release, and you can add b and h suffixes to operate on smaller sizes. You cannot request acquire on load for these instructions because the acquire is ignored due to the destination being zr.

The optional instruction set extension also provides for atomic exchanges:

    ; swap
    ; write Rs and return previous value in Rt (atomic)
    swp     Rs/zr, Rt/zr, [Xn/sp]       ; word or doubleword
    swpb    Ws/zr, Wt/zr, [Xn/sp]       ; byte
    swph    Ws/zr, Wt/zr, [Xn/sp]       ; halfword

    ; compare and swap
    ; if value is Rs, then write Rt; Rs receives previous value
    ; (atomic)
    cas     Rs/zr, Rt/zr, [Xn/sp]       ; word or doubleword
    casb    Ws/zr, Wt/zr, [Xn/sp]       ; byte
    cash    Ws/zr, Wt/zr, [Xn/sp]       ; halfword
    casp    Rs/zr, Rt/zr, [Xn/sp]       ; register pair
                                        ; Rs,R(s+1) and Rt,R(t+1)

    ; also a, l, and al versions for acquire/release semantics

The memory order modifiers go between the swp/cas prefix and the size suffix, except that they go after the p. So you have casab (compare and swap with acquire, byte size) but caspa (compare and swap pair with acquire).

As with the ld instructions, requests to aquire on load are ignored if the destination register is zr.

The memory operand must be writable, even if the comparison fails. If no value is stored, then any requested release semantics are ignored.

Bonus reading: Atomics in AArch64.

¹ The load is required to be fully atomic starting with version 8.4 of the AArch64. On older processors, Windows uses CASP instead of LDXP/STXP.


Comments are closed. Login to edit/delete your existing comments

  • MGetz 1

    Not sure how to say this, but I find it hilarious when the RISC diehards come out of the woodwork to trash CISC architectures without actually understanding the dynamics of ISA design, pipelining, caches, and multi-core consistency. The fact of the matter is that all ISAs are generally heading towards FISC at this point (Fast Instruction set computing) where the goal is to do things the fastest with the least power. AArch64 is a really good example of an ISA that was designed FISC from the start instead of being retrofitted to be FISC later like x86-64.

    The split reservation based atomics are great on paper, and if I was writing assembly maybe then too. But actually expressing that in code is actually quite painful using anything but intrinsic ops. Whereas the combined instructions actually fit how most atomics code is written in my experience, it’s either a load or store done in an acquire or a release. That also fits with how the standards for C++ and C is written, imagine that.

    I’m guessing if the designers of AArch64 could do it over clean they’d leave out quite a bit they brought over as ARMisms and just go with what’s showing up in 8.1 and later.

    Also for anyone hoping RISC-V will swoop in to be the savior from the AArch64 de-facto monopoly… unless the ISA gets significantly reworked it’s going to be mostly focused on embedded for the foreseeable future because it’s just not designed from the ground up to be highly parallel like AArch64 is.

    • Simon Farnsworth 1

      It’s all circling back round to the beginning – the original RISC versus CISC war was whether the fastest possible processor would have a small number of simple instructions (RISC) and execute them very fast, or whether it would have hugely complex instructions but execute fewer instructions per second (CISC).

      The real development, though, was that early RISC advocates did ISA design differently to the then-established practice. Established practice was to design a set of instructions that would make programming in assembly easy, and to them implement that; RISC advocates instead suggested that you should start with the smallest set of instructions to make it possible to program the system, and then only add new instructions if you could demonstrate an improvement to performance by doing so.

      • MGetz 1

        I think that’s why I’m so amused by the whole thing. When RISC was first described pipelines were largely in order, superscalar was just starting to be a thing, and nobody had really hit a hard limit on clock speed due to leakage in silicon yet. So to me most RISC fans have lost the plot of what RISC was intended to do as you mentioned. The Pentium 4 and the various PowerPC and POWER ISA CPUs put an end to the ‘clock speeds will save us’ however due to leakage and the fact that there is a limit to how fast you can drive the gate of a transistor. Traditional RISC trades complexity of instructions for complexity of dependency chains however and that adds complexity to silicon… the very thing RISC was designed to avoid in the original description. Itainium tried to solve this using VLIW which just spread the pain to the compiler, Arm32 did much the same in some ways with making every instruction largely conditional. Both features were largely ‘failures’. Arm32 because nobody and no compiler ever really used it and the superscalar was largely faster anyway. Itainium because well… lots of reasons that are documented everywhere.

        This is why I think RISC-V is a dead end, it sticks to traditional RISC and thus traps itself in questionable dependency chains. Whereas AArch64 was designed from the ground up largely to break those chains as quickly as possible to allow the CPU to use its superscalar capabilities to the max.

        From my perspective the restrictive x86 memory model is holding back x86-64 more than the variable length instructions. This is because it constrains the ability of the CPU to retire instructions until the previous reads and writes are visible to the CPU. If x86-64 were to introduce an opt in ‘loose mode’ where that constraint is no longer enforced it would allow the already robust super scalar infrastructure to bottle neck on execution port availability in most cases.

Feedback usabilla icon