Barriers are important on ARM-family systems because it has a weak memory model compared to the x86 series that most people are familiar with.
We start with the explicit barrier instructions:
dmb ish ; data memory barrier
dsb ish ; data synchronization barrier
isb sy ; instruction synchronization barrier
The data memory barrier ensures that all preceding writes are issued before any subsequent memory operations (including speculative memory access). In acquire/release terms, it is a full barrier. The instruction does not stall execution; it just tells the memory controller to preserve externally-visible ordering. This is probably the only barrier you will ever seen in user-mode code.
The data synchronization barrier is a data memory barrier, but with the additional behavior of stalling until all outstanding writes have completed. This is typically used before changing memory mappings, such as during context switches, to ensure that any outstanding writes complete to the original memory before it gets unmapped.
The instruction synchronization barrier flushes instruction prefetch. This is typically used if you have generated new code, say by jitting it or paging it in from disk.
All of these barrier instructions take a parameter known as the synchronization domain. In practice, they will be the values I gave in the examples above.
There are some other niche barriers like the “consumption of speculative data barrier” (CSDB) and “physical speculative store bypass barrier” (PSSBB), which I won’t bother going into because you’re not going to see them.
By default, the memory access instructions do not impose any special ordering. But there are variations that let you request acquire or release semantics. We saw the general pattern in the bonus chatter last time:
A– perform the load with acquire semanticsL– perform the store with release semanticsAL– perform the load with acquire semantics and the store with release semantics
The AL version applies only to load-modify-store instructions, which are all optional. But the acquire load and release store are supported by all processors.
; load acquire
ldarb Wt/zr, [Xn/sp] ; byte
ldarh Wt/zr, [Xn/sp] ; halfword
ldar Rt/zr, [Xn/sp] ; word or doubleword
; no register-pair version
; load acquire exclusive
ldaxrb Wt/zr, [Xn/sp] ; byte
ldaxrh Wt/zr, [Xn/sp] ; halfword
ldaxr Wt/zr, [Xn/sp] ; word or doubleword
ldaxp Rt/zr, [Xn/sp] ; pair
; store release
stlrb Wt/zr, [Xn/sp] ; byte
stlrh Wt/zr, [Xn/sp] ; halfword
stlr Wt/zr, [Xn/sp] ; word or doubleword
; no register-pair version
; store release exclusive
stlxrb Ws/zr, Wt/zr, [Xn/sp] ; byte
stlxrh Ws/zr, Wt/zr, [Xn/sp] ; halfword
stlxr Ws/zr, Wt/zr, [Xn/sp] ; word or doubleword
stlxp Rs/zr, Wt/zr, [Xn/sp] ; pair
These special acquire and release versions are handy in the load-locked/store-conditional pattern because they reduce the need for issue explicit barriers.
Here’s how the gcc compiler generates the code:
; sequential consistency interlocked increment and
; acquire-release interlocked increment
@@: ldaxr w8, [x0] ; load acquire from x0
add w8, w8, 1 ; increment
stlxr w9, w8, [x0] ; store it back with release
cbnz @B ; if failed, try again
; acquire-only interlocked increment
@@: ldaxr w8, [x0] ; load acquire from x0
add w8, w8, 1 ; increment
stxr w9, w8, [x0] ; store it back (no release)
cbnz @B ; if failed, try again
; release-only interlocked increment
@@: ldxr w8, [x0] ; load (no acquire) from x0
add w8, w8, 1 ; increment
stlxr w9, w8, [x0] ; store it back with release
cbnz @B ; if failed, try again
; relaxed interlocked increment
@@: ldxr w8, [x0] ; load from x0
add w8, w8, 1 ; increment
stxr w9, w8, [x0] ; store it back
cbnz @B ; if failed, try again
On the other hand, the Microsoft compiler adds additional barriers:
; sequential consistency interlocked increment and
; acquire-release interlocked increment
@@: ldaxr w8, [x0] ; load acquire from x0
add w8, w8, 1 ; increment
stlxr w9, w8, [x0] ; store it back with release
cbnz @B ; if failed, try again
dmb ish ; memory barrier (?)
; acquire-only interlocked increment
@@: ldaxr w8, [x0] ; load acquire from x0
add w8, w8, 1 ; increment
stxr w9, w8, [x0] ; store it back
cbnz @B ; if failed, try again
dmb ish ; memory barrier (?)
; release-only interlocked increment
@@: ldaxr w8, [x0] ; load acquire from x0 (?)
add w8, w8, 1 ; increment
stlxr w9, w8, [x0] ; store it back with release
cbnz @B ; if failed, try again
; no-fence interlocked increment
@@: ldxr w8, [x0] ; load from x0
add w8, w8, 1 ; increment
stxr w9, w8, [x0] ; store it back
cbnz @B ; if failed, try again
Older versions of the Microsoft compiler used a spurious release on the stlxr when generating an acquire-only interlocked increment, but it appears to be fixed in 19.14. The spurious acquire on the release-only interlocked increment, and the mystery memory barrier instructions, are still there in 19.32.
Not sure what the extra barriers are for. Maybe there’s something special about the Windows ABI that requires them? Maybe there’s some subtlety in the architecture that I’m not aware of? I don’t know.
While I’m here, I may as well mention this other instruction that isn’t a barrier, but it’s closely related:
; prefetch memory
prfm kind, [...]
prfum kind, [...] ; force unscaled offset
The addressing mode can include pre- and post-increment.
The kind is a concatenation of a Type, Target, and Policy.
| Category | Value | Meaning |
|---|---|---|
| Type | PLD |
Prefetch for load |
PLI |
Prefetch instruction | |
PLS |
Prefetch for store | |
| Target | L1 |
L1 cache |
L2 |
L2 cache | |
L3 |
L3 cache | |
| Policy | KEEP |
Temporal (load into cache normally) |
STRM |
Streaming, non-temporal (data will be used only once) |
For example, PLDL3STRM means “Prefetch for load into L3 cache for one-time use.”
0 comments