{"id":105307,"date":"2021-06-14T07:00:00","date_gmt":"2021-06-14T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105307"},"modified":"2021-06-14T06:47:03","modified_gmt":"2021-06-14T13:47:03","slug":"20210614-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210614-00\/?p=105307","title":{"rendered":"The ARM processor (Thumb-2), part 11: Atomic access and barriers"},"content":{"rendered":"<p>On the ARM processor, atomic operations are implemented in terms of a load-locked\/store-conditional pair of instructions.<\/p>\n<pre>    LDREX   Rd, [Rn, #imm8]     ; load word from [Rn, #imm8] and acquire exclusively\r\n    STREX   Rd, Rm, [Rn, #imm8] ; store Rm to [Rn, #imm8] if exclusively held\r\n                                ; Rd = 0 on success or 1 on failure\r\n\r\n    ; also LDREXB, LDREXH, LDREXD\r\n    ;      STREXB, STREXH, STREXD\r\n\r\n    CLREX                       ; release exclusive lock\r\n<\/pre>\n<p>The <code>LDREX<\/code> instruction loads a word from the specified address and takes an exclusive lock on the memory. This exclusive lock is broken if any other processor writes to the same address, or if the lock is explicitly cleared. The granularity of the lock is permitted to be as coarse as 2KB.<\/p>\n<p>The <code>STREX<\/code> instruction writes the value <var>Rm<\/var> to <var>Rn<\/var> provided the exclusive lock has not been lost. The <var>Rd<\/var> register is set to 0 if the write succeeded, or 1 if the write failed. The <var>Rd<\/var> register may not be the same register as <var>Rm<\/var>.<\/p>\n<p>The <code>STREX<\/code> is permitted to early-out and return failure due to a lost lock before checking whether the memory at <var>Rn<\/var> is writable.<\/p>\n<p>The <code>LDREX<\/code> and <code>STREX<\/code> instructions support only offset addressing with an unsigned 8-bit offset. (An offset of zero is assumed if none is provided.) No pre-indexing or post-indexing allowed.<\/p>\n<p>There are also byte, word, and doubleword versions of this pair of instructions. For best results, use the <code>STREX<\/code> variant that matches the <code>LDREX<\/code> variant, and with the same address.<\/p>\n<p>You can explicitly abandon a lock obtained by one of the <code>LDREX<\/code> instructions by issuing a <code>CLREX<\/code> instruction. This is used primarily in kernel mode to ensure that interrupts and context switches cause the lock to be lost: If the user-mode code is interrupted between the <code>LDREX<\/code> and the subsequent <code>STREX<\/code>, you want to make sure the <code>STREX<\/code> fails, rather than accidentally succeeding because it&#8217;s writing to an address that coincidentally matches a previous <code>LDREX<\/code> from the outgoing thread or interrupt.<\/p>\n<p>The atomic memory access instructions require aligned memory. Relaxing alignment enforcement doesn&#8217;t help here. Not that you expect it to: How can the kernel emulate a misaligned lock?<\/p>\n<p>The atomic memory operations are frequently coupled with synchronization primitives. The ARM processor has a rather weak memory model, <a href=\"https:\/\/randomascii.wordpress.com\/2020\/11\/29\/arm-and-lock-free-programming\/\"> so memory barriers are essential in proper multithreaded code<\/a>.<\/p>\n<pre>    DMB     ish     ; data memory barrier\r\n    DSB     ish     ; data synchronization barrier\r\n    ISB     sy      ; instruction synchronization barrier\r\n<\/pre>\n<p>The data memory barrier ensures that all preceding writes are issued before any subsequent memory operations (including speculative memory access). In acquire\/release terms, it is a full barrier. The instruction does not stall execution; it just tells the memory controller to preserve externally-visible ordering. This is probably the only barrier you will ever seen in user-mode code.<\/p>\n<p>The data synchronization barrier is a data memory barrier, but with the additional behavior of stalling until all outstanding writes have completed. This is typically used during context switches.<\/p>\n<p>The instruction synchronization barrier flushes instruction prefetch. This is typically used if you have generated new code, say by jitting it or paging it in from disk.<\/p>\n<p>All of the barrier instructions take a parameter known as the <i>sychronization domain<\/i>. In practice, they will be the values I gave in the examples above.<\/p>\n<p>A typical atomic sequence, complete with memory barriers, looks like this:<\/p>\n<pre>    dmb     ish             ; memory barrier\r\n\r\n@@: ldrex   r2, [r0]        ; load r2 from [r0] and lock\r\n\r\n    ; calculate new value - in this example, we increment\r\n    adds    r2, r2, #1      ; increment it\r\n\r\n    strex   r3, r2, [r0]    ; store if lock is still held\r\n    cmp     r3, #0          ; did it succeed?\r\n    bne     @B              ; N: try again\r\n\r\n    dmb     ish             ; memory barrier\r\n<\/pre>\n<p>Finally, we have some instructions that provide hints to the processor about future memory usage:\u00b2<\/p>\n<pre>    PLD     [Rn, #imm]      ; preload data\r\n    PLDW    [Rn, #imm]      ; preload data with intent to write\r\n    PLI     [Rn, #imm]      ; preload instructions\r\n<\/pre>\n<p>Processors are not required to honor these instructions and may treat them as nop. (Pre-index and post-index are not supported, so you don&#8217;t have to worry about accidentally nop&#8217;ing out the write-back.) If the address being prefetched is not valid, the request is ignored.<\/p>\n<p>Okay, enough about memory. Next time, we&#8217;ll look at control transfer instructions.<\/p>\n<p><b>Bonus chatter<\/b>: Classic ARM also contains two deprecated pseudo-atomic instructions:<\/p>\n<pre>    ; swap\r\n    swp     Rt, Rt2, [Rn]   ; temp = [Rn]\r\n                            ; [Rn] = Rt2\r\n                            ; Rt = temp\r\n\r\n    ; swap byte\r\n    swpb    Rt, Rt2, [Rn]   ; temp = byte at [Rn]\r\n                            ; byte at [Rn] = Rt2\r\n                            ; Rt = temp (zero-extended)\r\n<\/pre>\n<p>These are pseudo-atomic instructions because the processor promises that it will not split the load and store, but only if no TLB eviction occurs, and it makes no promises about what other processors or devices may see.<\/p>\n<p>These instructions are formally deprecated by ARM, and operating systems are permitted to disable them outright. Windows disables them, which is redundant because the instructions aren&#8217;t available in Thumb-2 mode anyway. I guess Windows wants to make extra sure you don&#8217;t use them.<\/p>\n<p>\u00b9 Even if alignment enforcement is relaxed, you will still get an alignment exception for misaligned doubleword access or any instruction that reads or writes multiple registers.<\/p>\n<p>\u00b2 Internally, these instructions reuse the encodings for loading partial values into <var>pc<\/var>, something you would never do in sane code. This is an example of how Thumb-2 disallows certain operations with <var>pc<\/var> and reuses the instruction encodings for other purposes.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Instruction<\/th>\n<th>Encoded as if<\/th>\n<\/tr>\n<tr>\n<td><code>PLD\u00a0 [...]<\/code><\/td>\n<td><code>LDRB\u00a0 pc, [...]<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>PLDW [...]<\/code><\/td>\n<td><code>LDRH\u00a0 pc, [...]<\/code><\/td>\n<\/tr>\n<tr>\n<td><code>PLI\u00a0 [...]<\/code><\/td>\n<td><code>LDRSB pc, [...]<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n","protected":false},"excerpt":{"rendered":"<p>Doing things one at a time.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-105307","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Doing things one at a time.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105307","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105307"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105307\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105307"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105307"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105307"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}