{"id":106968,"date":"2022-08-12T07:00:00","date_gmt":"2022-08-12T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=106968"},"modified":"2022-09-16T06:08:42","modified_gmt":"2022-09-16T13:08:42","slug":"20220812-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220812-00\/?p=106968","title":{"rendered":"The AArch64 processor (aka arm64), part 14: Barriers"},"content":{"rendered":"<p>Barriers are important on ARM-family systems because <a href=\"https:\/\/randomascii.wordpress.com\/2020\/11\/29\/arm-and-lock-free-programming\/\"> it has a weak memory model<\/a> compared to the x86 series that most people are familiar with.<\/p>\n<p>We start with the explicit barrier instructions:<\/p>\n<pre>    dmb     ish     ; data memory barrier\r\n    dsb     ish     ; data synchronization barrier\r\n    isb     sy      ; instruction synchronization barrier\r\n<\/pre>\n<p>The data memory barrier ensures that all preceding writes are issued before any subsequent memory operations (including speculative memory access). In acquire\/release terms, it is a full barrier. The instruction does not stall execution; it just tells the memory controller to preserve externally-visible ordering. This is probably the only barrier you will ever seen in user-mode code.<\/p>\n<p>The data synchronization barrier is a data memory barrier, but with the additional behavior of stalling until all outstanding writes have completed. This is typically used before changing memory mappings, such as during context switches, to ensure that any outstanding writes complete to the original memory before it gets unmapped.<\/p>\n<p>The instruction synchronization barrier flushes instruction prefetch. This is typically used if you have generated new code, say by jitting it or paging it in from disk.<\/p>\n<p>All of these barrier instructions take a parameter known as the <i>synchronization domain<\/i>. In practice, they will be the values I gave in the examples above.<\/p>\n<p>There are some other niche barriers like the &#8220;consumption of speculative data barrier&#8221; (<code>CSDB<\/code>) and &#8220;physical speculative store bypass barrier&#8221; (<code>PSSBB<\/code>), which I won&#8217;t bother going into because you&#8217;re not going to see them.<\/p>\n<p>By default, the memory access instructions do not impose any special ordering. But there are variations that let you request acquire or release semantics. We saw the general pattern in the bonus chatter last time:<\/p>\n<ul>\n<li><code>A<\/code> &#8211; perform the load with acquire semantics<\/li>\n<li><code>L<\/code> &#8211; perform the store with release semantics<\/li>\n<li><code>AL<\/code> &#8211; perform the load with acquire semantics and the store with release semantics<\/li>\n<\/ul>\n<p>The <code>AL<\/code> version applies only to load-modify-store instructions, which are all optional. But the acquire load and release store are supported by all processors.<\/p>\n<pre>    ; load acquire\r\n    ldarb   Wt\/zr, [Xn\/sp]          ; byte\r\n    ldarh   Wt\/zr, [Xn\/sp]          ; halfword\r\n    ldar    Rt\/zr, [Xn\/sp]          ; word or doubleword\r\n    ; no register-pair version\r\n\r\n    ; load acquire exclusive\r\n    ldaxrb  Wt\/zr, [Xn\/sp]          ; byte\r\n    ldaxrh  Wt\/zr, [Xn\/sp]          ; halfword\r\n    ldaxr   Wt\/zr, [Xn\/sp]          ; word or doubleword\r\n    ldaxp   Rt\/zr, [Xn\/sp]          ; pair\r\n\r\n    ; store release\r\n    stlrb   Wt\/zr, [Xn\/sp]          ; byte\r\n    stlrh   Wt\/zr, [Xn\/sp]          ; halfword\r\n    stlr    Wt\/zr, [Xn\/sp]          ; word or doubleword\r\n    ; no register-pair version\r\n\r\n    ; store release exclusive\r\n    stlxrb  Ws\/zr, Wt\/zr, [Xn\/sp]   ; byte\r\n    stlxrh  Ws\/zr, Wt\/zr, [Xn\/sp]   ; halfword\r\n    stlxr   Ws\/zr, Wt\/zr, [Xn\/sp]   ; word or doubleword\r\n    stlxp   Rs\/zr, Wt\/zr, [Xn\/sp]   ; pair\r\n<\/pre>\n<p>These special acquire and release versions are handy in the load-locked\/store-conditional pattern because they reduce the need for issue explicit barriers.<\/p>\n<p>Here&#8217;s how the gcc compiler generates the code:<\/p>\n<pre>    ; sequential consistency interlocked increment and\r\n    ; acquire-release interlocked increment\r\n@@: ldaxr   w8, [x0]                ; load acquire from x0\r\n    add     w8, w8, 1               ; increment\r\n    stlxr   w9, w8, [x0]            ; store it back with release\r\n    cbnz    @B                      ; if failed, try again\r\n\r\n    ; acquire-only interlocked increment\r\n@@: ldaxr   w8, [x0]                ; load acquire from x0\r\n    add     w8, w8, 1               ; increment\r\n    stxr   w9, w8, [x0]             ; store it back (no release)\r\n    cbnz    @B                      ; if failed, try again\r\n\r\n    ; release-only interlocked increment\r\n@@: ldxr    w8, [x0]                ; load (no acquire) from x0\r\n    add     w8, w8, 1               ; increment\r\n    stlxr   w9, w8, [x0]            ; store it back with release\r\n    cbnz    @B                      ; if failed, try again\r\n\r\n    ; relaxed interlocked increment\r\n@@: ldxr    w8, [x0]                ; load from x0\r\n    add     w8, w8, 1               ; increment\r\n    stxr    w9, w8, [x0]            ; store it back\r\n    cbnz    @B                      ; if failed, try again\r\n<\/pre>\n<p>On the other hand, the Microsoft compiler adds additional barriers:<\/p>\n<pre>    ; sequential consistency interlocked increment and\r\n    ; acquire-release interlocked increment\r\n@@: ldaxr   w8, [x0]                ; load acquire from x0\r\n    add     w8, w8, 1               ; increment\r\n    stlxr   w9, w8, [x0]            ; store it back with release\r\n    cbnz    @B                      ; if failed, try again\r\n    dmb     ish                     ; memory barrier (?)\r\n\r\n    ; acquire-only interlocked increment\r\n@@: ldaxr   w8, [x0]                ; load acquire from x0\r\n    add     w8, w8, 1               ; increment\r\n    stxr    w9, w8, [x0]            ; store it back\r\n    cbnz    @B                      ; if failed, try again\r\n    dmb     ish                     ; memory barrier (?)\r\n\r\n    ; release-only interlocked increment\r\n@@: ldaxr   w8, [x0]                ; load acquire from x0 (?)\r\n    add     w8, w8, 1               ; increment\r\n    stlxr   w9, w8, [x0]            ; store it back with release\r\n    cbnz    @B                      ; if failed, try again\r\n\r\n    ; no-fence interlocked increment\r\n@@: ldxr    w8, [x0]                ; load from x0\r\n    add     w8, w8, 1               ; increment\r\n    stxr    w9, w8, [x0]            ; store it back\r\n    cbnz    @B                      ; if failed, try again\r\n<\/pre>\n<p>Older versions of the Microsoft compiler used a spurious release on the <code>stlxr<\/code> when generating an acquire-only interlocked increment, but it appears to be fixed in 19.14. The spurious acquire on the release-only interlocked increment, and the mystery memory barrier instructions, are still there in 19.32.<\/p>\n<p>Not sure what the extra barriers are for. Maybe there&#8217;s something special about the Windows ABI that requires them? Maybe there&#8217;s some subtlety in the architecture that I&#8217;m not aware of? I don&#8217;t know.<\/p>\n<p>While I&#8217;m here, I may as well mention this other instruction that isn&#8217;t a barrier, but it&#8217;s closely related:<\/p>\n<pre>    ; prefetch memory\r\n    prfm    kind, [...]\r\n    prfum   kind, [...]             ; force unscaled offset\r\n<\/pre>\n<p>The addressing mode can include pre- and post-increment.<\/p>\n<p>The <var>kind<\/var> is a concatenation of a Type, Target, and Policy.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Category<\/th>\n<th>Value<\/th>\n<th>Meaning<\/th>\n<\/tr>\n<tr>\n<td rowspan=\"3\">Type<\/td>\n<td><code>PLD<\/code><\/td>\n<td>Prefetch for load<\/td>\n<\/tr>\n<tr>\n<td><code>PLI<\/code><\/td>\n<td>Prefetch instruction<\/td>\n<\/tr>\n<tr>\n<td><code>PLS<\/code><\/td>\n<td>Prefetch for store<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"3\">Target<\/td>\n<td><code>L1<\/code><\/td>\n<td>L1 cache<\/td>\n<\/tr>\n<tr>\n<td><code>L2<\/code><\/td>\n<td>L2 cache<\/td>\n<\/tr>\n<tr>\n<td><code>L3<\/code><\/td>\n<td>L3 cache<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"3\">Policy<\/td>\n<td><code>KEEP<\/code><\/td>\n<td>Temporal (load into cache normally)<\/td>\n<\/tr>\n<tr>\n<td><code>STRM<\/code><\/td>\n<td>Streaming, non-temporal (data will be used only once)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For example, <code>PLDL3STRM<\/code> means &#8220;Prefetch for load into L3 cache for one-time use.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Keeping things in the right order.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-106968","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Keeping things in the right order.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106968","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=106968"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/106968\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=106968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=106968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=106968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}