{"id":96835,"date":"2017-08-17T07:00:00","date_gmt":"2017-08-17T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=96835"},"modified":"2019-03-13T01:15:24","modified_gmt":"2019-03-13T08:15:24","slug":"20170817-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170817-00\/?p=96835","title":{"rendered":"The Alpha AXP, part 9: The memory model and atomic memory operations"},"content":{"rendered":"<p>The Alpha AXP has a notoriously weak memory model. When a processor writes to memory, the result becomes visible to other processors eventually, but there are very few constraints beyond that. <\/p>\n<p>For example, writes can become visible out of order. One processor writes a value to a location, and then writes a value to another location, and another processor can observe the second write without the first. Similarly, reads can complete out of order. One processor reads a value from a location, then reads from another location, and the result could be that the second read happens before the first.&sup1; <\/p>\n<p>Assume that memory locations <var>x<\/var> and <var>y<\/var> are both initially zero. The following sequence of operations is valid. <\/p>\n<table BORDER=\"0\" CELLSPACING=\"0\" STYLE=\"border-collapse: collapse\">\n<tr>\n<th STYLE=\"border: solid 1px black;width: 50%;padding: 2px\">Processor 1<\/th>\n<th STYLE=\"border: solid 1px black;width: 50%;padding: 2px\">Processor 2<\/th>\n<\/tr>\n<tr>\n<td>write 1 to <var>x<\/var><\/td>\n<td>read <var>y<\/var> yields 1<\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td><code>MB<\/code> (memory barrier)<\/td>\n<\/tr>\n<tr>\n<td>write 1 to <var>y<\/var><\/td>\n<td>read <var>x<\/var> yields 0<\/td>\n<\/tr>\n<\/table>\n<p>The memory barrier instruction <code>MB<\/code> instructs the processor to make all previous loads and stores complete to memory before starting any new loads and stores. However, it doesn&#8217;t force other processors to do anything; other processors can still complete their memory operations out of order, and that&#8217;s what happened in the above example. <\/p>\n<p>Similarly, the following sequence is also legal: <\/p>\n<table BORDER=\"0\" CELLSPACING=\"0\" STYLE=\"border-collapse: collapse\">\n<tr>\n<th STYLE=\"border: solid 1px black;width: 50%;padding: 2px\">Processor 1<\/th>\n<th STYLE=\"border: solid 1px black;width: 50%;padding: 2px\">Processor 2<\/th>\n<\/tr>\n<tr>\n<td>write 1 to <var>x<\/var><\/td>\n<td>read <var>y<\/var> yields 1<\/td>\n<\/tr>\n<tr>\n<td><code>MB<\/code> (memory barrier)<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>write 1 to <var>y<\/var><\/td>\n<td>read <var>x<\/var> yields 0<\/td>\n<\/tr>\n<\/table>\n<p>This is also legal because the memory barrier on processor 1 ensures that the value of <var>x<\/var> gets updated before the value of <var>y<\/var>, but it doesn&#8217;t prevent processor 2 from performing the reads out of order. <\/p>\n<p>In order to prevent <var>x<\/var> and <var>y<\/var> from appearing to be updated out of order, <i>both<\/i> sides need to issue memory barriers. Processor 1 needs a memory barrier to ensure that the write to <var>x<\/var> happens before the write to <var>y<\/var>, and processor 2 needs a memory barrier to ensure that the read from <var>y<\/var> happens before the read from <var>x<\/var>. <\/p>\n<p>Okay, onward to atomic operations. <\/p>\n<p>Performing atomic operations on memory requires the help of two new pairs of instructions: <\/p>\n<pre>\n    LDL_L   Ra, disp16(Rb)  ; load locked\n    LDQ_L   Ra, disp16(Rb)\n\n    STL_C   Ra, disp16(Rb)  ; store conditional\n    STQ_C   Ra, disp16(Rb)\n<\/pre>\n<p>The <i>load locked<\/i> instruction performs a traditional read from memory, but also sets the <var>lock_<\/var><var>flag<\/var> and memorizes the physical address in <var>phys_locked<\/var>. The processor monitors for any changes to that physical address from any processor, and if a change is detected,&sup2; the <var>lock_<\/var><var>flag<\/var> is cleared. <\/p>\n<p>The <var>lock_<\/var><var>flag<\/var> is also cleared by a variety of other conditions, most notably when the processor returns from kernel mode back to user mode. This means that any hardware interrupt or trap (such as a page fault, or executing an emulated instruction) will clear the <var>lock_<\/var><var>flag<\/var>. It is recommended that operating systems allow at least 40 instructions to execute between timer interrupts. <\/p>\n<p>You can later do a <i>store conditional<\/i> operation which will store a value to a memory address, provided the <var>lock_<\/var><var>flag<\/var> is still set. If so, then the source register is set to 1. If not, then the source register is set to 0 and the memory is left unmodified. Regardless of the result, the <var>lock_<\/var><var>flag<\/var> is cleared. <\/p>\n<p>A typical atomic increment looks like this: <\/p>\n<pre>\nretry:\n    LDL_L   t1, (t0)        ; load locked\n    ADDL    t1, #1, t1      ; increment value\n    STL_C   t1, (t0)        ; store conditional\n                            ; t1 = 1 if store was successful\n    BEQ     t1, failed      ; jump if store failed\n    ... continue execution ...\n\nfailed:\n    BR      zero, retry     ; try again\n<\/pre>\n<p>In the case where the store failed, we jump forward, and then back. Recall that conditional jumps backward are predicted taken, and conditional jumps forward are predicted not taken. If we had simply jumped backward on failure, then the processor would have a branch prediction miss in the common case that there is no contention. <\/p>\n<p>Note that the above sequence does not impose any memory ordering. In practice, you will see a <code>MB<\/code> before and\/or after the atomic sequence in order to enforce acquire and\/or release semantics. <\/p>\n<p>There are a number of practical rules regarding the <code>LD<u>x<\/u>_L<\/code> and <code>ST<u>x<\/u>_C<\/code> instructions. The most important ones are these: <\/p>\n<ul>\n<li>The <code>ST<u>x<\/u>_C<\/code> should be to the same address     as the most recently preceding <code>LD<u>x<\/u>_L<\/code>.     This isn&#8217;t a problem in practice because storing back     to the location of the previous load is the intended use of     the instructions.&sup3; <\/li>\n<li>The processor may lose track of your <code>LD<u>x<\/u>_L<\/code>     if you perform any memory access other than a     matching <code>ST<u>x<\/u>_C<\/code>,     or if you perform a branch instruction,     or if you trigger a trap     (such as executing an emulated instruction),     or if you execute more than 20 instructions after     the <code>LD<u>x<\/u>_L<\/code>. <\/li>\n<\/ul>\n<p> Although each <code>ST<u>x<\/u>_C<\/code> should be preceded by a matching <code>LD<u>x<\/u>_C<\/code>, it is legal to perform a <code>LD<u>x<\/u>_L<\/code> with no matching <code>ST<u>x<\/u>_C<\/code>. This can happen with conditional interlocked operations, where you discover after the <code>LD<u>x<\/u>_L<\/code> that the condition is not satisfied and you abandon the interlocked operation. <\/p>\n<p>The second rule says basically that the state created by the <code>LD<u>x<\/u>_L<\/code> instruction is ephemeral. After performing the <code>LD<u>x<\/u>_L<\/code> instruction, do as little work as possible to determine what value you want to store, and then store it right away. You are not allowed to take any branches, but <code>CMOV<u>cc<\/u><\/code> is okay. <\/p>\n<p>The requirement that you get around to the <code>ST<u>x<\/u>_C<\/code> within 20 instructions is a consequence of the requirement on operating systems that they allow 40 instructions to execute between timer interrupts. <\/p>\n<p>Next time, we&#8217;ll do a little exercise based on what we&#8217;ve learned so far. <\/p>\n<p>&sup1; Mind you, out-of-order reads are pretty common on all architectures. <a HREF=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/\">Store-to-load forwarding<\/a> means that a speculated read operation to speculatively-written memory can complete before a read operation that occurred notionally earlier in the instruction stream. However, as <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20170815-00\/?p=96816#comment-1306565\">Fabian Giesen notes<\/a>, the x86 has extra logic to avoid getting caught doing so! <\/p>\n<p>&sup2; The architecture permits implementations to be a little sloppy with the change detection. In particular, any modification within 128 bytes of the locked address is permitted to clear the <var>lock_<\/var><var>flag<\/var>. This means that targets of atomic operations should be at least 128 bytes apart in order to minimize the likelihood of false positives. <\/p>\n<p>&sup3; There are complicated rules about what happens if you violate this guideline (including some parts which are left implementation-defined), but they are largely irrelevant because you should just follow the guideline already. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>When does it happen? I have no idea!<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-96835","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>When does it happen? I have no idea!<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/96835","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=96835"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/96835\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=96835"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=96835"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=96835"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}