{"id":96825,"date":"2017-08-16T07:00:00","date_gmt":"2017-08-16T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=96825"},"modified":"2019-03-13T01:15:21","modified_gmt":"2019-03-13T08:15:21","slug":"20170816-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170816-00\/?p=96825","title":{"rendered":"The Alpha AXP, part 8: Memory access, storing bytes and words and unaligned data"},"content":{"rendered":"<p>Storing a byte and word requires a series of three operations: Read the original data, modify the original data to incorporate the byte or word, then write the modified data back to memory. <\/p>\n<p>To assist with the modification are two groups of instructions known as insertion and masking. <\/p>\n<pre>\n    INSBL   Ra, Rb\/#b, Rc  ; Rc =  (uint8_t)Ra &lt;&lt; (Rb\/#b * 8 % 64)\n    INSWL   Ra, Rb\/#b, Rc  ; Rc = (uint16_t)Ra &lt;&lt; (Rb\/#b * 8 % 64)\n    INSLL   Ra, Rb\/#b, Rc  ; Rc = (uint32_t)Ra &lt;&lt; (Rb\/#b * 8 % 64)\n    INSQL   Ra, Rb\/#b, Rc  ; Rc = (uint64_t)Ra &lt;&lt; (Rb\/#b * 8 % 64)\n\n    INSWH   Ra, Rb\/#b, Rc  ; Rc = (uint16_t)Ra &gt;&gt; ((64 - Rb\/#b * 8) % 64)\n    INSLH   Ra, Rb\/#b, Rc  ; Rc = (uint32_t)Ra &gt;&gt; ((64 - Rb\/#b * 8) % 64)\n    INSQH   Ra, Rb\/#b, Rc  ; Rc = (uint64_t)Ra &gt;&gt; ((64 - Rb\/#b * 8) % 64)\n<\/pre>\n<p>These are the inverse of the extraction instructions. Instead of extracting data from a 128-bit value, they move the data into position within a 128-bit value. For example, here&#8217;s a diagram of inserting the long <code>FGHI<\/code> into a 128-bit value: <\/p>\n<pre>\n    high part  low part\n    --------- ---------\n    0000 0FGH           -- INSLH\n              I000 0000 -- INSLL\n<\/pre>\n<p>The last piece of the puzzle is the masking instructions. <\/p>\n<pre>\n    MSKBL   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~( (uint8_t)~0 &lt;&lt; (Rb\/#b * 8 % 64))\n    MSKWL   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint16_t)~0 &lt;&lt; (Rb\/#b * 8 % 64))\n    MSKWL   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint32_t)~0 &lt;&lt; (Rb\/#b * 8 % 64))\n    MSKWL   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint64_t)~0 &lt;&lt; (Rb\/#b * 8 % 64))\n\n    MSKWH   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint16_t)~0 &gt;&gt; ((64 - Rb\/#b * 8) % 64))\n    MSKWH   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint32_t)~0 &gt;&gt; ((64 - Rb\/#b * 8) % 64))\n    MSKWH   Ra, Rb\/#b, Rc  ; Rc = Ra &amp; ~((uint64_t)~0 &gt;&gt; ((64 - Rb\/#b * 8) % 64))\n<\/pre>\n<p>These instructions zero out the bytes of a 128-bit value that are about to be replaced by an insertion. <\/p>\n<p>For example, here&#8217;s how the masking of a long would work: <\/p>\n<pre>\n    high part  low part\n    --------- ---------\n    ABCD EFGH IJKL MNOP -- 16-byte value\n          ^^^ ^         -- 4 bytes to be inserted here\n    ABCD E000           -- MSKLH\n              0JKL MNOP -- MSKLL\n<\/pre>\n<p>Putting the pieces together, we see that in order to replace a long in the middle of a 128-bit value, you would use the insertion instructions to place the new value in the correct position, the masking instructions to zero out the bits that used to be there, and then &#8220;or&#8221; the pieces together. <\/p>\n<pre>\n    ; store an unaligned long in t1 to (t0)\n    ; first read the 128-bit value currently in memory\n    LDQ_U   t2,3(t0)                    ; t2 = yyyy yyyD\n    LDQ_U   t5,(t0)                     ; t5 =           CBAx xxxx\n\n    ; build the values to insert\n    INSLH   t1,t0,t4                    ; t4 = 0000 000d\n    INSLL   t1,t0,t3                    ; t3 =           cba0 0000\n\n    ; mask out the values to be replaced\n    MSKLH   t2,t0,t2                    ; t2 = yyyy yyy0\n    MSKLL   t5,t0,t5                    ; t5 =           000x xxxx\n\n    ; \"or\" the new values into place\n    BIS     t2,t4,t2                    ; t2 = yyyy yyyd\n    BIS     t5,t3,t5                    ; t5 =           cbax xxxx\n\n    ; and write the results back out\n    STQ_U   t2,3(t0)                    ; must store high then low\n    STQ_U   t5,(t0)                     ; in case there was no straddling\n<\/pre>\n<p>Extending this pattern to quads and words is left as an exercise. <\/p>\n<p>Notice that in the case where <var>t0<\/var> does not straddle two quads, we perform two reads from the same location, and two writes to the same location. Let&#8217;s walk through what happens: <\/p>\n<pre>\n    ; first read the 128-bit value currently in memory\n    ; (which is really the same 64-bit value twice)\n    LDQ_U   t2,3(t0)                    ; t2 = yyDC BAxx\n    LDQ_U   t5,(t0)                     ; t5 = yyDC BAxx\n\n    ; build the values to insert\n    INSLH   t1,t0,t4                    ; t4 = 00dc ba00\n    INSLL   t1,t0,t3                    ; t3 = 0000 0000\n\n    ; mask out the values to be replaced\n    MSKLH   t2,t0,t2                    ; t2 = yy00 00xx\n    MSKLL   t5,t0,t5                    ; t5 = yyDC BAxx\n\n    ; \"or\" the new values into place\n    BIS     t2,t4,t2                    ; t2 = yydc baxx\n    BIS     t5,t3,t5                    ; t5 = yyDC BAxx\n\n    ; and write the results back out\n    STQ_U   t2,3(t0)                    ; write same value back\n    STQ_U   t5,(t0)                     ; write updated value\n<\/pre>\n<p>This highlights some of the weird memory effects of the Alpha AXP. If another thread snuck in and modified the memory at <var>t0 &amp; ~7<\/var>, those changes would be reverted at the first <code>STQ_U<\/code>, and then the updated value gets written next. This means that the value changes from <code>yyyyDCBAxx<\/code> to <code>zzzzDCBAww<\/code>, and then back to <code>yyyyDCBAxx<\/code>, and then finally to <code>yyyydcbaxx<\/code>. The value changes, and then appears to change back to the old value, before finally being updated to a new (sort-of) value. <\/p>\n<p>We&#8217;ll learn more about the Alpha AXP memory model later. <\/p>\n<p>In the case where you are writing a word and you know that it is aligned, then you can avoid having to deal with the 128-bit value and operate within a 64-bit value (because an aligned word will never straddle two quads). <\/p>\n<pre>\n    ; store an aligned word in t1 to (t0)\n    ; first read the 64-bit value currently in memory\n    LDQ_U   t5,(t0)                     t5 = yyBA xxxx\n\n    ; build the value to insert\n    INSWL   t1,t0,t3                    t3 = 00ba 0000\n\n    ; mask out the values to be replaced\n    MSKWL   t5,t0,t5                    t5 = yy00 xxxx\n\n    ; \"or\" the new values into place\n    BIS     t5,t3,t5                    t5 = yyba xxxx\n\n    ; and write the results back out\n    STQ_U   t5,(t0)\n<\/pre>\n<p>Okay, but what about bytes? Well, bytes can never be misaligned, so we always go through the &#8220;known aligned&#8221; shortcut. <\/p>\n<pre>\n    ; store a byte in t1 to (t0)\n    ; first read the 64-bit value currently in memory\n    LDQ_U   t5,(t0)                     t5 = yyyA xxxx\n\n    ; build the value to insert\n    INSBL   t1,t0,t3                    t3 = 000a 0000\n\n    ; mask out the values to be replaced\n    MSKBL   t5,t0,t5                    t5 = yyy0 xxxx\n\n    ; \"or\" the new values into place\n    BIS     t5,t3,t5                    t5 = yyya xxxx\n\n    ; and write the results back out\n    STQ_U   t5,(t0)\n<\/pre>\n<p>Dealing with unaligned memory on the Alpha AXP is very annoying. Notice that updates to words and bytes, even aligned words, is not atomic. We read the entire quad from memory, perform some register calculations, and then write the entire quad back out. If somebody made a change to another byte within the quad, we will wipe out that change when we complete our word or byte update. <\/p>\n<p>Next time, we&#8217;ll look at atomic memory operations. <\/p>\n<p><b>Bonus chatter<\/b>: There is one more pair of instructions which operate on the bytes within a register: <code>ZAP<\/code> and <code>ZAPNOT<\/code>. <\/p>\n<pre>\n    ZAP     Ra, Rb\/#b, Rc  ; Rc = Ra after zeroing the bytes selected by Rb\/#b\n    ZAPNOT  Ra, Rb\/#b, Rc  ; Rc = Ra after zeroing the bytes selected by ~Rb\/#b\n<\/pre>\n<p>The <code>ZAP<\/code> and <code>ZAPNOT<\/code> instructions treat the low-order 8 bits of the second parameter as references to the corresponding bytes of the <var>Ra<\/var> register: Bit <var>n<\/var> of <var>Rb<\/var>\/#b corresponds to bits <var>N<\/var> &times; 8 through <var>N<\/var> &times; 8 + 7. The <code>ZAP<\/code> instruction sets the byte to zero if the corresponding bit is set; the <code>ZAPNOT<\/code> instruction sets the byte to zero if the corresponding bit is clear. The other 56 bits of the second parameter are ignored. <\/p>\n<p>For example, <code>ZAP v0, #128, v0<\/code> clears the top byte of <var>v0<\/var>, and <code>ZAPNOT v0, #128, v0<\/code> clears all but the top byte of <var>v0<\/var>. (For some reason, I had trouble remembering which way is which. My trick was to pretend that the <code>ZAPNOT<\/code> instruction is called <code>KEEP<\/code>.) <\/p>\n<p>As a special case, these instructions provide a handy way to zero-extend a register. <\/p>\n<pre>\n    ZAPNOT  Ra, #1, Rc  ; zero-extend byte from Ra to Rc\n    ZAPNOT  Ra, #3, Rc  ; zero-extend word from Ra to Rc\n    ZAPNOT  Ra, #15, Rc ; zero-extend long from Ra to Rc\n<\/pre>\n<p>Note that in the last case, zero-extending a negative long will result in a 32-bit value in non-canonical form. But you hopefully were expecting that; if you want to sign-extend the value (in order to ensure a value in canonical form), you would have done <code>ADDL Ra, #0, Rc<\/code>. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Those little pieces.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-96825","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Those little pieces.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/96825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=96825"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/96825\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=96825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=96825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=96825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}