{"id":102798,"date":"2019-08-23T07:00:00","date_gmt":"2019-08-23T14:00:00","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/oldnewthing\/?p=102798"},"modified":"2019-08-07T00:12:46","modified_gmt":"2019-08-07T07:12:46","slug":"20190823-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190823-00\/?p=102798","title":{"rendered":"The SuperH-3, part 15: Code walkthough"},"content":{"rendered":"<p>Once again, we wrap up our processor retrospective series by walking through a simple function from the C runtime library.<\/p>\n<pre>extern FILE _iob[];\r\n\r\nint fclose(FILE *stream)\r\n{\r\n    int result = EOF;\r\n\r\n    if (stream-&gt;_flag &amp; _IOSTRG) {\r\n        stream-&gt;_flag = 0;\r\n    } else {\r\n        int index = stream - _iob;\r\n        _lock_str(index);\r\n        result = _fclose_lk(stream);\r\n        _unlock_str(index);\r\n    }\r\n\r\n    return result;\r\n}\r\n<\/pre>\n<p>Here&#8217;s the corresponding disassembly.<\/p>\n<pre>; int fclose(FILE *stream)\r\n; {\r\n        mov.l   r8,@-r15        ; push r8\r\n        mov.l   r9,@-r15        ; push r9\r\n        mov.l   r10,@-r15       ; push r10\r\n        sts.l   pr,@-r15        ; save return address\r\n\r\n        add     #-16,r15        ; allocate space for outbound calls\r\n<\/pre>\n<p>We start by saving the nonvolatile registers that we are going to be using as local variables in this function. Next, we allocate space on the stack to act as home space for our outbound calls. Most function start this way.<\/p>\n<pre>        mov     r4,r9           ; r9 = stream\r\n<\/pre>\n<p>This function enregisters the <code>stream<\/code> parameter, so save it from the volatile <var>r4<\/var> register into a non-volatile register <var>r9<\/var>. Other register variables are going to be <var>r10<\/var> for <code>result<\/code> and <var>r8<\/var> for <code>index<\/code>.<\/p>\n<pre>;   int result = EOF;\r\n;\r\n;   if (stream-&gt;_flag &amp; _IOSTRG) {\r\n\r\n        mov.l   @(12,r9),r3    ; r3 = stream-&gt;_flag\r\n        mov     #64,r2         ; r2 = _IOSTRG\r\n        and     r2,r3          ; r3 = stream-&gt;_flag &amp; _IOSTRG\r\n        tst     r3,r3          ; is it zero?\r\n        bt\/s    isfile         ; Y: so it's a file\r\n        mov     #-1,r10        ; Set r10 = EOF\r\n<\/pre>\n<p>To test the flag, we load the value into a register (<var>r3<\/var>), load the constant <code>0x40<\/code> into another register so we can <code>AND<\/code> them together and test the result. The <code>TST<\/code> instruction implicitly tests against zero, so a <i>branch if true<\/i> means <i>branch if zero<\/i>. If the result is indeed zero, then we branch to the string handling case, but not before setting <code>r10<\/code> to <code>-1<\/code>, which initializes the <code>result<\/code> variable.<\/p>\n<pre>;       stream-&gt;_flag = 0;\r\n;   }\r\n\r\n        mov     #0,r3          ; prepare to store zero\r\n        bra     done           ; and we're done\r\n        mov.l   r3,@(12,r9)    ; stream-&gt;_flag = 0\r\n                               ; (in the branch delay slot)\r\n<\/pre>\n<p>If we have a string, then we set <code>_flag<\/code> to 0 by loading the constant zero into a register and storing it. Then we jump to the common exit code.<\/p>\n<pre>;   } else {\r\n;       int index = stream - _iob;\r\n\r\nisfile:\r\n        mov.l   @(42,pc),r2 ; #0x10004080 ; load constant address of _iob\r\n        mov     r9,r8          ; r8 = stream\r\n        mov     #-5,r3         ; prepare to shift right 5 places\r\n        sub     r2,r8          ; r8 = stream - _iob (byte offset)\r\n        shad    r3,r8          ; index = stream - _iob (element offset)\r\n<\/pre>\n<p>The <code>FILE<\/code> structure is a convenient 32 bytes in size, so the byte offset can be converted to an element offset by a simple shift. There is no right-shift-by-5 instruction, so we have to do a variable shift. There is no right-shift-by-variable instruction, so we instead do a left shift by the negative, because the left-shift instruction <code>SHAD<\/code> can shift both left <i>or<\/i> right, depending on the sign of the shift amount.<\/p>\n<pre>;       _lock_str(index);\r\n`\r\n        mov.l   @(36,pc),r3 ; #0x10001040 ; address of _lock_str\r\n        jsr     @r3         ; call it\r\n        mov     r8,r4       ; copy parameter from r8 = index\r\n<\/pre>\n<p>To call the <code>_lock_str<\/code> function, we put the <code>index<\/code> parameter in <var>r4<\/var> (in the delay slot), load up the address of the function, and then call it.<\/p>\n<pre>;       result = _fclose_lk(stream);\r\n`\r\n        mov.l   @(36,pc),r3 ; #0x10002130 ; address of _fclose_lk\r\n        jsr     @r3         ; call it\r\n        mov     r9,r4       ; copy parameter from r9 = stream\r\n<\/pre>\n<p>And another function call. Note that the displacement for the <code>@(36,pc)<\/code> is the same offset as the previous one, yet it loads a different value. That&#8217;s because <var>pc<\/var> has changed!<\/p>\n<pre>;       _unlock_str(index);\r\n\r\n        mov.l   @(32,pc),r3 ; #0x100010c8 ; address of _unlock_str\r\n        mov     r8,r4      ; copy parameter from r8 = index\r\n        jsr     @r3        ; call it\r\n        mov     r0,r10     ; save return value of _fclose_lk into result\r\n<\/pre>\n<p>And then call <code>_unlock_str<\/code>. This time, we also have to save the return value from <code>_fclose_lk<\/code> so we can return it from the function.<\/p>\n<pre>;   }\r\n;   return result;\r\n; }\r\n\r\ndone:\r\n        add     #16,r15    ; clean the stack\r\n        mov     r10,r0     ; put return value into r0 register\r\n        lds.l   @r15+,pr   ; pop return address\r\n        mov.l   @r15+,r10  ; pop r10\r\n        mov.l   @r15+,r9   ; pop r9\r\n        rts                ; return to caller\r\n        mov.l   @r15+,r8   ; pop r8\r\n<\/pre>\n<p>And we reach the function exit. We put the return value in the <var>r0<\/var> register, because that&#8217;s what the calling convention dictates. And we undo the stack operations we performed in the function prologue: Clean the stack and pop off the registers.<\/p>\n<p>But wait, we&#8217;re not done yet. We have those constants in the code segment that we need to generate.<\/p>\n<pre>        .data.l     _iob\r\n        .data.l     _lock_str\r\n        .data.l     _fclose_lk\r\n        .data.l     _unlock_str\r\n<\/pre>\n<p>When you look at the disassembly, these data bytes are going to be disassembled as if they were code, because the disassembler doesn&#8217;t know that they&#8217;re actually data. You just have to understand that nonsense instructions after an unconditional branch are likely to be data.<\/p>\n<p><b>Bonus chatter<\/b>: Here&#8217;s my attempt to hand-optimize the assembly.<\/p>\n<p>First observation is that enregistering a variable that is used only once costs the same as spilling it. If you spill it, you write it to memory once and load it from memory once. If you enregister it, you write the original register to memory once, and restore it from memory once. Either way, you perform one read and one write. This means that the <code>stream<\/code> variable may as well be spilled.<\/p>\n<p>Second observation is that there is really only one interesting live variable across each of the calls. Either we are saving the index, or saving the result. So we can use the same register to hold both.<\/p>\n<p>And the third observation is that the compiler didn&#8217;t take advantage of the free home space.<\/p>\n<pre>        mov.l   r8,@(12,r15)    ; save r8 in parameter 4 home space\r\n        sts.l   pr,@(8,r15)     ; save pr in parameter 3 home space\r\n        mov.l   r4,@(4,r15)     ; save stream in parameter 2 home space\r\n<\/pre>\n<p>I have 16 bytes of free memory, so I use it instead of pushing values onto the stack. I used 12 bytes of my home space, so I need to allocate 12 bytes of stack to get myself back up to 16 bytes of home space for the outbound function calls. I&#8217;ll interleave that with the next sequence of instructions to try to avoid a load stall.<\/p>\n<pre>        mov.l   @(12,r4),r3     ; r3 = stream-&gt;_flag\r\n        add     #-12,r15        ; allocate space for outbound calls\r\n        mov     #64,r2          ; r2 = _IOSTRG\r\n        and     r2,r3           ; r3 = stream-&gt;_flag &amp; _IOSTRG\r\n        tst     r3,r3           ; is it zero?\r\n        mov     #-1,r0          ; return value is EOF (if it's a string)\r\n        bf      isstring        ; N: so it's a string\r\n<\/pre>\n<p>The code to test the flag hasn&#8217;t really changed, but I moved the stack pointer adjustment into this sequence to avoid the stall that occurs when we try to use <var>r3<\/var> too soon after loading it from memory. This delay of the stack pointer adjustment is legal because we are allowed to advance instructions into the prologue provided they are not jumps and do not modify nonvolatile registers.<\/p>\n<p>There is a stall between the <code>TST<\/code> and the <code>BF<\/code> because we are consuming flags immediately after generating them, so I slip a <code>MOV<\/code> instruction in there. The value is used only if the branch is taken, but it does no harm in the fallthrough case, and we may as well try it, since it&#8217;s a free instruction due to the stall.<\/p>\n<pre>;       int index = stream - _iob;\r\n;       _lock_str(index);\r\n\r\n        mov.l   #_iob,r2        ; r2 = address of _iob\r\n        mov     r4,r8           ; r8 = stream\r\n        mov.l   #_lock_str,r0   ; address of _lock_str\r\n        mov     #-5,r3          ; prepare to shift right 5 places\r\n        sub     r2,r8           ; r8 = stream - _iob (byte offset)\r\n        shad    r3,r8           ; index = stream - _iob (element offset)\r\n        jsr     @r0             ; call _lock_str\r\n        mov     r8,r4           ; copy parameter from r8 = index\r\n<\/pre>\n<p>The code to calculate the index hasn&#8217;t really changed, but I interleave it with the preparation to call <code>_lock_str<\/code> to avoid a load stall.<\/p>\n<pre>;       result = _fclose_lk(stream);\r\n`\r\n        mov.l   #_fclose_lk,r3  ; address of _fclose_lk\r\n        jsr     @r3             ; call it\r\n        mov     @(20,r15),r4    ; parameter 1 is the stream\r\n<\/pre>\n<p>This is the same as before, except we load the stream from memory because we didn&#8217;t dedicate a register to it. This does mean that if the <code>_fclose_lk<\/code> function tries to access its parameter within its first two instructions, it will suffer a load stall. (Normally, we&#8217;d have to count four instructions, but there is a one-cycle pipeline bubble on a taken branch, so that sucks up two of the instructions.) However, <code>_fclose_lk<\/code> is almost certainly going to have at least one register variable, so those first two instructions are going to be occupied by spilling <var>r8<\/var> and <var>pr<\/var>. The earliest it is likely to access <var>r4<\/var> is its third instruction, so we&#8217;re safe.<\/p>\n<pre>;       _unlock_str(index);\r\n\r\n        mov.l   #_unlock_str,r3 ; address of _unlock_str\r\n        mov     r8,r4           ; copy parameter from r8 = index\r\n        jsr     @r3             ; call it\r\n        mov     r0,r8           ; save return value of _fclose_lk into r8\r\n<\/pre>\n<p>The trick here is that the <code>result<\/code> variable becomes live at the same moment that <code>index<\/code> becomes dead, so we can use the same register <var>r8<\/var> for both of them. After the function returns, we put the saved value back into <var>r0<\/var> so we can return it.<\/p>\n<pre>        bra     done            ; to common exit code\r\n        mov     r8,r0           ; put result back into r0 so we can return it\r\n<\/pre>\n<p>After <code>_unlock_str<\/code> returns, we go to our common exit code, with the desired return value in <var>r0<\/var>.<\/p>\n<pre>;   int result = EOF;\r\n;   stream-&gt;_flag = 0;\r\n\r\nisstring:\r\n        mov      #0,r1          ; value to store into stream-&gt;_flag\r\n        mov      r1,@(12,r4)    ; stream-&gt;_flag = 0\r\n                                ; r0 is already -1\r\n<\/pre>\n<p>In the string case, we just zero out the <code>_flag<\/code> and return <code>-1<\/code>, which we preloaded into <var>r0<\/var> prior to the branch into this code path. Then we fall through to the common exit code.<\/p>\n<pre>done:\r\n        lds.l   @(20,r15),pr    ; recover return address\r\n        add     #12,r15         ; clean the stack\r\n        rts                     ; return to caller\r\n        mov.l   @(12,r15),r8    ; restore r8\r\n<\/pre>\n<p>And we&#8217;re done. Our epilogue code is rather brief because we already put the desired return value in the <var>r0<\/var> register, and because we didn&#8217;t have a lot of saved registers to restore. I put the <code>add<\/code> after the <code>lds.l<\/code> because I&#8217;m going to stall on the load delay, so I may as well get a free instruction out of it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Taking this thing for a spin.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-102798","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>Taking this thing for a spin.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102798","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=102798"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102798\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=102798"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=102798"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=102798"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}