{"id":98565,"date":"2018-04-20T07:00:00","date_gmt":"2018-04-20T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=98565"},"modified":"2019-03-13T00:46:14","modified_gmt":"2019-03-13T07:46:14","slug":"20180420-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180420-00\/?p=98565","title":{"rendered":"The MIPS R4000, part 15: Code walkthrough"},"content":{"rendered":"<p>Today we&#8217;re going to take a relatively small function and watch what the compiler did with it. The function is this guy from the C runtime library, although I&#8217;ve simplified it a bit to avoid some distractions. <\/p>\n<pre>\nextern FILE _iob[];\n\nint fclose(FILE *stream)\n{\n    int result = EOF;\n\n    if (stream-&gt;_flag &amp; _IOSTRG) {\n        stream-&gt;_flag = 0;\n    } else {\n        int index = stream - _iob;\n        _lock_str(index);\n        result = _fclose_lk(stream);\n        _unlock_str(index);\n    }\n\n    return result;\n}\n<\/pre>\n<p>Here&#8217;s the corresponding disassembly: <\/p>\n<pre>\n; int fclose(FILE *stream)\n; {\n    addiu   sp,sp,-0x28     ; reserve stack space\n    sw      ra,0x1C(sp)     ; save return address\n    sw      s0,0x18(sp)     ; save s0\n<\/pre>\n<p>On entry, the parameters to a function are passed in <var>a0<\/var> through <var>a3<\/var>. This function has only one parameter, so it goes in <var>a0<\/var>. <\/p>\n<p>We reserve some stack space. The first 16 bytes of that stack space are going to be used as home space for the functions we call, so our usable bytes start at offset <code>0x10<\/code>. We save the <var>s0<\/var> register (because we&#8217;re going to use it as a local variable) and the return address (because it will be modified when we call to other functions). <\/p>\n<pre>\n;   if (stream-&gt;_flag ...\n    lw      t6,0xc(a0)      ; t6 = stream-&gt;_flag\n    move    a1,a0           ; save stream in a1\n<\/pre>\n<p>We are going to test a bit in the <code>stream-&gt;_flag<\/code> member, so we need to load that up. Meanwhile, we save the <code>stream<\/code> parameter in the <var>a1<\/var> register. <\/p>\n<pre>\n;   int result = EOF;\n    li      v1,-1           ; result = -1\n<\/pre>\n<p>Interleaved with the evaluation of the condition we insert the initialization of the <code>result<\/code> local variable. <\/p>\n<pre>\n;   if (stream-&gt;_flag &amp; _IOSTRG) {\n    andi    t7,t6,0x40      ; is the _IOSTRG bit set?\n    bnezl   t7,done         ; yup, then bail\n\n;       stream-&gt;_flag = 0;\n    sw      zero,12(a0)     ; but set stream-&gt;_flag = 0 before we go\n<\/pre>\n<p>We mask off all but the <code>_IOSTRG<\/code> bit and see if it&#8217;s nonzero. If so, then we branch. This branch uses the <code>l<\/code> &#8220;likely&#8221; suffix, so the instruction in the branch delay slot executes only if the branch is taken. Since the <code>true<\/code> branch of the <code>if<\/code> is only one instruction long, the entire contents fit inside the delay slot. How convenient. We can put the <code>true<\/code> branch in the branch delay slot and jump right to the function exit code. If the branch is not taken, then the instruction in the branch delay slot is suppressed. (This suppression behavior is the case only for <code>l<\/code>-type branches.) <\/p>\n<pre>\n;   } else {\n;       int index = stream - _iob;\n    lui     t8,0x77cd       ; load address of _iob into t8\n    addiu   t8,t8,0x7b0     ; t8 = 0x77cd07b0\n    subu    s0,a1,t8        ; calculate raw pointer offset\n    sra     s0,s0,5         ; divide by 32 to get the index (saved in s0)\n<\/pre>\n<p>To calculate the pointer difference, we need to subtract the raw pointers, and in order to do that, we need to load the 32-bit address of the <code>_iob<\/code> array. That takes two instructions. And then we subtract the raw pointers to get the byte difference. And then we divide by <code>sizeof(FILE)<\/code> to get the index. We&#8217;re lucky that the size of a <code>FILE<\/code> is a power of 2, so a shift instruction can be used instead of a full division. <\/p>\n<pre>\n;       _lock_str(index);\n    move    a0,s0           ; Load argument for _lock_str\n    jal     _lock_str\n    sw      a1,0x28(sp)     ; save stream pointer on the stack for later\n<\/pre>\n<p>Now that we&#8217;ve calculated the index, set it up as the argument for the <code>_lock_str<\/code> function and call it. But just before we go, we save <var>a1<\/var> (which is the <code>stream<\/code> parameter) on the stack so we don&#8217;t lose it. The saving of <var>a1<\/var> goes into the branch delay slot, so it executes before the branch is taken, even though it comes after the branch in the instruction stream. <\/p>\n<p>(I don&#8217;t know why the compiler bothered with <var>a1<\/var>. It could have saved <var>a0<\/var> on the stack sooner and put the <code>move a0, s0<\/code> in the branch delay slot.) <\/p>\n<pre>\n;       result = _fclose_lk(stream);\n    jal     _fclose_lk\n    lw      a0,0x28(sp)     ; load argument for _fclose_lk\n<\/pre>\n<p>The next thing to do is to call <code>_fclose_lk<\/code>, and in this case, we load its argument in the branch delay slot. Seeing work happen in the branch delay slot takes getting used to. It always takes a period of adjustment whenever I switch to MIPS after working with some other processor without branch delay slots. <\/p>\n<pre>\n;       _unlock_str(index);\n    move    a0,s0           ; Load argument for _unlock_str\n    jal     _unlock_str\n    sw      v0,0x24(sp)     ; result = return value from _fclose_lk\n<\/pre>\n<p>After the <code>_fclose_lk<\/code>, we call <code>_unlock_str<\/code>, and this time we use the branch delay slot to save the return value from <code>_fclose_lk<\/code> onto the stack before we lose it. (Though the compiler could have done a little better and saved it in <var>s0<\/var>, since <code>index<\/code> is a dead variable at this point.) <\/p>\n<pre>\n;  }\n    lw      v1,0x24(sp)     ; recover result so we can return it\n<\/pre>\n<p>After <code>_unlock_str<\/code> returns, we put <code>result<\/code> into <var>v1<\/var> because that&#8217;s where our cleanup code expects it. <\/p>\n<p>Note that in the instruction stream, you see a store immediately followed by a load from the same location. This makes no sense at first, until you realize that there&#8217;s a function call in between them, because the store is in the branch delay slot. Even though the store and load immediately follow each other in the instruction stream, there&#8217;s an entire function call that happens in between! The store happens before the function call, ad the load happens after. <\/p>\n<pre>\n;  return result;\n; }\ndone:\n    move    v0,v1           ; set return value\n    lw      s0,0x18(sp)     ; restore s0\n    lw      ra,0x1C(sp)     ; recover return address\n    jr      ra              ; return\n    addiu   sp,sp,0x28      ; clean up stack\n<\/pre>\n<p>We set the return value to the <code>result<\/code>, and then we enter the epilogue. In the epilogue, we restore the <var>s0<\/var> register we had been using to hold <code>index<\/code>, and then we load up the return address and jump back to it. We destroy the stack frame in the branch delay slot. <\/p>\n<p>Overall, it&#8217;s pretty straightforward code. The only truly weird thing is the branch delay slot. <\/p>\n<p>But that&#8217;s a huge truly weird thing. <\/p>\n<p>This concludes our tour of the MIPS R4000 processor. I never had to do any significant work with it, so I probably won&#8217;t be able to answer interesting questions. The focus was on learning enough to be able to read valid compiler output, with a few extra notes on the architecture to call out what makes it different.&sup1; <\/p>\n<p><b>Bonus chatter<\/b>: Here&#8217;s my hand-optimized version of the function. <\/p>\n<pre>\n; int fclose(FILE *stream)\n; {\n    addiu   sp,sp,-0x10     ; reserve stack space\n    sw      ra,0x14(sp)     ; save return address\n    sw      s0,0x10(sp)     ; save s0\n    move    s0,a0           ; register variable s0 = stream\n<\/pre>\n<p>My first trick is to reuse the home space. The compiler-generated version didn&#8217;t use the home space for anything other than saving the <code>stream<\/code> parameter. Look, people, it&#8217;s free memory! We need three words of stack, one for the return address, one to save the preserved register <var>s0<\/var>, and one to save the index. We get four words of home space, so we can just use that. The actual stack frame needed by our function is just the home space for the outbound call. <\/p>\n<p>(I wonder whether it&#8217;s legal to overlap your inbound home space with your outbound home space. If our function had needed only two words of stack, would it have been okay for us to write <code>addiu sp, sp, -8<\/code>?) <\/p>\n<pre>\n;   if (stream-&gt;_flag ...\n    lw      t6,0xc(a0)      ; t6 = stream-&gt;_flag\n;   int result = EOF;\n    li      v0,-1           ; result = -1 (avoid load stall)\n<\/pre>\n<p>I&#8217;m precalculating the <code>result<\/code> in anticipation of the early-out. This instruction is basically free because it comes in the load delay slot. If we had tried to use the value in <var>t6<\/var> immediately, the processor would have stalled for a cycle, so we may as well use that cycle productively, even if only speculatively. <\/p>\n<pre>\n;   if (stream-&gt;_flag &amp; _IOSTRG) {\n    andi    t7,t6,0x40      ; is the _IOSTRG bit set?\n    bnezl   t7,done         ; yup, then bail\n;       stream-&gt;_flag = 0;\n    sw      zero,12(a0)     ; but set stream-&gt;_flag = 0 before we go\n<\/pre>\n<p>The test and one-line-body for the <code>_IOSTRG<\/code> test hasn&#8217;t changed, except that we exit with the return value directly in <var>v0<\/var> rather than in <var>v1<\/var>. <\/p>\n<pre>\n;   } else {\n;       int index = stream - _iob;\n    lui     t8,0x77cd       ; load address of _iob into t8\n    addiu   t8,t8,0x7b0     ; t8 = 0x77cd07b0\n    subu    a0,a0,t8        ; calculate raw pointer offset\n    sra     a0,a0,5         ; divide by 32 to get the index (in a0)\n<\/pre>\n<p>The calculation of the index is the same, except that I put it directly into <var>a0<\/var> so it is ready to be passed to <code>_lock_str<\/code>. <\/p>\n<pre>\n;       _lock_str(index);\n    jal     _lock_str\n    sw      a0,0x18(sp)     ; save index\n<\/pre>\n<p>I spent some time trying to decide which should be the register variable: <code>stream<\/code> or <code>index<\/code>. Turns out it doesn&#8217;t matter from a code size point of view: Both are saved and restored exactly once. <\/p>\n<pre>\n;       result = _fclose_lk(stream);\n    jal     _fclose_lk\n    move    a0,s0           ; load argument for _fclose_lk\n<\/pre>\n<p>Calling <code>_fclose_lk<\/code> is simpler because we can move the argument from a register rather than from memory. That way, if the first thing that <code>_fclose_lk<\/code> does is try to use the stream, it won&#8217;t suffer a load delay stall. The first instruction of the called function executes immediately after the branch delay slot. If you put a load instruction in the branch delay slot, then the first instruction of the called function is executing in a load delay slot, and it probably isn&#8217;t expecting that. <\/p>\n<p>So that thinking tipped the scales in favor of keeping <code>stream<\/code> as the register variable. (Of course, that thinking is also based on the older MIPS implementation, which was not dual-issue. The MIPS R4000 processes one instruction every half-cycle. This alters the micro-optimization considerations for both branch delays and load delays.) <\/p>\n<pre>\n;       _unlock_str(index);\n    lw      a0,0x18(sp)     ; Load argument for _unlock_str\n    jal     _unlock_str\n    move    s0,v0           ; result = return value from _fclose_lk\n<\/pre>\n<p>We could have swapped the <code>lw<\/code> and <code>move<\/code>, but I load early and move late in order to avoid loading memory in a branch delay slot, for reasons explained above. <\/p>\n<p>Since the <code>stream<\/code> variable is dead, we can reuse the <var>s0<\/var> register to hold <code>result<\/code>. <\/p>\n<pre>\n;  }\n    move    v0,s0           ; recover result so we can return it\n;  return result;\n; }\ndone:\n    lw      s0,0x10(sp)     ; restore s0\n    lw      ra,0x14(sp)     ; recover return address\n    jr      ra              ; return\n    addiu   sp,sp,0x10      ; clean up stack\n<\/pre>\n<p>And then we clean up and go home. Everybody reached <code>done<\/cODE> with the return value already in <code>v0<\/code>, so all that's left to do is restore our registers and stack. <\/p>\n<p>&sup1; I confess that the excursion into branch delay slots took me away from the focus on how to read valid compiler output. Sorry. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s look at some code.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-98565","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Let&#8217;s look at some code.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/98565","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=98565"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/98565\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=98565"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=98565"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=98565"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}