{"id":98495,"date":"2018-04-12T07:00:00","date_gmt":"2018-04-12T21:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=98495"},"modified":"2021-03-29T15:53:17","modified_gmt":"2021-03-29T22:53:17","slug":"20180412-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180412-00\/?p=98495","title":{"rendered":"The MIPS R4000, part 9: Branch delay slot parlor tricks"},"content":{"rendered":"<p>Last time, we learned about <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20180411-00\/?p=98485\"> the MIPS branch delay slot<\/a>. Today, we&#8217;ll look at some tricks you can play with the branch delay slot.<\/p>\n<p>First trick: It is legal to jump into a branch delay slot. Of course, it&#8217;s not a branch delay slot when you do that. This lets you write some wacky-looking code:<\/p>\n<pre>    B       somewhere           ; unconditional branch\r\nlabel:\r\n    OR      v0, zero, zero      ; v0 = 0\r\n...\r\n<\/pre>\n<p>When the unconditional branch is taken, the <var>v0<\/var> register is set to zero before execution continues at the branch destination.<\/p>\n<p>Meanwhile, if somebody jumps to <code>label<\/code>, then execution continues at <code>label<\/code>, which sets <var>v0<\/var> to zero, and then continues with other stuff.<\/p>\n<p>The instruction at <code>label<\/code> acts both as the branch delay slot for the unconditional branch that precedes it, but it&#8217;s also the first instruction in the basic block if somebody jumps directly into it.<\/p>\n<p>I&#8217;ve seen the opportunity arise for this sort of &#8220;squeeze out a single instruction&#8221; optimization, but the Microsoft compiler doesn&#8217;t take advantage of it. Which is probably a good thing. (For one thing, it makes it much harder for <a href=\"https:\/\/web.archive.org\/web\/20180719082446\/https:\/\/blogs.msdn.microsoft.com\/reiley\/2011\/08\/06\/microsoft-binary-technologies-and-debugging\/\"> binary transformation tools<\/a> to decompose a program into basic blocks and recombine them in different ways.)<\/p>\n<p>Another branch delay slot trick is editing the return address as part of the jump.<\/p>\n<pre>    BAL     somewhere\r\n    ADDIU   ra, ra, 4\r\n\r\n    NOP\r\n\r\n    NOP ; the routine returns here!\r\n<\/pre>\n<p>The <code>BAL<\/code> instruction sets the <var>ra<\/var> register to point to the instruction after the branch delay slot, which in our case is the first <code>NOP<\/code>. But in the branch delay slot, we modify the <var>ra<\/var> register, so that when execution reaches the start of the called procedure, it gets an artificial return address.<\/p>\n<p>I&#8217;m told this sort of trick is used by some compilers to combine a call and an unconditional jump into a call with fake return address. For example, in this code fragment<\/p>\n<pre>    if (...) {\r\n        ...\r\n        function1(...);\r\n    } else {\r\n        ...\r\n    }\r\n    \/\/ resume\r\n    x = 0;\r\n<\/pre>\n<p>the call to <code>function1<\/code> is probably followed by an unconditional jump to skip over the <code>else<\/code> branch.<\/p>\n<pre>    BAL     function1\r\n    NOP                     ; garbage in the branch delay slot\r\n    B       resume\r\n    OR      v0, zero, zero  ; set x = 0\r\n\r\n    ... else-branch code goes here ...\r\n\r\n    OR      v0, zero, zero  ; set x = 0\r\nresume:\r\n    ...\r\n<\/pre>\n<p>A sneaky compiler could <a href=\"http:\/\/www.pagetable.com\/?p=313\"> generate the following code<\/a>:<\/p>\n<pre>    BAL     function1\r\n    <span style=\"color: blue;\">ADDIU   ra, ra, resume - nominal_return<\/span> ; tweak return address\r\nnominal_return:\r\n\r\n    ... else-branch code goes here ...\r\n\r\nresume:\r\n    OR      v0, zero, zero  ; set x = 0\r\n    ...\r\n<\/pre>\n<p>In the branch delay slot, we edit the return address so that when <code>function1<\/code> returns, it resumes execution at <code>resume<\/code> rather than <code>nominal_return<\/code>, thereby avoiding having to execute another branch instruction. (We also were able to remove the duplicate <code>OR v0, zero, zero<\/code> instruction that had been hoisted into the branch delay slot of the unconditional branch.) Note that you get this savings only because you had a garbage <code>NOP<\/code> in the branch delay slot. If there were a useful instruction there, then the transformation would go like this:<\/p>\n<pre>    \/\/ original code\r\n    BAL     function1\r\n    MOVE    a0, r0      ; set parameter for function\r\n    B       resume\r\n    OR      v0, zero, zero  ; set x = 0\r\n\r\n    \/\/ sneaky code\r\n    MOVE    a0, r0      ; set parameter for function\r\n    BAL     function1\r\n    ADDU    ra, ra, ... ; tweak return address\r\n\r\nresume:\r\n    OR      v0, zero, zero  ; set x = 0\r\n<\/pre>\n<p>The instruction in the <code>BAL<\/code> instruction&#8217;s branch delay slot would have to go somewhere else, so you didn&#8217;t save any time (though you still saved one instruction of space by avoiding duplication of the <code>OR v0, zero, zero<\/code>).<\/p>\n<p>But as we saw earlier, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20041216-00\/?p=36973\"> this trick defeats the return address predictor<\/a>,\u00b9 so it&#8217;s probably a bad idea.<\/p>\n<p>Okay, next time, we&#8217;re going to look at the calling convention a bit more closely.<\/p>\n<p><b>Bonus chatter<\/b>: Another extra sneaky trick is reusing the return address. Suppose your interpreter loop goes like this:<\/p>\n<pre>void interpreter_loop(interpreter_state* state)\r\n{\r\n for (;;) {\r\n  uint32_t opcode = *state-&gt;pc;\r\n  state-&gt;pc++;\r\n  jump_table[opcode](state, opcode, state-&gt;pc);\r\n }\r\n}\r\n<\/pre>\n<p>The interpreter loop just dispatches to the next opcode forever. Presumably you would break out of this loop with a <code>longjmp<\/code> or some other nonlocal transfer.<\/p>\n<p>The handler function is given the current interpreter state (so it can update it), and as a courtesy, it also gets the current opcode and a pointer to the next unparsed byte as a convenience.<\/p>\n<pre>interpreter_loop:\r\n    ...\r\n    MOVE    s0, a0       ; s0 points to the interpreter state\r\n    LA      s1, jump_table\r\n    LA      ra, next_opcode ; Footnote \u00b2\r\nnext_opcode:\r\n    LW      v1, 80(s0)  ; get address of next opcode byte\r\n    ADDU    a2, v1, 1   ; move to next opcode byte (also argument for handler)\r\n    LBU     a1, 0(v1)   ; load current opcode byte (also argument for handler)\r\n    SW      a2, 80(s0)  ; save pointer to next opcode byte\r\n    SLL     t0, a1, 2   ; multiple by 4 to index jump table\r\n    ADDU    t0, t0, s1  ; calculate entry in jump table\r\n    LW      v0, 0(t0)   ; load the jump target\r\n    JR      v0          ; jump to handler - will return to next_opcode\r\n    MOVE    a0, s0      ; argument for handler\r\n<\/pre>\n<p>When we call the first handler, <var>ra<\/var> is set equal to <code>next_opcode<\/code>. That handler will do its work and then return to the caller by restoring the return address to the <var>ra<\/var> register and performing a <code>JR ra<\/code>.<\/p>\n<p>This means that when control returns to <code>next_opcode<\/code>, you know that <var>ra<\/var> is equal to <code>next_opcode<\/code>! Since that&#8217;s the value you wanted to be in that register anyway, you can just leave it there when you jump to the next handler, saving you the trouble of having to branch back up to <code>next_opcode<\/code> explicitly.<\/p>\n<p>This seems to be a really clever trick, but it is probably not that useful in practice because of that return address predictor thing.<\/p>\n<p>\u00b9 On the other hand, the MIPS R4000 does not have separate opcodes for &#8220;jump indirect to register&#8221; and &#8220;jump indirect to register for the purpose of returning&#8221;; it uses the <code>JR<\/code> instruction for both cases.<\/p>\n<p>The inability to distinguish whether a jump instruction was semantically a return instruction was a non-issue in the original implementation of the MIPS architecture. It had only a two-stage pipeline, so the single branch delay slot was sufficient to avoid ever needing to predict any branches at all.<\/p>\n<p>The MIPS R4000 had a four-stage pipeline, and a branch misprediction would consequently suffer a 2-cycle stall. The MIPS designers codified existing practice and retroactively declared that if the register operand in the <code>JR<\/code> instruction is <var>ra<\/var>, then it predicts as a subroutine return; otherwise it predicts as a computed jump.<\/p>\n<p>\u00b2 For extra sneakiness (and to save an instruction),\u00b3 the loop preparation code could have been written as<\/p>\n<pre>    LA      s1, jump_table\r\n    BAL     next_opcode\r\n    MOVE    s0, a0       ; s0 points to the interpreter state\r\nnext_opcode:\r\n<\/pre>\n<p>This version lets the processor calculate the address of <code>next_opcode<\/code> by performing a <code>BAL<\/code>. This sets the return address to the instruction after the branch delay slot, which is <code>next_opcode<\/code>, and then jumps to\u2026 <code>next_opcode<\/code>, which is where the instruction would have gone anyway.<\/p>\n<p>\u00b3 Mind you, this size savings costs you a pipeline stall. See footnote 1.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Technically legal, but strange.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-98495","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Technically legal, but strange.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/98495","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=98495"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/98495\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=98495"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=98495"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=98495"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}