{"id":102788,"date":"2019-08-16T07:00:00","date_gmt":"2019-08-16T14:00:00","guid":{"rendered":"http:\/\/devblogs.microsoft.com\/oldnewthing\/?p=102788"},"modified":"2019-09-13T21:44:54","modified_gmt":"2019-09-14T04:44:54","slug":"20190816-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190816-00\/?p=102788","title":{"rendered":"The SuperH-3, part 10: Control transfer"},"content":{"rendered":"<p>Yes, we have once again reached the point where we have to talk about branch delay slots. I will defer to the background information I provided <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/\"> when the issue arose in the discussion of the MIPS R4000<\/a>. Basically, the branch delay slot is an instruction that occurs in the instruction stream after a branch. That instruction executes even when the branch is taken. (Of course, if the branch is not taken, the instruction executes normally as well.)<\/p>\n<p>On the SH-3, the single-instruction branch delay slot is not sufficient to cover for the pipeline bubble created by a branch. Due to the pipeline structure, two instructions have already been fetched by the time the processor determines whether the branch is taken. The first such instruction goes into the branch delay slot, and the second one is converted to a <code>nop<\/code>. So even if you fill the branch delay slot, you still get a one-cycle stall for the discarded instruction. Therefore, you should prefer to structure branches so that they are normally not taken.<\/p>\n<p>Okay, here we go.<\/p>\n<pre>    BT      label   ; branch if T=1, reach is 256 bytes, squash the delay slot\r\n    BT\/S    label   ; branch if T=1, reach is 256 bytes\r\n\r\n    BF      label   ; branch if T=0, reach is 256 bytes, squash the delay slot\r\n    BF\/S    label   ; branch if T=0, reach is 256 bytes\r\n<\/pre>\n<p>The <i>branch if true<\/i> and <i>branch if false<\/i> test the <var>T<\/var> flag and branch if it is set (true) or clear (false). This particular branch is interesting because you get to choose whether you want the instruction in the delay slot to execute. Note that you already paid for the delay slot, so choosing not to execute it doesn&#8217;t make things run any faster. The processor just converts the instruction to a <code>nop<\/code> and you waste a cycle.\u00b9<\/p>\n<pre>    BRA     label   ; branch always, reach is 4KB\r\n    BRAF    Rn      ; branch to PC + Rn + 4\r\n    JMP     @Rn     ; branch to Rn\r\n\r\n    BSR     label   ; branch always, reach is 4KB, PR = return address\r\n    BSRF    Rn      ; branch to PC + Rn + 4, PR = return address\r\n    JSR     @Rn     ; branch to Rn, PR = return address\r\n\r\n    RTS             ; branch to PR\r\n<\/pre>\n<p>These instructions perform unconditional branches, either to a specific address within 4KB (<i>branch always<\/i>), to an address relative to the current program counter (<i>branch always far<\/i>), or to an address provided by a register (<i>jump<\/i>). The <code>xSR<\/code> instructions branch to a subroutine and record the return address in the special <var>pr<\/var> register. And of course after you branch to a subroutine, you need a way to get back, hence <code>RTS<\/code> <i>return from subroutine<\/i>.<\/p>\n<p>The extra +4 in the <code>BRAF<\/code> and <code>BSRF<\/code> are due to pipelining. By the time the processor determines that the branch needs to be taken, the program counter has already moved ahead two instructions.<\/p>\n<p>The Microsoft compiler doesn&#8217;t use the <code>BSR<\/code> instruction because the linker is very likely to put the branch target outside the 4KB reach of the <code>BSR<\/code> instruction.<\/p>\n<p>The Microsoft compiler uses the <code>BRAF<\/code> instruction in just one specific scenario (which we&#8217;ll look at later), and it doesn&#8217;t appear to use <code>BSRF<\/code> at all. The <code>BRAF<\/code> and <code>BSRF<\/code> instructions appear to be useful for writing position-independent code.<\/p>\n<p><b>Watch out<\/b>: Even though the <code>JMP<\/code> and <code>JSR<\/code> instructions use an <code>@<\/code>, there is no memory access going on. I don&#8217;t know why the mnemonic uses an <code>@<\/code>.<\/p>\n<p>Note that the <code>BT<\/code> and <code>BF<\/code> instructions have a very limited reach. If you need to branch further, you&#8217;ll have to use a trick like branching to a branch, or reversing the sense of the test to jump over a branch instruction with greater reach.<\/p>\n<pre>    ; BT toofar\r\n\r\n    ; option 1: branch to a branch (trampoline)\r\n    BT      trampoline\r\n...\r\ntrampoline:\r\n    BRA      toofar+2\r\n    delay_slot_instruction ; move first instruction of toofar here\r\n\r\n    ; option 2: reverse the sense and jump over a branch\r\n\r\n    BF      skip\r\n    BRA     toofar+2\r\n    delay_slot_instruction ; move first instruction of toofar here\r\nskip:\r\n<\/pre>\n<p>The SH-3 deals with branch delay slots slightly differently from the MIPS R4000. The SH-3 temporarily disables interrupts between the branch instruction and its delay slot, so you cannot get interrupted in the branch delay slot.<\/p>\n<p>If an exception occurs on the instruction in the branch delay slot, the exception is raised, and assuming the kernel fixes the problem, execution resumes at the branch instruction. This is safe because the branch instructions are all restartable; the only register modification is to <var>pr<\/var>, but none of the <code>xSR<\/code> instructions consume <var>pr<\/var>, so it&#8217;s okay to re-execute them; you just set <var>pr<\/var> twice to the same value.<\/p>\n<p>Some instructions are disallowed in a branch delay slot.<\/p>\n<ul>\n<li>Another branch instruction. Because duh.<\/li>\n<li>A <code>TRAPA<\/code> instruction. Sorry, no system calls in a branch delay slot. If you want to make a system call and return, you&#8217;ll have to code the system call before the <code>RTS<\/code> and drop a <code>nop<\/code> into the branch delay slot.<\/li>\n<li>An instruction that uses PC-relative addressing. Because the program counter has already moved to the branch target, so your PC-relative address isn&#8217;t what you think it is.<\/li>\n<\/ul>\n<p>The last case is subtle. It means that the branch delay slot cannot contain a load of a value from a PC-relative address, nor can you use <code>MOVA<\/code> to load the address of a PC-relative value. If you need to pass a large constant as a parameter to a function, you&#8217;ll have to do it ahead of the <code>JSR<\/code> and find something else to put in the delay slot.<\/p>\n<p>If you put a disallowed instruction in a branch delay slot, the processor will raise an <i>illegal slot instruction<\/i> exception.<\/p>\n<p>When it comes time to return from a subroutine, you often have two choices. You can use the <code>RTS<\/code> instruction or an equivalent <code>JMP @<\/code>:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Allowed<\/th>\n<th>Not allowed<\/th>\n<\/tr>\n<tr>\n<td><tt>lds.l @r15+, pr<\/tt><br \/>\n<tt>rts<\/tt><\/td>\n<td><tt>mov.l @r15+, r1<\/tt><br \/>\n<tt>jmp @r1<\/tt><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Both sequences are equivalent: They transfer control to the address popped off the stack. They just use a different register to do it. However, Windows requires that you use the first sequence. This is necessary so that function unwinding can be performed by the kernel in the case of an exception.<\/p>\n<p>It&#8217;s probably in your best interest to use the first version anyway, because it will work well with the return address predictor, should the SuperH ever gain one.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190819-00\/?p=102790\"> Next time we&#8217;ll look at atomic operations<\/a>, more specifically the lack of them.<\/p>\n<p>\u00b9 Technically, you are wasting <i>another<\/i> cycle, because a taken branch already suffers a loss of one cycle for the discarded second prefetched instruction. You&#8217;re increasing the taken-branch cost from one cycle to two.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The return of the branch delay slot.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[2],"class_list":["post-102788","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-history"],"acf":[],"blog_post_summary":"<p>The return of the branch delay slot.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102788","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=102788"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/102788\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=102788"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=102788"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=102788"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}