The SuperH-3, part 10: Control transfer

Raymond Chen

Yes, we have once again reached the point where we have to talk about branch delay slots. I will defer to the background information I provided when the issue arose in the discussion of the MIPS R4000. Basically, the branch delay slot is an instruction that occurs in the instruction stream after a branch. That instruction executes even when the branch is taken. (Of course, if the branch is not taken, the instruction executes normally as well.)

On the SH-3, the single-instruction branch delay slot is not sufficient to cover for the pipeline bubble created by a branch. Due to the pipeline structure, two instructions have already been fetched by the time the processor determines whether the branch is taken. The first such instruction goes into the branch delay slot, and the second one is converted to a nop. So even if you fill the branch delay slot, you still get a one-cycle stall for the discarded instruction. Therefore, you should prefer to structure branches so that they are normally not taken.

Okay, here we go.

    BT      label   ; branch if T=1, reach is 256 bytes, squash the delay slot
    BT/S    label   ; branch if T=1, reach is 256 bytes

    BF      label   ; branch if T=0, reach is 256 bytes, squash the delay slot
    BF/S    label   ; branch if T=0, reach is 256 bytes

The branch if true and branch if false test the T flag and branch if it is set (true) or clear (false). This particular branch is interesting because you get to choose whether you want the instruction in the delay slot to execute. Note that you already paid for the delay slot, so choosing not to execute it doesn’t make things run any faster. The processor just converts the instruction to a nop and you waste a cycle.¹

    BRA     label   ; branch always, reach is 4KB
    BRAF    Rn      ; branch to PC + Rn + 4
    JMP     @Rn     ; branch to Rn

    BSR     label   ; branch always, reach is 4KB, PR = return address
    BSRF    Rn      ; branch to PC + Rn + 4, PR = return address
    JSR     @Rn     ; branch to Rn, PR = return address

    RTS             ; branch to PR

These instructions perform unconditional branches, either to a specific address within 4KB (branch always), to an address relative to the current program counter (branch always far), or to an address provided by a register (jump). The xSR instructions branch to a subroutine and record the return address in the special pr register. And of course after you branch to a subroutine, you need a way to get back, hence RTS return from subroutine.

The extra +4 in the BRAF and BSRF are due to pipelining. By the time the processor determines that the branch needs to be taken, the program counter has already moved ahead two instructions.

The Microsoft compiler doesn’t use the BSR instruction because the linker is very likely to put the branch target outside the 4KB reach of the BSR instruction.

The Microsoft compiler uses the BRAF instruction in just one specific scenario (which we’ll look at later), and it doesn’t appear to use BSRF at all. The BRAF and BSRF instructions appear to be useful for writing position-independent code.

Watch out: Even though the JMP and JSR instructions use an @, there is no memory access going on. I don’t know why the mnemonic uses an @.

Note that the BT and BF instructions have a very limited reach. If you need to branch further, you’ll have to use a trick like branching to a branch, or reversing the sense of the test to jump over a branch instruction with greater reach.

    ; BT toofar

    ; option 1: branch to a branch (trampoline)
    BT      trampoline
    BRA      toofar+2
    delay_slot_instruction ; move first instruction of toofar here

    ; option 2: reverse the sense and jump over a branch

    BF      skip
    BRA     toofar+2
    delay_slot_instruction ; move first instruction of toofar here

The SH-3 deals with branch delay slots slightly differently from the MIPS R4000. The SH-3 temporarily disables interrupts between the branch instruction and its delay slot, so you cannot get interrupted in the branch delay slot.

If an exception occurs on the instruction in the branch delay slot, the exception is raised, and assuming the kernel fixes the problem, execution resumes at the branch instruction. This is safe because the branch instructions are all restartable; the only register modification is to pr, but none of the xSR instructions consume pr, so it’s okay to re-execute them; you just set pr twice to the same value.

Some instructions are disallowed in a branch delay slot.

  • Another branch instruction. Because duh.
  • A TRAPA instruction. Sorry, no system calls in a branch delay slot. If you want to make a system call and return, you’ll have to code the system call before the RTS and drop a nop into the branch delay slot.
  • An instruction that uses PC-relative addressing. Because the program counter has already moved to the branch target, so your PC-relative address isn’t what you think it is.

The last case is subtle. It means that the branch delay slot cannot contain a load of a value from a PC-relative address, nor can you use MOVA to load the address of a PC-relative value. If you need to pass a large constant as a parameter to a function, you’ll have to do it ahead of the JSR and find something else to put in the delay slot.

If you put a disallowed instruction in a branch delay slot, the processor will raise an illegal slot instruction exception.

When it comes time to return from a subroutine, you often have two choices. You can use the RTS instruction or an equivalent JMP @:

Allowed Not allowed
lds.l @r15+, pr
mov.l @r15+, r1
jmp @r1

Both sequences are equivalent: They transfer control to the address popped off the stack. They just use a different register to do it. However, Windows requires that you use the first sequence. This is necessary so that function unwinding can be performed by the kernel in the case of an exception.

It’s probably in your best interest to use the first version anyway, because it will work well with the return address predictor, should the SuperH ever gain one.

Next time we’ll look at atomic operations, more specifically the lack of them.

¹ Technically, you are wasting another cycle, because a taken branch already suffers a loss of one cycle for the discarded second prefetched instruction. You’re increasing the taken-branch cost from one cycle to two.