The SuperH-3, part 10: Control transfer
Yes, we have once again reached the point where we have to talk about branch delay slots. I will defer to the background information I provided when the issue arose in the discussion of the MIPS R4000. Basically, the branch delay slot is an instruction that occurs in the instruction stream after a branch. That instruction executes even when the branch is taken. (Of course, if the branch is not taken, the instruction executes normally as well.)
On the SH-3, the single-instruction branch delay slot is not sufficient to cover for the pipeline bubble created by a branch. Due to the pipeline structure, two instructions have already been fetched by the time the processor determines whether the branch is taken. The first such instruction goes into the branch delay slot, and the second one is converted to a
nop. So even if you fill the branch delay slot, you still get a one-cycle stall for the discarded instruction. Therefore, you should prefer to structure branches so that they are normally not taken.
Okay, here we go.
BT label ; branch if T=1, reach is 256 bytes, squash the delay slot BT/S label ; branch if T=1, reach is 256 bytes BF label ; branch if T=0, reach is 256 bytes, squash the delay slot BF/S label ; branch if T=0, reach is 256 bytes
The branch if true and branch if false test the T flag and branch if it is set (true) or clear (false). This particular branch is interesting because you get to choose whether you want the instruction in the delay slot to execute. Note that you already paid for the delay slot, so choosing not to execute it doesn’t make things run any faster. The processor just converts the instruction to a
nop and you waste a cycle.¹
BRA label ; branch always, reach is 4KB BRAF Rn ; branch to PC + Rn + 4 JMP @Rn ; branch to Rn BSR label ; branch always, reach is 4KB, PR = return address BSRF Rn ; branch to PC + Rn + 4, PR = return address JSR @Rn ; branch to Rn, PR = return address RTS ; branch to PR
These instructions perform unconditional branches, either to a specific address within 4KB (branch always), to an address relative to the current program counter (branch always far), or to an address provided by a register (jump). The
xSR instructions branch to a subroutine and record the return address in the special pr register. And of course after you branch to a subroutine, you need a way to get back, hence
RTS return from subroutine.
The extra +4 in the
BSRF are due to pipelining. By the time the processor determines that the branch needs to be taken, the program counter has already moved ahead two instructions.
The Microsoft compiler doesn’t use the
BSR instruction because the linker is very likely to put the branch target outside the 4KB reach of the
The Microsoft compiler uses the
BRAF instruction in just one specific scenario (which we’ll look at later), and it doesn’t appear to use
BSRF at all. The
BSRF instructions appear to be useful for writing position-independent code.
Watch out: Even though the
JSR instructions use an
@, there is no memory access going on. I don’t know why the mnemonic uses an
Note that the
BF instructions have a very limited reach. If you need to branch further, you’ll have to use a trick like branching to a branch, or reversing the sense of the test to jump over a branch instruction with greater reach.
; BT toofar ; option 1: branch to a branch (trampoline) BT trampoline ... trampoline: BRA toofar+2 delay_slot_instruction ; move first instruction of toofar here ; option 2: reverse the sense and jump over a branch BF skip BRA toofar+2 delay_slot_instruction ; move first instruction of toofar here skip:
The SH-3 deals with branch delay slots slightly differently from the MIPS R4000. The SH-3 temporarily disables interrupts between the branch instruction and its delay slot, so you cannot get interrupted in the branch delay slot.
If an exception occurs on the instruction in the branch delay slot, the exception is raised, and assuming the kernel fixes the problem, execution resumes at the branch instruction. This is safe because the branch instructions are all restartable; the only register modification is to pr, but none of the
xSR instructions consume pr, so it’s okay to re-execute them; you just set pr twice to the same value.
Some instructions are disallowed in a branch delay slot.
- Another branch instruction. Because duh.
TRAPAinstruction. Sorry, no system calls in a branch delay slot. If you want to make a system call and return, you’ll have to code the system call before the
RTSand drop a
nopinto the branch delay slot.
- An instruction that uses PC-relative addressing. Because the program counter has already moved to the branch target, so your PC-relative address isn’t what you think it is.
The last case is subtle. It means that the branch delay slot cannot contain a load of a value from a PC-relative address, nor can you use
MOVA to load the address of a PC-relative value. If you need to pass a large constant as a parameter to a function, you’ll have to do it ahead of the
JSR and find something else to put in the delay slot.
If you put a disallowed instruction in a branch delay slot, the processor will raise an illegal slot instruction exception.
When it comes time to return from a subroutine, you often have two choices. You can use the
RTS instruction or an equivalent
|lds.l @r15+, pr|
|mov.l @r15+, r1|
Both sequences are equivalent: They transfer control to the address popped off the stack. They just use a different register to do it. However, Windows requires that you use the first sequence. This is necessary so that function unwinding can be performed by the kernel in the case of an exception.
It’s probably in your best interest to use the first version anyway, because it will work well with the return address predictor, should the SuperH ever gain one.
Next time we’ll look at atomic operations, more specifically the lack of them.
¹ Technically, you are wasting another cycle, because a taken branch already suffers a loss of one cycle for the discarded second prefetched instruction. You’re increasing the taken-branch cost from one cycle to two.