The SuperH-3, part 10: Control transfer

Raymond Chen

Yes, we have once again reached the point where we have to talk about branch delay slots. I will defer to the background information I provided when the issue arose in the discussion of the MIPS R4000. Basically, the branch delay slot is an instruction that occurs in the instruction stream after a branch. That instruction executes even when the branch is taken. (Of course, if the branch is not taken, the instruction executes normally as well.)

On the SH-3, the single-instruction branch delay slot is not sufficient to cover for the pipeline bubble created by a branch. Due to the pipeline structure, two instructions have already been fetched by the time the processor determines whether the branch is taken. The first such instruction goes into the branch delay slot, and the second one is converted to a nop. So even if you fill the branch delay slot, you still get a one-cycle stall for the discarded instruction. Therefore, you should prefer to structure branches so that they are normally not taken.

Okay, here we go.

    BT      label   ; branch if T=1, reach is 256 bytes, squash the delay slot
    BT/S    label   ; branch if T=1, reach is 256 bytes

    BF      label   ; branch if T=0, reach is 256 bytes, squash the delay slot
    BF/S    label   ; branch if T=0, reach is 256 bytes

The branch if true and branch if false test the T flag and branch if it is set (true) or clear (false). This particular branch is interesting because you get to choose whether you want the instruction in the delay slot to execute. Note that you already paid for the delay slot, so choosing not to execute it doesn’t make things run any faster. The processor just converts the instruction to a nop and you waste a cycle.¹

    BRA     label   ; branch always, reach is 4KB
    BRAF    Rn      ; branch to PC + Rn + 4
    JMP     @Rn     ; branch to Rn

    BSR     label   ; branch always, reach is 4KB, PR = return address
    BSRF    Rn      ; branch to PC + Rn + 4, PR = return address
    JSR     @Rn     ; branch to Rn, PR = return address

    RTS             ; branch to PR

These instructions perform unconditional branches, either to a specific address within 4KB (branch always), to an address relative to the current program counter (branch always far), or to an address provided by a register (jump). The xSR instructions branch to a subroutine and record the return address in the special pr register. And of course after you branch to a subroutine, you need a way to get back, hence RTS return from subroutine.

The extra +4 in the BRAF and BSRF are due to pipelining. By the time the processor determines that the branch needs to be taken, the program counter has already moved ahead two instructions.

The Microsoft compiler doesn’t use the BSR instruction because the linker is very likely to put the branch target outside the 4KB reach of the BSR instruction.

The Microsoft compiler uses the BRAF instruction in just one specific scenario (which we’ll look at later), and it doesn’t appear to use BSRF at all. The BRAF and BSRF instructions appear to be useful for writing position-independent code.

Watch out: Even though the JMP and JSR instructions use an @, there is no memory access going on. I don’t know why the mnemonic uses an @.

Note that the BT and BF instructions have a very limited reach. If you need to branch further, you’ll have to use a trick like branching to a branch, or reversing the sense of the test to jump over a branch instruction with greater reach.

    ; BT toofar

    ; option 1: branch to a branch (trampoline)
    BT      trampoline
...
trampoline:
    BRA      toofar+2
    delay_slot_instruction ; move first instruction of toofar here

    ; option 2: reverse the sense and jump over a branch

    BF      skip
    BRA     toofar+2
    delay_slot_instruction ; move first instruction of toofar here
skip:

The SH-3 deals with branch delay slots slightly differently from the MIPS R4000. The SH-3 temporarily disables interrupts between the branch instruction and its delay slot, so you cannot get interrupted in the branch delay slot.

If an exception occurs on the instruction in the branch delay slot, the exception is raised, and assuming the kernel fixes the problem, execution resumes at the branch instruction. This is safe because the branch instructions are all restartable; the only register modification is to pr, but none of the xSR instructions consume pr, so it’s okay to re-execute them; you just set pr twice to the same value.

Some instructions are disallowed in a branch delay slot.

Another branch instruction. Because duh.
A TRAPA instruction. Sorry, no system calls in a branch delay slot. If you want to make a system call and return, you’ll have to code the system call before the RTS and drop a nop into the branch delay slot.
An instruction that uses PC-relative addressing. Because the program counter has already moved to the branch target, so your PC-relative address isn’t what you think it is.

The last case is subtle. It means that the branch delay slot cannot contain a load of a value from a PC-relative address, nor can you use MOVA to load the address of a PC-relative value. If you need to pass a large constant as a parameter to a function, you’ll have to do it ahead of the JSR and find something else to put in the delay slot.

If you put a disallowed instruction in a branch delay slot, the processor will raise an illegal slot instruction exception.

When it comes time to return from a subroutine, you often have two choices. You can use the RTS instruction or an equivalent JMP @:

Allowed	Not allowed
`lds.l @r15+, pr` `rts`	`mov.l @r15+, r1` `jmp @r1`

Both sequences are equivalent: They transfer control to the address popped off the stack. They just use a different register to do it. However, Windows requires that you use the first sequence. This is necessary so that function unwinding can be performed by the kernel in the case of an exception.

It’s probably in your best interest to use the first version anyway, because it will work well with the return address predictor, should the SuperH ever gain one.

Next time we’ll look at atomic operations, more specifically the lack of them.

¹ Technically, you are wasting another cycle, because a taken branch already suffers a loss of one cycle for the discarded second prefetched instruction. You’re increasing the taken-branch cost from one cycle to two.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

8 comments

Discussion is closed. Login to edit/delete existing comments.

David Walker August 22, 2019

"Because duh." I don't think it's obvious that a branch can't be followed by another branch. There would be cascading delays, and maybe it's not a good idea for whatever reason, but why is it not allowed?
Your trampoline example has a "BF skip" followed by a "BRA toofar+2". The BRA seems to be in the branch delay slot for the BF, right??????
I have seen sequential Branch instructions, one after the other, where the list is used as a target for a computed branch-into, like a function target list. This would appear to have a branch instruction in the...
Read more
“Because duh.” I don’t think it’s obvious that a branch can’t be followed by another branch. There would be cascading delays, and maybe it’s not a good idea for whatever reason, but why is it not allowed?
Your trampoline example has a “BF skip” followed by a “BRA toofar+2”. The BRA seems to be in the branch delay slot for the BF, right??????
I have seen sequential Branch instructions, one after the other, where the list is used as a target for a computed branch-into, like a function target list. This would appear to have a branch instruction in the branch delay slot of each previous branch instruction (these might be unconditional branches, but sometimes these are conditional branches with all conditions set to True).
If this architecture doesn’t allow a branch instruction to follow another one, then that can’t be done… I’m confused, but then again, microarchitectures are not my specialty. The series is still interesting.
And why doesn’t the adjustment in “branch to PC + Rn + 4” require a -4 instead of +4, to correct for the PC being 2 instructions (4 bytes) ahead of where we want to branch to?

Read less
- Raymond Chen Author August 23, 2019
  
  I should have written “Because inception” rather than “Because duh”. The BF instruction does not have a delay slot; the BRA is consequently not in a delay slot and is therefore legal. The +4 in the PC-relative instructions exists because by the time the processor gets around to doing the work, PC has moved four bytes ahead of where we’re branching from. You need to account for this when you calculate Rn.
Joshua Hudson August 17, 2019

I’d love to knock out the test for PC-relative addressing in the branch delay slot. It would lead to some really clever code by loading one of two values depending on whether the branch was taken or not.

Happy reverse engineering.
smf August 17, 2019

Watch out: Even though the and instructions use an , there is no memory access going on. I don’t know why the mnemonic uses an .
The 6502 does the same thing, memory accessing instructions have the form LDA $12 while loading an immediate value is LDA #$12.
JMP & JSR both use $12. I guess it's because as soon as you load the program counter, the cpu will start reading data from that address. Even though the instruction itself is an immediate load into the program counter & should then include the #

Read more
Watch out: Even though the JMP and JSR instructions use an @, there is no memory access going on. I don’t know why the mnemonic uses an @.
The 6502 does the same thing, memory accessing instructions have the form LDA $12 while loading an immediate value is LDA #$12.
JMP & JSR both use $12. I guess it’s because as soon as you load the program counter, the cpu will start reading data from that address. Even though the instruction itself is an immediate load into the program counter & should then include the #

Read less
- Julien Oster August 21, 2019
  
  Yeah, it’s bad. Another way to see it, is “the jump target is an address”, and the @ (or lack of # on 6502) signifies whether something is to be treated as an address, but that does not make it less confusing, or inconsistent.
Sean Wang August 16, 2019

I found it interesting that the second instruction in the branch "shadow" isn't also used, that way in certain situations no cycles can be wasted. I've worked on an architecture where different instructions had different numbers of delay slots (up to 3), so that up to 3 unconditional instructions can be placed after a branch to reduce the number of cycles wasted upon taking the branch. The assembler also seemed decently capable of moving instructions into a delay slot for optimization. But this architecture was for a specialized co-processor, not a CPU. So it didn't have to deal with trap...
Read more
I found it interesting that the second instruction in the branch “shadow” isn’t also used, that way in certain situations no cycles can be wasted. I’ve worked on an architecture where different instructions had different numbers of delay slots (up to 3), so that up to 3 unconditional instructions can be placed after a branch to reduce the number of cycles wasted upon taking the branch. The assembler also seemed decently capable of moving instructions into a delay slot for optimization. But this architecture was for a specialized co-processor, not a CPU. So it didn’t have to deal with trap instructions or things like that.

Read less
- smf August 17, 2019
  
  It’s probably a compromise based on how the SH-2 worked, not wasting memory on all those nop instructions & not wanting to complicate the compiler further. I like the MIPS R3000, it lets you put all the things in the branch delay slot that you aren’t supposed to. If you put a branch in a branch delay slot then the second branch delay slot is the target of the first. If you do this with interrupts enabled and one hits when it’s executing that instruction, then things go bad.
Dave Sanderman August 16, 2019

i think i enjoy these articles the way people enjoy true crime TV shows. morbid fascination.