The SuperH-3, part 6: Division

Raymond Chen

The SH-3 does not have a simple “divide two integers please” instruction. Rather, it has a collection of instructions that let you build a division operation yourself.

    DIV0U               ; prepare for unsigned division
    DIV0S   Rm, Rn      ; prepare for signed division Rn ÷ Rm
    DIV1    Rm, Rn      ; generate 1 bit of the quotient Rn ÷ Rm

To begin an integer division operation, you execute either a DIV0U or DIV0S instruction, depending on whether you want a signed or unsigned division.

You then perform a number of DIV1 instructions equal to the number of bits of quotient you need, mixed in with other instructions are outlined in the programmer’s manual. After running the desired number of iterations, the result is in either the Rn register or in the register you accumulated the results into, depending on the specific algorithm you used.

These instructions do not attempt to handle division by zero or division overflow. They will simply generate nonsense results. If preventing division by zero or overflow is important to you, you will have to check for them yourself explicitly.

I’m not going to go into the fine details of how these instructions operate. They use Rm and Rn to record the state of the division, with three additional bits of state recorded in the M, Q, and T flags.

It’s basically magic.

In practice, you won’t see the compiler generate these instructions anyway. Instead, the compiler is going to do one of the following:

If dividing by a constant power of 2, use a shift instruction.
If dividing by a small constant, multiply by 2³²÷n and extract the high 32 bits of the result.
Otherwise, call a helper function in the runtime library.

Phew, that was crazy.

Next time, we’ll return to the relative sanity of bitwise logical operations.

Bonus chatter: If you want to see how one compiler implements division on the SH-3 (and if you are okay with being exposed to GPL source code), you can take a look at how GCC implements division in its runtime library.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

4 comments

Discussion is closed. Login to edit/delete existing comments.

Simon Clarkstone August 12, 2019

I spent a while wondering why this seemed familiar, then I rememberd it is like the GreenArrays F18A's multiply-step instruction. It does a conditional add based on the bottom bit of a register and a shift of two registers joined together, so that you can multiply numbers by repeating the multiply-step instruction (though with NOPs in-between, because the multiply-step instruction has a preceding delay-slot, as does the addition instruction).
The F18A is a weird processor; it's a grid of tiny microprocessors that pass data asynchronously. It was invented by Chuck Moore, the same guy who invented the early Forths (and...
Read more
I spent a while wondering why this seemed familiar, then I rememberd it is like the GreenArrays F18A’s multiply-step instruction. It does a conditional add based on the bottom bit of a register and a shift of two registers joined together, so that you can multiply numbers by repeating the multiply-step instruction (though with NOPs in-between, because the multiply-step instruction has a preceding delay-slot, as does the addition instruction).
The F18A is a weird processor; it’s a grid of tiny microprocessors that pass data asynchronously. It was invented by Chuck Moore, the same guy who invented the early Forths (and IIRC distains ANSI Forth as too complex and over-generalised).

Read less
- Julien Oster August 20, 2019
  
  That F18A thing is interesting. Whatever happened to it?
Zak Larue-Buckley August 12, 2019

I’m just suprised there is a hardware divide at all. Thats quite a luxury!
How did it compare to a pure software integer divide in terms of cycles?
- Fabian Giesen August 12, 2019
  
  From the description of the instructions, there isn't much of one! Getting one bit of quotient at a time is a tell-tale sign of the type of algorithm used: namely, a basic digit recurrence algorithm (either restoring or non-restoring). The main step in either variant is a single integer subtraction and a 1-bit shift per iteration. The subtraction is just a regular ALU operation; the shift register to record the quotient bits and the "glue logic" instructions to initially set up the divide do need some extra hardware, but by putting the responsibility of sequencing the division on the SW...
  Read more
  From the description of the instructions, there isn’t much of one! Getting one bit of quotient at a time is a tell-tale sign of the type of algorithm used: namely, a basic digit recurrence algorithm (either restoring or non-restoring). The main step in either variant is a single integer subtraction and a 1-bit shift per iteration. The subtraction is just a regular ALU operation; the shift register to record the quotient bits and the “glue logic” instructions to initially set up the divide do need some extra hardware, but by putting the responsibility of sequencing the division on the SW side, you get away with quite little extra hardware to support the division indeed!
  Getting 1 cycle per quotient bit plus setup (for 32:16 divisions; 32:32 seem to work out to two cycles per bit plus setup) for this tiny number of extra HW is pretty solid! For comparison, the fastest pure SW divide routines based on digit-recurrence for the old ARMs (which didn’t have dividers) I know came out around 3 cycles per quotient bit (plus setup). Then a bit later (starting around ARM7TDMI or maybe it was ARM9E?) they had fairly fast multipliers but still no dividers and the best way to do large divisions used a multiply-based Newton-Raphson iteration. Still came out above 1 cycle/bit for most cases though. (And really leaned on the fast multiplier to do so.)
  
  Read less