Last time, we looked at how the 80386 performed stack probing on entry to a function with a large local frame. Today we’ll look at MIPS, which differs in a few ways.
; on entry, t8 = desired stack allocation
chkstk:
sw t7, 0(sp) ; preserve register
sw t8, 4(sp) ; save allocation size
sw t9, 8(sp) ; preserve register
li t9, PerProcessorData ; prepare to get thread bounds
bgez sp, usermode ; branch if running in user mode
subu t8, sp, t8 ; t8 = new stack pointer (in delay slot)
lw t9, KernelStackStart
b havelimit
subu t9, t9, KERNEL_STACK_SIZE ; t9 = end of stack (in delay slot)
usermode:
lw t9, Teb(t9) ; get pointer to current thread data
nop ; stall on memory load
lw t9, StackLimit(t9) ; t9 = end of stack
nop ; burn the delay slot
havelimit:
sltu t7, t8, t9 ; need to grow the stack?
beq zero, t7, done ; N: then nothing to do
li t7, -PAGE_SIZE ; prepare mask (in delay slot)
and t8, t8, t7 ; round down to nearest page
probe:
subu t9, t9, PAGE_SIZE ; move to next page
bne t8, t9, probe ; loop until done
sw zero, 0(t9) ; touch the memory (in delay slot)
done:
lw t7, 0(sp) ; restore
lw t8, 4(sp) ; restore
j ra ; return
lw t9, 8(sp) ; restore (in delay slot)
The MIPS code is trickier to ready because of the pesky delay slot. Recall that delay slots execute even if the branch is taken.¹
One thing that is different here is that the code short-circuits if the stack has already expanded the necessary amount. The x86-32 version always touches the stack, even if not necessary, but the MIPS version does the work only if needed. It’s often the case that a program allocates a large buffer on the stack but ends up using only a small portion of it, and the short-circuiting avoids faulting in pages and cache lines unnecessarily. But to do this, we need to know how far the stack has already expanded, and that means checking a different place depending on whether it’s running on a user-mode stack or a kernel-mode stack.
Note that the probe loop faults the memory in by writing to it rather than reading from it.² This is okay because we already know that the write will expand the stack, rather than write into an already-expanded stack, and nobody can be expanding our stack at the same time because the stack belongs to this thread. (If we hadn’t short-circuited, then a write would not be correct, because the write might be writing to an already-present portion of the stack.)
On the MIPS processor, the address space is architecturally divided exactly in half with user mode in the lower half and kernel mode in the upper half. The code relies on this by testing the upper bit of the stack pointer to detect whether it is running in user mode or kernel mode.³
Another difference between the MIPS version and the 80386 version is that the MIPS version validates that the stack can expand, but it returns with the stack pointer unchanged. It leaves the caller to do the expansion.
I deduced that a function prologue for a function with a large stack frame might look like this:
sw ra, 12(sp) ; save return address in home space
li t8, 17320 ; large stack frame
br chkstk ; expand stack if necessary
lw ra, 12(sp) ; recover original return address
sub sp, sp, t8 ; create the local frame
sw ra, nn(sp) ; save return address in its real location
⟦ rest of function as usual ⟧
The big problem is finding a place to save the return address. From looking the implementation of the chkstk function, I see that it is going to use home space slots 0, 4, and 8, but it doesn’t use slot 12, so we can use it to save our return address before it gets overwritten by the br.
Later, I realized that the code can save the return address in the t9 register, since that is a scratch register according to the Windows calling convention, but the chkstk function nevertheless dutifully preserves it.⁴
move t9, ra ; save return address in t9
li t8, 17320 ; large stack frame
br chkstk ; expand stack if necessary
sub sp, sp, t8 ; create the local frame
sw t9, nn(sp) ; save return address in its real location
⟦ rest of function as usual ⟧
However, I wouldn’t be surprised if the compiler used the first version, just in case somebody is using a nonstandard calling convention that passes something meaningful in t9.
Next time, we’ll look at PowerPC, which has its own quirk.
¹ Delay slots were a popular feature in early RISC days to avoid a pipeline bubble by saying, “Well, I already went to the effort of fetching and decoding this instruction; may as well finish executing it.” Unfortunately, this clever trick backfired when newer versions of the processor had deeper pipelines or multiple execution units. If you still wanted to avoid the pipeline bubble, you would have to add more delay slots, but three delay slots is getting kind of silly, and it would break compatibility with code written to the v1 processor. Therefore, processor developers just kept the one delay slot for compatibility and lived with the pipeline bubble for the other nonexistent delay slots. (To hide the bubble, they added branch prediction.)
² I don’t know why they chose to write instead of read. Maybe it’s to avoid an Address Sanitizer error about reading from memory that was never written?
³ This code is compiled into the runtime support library that can be used in both user mode and kernel mode, so it needs to detect what mode it’s in. An alternate design would be for the compiler to offer two versions of the function, one for user mode and one for kernel mode, and make you specify at link time which version you wanted.
⁴ The chkstk function preserves all registers so that it can be used even in functions with nonstandard calling conventions. Okay, it preserves almost all registers. It doesn’t preserve the assembler temporary at, which is used implicitly by the li instruction. But nobody expects the assembler temporary to be preserved. It also doesn’t preserve the “do not touch, reserved for kernel” registers k0 and k1, which is fine, because the caller shouldn’t be touching them either!
s/ready/read/ @ ‘The MIPS code is trickier to …’
OTOH it can sort of be made sense of as it is, and I’m probably just too hung up on precision whenever language is concerned (side effect of having been a programmer for forty-some years). 😉