Windows stack limit checking retrospective: MIPS

Last time, we looked at how the 80386 performed stack probing on entry to a function with a large local frame. Today we’ll look at MIPS, which differs in a few ways.

; on entry, t8 = desired stack allocation

chkstk:
    sw      t7, 0(sp)           ; preserve register
    sw      t8, 4(sp)           ; save allocation size
    sw      t9, 8(sp)           ; preserve register

    li      t9, PerProcessorData ; prepare to get thread bounds
    bgez    sp, usermode        ; branch if running in user mode
    subu    t8, sp, t8          ; t8 = new stack pointer (in delay slot)

    lw      t9, KernelStackStart
    b       havelimit
    subu    t9, t9, KERNEL_STACK_SIZE ; t9 = end of stack (in delay slot)

usermode:
    lw      t9, Teb(t9)         ; get pointer to current thread data
    nop                         ; stall on memory load
    lw      t9, StackLimit(t9)  ; t9 = end of stack
    nop                         ; burn the delay slot

havelimit:
    sltu    t7, t8, t9          ; need to grow the stack?
    beq     zero, t7, done      ; N: then nothing to do
    li      t7, -PAGE_SIZE      ; prepare mask (in delay slot)
    and     t8, t8, t7          ; round down to nearest page

probe:
    subu    t9, t9, PAGE_SIZE   ; move to next page
    bne     t8, t9, probe       ; loop until done
    sw      zero, 0(t9)         ; touch the memory (in delay slot)

done:
    lw      t7, 0(sp)           ; restore
    lw      t8, 4(sp)           ; restore
    j       ra                  ; return
    lw  t9, 8(sp)               ; restore (in delay slot)

The MIPS code is trickier to ready because of the pesky delay slot. Recall that delay slots execute even if the branch is taken.¹

One thing that is different here is that the code short-circuits if the stack has already expanded the necessary amount. The x86-32 version always touches the stack, even if not necessary, but the MIPS version does the work only if needed. It’s often the case that a program allocates a large buffer on the stack but ends up using only a small portion of it, and the short-circuiting avoids faulting in pages and cache lines unnecessarily. But to do this, we need to know how far the stack has already expanded, and that means checking a different place depending on whether it’s running on a user-mode stack or a kernel-mode stack.

Note that the probe loop faults the memory in by writing to it rather than reading from it.² This is okay because we already know that the write will expand the stack, rather than write into an already-expanded stack, and nobody can be expanding our stack at the same time because the stack belongs to this thread. (If we hadn’t short-circuited, then a write would not be correct, because the write might be writing to an already-present portion of the stack.)

On the MIPS processor, the address space is architecturally divided exactly in half with user mode in the lower half and kernel mode in the upper half. The code relies on this by testing the upper bit of the stack pointer to detect whether it is running in user mode or kernel mode.³

Another difference between the MIPS version and the 80386 version is that the MIPS version validates that the stack can expand, but it returns with the stack pointer unchanged. It leaves the caller to do the expansion.

I deduced that a function prologue for a function with a large stack frame might look like this:

    sw  ra, 12(sp)      ; save return address in home space
    li  t8, 17320       ; large stack frame
    br  chkstk          ; expand stack if necessary
    lw  ra, 12(sp)      ; recover original return address
    sub sp, sp, t8      ; create the local frame
    sw  ra, nn(sp)      ; save return address in its real location

    ⟦ rest of function as usual ⟧

The big problem is finding a place to save the return address. From looking the implementation of the chkstk function, I see that it is going to use home space slots 0, 4, and 8, but it doesn’t use slot 12, so we can use it to save our return address before it gets overwritten by the br.

Later, I realized that the code can save the return address in the t9 register, since that is a scratch register according to the Windows calling convention, but the chkstk function nevertheless dutifully preserves it.⁴

    move t9, ra         ; save return address in t9
    li  t8, 17320       ; large stack frame
    br  chkstk          ; expand stack if necessary
    sub sp, sp, t8      ; create the local frame
    sw  t9, nn(sp)      ; save return address in its real location

    ⟦ rest of function as usual ⟧

However, I wouldn’t be surprised if the compiler used the first version, just in case somebody is using a nonstandard calling convention that passes something meaningful in t9.

Next time, we’ll look at PowerPC, which has its own quirk.

¹ Delay slots were a popular feature in early RISC days to avoid a pipeline bubble by saying, “Well, I already went to the effort of fetching and decoding this instruction; may as well finish executing it.” Unfortunately, this clever trick backfired when newer versions of the processor had deeper pipelines or multiple execution units. If you still wanted to avoid the pipeline bubble, you would have to add more delay slots, but three delay slots is getting kind of silly, and it would break compatibility with code written to the v1 processor. Therefore, processor developers just kept the one delay slot for compatibility and lived with the pipeline bubble for the other nonexistent delay slots. (To hide the bubble, they added branch prediction.)

² I don’t know why they chose to write instead of read. Maybe it’s to avoid an Address Sanitizer error about reading from memory that was never written?

³ This code is compiled into the runtime support library that can be used in both user mode and kernel mode, so it needs to detect what mode it’s in. An alternate design would be for the compiler to offer two versions of the function, one for user mode and one for kernel mode, and make you specify at link time which version you wanted.

⁴ The chkstk function preserves all registers so that it can be used even in functions with nonstandard calling conventions. Okay, it preserves almost all registers. It doesn’t preserve the assembler temporary at, which is used implicitly by the li instruction. But nobody expects the assembler temporary to be preserved. It also doesn’t preserve the “do not touch, reserved for kernel” registers k0 and k1, which is fine, because the caller shouldn’t be touching them either!

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

4 comments

Neil Rashbrook March 14, 2026

Given that MIPS has that zero register which can be the target of a dummy read operation to fault the page in, you don’t even need a spare register, so that is somewhat strange.

Simon Farnsworth March 17, 2026 · Edited

I wonder if someone's heard of the shared zeroed page optimization for systems under heavy paging pressure, and prematurely optimized the code in case Windows NT gets that optimization.

With the shared zeroed page optimization, a zeroed page is reserved for read-only use; when you read from a previously committed but unused page, the shared zero is mapped read-only immediately. You do not get a private page of RAM until later - possibly as late as when you first write to the page, possibly on the next kernel interaction you have.

The idea behind the optimization is that if you're pretty much...
Read more
I wonder if someone’s heard of the shared zeroed page optimization for systems under heavy paging pressure, and prematurely optimized the code in case Windows NT gets that optimization.

With the shared zeroed page optimization, a zeroed page is reserved for read-only use; when you read from a previously committed but unused page, the shared zero is mapped read-only immediately. You do not get a private page of RAM until later – possibly as late as when you first write to the page, possibly on the next kernel interaction you have.

The idea behind the optimization is that if you’re pretty much guaranteed to have to page something out to free up RAM so that you can assign a private page to a process, progress is made a little faster if a process that reads a newly allocated page can make a bit more progress through its code before it has to block waiting for page out to complete in parallel to the page-out I/O – especially since paging systems are often optimized to free up several physical pages in one I/O.

But, if you’ve misunderstood the intention here, writing to a page can seem like an optimization, because it guarantees that the kernel maps the private page in immediately, instead of mapping the shared zero page and going off to do page-out I/O in the background to get you your writable page later.

Read less

Marek Knápek March 13, 2026

About the third footnote: Are you sure the compiler always knows wether the code is user mode only or kernel mode only? Or the user operating the compiler command line always knows? Example: Can a DLL be used from both, user mode and kernel mode? Can a static lib be used from both, user mode and kernel mode?

Gufo della Notte March 13, 2026 · Edited

s/ready/read/ @ ‘The MIPS code is trickier to …’

OTOH it can sort of be made sense of as it is, and I’m probably just too hung up on precision whenever language is concerned (side effect of having been a programmer for forty-some years). 😉

Windows stack limit checking retrospective: MIPS

Category

Topics

Author

4 comments

Read next

Windows stack limit checking retrospective: PowerPC

Windows stack limit checking retrospective: x86-32 also known as i386, second try

Category

Topics

Share

Author

4 comments

Read next

Windows stack limit checking retrospective: PowerPC

Windows stack limit checking retrospective: x86-32 also known as i386, second try

Stay informed