{"id":112138,"date":"2026-03-13T07:00:00","date_gmt":"2026-03-13T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=112138"},"modified":"2026-03-13T07:04:33","modified_gmt":"2026-03-13T14:04:33","slug":"20260313-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20260313-00\/?p=112138","title":{"rendered":"Windows stack limit checking retrospective: MIPS"},"content":{"rendered":"<p>Last time, we looked at <a title=\"Windows stack limit checking retrospective: 80386\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20260312-00\/?p=112136\"> how the 80386 performed stack probing on entry to a function with a large local frame<\/a>. Today we&#8217;ll look at MIPS, which differs in a few ways.<\/p>\n<pre>; on entry, t8 = desired stack allocation\r\n\r\nchkstk:\r\n    sw      t7, 0(sp)           ; preserve register\r\n    sw      t8, 4(sp)           ; save allocation size\r\n    sw      t9, 8(sp)           ; preserve register\r\n\r\n    li      t9, PerProcessorData ; prepare to get thread bounds\r\n    bgez    sp, usermode        ; branch if running in user mode\r\n    subu    t8, sp, t8          ; t8 = new stack pointer (in delay slot)\r\n\r\n    lw      t9, KernelStackStart\r\n    b       havelimit\r\n    subu    t9, t9, KERNEL_STACK_SIZE ; t9 = end of stack (in delay slot)\r\n\r\nusermode:\r\n    lw      t9, Teb(t9)         ; get pointer to current thread data\r\n    nop                         ; stall on memory load\r\n    lw      t9, StackLimit(t9)  ; t9 = end of stack\r\n    nop                         ; burn the delay slot\r\n\r\nhavelimit:\r\n    sltu    t7, t8, t9          ; need to grow the stack?\r\n    beq     zero, t7, done      ; N: then nothing to do\r\n    li      t7, -PAGE_SIZE      ; prepare mask (in delay slot)\r\n    and     t8, t8, t7          ; round down to nearest page\r\n\r\nprobe:\r\n    subu    t9, t9, PAGE_SIZE   ; move to next page\r\n    bne     t8, t9, probe       ; loop until done\r\n    sw      zero, 0(t9)         ; touch the memory (in delay slot)\r\n\r\ndone:\r\n    lw      t7, 0(sp)           ; restore\r\n    lw      t8, 4(sp)           ; restore\r\n    j       ra                  ; return\r\n    lw  t9, 8(sp)               ; restore (in delay slot)\r\n<\/pre>\n<p>The MIPS code is trickier to ready because of the pesky delay slot. Recall that delay slots execute even if the branch is taken.\u00b9<\/p>\n<p>One thing that is different here is that the code short-circuits if the stack has already expanded the necessary amount. The x86-32 version always touches the stack, even if not necessary, but the MIPS version does the work only if needed. It&#8217;s often the case that a program allocates a large buffer on the stack but ends up using only a small portion of it, and the short-circuiting avoids <a title=\"Fun fact: the initially released version of MS Flight Sim 2000 included a huge performance bug any time shorelines were in the rendered view\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20260311-00\/?p=112134&amp;commentid=143917#comment-143917\"> faulting in pages and cache lines unnecessarily<\/a>. But to do this, we need to know how far the stack has already expanded, and that means checking a different place depending on whether it&#8217;s running on a user-mode stack or a kernel-mode stack.<\/p>\n<p>Note that the probe loop faults the memory in by <i>writing<\/i> to it rather than reading from it.\u00b2 This is okay because we already know that the write will expand the stack, rather than write into an already-expanded stack, and nobody can be expanding our stack at the same time because the stack belongs to this thread. (If we hadn&#8217;t short-circuited, then a write would not be correct, because the write might be writing to an already-present portion of the stack.)<\/p>\n<p>On the MIPS processor, the address space is architecturally divided exactly in half with user mode in the lower half and kernel mode in the upper half. The code relies on this by testing the upper bit of the stack pointer to detect whether it is running in user mode or kernel mode.\u00b3<\/p>\n<p>Another difference between the MIPS version and the 80386 version is that the MIPS version validates that the stack can expand, but it returns with the stack pointer unchanged. It leaves the caller to do the expansion.<\/p>\n<p>I deduced that a function prologue for a function with a large stack frame might look like this:<\/p>\n<pre>    sw  ra, 12(sp)      ; save return address in home space\r\n    li  t8, 17320       ; large stack frame\r\n    br  chkstk          ; expand stack if necessary\r\n    lw  ra, 12(sp)      ; recover original return address\r\n    sub sp, sp, t8      ; create the local frame\r\n    sw  ra, nn(sp)      ; save return address in its real location\r\n\r\n    \u27e6 rest of function as usual \u27e7\r\n<\/pre>\n<p>The big problem is finding a place to save the return address. From looking the implementation of the <code>chkstk<\/code> function, I see that it is going to use home space slots 0, 4, and 8, but it doesn&#8217;t use slot 12, so we can use it to save our return address before it gets overwritten by the <code>br<\/code>.<\/p>\n<p>Later, I realized that the code can save the return address in the <code>t9<\/code> register, since that is a scratch register according to the Windows calling convention, but the <code>chkstk<\/code> function nevertheless dutifully preserves it.\u2074<\/p>\n<pre>    move t9, ra         ; save return address in t9\r\n    li  t8, 17320       ; large stack frame\r\n    br  chkstk          ; expand stack if necessary\r\n    sub sp, sp, t8      ; create the local frame\r\n    sw  t9, nn(sp)      ; save return address in its real location\r\n\r\n    \u27e6 rest of function as usual \u27e7\r\n<\/pre>\n<p>However, I wouldn&#8217;t be surprised if the compiler used the first version, just in case somebody is using a nonstandard calling convention that passes something meaningful in <code>t9<\/code>.<\/p>\n<p>Next time, we&#8217;ll look at PowerPC, which has its own quirk.<\/p>\n<p>\u00b9 Delay slots were a popular feature in early RISC days to avoid a pipeline bubble by saying, &#8220;Well, I already went to the effort of fetching and decoding this instruction; may as well finish executing it.&#8221; Unfortunately, this clever trick backfired when newer versions of the processor had deeper pipelines or multiple execution units. If you still wanted to avoid the pipeline bubble, you would have to add more delay slots, but three delay slots is getting kind of silly, and it would break compatibility with code written to the v1 processor. Therefore, processor developers just kept the one delay slot for compatibility and lived with the pipeline bubble for the other nonexistent delay slots. (To hide the bubble, they added branch prediction.)<\/p>\n<p>\u00b2 I don&#8217;t know why they chose to write instead of read. Maybe it&#8217;s to avoid an Address Sanitizer error about reading from memory that was never written?<\/p>\n<p>\u00b3 This code is compiled into the runtime support library that can be used in both user mode and kernel mode, so it needs to detect what mode it&#8217;s in. An alternate design would be for the compiler to offer two versions of the function, one for user mode and one for kernel mode, and make you specify at link time which version you wanted.<\/p>\n<p>\u2074 The <code>chkstk<\/code> function preserves all registers so that it can be used even in functions with nonstandard calling conventions. Okay, it preserves <i>almost<\/i> all registers. It doesn&#8217;t preserve the assembler temporary <code>at<\/code>, which is used implicitly by the <code>li<\/code> instruction. But nobody expects the assembler temporary to be preserved. It also doesn&#8217;t preserve the &#8220;do not touch, reserved for kernel&#8221; registers <code>k0<\/code> and <code>k1<\/code>, which is fine, because the caller shouldn&#8217;t be touching them either!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Optimizing out the unnecessary probes comes with its own complexity.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-112138","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Optimizing out the unnecessary probes comes with its own complexity.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/112138","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=112138"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/112138\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=112138"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=112138"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=112138"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}