March 19th, 2026

3 reactions

Windows stack limit checking retrospective: amd64, also known as x86-64

Raymond Chen

Our survey of stack limit checking reaches the modern day with amd64, also known as x86-64. This time, there are two versions of the function, one for user mode and one for kernel mode. We’ll look at the user mode version.

Actually, there are two user mode versions. One is in msvcrt, the legacy runtime.

; on entry, rax is the number of bytes to allocate
; on exit, stack has been validated (but not adjusted)

chkstk:
    sub     rsp, 16
    mov     [rsp], r10          ; save temporary register
    mov     [rsp][8], r11       ; save temporary register

    xor     r11, r11            ; r11 = 0
    lea     r10, [rsp][16][8]   ; r10 = caller's rsp
    sub     r10, rax            ; r10 = desired new stack pointer
    cmovb   r10, r11            ; clamp underflow to zero

    mov     r11, gs:[StackLimit]; user mode stack limit

    cmp     r10, r11            ; are we inside the limit?
    jae     done                ; Y: nothing to do

    and     r10w, #-PAGE_SIZE   ; round down to page start

probe:
    lea     r11, [r11][-PAGE_SIZE]  ; move to previous page
    test    [r11], r11b         ; probe it
    cmp     r10, r11            ; finished probing?
    jb      probe               ; N: keep going

done:
    mov     r10, [rsp]          ; restore temporary register
    mov     r11, [rsp][8]       ; restore temporary register
    add     rsp, 16             ; clean up stack
    ret

Bonus reading: Windows is not a Microsoft Visual C/C++ Run-Time delivery channel.

The other is in ucrtbase, the so-called universal runtime. That one is identical except that the probing is done by writing rather than reading.

    mov     byte ptr [r11], 0   ; probe it

In both cases, the function ensures that the stack has expanded the necessary amount but leaves it the caller’s responsibility to adjust the stack after the call returns. This design preserves compliance with shadow stacks (which Intel calls Control-Flow Enforcement Technology, or CET).

A typical usage might go like this:

    mov     eax, #17328         ; desired stack frame size (zero-extended)
    call    chkstk              ; validate that there is enough stack
    sub     rsp, rax            ; allocate it

Next time, we’ll wrap up the series with a look at AArch64, also known as arm64.

Topics

Code

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

8 comments

Discussion is closed. Login to edit/delete existing comments.

Simon Felix March 20, 2026

It’s a curious design. Does anyone know why people use stack probing instead of this:
* Store the current stack size in a thread local variable
* Compare the desired stack size with the current stack size
* If adjustments are necessary, adjust using a kernel call

This would remove a bit of code, and a kernel call can be made as cheap as a (soft) page fault. Wouldn’t it be simpler?
- LB March 23, 2026
  
  The issue is that now you have two different ways of doing the same thing, and if you try to eliminate the probing approach, then every function has to do your proposed tests on entry to make sure it has enough stack space to actually run, which is quite expensive.
Kalle Niemitalo March 20, 2026 · Edited

If I understand correctly, the Intel CET shadow stack feature only checks that the return address itself matches, and does not check whether a near RET pops it from the same address where CALL originally pushed it. So if chkstk popped the return address, pushed it to a different location and returned to it, that would not immediately break compliance with shadow stacks.

On the other hand, it is conceivable that CALL in a future processor could also push RSP to the shadow stack and RET could compare that, if the operating system enables this kind of extended checking. ...
Read more
If I understand correctly, the Intel CET shadow stack feature only checks that the return address itself matches, and does not check whether a near RET pops it from the same address where CALL originally pushed it. So if chkstk popped the return address, pushed it to a different location and returned to it, that would not immediately break compliance with shadow stacks.

On the other hand, it is conceivable that CALL in a future processor could also push RSP to the shadow stack and RET could compare that, if the operating system enables this kind of extended checking. Perhaps the designers of chkstk wanted to prepare for this possibility and choose a calling convention that won’t have to be changed if compliance with that kind of shadow stack is ever desired.

Or perhaps this is for the sake of unwinding in structured exception handling; that might require the epilog to restore the stack pointer.

Read less
Joshua Hudson March 19, 2026

Bonus: wine amd64 has the simplest version of chkstk:

__chkstk:
ret

where it’s actually written in inline assembly at block scope.

This works because there are no guard pages; all stack is already reserved at time of thread start. There seems to be some implicit assumption that no program actually causes a megabyte-sized stackoverflow, which isn’t a completely unreasonable assumption given the program had to work to get shipped.

(It’s actually quite easy to write programs that work on Wine that don’t have a ghost of a chance of running on Windows; nobody seems to care.)
Csaba Varga March 19, 2026

I’m wondering why the probe loop does “lea r11, [r11][-PAGE_SIZE]” rather than “sub r11, PAGE_SIZE”. It can’t be for preserving the flags, the next TEST will wipe them anyway. Even in the ucrtbase version, the CMP after the probe will wipe the flags.

I doubt it matters performance-wise, but it’s still bugging me.
- James Ng March 20, 2026
  
  LEA is often faster and doesn't block an ALU. This is because it takes 0 cycles as it's calculated during the operand retrieval stage where addresses to operands are calculated and sent to the memory Interface.
  
  On modern systems LEA is a very flexible instruction so if you can use it it's preferred. Same reason everyone uses XOR to zero a register.
  
  LEA can be used to substitute for addition, subtraction and multiplication within limits. Check out Matt Godbolt's advent of compiler optimization for more details. He's the guy that created Compiler Explorer and it turns out the modern compiler is way...
  Read more
  LEA is often faster and doesn’t block an ALU. This is because it takes 0 cycles as it’s calculated during the operand retrieval stage where addresses to operands are calculated and sent to the memory Interface.
  
  On modern systems LEA is a very flexible instruction so if you can use it it’s preferred. Same reason everyone uses XOR to zero a register.
  
  LEA can be used to substitute for addition, subtraction and multiplication within limits. Check out Matt Godbolt’s advent of compiler optimization for more details. He’s the guy that created Compiler Explorer and it turns out the modern compiler is way more intelligent than you’d think. One of the last videos he shows how Clang recognized the algorithm used and substituted an equivalent turning a loop from O(n) to O(1).
  
  Read less
  - Jan Ringoš March 20, 2026
    
    One would assume modern CPU decoders can optimize and rewrite eligible ADDs and SUBs as if they were LEAs and relieve the ALU port.
  - Csaba Varga March 20, 2026
    
    Yes, I’m aware of what LEA can do, and I’ve also seen the Advent of Compiler Optimization videos. It just feels a bit roundabout when the only thing you need is subtraction, i.e. no multiplication, no need to store the result in a different register etc.
    
    (Also, I assumed that SUB can be encoded more efficiently than LEA, but I’ve tried it in an online assembler just now, and turns out both instructions take four bytes to encode. Interesting.)

Stay informed

Get notified when new posts are published.

Email *

Country/Region *

I would like to receive the The Old New Thing Newsletter. Privacy Statement.

Follow this blog

Windows stack limit checking retrospective: amd64, also known as x86-64

Category

Topics

Author

8 comments

Read next

How can I make sure the anti-malware software doesn’t terminate my custom service?

How can I change a dialog box’s message loop to do a `MsgWaitForMultipleObjects` instead of `GetMessage`?

Category

Topics

Share

Author

8 comments

Read next

How can I make sure the anti-malware software doesn’t terminate my custom service?

How can I change a dialog box’s message loop to do a Msg­Wait­For­Multiple­Objects instead of Get­Message?

Stay informed

How can I change a dialog box’s message loop to do a `MsgWaitForMultipleObjects` instead of `GetMessage`?