Our survey of stack limit checking reaches the modern day with amd64, also known as x86-64. This time, there are two versions of the function, one for user mode and one for kernel mode. We’ll look at the user mode version.
Actually, there are two user mode versions. One is in msvcrt, the legacy runtime.
; on entry, rax is the number of bytes to allocate
; on exit, stack has been validated (but not adjusted)
chkstk:
sub rsp, 16
mov [rsp], r10 ; save temporary register
mov [rsp][8], r11 ; save temporary register
xor r11, r11 ; r11 = 0
lea r10, [rsp][16][8] ; r10 = caller's rsp
sub r10, rax ; r10 = desired new stack pointer
cmovb r10, r11 ; clamp underflow to zero
mov r11, gs:[StackLimit]; user mode stack limit
cmp r10, r11 ; are we inside the limit?
jae done ; Y: nothing to do
and r10w, #-PAGE_SIZE ; round down to page start
probe:
lea r11, [r11][-PAGE_SIZE] ; move to previous page
test [r11], r11b ; probe it
cmp r10, r11 ; finished probing?
jb probe ; N: keep going
done:
mov r10, [rsp] ; restore temporary register
mov r11, [rsp][8] ; restore temporary register
add rsp, 16 ; clean up stack
ret
Bonus reading: Windows is not a Microsoft Visual C/C++ Run-Time delivery channel.
The other is in ucrtbase, the so-called universal runtime. That one is identical except that the probing is done by writing rather than reading.
mov byte ptr [r11], 0 ; probe it
In both cases, the function ensures that the stack has expanded the necessary amount but leaves it the caller’s responsibility to adjust the stack after the call returns. This design preserves compliance with shadow stacks (which Intel calls Control-Flow Enforcement Technology, or CET).
A typical usage might go like this:
mov eax, #17328 ; desired stack frame size (zero-extended)
call chkstk ; validate that there is enough stack
sub rsp, rax ; allocate it
Next time, we’ll wrap up the series with a look at AArch64, also known as arm64.
It’s a curious design. Does anyone know why people use stack probing instead of this:
* Store the current stack size in a thread local variable
* Compare the desired stack size with the current stack size
* If adjustments are necessary, adjust using a kernel call
This would remove a bit of code, and a kernel call can be made as cheap as a (soft) page fault. Wouldn’t it be simpler?
The issue is that now you have two different ways of doing the same thing, and if you try to eliminate the probing approach, then every function has to do your proposed tests on entry to make sure it has enough stack space to actually run, which is quite expensive.
If I understand correctly, the Intel CET shadow stack feature only checks that the return address itself matches, and does not check whether a near RET pops it from the same address where CALL originally pushed it. So if chkstk popped the return address, pushed it to a different location and returned to it, that would not immediately break compliance with shadow stacks.
On the other hand, it is conceivable that CALL in a future processor could also push RSP to the shadow stack and RET could compare that, if the operating system enables this kind of extended checking. ...
Bonus: wine amd64 has the simplest version of chkstk:
__chkstk:
ret
where it’s actually written in inline assembly at block scope.
This works because there are no guard pages; all stack is already reserved at time of thread start. There seems to be some implicit assumption that no program actually causes a megabyte-sized stackoverflow, which isn’t a completely unreasonable assumption given the program had to work to get shipped.
(It’s actually quite easy to write programs that work on Wine that don’t have a ghost of a chance of running on Windows; nobody seems to care.)