We start our survey of historical stack limit checking functions on Windows with the 80386 family of processors. This function has actually changed form over the years, so we’ll start with the “original flavor”.
Originally, the _chkstk function was called by putting the desired number of bytes in the eax register and calling the _chkstk function. The function touched each page of the stack, adjusted the stack pointer, and then returned with the adjusted stack pointer. This is an unusual calling convention since it is neither caller clean, nor is it callee clean. It’s callee-dirty! The function returns with more stack than it started.
_chkstk:
push ecx ; preserve register
; calculate the stack pointer of the caller
mov ecx, esp
add ecx, 8 ; 4 bytes were auto-pushed for the return address,
; we pushed 4 bytes for the ecx
touch:
cmp eax, PAGE_SIZE ; less than a page to go?
jb finalpage ; do the last page and finish
sub ecx, PAGE_SIZE ; allocate a page from our pretend stack pointer
or dword ptr [ecx], 0 ; touch the memory
sub eax, PAGE_SIZE ; did a page
jmp touch ; go back and do some more
finalpage:
sub ecx, eax ; allocate the leftovers from our pretend stack pointer
or dword ptr [ecx], 0 ; touch the memory
mov eax, esp ; remember original stack pointer
mov esp, ecx ; move the real stack to match our pretend stack
mov ecx, [eax] ; recover original ecx
mov eax, 4[eax] ; recover return address
jmp eax ; "return" to caller
A function with a large stack frame would go something like
function:
push ebp ; link into frame chain
mov ebp, esp
push ebx ; save non-volatile register
push esi
push edi
mov eax, 17320 ; large stack frame
call _chkstk ; allocate it from our stack safely
; behaves like "sub esp, eax"
This goes into the competition for “wackiest x86-32 calling convention.”¹
Next time, we’ll look at how stack probing happens on MIPS, which has its own quirks, but nothing as crazy as this.
Bonus chatter: The strange calling convention dates back to the 16-bit 8086. And back then, there were two versions of the chkstk function, depending on whether you were calling it far or near.
; frame size in ax
chkstk:
#if NEAR
pop bx ; pop 16-bit return address
#else // FAR
pop bx ; pop 32-bit return address
pop dx
#endif
inc ax
and al, 0xFE ; round up to even
sub ax, sp ; check for stack overflow
jae overflow ; Y: overflow
neg ax ; ax = new stack pointer
cmp ax, ss:[pStackTop]
ja overflow ; stack mysteriously too high
cmp ax, ss:[pStackMin] ; new stack limit?
jbe nochange
mov ss:[pStackMin], ax ; update stack limit
nochange:
mov sp, ax ; update the stack pointer
#if NEAR
jmp bx ; "return" to caller
#else // FAR
push dx ; restore return address
push bx
retf ; return to caller
#endif
Reminds me of code I had to do with BSD Unix on the PDP11. On startup to set up the stack I had to copy the return address into a register, go through a loop "touching" memory downwards until it was ok, tell the OS I was moving the stack - and this is why I had to get the return address - once you told the OS you were moving the stack the stack vanished! so I set up the new stack and returned via the register value. But you couldn't move the stack before telling the OS...
I was very confused by the Bonus Chatter for a while. What would a 16-bit chkstk be for? Stacks are fixed at compile time on 16-bit Windows, and you can’t grow the stack anyway because it’s sandwiched between the static data and the local heap (Petzold 3.1 p281).
But of course, you would like to AT LEAST detect and defend against a stack overflow even if the hardware won’t help and your only option is to bail out. I presume the win16 functions are called in every function’s prolog and will trigger an Application Fault if the allocated stack would...
Is it just me, or there is a mixup between eax and ecx ? The caller should set eax, not ecx, and the first “push ecx” does not preserve the allocation size, it just preserves whatever is in ecx.
Agreed
Wow, this must be old code indeed, if it doesn’t worry about returning with a JMP. The x86 branch predictor assumes every CALL to be paired with a RET, so it will mispredict a bunch of future RETs if you get back to your caller without using a RET. The predictor-friendly way to return would be replacing the final two instructions with:
push 4[eax] ; copy return address to the top of the stack ret ; return to callerOn the face of it, does
chkstkbehave similarly toalloca, or am I missing something?_alloca is the stack allocator intrinsic. __alloca_probe is emitted to check on _alloca. IIRC, it’s equivalent to __chkstk.
I’m guilty of using _alloca, judiciously, when it makes sense. I just keep my allocations below PAGE_SIZE.
None of the 16-bit x86 microprocessors had a branch predictor at all; the Pentium Pro was the first to have a return address stack; the Pentium did have branch prediction but I don’t think it needed strict pairing so while it would be temporarily confused by the JMP it would get its act together again at the next RET.
Is it me or is there a footnote ¹ without it pointing to anything. Is that, like, a joke since it’s pointing to a page that hasn’t yet been allocated?