Some time ago we took a closer look at the stack guard page and how a rogue stack access from another thread into the guard page could result in the guard page being lost. (And we used this information to investigate a stack overflow failure.)
You might have noticed that the “one guard page at a time” policy assumes that the stack grows one page at a time. But what if a function has a lot of local variables (or just one large local variable) such that the size of the local frame is greater than a page, and the first variable that the function uses is the one at the lowest address? That would result in a memory access in the reserved region (red in the diagram on the linked page), rather than in the guard page (yellow in the diagram), and since it’s not in a guard page, that is simply an invalid memory access, and the process would crash.
Yet processes don’t crash when this happens. How does that work?
The answer is that when the stack pointer needs to move by more than the size of a page (typically 4KB), the compiler generates a call to a helper function called something like _chkstk. The job of this function is to touch all of the pages spanned by the desired stack allocation, in order, so that guard pages can be converted to committed memory. The system maintains only one guard page, namely the page that is just below the allocated portion of the stack. Once you touch that guard page, the system converts it to a committed page, updates the stack limit, and creates a new guard page one page further down. That’s why the access has to be sequential: You have to make sure that the first access outside the stack limit is to wherever the guard page is.
The form of this stack-checking function has changed over the years, and we’ll be spending a few days doing a historical survey of how they worked. We’ll start next time with the 80386 family of processors, also known as x86-32 and i386.
Didn’t a previous blog entry discover that at some point a change was made to have multiple guard pages instead of just one, or am I misremembering?
How does _alloca(size_t) play with this.. does it do the necessary _chkstk() probing?
I’m sure using a TLS slot is probably always better idea — but curious if this is an argument against the use of _alloca()
I remember trying to look into this and discovering that on i386, MSVC and GCC expect different calling conventions for _chkstk and getting confused trying to sort it out.
Why not have a page fault handler that detects the faulting address being the stack and page in the other pages?
My guess: you don’t want an invalid pointer dereference to allocate a huge chunk of stack, just because the pointer happens to be pointing where the stack might grow, eventually. You want an invalid pointer dereference to segfault most of the time.
99.9% of functions are happy with just having a single guard page, it causes zero overhead for them. The remaining 0.1% can probably tolerate the tiny performance hit of calling _chkstk . (The actual allocation may be costly, as pete.d’s example shows, but you would need to do that anyway when a function needs so much space from the stack.)
Fun fact: the initially released version of MS Flight Sim 2000 included a huge performance bug any time shorelines were in the rendered view (i.e. flying around bodies of water). Some quick and dirty profiling revealed the _chkstk function to be the cause. Turned out, some huge temporary data structure was being allocated on the stack for the purpose of the shoreline rendering, and so every frame this very costly paging in of dozens of pages (or was it hundreds? I don't even remember at this point) was killing the frame rate.
Switching to a globally allocated buffer that could be...