February 3rd, 2022

A closer look at the stack guard page

In a discussion of why Is­Bad­Xxx­Ptr should really be called Crash­Program­Randomly, I gave a brief description of the stack guard page:

The dynamic growth of the stack is performed via guard pages: Just past the last valid page on the stack is a guard page. When the stack grows into the guard page, a guard page exception is raised, which the default exception handler handles by committing a new stack page and setting the next page to be a guard page.

Let’s break this down a bit more.

Here’s a thread’s stack after the thread has been running for a little while. As is customary in memory diagrams, higher addresses are at the top, which means that the stack grows downward (toward lower addresses).

 
valid stack   committed
  committed
  committed ← Stack pointer
  committed
    guard page
    reserved
    reserved
    reserved

The regular committed pages encompass all of the stack memory that the program has used so far. It may not be using all of it right now: Any memory beyond the red zone is off limits to the application. When the stack pointer recedes from its high water mark, the pages left behind are not decommitted.

The page just past the stack pointer’s high water mark is a special type of committed page known as a guard page. A guard page is a page which raises a STATUS_GUARD_PAGE_VIOLATION exception the first time it is accessed.

Suppose that the stack pointer moves into the guard page, indicating that the thread has increased its stack requirements by one additional page.

valid stack   committed
  committed
  committed
  committed
    guard page ← Stack pointer
    reserved
    reserved
    reserved

The moment the thread accesses memory from the guard page, the system converts it to a regular committed page (removing the PAGE_GUARD flag) and raises a STATUS_GUARD_PAGE_VIOLATION exception. The default exception handler deals with the exception by looking to see if the address lies in the current stack’s guard page region. If so, then it upgrades the next reserved page to a guard page, and then resumes execution:

    Before   During   After
valid stack   committed   committed   committed   valid stack
  committed   committed   committed
  committed   committed   committed
  committed   committed   committed
    guard page ← Stack pointer → committed ← Stack pointer → committed
    reserved   reserved   guard page
    reserved   reserved   reserved
    reserved   reserved   reserved

Clearing the PAGE_GUARD flag on an access to a guard page means that once you access it, it stops being a guard page. This means that guard pages raise the guard page exception only on first access. If you fail to take action on a guard page exception, the system ignores it, and you lost your one chance to do something.

This is why our code to detect stack overflows makes sure to call _resetstkoflw() if it decides to recover. Resetting the stack overflow state consists of turning the PAGE_GUARD flag back on for the guard page, restoring the page to its former glory as a guard page so it can do its job of detecting stack growth.

This is how things go when everything is working right. But things don’t always work right.

If one thread accesses another thread’s guard page, perhaps due to a buffer overflow, or just an uninitialized pointer variable that happens to point there, that too will trigger the guard page exception. That exception is raised by the thread that did the accessing, which is not the thread that owns the stack. The default exception handler sees that the guard page exception is not for the current thread’s stack, so it ignores it.¹

Congratulations, your stack is now corrupted, because the guard page is gone:

valid stack   committed
  committed
  committed ← Stack pointer
  committed
    committed (oops)
    reserved
    reserved
    reserved

Things proceed normally for a while, until the thread’s stack needs to grow into what used to be the guard page.

valid stack   committed
  committed
  committed
  committed
    committed ← Stack pointer (oops)
    reserved
    reserved
    reserved

Normally, this would trigger a guard page exception, and the system would do the usual thing of upgrading the next reserved page to a new guard page. However, that page is no longer a guard page, so execution just continues normally with no action taken.

Things still proceed as if everything were perfectly normal, but the consequences of your misdeeds finally catch up to you when the stack pointer crosses into a second new page, the first reserved page.

valid stack   committed
  committed
  committed
  committed
    committed (oops)
    reserved ← Stack pointer (double oops)
    reserved
    reserved

This is also not a guard page, so no special stack expansion kicks in. You just get a stack overflow exception and die.

Such is the sad life of invalid memory access. You can corrupt your own process in a subtle way that doesn’t show up until much, much later.

Next time, we’ll investigate a stack overflow problem and learn how to detect whether this guard page corruption has occurred.

¹ In theory, the default exception handler could search through all the threads in the process and see if the address resides in a guard page of any thread, but it doesn’t. One reason is that this would require cross-thread coordination with the thread whose guard page you accidentally accessed, as well as any other thread that also might be accessing that guard page at the same time. But the bigger reason is probably that the entire situation is a bug in the program anyway, and there’s no point going out of your way to slow down the system in order to deal with things that programs shouldn’t be doing anyway.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

10 comments

Discussion is closed. Login to edit/delete existing comments.

Newest
Newest
Popular
Oldest
  • Henke37

    And this is another reason not to do funny stuff with the stack, such as trying to allocate and switch to one of your own. Use the fibers api instead, its in cohorts with the default exception handler and will work correctly.

  • word merchant · Edited

    “The default exception handler sees that the guard page exception is not for the current thread’s stack, so it ignores it.”

    At the risk of having missed the point, in this situation, why doesn’t the default exception handler terminate the process? A wild write to the guard page would indicate something is pretty wrong in the application and it probably isn’t going to end well.

    • Raymond ChenMicrosoft employee Author

      Because the app might be using guard pages for its own purposes.

      • Ismo Salonen

        Is there any way to opt-in for termination/special exception ( which should be default but changing this needs the time machine) ?
        Might help debugging this kind of bugs ? I suppose there are only few applications using quard pages for own purposes.

      • Evan Crawford

        I’m not following either, if an application is using guard pages for its own purposes, then wouldn’t it override the default handler?

      • Raymond ChenMicrosoft employee Author

        You’d think so, but guard pages are documented as “If nobody handles the guard page exception, it is ignored and execution resumes normally.” And there may be apps that rely on that behavior. Furthermore, not all accesses to guard pages raise an exception. VirtualLock of a guard page simply fails, and GetLastError tells you “Sorry. Guard page.” No exception.

  • Andreas Peitz

    I remember the old article well and still cannot wrap my head around the actual issue. The problem of triggering a guard page, for sure. But how is that any different to normal usage and the stack running out of reserved space? Is there a buffer zone after (“before”) the stack or will the stack continue randomly into other allocated memory? Isn’t an always growing stack a design flow in the program in the first place? Isn’t triggering the stack guard page a “mistake” in general? (ignoring the thread “startup” call chains, but there is a initial huge 64K commit anyway, I think). Recursive calls, yes. But that’s a different topic, CrashProgramRecursively. Way too many programmers put way to much stuff on the stack that doesn’t belong there. For me it’s like, programs that shoot their stack guard pages will destroy their stack pointers long before that, due to uninitialized stack variables, stack buffer overflows and so on. Or shallow copies of structures everywhere with references (pointers) stored outside the function and used by other functions. The list goes on. A program that relies on that guard page is the bigger issue in my opinion. (again, not counting startup).

    How about debuggers? Or ReadProcessMemory? Does that trigger guard pages in the remote process? I would think they don’t as this is a virtual memory trigger by the CPU itself and different processes have different memory mappings.

  • MGetz

    I find it interesting that people rarely think about the consequence to next byte e.g. rsp-1 might be a lot further away than they think it is. Thus it makes good sense to be at least somewhat conscious of how much stack you’re using (don’t prior optimize, source of all evil etc. still applies) that way you’re less likely to hit things like the guard page and thus odd perf hits that don’t always make sense. That said once the page is hit as far as I know windows never reclaims it so it’s a one time thing. I guess it probably doesn’t matter except for benchmarks and the like and if they were really concerned they’d just set the stack size default in the linker so this isn’t an issue. /randommusings

    • Joshua Hudson

      Or do what I do. The stack is pre-allocated. There’s only one guard page at the very top; if you ever fault in it you could recover the process but that work unit is being cancelled.

      I got tired of “impossible” stack overflows because somebody else ran the server out of RAM.

      • MGetz

        EXE pre-allocate via linker options? Or something else? Either way I pretty much assumed that would be a must for a benchmark because if that happens at the wrong time it’s extremely expensive for something that shouldn’t be losing the 2k+ cycles dealing with the fault. That said there could still be the possibility that an interrupt could trigger the fault? (Raymond would need to weigh in on if the kernel cares or not, I suspect it technically does but uses so little as to not matter much).

Feedback