November 1st, 2019

Why does my program crash if I terminate a thread that is waiting to enter a critical section? It never got the critical section, so who cares?

A customer had a program that used the Terminate­Thread function to terminate a thread while it was waiting for a critical section. They claim that this code worked in earlier versions of Windows, but starting in Windows 10, the program crashes when it tries to use that critical section further. Why is this happening? The documentation for Enter­Critical­Section says

If a thread terminates while it has ownership of a critical section, the state of the critical section is undefined.

But it doesn’t say that there’s anything wrong with terminating a thread that is waiting for a critical section.

Well, yeah, it doesn’t say that because the only way you can get into that state is if you call Terminate­Thread while the Enter­Critical­Section is still running, and terminating a thread is already a terrible idea. It’s such a terrible idea that doing so puts the entire process is an undefined state. There’s no point trying to go into the details of what could happen when your process is an undefined state. It’s undefined!

But just for curiosity’s sake, what changed in Windows 10 that made the undefined state lead to a crash?

Historically, critical sections were implemented in terms of an interlocked integer and a kernel event. The interlocked integer was used to keep track of the critical section’s state: Unowned, owned with no waiting threads, or owned with at least one waiting thread. If a thread needed to wait for the critical section, it waited on the kernel event handle. When the owner of the critical section exited it, it checked if there are any waiting threads, and if so, it signaled the event, thereby allowing one of the waiting threads to attempt to enter.

This design had some flaws. For example, a process that created lots of critical sections could potentially create a lot of kernel event handles, and kernel synchronization objects consume non-paged pool. To alleviate stress on non-paged pool, critical sections created the kernel event handle only on demand, but that introduces a new failure mode if the lazy creation of the kernel event handle fails. Some programs avoided this problem by pre-creating the kernel event handle, but that just put us back where we started with high non-paged-pool usage.

In Windows 10, critical sections were rewritten in terms of Wait­On­Address. This removes the need for kernel events entirely, and therefore avoids entire classes of potential problems.

The Wait­On­Address function works by linking the list of waiting threads through their stacks. This linked list of threads is manipulated both by waiting threads (when they join the list) and by the waking thread (so it can notify a waiting thread). The code to manipulate this linked list is quite complicated because it needs to do its work in a lock-free manner.¹

If you terminate a thread, the thread’s stack is cleaned up as part of cleaning up the biggest pieces of garbage on the sidewalk. That removes a memory leak, but it also leaves dangling pointers if that thread was calling Wait­On­Address or any other function that links memory on the stack into a data structure visible to other threads. And it’s those dangling pointers that cause any future use of the critical section to crash.

Bonus chatter: If you think about it, the original design that allocated a kernel handle per critical section was wasteful. Really, you need one kernel handle per thread, since each thread can wait on only one critical section. (There is no Wait­For­Multiple­Critical­Sections function.) On the other hand, managing those kernel handles is complicated because you need to do it in a lock-free way. Once you sign up to do the complicated stuff, you may as well go all the way and do it handle-free.

¹ The slim reader-writer locks also links the list of waiting threads through their stacks. The code for slim read-writer locks is even more complicated because it needs to reorder the nodes within the linked list, while still being lock-free.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

9 comments

Discussion is closed. Login to edit/delete existing comments.

Newest
Newest
Popular
Oldest
  • cheong00

    Just something remotely related. Yesterday I found my Android game crashed when I sort the card list a few time, with the systems saying it detected a memory scan and will terminate the App.

    Talking about Operating System changes that cause things previously worked failed in unexpected way.

  • Ian Boyd

    Is thing sort of programming bug also caught by AppVerifier?

    • Raymond ChenMicrosoft employee Author

      AppVerifier complains loudly if you call TerminateThread.

  • Neil Rashbrook

    Can you not have one event per process, and signal all threads waiting on critical sections if there’s a thread waiting on the critical section that you just left? Any thread waiting on another critical section will just assume that another thread waiting on that critical section got there first.

  • ‪ ‪

    Providing Terminate­Thread() is Microsoft’s biggest mistake.

    • MNGoldenEagle

      To be fair, most operating systems provided the same functionality, including POSIX. API design sadly does not come with any magic 8 balls to tell you if it’s going to end up being a costly mistake or not.

      • Simon Clarkstone

        According to the article Raymond linked to, the designers did know TerminateThread() was a bad idea and were reluctant to write it.

      • Kenny

        If I had a dollar for every time I was forced to implement something which I told management was a bad idea over the last 20 years I would have enough money to buy a very nice used car.

  • Ron ParkerMicrosoft employee

    This is the second time in a month that you’ve posted some random thing that just happened to solve a problem I was having. The first one was the fiber-local storage thing, and now this one.

    I was just investigating a bug where someone was keeping around a pointer to a stack variable that had gone out of scope. In the process I made a little tool to search all of the process’s heaps for any pointers into any thread’s stack, and especially into areas near or at a lower address than that thread’s RSP. I was surprised that even notepad.exe has a handful of stack pointers in the heap right at startup, but I also noted that they all pointed into “live” stack.

    This note about WaitOnAddress has solved the (minor) mystery of what some of those pointers likely are, and now I can try to mark them as “safe” based on the RIP of the thread who owns the stack. That is, if the thread is inside WaitOnAddress (or the relevant SRW functions) somewhere, the tool can probably ignore pointers to a certain to-be-determined part of its stack, even though they may be “near” RSP and thus look suspicious.

Feedback