A customer encountered found that sometimes, their application hung in its clean up code. Here’s a simplified version.
bool ShuttingDown = false; void MainThread() { DWORD id; auto hThread = CreateThread(nullptr, 0, WorkerThread, nullptr, 0, &id); // succeeds BlahBlahBlah(); // do useful work // Time to clean up. Post an APC to the worker thread // to tell it that it's time to go home. QueueUserAPC(WakeWorker, hThread, 0); // succeeds WaitForSingleObject(hThread, INFINITE); // hangs CloseHandle(hThread); } void CALLBACK WakeWorker(ULONG_PTR) { ShuttingDown = true; } DWORD CALLBACK WorkerThread(void*) { // Do work until shut down. do { // All work is posted via APCs. SleepEx(INFINITE, TRUE); } while (!ShuttingDown); return 0; }
The idea is that the program has a worker thread to, y’know, do some work. All of the work items are posted via QueueUserAPC
, and the worker thread simply calls SleepEx
alertably, over and over again. Each call to SleepEx
sleeps the thread until an APC is queued, at which point it returns (because the bAlertable
parameter is TRUE
).
One way of looking at this design is that it’s a sneaky way of making the operating system manage your work queue for you. Another way of looking at it is as a single-threaded I/O completion port.
Call it what you will.
Anyway, the problem is that the worker thread is stuck in SleepEx
, as if it never got the WakeWorker
APC that tells it to exit.
But is that really the problem?
The customer was able to get some full memory dumps of systems that got into this state, and the telling detail is that the ShuttingDown
variable was set to true
. So it wasn’t that the WakeWorker
APC never arrived. It totally did arrive and set ShuttingDown
to true
. Yet somehow that didn’t wake up the SleepEx
.
One of my colleagues offered the possibility that when the code entered the SleepEx
loop, the WakeWorker
APC had already run. If that’s the case, then the SleepEx
is going to wait for an APC that has already arrived.
I was able to confirm with a test program that this was indeed a possibility.
DWORD apc = 0; void CALLBACK WakeWorker(ULONG_PTR) { apc = GetCurrentThreadId(); } DWORD CALLBACK WorkerThread(void*) { if (apc == GetCurrentThreadId()) DebugBreak(); return 0; } int __cdecl main() { DWORD id; while (true) { apc = 0; auto h = CreateThread(nullptr, 0, WorkerThread, nullptr, 0, &id); QueueUserAPC(WakeWorker, h, 0); WaitForSingleObject(h, INFINITE); CloseHandle(h); Sleep(10); } // notreached }
This program creates a thread and immediately queues an APC to it. The thread procedure checks if the APC already ran on the same thread, and if so, it breaks into the debugger.
If you run this program, it breaks immediately. If you set a breakpoint on the WakeWorker
function, you’ll see that it does indeed run on the thread before the WorkerThread
function starts.
From the stack trace on that breakpoint, it appears that the kernel processes APCs during the early phases of thread creation, so if you had any queued APCs, they will get processed before the thread procedure even starts.
So that’s the bug: The ShuttingDown
flag was already set by the time the thread procedure started, but the code assumes that it can only be set by an APC that is processed by the SleepEx
.
The fix is simple: Change the loop from a do...while
to a while
, so that the test is at the top of the loop instead of at the bottom.
DWORD CALLBACK WorkerThread(void*) { // Do work until shut down. while (!ShuttingDown) { // All work is posted via APCs. SleepEx(INFINITE, TRUE); } return 0; }
The case where this happens is if the BlahBlahBlah()
function returns very quickly, allowing the QueueUserAPC
to win the race against the WorkerThread
.
How dow changing the loop to a “while().. ” help in general case ? There is still possible execution race ( no datarace as all runs in one thread) if there is alertable function between the test and SleepEx. That was my quess for the bug but the example did not give enough informarion to make judgement call.
It wouldn’t really make sense to have any other code in that loop, but if you did, you’d have to check for the exit flag after every step that performs an alertable wait.
“From the stack trace on that breakpoint, it appears that the kernel processes APCs during the early phases of thread creation, so if you had any queued APCs, they will get processed before the thread procedure even starts.”
Indeed, this is documented behavior:
“If an application queues an APC before the thread begins running, the thread begins by calling the APC function. After the thread calls an APC function, it calls the APC functions for all APCs in its APC queue.”
— https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc
My understanding is that this is precisely how the DllMain routines fire before main(), or before the thread start function: they are executed inside an APC that was queued before the process/thread starts.
I feel like I read somewhere in the hundreds of pages that is Windows Internals that APCs are used during thread startup for uhm, something? I’d pull a quote, but that’s too many pages.
Wait how does that fix work? My first hypothesis was the APC woke up in the function prolog for SleepEx.
Before it sleeps, it checks whether shutdown is already set. If it’s already set then it doesn’t go to sleep.
The bug is that it goes to sleep after shutdown is already set but without checking for that which means the APC has already run.
The APC was run by the kernel before entering the thread procedure, as it appears the kernel drains user APCs before running the thread procedure.
“The bug is that it goes to sleep after shutdown is already set but without checking for”
This doesn’t answer the question. There’s still a time between checking shutdown and making the system call.
This is a distinction between Win32 and Unix signal handlers; also, David Culter’s early OS VMS. Win32 does not believe in allowing user mode code to be interrupted by other user mode code running on the same thread. This is because it is almost impossible to reason about that sort of situation – are we holding the heap lock? Who knows? So, APCs only run when the thread announced it is ready for them. This way, it is possible to reason about re-entrance issues.
According to this MSDN page a user-mode APC only executes when the thread is in an “alertable state”, which is during SleepEx and WaitForMultipleObjectsEx & friends. It won’t fire between arbitrary instructions.
And this presumably why the customer thought their code was safe, because they hadn’t read the QueueUserAPC documentation well enough to notice that an APC can also execute during thread startup.
And there’s the answer. You live you learn.
ShuttingDown should also be declared volatile so that the compiler doesn’t “helpfully” optimize the code by hoisting it out of the loop, but that’s not the point of this article.
In this case no
volatile
is not only not necessary but may cause weird behavior, it’s important to remember thatvolatile
is not atomic and does not solve data race issues. Usingstd::atomic_bool
satisfies that use case but if you look at the code they are queuing the User APC onto the “owning” thread. Which means the code has no undefined behavior because it’s inherently data race free. Could this in theory be improved by using an atomic? Potentially an acquire operation just incase the thread got moved cross CPU and the cache is not coherent but it should be fine. I’d personally probably use the atomic with an acquire and release operations because I’m paranoid.Actually, wouldn’t the APC be vaguely equivalent to a signal handler? In signal handlers, the only things C (and C++) give you safe access to are lock-free atomic types and `volatile sig_atomic_t`. The only atomic that’s guaranteed to be lock-free is `atomic_flag`, although Win32 probably gives you more.
No. From the program’s point of view the APC is called in a perfectly normal way by the SleepEx function. There are no restrictions analogous to those in a signal handler.
No, User mode APCs don’t break execution like Unix signal handlers do, which is a major distinction. They can only be triggered on a thread that’s in an “Alertable state” which is what the second parameter of
SleepEx
does. It’s closer to how apples grand central dispatch works honestly. You don’t have to worry about clobbering an active stack like you do with signal handlers either. Inherently the thread you’re running on was ‘stopped’ for all practical purposes until you got run. Also you can queue them directly onto a specific thread which this code does. I’m pretty sure Raymond has covered cases where APC never run because the thread never went alertable. That said if you wanted to be very very sure you’d use atomic types as I mentioned. That means that when the well meaning programmer™️ comes in later and tries to useShuttingDown
cross thread where they shouldn’t be it won’t cause stupid bugs it will just be slower that it needs to be.There is now QueueUserAPC2 as a documented way to queue a special user-mode APC that does not require an alertable state.
Also volatile on non-x86/x64 systems may cause poor performance as the compiler may try to keep things sequentially consistent when acquire / release will do.