I noted some time ago that creating a threadpool wait allows the threadpool to combine multiple waits, so that each thread waits for nearly 64 objects. (It’s not quite 64 objects because one of the objects is a special sentinel object that means “Stop waiting.”)
In the time since I wrote that article, the situation has gotten even better. Starting in Windows 8, the registered waits are associated with a completion port, and a single thread handles all the wait requests by waiting for completions.
We can see the new behavior in action with this simple program:
#include <windows.h> #include <stdio.h> int main() { static LONG count = 0; HANDLE last = CreateEvent(nullptr, true, false, nullptr); HANDLE event = last; for (int i = 0; i < 10000; i++) { auto wait = CreateThreadpoolWait( [](auto, auto event, auto, auto) { InterlockedIncrement(&count); SetEvent(event); }, event, nullptr); event = CreateEvent(nullptr, true, false, nullptr); SetThreadpoolWait(wait, event, nullptr); } Sleep(10000); SetEvent(event); WaitForSingleObject(last, INFINITE); printf("%d events signaled\n", count); return 0; }
This quick-and-dirty program creates 10,000 threadpool waits, each waiting on a different event, and whose callback signals the next event, creating a chain of waits that eventually lead to setting the event named last
. Under the old rules, creating 10,000 threadpool waits would result in around 10,000 ÷ 63 ≅ 232 threads to wait on all of those objects. But if you break into the debugger during the Sleep()
, you’ll see that there are just a few. And if you set a breakpoint at the start of the main
function, you’ll see that only one of those threads was created as a result of the threadpool waits; the others were pre-existing.
To prove that all of these waits really are waiting, we signal the most recent one, which sets off a chain of SetEvent
calls, and wait for the last event to be set. We print the number of events that were signaled (should be 10,000) and call it a day.
This is just a proof of concept to show the thread behavior, so I didn’t bother cleaning up the waits or the handles.
> waits are associated with a completion port
How can my program do that?
Has CreateIoCompletionPort been quietly extended to support event (or other mutant) handles?
I wish I knew this 10 years ago, this is a pretty big deal. It would have influenced some architectural decisions I made. It belongs in Windows Internals 7 if it isn’t there already. I might have missed it.
Why does the wait thread use a different IOCP from the I/O completions? Doesn’t that defeat the purpose of IOCPs?
The
CreateThreadpool*
family of functions return thePTP_*
family of types, which point to correspondingTP_*
structures. I presume that the structures themselves are allocated on the heap as part ofCreateThreadpool*
.My question: why not allow the user to specify an existing location to store the
TP_*
structures? It seems to me that you’d usually perform an allocation yourself to store the callback state anyway. You could allocate theTP_*
inline.(1) It would require a separate callback from the threadpool back to the program to say "Okay, I'm done with the memory now, you can free it," which means the program will probably need an atomic reference count. In practice, people probably won't actually use atomics. They'll use a non-atomic reference count, or even no reference count at all (and risk deadlocking at destruction or just free the memory when the container destructs, causing the thread pool to corrupt the heap). All this means that the thread pool team will have to spend a lot of their time investigating bugs...
Thanks for the analysis. What about inverting the pattern then, letting the thread pool API manage all allocations:
<code>
The thread pool can then choose to allocate in line with the if it wants to.
If the user wants the thread pool to use a specific allocator to allocate everything in, we can use callback environments for that.
I thought you’d gone off the rails today by not specifying a capture of that count variable used in the lambda, but since that variable is static defined, it’s fine. Didn’t know that, could be useful