April 6th, 2022

All Windows threadpool waits can now be handled by a single thread

Raymond Chen

I noted some time ago that creating a threadpool wait allows the threadpool to combine multiple waits, so that each thread waits for nearly 64 objects. (It’s not quite 64 objects because one of the objects is a special sentinel object that means “Stop waiting.”)

In the time since I wrote that article, the situation has gotten even better. Starting in Windows 8, the registered waits are associated with a completion port, and a single thread handles all the wait requests by waiting for completions.

We can see the new behavior in action with this simple program:

#include <windows.h>
#include <stdio.h>

int main()
{
    static LONG count = 0;
    HANDLE last = CreateEvent(nullptr, true, false, nullptr);

    HANDLE event = last;
    for (int i = 0; i < 10000; i++)
    {
        auto wait = CreateThreadpoolWait(
        [](auto, auto event, auto, auto)
        {
            InterlockedIncrement(&count);
            SetEvent(event);
        }, event, nullptr);
        event = CreateEvent(nullptr, true, false, nullptr);
        SetThreadpoolWait(wait, event, nullptr);
    }

    Sleep(10000);
    SetEvent(event);
    WaitForSingleObject(last, INFINITE);
    printf("%d events signaled\n", count);
    return 0;
}

This quick-and-dirty program creates 10,000 threadpool waits, each waiting on a different event, and whose callback signals the next event, creating a chain of waits that eventually lead to setting the event named last. Under the old rules, creating 10,000 threadpool waits would result in around 10,000 ÷ 63 ≅ 232 threads to wait on all of those objects. But if you break into the debugger during the Sleep(), you’ll see that there are just a few. And if you set a breakpoint at the start of the main function, you’ll see that only one of those threads was created as a result of the threadpool waits; the others were pre-existing.

To prove that all of these waits really are waiting, we signal the most recent one, which sets off a chain of SetEvent calls, and wait for the last event to be set. We print the number of events that were signaled (should be 10,000) and call it a day.

This is just a proof of concept to show the thread behavior, so I didn’t bother cleaning up the waits or the handles.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

6 comments

Discussion is closed. Login to edit/delete existing comments.

Jan Ringoš April 8, 2022

> waits are associated with a completion port

How can my program do that?
Has CreateIoCompletionPort been quietly extended to support event (or other mutant) handles?
switchdesktopwithfade@hotmail.com April 6, 2022

I wish I knew this 10 years ago, this is a pretty big deal. It would have influenced some architectural decisions I made. It belongs in Windows Internals 7 if it isn’t there already. I might have missed it.

Why does the wait thread use a different IOCP from the I/O completions? Doesn’t that defeat the purpose of IOCPs?
紅樓鍮 April 6, 2022 · Edited

The CreateThreadpool* family of functions return the PTP_* family of types, which point to corresponding TP_* structures. I presume that the structures themselves are allocated on the heap as part of CreateThreadpool*.

My question: why not allow the user to specify an existing location to store the TP_* structures? It seems to me that you’d usually perform an allocation yourself to store the callback state anyway. You could allocate the TP_* inline.
- Raymond Chen Author April 6, 2022
  
  (1) It would require a separate callback from the threadpool back to the program to say "Okay, I'm done with the memory now, you can free it," which means the program will probably need an atomic reference count. In practice, people probably won't actually use atomics. They'll use a non-atomic reference count, or even no reference count at all (and risk deadlocking at destruction or just free the memory when the container destructs, causing the thread pool to corrupt the heap). All this means that the thread pool team will have to spend a lot of their time investigating bugs...
  Read more
  (1) It would require a separate callback from the threadpool back to the program to say “Okay, I’m done with the memory now, you can free it,” which means the program will probably need an atomic reference count. In practice, people probably won’t actually use atomics. They’ll use a non-atomic reference count, or even no reference count at all (and risk deadlocking at destruction or just free the memory when the container destructs, causing the thread pool to corrupt the heap). All this means that the thread pool team will have to spend a lot of their time investigating bugs that turn out not to be their fault. (2) It creates the CRITICAL_SECTION problem, where the team later wanted to expand the size of a CRITICAL_SECTION to hold more information, but they couldn’t because the CRITICAL_SECTION is caller-allocated, so they had to add a separate heap allocation anyway. Extra work. No benefit.
  
  Read less
  - 紅樓鍮 April 6, 2022 · Edited
    Thanks for the analysis. What about inverting the pattern then, letting the thread pool API manage all allocations:
    <code>
    The thread pool can then choose to allocate in line with the if it wants to.
    
    If the user wants the thread pool to use a specific allocator to allocate everything in, we can use callback environments for that.
    
    Read more
    Thanks for the analysis. What about inverting the pattern then, letting the thread pool API manage all allocations:
    
    PTP_WORK CreateThreadpoolWork( [in] PTP_WORK_CALLBACK pfnwk, [in] size_t dataSize, [in] size_t dataAlign, [in, optional] PTP_CALLBACK_ENVIRON pcbe, // on successful return, holds a pointer to a suitably aligned block of memory [out, optional] void **data );
    
    The thread pool can then choose to allocate **data in line with the TP_WORK if it wants to.
    
    If the user wants the thread pool to use a specific allocator to allocate everything in, we can use callback environments for that.
    
    Read less
Henry Skoglund April 6, 2022

I thought you’d gone off the rails today by not specifying a capture of that count variable used in the lambda, but since that variable is static defined, it’s fine. Didn’t know that, could be useful