April 6th, 2022

All Windows threadpool waits can now be handled by a single thread

I noted some time ago that creating a threadpool wait allows the threadpool to combine multiple waits, so that each thread waits for nearly 64 objects. (It’s not quite 64 objects because one of the objects is a special sentinel object that means “Stop waiting.”)

In the time since I wrote that article, the situation has gotten even better. Starting in Windows 8, the registered waits are associated with a completion port, and a single thread handles all the wait requests by waiting for completions.

We can see the new behavior in action with this simple program:

#include <windows.h>
#include <stdio.h>

int main()
{
    static LONG count = 0;
    HANDLE last = CreateEvent(nullptr, true, false, nullptr);

    HANDLE event = last;
    for (int i = 0; i < 10000; i++)
    {
        auto wait = CreateThreadpoolWait(
        [](auto, auto event, auto, auto)
        {
            InterlockedIncrement(&count);
            SetEvent(event);
        }, event, nullptr);
        event = CreateEvent(nullptr, true, false, nullptr);
        SetThreadpoolWait(wait, event, nullptr);
    }

    Sleep(10000);
    SetEvent(event);
    WaitForSingleObject(last, INFINITE);
    printf("%d events signaled\n", count);
    return 0;
}

This quick-and-dirty program creates 10,000 threadpool waits, each waiting on a different event, and whose callback signals the next event, creating a chain of waits that eventually lead to setting the event named last. Under the old rules, creating 10,000 threadpool waits would result in around 10,000 ÷ 63 ≅ 232 threads to wait on all of those objects. But if you break into the debugger during the Sleep(), you’ll see that there are just a few. And if you set a breakpoint at the start of the main function, you’ll see that only one of those threads was created as a result of the threadpool waits; the others were pre-existing.

To prove that all of these waits really are waiting, we signal the most recent one, which sets off a chain of SetEvent calls, and wait for the last event to be set. We print the number of events that were signaled (should be 10,000) and call it a day.

This is just a proof of concept to show the thread behavior, so I didn’t bother cleaning up the waits or the handles.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

6 comments

Discussion is closed. Login to edit/delete existing comments.

  • Jan Ringoš

    > waits are associated with a completion port

    How can my program do that?
    Has CreateIoCompletionPort been quietly extended to support event (or other mutant) handles?

  • switchdesktopwithfade@hotmail.com

    I wish I knew this 10 years ago, this is a pretty big deal. It would have influenced some architectural decisions I made. It belongs in Windows Internals 7 if it isn’t there already. I might have missed it.

    Why does the wait thread use a different IOCP from the I/O completions? Doesn’t that defeat the purpose of IOCPs?

  • 紅樓鍮 · Edited

    The CreateThreadpool* family of functions return the PTP_* family of types, which point to corresponding TP_* structures. I presume that the structures themselves are allocated on the heap as part of CreateThreadpool*.

    My question: why not allow the user to specify an existing location to store the TP_* structures? It seems to me that you’d usually perform an allocation yourself to store the callback state anyway. You could allocate the TP_* inline.

    • Raymond ChenMicrosoft employee Author

      (1) It would require a separate callback from the threadpool back to the program to say "Okay, I'm done with the memory now, you can free it," which means the program will probably need an atomic reference count. In practice, people probably won't actually use atomics. They'll use a non-atomic reference count, or even no reference count at all (and risk deadlocking at destruction or just free the memory when the container destructs, causing the thread pool to corrupt the heap). All this means that the thread pool team will have to spend a lot of their time investigating bugs...

      Read more
      • 紅樓鍮 · Edited

        Thanks for the analysis. What about inverting the pattern then, letting the thread pool API manage all allocations:
        <code>
        The thread pool can then choose to allocate in line with the if it wants to.

        If the user wants the thread pool to use a specific allocator to allocate everything in, we can use callback environments for that.

        Read more
  • Henry Skoglund

    I thought you’d gone off the rails today by not specifying a capture of that count variable used in the lambda, but since that variable is static defined, it’s fine. Didn’t know that, could be useful 🙂