A customer had a test that created a lot of threads, and they wanted their test to wait for all of the threads to exit before proceeding to the next step. However, the number of threads exceeded the maximum number of handles, more than MAXIMUM_ of them, so what is the best way to wait for all of them if they can’t do a single WaitForMultipleObjects?
The customer noted that the documentation had a few suggestions. One is to divide the objects into groups of size at most MAXIMUM_ and for each group, create a thread to call WaitForMultipleObjects. Another suggestion is to call RegisterWaitForSingleObject on each handle.
The customer thought these approaches were unnecessarily complicated. What about just dividing the objects into groups of size at most MAXIMUM_ and just going into a loop calling WaitForMultipleObjects on each group? “Is there some subtlety that we’re missing?”
Process handles and thread handles have the property that waits on them are idempotent. Waiting on a process or thread handle waits for the process or thread to exit, but it has no effect on the process or thread itself. This is different from some other types of handles: Waiting on semaphores, mutexes, and auto-reset events has side effects: Consuming a semaphore token, taking ownership of the mutex, and resetting an auto-reset event.
Furthermore, process and thread handles also have the property that once they become signaled, they never become unsignaled. This means that once you have successfully waited on them to become signaled, you don’t have to worry about the possibility that in the future, they might not be signaled any more.
Therefore, if you are waiting for a group of process and thread handles all to be signaled, you have the liberty to wait for them in any order and not require the special behavior of WaitForMultipleObjects where it doesn’t create any wait side-effects until all the objects become signaled simultaneously.
So yes, you can wait for them in blocks of MAXIMUM_. But really, even that is too much work. You can just wait for them one at a time.
for (auto&& handle : m_threadHandles)
{
    REQUIRE(WaitForSingleObject(handle, INFINITE)
            == WAIT_OBJECT_0);
}
         
                         
                    
Performance question here, Didn’t it will costs more NtWaitForSingleObject system call instead of one NtWaitForMultipleObjects?
This post has finally pushed me to investigate how the Vista+ Thread Pool manages to (on Windows 8+) wait for more than 64 events on a single thread. I’ve been wondering if I can reuse the underlying tech (NT API) to use it with custom I/O Completion Port, outside of the system Thread Pool. It turns out it’s pretty simple.
Is there any particular reason this functionality haven’t been lifted to Win32 API for general use?
People are still battling with the MAXIMUM_WAIT_OBJECTS limit.
I believe there's a good reason for the MAXIMUM_WAIT_OBJECTS limit. The system doesn't know what arguments have changed between two consecutive calls to WaitForMultipleObjects(Ex); you may have just added (or removed) one HANDLE, or you may have replaced ALL HANDLEs. As such the system has to scan every passed HANDLE in each invocation. In addition the system has to remove the waits on return and re-arm them on the next call (because some waits have side-effects, e.g mutexes). All of that is costly and to lower the overhead you spread the arguments in batches and make multiple parallel invocations.
With IOCP...
Why you would want to do two consecutive calls of WaitForMultipleObjects(Ex)? Isn’t Kernel handles are not reference-counted?
WaitForMultipleObjectsoperates likeselectandpollon Unix, and the limitations you mentioned are also limitations ofselectandpoll. Problem is,epoll, Linux’s rough equivalent to IOCP, supports most types of file descriptors that exist on Linux, including eventfd; IOCP on the other hand only supports a small set of operations that are mostly just reading, writing and accepting connections, and that’s despite Jan’s discovery that the NT kernel apparently does contain all necessary facilities to use events and mutexes with IOCP after all.The reasoning on the limit makes sense. Which makes it even more curious why there’s no API that would allow applications to be more efficient. And yes, assigning handles to IOCP solves it only partially, as it, for example, can’t handle acquiring Mutexes.
When I first learned about threads in C++ it took me a bit of time to wrap my head around the fact that, given
std::thread t(…), u(…), v(…);, you can simply callt.join(); u.join(); v.join();to wait for all of them to finish despite the fact that the threads themselves may finish in any order. Of course, withstd::jthreadeven explicitly callingjoin()has become unnecessary.