Some time ago, I wrote about waiting for more than MAXIMUM_
and explained that you can wait for them in blocks of MAXIMUM_
at a time, or just take the simpler solution of waiting for them one at a time.
There was some concern over the loss of efficiency in waiting for threads one at a time instead of in bulk.
This is a valid concern, but as with most issues in software engineering, you have to balance the cost against the benefit.
The cost is that you have to write a bunch of code to break down the list of threads into blocks of MAXIMUM_
so you can wait for each block, while dealing with boundary cases like zero, N × MAXIMUM_
, and N × MAXIMUM_
− 1. This code may not be that hard to write, but you still have to keep those boundary cases in mind. On the other hand, waiting for them one at a time is straightforward, can take advantage of language facilities you already have (like thread.join()
), and handles boundary cases more easily.
The benefit is that if you anticipate that these threads will not be signaled for a long time, you avoid 2n context switches, and you get to unschedule your thread for an extended period of time, which could in turn reduce memory pressure on the system because the data associated with the thread (like its stack) can be paged out.
But let’s look at the original scenario:
A customer had a test that created a lot of threads, and they wanted their test to wait for all of the threads to exit before proceeding to the next step.
This was a test! Most of the time, the performance of a test is irrelevant. What you care about is that the test accurately tests the thing that it is trying to test. The simpler you write the test, the more likely the test will be correct. The performance of the thing you’re trying to test might be important, but the performance of the test itself is not usually important. You just have to make sure that the test accurately measures the performance of the thing being tested.
As with most issues of performance, the issue is improving the performance of the code to the point that it meets its performance targets. Adding additional complexity to exceed those targets is not usually all that helpful.
From the problem description, it appears that the test is performing a series of steps, and the test wants to make sure it is finished with step N before moving on to step N+1. If it spawns n helper threads, the fancy WaitÂForÂMultipleÂObjects
version will save roughly 2n context switches, but this happens only once per test run, so there is no additional multiplier. And compared to the other work being done by the test, 2n context switches is probably not going to show up as a performance issue.
In this case, if the extra context switches waiting for all the test threads becomes a bottleneck, you can just have an atomic counter that starts with the value n, accompanied by an unset event that the main thread waits for. Each test thread atomically decrements the counter when it finishes, and when the value reaches zero, signal the event.
But I doubt you’ll ever need to write this code. The extra context switches will not be a problem in your test.
And you know what? I bet the extra context switches will not be a problem in production either.¹
¹ The only case I’m thinking where it could be a problem is if you do this multi-thread wait for a huge number of long-running threads. Most programs don’t do this, in part because a huge number of threads is probably going to create additional problems on its own. And also because the extra cost is amortized across the threads themselves: Two context switches is probably not measurable when compared to the cost of creating a thread in the first place.
The savings will probably not pay for itself compared to the added cost of development, testing, and maintenance resulting from the code complexity.
I suspect that questions of this nature arise from a faulty generalization of otherwise sound principles. If you are working in a highly concurrent system, the usual rule of thumb is to avoid doing things one at a time unless the operation is inherently synchronous and can't be parallelized, because serial execution imposes a fairly significant bottleneck. That is, you're only making forward progress on one thing at a time, instead of N at a...