Does Parallel.For use one Task per iteration?

In .NET 4, the new Parallel class provides For, ForEach, and Invoke methods for performing operations in parallel. One mental model that some folks use when thinking about Parallel.For is that it’s equivalent to running one System.Threading.Tasks.Task per iteration, e.g. that a loop like:

Parallel.For(0, N, i =>
{
DoWork(i);
});

is equivalent to:

var tasks = new List<Task>(N);
for(int i=0; i<N; i++)
{
tasks.Add(Task.Factory.StartNew(state => DoWork((int)state), i));
}
Task.WaitAll(tasks.ToArray());

From the perspective of every iteration potentially running in parallel with every other iteration, this is an ok mental model. However, it can also lead to some misunderstandings as to Parallel’s behavior. Parallel, in fact, does not necessarily use one Task per iteration, as that could add significantly more overhead than is necessary.

Under the covers, Parallel.For tries to use the minimum number of tasks necessary to complete the loop as fast as possible. A better mental model includes Parallel spinning up tasks as threads are available to process those tasks, and each of those tasks then participating in a range management scheme. A task asks for a chunk to be done (which may include multiple iterations), processes that work, and then goes back for more. The chunk sizes may vary based on a variety of factors, including the number of tasks participating, the load on the machine, the size of the iteration space, and so forth.

One of the most important takeaways from this revised mental model is that iterations are handed out in indivisible chunks, and only one thread is involved in the processing of a particular chunk. This has implications for interdependencies between iterations. If iteration i blocks waiting for iteration i+1 to be completed, and iterations i and i+1 are both allocated to the same chunk, the loop will likely deadlock. The thread processing those iterations will block processing iteration i, but as that thread is also responsible for processing iteration i+1, iteration i+1 will never get processed and iteration i will never unblock.

As such, if you find yourself in an obscure situation like this, you have several options. You can fall back to using a Task per iteration, as per the original model shown previously. You can also take advantage of custom partitioning support, which we’ll discuss more in a future post. Of course, Parallel provides significantly more functionality than is shown in that previous Tasks-based code snippet. Parallel tracks unhandled exceptions and prevents additional iterations from starting in the event of an exception. It supports cancellation, setting a maximum degree of parallelism, targeting a particular scheduler, breaking out of a loop early with ParallelLoopState.Stop/Break, and so forth. All of this functionality could be layered on top of Tasks with extra code and effort; after all, that’s how Parallel works (at least in the current bits; the implementation may of course change in the future).

Parallel.ForEach has similar implications as Parallel.For. A decent mental model for Parallel.ForEach when the source is indexible (like an array or a list) involves simply delegating to Parallel.For. For example, the following Parallel.ForEach loop:

T [] source = …;
Parallel.ForEach(source, item =>
{
DoWork(item);
});

could be thought of as being implemented like this:

T [] source = …;
Parallel.For(0, source.Length, i=>
{
var item = source[i];
DoWork(item);
});

Since our model for Parallel.For involves chunking, so too does our model for Parallel.ForEach. This is also true when Parallel.ForEach works with a non-indexable IEnumerable<T> as the data source. IEnumerable<T> isn’t thread-safe (even if the data source being enumerated is in some fashion), in that it requires two calls to retrieve the next item: one to MoveNext() to advance the enumerator, and one to Current to retrieve the item. Thus, when working with an enumerator, the threads involved in a Parallel.ForEach (as is also true with PLINQ) need to use a lock while accessing an enumerable data source. Taking that lock for each item in the data source proves typically to be prohibitively expensive, and thus Parallel.ForEach ammortizes the cost of the lock across multiple items by chunking the input source: acquire the lock, enumerate some number of elements from the data source into a private data structure, release the lock, and then process that privatized chunk (we’ll discuss this in more detail in a future post). As a result, this suffers from the same cross-iteration dependency issue as previously described. If the processing of element i in the enumerable blocks waiting for element i+1 to be processed, the processing may deadlock if those two elements happen to be in the same privatized chunk. As with Parallel.For, you can always fall back to using Tasks directly if you have these kind of unusual dependencies.

Parallel.Invoke is a more interesting creature. The signature of the primary Invoke overload is:

public static void Invoke(params Action[] actions);

This signature lends itself well to two different mental models. The first, as with Parallel.For/ForEach, is simply to use a Task to run each action:

var tasks = new List<Task>(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());

Another is based on observing that we’re really dealing with an array as an input data source, and we can use Parallel.ForEach:

Parallel.ForEach(actions, action => action());

In fact, the current (Beta 1) implementation of Parallel.Invoke uses approaches similar to both of these models, and it chooses which to use based on a variety of circumstances. This criteria includes the number of Actions provided (smaller numbers will favor the Task-per-Action approach, whereas larger will favor the ForEach approach) as well as whether ParallelOptions were provided to Invoke to control things like the degree of parallelism or cancellation.

At the end of the day, it’s important to remember that implementation details may change at any time. But having a good mental model for how something is implemented can also help to understand the constraints imposed by that functionality.