Parallel loop performance

We’ve received several questions on the MSDN Forums for Parallel Extensions about the performance of the Parallel class, and specifically of the loop constructs we provided in the CTP. We’re very much aware that the performance of Parallel.For/ForEach in the CTP is not optimal, and that for some situations, the overhead for these constructs will be too much for a variety of situations. That’s ok: we released the preview of the Task Parallel Library in the December CTP to solicit early feedback on the APIs we’re providing (e.g. Are they useful? What’s missing? etc.), and to do so and to get the preview out early, we did so at the expense of performance and reliability (it is a technology preview, after all). We’ll be spending a whole lot of time getting the performance of these loops to be stellar; for now, please let us know whether the APIs are right so that we can adjust if they’re not. And if you have interesting situations where the performance isn’t great, we’d love to know about that, too, so that we factor that in when determining the best implementations to ship.

That said, there will always be some cases that we won’t handle perfectly, due simply to the sheer number of variations and use cases that exist. And for those situations, we provide the lower-level Task construct with its replication capabilities to allow you to write custom loops that are best suited to your specific needs (as an example, we previously posted about how to write a parallel while loop).

Consider another situation: you want a for loop, but the body of the for loop contains so little work that even the cost of a delegate invocation (to execute the body) is too much. For example, consider this implementation of a very simple ParallelFor loop:

static void ParallelFor(
    int fromInclusive, int toExclusive, Action<int> body)
{
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Increment(
            ref index) – 1) < toExclusive)
        {
            body(i);
        }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

For loop bodies with a significant amount of work, this implementation isn’t all that bad. But for small bodies, there is a fairly large overhead here per iteration, in that each iteration incurs both an interlocked operation (to advance the index variable) and a delegate invocation (to run the body). Following this same approach, I could cut down on the number of interlocked operations by incrementing each time by an amount larger than 1:

static void ParallelFor(
    int fromInclusive, int toExclusive, Action<int> body)
{
    const int STRIDE = 8;
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Add(
            ref index, STRIDE) – STRIDE) < toExclusive)
        {
            int endExclusive =
                Math.Min(i + STRIDE, toExclusive);
            for(; i<endExclusive; i++)
            {
                body(i);
            }
        }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

In spirit, this is actually very similar to the implementation of Parallel.For in the CTP (which, based on my previous comments about CTP performance, should hint at the performance quality of this example ;). Each CPU grabs 8 elements using an interlocked addition. This decreases the number of interlocked operations per iteration from 1 to approximately 1/8, but it does so at the expense of load balancing. And still, for each iteration we have the cost of a delegate invocation in order to execute the body of the loop as provided by the caller. In fact, no matter what we do here, with this overall design of ParallelFor, we won’t be able to avoid that delegate invocation, which could be prohibitive for certain problems with very, very small bodies.

To fix that, we could go with a different design:

static void ParallelForRange(
int fromInclusive, int toExclusive,
Action<int, int> body);

The idea here is that rather than providing a loop body that accepts the current iteration, the loop body accepts two integers, one representing the start of a range and the other representing the end (just as with fromInclusive and toExclusive, but on a smaller scale):

static void ParallelForRange(
    int fromInclusive, int toExclusive,
    Action<int, int> body)
{
    const int STRIDE = 8;
    int index = fromInclusive;
    Task.Create(delegate
    {
        int i;
        while ((i = Interlocked.Add(
            ref index, STRIDE) – STRIDE) < toExclusive)
        {
            body(i, Math.Min(i + STRIDE, toExclusive));
       }
    },
    TaskCreationOptions.SelfReplicating).Wait();
}

To use this, rather than doing the following as you would with the ParallelFor method:

ParallelFor(0, N, i =>
{
// original loop body here
});

you could do the following:

ParallelForRange(0, N, (fromInclusive, toExclusive) =>
{
    for (int i = fromInclusive; i < toExclusive; i++)
    {
        // original loop body here
    }
});

Just as with the previous ParallelFor implementation, where we used a stride to decrease the number of interlocked operations, here in ParallelForRange we’ve used that same stride to decrease the number of delegate invocations, at the expense of forcing the user to implement their own loop inside of the parallel loop.

Now, as previously mentioned, this striding approach has a downside of decreasing the possibility for load balancing. The implementation of Parallel.For that we eventually ship will do its best to both decrease overhead per iteration and maximize load balancing. But there may still be cases where the cost of the delegate invocation per iteration will be too much. Forgetting the implementation details for a moment, would it be useful for you if in addition to Parallel.For, we also provided a Parallel.ForRange with a design similar to the method I just showed? We’d love your feedback on issues like this, as we want to make sure we’re providing the right solutions for your problem domain. And are there other common looping constructs that you’d like to see us provide? Now is a great time to download the CTP, try it out, and send us your feedback… we’re listening.