October 12th, 2009

Parallelized Map and Filter Operations

Stephen Toub - MSFT
Partner Software Engineer

Common operations like map and filter are available in parallelized form through PLINQ, though the names differ.  A map can be achieved with PLINQ’s Select operator, and a filter with PLINQ’s Where operator.

For example, I could implement a ParallelMap operation that takes in one array and returns another as follows:

public static TOutput [] ParallelMap<TInput,TOutput>(
    TInput [] input, Func<TInput,TOutput> map)
{
    return input.AsParallel().Select(map).ToArray();
}

and a ParallelFilter operation could be implemented in a similar fashion, based on PLINQ’s Where operator:

public static TInput [] ParallelFilter<TInput>(
    TInput [] input, Func<TInput,Boolean> filter)
{
    return input.AsParallel().Where(filter).ToArray();
}

In general, PLINQ is the recommended way to achieve such parallelization.  Not only is it useful for individual operations like this, but it’s also a great way to compose operations.  For example, if I wanted to write a ParallelFilterAndMap, I could achieve that as follows:

public static TOutput [] ParallelFilterAndMap<TInput,TOutput>(
    TInput [] input, Func<TInput,Boolean> filter, Func<TInput,TOutput> map)
{
    return input.AsParallel().Where(filter).Select(map).ToArray();
}

It is also possible to build such operations on top of Parallel.*.  For example, here’s ParallelMap re-implemented in terms of Parallel.For:

public static TOutput [] ParallelMap<TInput,TOutput>(
    TInput [] input, Func<TInput,TOutput> map)
{
    var output = new TOutput[input.Length];
    Parallel.For(0, input.Length, i => output[i] = map(input[i]));
    return output;
}

and ParallelFilterAndMap re-implemented in terms of Parallel.ForEach and ConcurrentBag<T>:

public static TInput [] ParallelFilterAndMap<TInput,TOutput>(
    TInput [] input, Func<TInput,Boolean> filter, Func<TInput,TOutput> map)
{
    var output = new ConcurrentBag<T>();
    Parallel.ForEach(input, item =>
    {
        if (filter(item)) output.Add(map(item));
    });
    return output.ToArray();
}

As mentioned, in most cases you’re better off using PLINQ for these kinds of operations.  However, there may be cases where a Parallel.*-based version is best.  For example, let’s say that you wanted to implement an in-place parallel map operation, e.g. a map operation that maps from TData to TData, and that stores the new output into the input array.  PLINQ doesn’t modify the input data source, so if you wanted that functionality, you could build it on top of Parallel.For:

public static void ParallelMap<TData>(
    TData [] data, Func<TData,TData> map)
{
    Parallel.For(0, data.Length, i => data[i] = map(data[i]));
}

There are a few other subtle differences that you could take advantage of by building on Parallel.For/ForEach instead of on PLINQ.  For example, PLINQ uses a static number of threads for the processing of a given query (which defaults to Environment.ProcessorCount), whereas the number of threads used by Parallel.* may vary over time as influenced by the ThreadPool’s thread injection and retirement logic.  Parallel.* can also be targeted to a specific and custom TaskScheduler, something not supported by PLINQ in .NET 4 (PLINQ always runs on TaskScheduler.Default, which is based on the .NET 4 ThreadPool).

Happy coding.

Author

Stephen Toub - MSFT
Partner Software Engineer

Stephen Toub is a developer on the .NET team at Microsoft.

0 comments

Discussion are closed.