November 16th, 2012

PLINQ and Int32.MaxValue

Stephen Toub - MSFT
Partner Software Engineer

In both .NET 4 and .NET 4.5, PLINQ supports enumerables with up to Int32.MaxValue elements.  Beyond that limit, PLINQ will throw an overflow exception.  LINQ to Objects itself has this limitation with certain query operators (such as the indexed Select operator which counts the elements processed), but PLINQ has it with more.

This limitation impacts so few scenarios that it’s relatively benign and I rarely hear about it.  That said, it does come up now and again, and I was in fact asked about it earlier this week.  So, to help in case anyone else runs into this…

There is a relatively straightforward workaround you can apply if you do run up against this limit: ensure the enumerable passed to PLINQ has no more than Int32.MaxValue.  That answer might sound flippant and impractical, but in reality it’s often easily accomplished by batching a longer enumerable into multiple shorter enumerables, where none of the shorter enumerables exceed the limit.

Consider a basic query like the following:

IEnumerable<Output> outputs =
    from input in inputs.AsParallel()
    where Filter(input)
    select Map(input);

If the inputs enumerable has more than Int32.MaxValue elements, this will result in an overflow.  To workaround this, imagine if we had a Batch extension method for IEnumerable<T> that would partition the IEnumerable<T> into multiple sequential IEnumerable<T> instances, each of which had no more than Int32.MaxValue elements.  With that, we could rewrite this as follows:

IEnumerable<Output> outputs = inputs.Batch(Int32.MaxValue).SelectMany(batch =>
    from input in batch.AsParallel()
    where Filter(input)
    select Map(input)
);

The original query remains intact, except that instead of operating on the original inputs enumerable, it’s operating on the new batch enumerable, which is a subset provided by the Batch method.  We’ll now be parallelizing our processing of each sequential batch, moving on to the next batch to be processed in parallel once we finish the previous one.

Such a Batch method doesn’t actually exist built-in to .NET 4.5, but writing one is relatively straightforward.  Here’s one implementation which you could customize as your needs demand:

static class MyLinqExtensions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(
        this IEnumerable<T> source, int batchSize)
    {
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return YieldBatchElements(enumerator, batchSize – 1);
    }

    private static IEnumerable<T> YieldBatchElements<T>(
        IEnumerator<T> source, int batchSize)
    {
        yield return source.Current;
        for (int i = 0; i < batchSize && source.MoveNext(); i++)
            yield return source.Current;
    }
}

Author

Stephen Toub - MSFT
Partner Software Engineer

Stephen Toub is a developer on the .NET team at Microsoft.

0 comments

Discussion are closed.