The post Recursion and Concurrency appeared first on .NET Parallel Programming.
]]>class Tree<T>
{
public Tree<T> Left, Right; // children
public T Data; // data for this node
}
Let’s say we want to execute an Action<T> for each datum stored in the tree, and let’s assume I don’t care about order (introducing parallelism could be questionable if I did). That’s straightforward to do sequentially and recursively:
public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;// Process the current node, then the left, then the right
action(tree.Data);
Process(tree.Left, action);
Process(tree.Right, action);
}
I could also do so without recursion by maintaining an explicit stack (or queue, or some other data structure with different ordering guarantees):
public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;
var toExplore = new Stack<Tree<T>>();
// Start with the root node
toExplore.Push(tree);
while (toExplore.Count > 0)
{
// Grab the next node, process it, and push its children
var current = toExplore.Pop();
action(current.Data);
if (current.Left != null)
toExplore.Push(current.Left);
if (current.Right != null)
toExplore.Push(current.Right);
}
}
Now, let’s assume the action we’re performing on each node of the tree is independent, relatively expensive and/or that the tree is relatively large, and as such we want to process the tree in parallel (we’re of course also assuming that the action delegate is threadsafe), meaning that we want multiple threads each running the action delegate on distinct tree nodes. How do we do this with what we have in .NET today?
There are multiple approaches, some more valid than others. The first thing someone might try is to follow the original recursive implementation but using the ThreadPool, which could look something like this:
public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;// Use an event to prevent this method from
// returning until its children have completed
using (var mre = new ManualResetEvent(false))
{
// Process the left child asynchronously
ThreadPool.QueueUserWorkItem(delegate
{
Process(tree.Left, action);
mre.Set();
});// Process current node and right child synchronously
action(tree.Data);
Process(tree.Right, action);// Wait for the left child
mre.WaitOne();
}
}
The idea behind this implementation is to, given a node, spin up a work item to process that node’s left child in parallel with the current node, and then process the current node’s data as well as its right child. Of course, I could be losing out on some parallelism here as I delay processing of the right child until I’m done processing the current data. So we modify it slightly:
public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;// Use an event to wait for the children
using (var mre = new ManualResetEvent(false))
{
int count = 2;// Process the left child asynchronously
ThreadPool.QueueUserWorkItem(delegate
{
Process(tree.Left, action);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});// Process the right child asynchronously
ThreadPool.QueueUserWorkItem(delegate
{
Process(tree.Right, action);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});// Process the current node synchronously
action(tree.Data);// Wait for the children
mre.WaitOne();
}
}
I’ve now fixed that issue, such that both the left and the right children could potentially be processed in parallel with the current node, but that was by far not the worst problem. For starters, I’m creating a ManualResetEvent for every node in the tree… that’s expensive. ManualResetEvent is a thin wrapper around a Win32 kernel event primitive, so creating one of these things requires kernel transitions, as does setting and waiting on one. Next, every time I process a node, I block waiting for its children to complete. And as the processing of a node (all but the root) is happening on a thread from the ThreadPool, I’m blocking ThreadPool threads. If a ThreadPool thread gets blocked, the ThreadPool will need to inject additional threads in order to process the remaining work items, and thus this implementation will require approximately one thread from the pool per node in the tree. That’s a lot of threads! And that carries with it some serious problems. By default, a thread in .NET has a megabyte of stack space committed for it, so each thread burns a megabyte of (virtual) memory. The ThreadPool also throttles the creation of additional threads, such that introducing a new thread (once the number of pool threads equals the number of processors) will take 500 ms. For a tree of 250 nodes, that means its processing will take close to 2 minutes, purely for the overhead of creating threads, nevermind the actual processing of the nodes. And worse, there is a maximum number of threads in the pool: in .NET, the ThreadPool has a limited number of threads, by default 25 per processor in .NET 1.x/2.0 and 250 per processor in .NET 2.0 SP1. If the pool reaches the maximum, no new threads will be created, and thus this implementation could deadlock. Parent nodes will be waiting for their child nodes to complete, but the child nodes can’t be processed until their parent nodes complete and relinquish a thread from the pool to execute.
Obviously, we need a better implementation. Next up, we can walk the tree sequentially, queuing up a work item in the pool for each node in the tree. Here we take that approach based on the recursive implementation of a tree walk:
public static void Process<T>(Tree<T> tree, Action<T> action)
{
if (tree == null) return;// Use an event to wait for all of the nodes to complete
using (var mre = new ManualResetEvent(false))
{
int count = 1;
// Recursive delegate to walk the tree
Action<Tree<T>> processNode = null;
processNode = node =>
{
if (node == null) return;
// Asynchronously run the action on the current nodeInterlocked.Increment(ref count);
ThreadPool.QueueUserWorkItem(delegate
{
action(node.Data);
if (Interlocked.Decrement(ref count) == 0)
mre.Set();
});
// Process the childrenprocessNode(node.Left);
processNode(node.Right);
};
// Start off with the root node
The post Recursion and Concurrency appeared first on .NET Parallel Programming.
]]>The post PLINQ at Seattle Code Camp appeared first on .NET Parallel Programming.
]]>The post PLINQ at Seattle Code Camp appeared first on .NET Parallel Programming.
]]>The post Parallel Aggregations in PLINQ appeared first on .NET Parallel Programming.
]]>In order to explain the issues we encounter when parallelizing aggregations in PLINQ, let’s first take a quick look at how aggregations work in LINQ.
Aggregation is an operation that iterates over a sequence of input elements, maintaining an accumulator that contains the intermediate result. At each step, a reduction function takes the current element and accumulator value as inputs, and returns a value that will overwrite the accumulator. The final accumulator value is the result of the computation. A variety of interesting operations can be expressed as aggregations: sum, average, min, max, sum of squares, variance, concatenation, count, count of elements matching a predicate, and so on.
LINQ provides several overloads of Aggregate. A possible implementation (without error checking) of the most general of them is given below:
public static TResult Aggregate<TSource, TAccumulate, TResult>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> func,
Func<TAccumulate, TResult> resultSelector
)
{
TAccumulate accumulator = seed;
foreach (TSource elem in source)
{
accumulator = func(accumulator, elem);
}
return resultSelector(accumulator);
}
To compute a particular aggregation, the user provides the input sequence (as method parameter source), the initial accumulator value (seed), the reduction function (func), and a function to convert the final accumulator to the result (resultSelector). As a usage example, consider the method below that computes the sum of squares of integers:
public static int SumSquares(IEnumerable<int> source)
{
return source.Aggregate(0, (sum, x) => sum + x * x, (sum) => sum);
}
LINQ also exposes a number of predefined aggregations, such as Sum, Average, Max, Min, etc. Even though each one can be implemented using the Aggregate operator, a direct implementation is likely to be more efficient (for example, to avoid a delegate call for each input element).
Let’s say that we call SumSquares(Enumerable.Range(1,4)) on a dualcore machine. How can we split up the computation among two threads? We could distribute the elements of the input among the threads. For example, Thread 1 could compute the sum of squares of {1,4} and Thread 2 would compute the sum of squares of {3,2}*. Then, as a last step, we combine the results – add them in this case – and we get the final answer.
Sequential Answer = ((((0 + 1^2) + 2^2) + 3^2) + 4^2) = 30
Parallel Answer = (((0 + 1^2) + 4^2) + ((0 + 3^2) + 2^2)) = 30
Note: Notice that elements within each partition do not necessarily appear in the order in which they appear in the input. The reason for this may not be apparent, but it has to do with the presence of other operators in the query.
In the parallel aggregation, we need to do something that we didn’t need to in the sequential aggregation: combine the intermediate results (i.e. accumulators). Notice that combining two accumulators may be a different operation than combining an accumulator with an input element. In the SumSquares example, to combine the accumulator with an input element, we square the element and add it to the accumulator. But, to combine two accumulators, we simply add them, without squaring the second one.
In the cases where the accumulator type is different from the element type, it is even more obvious combining accumulators and combining an accumulator with an element are different operations: even their input argument types differ!
Therefore, the most general PLINQ Aggregate overload accepts an intermediate reduce function as well as a final reduce function, while the most general LINQ Aggregate only needs the intermediate reduce function. The signature of the most general PLINQ Aggregate overload is below (compare with the most general LINQ Aggregate overload shown above):
public static TResult Aggregate<TSource, TAccumulate, TResult>(
this IParallelEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> intermediateReduceFunc,
Func<TAccumulate, TAccumulate, TAccumulate> finalReduceFunc,
Func<TAccumulate, TResult> resultSelector
)
So, how to tell whether a particular aggregation can be parallelized with PLINQ? The simple approach is to imagine the above parallelization process. The input sequence will be reordered and split up into several partitions. Each partition will be accumulated separately on its own thread, with its accumulator initialized to the seed. Then, all accumulators will be combined using the final reduce function. Does this process produce the correct answer? If it does, then the aggregation can be parallelized using PLINQ.
In the rest of this posting, I will describe more in depth the properties that an aggregation must have in order to parallelize correctly. In typical cases, imagining the parallelization process is the easiest way to find out whether an aggregation will produce the correct answer when ran on PLINQ.
Just as in other types of PLINQ queries, delegates that form a part of the query must be pure, or at least observationally pure. So, if any shared state is accessed, appropriate synchronization must be used.
The parallel version of an aggregation does not necessarily apply the reduction functions in the same order as the sequential computation. In the SumSquares example, the sequential result is computed in a different order than the parallel result. Of course, the two results will be equal because of the special properties of the + operator: associativity and commutativity.
Operator F(x,y) is associative if F(F(x,y),z) = F(x,F(y,z)), and commutative if F(x,y) = F(y,x), for all valid inputs x,y,z. For example, operator Max is commutative because Max(x,y) = Max(y,x) and also associative because Max(Max(x,y),z) = Max(x,Max(y,z)). Operator – is not commutative because it is not true in general that xy = yx, and it is not associative because it is not true in general that x(yz) = (xy)z.
The following table gives examples of operations that fall into different categories with respect to associativity and commutativity:


Associative




No

Yes

Commutative

No

(a, b) => a / b (a, b) => a – b (a, b) => 2 * a + b

(string a, string b) => a.Concat(b) (a, b) => a (a, b) => b

Yes

(float a, float b) => a + b (float a, float b) => a * b (bool a, bool b) => !(a && b) (int a, int b) => 2 + a * b (int a, int b) => (a + b) / 2

(int a, int b) => a + b (int a, int b) => a * b (a, b) => Min(a, b) (a, b) => Max(a, b)

An operation must be both associative and commutative in order for the PLINQ parallelization to work correctly. The good news is that many of the interesting aggregations turn out to be both associative and commutative.
Note: For simplicity, this section only considers aggregations where the type of the accumulator is the same as the type of the element (not only the .Net type, but also the “logical” type). After all, if the accumulator type is different from the element type, the intermediate reduction function cannot possibly be commutative because its two arguments are of different types! In the general case, the final reduction function must be associative and commutative, and the intermediate reduction function must be related to the final reduction function in a specific way. See section “Constraints on Reduce Function and Seed” for details.
LINQ allows the user to initialize the accumulator to an arbitrary seed value. In the following example, the user sets the seed to 5, and thus computes 5 + the sum of squares of integers in a sequence.
public static int SumSquaresPlus5(IEnumerable<int> source)
{
return source.Aggregate(5, (sum, x) => sum + x * x, (sum) => sum);
}
Unfortunately, if we parallelize this query, several threads will split up the input, and each will initialize its accumulator to 5. As a result, 5 will be added to the result as many times as there are threads, and the computed answer will be incorrect.
Can PLINQ do something to fix this problem?
Nonsolution 1: Initialize one accumulator to the seed and the rest of them to the default value of T.
For example, if the input contains integers, why not initialize one thread’s accumulator to the userprovided seed, and the other accumulators to 0? The problem is that while 0 is a great initial accumulator value for some aggregations, such as sum, it does not work at all for other aggregations. One such operation is product: if we initialize the accumulator
The post Parallel Aggregations in PLINQ appeared first on .NET Parallel Programming.
]]>The post Debugger display of PLINQ queries appeared first on .NET Parallel Programming.
]]>Sometimes very simple additions to an API or implementation make me happy. One such nicety in the CTP of PLINQ is the implementation of ToString on the concrete types that represent query operators. These implementations provide a textual representation of the query structure, which can be very nice for debugging purposes.
Consider the following LINQ query, using the implementation of LINQtoObjects that shipped with Visual Studio 2008:
var q =
from x in list1
from y in list2
where x == y
select x.Length;
If I hover over the ‘q’ variable in the debugger, I see the following:
Based on the type displayed in the debugger tip, I’m only able to discern the type of the last operator in the query (the Select). If I change this to use PLINQ:
var q =I get a much better understanding of the query in the debugger tip:
Here’s the text from the tip (in case it’s difficult to make out in the image):
{SelectQueryOperator`2(WhereQueryOperator`1(SelectManyQueryOperator`3
(ScanQueryOperator`1(ParallelEnumerableWrapper`1))))}
From this, we can see that the query contains a Select that operates on a Where that operates on a SelectMany. Our ability to do this is a sidebenefit of PLINQ’s need to analyze the query structure in order to parallelize it; as part of that, we walk the query tree building up information about it, and we store that information in data structures that are then visible to the implementation and to the debugger. The base query operator types in PLINQ override ToString to provide this information, such that the debugger tip is able to display it to you.
To see other related benefits, consider another LINQ example:
var q =
from x in Enumerable.Range(0, 100)
select x * x;
If at a later point I’d like to analyze this query in the debugger, there’s very little information I can glean from it:
With PLINQ, I can actually drill down and look at details such as the bounds on the range:
(You can actually get the same information with the original LINQtoObject sample, but to do so you need to first call GetEnumerator on the IEnumerable<T> and then call MoveNext on the returned IEnumerator<T> in order to initialize the relevant fields in the resulting compilergenerated types.)
Pretty neat. Keep in mind that the debugger is able to provide insight into the internals of the system, but these internals will likely change from release to release in the future, so you shouldn’t rely in any way on them staying the same.
The post Debugger display of PLINQ queries appeared first on .NET Parallel Programming.
]]>