June 11th, 2008

PLINQ Ordering

There is a natural tension between ordering and performance in a parallel partitioning system such as PLINQ, which we addressed as guidance in the Dec07 CTP documentation:  “Although you can opt into ordering, this does come at a cost to performance because it constrains the options which PLINQ can use for executing a query, so it is better to avoid it in general if performance is your main concern. A better approach is to avoid assumptions on ordering in your programs where possible, or to explicitly sort elements in your query. This is consistent with the approach that most relational databases take.” 

In the June 2008 CTP, we did a major overall of the ordering model in PLINQ.  Here’s an overview of our thinking, our rationale, and the design decisions that we made along the way.  As always, your feedback is very important.  We are especially looking for input regarding the default choice that we have made – to be unordered unless otherwise specified.

Importance of PERFORMANCE

PLINQ’s raison d’etre is to provide improved performance to developers on multi-core systems.  As such, it is as the forefront of our mind when creating these systems.  Customers in many spaces that could make excellent use out of data parallelism do not have requirements on ordering, yet they would benefit from large performance gains. 

This is certainly the case in the finance industry.  Financial institutions have told us that they could easily absorb 1000x increase in performance, and performance alone is what would drive them to adopt PLINQ (they don’t even look to LINQ, because ease of programming is secondary in importance).  There are many scenarios that are pure number crunching and do not have expectations on ordering that need to be met. 

There are other precedents in LINQ for having an unordered system: LINQ-to-SQL and other data providers do not provide a default ordering contract.  They require the end user to sort if this is necessary.

Importance of ORDER

On the other side, there are also reasons to support ordering in the system.

1. LINQ-to-Objects sets the precedent – the original recipe, predecessor to PLINQ, does maintain ordering by default.  While some data providers no not require ordering, it still remains that PLINQ, for many scenarios, can be a drop-in replacement for LINQ-to-Objects.  I bet you didn’t miss that caveat “for many scenarios”, which draws attention to what we have dubbed the Parallelism Blockers.  Again from the CTP documentation, note #4: 

A very common question is: Why is an explicit programming model needed for concurrency instead of having the system automatically decide to provide parallelism?

There are several reasons why there needs to be some programmer involvement in the parallelism process:

1.    Mutable data structures and impurity
2.    Concurrent exceptions
3.    Thread affinity
4.    Ordering expectations
5.    Performance – Problems with < 1.0 speedup

2. Some code expects ordering – As an illustration, this can provide unexpected behavior in code such as:

Array.Sort(a);

var q = from x in a …;

The initial Array.Sort’s demand on order would not be kept by the system, and it would be shuffled in the query execution by default. PLINQ provides mechanisms to preserve ordering, but usually at a cost to performance. So, you can opt-in to order preservation in PLINQ explicitly:

Array.Sort(a);

from x in a.AsParallel().AsOrdered() …

3. Some operators require or insist on ordering – The OrderBy and ThenBy operators specify ordering.  If these were disregarded and ignored after being declared, many scenarios would be cut off.  For this reason, these turn on ordering that flows forward in the system.

Further, operators such as Reverse, SequenceEquals, TakeWhile, SkipWhile, and Zip lose practical purposes when there is not a defined ordering.  Since these operators are useful in many cases, it is important to keep them.

 

Our Goals

These goals are the list of requirements that must be achieved to successfully address this tension.  These are listed in no particular order and are not ranked.  They must all be achieved.

  1. Transparent, clear mental model for developers – The developer should be able to understand the system through a very clear and succinctly stated design statement.  There should not be exceptions to this.  It should be straight-forward and simple to write code against.
  2. Good default – There should be a default model that works for the important customer cases as determined by the group.  For Dev10 this is for Finance and Multimedia scenarios.  A good default also means that there is nothing additional that needs to be specified in the opt-in mechanism (i.e. AsParallel() without any parameters should be sufficient).
  3. Flexibility to address both sides of the performance/ordering tension – While the model does pick a default, scenarios that favor the other side of the tension should be able to toggle in all or parts of the query. 

Non-Goals

These are the things that we are not explicitly trying to accomplish nor avoid; they are not the main purpose of this design.  If they happen along the way, they happen.

  1. Map completely to LINQ-to-Objects or another provider – There is no specific ordering or full set of operators contract required for the LINQ-specification, and it is not a goal to strictly adhere to another provider.  It is more important to adhere to a customer scenario.

The New Mental Model

The mental model is the suggested way for the developer to think of the system.  How it is actually implemented under the covers may differ from this, as long as the end results align with user expectations, consistent with the mental model.

The mental model is succinctly:

Ordering operators re-establish order, shuffle points shuffle the order.

And more detailed:

a.    The query is shuffled at the start
b.    unless the default is overridden to keep the order at the start
c.    ordering operators establish a new ordering that flows afterwards in the query,
d.    a shuffle point can be inserted anywhere in the query, reshuffling the query,
e.    until there is another (optional) ordering operator…

Default: unordered

In favor of performance, PLINQ’s raison d’etre, the default for queries is to be unordered

Rationale for this is covered above.  This is a decision that we will validate through performance work and feedback from you.  This is not a clear-cut decision and the right data could persuade us.  If you have good reasons to keep or change this default, please let us know.  What do you think the default should be and why? 

Programming Model Implications

To provide a richer programming model for working with ordering, there are mechanisms to override the default and turn ordering on and off in the query. 

This specification announces the addition of AsOrdered and AsUnordered operators to the PLINQ syntax and the removal of the ParallelQueryOptions enum and all overloads of AsParallel that accept this enum. 

Overriding the default

Call AsOrdered on an IEnumerable or IList to opt into PLINQ preserving the input ordering.

AsOrdered can only be validly invoked on AsParallel, Range, and Repeat operators.  Any other operators will throw a runtime InvalidOperationException. 

New to the ParallelEnumerable static class is:

public static IParallelEnumerable<TSource> AsOrdered<TSource>(IParallelEnumerable<TSource> source);

Usage examples:

src.AsParallel().AsOrdered().Take(1000)

ParallelEnumerable.Range(1,1000000000).AsOrdered().Take(1000)

ParallelEnumerable.Repeat(2,1000000000).AsOrdered().Take(1000)


Shuffling midquery

If, mid-query, a point is reach where the developer knows that ordering is no longer important, the developer can add a shuffle point to tell the query engine that ordering can now be arbitrary.  A new operator is added to PLINQ for this purpose. 

New to the ParallelEnumerable static class is:

public static IParallelEnumerable<TSource> AsUnordered<TSource>(IParallelEnumerable<TSource> source);

It can be used as follows:

pages.AsParallel().AsOrdered().Take(1000000).AsUnordered().Join(…).Select(…);

In this case, the query no longer requires ordering for the following Join and Select, which are the heavy lifting of the query.

Turning on ordering midquery

Ordering can be turned on, then off, then on again midquery:

  • On: Using original LINQ operators OrderBy or OrderByDescending and then optionally ThenBy or ThenByDescending operators will establish ordering in the query that will continue to flow throughout.  Take for example:

pages.AsParallel().Where(…).Select(…).OrderBy(i => i.rank).Take(10);

In the above query, the Where and the Select operators can function in arbitrary order, then the pages are ordered by rank for the Take operator to return the top 10 pages by rank.

  • Then Off: This ordering requirement is lifted at the next shuffle point if there is one. 

pages.AsParallel().Where(…).Select(…).OrderBy(i => i.rank).Take(10000)

.AsUnordered().Join(…);

In the above example, the initial parts of the query remain the same as above, but the query no longer needs to maintain the ordering when performing the Join operator.

  • Then On again: This ordering requirement is reset at a subsequent ordering operator.

pages.AsParallel().Where(…).Select(…).OrderBy(i => i.rank).Take(10000)

.AsUnordered().Join(…)

.OrderBy(i => i.domain);

In the above example, ordering is turned on again at the end to reorder by domain for the output.

Author

0 comments

Discussion are closed.