For a relatively advanced feature, I’ve been surprised how often this question has come up recently.
When a task completes, its continuations become available for execution, and by default, a continuation will be scheduled for execution rather than executed immediately. This means that the continuation has to be queued to the scheduler and then later retrieved so that it may be run. Given that there’s some overhead involved there, why would we choose to make that the default behavior rather than avoiding that overhead by executing the continuation synchronously upon completion of the antecedent? There are a few reasons.
First, it’s quite common for multiple continuations to be created off of the same task. If continuations were executed synchronously by default, we would lose out on a valuable opportunity for parallelism, as they would all be executed synchronously one after the other. By scheduling the continuations to run asynchronously rather than executing them synchronously, we expose those continuations to be picked up by other available threads, thereby allowing them to run in parallel.
Second, it’s quite common for long chains of continuations to be formed, with one task continuing off of another, and another off of that, and another off of that, and so on. If these continuations were all executed synchronously, the completion logic from one task would invoke the next task, and its completion logic would invoke the next… each of these would lead to additional stack frames piling up on top of each other, and with a long enough chain, we could end up overflowing the stack.
Third, a common solution to such overflow conditions as discussed above is to use a “trampoline,” where you store a reference to some work to be done, back out of your current stack frame(s), and have a higher-level frame (typically a looping construct) look for the stored reference and execute it. That way, after every invocation, rather than picking up the next piece of work immediately, you store the reference, back out, and then execute it. This, as it happens, is exactly the solution TPL employs to make asynchronous execution fast. Remember that as part of .NET 4, the ThreadPool’s internal implementation was augmented with work-stealing queues to which TPL has access. When work running on a ThreadPool thread schedules a Task for execution, that Task is put into a work-stealing queue local to that thread. The thread is able to push and pop work items from it very efficiently and with minimal synchronization. Now, when a task completes, it’s typically completing on a ThreadPool thread, and as such all of the continuations it queues get queued to the local work-stealing queue. The thread will then go in search of work to do, first checking its local queue, and immediately find one of the continuations it just queued. This is, in effect, the trampoline. The thread picks off the most recently queued continuation efficiently and begins processing it.
These are the primary reasons why we default to queueing continuations rather than just executing them synchronously: it provides more opportunities to leverage parallelism, it’s the safer choice, and the difference in performance is typically not important. Of course, microbenchmarks will highlight a non-negligable performance difference, so if you’re dealing with continuations that contain very few instructions, have little risk of blocking, etc., ExecuteSynchronously can be worthwhile. Consider the following simple test:
using System;
using System.Diagnostics;
using System.Threading.Tasks;
class Program
{
const int NUM_CONTINUATIONS = 100000;
static long Test(bool executeSynchronously)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
var first = new Task(() => { });
var last = first;
for (int i = 0; i < NUM_CONTINUATIONS; i++)
{
last = last.ContinueWith(delegate { }, executeSynchronously ?
TaskContinuationOptions.ExecuteSynchronously :
TaskContinuationOptions.None);
}
var sw = Stopwatch.StartNew();
first.Start();
last.Wait();
return sw.ElapsedMilliseconds;
}
static void Main(string[] args)
{
while (true)
{
long withoutExecuteSynchronously = 0;
long withExecuteSynchronously = 0;
for (int i = 0; i < 5; i++)
{
withoutExecuteSynchronously += Test(false);
withExecuteSynchronously += Test(true);
}
Console.WriteLine((withoutExecuteSynchronously /
(double)withExecuteSynchronously).ToString(“F2”));
}
}
}
This test creates a chain of NUM_CONTINUATIONS continuations, each of which does zero work, seeing how long it takes to execute the whole chain, and comparing the cases where the continuations are and are not created with the ExecuteSynchronously option. On the laptop on which I’m writing this blog post, I see the ExecuteSynchronously version running faster, with at most a 2x difference in throughput. This highlights why we made the ExecuteSynchronously option available even though it’s not the default.
0 comments