In the previous “What’s New for Parallelism in Visual Studio 2012 RC” blog post, I mentioned briefly that for the .NET 4.5 Release Candidate, StreamReader.ReadLineAsync experienced a significant performance improvement over Beta. There’s an intriguing story behind that, one I thought I’d share here.
It has to do with some interesting interactions between certain optimizations in the BCL and how the C# and Visual Basic compilers compile async methods. Before I describe the changes made, a bit of experimentation will be useful to help set the stage.
Consider the following code. The code is simply timing how long it takes to run two different async methods (which don’t actually await anything, and so will run to completion synchronously). Each method increments an integer 100 million times. The first method does so by incrementing a field 100 million times, and the second method does so by copying the field into a local, incrementing that local 100 million times, and then storing the local back to the field.
using System;
using System.Diagnostics;
using System.Threading.Tasks;class Program
{
static void Main()
{
var sw = new Stopwatch();
var obj = new MyObj();
while (true)
{
foreach (var m in new Func<Task>[] {
new MyObj().Foo1, new MyObj().Foo2 })
{
sw.Restart();
m();
sw.Stop();
Console.WriteLine(“{0}: {1}”, m.Method.Name, sw.Elapsed);
}
Console.WriteLine();
}
}
}class MyObj
{
const int ITERS = 100000000;
private int m_data;public async Task Foo1()
{
for (int i = 0; i < ITERS; i++) m_data++;
}public async Task Foo2()
{
int localData = m_data;
for (int i = 0; i < ITERS; i++) localData++;
m_data = localData;
}
}
When I run this on the laptop on which I’m writing this blog post, I get output numbers like the following:
Foo1: 00:00:00.2289705
Foo2: 00:00:00.0755818
In other words, this particular microbenchmark implies that accessing a local (which in this case is actually compiled to be a field on the stack-allocated struct representing the async state machine) is approximately three times faster than accessing a class field. Now, let’s make a one line change to our example. Instead of:
class MyObj
we’ll change that to be:
class MyObj : MarshalByRefObject
Now when I run the sample, I see much different output for Foo1:
Foo1: 00:00:05.4411074
Foo2: 00:00:00.0757737
Woah! Notice that our Foo2 method took basically the same amount of time as it did previously, but the amount of time it took to execute Foo1 skyrocketed in comparison to its previous cost. Whereas previously it was only 3x more expensive to use the field than the local, now it’s ~72x more expensive. What happened?
The issue here has to do with .NET Remoting. According to MSDN documentation, MarshalByRefObject “enables access to objects across application domain boundaries.” What this means is that you can construct a MarshalByRefObject in one AppDomain, where it will live, and then hand out to other AppDomains references to that object. In those other AppDomains, the developer’s code will still be accessing the object via its type, but that AppDomain doesn’t actually have a direct reference to the object; rather, the CLR has played some magic under the covers, such that the other AppDomain actually holds onto a proxy object. You can see this with a modification to our earlier sample by using the System.Runtime.Remoting.RemotingServices.IsTransparentProxy method:
var obj1 = new MyObj();
Console.WriteLine(RemotingServices.IsTransparentProxy(obj1));var ad = AppDomain.CreateDomain(“myDomain”);
var obj2 = (MyObj)ad.CreateInstanceAndUnwrap(
typeof(MyObj).Assembly.FullName, typeof(MyObj).FullName);
Console.WriteLine(RemotingServices.IsTransparentProxy(obj2));
This will output ‘false’ for obj1 but ‘true’ for obj2.
Ok, so the CLR supports proxy objects… how is that relevant? When you access a field on a MarshalByRefObject, the CLR needs to check whether the object being accessed lives in the current domain or whether it actually lives in the remote domain, in which case it needs to marshal the request across the AppDomain boundary. That check, which is inserted by the JIT compiler, can be relatively expensive (at least when compared to the normal cost of accessing a field). As such the CLR has some optimizations in place to mitigate these costs. In particular, it special cases access to fields on ‘this’, since if you’re accessing ‘this’, it means you’re already running in a method on the target object, which means you must already be in the same AppDomain as the field. This optimization has served the CLR very well, as most fields are private and thus are only accessed by that object. It’s worked so well that very few people ever noticed these costs… that is, until async methods entered the picture.
Remember how async methods are compiled. If you have an async method BarAsync:
public class Foo
{
private int m_data;public async Task BarAsync()
{
…
m_data = 42;
…
}
}
the compiler will generate a state machine type for BarAsync, and the contents of the BarAsync method will be compiled into a MoveNext method on that type, e.g.
public class Foo
{
private int m_data;public async Task BarAsync()
{
var builder = AsyncTaskMethodBuilder.Create();
var sm = new BarAsyncStateMachine();
sm.<>this = this;
…
builder.Start(ref sm);
return sm.Task;
}private struct BarAsyncStateMachine : IAsyncStateMachine
{
private Foo <>this;
…
public void MoveNext()
{
…
<>this.m_data = 42;
…
}
}
}
Note that the MoveNext method is no longer accessing m_data on ‘this’, as the m_data field is now on a separate instance. As such, when the JIT compiles this method, it’s unable to apply the aforementioned optimization, and we end up getting the cross-AppDomain checks. This is what’s causing the significant difference in our earlier microbenchmark.
In general, this shouldn’t be a big deal, as very few types actually inherit from MarshalByRefObject, and of those few that do, very few will have async methods on them, and of those that do, very few of those async methods will make frequent enough accesses to fields for this to be a measurable problem. Unfortunately, there is an important set of types and methods that do meet all of these conditions: System.IO. The Stream, TextReader, and TextWriter classes all derive from MarshalByRefObject, which means that any async method implemented on these or a derived type is susceptible to this problem. Further, they deal with I/O, which means there are often potentially long-latency operations involved that warrant using asynchrony. And there’s often some amount of data buffering done in order to limit the number of long-latency calls made, which means potentially frequent access to fields storing the buffered data.
We analyzed all of the known cases in the BCL where this could potentially be a problem, and we fixed those. The most high profile and most impacted example of this was the StreamReader.ReadLineAsync method. ReadLineAsync internally has a tight loop which was making multiple accesses to fields on the StreamReader, and thus this particular issue was very noticeable; by fixing this in ReadLineAsync, we saw a several fold increase in ReadLineAsync’s performance between Beta and the Release Candidate.
Note that this issue wasn’t addressed by adding new optimizations to JIT, but rather by running performance tests, finding methods that weren’t meeting performance goals, measuring to determine the cost, and in the offending cases, addressing the issue by minimizing the number of times the fields were accessed. If you have any custom Stream, StreamReader, or StreamWriter types on which you’ve implemented async methods and that access fields on the type very frequently, you might also want to measure and consider whether you’re impacted by this. If you are, you have a few possible workarounds. One is to use a local, as I did in my original MyObj example. If that works for you, that’s the best solution. If that isn’t feasible given your code, a second workaround is to create a simple property on the MarshalByRefObject-derived type; that property’s getter should return the value of the field and its setter should set it, and then you can use that property from your async method as a stand-in for the local. The JIT has additional optimizations in place for dealing with properties on MarshalByRefObjects, because it’s much more likely that a property will be exposed publicly than it is for a field to be exposed publicly. If I go back to my original repro and add the following additional implementation:
private int Data { get { return m_data; } set { m_data = value; } }
public async Task Foo3()
{
for (int i = 0; i < ITERS; i++) Data++;
}
I then see this output:
Foo1: 00:00:05.1012831
Foo2: 00:00:00.0758750
Foo3: 00:00:00.8242660
Note that the approach which uses the property is still noticeably slower than the local-based approach, but it’s also 6x faster than the original flawed version. Again, though, only consider making such changes to your code if measurements actually show this issue to be impactful… the majority of the cases we looked at in the BCL were not impacted, as the other costs in the method dwarfed any overheads incurred from accessing fields.
0 comments