Internals of the POH

maoni

Maoni

As folks are aware we added a new kind of heap in .NET 5 called the POH (Pinned Object Heap). Since this is a user facing feature (and there aren't that many of those in GC) I've been meaning to write about it but didn't get around till now. In this blog entry I'll explain the internals of it, partly because if you understand them it'll make it easier to reason about scenarios that I don't already cover; partly just because I know people who read my blog tend to want the internals ๐Ÿ˜ƒ

Why POH?

First of all, why did we add this POH and why did we only add it in .NET 5? Pinning was (and still is) thought to be an outlier scenario as it clearly hinders GC's ability to compact the heap (from here on, I will use "pins" interchangeably with "pinned objects"). And you can pin any existing object with blittable fields as long as you can get a hold of it. This means you can pin an object in any generation, old or young.

The best scenario is when you pin something for a short enough amount of time, meaning it's too short for the GC to notice. If a GC isn't happening the objects are not moved around anyway so something is pinned or not simply has no effect. And of course sometimes it's very difficult to control this. Or you can pin objects that aren't going to move anyway. For example if you allocate some static objects when the process starts running, they will all survive till end of the process lifetime anyway. And since they are already clustered together, even when we do a compacting GC on them, they will not move. So pinning or not also has no effect.

The worst scenario is when pinned objects are scattered on the heap and they don't go away, especially when these pinned objects are in older generations. GC tries very hard to leave the pinned objects in younger generations because the free spaces inbetween pins can be used sooner. So if we see free spaces between pins in gen0, it means we can satisfy user allocations with these free spaces. But we can only use free spaces between pins in gen2 when we actually promote gen1 survivors into gen2, which means we can only make use of these free spaces during a gen1 GC. Usually when a GC of generation G happens, objects that were in G would be in (G+1), but we may choose to leave a pinned object that was in generation G still in G, instead of promoting it to (G+1). This is called demotion.

Over the years I've done quite a bit of perf work to combat pinning even more so GC can handle the stressful pinning scenarios better (if you are curious about this I talked about it in detail in the dotnetos talk last year). We also went from "users have to care about pinning" to "let's have our libraries care about pinning so users don't have to" to put the optimization burden on our libraries authors instead of our users. So libraries started pooling buffers they would pin and when they need to grow the number of buffers they don't get just one, they get a batch of them so they don't need to do that so often and each buffer in the new batch will likely be next to each other (not guaranteed but very likely). So instead of

|pinned|non pinned|pinned|non pinned|

you have

|pinned|pinned|non pinned|non pinned|

when the non pinned objects die, the 2nd case will be more compact since it wouldn't have free space between the 2 pins.

All this effort combined improved the perf for pinning scenarios by a lot, but one of our goals is of course we always want to achieve higher perf. I had wanted to add a separate heap for pinned objects so they didn't "pollute" the rest of the heap for quite some time and in .NET 5 this was finally put on the agenda.

A design choice

Since we allow to pin any existing object, it means the pins can be scattered all over the heap. So to group these pins together, we have to make a choice whether we still allow pinning an existing object. If we want to allow it, it means we'd need to move it to this separate heap when the user tells us to pin it. Moving an object currently requires the managed threads to be suspended. Pinning is not considered a common case but having to suspend managed threads just to pin an object still seems heavy-handed. Even if we decided to do that, we'd still need to consider what to do with the object when it's unpinned. Do we then suspend again to move it back? If we look at the pinned buffer pool scenario, the component that manages the pool is often the one that pins them. And since these buffers are usually allocated for the purposed to be pinned, they can indeed be pinned right when they are allocated. So I chose to provide an API to pin an object at the allocation time.

How to add a separate heap

It's a bit unfortunate that we overloaded the word "heap" here but it's been this way since V1.0. When I started it was already too late to make changing the name worthwhile. But fortunately AFAICS it didn't seem that confusing for our customers. Before POH, with Workstation GC we have one heap that has an area for small objects and a different area for large objects. And we call these areas Small Object Heap and Large Object Heap. When we talk about Server GC, we say it has multiple heaps meaning that we have multiple of these SOHs and LOHs. So in our current context, heap means the heap like in SOH or LOH.

When we talk about SOH vs LOH, there's both the physical and the logic aspect. The physical aspect is they exist in different areas. And we organize memory by segments, it means SOH and LOH occupy different segments. So adding another heap means this heap will also occupy its own segments. GC has a few data structures that store info for physical generations like generation_table so LOH is actually stored at generation_table[3] so physically this is generation 3. The logic aspect defines how these heaps are logically organized, ie, LOH is logically part of gen2 so it's only collected when we do a gen2 GC. So we needed to decide which generation the new POH would belong to. And the conclusion is, since it's used more for longer lived objects, it makes sense to make it part of gen2. Making something part of gen2 makes it simpler to handle because we don't need to handle the part we are not collecting – we are collecting the whole thing.

So we ended up with a pretty simple design. We added something that's basically like LOH except we obviously cannot ever move objects on this heap where LOH can be compacted (and is automatically compacted in a container with the memory limit set). We can sweep it just like we sweep LOH. When an object is requested to be allocated on the POH, it shares the same lock that we take for LOH. This is the more_space_lock_loh lock. When multiple user threads are allocating on the same LOH, they are synchronized via this lock. Of course in Server GC (now we are switching the meaning of heap), each heap's LOH has its own lock. I chose to not create a separate lock for POH because POH is not expected to be used very frequently and it's not worth creating a separate lock. Another thing to point out is this lock isn't held very long – even though GC needs to clear memory before it hands it out and a large object can be very large, we only hold this lock to clear the first few pointer sized words. We then release the lock while we clear the rest.

Most of the work was actually the refactoring. Because LOH and POH are so similar, we created a new term – UOH, stands for User Old Heap, that covers both LOH and POH because they are handled together often. The reason why it's "user" and "old" is because user code allocates directly into these 2 heaps, and they are both considered part of the old generation – gen2. So many places that cared about LOH was renamed to UOH, eg, more_space_lock_loh was renamed to more_space_lock_uoh. Since before 5.0 GC hard coded the max number of physical generations, most of the refactoring was to not make that hard coded anymore so in the future in case we need to add another separate heap we wouldn't need to change so many places. After the refactoring was done, the amount of changes of adding POH was pretty small.

What's happening in .NET 6 with POH

In .NET 6 we are doing some perf tuning on POH. So far we've kept pretty much the same tuning as LOH but we don't expect POH to be used as much as LOH. Because POH is mostly in libraries such as for network IO, it's only for the small amount of objects that actually are used for interoping. So POH in general should be quite small – instead of "stretching out" the heap, now these objects are all allocated in their own area which should be small. This is not to say you should convert all your pinned handle to allocate on POH – if you know that you only need to pin something that's very short lived, it's better to leave it in gen0 so it can be reclaimed very quickly. And of course you might have a scenario where you simply cannot use POH because you are not in control of allocating the object.

We are also updating PerfView to support showing info on POH. We didn't have time to do this in .NET 5 (thankfully some of the other profilers already did it before us ๐Ÿ˜„) but PerfView ships on its own schedule so we can afford to do this more leisurely.

One change worth mentioning has to do with another design choice we made in .NET 5. When you pin an object with a pinned handle you can only pin objects with blittable fields, IOW, you cannot pin an object that has fields pointing to other managed objects. This was a conscious decision because of the usage scenarios for pinning. But the runtime itself is not limited by this rule. And it does pin objects with references. One of the scenarios where this happens is it pins the object[] that points to static objects (we do this so JIT can generate more performant code to find static objects) and this object[] object lives on LOH. And we have been getting reports from our customers that this is fragmenting the LOH and it's something that's very hard for them to work around (if they have a lot of static objects they'll hit this). POH seemed like a perfect choice for this scenario so we started to allow allocating objects with references on POH, but only to be used by the runtime itself. Moving those objects to POH showed clear benefit for cutting down LOH fragmentation. This does mean we now need to scan POH during ephemeral GCs whereas before we didn't. But keep in mind that most objects on POH would be without references so scanning them is very quick and that small amount of perf sacrifice was worth the benefit.

13 comments

Leave a comment