Internals of the POH

Maoni

As folks are aware we added a new kind of heap in .NET 5 called the POH (Pinned Object Heap). Since this is a user facing feature (and there aren't that many of those in GC) I've been meaning to write about it but didn't get around till now. In this blog entry I'll explain the internals of it, partly because if you understand them it'll make it easier to reason about scenarios that I don't already cover; partly just because I know people who read my blog tend to want the internals 😃

Why POH?

First of all, why did we add this POH and why did we only add it in .NET 5? Pinning was (and still is) thought to be an outlier scenario as it clearly hinders GC's ability to compact the heap (from here on, I will use "pins" interchangeably with "pinned objects"). And you can pin any existing object with blittable fields as long as you can get a hold of it. This means you can pin an object in any generation, old or young.

The best scenario is when you pin something for a short enough amount of time, meaning it's too short for the GC to notice. If a GC isn't happening the objects are not moved around anyway so something is pinned or not simply has no effect. And of course sometimes it's very difficult to control this. Or you can pin objects that aren't going to move anyway. For example if you allocate some static objects when the process starts running, they will all survive till end of the process lifetime anyway. And since they are already clustered together, even when we do a compacting GC on them, they will not move. So pinning or not also has no effect.

The worst scenario is when pinned objects are scattered on the heap and they don't go away, especially when these pinned objects are in older generations. GC tries very hard to leave the pinned objects in younger generations because the free spaces inbetween pins can be used sooner. So if we see free spaces between pins in gen0, it means we can satisfy user allocations with these free spaces. But we can only use free spaces between pins in gen2 when we actually promote gen1 survivors into gen2, which means we can only make use of these free spaces during a gen1 GC. Usually when a GC of generation G happens, objects that were in G would be in (G+1), but we may choose to leave a pinned object that was in generation G still in G, instead of promoting it to (G+1). This is called demotion.

Over the years I've done quite a bit of perf work to combat pinning even more so GC can handle the stressful pinning scenarios better (if you are curious about this I talked about it in detail in the dotnetos talk last year). We also went from "users have to care about pinning" to "let's have our libraries care about pinning so users don't have to" to put the optimization burden on our libraries authors instead of our users. So libraries started pooling buffers they would pin and when they need to grow the number of buffers they don't get just one, they get a batch of them so they don't need to do that so often and each buffer in the new batch will likely be next to each other (not guaranteed but very likely). So instead of

|pinned|non pinned|pinned|non pinned|

you have

|pinned|pinned|non pinned|non pinned|

when the non pinned objects die, the 2nd case will be more compact since it wouldn't have free space between the 2 pins.

All this effort combined improved the perf for pinning scenarios by a lot, but one of our goals is of course we always want to achieve higher perf. I had wanted to add a separate heap for pinned objects so they didn't "pollute" the rest of the heap for quite some time and in .NET 5 this was finally put on the agenda.

A design choice

Since we allow to pin any existing object, it means the pins can be scattered all over the heap. So to group these pins together, we have to make a choice whether we still allow pinning an existing object. If we want to allow it, it means we'd need to move it to this separate heap when the user tells us to pin it. Moving an object currently requires the managed threads to be suspended. Pinning is not considered a common case but having to suspend managed threads just to pin an object still seems heavy-handed. Even if we decided to do that, we'd still need to consider what to do with the object when it's unpinned. Do we then suspend again to move it back? If we look at the pinned buffer pool scenario, the component that manages the pool is often the one that pins them. And since these buffers are usually allocated for the purposed to be pinned, they can indeed be pinned right when they are allocated. So I chose to provide an API to pin an object at the allocation time.

How to add a separate heap

It's a bit unfortunate that we overloaded the word "heap" here but it's been this way since V1.0. When I started it was already too late to make changing the name worthwhile. But fortunately AFAICS it didn't seem that confusing for our customers. Before POH, with Workstation GC we have one heap that has an area for small objects and a different area for large objects. And we call these areas Small Object Heap and Large Object Heap. When we talk about Server GC, we say it has multiple heaps meaning that we have multiple of these SOHs and LOHs. So in our current context, heap means the heap like in SOH or LOH.

When we talk about SOH vs LOH, there's both the physical and the logic aspect. The physical aspect is they exist in different areas. And we organize memory by segments, it means SOH and LOH occupy different segments. So adding another heap means this heap will also occupy its own segments. GC has a few data structures that store info for physical generations like generation_table so LOH is actually stored at generation_table[3] so physically this is generation 3. The logic aspect defines how these heaps are logically organized, ie, LOH is logically part of gen2 so it's only collected when we do a gen2 GC. So we needed to decide which generation the new POH would belong to. And the conclusion is, since it's used more for longer lived objects, it makes sense to make it part of gen2. Making something part of gen2 makes it simpler to handle because we don't need to handle the part we are not collecting – we are collecting the whole thing.

So we ended up with a pretty simple design. We added something that's basically like LOH except we obviously cannot ever move objects on this heap where LOH can be compacted (and is automatically compacted in a container with the memory limit set). We can sweep it just like we sweep LOH. When an object is requested to be allocated on the POH, it shares the same lock that we take for LOH. This is the more_space_lock_loh lock. When multiple user threads are allocating on the same LOH, they are synchronized via this lock. Of course in Server GC (now we are switching the meaning of heap), each heap's LOH has its own lock. I chose to not create a separate lock for POH because POH is not expected to be used very frequently and it's not worth creating a separate lock. Another thing to point out is this lock isn't held very long – even though GC needs to clear memory before it hands it out and a large object can be very large, we only hold this lock to clear the first few pointer sized words. We then release the lock while we clear the rest.

Most of the work was actually the refactoring. Because LOH and POH are so similar, we created a new term – UOH, stands for User Old Heap, that covers both LOH and POH because they are handled together often. The reason why it's "user" and "old" is because user code allocates directly into these 2 heaps, and they are both considered part of the old generation – gen2. So many places that cared about LOH was renamed to UOH, eg, more_space_lock_loh was renamed to more_space_lock_uoh. Since before 5.0 GC hard coded the max number of physical generations, most of the refactoring was to not make that hard coded anymore so in the future in case we need to add another separate heap we wouldn't need to change so many places. After the refactoring was done, the amount of changes of adding POH was pretty small.

What's happening in .NET 6 with POH

In .NET 6 we are doing some perf tuning on POH. So far we've kept pretty much the same tuning as LOH but we don't expect POH to be used as much as LOH. Because POH is mostly in libraries such as for network IO, it's only for the small amount of objects that actually are used for interoping. So POH in general should be quite small – instead of "stretching out" the heap, now these objects are all allocated in their own area which should be small. This is not to say you should convert all your pinned handle to allocate on POH – if you know that you only need to pin something that's very short lived, it's better to leave it in gen0 so it can be reclaimed very quickly. And of course you might have a scenario where you simply cannot use POH because you are not in control of allocating the object.

We are also updating PerfView to support showing info on POH. We didn't have time to do this in .NET 5 (thankfully some of the other profilers already did it before us 😄) but PerfView ships on its own schedule so we can afford to do this more leisurely.

One change worth mentioning has to do with another design choice we made in .NET 5. When you pin an object with a pinned handle you can only pin objects with blittable fields, IOW, you cannot pin an object that has fields pointing to other managed objects. This was a conscious decision because of the usage scenarios for pinning. But the runtime itself is not limited by this rule. And it does pin objects with references. One of the scenarios where this happens is it pins the object[] that points to static objects (we do this so JIT can generate more performant code to find static objects) and this object[] object lives on LOH. And we have been getting reports from our customers that this is fragmenting the LOH and it's something that's very hard for them to work around (if they have a lot of static objects they'll hit this). POH seemed like a perfect choice for this scenario so we started to allow allocating objects with references on POH, but only to be used by the runtime itself. Moving those objects to POH showed clear benefit for cutting down LOH fragmentation. This does mean we now need to scan POH during ephemeral GCs whereas before we didn't. But keep in mind that most objects on POH would be without references so scanning them is very quick and that small amount of perf sacrifice was worth the benefit.

13 comments

Comments are closed. Login to edit/delete your existing comments

  • Michael Silver

    I’m pretty new to garbage collection and pinning items in the heap, but I’m curious about using this feature with secure strings or perhaps strings that would be easier to clear from memory after their use. A pinned object can still be copied, correct? I guess in that sense, it wouldn’t help for controlling sensitive items in memory.

  • Andrey Dryazgov

    “it means we’d need to move it to this separate heap when the user tells us to pin it”

    It sounds like each time when I pin an array with “fixed (byte* p = byteArray) {}” GC moves the whole array to a separate heap. That would cause a huge performance penalty then. Perhaps I’m doing something terrible in my libraries. Where can I read more about it? Thanks.

      • gc y

        Hi Maoni,
        I think that you need not move an object to POH when the user pins it, instead, you can move it when GC happens. Is this a better choice?

        • Damien Knapman

          Since one of the main purposes of pinning is to stop objects moving during a GC, I don’t think this strategy would work out well.

  • Oleg Mikhailov

    Can I ask a question not entirely on the topic? Have you considered the 80 kilobyte problem? Time goes by, but this limit does not change; the amount of RAM installed in average PC increased and SSDs aren’t even working good with blocks less than 64 kilobytes. What prevents this limit from being increased?

    • Damien Knapman

      Counter-question – assuming this is about the LOH cutoff – what evidence suggests it should be changed? Also, it’s been a long time since the amount of installed RAM (relevant to the OS) has been relevant to programs (which work in terms of address space)

      • Oleg Mikhailov

        Ok, I’ll tell you a dramatic story. A long time ago, I was a junior developer and learned the language from books. It was written in the books: “There is a garbage collector, the GC class has some methods, but nevermind, you shouldn’t use them under any circumstances! It does everything itself, don’t meddle.” And while I was doing web development, it really worked. Eight years later, I needed to write a desktop application and used large arrays, which were created and freed quite intensively, expecting that the garbage collector will handle them. But no, instead of reusing the freed memory, it allocated more and more, so finally that application crashed with OutOfMemoryException.

        Obviously, there are good reasons for such behavior of the garbage collector, since it has been around for so long, and they even added ArrayPool to BCL in order to get around the limitations of GC. I hope that if that limit would be increased, it would allow to reuse memory in larger number of scenarious and have less issues with large objects.

    • Dave Black

      The LOH actually starts at 85kb not 80kb. Though the runtime can, and sometimes does, put certain objects smaller than 85kb on the LOH. I’ve seen these be mostly very small-sized arrays of value types.