Balancing work on GC threads
In Server GC, each GC thread will work on its heap in parallel (that’s a simplistic view and is not necessarily true for all phases but on the high level it’s exact the idea of a parallel GC). So that alone means work is already split between GC threads. But because GC work for some stages can only proceed after all threads are done with their last stage (for example, we can’t have any GC thread start with the plan phase until all GC threads are done with the mark phase so we don’t miss objects that should be marked), we want the amount of GC work balanced on each thread as much as possible so the total pause can be shorter, otherwise if one thread is taking a long time to finish such a stage the other threads will be waiting around not doing anything. There are various things we do in order to make the work more balanced. We will continue to do work like this to balance out more.
One way to balance the collection work is to balance the allocations. Of course even if you have the exact same amount of allocations per heap the amount of collection work can still be very different, depending on the survival. But it certainly helps. So we equalize the allocation budget at the end of a GC so each heap gets the same allocation budget. This doesn’t mean naturally each heap will get the same amount of allocations but it puts the same upper limit on the amount of allocations each heap can do before the next GC is triggered. The number of allocating threads and the amount of allocation each allocating thread does are of course up to user code. We try to make the allocations on the heap associated with the core that the allocating thread runs on but since we have no control, we need to check if we should balance to other heaps that are the least full and balance to them when appropriate. The “when appropriate” requires some careful tuning heuristics. Currently we take into consideration the core the thread is running on, the NUMA node it runs on, how much allocation budget it has left compared to other heaps and how many allocating threads have been running on the same core. I do think this is a bit unnecessarily complicated so we are doing more work to see if we could simply this.
If we use the GCHeapCount config to specify fewer heaps than cores, it means there will only be that many GC threads and by default they would only run on that many cores. Of course the user threads are free to run on the rest of the cores and the allocations they do will be balanced onto the GC heaps.
Balancing GC work
Most of the current balancing done in GC is focused on marking, simply because marking is usually the phase that takes the longest. If you are going to pick tasks to balance, it makes more sense to balance the longest part that is most prone to being unbalanced – balancing work does not come without a cost.
Marking uses a mark stack which makes it a natural target for working stealing. When a GC thread is done with its own marking, it looks around to see if other threads’ mark stacks are still busy and if so, steal an object to mark. This is complicated by the fact that we implement “partial mark”, which means if an object contains many references we only push a chunk of them onto the mark stack at a time to not overflow the stack. This means the entries on the stack may not be straightforward object addresses. Stealing needs to recognize specific sequences to determine whether it should search for other entries or read the right entry in that sequence to steal. Note that this is only turned on during full blocking GCs as the stealing does have noticeable cost in certain situations.
Performance work is largely driven by user scenarios. And as our framework is used more and more by high performance scenarios, we are always doing work to shorten the pause time. Folks have asked about concurrent compacting GCs and yes, we do have that on our roadmap. But it does not mean we will stop improving our current GC. One of the things we noticed from looking at customer data is when we are doing an ephemeral GC, marking young gen objects pointed to by objects in older generations usually takes the longest time. Recently we implemented working stealing for this in 5.0 by having each GC thread takes a chunk of the older generation to process each time. It atomically increases the chunk index so if another thread is also looking at the same generation it will take the next chunk that hasn’t been taken. The complication here is we might have multiple segments so we need to keep track of the current segment being processed (and its starting index). In the situation when one thread just gets to a segment which has already been processed by other threads, it knows to advance past this segment. Each chunk is guaranteed to only been processed by one thread. Because of this guarantee and the fact that relocating pointers to young gen objects shares the same code path, it means this relocation work is also balanced in the same fashion.
We also do balancing work at the end of the phase so it can balance the imbalance in earlier work happening in the same phase.
There are other kinds of balancing but those are the main ones. More kinds of work can be balanced for STW GCs. We chose to focus more on the mark phase because it’s the most needed. We have not balanced the concurrent work just because it’s more forgiving when you run concurrently. Clearly there’s merit in balancing that too so it’s a matter of getting to it.
As I mentioned, we are continuing the journey to make things more balanced. Aside from balancing the current tasks more, we are also changing how heaps are organized to make balancing more natural (so the threads aren’t so tightly coupled with heaps). That’s for another blog post.