What makes managed code, “managed”? Most people would point to the garbage collector. Automatic memory management makes a tremendous difference in programmer productivity. And when garbage collection improves, all .NET applications benefit. Abhishek Mondal, the program manager for GC on the Common Language Runtime, and Maoni Stephens, the developer for GC on the CLR, authored this article. — Brandon
In this post, we will look at how the CLR garbage collector (GC) has been changed in the .NET Framework 4.5 to meet the needs of large client and server apps. These improvements are in response to requests from developers who use the .NET Framework to build large-scale commercial apps. Some of these customers have already reported significant wins after deploying the .NET Framework 4.5 (currently available as an RC release) into production.
The needs of large-scale apps
Ever since the .NET Framework was introduced, developers have been using this technology to build client and server apps of increasing size and complexity. The larger an app gets, the more resources it will consume, and memory is one of the major resources. For example, some developers have built massive-scale websites and services that are used by millions of end-users. These sites typically need to deliver some combination of high throughput and low latency, and have to provide access to data in huge databases. Each year, the traffic to these sites grows and so does the amount of data they serve up. At the same time, these developers also strive to deliver increasingly better end-user experiences, which are sometimes defined by formal service level agreements (SLA). We have seen similar examples on the client.
Developers adopt new approaches and architectures in their apps to meet the increasing demands of customers. Newer .NET Framework features such as the async pattern can sometimes help. However, developers of large-scale apps have told us that they need changes in the GC to continue to grow the scale of apps effectively, particularly on the server. We have many partners within Microsoft, such as Exchange Server, SQL Server, Bing, Microsoft Dynamics CRM, and SharePoint, who build sites that serve millions of visitors and who have the engineering experience to help validate the changes that we made to the CLR GC. We used the combination of customer requests that we received and the partner experience within Microsoft to determine a set of important improvements in the GC for the .NET Framework 4.5.
We are happy to report that we’ve improved the GC to handle the latest trend of workloads we are seeing, with heap sizes in the tens of gigabytes, running on machines with ever increasing memory and cores, and using configurations such as non-uniform memory access (NUMA).
Key customer scenarios for the .NET Framework 4.5 GC
After we collected feedback from developers and our Microsoft partners, we determined a set of GC improvements that would satisfy a broad set of the requests and that would benefit both server and client apps. I’ve listed the requests below, described in terms of app requirements.
Server apps
- My app requires shorter pauses.
- My app requires higher throughput.
- My app should scale on modern hardware.
Client and server apps
- My app cannot tolerate pauses during a certain time window.
- The large object heap takes up too much space.
- My app works on large datasets (uses objects> 2 GB).
Beyond addressing these customer requests, we did a lot of work to improve the overall performance of the GC. These improvements should help all apps.
I will now describe how the garbage collector in the .NET Framework 4.5 addresses each of these important scenarios.
My app requires shorter pauses
Developers tell us that they need to deliver timely user experiences, often in terms of an SLA. The longest pauses are due to full blocking GCs. A full blocking GC may cause them to miss their SLA, so reducing this pause can be critical. Before version 4, the .NET Framework provided a concurrent GC mode that performed full GCs concurrently with user code (vs. blocking, which pauses all user threads), thus reducing pause time for full GCs. This mode was available only for workstation GC. In the .NET Framework 4, we delivered an improved version called the background workstation garbage collection, which reduced latency but only benefited client apps. In the .NET Framework 4.5, we have delivered background server garbage collection, which is typically used for server apps. As a result, all apps now have background GC available to them, regardless of which GC they use.
The new background server GC in the .NET Framework 4.5 offloads much of the GC work associated with a full blocking collection to dedicated background GC threads that can run concurrently with user code, resulting in much shorter (less noticeable) pauses. One customer reported a 70% decrease in GC pause times.
You won’t need to change anything to take advantage of background server GC. Like most GC features, it is turned on automatically for all apps.
My app requires higher throughput
If an app triggers many GCs and spends a lot of time in GC, throughput drops noticeably. There are a variety of reasons why an app can trigger many GCs. One of the common cases is imbalanced GC heaps.
For server GC, there’s one heap per logical processor. When one of the heaps runs out of space for allocations, it triggers a GC. Usually, server apps have a pool of worker threads that perform similar types of tasks, and roughly the same amount of memory is allocated per thread, which results in naturally balanced managed heaps. If this is not the case, and one thread allocates a lot more memory than other threads, the heap that corresponds to its CPU runs out of space to allocate quickly and triggers a GC. GC has a “heap balancing” mechanism to help these apps. In the .NET Framework 4 and earlier versions, the small object heap (SOH) was balanced, but the large object heap (LOH) was not, so imbalanced LOH allocations in server GC would trigger more full GCs (because we collect LOH only in full collections). In the .NET Framework 4.5, the server GC allocator balances the allocations across the heaps when it finds that GC heaps, including both SOH and LOH, are not balanced. In apps that begin to develop imbalanced heaps, this mechanism eliminates unnecessary GCs. This reduces the total time spent in GC and improves the app throughput.
My app should scale on modern hardware
Developers have asked us about non-uniform memory access (NUMA) and > 64 logical processor machine architectures. In the .NET Framework 4.5, the GC has been updated to perform optimally with these two hardware advances.
We already talked about how the GC balances heaps for all the logical processors available to an app. In the .NET Framework 4.5, for NUMA machines, the GC balances heaps within NUMA nodes before considering balancing across nodes. This improves performance. Within a NUMA node, each logical processor prefers the same subset of system memory, so it makes sense to perform heap balancing within a node before balancing across nodes. The whole point of NUMA is that memory access times are not uniform, with some memory access being faster than others. If an app mostly accesses the least costly memory, it will have a performance advantage. For readers who are familiar with NUMA, this means that the GC prefers “local memory” to “remote memory” as part of heap balancing.
Starting with Windows 7 and Windows Server 2008 R2, Windows supports more than 64 processors on a single computer. For machines that have very large numbers of processors (more than 64), the operating system splits the processors into multiple processor groups. In the .NET Framework 4.5, the GC can view CPUs across processor groups and takes all the cores into account when creating and balancing heaps.
You’ll need to enable the <GCCpuGroup> element in your app’s configuration file to turn this behavior on. GC automatically detects and enables NUMA support. You can also use the processor group feature along with the NUMA feature enabled.
My app cannot tolerate pauses during a certain time window
A growing subset of .NET developers has built commercial apps and services that deliver results according to defined business requirements or SLAs. Stock markets are examples of services that must deliver very timely results while markets are open. Typically, these apps perform significant work during the time when they want to deliver low-latency results. Yet they cannot tolerate noticeable pauses due to a collection.
Our customers have told us that they would deploy more memory on their servers if doing so would remove long pause times (which are typically introduced by full blocking GCs). In the .NET Framework 4.5, we provided that option by introducing SustainedLowLatency mode, which avoids full blocking GCs. This mode is also available for the workstation GC in the .NET Framework 4 via Update 4.0.3.
While the SustainedLowLatency setting is in effect, generation 0, generation 1, and background generation 2 collections still occur and do not typically cause noticeable pause times. A blocking generation 2 collection happens only if the machine is low in memory or if the app induces a GC by calling GC.Collect(). It is critical that you deploy apps that use the SustainedLowLatency setting onto machines that have adequate memory, so they will satisfy the resulting growth in the heap while the setting is in effect.
In the .NET Framework 4.5, SustainedLowLatency mode is available for both workstation and server GC. To turn it on, set the GCSettings.LatencyMode property to GCLatencyMode.SustainedLowLatency. The .NET Framework 4 includes a LowLatency mode for workstation GC; however, this setting is only intended to be used for short periods of time, whereas SustainedLowLatency mode is intended to be used for much longer.
The large object heap (LOH) takes up too much space
Developers have given us feedback that large object heap (LOH) fragmentation contributes to relatively large heap sizes. In extreme cases, developers have reported that the managed heap uses twice the space it should. In the .NET Framework 4.5, we have taken significant steps to solve this issue.
The GC is responsible for allocating managed memory in addition to collecting it. In preparation for LOH allocation requests, the GC builds up a list of free (available) memory blocks after a collection. As part of an allocation, the GC consults this these free memory blocks one by one to determine if any block in the list will satisfy the allocation. If a given memory block is a candidate, it is used for the allocation. In earlier releases of the .NET Framework, once a memory block was rejected as a candidate for an allocation, it was removed as a candidate for subsequent allocations. In the .NET Framework 4.5, we now retain free memory blocks as candidates for allocation requests until they are used. The new algorithm uses LOH space much more efficiently, leading to decreased fragmentation and memory use.
We’ve observed substantial improvements in LOH allocation benchmarks due to less fragmentation. We expect that you will see similar results in apps that allocate a significant set of short-lived large objects.
My app works on large datasets (uses objects > 2GB)
64-bit machines are now commonplace, and apps often take advantage of the larger physical memory that can be deployed on these machines, with heaps sometimes in the 10s of GBs. However, developers have reminded us that we do not support creating individual objects (typically arrays) of >2GB. For data-intensive apps, this limit can present a challenge. In the .NET Framework 4.5, this limit has been lifted on 64-bit platforms. The limit remains in place on 32-bit machines, due to the 32-bit address space.
In earlier versions of the .NET Framework, the following lines of code throw OutOMemoryException exceptions due to the 2-GB limit on object sizes on the GC heap. In the .NET Framework 4.5, the code succeeds, assuming that there is enough physical memory available on the machine:
new Object[1000000000]; //(approximate size of this array 8 GB) new int[50000,50000]; //(approximate size of this array is 10 GB) new Dictionary(1000000000); //(approximate underlying array size is 24 GB)
Note that the array index is still a 32-bit number. By default, the .NET Framework does not support arrays that are greater than 2 GB, but in the .NET Framework 4.5, you can use the <gcAllowVeryLargeObjects> element in your application configuration file to enable arrays that are greater than this size.
Background GC behavior in detail
The CLR background GC enhancements in the .NET Framework 4.5 should provide valuable improvements for many developers. These enhancements don’t require any configuration or interaction, but I’m sure that you will appreciate having more information about how it works.
The background workstation GC was introduced in the .NET Framework 4. Both workstation and server background GC share the following characteristics when a background GC is in progress:
- They apply only to generation 2 GCs.
- They don’t compact the heap.
- Generation 0 and 1 GCs do not happen concurrently with background GC, but they suspend all user and background GC threads.
Server GC: Before and after
In the .NET Framework 4, server GC suspends user threads for the entire duration of the collection. In the .NET Framework 4.5, much of the work of collecting generation 2 and the LOH happen in the background, without suspending user threads. The following illustration compares these behaviors. (Note that this is just an example provided for comparison purposes. The number of threads, allocations, and suspensions will vary depending on the scenario. The length of the arrows are also illustrative and do not represent the actual relative proportions of work.)
As you can see in Figure 1 above, in the .NET Framework 4 and earlier versions, user threads (threads 1, 2 and 3) are suspended (marked in dark blue) while a GC (all generations) occurs on the GC thread (marked in light blue). However, in the .NET Framework 4.5, the GC is able to offload a significant subset of the generation 2 GC work by using dedicated background threads (BGC threads 1 and 2). As a result, the GC reduces the total pause time on the user threads significantly. In general, apps should experience a few short pauses (marked in dark blue) when a generation 2 GC is triggered, but won’t have the long single pauses that might have been seen in earlier versions.
As mentioned before, the background GC threads only perform generation 2 collections. If a generation 0 or generation 1 collection needs to happen, it happen on the GC threads (marked in yellow) and require all other threads to be suspended.
Testing the .NET Framework 4.5 GC on large-scale workloads
During development, we used another Microsoft team’s real workload as one of our test scenarios. In one setup, the maximum pause dropped from 2 seconds to 600 milliseconds (70% reduction). In another, it dropped from 1.3 seconds to 400 milliseconds (70% reduction). Here are two datasets from the experiment showing the difference in pause times:
The two charts above plot GC pause times (ms) for the same workload in the .NET Framework 4 vs. 4.5. Although the throughput is comparable in both scenarios, the noticeable pause times have been significantly reduced in 4.5 (the chart on the right). This improvement leads to more responsive apps.
Enabling background server GC
Background GC is on by default for both workstation and server GC. If you’d like to turn off background collections and revert to the old GC behavior, you can do so by setting the <gcConcurrent> element in your application configuration file to false. We also added a new overload for the GC.Collect method to allow requests for background GCs:
void Collect(int generation, GCCollectionMode mode, bool blocking)
You can set the blocking parameter to false to request a background GC. The garbage collector will decide whether to do a background GC or a blocking GC.
Conclusion
In this post, I’ve discussed the improvements to the GC in the .NET Framework 4.5. These improvements are intended to provide value for developers who build large-scale client and server apps. The GC already works great for apps of more typical sizes, but these changes will help many smaller apps, too.
For a real-world example of how these improvements are benefiting our customers, take a look at the following channel 9 interview with one of our early adopters, the Bing team:
You can also read about the Bing team’s experience on the Windows Server blog.
As always, your feedback is what largely feeds our product planning, so keep your questions and comments coming. Tell us what you think about the GC improvements in this release and your suggestions for future releases.
–Abhishek and Maoni
These are great improvements.. pause times are a serious issue that still drives use of C++ over C#. I’d really like to someday see a zero-pause GC option, like Azul Zing / C4.