How Tiered Compilation works in OpenJDK
The OpenJDK HotSpot runtime system is a complex piece of software that employs several techniques to optimize the execution of Java programs on the fly. The system is composed of two different compilers, one interpreter and several different Garbage Collectors, among several other components. These compilers are orchestrated in a tiered compilation mechanism which tries to use the most suitable compiler for each method.
HotSpot has the critical goal of generating efficient machine code while keeping the runtime cost at a minimum. To achieve that it employs various strategies, such as tiered compilation, dynamic profiling, speculation, deoptimization, and various compiler optimizations, both architecturally dependent and independent. In this post, we will delve into the motivation behind tiered compilation and understand how it works.
The Right Tool for The Job
The basic idea behind tiered compilation is that most of the execution time of a program is spent on a small part of the code (i.e., a few methods). Furthermore, the cost to compile a method (i.e., CPU cycles) is the same whether it executes only once or a million times. Based on these observations the VM is designed to use a “simple compiler” for methods that do not execute “many times”, i.e., cold methods, and use a more “optimizing” compiler for methods that run “a lot”, i.e., hot methods. The rationale is as follows:
- We do not want to spend too many CPU cycles compiling a method that is going to be executed only once. In other words, we do not want to spend more cycles compiling a method than executing it!
- If a method executes a sufficiently high number of times the cost of compiling it plus the cost of running it, in an optimized form, will be lower than the cost of running it in an unoptimized form.
Let us use some hypothetical values to exemplify the points above. Consider a VM that has three different compilers: Comp1, Comp2, Comp3. Each of these compilers is slower than the previous one to produce code, but the actual generated code is better than the one produced by the previous compiler. Comp1 is a fast and simple compiler that produces “slow code”. Comp2 is slower than Comp1, but it produces slightly “faster code”. Comp3 is a compiler that produces very efficient code, but it is much slower than Comp1 and Comp2.
Based on these descriptions, suppose we have the values shown in the table below for the compilation of a random method. The Compilation column shows the cost, as the number of CPU cycles to compile the method. The Cycle Count column shows the number of cycles the code produced for the method takes to execute.
These values mean that Comp1 can compile the method 100x faster than Comp3, however the produced code will execute 5x slower. The table below shows the total number of CPU cycles necessary to compile and execute the method in four different scenarios: A) The method executes only once; B) The method executes one hundred times; C) The method executes 1000 times and, D) The method executes 1,000,000 times.
If we use Comp1 in Scenario A, the total number of cycles to compile and execute the method would be 51 (1 to compile + 50 to execute once). If we use Comp2 for this same scenario the total number of cycles would have been 40 (10 cycles to compile the method + 30 to execute it once). If, on the other hand, we use Comp3 the total number of cycles would be 110 (100 to compile + 10 to execute once). Therefore, the best compiler, if the method is executed only once, is Comp2.
In scenario D, if we use Comp1 the total number of cycles would be 50,000,001 (1 to compile the method and 50,000,000 to execute). If we use Comp3 in this scenario, the total number of cycles would be 10,000,100 (100 cycles to compile the method and 10,000,000 to execute it 1,000,000 times).
As you can see, the numbers are quite different depending on the compiler that is used and/or the number of times the method executes. In other words, the best compiler for compiling the method depends on how many times the method is going to execute. In the next section, we will look at how to “predict” (or guess) how many times a method will execute.
Hot Code Prediction
In the previous section I intentionally used terms like “a lot”, “simple”, etc. In this section I explain exactly what I mean by each of them.
By a “simple” compiler, I mean one that can produce code quickly (i.e., the compilation cost is not large) but there is no stringent requirement for the machine code produced to be the most efficient possible. On the other hand, a sophisticated compiler (sometimes called an optimizing compiler) is one that employs many optimizations, aiming to produce the most efficient machine code possible. As you can imagine, the compilation cost is different for the two compilers. Also, note that we do not have to settle with just the two extremes, we can employ a range of different compilers with different capabilities, or we can just use the same compiler but run a different list of optimizations for different pieces of code (like GCC or Clang with the distinct levels of optimization).
We saw in the previous section that the number of times a method executes is a crucial factor in deciding which compiler to use. Baked in our example was the assumption that we know how many times a method is going to execute before we even compile it, because we are going to use that information to decide how to compile it.
Now, how do we tell, before we even compile a method, whether it is going to execute “many times” or not? What is “many times”? The answer is: we cannot – sort of! The problem that we are trying to solve is a version of an important [undecidable] problem in Computer Science – The Halting Problem. What we do in practice is find a workaround for the problem: we assume the method is cold and compile it using the simple compiler and then execute the method a few times, let us say N. If the method is executed N times, we assume that it will execute at least N more times in the future – and therefore is a hot method. If the method does not even reach N executions, then it is a cold method, and we already used the right compiler for it – the simple one. In other words, we use threshold N to find when a method has become hot, and we use that as an indicator that the method will continue to be hot in the future.
The question then is how to choose the value of the threshold N. If we make N = 1 there is a good chance our assumption (the method will execute N more times) will hold most of the time, however that does not help much to find methods that are hot. If we make N = 1,000,000 we increase our chance of finding real hot methods, but we also increase the probability that our assumption – that the method will execute N more times – will be false. Moreover, we executed a method compiled with the simple compiler 1,000,000 times.
In practice what we do is we choose N based on empirical evidence. We consider, among other things, the compilation costs, the performance of the code produced by our compilers, the kind of workload that we have, how many compilers do we have, etc., and try to produce an educated guess for the threshold. There is hardly an ideal threshold that will work for every situation. For instance, there are programs where the execution is more spread throughout the source code (e.g., GUI programs) and other programs where the execution is much more limited to a small part of the source code (e.g., scientific programs or benchmarks). The behavior of a program might even go through phases where different portions of the code become hot at different moments.
In the next section we are going to take a quick look at how HotSpot deals with this problem.
Tiered Compilation in HotSpot
As we will see, the Tiered Compilation process in HotSpot is a very elaborate process, but the idea is still the same one that we discussed in the previous section: use the best tool for the job.
HotSpot has 2 compilers, C1 (source here) and C2 (source here), and an interpreter. There may be multiple threads executing each compiler and the number of such threads is adjusted dynamically based on the load of the system. That means that several methods may be compiled concurrently. Furthermore, the execution of a method may continue uninterrupted in its current form while another version of the method is being compiled. The compiler threads act upon two priority queues, one for each compiler, containing Compilation Tasks (source here) that describe, among other things, which method should be compiled.
The two compilers in different configurations are combined with the interpreter to produce a total of five distinct optimization levels:
|C1 without profiling
|C1 with basic profiling
|C1 with full profiling
|C2 full optimizing, no profiling
Normally the execution of a method starts in the interpreter, which is the simplest/cheapest mechanism that HotSpot has available to execute code. During the execution of the method via interpretation, or in compiled form, the dynamic profile of the method is collected using instrumentation. The method’s profile is then used by several heuristics that will decide, among other things, if the method should be compiled, recompiled at a different level, which optimizations should be performed, etc.
The two most fundamental pieces of information collected by HotSpot during the execution of a method are how many times the method has been executed and how many times the loops in a method have iterated. That information is used by a Compilation Policy (source here) to decide whether to compile a method or not, and at which level. HotSpot also has a feature called On-Stack-Replacement (OSR) that makes it possible to compile hot loops independently of their containing method. This is an interesting feature to work around cases where a hot loop is inside a method that has not itself become hot.
The following formula (see source here and here) is used by the Compilation Policy to decide whether to create a compilation request at level 3, for a method that is currently being interpreted. The same formula is used to decide if a request should be created to compile a method at level 4 if the method is currently executing as a level 3 compilation. A similar formula is used to control OSR compilations.
(Executions > TierXInvocationThreshold * Scale)
(Executions > TierXMinInvocationThreshold * Scale AND Executions + Iterations > TierXCompileThreshold * Scale)
- TierXInvocationThreshold – compile the method at level X if the number of invocations crosses this threshold. Defaults to 200 for level 3 and 5000 for level 4.
- TierXMinInvocationThreshold – minimum number of executions needed to compile the method at level X. Defaults to 100 for level 3 and 600 for level 4.
- TierXCompileThreshold – compile the method at level X if the method plus the number of its loops’ iterations have executed more than this threshold. Defaults to 2,000 for level 3 and 15,000 for level 4.
- Executions – is the number of times the method has been executed.
- Iterations – is the accumulated number of iterations that loops inside the method have been executed.
- Scale – is a scaling factor applied to keep the load in the compilation queues stable. It is periodically dynamically adjusted based on the number of entries in the compilation queues and the number of compiler threads.
HotSpot also takes into consideration the effect of the compilation in other parts of the system and how much load the system is currently handling. Assuming a method is being interpreted, and the above formula is satisfied to compile the method at level 3, the Compilation Policy may actually decide to compile it at level 2 (C1 with basic profiling) instead of level 3 (C1 with full profiling). The decision is based on the following factors:
- The length of the C2 queue. The observation is that a method compiled at level 3 is generally slower than the same method compiled at level 2, therefore we would want to minimize the time a method spends at level 3. We should only spend the time at level 3 that is necessary to gather adequate profiling. So, if the C2 queue is long enough it is more beneficial to go first to level 2, because if we transitioned to level 3, we would be executing 30% slower until the C2 compile request makes its way through the long queue. Instead, the decision is made to compile the method at level 2 and when the load on C2’s queue recedes we are going to recompile the method at level 3 and start gathering profiling information.
- The length of C1 queue is used to dynamically adjust the thresholds, to introduce additional filtering if the compiler is overloaded. The rationale is that if the compiler queue is too long the method may not be executed anymore by the time its compilation finishes.
Once a method compiled at level 3 has been executed enough times to satisfy the formula for transition to level 4, a request is created to recompile the method at that level. At this level, HotSpot compiles the code using the C2 compiler for maximum long-term performance. Given that level 4 code is considered fully optimized, the JVM stops collecting profiling information.
When a method is first compiled by C1 some basic information is collected about its structure, like the number of instructions and loops. Based on that information it can be decided that a method is trivial – it returns a constant or is a getter or setter (see source here) – and compiling it with C1 will yield the same code. In this case the method is always compiled at level 1.
The description above is a simplified version of the compilation policy. There are several other points that HotSpot considers when deciding to compile or recompile a method, and several flags that the user of the JVM can use to customize the policy. For our discussion here Deoptimizations are probably the most relevant topic.
C2 uses the information about the dynamic profile of a method to guide several optimizations. Here are a few examples, to name just a few: Inlining, Class Hierarchy Analysis (CHA), basic block ordering, some loop optimizations, etc.
- The profile shows that an “if-else” statement has only executed the “then” part in the past. C2 may assume that that will continue to happen in the future and decide to not compile the “else” block at all.
- The profile shows that a reference variable was never null. C2 may then assume that the variable will likely continue to not be null in future executions and optimize the code using that assumption.
- The profile shows that a loop usually iterates thousands of times. C2 may decide to unroll and/or vectorize the loop based on that information.
- The profile shows that a class has no subclasses. C2 may decide to inline or do other kinds of optimization in method calls using objects of that class.
There is a catch, though: the dynamic profile of a method is, well, dynamic. There is no guarantee that a method’s profile will be stable during the application’s execution. To give a few examples:
- A condition used in an “if-else” statement that during most of the execution of the program was true, suddenly starts returning false.
- An exception that was never triggered starts triggering.
- A reference variable that was never null starts showing up as null.
- A new class is loaded in the system and a previously simple class hierarchy now is more complex.
- A loop that usually iterates thousands of times now starts iterating just a few times.
To account for this dynamic behavior C2 will insert “predicates” into the compiled code to check if the assumptions that it made based on the profile information are still valid in future executions. If the predicate is false, a Trap (a call to JVM internals) will be executed to notify HotSpot which assumption was false. With the information provided by the Trap, HotSpot will decide what needs to be done. The current compilation of the method may need to be discarded and the method reprofiled and recompiled, or it may just be necessary to switch the execution to the interpreter. Look here for a list of the possible reasons that a Trap may execute and here for a list of the necessary actions to be taken.
Let us see in the next section how to examine the list of compilations produced by HotSpot.
In this section we are going to see a few basic flags used to control compilations in HotSpot. I should mention, first, that these flags are not meant to “tune” the application performance. They are more intended for investigating the root cause of bugs.
The “-Xcomp” and “-Xint” options instruct the JVM that methods must only be compiled (-Xcomp) or that they should only be interpreted (-Xint). In other words, if you pass the “-Xint” to the JVM then all JIT compilers will be disabled and the only mechanism that HotSpot will use to execute the program is interpretation. On the other hand, if you pass “-Xcomp”, HotSpot will not use the Interpreter and methods will be always compiled.
The “-Xcomp” option, however, does not force HotSpot to compile a method at a certain optimization level. Other options can be used to do that, among them are -XX:TieredStopAtLevel and -XX:-TieredCompilation. The first one is used to set the maximum compilation level. The most common use case for this is probably to force HotSpot to only use the C1 compiler in order to bypass some C2 bug or to stress test C1. The second option will cause HotSpot to not use the Tiered Compilation heuristics to guide the transitions between the compilations and instead it will use another heuristic to choose between C1 and C2 for all compilations.
To illustrate the impact of Tiered Compilation on program execution time let us do some quick experiments using the options that we just saw. This is not going to be a comprehensive experiment; it is intended to just give an idea of the impact that these different compilation strategies can have on the JVM and application performance.
The table below shows the average wall time to execute “java-17.07 [FLAGS] HelloWorld” on an AMD Ryzen 7 1800X machine running Ubuntu 20.04. The HelloWorld program just prints the “Hello World” message on the console.
|Average Wall Time
These results show that, for this example, disabling the interpreter (i.e., -Xcomp) causes a huge increase in the execution time when compared to the default JVM options. The reason, as explained above, is that many methods will not be executed enough to amortize the cost of their compilation. On the other hand, for this example, if you configure the JVM to use only the interpreter the performance will be like when you do not do any customization at all; that is because this example does not have long running methods, and therefore interpreting is the best choice anyway.
Tiered compilation is a technique used by many virtual machines to make execution of programs more efficient. The core idea is finding the sweet spot between the cost of compiling a method and performance of the compiled code. To achieve that, the VM uses optimizing compilers only on methods that execute often and use non-optimizing compilers (or an interpreter) on methods that do not execute much. In other words, the VM uses a combination of compiled and interpreted code to execute the client program.
The OpenJDK JVM employs two JIT compilers and an interpreter to implement a sophisticated form of tiered compilation. The JVM initially uses the interpreter to execute methods, and when the execution count of a method satisfies a heuristic, the method is enqueued for compilation using the C1 compiler. If the execution count of the method is large enough it will meet another condition and the method will be compiled using the C2 optimizing compiler. Profiling information collected from the interpreted and C1 compiled version of the method is used by C2 to drive many optimizations.
Most methods in a program are executed only a few times, sometimes just once. Other methods are executed millions of times. There is an informal saying that programs spend 90% of their execution time executing only 10% of the methods. Spending CPU time compiling a method not critical for the program’s execution will result in unnecessary overhead. Tiered Compilation is a practical approach that combines engineering and heuristics to enable a VM to infer if a method will execute often or not and use that information to decide if the method should be interpreted or compiled. As a case in point, executing a “Hello World” Java program using only compilation results in an execution time 40x greater than executing it using only interpretation.