Debugging a JVM Crash for LinkedIn – Part 3
Welcome to Part 3 of our investigation into a JVM crash for LinkedIn. This blog post concludes the investigation we began in Debugging a JVM Crash for LinkedIn – Part 1 and continued in Debugging a JVM Crash for LinkedIn – Part 2. In Part 2, we analyzed the core dump and the instruction where the JVM crashed to find clues as to the cause.
As a reminder, this series is broken down as follows:
Looking for a Fix
From our earlier investigations, it’s clear this is a Just-In-Time (JIT) compiler bug. It’s a memory addressing error that was introduced by the JIT compiler – it’s not something a Java programmer could have introduced. But is the compiler calculating the wrong address for the read, or is it reading too much?
The next step in an investigation like this is to check if there’s an existing JBS bug that’s related to this issue. The OpenJDK community is full of experts dedicated to ensuring that Java has the best, most reliable, runtime on the planet. Someone may have already seen this, and potentially even have fixed it!
We, therefore, head over to the JDK Bug System and search for vpxor. What do you know? This issue pops up straight away: c2 loop unrolling by 8 results in reading memory past array.
Reading the description, it looks *exactly* like the behaviour we’re seeing:
- The error is a SIGSEGV
- It’s happening on a vpxor instruction
- It’s happening very infrequently, and only when the access is at the end of a memory region
- It only happens when C2 is performing an optimization known as loop unrolling
The bug description shows two versions of the problem. One with compressed ordinary object pointers (oops):
vmovq 0x10(%r8,%rdi,1),%xmm0 <- read 8 bytes from byteArray1(r8) vpxor 0x10(%r11,%rdi,1),%xmm0,%xmm0 <- read 16 bytes from byteArray2 (r11) and xor them with xmm0
And one without:
vmovq 0x18(%rcx, %r10, 1), %xmm0 vpxor 0x18(%rbp, %r10, 1), %xmm0, %xmm0 <- reading 16 bytes result in reading past mapped memory region
It looks like the 0x18 we’re seeing in the LinkedIn code is the size of the object header when compressed oops is disabled. It’s unclear why it’s being used twice in the address calculation, but our assumption at this point is that it’s not relevant to the problem at hand.
Here’s where the LinkedIn code crashes (now including the instruction right before the crash):
0x7ffb860d7058: vmovq 0x18(%rcx,%r9,1),%xmm0 0x7ffb860d705f: vpxor 0x18(%rdi,%r9,1),%xmm0,%xmm0
Reading the bug description and comments further, the problem is because, although vmovq is operating on 8 bytes, vpxor reads 16 bytes (see MOVQ and PXOR). This is all fine during the vectorized main loop if the remaining length of the vector is >= 16 bytes. However, if it’s less than that then we get this erroneous read. The fix suggested is to only allow vpxor to be used when there are >= 16 bytes remaining. Otherwise, the remaining bytes are processed in the unvectorized post loop. The bug here is that the wrong instruction is being selected under these circumstances.
Now that we understand the bug and are confident that it’s what LinkedIn is seeing, what do we do next?
Well, when we initially encountered this issue we looked at the bug status and saw that it had not only already been fixed in “tip” (the latest jdk repo), but that the fix was also backported to JDK 11 and would be released as part of as part of the 11.0.14 Patch Set Update (PSU) in January. However, we encountered this before the 11.0.14 release date! We wanted to mitigate the issue right away, so we implemented a workaround.
Since this bug was encountered before the release of 11.0.14, we needed a workaround while waiting for the fix to land.
In this case, since it’s a JIT compilation bug, the workaround is to simply stop compiling the offending method. All we need to do is tell the JVM to exclude the method from compilation and it should only execute it in interpreted mode. In this case, since we’re speculating that the issue could be coming from a method inlined in the compilation of initSecContext, we should disable the compilation of that method and any method that it calls that might have included the location of the crash.
To disable the compilation of a method, one approach is to use the CompileCommand flag when we launch the JVM. We found multiple ways to specify this exclusion that work, as follows:
-XX:CompileCommand=exclude,sun/security/jgss/krb5/Krb5Context.initSecContext -XX:CompileCommand=exclude,sun/security/jgss/krb5/Krb5Context.initSecContext() -XX:CompileCommand=exclude,sun/security/jgss/krb5/Krb5Context,initSecContext -XX:CompileCommand=exclude,sun.security.jgss.krb5.Krb5Context::initSecContext
Note that a workaround like this can affect performance, as the method will no longer be compiled – it will run in interpreted mode. If it’s an expensive method that is called often, performance can suffer. However, correctness often trumps performance, so this kind of workaround is usually essential for stability until a fix for the issue lands.
We were lucky in this case that the method in question was not seen to be performance-critical for LinkedIn. As a result, the workaround was a good stopgap until the fix became available in an OpenJDK update.
When LinkedIn upgraded to version 11.0.14 of the Microsoft Build of OpenJDK, they got the fix described in the JBS issue and this crash went away. Once again, the OpenJDK community came to the rescue to find the root cause of a nasty bug and implement a fix!
This was an interesting effort which showed some of the work that Microsoft’s Java Engineering Group does for our customers on a regular basis, and which delivered a good result for LinkedIn. It also served as a reminder of how the generosity and expertise of many JVM engineers around the world help make Java the most stable runtime on the planet!
For further information on using HotSpot error logs to debug JVM crashes, I recommend that you check out Fatal Error Log – Troubleshooting Guide for HotSpot VM and Andrei Pangin – JVM crash dump analysis.
Thanks for reading!