Debugging a JVM Crash for LinkedIn – Part 1

Brian Stafford

Introduction

As part of the work we do in the Java Engineering Group we provide support to many first- and third-party customers of the Microsoft Build of OpenJDK, particularly those who run Java workloads on Azure. One of our largest customers is LinkedIn.

In the past couple of years our group has worked with LinkedIn to assist with their migration to Java 11 and their adoption of the Microsoft Build of OpenJDK.

As is typical of support, we occasionally need to investigate crashes in the Java Virtual Machine (JVM). Recently, LinkedIn reported a JVM crash in one of their services powering Espresso, LinkedIn’s in-house NoSQL database.  At the time, this service was running JDK 11.0.8. Even with the large number of services and JVM instances running Java, the crashes LinkedIn has seen are few and far between. This one is a very rare crash itself, happening only once per day across ~80 Linux hosts that run the service.

In this series of blog posts, we will walk through the process we used to investigate the root cause of this JVM crash, armed only with the HotSpot error log file and the crash dump provided by LinkedIn. This series will be broken down as follows:

Analyzing the Log

The first thing we like to do when we receive a crash report is to look at the HotSpot error log file. In this case, the header at the top of the file includes the following:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ffb860d705f, pid=19538, tid=7764
#
# JRE version: OpenJDK Runtime Environment Microsoft (11.0.8+10) (build 11.0.8+10-20200812)
# Java VM: OpenJDK 64-Bit Server VM Microsoft (11.0.8+10-20200812, mixed mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# J 52708 c2 sun.security.jgss.krb5.Krb5Context.initSecContext(Ljava/io/InputStream;I)[B java.security.jgss@11.0.8 (698 bytes) @ 0x00007ffb860d705f [0x00007ffb860d47c0+0x000000000000289f]

This tells us that we’re dealing with a SIGSEGV, a Linux Segmentation Fault (https://komodor.com/learn/sigsegv-segmentation-faults-signal-11-exit-code-139/), which is signal 11 (0xb). This type of fault occurs when the OS detects that a process is trying to access memory that it is not allowed to. Often this means that there’s a bad pointer floating around that is either being used for reading or writing a memory address that is not mapped to the process. Alternatively, it could mean that the pointer is valid, but the process is attempting to read more memory starting at that address than it is allowed to.

The crash is happening in the following method frame:

# J 52708 c2 sun.security.jgss.krb5.Krb5Context.initSecContext(Ljava/io/InputStream;I)[B java.security.jgss@11.0.8 (698 bytes) @ 0x00007ffb860d705f [0x00007ffb860d47c0+0x000000000000289f]

The thread that crashed is shown in the “T H R E A D” section. Here’s an excerpt from the top of that section:

Current thread (0x00007ff7fbaa0000):  JavaThread "DataStreamer for file <file omitted>" daemon [_thread_in_Java, id=7764, stack(0x00007ff75185a000,0x00007ff75195b000)]

Stack: [0x00007ff75185a000,0x00007ff75195b000],  sp=0x00007ff751956420,  free space=1009k

Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)

J 52708 c2 sun.security.jgss.krb5.Krb5Context.initSecContext(Ljava/io/InputStream;I)[B java.security.jgss@11.0.8 (698 bytes) @ 0x00007ffb860d705f [0x00007ffb860d47c0+0x000000000000289f]

J 51577 c2 sun.security.jgss.GSSContextImpl.initSecContext(Ljava/io/InputStream;Ljava/io/OutputStream;)I java.security.jgss@11.0.8 (515 bytes) @ 0x00007ffb85f89a80 [0x00007ffb85f861e0+0x00000000000038a0]

J 68479 c2 com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge([B)[B jdk.security.jgss@11.0.8 (117 bytes) @ 0x00007ffb84b7c890 [0x00007ffb84b7c5a0+0x00000000000002f0]

…

We can see the problematic method at the top of that call stack. Note that it is prefixed with “J“, so it appears that this method consists of compiled Java code, not interpreted code.

One thing we sometimes check when a crash like this happens is to see if there was a recent compilation of the method, as compilation is triggered by profiling counters which can sometimes be buggy. To do this, we look at the “Compilation events” section:

Compilation events (20 events):

Event: 966063.635 Thread 0x00007ffb948bf800 92423       4       com.mysql.jdbc.ConnectionPropertiesImpl::getUseTimezone (8 bytes)

Event: 966063.636 Thread 0x00007ffb948bf800 nmethod 92423 0x00007ffb83fe6490 code [0x00007ffb83fe6620, 0x00007ffb83fe66d8]

Event: 966102.461 Thread 0x00007ffb948bf800 92424       4       org.apache.kafka.common.Node::<init> (34 bytes)

…

Event: 966301.993 Thread 0x00007ffb948bf800 nmethod 92432 0x00007ffb848b8a10 code [0x00007ffb848b8be0, 0x00007ffb848b9418]

Nothing in that list of compilation events seems relevant here. Next, we check to see if there were any recent deoptimization events affecting this method, as it may indicate some interesting and relevant activity. For example, another crash we investigated recently was due to the fact that profiling counters were not being reset after a deoptimization event. We look in the “Deoptimization events” section:

Deoptimization events (20 events):

Event: 965907.465 Thread 0x00007ff8abe6d800 Uncommon trap: trap_request=0xffffff14 fr.pc=0x00007ffb84512b48 relative=0x00000000000006c8

Event: 965907.465 Thread 0x00007ff8abe6d800 Uncommon trap: reason=tenured action=make_not_entrant pc=0x00007ffb84512b48 method=sun.util.locale.provider.CalendarDataUtility$CalendarWeekParameterGetter.getObject(Ljava/util/spi/CalendarDataProvider;Ljava/util/Locale;Ljava/lang/String;[Ljava/lang/Objec

Event: 965907.465 Thread 0x00007ff8abe6d800 DEOPT PACKING pc=0x00007ffb84512b48 sp=0x00007ff8899f2cc0

…

Event: 966291.973 Thread 0x00007ff8b492f000 DEOPT UNPACKING pc=0x00007ffb7c28f425 sp=0x00007ff86102c878 mode 2

Again, nothing there seems relevant.

Our next step is to investigate the memory address that the application is trying to access. In the T H R E A D section we’re provided with siginfo, which gives us more information on the signal that was triggered:

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007ffb7a000000

As we’ve seen, the type of signal was SIGSEGV. The actual type of error was a SEGV_MAPERR, which occurs when a process tries to access an address that’s not mapped into the address space of the process (see https://stackoverflow.com/questions/1000002/what-is-segv-maperr). Which address are we talking about? The one indicated by si_addr, which is 0x00007ffb7a000000.

Let’s see if we can find out any more information about that address from the log file. We do a search and find it in several places! Our first hit, after the siginfo, is here:

Heap:

 garbage-first heap   total 10485760K, used 5056421K [0x00007ff8fa000000, 0x00007ffb7a000000)

  region size 8192K, 60 young (491520K), 21 survivors (172032K)

 Metaspace       used 161880K, capacity 163125K, committed 164004K, reserved 165888K

OK, this seems interesting! It looks like we’re accessing the byte *right after the heap ends* – note that the range is [0x00007ff8fa000000, 0x00007ffb7a000000), i.e., inclusive on the lower end and exclusive on the higher end. In the log file, you can inspect the various Heap Regions this application is using:

Heap Regions: E=young(eden), S=young(survivor), O=old, HS=humongous(starts), HC=humongous(continues), CS=collection set, F=free, A=archive, TAMS=top-at-mark-start (previous, next)

|   0|0x00007ff8fa000000, 0x00007ff8fa800000, 0x00007ff8fa800000|100%|HS|  |TAMS 0x00007ff8fa800000, 0x00007ff8fa000000| Complete

|   1|0x00007ff8fa800000, 0x00007ff8fb000000, 0x00007ff8fb000000|100%|HC|  |TAMS 0x00007ff8fb000000, 0x00007ff8fa800000| Complete

…

|1278|0x00007ffb79000000, 0x00007ffb79800000, 0x00007ffb79800000|100%| E|CS|TAMS 0x00007ffb79000000, 0x00007ffb79000000| Complete

|1279|0x00007ffb79800000, 0x00007ffb7a000000, 0x00007ffb7a000000|100%| E|CS|TAMS 0x00007ffb79800000, 0x00007ffb79800000| Complete

You can see here that the bad address is just off the end of the last Heap Region, and thus off the end of the heap altogether.

We also see this address referenced several times in various GC Heap History events, e.g.:

GC Heap History (20 events):

Event: 966300.329 GC heap before

{Heap before GC invocations=894510 (full 0):

 garbage-first heap   total 10485760K, used 8059819K [0x00007ff8fa000000, 0x00007ffb7a000000)

  region size 8192K, 426 young (3489792K), 14 survivors (114688K)

 Metaspace       used 161880K, capacity 163125K, committed 164004K, reserved 165888K

}

All right, so it seems like we have a handle on where the bad memory access is. But why is this happening? This is starting to smell like a compiler bug, because it’s the compiler’s responsibility to make sure all memory accesses are legal. The Java programmers who wrote this application could not have made this mistake – they’re protected from having to worry about native memory management.

Our next step is to inspect the JIT-compiled code and see if we can figure out what’s going on. The log file tells us that the crash happens at the instruction with a program counter of 0x00007ffb860d705f. Here are the instructions around that memory location:

Instructions: (pc=0x00007ffb860d705f)

0x00007ffb860d6f5f:   24 a0 00 00 00 49 8b 4a 18 48 85 c9 75 21 be f6
0x00007ffb860d6f6f:   ff ff ff 66 90 c5 f8 77 e8 04 84 1b f6 48 8b 8c
0x00007ffb860d6f7f:   24 b8 00 00 00 eb 08 48 8b 8c 24 b8 00 00 00 8b
0x00007ffb860d6f8f:   59 10 85 db 0f 86 32 01 00 00 8b d3 ff ca 3b d3
0x00007ffb860d6f9f:   0f 83 43 2f 00 00 48 8b 7d 18 44 8b 57 10 45 85
0x00007ffb860d6faf:   d2 0f 86 32 2f 00 00 41 3b d2 0f 83 29 2f 00 00
0x00007ffb860d6fbf:   41 ba 08 00 00 00 41 3b da 44 8b c3 45 0f 4f c2
0x00007ffb860d6fcf:   4c 63 8c 24 c8 00 00 00 46 0f be 5c 0f 18 46 0f
0x00007ffb860d6fdf:   be 54 09 18 45 33 da 46 88 5c 09 18 44 8b 94 24
0x00007ffb860d6fef:   c8 00 00 00 41 ff c2 44 89 94 24 c8 00 00 00 45
0x00007ffb860d6fff:   3b d0 7c cc 44 8b c3 41 83 c0 f9 41 3b d0 41 bb
0x00007ffb860d700f:   00 00 00 80 45 0f 4c c3 45 3b d0 7d 79 45 8b da
0x00007ffb860d701f:   44 89 9c 24 c8 00 00 00 41 ba 40 1f 00 00 45 8b
0x00007ffb860d702f:   d8 44 2b 9c 24 c8 00 00 00 45 3b da 45 0f 4f da
0x00007ffb860d703f:   44 03 9c 24 c8 00 00 00 66 0f 1f 84 00 00 00 00
0x00007ffb860d704f:   00 4c 63 8c 24 c8 00 00 00 c4 a1 7a 7e 44 09 18
0x00007ffb860d705f:   c4 a1 79 ef 44 0f 18 c4 a1 79 d6 44 09 18 44 8b
0x00007ffb860d706f:   8c 24 c8 00 00 00 41 83 c1 08 44 89 8c 24 c8 00
0x00007ffb860d707f:   00 00 45 3b cb 7c ca 4d 8b 9f 08 01 00 00 41 85
0x00007ffb860d708f:   03 45 3b c8 7c 98 44 8b 9c 24 c8 00 00 00 44 3b
0x00007ffb860d709f:   db 7d 29 44 8b 8c 24 c8 00 00 00 66 90 4d 63 c1
0x00007ffb860d70af:   46 0f be 54 07 18 46 0f be 5c 01 18 45 33 d3 46
0x00007ffb860d70bf:   88 54 01 18 41 ff c1 44 3b cb 7c e1 44 8b 55 10
0x00007ffb860d70cf:   4c 8b 84 24 a0 00 00 00 45 89 50 10 41 83 fa 03
0x00007ffb860d70df:   0f 84 43 3c 00 00 41 83 fa 01 0f 84 8d 3c 00 00
0x00007ffb860d70ef:   41 83 fa 10 0f 84 d7 3c 00 00 49 8b 87 18 01 00
0x00007ffb860d70ff:   00 4c 8b d0 49 83 c2 18 48 bd 60 79 70 72 f8 7f
0x00007ffb860d710f:   00 00 4d 3b 97 28 01 00 00 0f 83 96 1c 00 00 4d
0x00007ffb860d711f:   89 97 18 01 00 00 41 0f 0d 8a c0 00 00 00 4c 8b
0x00007ffb860d712f:   95 b8 00 00 00 4c 89 10 48 89 68 08 48 c7 40 10
0x00007ffb860d713f:   00 00 00 00 48 89 84 24 a8 00 00 00 48 8b 00 4c
0x00007ffb860d714f:   8b d0 49 83 e2 07 49 83 fa 05 0f 85 a2 28 00 00

Great! So, what do we do next? Well, if you can read hex assembly directly, you can probably take it from here yourself. If not, you’ll need to stick around for the next step – disassembly!

To make this hex dump of instructions more readable, we turned to the Online Disassembler, or ODA (https://onlinedisassembler.com/odaweb/). (Unfortunately, since the initial preparation of this blog post, ODA seems to have been removed. Hopefully it shows up in a different location soon. The instructions that follow refer to the original version of ODA.).

ODA is a very handy tool that allows you to paste in some hex instructions and see the disassembly. You can change the architecture, syntax style, endianness, etc.

Setting the architecture is very important – if that’s not set correctly you will very likely see incorrect disassembly. In this case, for a 64-bit JVM, we set the Arch to i386:x86-64:intel and leave the Syntax Style as DEFAULT (which is intel-mnemonic).

In the text box at the bottom left, we insert the instructions from 0x00007ffb860d705f:

c4 a1 79 ef 44 0f 18 c4 a1 79 d6 44 09 18 44 8b

In the disassembly pane, we see that this corresponds to the following:

c4a179ef440f18      vpxor xmm0, xmm0, XMMWORD PTR [rdi+r9*1+0x18]

All right, so it looks like we have identified the offending instruction!

What’s Next?

For fun, we’ll see if we can locate this instruction in the core dump in the next post! Stay tuned for Part 2 of this series.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 comment

Leave a comment

  • Stanislav Vysotskii

    Thank you! What a great article! I have never went further than analyzing the JVM dump in Memory Analyzer (Eclipse based tool) or something alike.
    I will be waiting for the other parts! In the end it should be incredibly useful knowledge for solving complex problems.