Adventures in application compatibility: The case of the wild instruction pointer that, upon closer inspection, might not be so wild after all

Raymond

Application compatibility testing as well as Windows Insiders discovered that Windows began crashing randomly if you upgraded to a specific build and had a specific program installed. Uninstalling that program stopped the crashes.

The crash dumps were spread out over a large number of processes unrelated to the program, so it’s not that the program itself was crashing, but rather that the presence of the program was causing other programs to start crashing. If you looked at the crash dumps, you found that the instruction pointer was just hanging out in the middle of nowhere:

rax=00007ffc1f8d0dc0 rbx=0000000000000010 rcx=0000000e194fa970
rdx=0000000000000000 rsi=0000000e194fa728 rdi=0000000e194fa428
rip=00007ffd9d1c5f2c rsp=0000000e194fa3e8 rbp=0000000000000001
 r8=0000011c610f6a30  r9=0000000e194fa150 r10=0000000e194fa760
r11=0000000e194fa9ec r12=0000000000000000 r13=00000000ffffffff
r14=0000000000000000 r15=0000000e194fa650
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010204
00007ffd`9d1c5f2c ??              ???

There were some clues on the stack:

0:008> dps @rsp
0000000e`194fa3e8  00007ffc`9d1c6219 ntdll!DestroyWidget+0x9
0000000e`194fa3f0  0000007c`a92fb098
0000000e`194fa3f8  00000000`00000000
0000000e`194fa400  0000000e`194fa4c8
0000000e`194fa408  0000011c`6382b440
0000000e`194fa410  00000000`00000246
0000000e`194fa418  00007ffc`763e3573 contoso+0x23573
0000000e`194fa420  0000011c`6102f690
0000000e`194fa428  00000000`00000000
0000000e`194fa430  0000011c`6382b460
0000000e`194fa438  00000000`00000000
0000000e`194fa440  00000000`00000000
0000000e`194fa448  0000000e`194fa4c8
0000000e`194fa450  00000000`00000000
0000000e`194fa458  00000000`00000000
0000000e`194fa460  00000000`00000000

According to the stack, the jump-into-space came from ntdll!DestroyWidget+0x9, but if you look at the code in ntdll!DestroyWidget+0x9, there is no jump into space. It’s calling into another nearby function.

ntdll!DestroyWidget:
00007ffc`9d1c6210 4883ec28        sub     rsp,28h
00007ffc`9d1c6214 e813fdffff      call    ntdll!DestroyWidgetWorker (00007ffc`9d1c5f2c)
00007ffc`9d1c6219 85c0            test    eax,eax

Notice that the wild instruction pointer differs from the intended jump target by a single bit:

Intended 00007ffc`9d1c5f2c
Actual 00007ffd`9d1c5f2c

This is not a return address stored on the stack, so it’s not rogue memory corruption. The jump target is not stored on the stack at all; it’s encoded directly in the instruction stream. So we can rule out a use-after-free bug here.

Hey, it’s not much, but it’s good to be able to rule out stuff so you can focus on the stuff that is still in play.

Another thought is that this was caused by overclocking. However, the reports were coming from a large number of systems, and the crash was consistent, which is atypical of overclocking, since overclocking crashes tends to be random.

Could something in the code stream be triggering a CPU erratum that caused jump targets to be miscalculated? Perhaps, but the close correlation with a specific program being installed suggests that the problem is in the software, not the hardware.

Inspection of more crash dumps show that the error is not actually a single-bit error after all. It’s an “off by 4GB” error.

Intended 00007ffc`9d1c5f2c 00007ff9`33605f2c
Actual 00007ffd`9d1c5f2c 00007ffa`33605f2c
XOR 00000001`00000000 00000003`00000000
Difference 00000001`00000000 00000001`00000000

There are different levels of crash dumps. Some time ago, I mentioned the triage dump, which is an extremely lightweight dump file that captures only a little bit of stack information, just enough to generate a stack trace but not much else. The dumps we’ve been looking at here are “minidumps”, which contain more complete stack information. But now it’s time to bring out the big guns: The full process dump.

Full process dumps are very large, so Windows Error Reporting doesn’t capture them most of the time. But developers can specifically request that the next N crashes be captured as full process dumps, and Windows Error Reporting will oblige.

Opening a full process crash dump shows something very telling: The code at ntdll!DestroyWidget looks different:

0:008> u ntdll!DestroyWidget
ntdll!DestroyWidget:
00007ffc`9d1c6210 e96bab7082      jmp     00007ffc`1f8d0d80
00007ffc`9d1c6215 13fd            adc     edi,ebp
00007ffc`9d1c6217 ff              ???
00007ffc`9d1c6218 ff85c0740bb8    inc     dword ptr [rbp-47F48B40h]

The function has been detoured!

Okay, now we’re getting somewhere.

When the detour wants to call the original function, it needs to replicate the original instructions that were overwritten and then jump to the first non-overwritten instruction. This is made more complicated by the fact that the last overwritten instruction was a call instruction. The replicant is rather messy but it boils down to

    ; replicate the "sub rsp,28h"
    sub     rsp,28h

    ; replicate the "call ntdll!DestroyWidgetWorker"
    mov     rax,7FFD9D1C6219h
    push    rax             ; fake return address
    mov     rax,7FFC9D1C5F2Ch
    jmp     rax             ; jump to ntdll!DestroyWidgetWorker

To replicate the call instruction, the detour pushes a fake return address and then jumps to the start of the called function. This, of course, messes up the return address predictor since the call and ret instructions no longer balance. Sorry for your system performance, but hey, at least our program got its detour!¹

Upon looking at the replicated code, you may spot the error: They miscalculated the fake return address.

What happened is that their detour generator incorrectly decoded the call instruction and treated the 32-bit immediate as an unsigned 32-bit offset rather than a signed 32-bit offset. The call to Destroy­Widget­Worker has a negative offset:

00007ffc`9d1c6214 e813fdffff      call    ntdll!DestroyWidgetWorker (00007ffc`9d1c5f2c)
                    ^^^^^^^^                                         ^^^^^^^^^^^^^^^^^
         offset = 0xfffffd13                                 lower address than caller

Their instruction decoder zero-extended the offset to a 64-bit value, resulting in a miscalculated jump target that is 4GB too high:

  Correct Incorrect
Return address 00007ffc`9d1c6219 00007ffc`9d1c6219
Plus offset ffffffff`fffffd13 00000000`fffffd13
Equals target 00007ffc`9d1c5f2c 00007ffd`9d1c5f2c

My guess is that the instruction decoder was ported from a 32-bit decoder, and in 32-bit code, it doesn’t matter whether you treat the offset as signed or unsigned because the sum is truncated to a 32-bit value. But when doing 64-bit decoding, those upper 32 bits are important, and failing to extend negative values correctly results in an off-by-4GB calculation.

Even though this problem has always existed, it requires two triggers:

  • The detoured function must have a call instruction within the first 5 bytes.
  • The destination of the call must be at a lower address than the caller.

The program’s detour code was lucky, but recently its luck ran out.

We contacted the vendor, who released a patch. The crashes started to abate, but they don’t go away completely because not everybody is diligent about installing patches.

Bonus chatter: A reminder that Windows does not support detouring the operating system. This program has wandered into unsupported territory. Not that their customers will know or care.

¹ A version that preserves the return address predictor stack might go something like this:

    ; replicate the "sub rsp,28h"
    sub     rsp,28h

    ; replicate the "call ntdll!DestroyWidgetWorker"
    call    @F              ; push a slot onto the return address predictor
@@: mov     rax,7FFC9D1C6219h
    mov     [rsp], rax      ; change the return address to our fake one
    mov     rax,7FFC9D1C5F2Ch
    jmp     rax             ; jump to ntdll!DestroyWidgetWorker

The ret from Destroy­Widget­Worker will be mispredicted, but at least all the remaining return addresses will be predicted correctly.

10 comments

Comments are closed. Login to edit/delete your existing comments

  • Joshua Hudson

    That call fixup is doing it the hard way. When writing out the call in the run-tail, generate a call indirect [offset] instruction where offset refers to the run-tail’s data region (only a few bytes away) and contains the 8 byte absolute address. The same applies to any jmp instruction. If you encounter a conditional jump, you cannot patch that. I would also question whether such a function complies with the calling convention at all.

  • David Wolff

    Is the program that caused this in user space? Does this mean that anyone can deliberately write a program to make other programs crash?

    • Me Gusta

      Basically, it seems to do wide scale injection and redirection of a certain function. While Raymond didn’t give the actual name, it was a function in NTDLL.
      Anyway, the answer to this is yes for regular desktop applications. Especially if the program which causes the crash runs with admin privileges. All it has to do is use whatever means needed to inject a library and there you go. The humble LoadLibrary, CreateRemoteThread and VirtualAlloc are all you really need, besides the DLL of course.

    • aaaaaa 123456789

      The short version of the story is that, if your privileges are high enough, you can do (almost) anything. Writing to other process images is a well-documented privilege. (Consider the action of a debugger, for instance: it may need to do this to create breakpoints. That’s why SeDebugPrivilege is such a sensitive privilege to grant!)

    • MGetz

      If you ask a browser writers those programs are called 3rd party AV (keeping with blog rules no guilty parties will be named). Which used to do things like this all the time to them in an effort to make them more ‘secure’. In many cases this actually made them less secure than if the AV had just used the kernel mode hooks MS sets up for them. This also used to be common in enterprise as a way to ‘monitor’ use of some methods as a way of maintaining ‘corporate security’. Again… misguided as that’s a firewall problem.

      There is a reason that detours is officially not supported anymore, because there are better ways to do anything it was supposed to do. Most often with less likelihood of crashing too. On a modern x64 machine there are also a lot of things working against this: DEP, CFG, the new intel shadow stack etc. So even if you were able to get the hook in, there is a really good chance you’d crash because of one of those. Having reviewed the Chromium source code which dynamically creates CFG metadata for your javascript when it JITs it’s not trivial to work around this (also that’s some seriously interesting code).

  • Ataru Moroboshi

    Has anyone, by chance, know the name of the program? I have random crashes in unrelated apps few times a day. Thanks

    • Brian Boorman

      Raymond’s ground rules for this blog (see the link in the “Relevant Links” footer section below) prohibit naming or attempting to guess the name of programs that he doesn’t name.

    • Me Gusta

      You also must remember that this is not the only potential cause of unrelated application crashes.
      Subtly bad hardware, unstable overclocks, corrupted operating system files/settings and bad drivers amongst other things all have the ability to cause crashes.
      But anyway, to quote the last sentence before the bonus chatter section:
      “The crashes started to abate, but they don’t go away completely because not everybody is diligent about installing patches.”
      The best way to deal with this as a potential problem source for your system is to just check to see if any applications on your system are not at the latest version and then update. But beware, this may not be the cause of your problems and so updating any applications may not fix the problem.

  • Piotr Siódmak

    What tipped you off that Program1.dmp, Program2.dmp and Program3.dmp submitted to Windows Error Reporting were related to each other? Was the fact that the instruction pointer pointed to empty memory the only thing? I would guess this would be classified as “calling a DLL function after unloading the DLL” and discarded as a common programming error and never forwarded from the automatic triager to a human.

    • Me Gusta

      In this case, the call is a relative call, and it is a call backwards. The call instruction has an immediate version, where you give a signed 16 or 32 bit value. If you look at the instruction encoding, it is 0xe8 (op code for call) 0x13 0xfd 0xff 0xff. This is the call instruction followed by a 32 bit immediate in little endian format. This is an offset of 0xfffffd13, and since this is a signed value then the highest bit being set means that this is negative.
      If you use twos compliment to negate this, you will end up with a value of 0x2ed. The instruction being at 0x7ffc9d1c6214 and 5 bytes long means that this call would have called the address 0x7ffc9d1c5f2c. However, the RIP had the address 0x7ffd9d1c5f2c, this means that there was no way the call could have transferred control to this location.
      So the reason why this was not seen as a call after free is that the call could not do what happened and shenanigans were happening.