{"id":105704,"date":"2021-09-17T07:00:00","date_gmt":"2021-09-17T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105704"},"modified":"2021-09-17T05:32:30","modified_gmt":"2021-09-17T12:32:30","slug":"20210917-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210917-00\/?p=105704","title":{"rendered":"Adventures in application compatibility: The case of the wild instruction pointer that, upon closer inspection, might not be so wild after all"},"content":{"rendered":"<p>Application compatibility testing as well as Windows Insiders discovered that Windows began crashing randomly if you upgraded to a specific build and had a specific program installed. Uninstalling that program stopped the crashes.<\/p>\n<p>The crash dumps were spread out over a large number of processes unrelated to the program, so it&#8217;s not that the program itself was crashing, but rather that the presence of the program was causing <i>other programs<\/i> to start crashing. If you looked at the crash dumps, you found that the instruction pointer was just hanging out in the middle of nowhere:<\/p>\n<pre>rax=00007ffc1f8d0dc0 rbx=0000000000000010 rcx=0000000e194fa970\r\nrdx=0000000000000000 rsi=0000000e194fa728 rdi=0000000e194fa428\r\nrip=00007ffd9d1c5f2c rsp=0000000e194fa3e8 rbp=0000000000000001\r\n r8=0000011c610f6a30  r9=0000000e194fa150 r10=0000000e194fa760\r\nr11=0000000e194fa9ec r12=0000000000000000 r13=00000000ffffffff\r\nr14=0000000000000000 r15=0000000e194fa650\r\niopl=0         nv up ei pl nz na po nc\r\ncs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010204\r\n00007ffd`9d1c5f2c ??              ???\r\n<\/pre>\n<p>There were some clues on the stack:<\/p>\n<pre>0:008&gt; dps @rsp\r\n0000000e`194fa3e8  00007ffc`9d1c6219 ntdll!DestroyWidget+0x9\r\n0000000e`194fa3f0  0000007c`a92fb098\r\n0000000e`194fa3f8  00000000`00000000\r\n0000000e`194fa400  0000000e`194fa4c8\r\n0000000e`194fa408  0000011c`6382b440\r\n0000000e`194fa410  00000000`00000246\r\n0000000e`194fa418  00007ffc`763e3573 contoso+0x23573\r\n0000000e`194fa420  0000011c`6102f690\r\n0000000e`194fa428  00000000`00000000\r\n0000000e`194fa430  0000011c`6382b460\r\n0000000e`194fa438  00000000`00000000\r\n0000000e`194fa440  00000000`00000000\r\n0000000e`194fa448  0000000e`194fa4c8\r\n0000000e`194fa450  00000000`00000000\r\n0000000e`194fa458  00000000`00000000\r\n0000000e`194fa460  00000000`00000000\r\n<\/pre>\n<p>According to the stack, the jump-into-space came from <code>ntdll!DestroyWidget+0x9<\/code>, but if you look at the code in <code>ntdll!DestroyWidget+0x9<\/code>, there is no jump into space. It&#8217;s calling into another nearby function.<\/p>\n<pre>ntdll!DestroyWidget:\r\n00007ffc`9d1c6210 4883ec28        sub     rsp,28h\r\n00007ffc`9d1c6214 e813fdffff      call    ntdll!DestroyWidgetWorker (00007ffc`9d1c5f2c)\r\n00007ffc`9d1c6219 85c0            test    eax,eax\r\n<\/pre>\n<p>Notice that the wild instruction pointer differs from the intended jump target by a single bit:<\/p>\n<table class=\"cp1\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Intended<\/th>\n<td><code>00007ff<u>c<\/u>`9d1c5f2c<\/code><\/td>\n<\/tr>\n<tr>\n<th>Actual<\/th>\n<td><code>00007ff<u>d<\/u>`9d1c5f2c<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This is not a return address stored on the stack, so it&#8217;s not rogue memory corruption. The jump target is not stored on the stack at all; it&#8217;s encoded directly in the instruction stream. So we can rule out a use-after-free bug here.<\/p>\n<p>Hey, it&#8217;s not much, but it&#8217;s good to be able to rule out stuff so you can focus on the stuff that is still in play.<\/p>\n<p>Another thought is that this was <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20050412-47\/?p=35923\"> caused by overclocking<\/a>. However, the reports were coming from a large number of systems, and the crash was consistent, which is atypical of overclocking, since overclocking crashes tends to be random.<\/p>\n<p>Could something in the code stream be triggering a CPU erratum that caused jump targets to be miscalculated? Perhaps, but the close correlation with a specific program being installed suggests that the problem is in the software, not the hardware.<\/p>\n<p>Inspection of more crash dumps show that the error is not actually a single-bit error after all. It&#8217;s an &#8220;off by 4GB&#8221; error.<\/p>\n<table class=\"cp1\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>Intended<\/th>\n<td><code>00007ff<u>c<\/u>`9d1c5f2c<\/code><\/td>\n<td><code>00007ff<u>9<\/u>`33605f2c<\/code><\/td>\n<\/tr>\n<tr>\n<th>Actual<\/th>\n<td><code>00007ff<u>d<\/u>`9d1c5f2c<\/code><\/td>\n<td><code>00007ff<u>a<\/u>`33605f2c<\/code><\/td>\n<\/tr>\n<tr>\n<th>XOR<\/th>\n<td><code>00000001`00000000<\/code><\/td>\n<td><code>00000003`00000000<\/code><\/td>\n<\/tr>\n<tr>\n<th>Difference<\/th>\n<td><code>00000001`00000000<\/code><\/td>\n<td><code>00000001`00000000<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>There are different levels of crash dumps. Some time ago, I mentioned the <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20161104-00\/?p=94645\"> triage dump<\/a>, which is an extremely lightweight dump file that captures only a little bit of stack information, just enough to generate a stack trace but not much else. The dumps we&#8217;ve been looking at here are &#8220;minidumps&#8221;, which contain more complete stack information. But now it&#8217;s time to bring out the big guns: The full process dump.<\/p>\n<p>Full process dumps are very large, so Windows Error Reporting doesn&#8217;t capture them most of the time. But developers can specifically request that the next <var>N<\/var> crashes be captured as full process dumps, and Windows Error Reporting will oblige.<\/p>\n<p>Opening a full process crash dump shows something very telling: The code at <code>ntdll!DestroyWidget<\/code> looks different:<\/p>\n<pre>0:008&gt; u ntdll!DestroyWidget\r\nntdll!DestroyWidget:\r\n00007ffc`9d1c6210 e96bab7082      jmp     00007ffc`1f8d0d80\r\n00007ffc`9d1c6215 13fd            adc     edi,ebp\r\n00007ffc`9d1c6217 ff              ???\r\n00007ffc`9d1c6218 ff85c0740bb8    inc     dword ptr [rbp-47F48B40h]\r\n<\/pre>\n<p>The function has been detoured!<\/p>\n<p>Okay, now we&#8217;re getting somewhere.<\/p>\n<p>When the detour wants to call the original function, it needs to replicate the original instructions that were overwritten and then jump to the first non-overwritten instruction. This is made more complicated by the fact that the last overwritten instruction was a <code>call<\/code> instruction. The replicant is rather messy but it boils down to<\/p>\n<pre>    ; replicate the \"sub rsp,28h\"\r\n    sub     rsp,28h\r\n\r\n    ; replicate the \"call ntdll!DestroyWidgetWorker\"\r\n    mov     rax,7FFD9D1C6219h\r\n    push    rax             ; fake return address\r\n    mov     rax,7FFC9D1C5F2Ch\r\n    jmp     rax             ; jump to ntdll!DestroyWidgetWorker\r\n<\/pre>\n<p>To replicate the call instruction, the detour pushes a fake return address and then jumps to the start of the called function. This, of course, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20041216-00\/?p=36973\"> messes up the return address predictor<\/a> since the <code>call<\/code> and <code>ret<\/code> instructions no longer balance. Sorry for your system performance, but hey, at least our program got its detour!\u00b9<\/p>\n<p>Upon looking at the replicated code, you may spot the error: They miscalculated the fake return address.<\/p>\n<p>What happened is that their detour generator incorrectly decoded the <code>call<\/code> instruction and treated the 32-bit immediate as an <i>unsigned<\/i> 32-bit offset rather than a <i>signed<\/i> 32-bit offset. The call to <code>Destroy\u00adWidget\u00adWorker<\/code> has a negative offset:<\/p>\n<pre>00007ffc`9d1c6214 e813fdffff      call    ntdll!DestroyWidgetWorker (00007ffc`9d1c5f2c)\r\n                    ^^^^^^^^                                         ^^^^^^^^^^^^^^^^^\r\n         offset = 0xfffffd13                                 lower address than caller\r\n<\/pre>\n<p>Their instruction decoder zero-extended the offset to a 64-bit value, resulting in a miscalculated jump target that is 4GB too high:<\/p>\n<table class=\"cp1\" style=\"border-collapse: collapse;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<th>\u00a0<\/th>\n<th>Correct<\/th>\n<th>Incorrect<\/th>\n<\/tr>\n<tr>\n<th>Return address<\/th>\n<td><code>00007ffc`9d1c6219<\/code><\/td>\n<td><code>00007ffc`9d1c6219<\/code><\/td>\n<\/tr>\n<tr>\n<th>Plus offset<\/th>\n<td><code>ffffffff`fffffd13<\/code><\/td>\n<td><code>00000000`fffffd13<\/code><\/td>\n<\/tr>\n<tr>\n<th>Equals target<\/th>\n<td><code>00007ffc`9d1c5f2c<\/code><\/td>\n<td><code>00007ffd`9d1c5f2c<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>My guess is that the instruction decoder was ported from a 32-bit decoder, and in 32-bit code, it doesn&#8217;t matter whether you treat the offset as signed or unsigned because the sum is truncated to a 32-bit value. But when doing 64-bit decoding, those upper 32 bits are important, and failing to extend negative values correctly results in an off-by-4GB calculation.<\/p>\n<p>Even though this problem has always existed, it requires two triggers:<\/p>\n<ul>\n<li>The detoured function must have a <code>call<\/code> instruction within the first 5 bytes.<\/li>\n<li>The destination of the <code>call<\/code> must be at a lower address than the caller.<\/li>\n<\/ul>\n<p>The program&#8217;s detour code was lucky, but recently its luck ran out.<\/p>\n<p>We contacted the vendor, who released a patch. The crashes started to abate, but they don&#8217;t go away completely because not everybody is diligent about installing patches.<\/p>\n<p><b>Bonus chatter<\/b>: A reminder that Windows does not support detouring the operating system. This program has wandered into unsupported territory. Not that their customers will know or care.<\/p>\n<p>\u00b9 A version that preserves the return address predictor stack might go something like this:<\/p>\n<pre>    ; replicate the \"sub rsp,28h\"\r\n    sub     rsp,28h\r\n\r\n    ; replicate the \"call ntdll!DestroyWidgetWorker\"\r\n    <span style=\"color: blue;\">call    @F              ; push a slot onto the return address predictor\r\n@@: mov     rax,7FFC9D1C6219h\r\n    mov     [rsp], rax      ; change the return address to our fake one<\/span>\r\n    mov     rax,7FFC9D1C5F2Ch\r\n    jmp     rax             ; jump to ntdll!DestroyWidgetWorker\r\n<\/pre>\n<p>The <code>ret<\/code> from <code>Destroy\u00adWidget\u00adWorker<\/code> will be mispredicted, but at least all the remaining return addresses will be predicted correctly.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The search for clues leads to an unexpected place.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-105704","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>The search for clues leads to an unexpected place.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105704","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105704"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105704\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105704"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105704"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105704"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}