Windows reliability telemetry reported that there were a large number of crashes in the Display control panel. Since these crashes are coming via telemetry and the Windows Error Reporting service, there is no information about what steps are required to reproduce the problem. All we have are crash dumps.
The crash was due to the instruction pointer being in the middle of nowhere. For example, in one dump, it was at address ffffffff`924bbde0
. Close study shows that this value is suspiciously similar to 00007fff`924bbde0
, which is the address of ntdll!ButtonWndProc_A
. This tells me that somebody subclassed a button, and then tried to restore the original window procedure, but they messed up and truncated the 64-bit pointer value to a 32-bit signed integer. Bonus insult: Their button is ANSI, not Unicode. It’s (checks watch) 2023, get with the program. Not everybody who uses a computer speaks English.
To debug this problem, I had to do some triangulation of the crash dumps to look for a third party component that was common to all (or at least most) of the crashes. Since this was a Display control panel, I focused on the video card information, since video card drivers can provide a custom Display control panel plug-in to show off their driver-specific features.
And I found it.
The custom property sheet that comes with one particular video card has a bug in its WM_
handler: It casts a WNDPROC
to a 32-bit value, causing the upper 32 bits to be lost.
Here is the reverse-engineered dialog procedure:
INT_PTR CALLBACK DialogProc( HWND hdlg, UINT uMsg, WPARAM wParam, LPARAM lParam) { if (uMsg == WM_INITDIALOG) { SetWindowLongPtr(hdlg, GWLP_USERDATA, ((PROPSHEETPAGE*)lParam)->lParam); } MyClass* self = (MyClass*)GetWindowLongPtr( hdlg, GWLP_USERDATA); return self ? self->RealDialogProc(hdlg, uMsg, wParam, lParam) : FALSE; }
And here is the “real” DLGPROC
:
INT_PTR CALLBACK MyClass:RealDialogProc( HWND hdlg, UINT uMsg, WPARAM wParam, LPARAM lParam) { switch (uMsg) { ... case WM_DESTROY: SetWindowLongPtr(GetDlgItem(m_dlg, IDC_SOME_BUTTON), GWLP_WNDPROC, (LONG)g_originalWndProc); ... } }
I suspect this code was originally written as 32-bit code, and the line was
SetWindowLong(GetDlgItem(m_dlg, IDC_SOME_BUTTON), GWL_WNDPROC, (LONG)g_originalWndProc);
When porting to 64-bit, the SetÂWindowÂLong
becomes SetÂWindowÂLongÂPtr
to expand the value to 64 bits, and the name of the index changes from GWL_
to GWLP_
, with the extra P emphasizing that the value should be passed to Get
/SetÂWindowÂLongÂPtr
.
But they forgot to upgrade their cast from (LONG)
to (LONG_PTR)
, so they were accidentally truncating their 64-bit value to a sign-extended 32-bit value as part of the restoration.
I went for style points and came up with a one-byte patch to fix the bug.
// rbx = (LONG)g_originalWndProc 48631d33540300 movsxd rbx,dword ptr [contoso+0x39c50]
Patch the second byte from 63
to 8b
:
// rbx = (LONG_PTR)g_originalWndProc 488b1d33540300 mov rbx,qword ptr [contoso+0x39c50]
It turns out that all the machines that are hitting this bug are running drivers that are over ten years old. The current drivers don’t have this bug. In fact, the current drivers don’t even have the custom control panel extension! In the time since I originally did this investigation, it appears that people finally got their act together and upgraded their video drivers, because there has been only one recorded occurrence of this crash worldwide in the past 30 days.
Bonus reading: The difference between a junior and senior position at a video card company.
"ntdll!ButtonWndProc_A ... Bonus insult: Their button is ANSI, not Unicode. It’s (checks watch) 2023, get with the program. Not everybody who uses a computer speaks English."
Hmmm... https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page#activeCodePage but advertises that functions like ntdll!ButtonWndProc_A support Unicode!
Now switch from ntdll to NLS, the "(Inter)National Language Support", and query its GetLocaleInfoA function for LOCALE_SNATIVEDIGITS of MAKELCID(MAKELANGID(LANG_ARABIC, SUBLANG_DEFAULT), SORT_DEFAULT), which according to the documentation of LOCALE_SNATIVE* should return the Arabic numerals ٠١٢٣٤٥ ٦٧٨٩, or for LOCALE_SNATIVELANGUAGE of MAKELCID(MAKELANGID(LANG_ARMENIAN, SUBLANG_ARMENIAN_ARMENIA),...
35+ years ago long/LONG was defined as int32_t (just like DWORD was defined as uint32_t), everything since then (i.e. choosing LLP64 instead of LP64) is just consequence of this. (Maybe type LPARAM=intptr_t could have been used instead (e.g. Get-/SetWindowLparam), but that didn’t happen.))
Did your psychic debugging powers lead you to look at the address of the ntdll functions? That would have taken me approximately infinity -1 years to think of.
A negative pointer in 64 bit address space makes the mistake obvious. If you’re going to do something that can succeed from there, you’re going to overlay on the low 32 bits of the pointer until you find a match somewhere in the dump.
> It’s (checks watch) 2023, get with the program. Not everybody who uses a computer speaks English.
Coming from the platform that still fails to support proper UTF-8 in its APIs (https://github.com/microsoft/terminal/issues/7777). It’s 2023, indeed. But you’re blaming the wrong party here.
If only Microsoft had also ignored localization to non English for many many years instead of offering it years before utf8 came around.
The platform has properly supported UTF-16 and therefore Unicode for decades. One might think a Control Panel applet would make use of that.
UTF-16 is legacy trash but yeah I too would expect a control panel applet to use it. Among other reasons, you can’t manifest that because it runs in some other process.
They did fix that. You have to add a manifest to say you have a new enough C library to not explode. The older Microsoft C libraries had problems.
Get your manifest fragments here: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
Fortunately, nobody needs a (Microsoft) C library on Windows: the Win32 API provides all interfaces with its -W functions.
The -A functions are the legacy trash you wrote about; manifested to abuse UTF-8 they become outright dangerous or just suck — BIG TIME: see FullDisclosure
They did not fix that entirely yet, as you can read with the linked issue above. ReadConsoleA + manifest still does not work.
> which is the address of ntdll!ButtonWndProc_A
Huh, I’d have expected it to be in user32.dll, not ntdll.dll.
NTDLL.dll also exports NtdllDefWindowProc_A, NtdllDefWindowProc_W, NtdllDialogWndProc_A and NtdllDialogWndProc_W
Now guess where USER32.dll forwards its DefDlgProcA, DefDlgProcW, DefWindowProcA and DefWindowProcW to.
I remember at my first job out of college I had to do work on an extension to a well known piece of publishing software that has largely fallen out of favor that we sold. This was about the time of the x64 transition and I had just finished up the 64bit port of an extension we had to office, a job made all the more fun because someone was doing pinvokes directly to mangled...
Win10, since long time (and maybe a bit related to this topic):
Virtual Desktop going from say Desktop 1 to Desktop 2, crashes a heck of a lot. I mean, a lot.
I only notice when VS 2019 is running in D1. Switch to D2 and things just might jitter for a bit (four display monitors, Nvidia), then blooooop, a crashdump gets made and sent. I would include the name but I delete them...