{"id":111066,"date":"2025-04-11T07:00:00","date_gmt":"2025-04-11T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=111066"},"modified":"2025-04-11T08:57:43","modified_gmt":"2025-04-11T15:57:43","slug":"20250411-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20250411-00\/?p=111066","title":{"rendered":"The case of the UI thread that hung in a kernel call"},"content":{"rendered":"<p>A customer asked for help with a longstanding but low-frequency hang that they have never been able to figure out. From what they could tell, their UI thread was calling into the kernel, and the call simply hung for no apparent reason. Unfortunately, the kernel dump couldn&#8217;t show a stack from user mode because the stack had been paged out. (Which makes sense, because a hung thread isn&#8217;t using its stack, so once the system is under some memory pressure, that stack gets paged out.)<\/p>\n<pre>0: kd&gt; !thread 0xffffd18b976ec080 7\r\nTHREAD ffffd18b976ec080  Cid 79a0.7f18  Teb: 0000003d7ca28000\r\n    Win32Thread: ffffd18b89a8f170 WAIT: (Suspended) KernelMode Non-Alertable\r\nSuspendCount 1\r\n    ffffd18b976ec360  NotificationEvent\r\nNot impersonating\r\nDeviceMap                 ffffad897944d640\r\nOwning Process            ffffd18bcf9ec080       Image:         contoso.exe\r\nAttached Process          N\/A            Image:         N\/A\r\nWait Start TickCount      14112735       Ticks: 1235580 (0:05:21:45.937)\r\nContext Switch Count      1442664        IdealProcessor: 2             \r\nUserTime                  00:02:46.015\r\nKernelTime                00:01:11.515\r\n\r\n nt!KiSwapContext+0x76\r\n nt!KiSwapThread+0x928\r\n nt!KiCommitThreadWait+0x370\r\n nt!KeWaitForSingleObject+0x7a4\r\n nt!KiSchedulerApc+0xec\r\n nt!KiDeliverApc+0x5f9\r\n nt!KiCheckForKernelApcDelivery+0x34\r\n nt!MiUnlockAndDereferenceVad+0x8d\r\n nt!MmProtectVirtualMemory+0x312\r\n nt!NtProtectVirtualMemory+0x1d9\r\n nt!KiSystemServiceCopyEnd+0x25 (TrapFrame @ ffff8707`a9bef3a0)\r\n ntdll!ZwProtectVirtualMemory+0x14\r\n [end of stack trace]\r\n<\/pre>\n<p>Although we couldn&#8217;t see what the code was doing in user mode, there was something unusual in the information that was present.<\/p>\n<p>Observe that the offending thread is <i>Suspended<\/i>. And it appears to have been suspended for over five hours.<\/p>\n<pre>THREAD ffffd18b976ec080  Cid 79a0.7f18  Teb: 0000003d7ca28000\r\n    Win32Thread: ffffd18b89a8f170 WAIT: (<span style=\"border: solid 1px currentcolor;\">Suspended<\/span>) KernelMode Non-Alertable\r\nSuspendCount 1\r\n    ffffd18b976ec360  NotificationEvent\r\nNot impersonating\r\nDeviceMap                 ffffad897944d640\r\nOwning Process            ffffd18bcf9ec080       Image:         contoso.exe\r\nAttached Process          N\/A            Image:         N\/A\r\nWait Start TickCount      14112735       Ticks: 1235580 (<span style=\"border: solid 1px currentcolor;\">0:05:21:45.937<\/span>)\r\n<\/pre>\n<p>Naturally, a suspended UI thread is going to manifest itself as a hang.<\/p>\n<p>Functions like <code>Suspend\u00adThread<\/code> exist primarily for debuggers to use, so we asked them if they had a debugger attached to the process when they captured the kernel dump. They said that they did not.<\/p>\n<p>So who suspended the thread, and why?<\/p>\n<p>The customer then realized that they had a watchdog thread which monitors the UI thread for responsiveness, and every so often, it suspends the UI thread, captures a stack trace, and then resumes the UI thread. And in the dump file, they were able to observe their watchdog thread in the middle of its stack trace capturing code. But why was the stack trace capture taking five hours?<\/p>\n<p>The stack of the watchdog thread looks like this:<\/p>\n<pre>ntdll!ZwWaitForAlertByThreadId(void)+0x14\r\nntdll!RtlpAcquireSRWLockSharedContended+0x15a\r\nntdll!RtlpxLookupFunctionTable+0x180\r\nntdll!RtlLookupFunctionEntry+0x4d\r\ncontoso!GetStackTrace+0x72\r\ncontoso!GetStackTraceOfUIThread+0x127\r\n...\r\n<\/pre>\n<p>Okay, so we see that the watchdog thread is trying to get a stack trace of the UI thread, but it&#8217;s hung inside <code>Rtl\u00adLookup\u00adFunction\u00adEntry<\/code> which is waiting for a lock.<\/p>\n<p>You know who I bet holds the lock?<\/p>\n<p>The UI thread.<\/p>\n<p>Which is suspended.<\/p>\n<p>The UI thread is probably trying to dispatch an exception, which means that it is walking the stack looking for an exception handler. But in the middle of this search, it got suspended by the watchdog thread. Then the watchdog thread tries to walk the stack of the UI thread, but it can&#8217;t do that yet because the function table is locked by the UI thread&#8217;s stack walk.<\/p>\n<p>This is a practical exam for a previous discussion: <a title=\"Why you should never suspend a thread\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20031209-00\/?p=41573\"> Why you should never suspend a thread<\/a>.<\/p>\n<p>Specifically, the title should say &#8220;Why you should never suspend a thread <i>in your own process<\/i>.&#8221; Suspending a thread in your own process runs the risk that the thread you suspended was in possession of some resource that the rest of the program needs. In particular, it might possess a resource that is needed by the code which has responsible for eventually resuming the thread. Since it is suspended, it will never get a chance to release those resources, and you end up with a deadlock between the suspended thread and the thread whose job it is to resume that thread.<\/p>\n<p>If you want to suspend a thread and capture stacks from it, you&#8217;ll have to do it from another process, so that you don&#8217;t deadlock with the thread you suspended.\u00b9<\/p>\n<p><b>Bonus chatter<\/b>: In this kernel stack, you can see evidence that <a title=\"The SuspendThread function suspends a thread, but it does so asynchronously\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20150205-00\/?p=44743\"> the <code>Suspend\u00adThread<\/code> operates asynchronously<\/a>. When the watchdog thread calls <code>Suspend\u00adThread<\/code> to suspend the UI thread, the UI thread was in the kernel, in the middle of changing memory protections. The thread does not suspend immediately, but rather waits for the kernel to finish its work, and then before returning to user mode, the kernel does a <code>Check\u00adFor\u00adKernel\u00adApc\u00adDelivery<\/code> to see if there were any requests waiting. It picks up the request to suspend, and that is when the thread actually suspends.\u00b2<\/p>\n<p><b>Bonus bonus chatter<\/b>: &#8220;What if the kernel delayed suspending a thread if it held any user-mode locks? Wouldn&#8217;t that avoid this problem?&#8221; First of all, how would the kernel even know whether a thread held any user-mode locks? There is no reliable signature for a user-mode lock. After all, you can make a user-mode lock out of any byte of memory by using it as a spin lock. Second, even if the kernel somehow could figure out whether a thread held a user-mode lock, you don&#8217;t want that to block thread suspension, because that would let a program make itself un-suspendable! Just call <code>AcquireSRWLockShared(some_global_srwlock)<\/code> and never call the corresponding <code>Release<\/code> function. Congratulations, the thread perpetually owns the global lock in shared mode and would therefore now be immune from suspension.<\/p>\n<p>\u00b9 Of course, this also requires that the code that does the suspending does not wait on cross-process resources like semaphores, mutexes, or file locks, because those might be held by the suspended thread.<\/p>\n<p>\u00b2 The kernel doesn&#8217;t suspend the thread immediately because it might be in possession of internal kernel locks, and suspending a thread while it owns a kernel lock (such as the lock that synchronizes access to the page tables) would result in the kernel itself deadlocking!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I did tell you not to do that.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-111066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>I did tell you not to do that.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/111066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=111066"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/111066\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=111066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=111066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=111066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}