{"id":107151,"date":"2022-09-09T07:00:00","date_gmt":"2022-09-09T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=107151"},"modified":"2022-09-09T06:56:06","modified_gmt":"2022-09-09T13:56:06","slug":"20220909-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20220909-00\/?p=107151","title":{"rendered":"The case of the APC that never arrives"},"content":{"rendered":"<p>A customer encountered found that sometimes, their application hung in its clean up code. Here&#8217;s a simplified version.<\/p>\n<pre>bool ShuttingDown = false;\r\n\r\nvoid MainThread()\r\n{\r\n    DWORD id;\r\n    auto hThread = CreateThread(nullptr, 0, WorkerThread,\r\n                                nullptr, 0, &amp;id); \/\/ succeeds\r\n\r\n    BlahBlahBlah(); \/\/ do useful work\r\n\r\n    \/\/ Time to clean up. Post an APC to the worker thread\r\n    \/\/ to tell it that it's time to go home.\r\n    QueueUserAPC(WakeWorker, hThread, 0); \/\/ succeeds\r\n\r\n    WaitForSingleObject(hThread, INFINITE); \/\/ hangs\r\n\r\n    CloseHandle(hThread);\r\n}\r\n\r\nvoid CALLBACK WakeWorker(ULONG_PTR)\r\n{\r\n    ShuttingDown = true;\r\n}\r\n\r\nDWORD CALLBACK WorkerThread(void*)\r\n{\r\n    \/\/ Do work until shut down.\r\n    do\r\n    {\r\n        \/\/ All work is posted via APCs.\r\n        SleepEx(INFINITE, TRUE);\r\n    } while (!ShuttingDown);\r\n\r\n    return 0;\r\n}\r\n<\/pre>\n<p>The idea is that the program has a worker thread to, y&#8217;know, do some work. All of the work items are posted via <code>Queue\u00adUser\u00adAPC<\/code>, and the worker thread simply calls <code>Sleep\u00adEx<\/code> alertably, over and over again. Each call to <code>Sleep\u00adEx<\/code> sleeps the thread until an APC is queued, at which point it returns (because the <code>bAlertable<\/code> parameter is <code>TRUE<\/code>).<\/p>\n<p>One way of looking at this design is that it&#8217;s a sneaky way of making the operating system manage your work queue for you. Another way of looking at it is as a single-threaded I\/O completion port.<\/p>\n<p>Call it what you will.<\/p>\n<p>Anyway, the problem is that the worker thread is stuck in <code>SleepEx<\/code>, as if it never got the <code>Wake\u00adWorker<\/code> APC that tells it to exit.<\/p>\n<p>But is that really the problem?<\/p>\n<p>The customer was able to get some full memory dumps of systems that got into this state, and the telling detail is that the <code>ShuttingDown<\/code> variable was set to <code>true<\/code>. So it wasn&#8217;t that the <code>Wake\u00adWorker<\/code> APC never arrived. It totally did arrive and set <code>ShuttingDown<\/code> to <code>true<\/code>. Yet somehow that didn&#8217;t wake up the <code>SleepEx<\/code>.<\/p>\n<p>One of my colleagues offered the possibility that when the code entered the <code>Sleep\u00adEx<\/code> loop, the <code>Wake\u00adWorker<\/code> APC had <i>already run<\/i>. If that&#8217;s the case, then the <code>Sleep\u00adEx<\/code> is going to wait for an APC that has already arrived.<\/p>\n<p>I was able to confirm with a test program that this was indeed a possibility.<\/p>\n<pre>DWORD apc = 0;\r\n\r\nvoid CALLBACK WakeWorker(ULONG_PTR)\r\n{\r\n    apc = GetCurrentThreadId();\r\n}\r\n\r\nDWORD CALLBACK WorkerThread(void*)\r\n{\r\n    if (apc == GetCurrentThreadId()) DebugBreak();\r\n    return 0;\r\n}\r\n\r\nint __cdecl main()\r\n{\r\n    DWORD id;\r\n    while (true)\r\n    {\r\n        apc = 0;\r\n        auto h = CreateThread(nullptr, 0, WorkerThread, nullptr, 0, &amp;id);\r\n        QueueUserAPC(WakeWorker, h, 0);\r\n        WaitForSingleObject(h, INFINITE);\r\n        CloseHandle(h);\r\n        Sleep(10);\r\n    }\r\n    \/\/ notreached\r\n}\r\n<\/pre>\n<p>This program creates a thread and immediately queues an APC to it. The thread procedure checks if the APC already ran on the same thread, and if so, it breaks into the debugger.<\/p>\n<p>If you run this program, it breaks immediately. If you set a breakpoint on the <code>Wake\u00adWorker<\/code> function, you&#8217;ll see that it does indeed run on the thread before the <code>Worker\u00adThread<\/code> function starts.<\/p>\n<p>From the stack trace on that breakpoint, it appears that the kernel processes APCs during the early phases of thread creation, so if you had any queued APCs, they will get processed before the thread procedure even starts.<\/p>\n<p>So that&#8217;s the bug: The <code>ShuttingDown<\/code> flag was already set by the time the thread procedure started, but the code assumes that it can only be set by an APC that is processed by the <code>Sleep\u00adEx<\/code>.<\/p>\n<p>The fix is simple: Change the loop from a <code>do...while<\/code> to a <code>while<\/code>, so that the test is at the top of the loop instead of at the bottom.<\/p>\n<pre>DWORD CALLBACK WorkerThread(void*)\r\n{\r\n    \/\/ Do work until shut down.\r\n    while (!ShuttingDown)\r\n    {\r\n        \/\/ All work is posted via APCs.\r\n        SleepEx(INFINITE, TRUE);\r\n    }\r\n\r\n    return 0;\r\n}\r\n<\/pre>\n<p>The case where this happens is if the <code>Blah\u00adBlah\u00adBlah()<\/code> function returns very quickly, allowing the <code>QueueUserAPC<\/code> to win the race against the <code>Worker\u00adThread<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Or maybe it did?<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-107151","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Or maybe it did?<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=107151"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/107151\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=107151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=107151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=107151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}