{"id":95435,"date":"2017-02-15T07:00:00","date_gmt":"2017-02-15T22:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/?p=95435"},"modified":"2019-03-13T01:06:16","modified_gmt":"2019-03-13T08:06:16","slug":"20170215-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20170215-00\/?p=95435","title":{"rendered":"Why did my thread pool stop processing work once it hit a long-running work item?"},"content":{"rendered":"<p>A customer found that occasionally, their program&#8217;s thread pool stopped processing work items queued with the <code>WT_EXECUTE&shy;IN&shy;PERSISTEN&shy;THREAD<\/code> flag. They would queue up the work items to the thread pool, but the work items would not get dispatched. Naturally, this caused problems with the program because certain background actions would simply stall. <\/p>\n<p>After some investigation, the problem was traced to this work item that appears to be preventing the thread pool from processing any work: <\/p>\n<pre>\nntdll!NtWaitForMultipleObjects+0xc\nKERNELBASE!WaitForMultipleObjectsEx+0xcc\nkernel32!WaitForMultipleObjects+0x19\ncontoso!WidgetMonitor::WidgetNotificationCallback+0xfd\ncontoso!std::tr1::_Callable_fun&lt;void (__stdcall*const)\n    (std::tr1::shared_ptr&lt;WidgetService&gt;),0&gt;::_ApplyX+0x1b\ncontoso!std::tr1::_Impl_no_alloc1&lt;std::tr1::_Callable_fun&lt;\n    void (__stdcall*const)(std::tr1::shared_ptr&lt;WidgetService&gt;),0&gt;,\n    void,std::tr1::shared_ptr&lt;WidgetService&gt; &amp;&gt;::_Do_call+0x2d\ncontoso!std::tr1::_Function_impl1&lt;void,std::tr1::shared_ptr&lt;\n    WidgetMonitor::WidgetNotificationContext&gt; &amp;&gt;::operator()+0x1e\ncontoso!Win32Adapters::Threading::Callback&lt;std::tr1::shared_ptr&lt;\n    WidgetMonitor::WidgetNotificationContext&gt; &gt;::\n    ExecuteCallbackTarget+0x3f\ncontoso!Win32Adapters::Threading::Callback&lt;std::tr1::shared_ptr&lt;\n    WidgetMonitor::WidgetNotificationContext&gt; &gt;::\n    DefaultThreadProc+0xd\nntdll!RtlpTpWorkCallback+0xef\nntdll!TppWorkerThread+0x4f3\nkernel32!BaseThreadInitThunk+0x24\nntdll!__RtlUserThreadStart+0x2f\nntdll!_RtlUserThreadStart+0x1b\n<\/pre>\n<p>(I inserted line breaks for readability.) <\/p>\n<p>Once they closed the widget monitor, the thread pool woke up and the work items that targeted the persistent thread started running again. <\/p>\n<p>Okay, first things first: For expository purposes, let&#8217;s remove all of the <code>std::tr1<\/code> stuff and pretend that the stack was this: <\/p>\n<pre>\nntdll!NtWaitForMultipleObjects+0xc\nKERNELBASE!WaitForMultipleObjectsEx+0xcc\nkernel32!WaitForMultipleObjects+0x19\ncontoso!WidgetMonitor::WidgetNotificationCallback+0xfd\nntdll!RtlpTpWorkCallback+0xef\nntdll!TppWorkerThread+0x4f3\nkernel32!BaseThreadInitThunk+0x24\nntdll!__RtlUserThreadStart+0x2f\nntdll!_RtlUserThreadStart+0x1b\n<\/pre>\n<p>That gets rid of the project&#8217;s internal callback scaffolding and lets us focus on the interaction with the operating system. <\/p>\n<p>The problem isn&#8217;t really visible in the stack trace. We&#8217;ll have to go to the code. <\/p>\n<pre>\nvoid WidgetMonitor::WidgetNotificationCallback(void* parameter)\n{\n WidgetNotificationContext* context =\n   reinterpret_cast&lt;WidgetNotificationContext*&gt;(parameter);\n\n RAII_HKEY hkey = ...;\n RAII_HANDLE registryEvent = ...;\n bool keepWaiting = true;\n while (keepWaiting) {\n  if (RegNotifyChangeKeyValue(hkey, false, REG_NOTIFY_CHANGE_LAST_SET,\n                              registryEvent, TRUE) == ERROR_SUCCESS) {\n   HANDLE handles[2] = { registryEvent, context-&gt;shutdownEvent };\n   DWORD waitResult = WaitForMultipleObjects(2, handles, FALSE, INFINITE);\n   switch (waitResult) {\n   case WAIT_OBJECT_0: \/\/ the registry key changed\n    ...\n    break;\n   case WAIT_OBJECT_0+1: \/\/ we are being asked to shut down\n    ...\n    keepWaiting = false;\n    break;\n   default: \/\/ Something unexpected happened\n    ...\n    keepWaiting = false;\n    break;\n   }\n  }\n }\n}\n<\/pre>\n<p>The deal is that the callback function processes the callback, and then goes into a loop monitoring a registry key. It continues monitoring the key until the shutdown event is signaled. <\/p>\n<p>Okay, so this looks a little weird, holding a thread pool thread hostage for an extended period of time, which is sort of contrary to the intent of a thread pool, which is to reuse a thread for multiple short work items. But it&#8217;s technically legal, and you are <a HREF=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/\">encouraged to pass the <code>WT_EXECUTE&shy;LONG&shy;FUNCTION<\/code> flag<\/a> to tell the thread pool, &#8220;This function will take <a HREF=\"https:\/\/blogs.msdn.microsoft.com\/oldnewthing\/20111209-00\/?p=8933\">a long time<\/a>, so you may want to schedule work onto other threads more aggressively instead of sitting around waiting for this work item to finish.&#8221; <\/p>\n<p>But the problem is that the program didn&#8217;t pass only the <code>WT_EXECUTE&shy;LONG&shy;FUNCTION<\/code> flag. It did this: <\/p>\n<pre>\nBOOL WidgetMonitor::StartMonitoringChangeNotifications()\n{\n  WidgetNotificationContext context = ...;\n  return QueueUserWorkItem(\n    WidgetMonitor::WidgetNotificationCallback,\n    context, WT_EXECUTELONGFUNCTION | WT_EXECUTEINPERSISTENTTHREAD);\n}\n<\/pre>\n<p>Notice that they requested that the callback run in the persistent thread. But the documentation for that flag says <\/p>\n<blockquote CLASS=\"q\"><p>This flag should be used only for short tasks&hellip; <\/p><\/blockquote>\n<p>So we have a contradiction. One flag says, &#8220;Run this callback in a persistent thread, and I promise I don&#8217;t take a long time.&#8221; The other flag says, &#8220;I&#8217;m going to take a long time.&#8221; <\/p>\n<p>The original thread pool was a bit too trusting and assumed that nobody would be so crazy as to explicitly declare their intent to break the rules.&sup1; I mean, if you&#8217;re going to break the rules, you are probably going to be sneaky about it, right? It so happened that the way the thread pool code was written, the <code>WT_EXECUTE&shy;IN&shy;PERSISTEN&shy;THREAD<\/code> flag takes precedence. The callback runs in the persistent thread, even though it runs long.&sup2; <\/p>\n<p>And that&#8217;s why the thread pool persistent thread grinds to a halt. The persistent thread is running the callback function, and the callback function is stuck. As a result, the persistent thread can&#8217;t do anything else, and the thread pool makes no progress. This also explains why shutting down widget notifications caused everything to wake up: Shutting down widget notifications causes the <code>Widget&shy;Config::Widget&shy;Notification&shy;Callback<\/code> function to break out of its loop and finally exit. This releases the persistent thread to run more work items. <\/p>\n<p>Okay, so we&#8217;ve diagnosed the problem. Next time, we&#8217;ll speculate as to why the developers chose to combine contradictory threads and (perhaps more important) suggest a solution. <\/p>\n<p>&sup1; Actually, what&#8217;s happening is that the two flags are targeting different parts of the thread pool. The &#8220;persistent thread&#8221; flag is an instruction to the thread pool work item dispatcher, telling it to dispatch the work item to a persistent thread. The &#8220;long function&#8221; flag is an instruction to the thread pool throughput manager to let it know that it should prefer to start a new thread instead of waiting for the work item to complete. Neither component on its own noticed anything wrong. <\/p>\n<p>&sup2; If we had a time machine, we could go back and make this combination of flags cause <code>Queue&shy;User&shy;Work&shy;Item<\/code> fail with <code>ERROR_INVALID_PARAMETER<\/code>, but unfortunately that option is not available to us. We&#8217;re stuck with the existing behavior of allowing the contradictory flags and ignoring the <code>WT_EXECUTE&shy;LONG&shy;FUNCTION<\/code> flag. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Because you ran the work item on the persistent thread.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-95435","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Because you ran the work item on the persistent thread.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/95435","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=95435"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/95435\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=95435"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=95435"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=95435"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}