{"id":105420,"date":"2021-07-08T07:00:00","date_gmt":"2021-07-08T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105420"},"modified":"2021-07-07T20:51:08","modified_gmt":"2021-07-08T03:51:08","slug":"2021078-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210708-00\/?p=105420","title":{"rendered":"On the perils of holding a lock across a coroutine suspension point, part 2: Nonrecursive mutexes"},"content":{"rendered":"<p>Last time, we looked at <a title=\"On the perils of holding a lock across a coroutine suspension point, part 1: The set-up\" href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210707-00\/?p=105417\"> what can go wrong if you hold a recursive mutex across a coroutine suspension point<\/a>. Do things get any better if you switch to a nonrecursive mutex?<\/p>\n<p>Recall that we are looking at this function:<\/p>\n<pre>IAsyncAction MyObject::RunOneAsync()\r\n{\r\n  std::lock_guard guard(m_mutex);\r\n\r\n  if (!m_list.empty()) {\r\n    auto&amp; item = m_list.front();\r\n    <span style=\"color: blue;\">co_await item.RunAsync();<\/span>\r\n    item.Cleanup();\r\n    m_list.pop_front();\r\n  }\r\n}\r\n<\/pre>\n<p>Let&#8217;s walk through what happens if the mutex is nonrecursive and a call to <code>Run\u00adOne\u00adAsync<\/code> is made from the same thread that mad a previous not-yet-complete call to <code>Run\u00adOne\u00adAsync<\/code>.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: left;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\">construct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.lock()<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: 1px black; border-style: none solid; background-color: white; color: black;\"><code>auto&amp; item = m_list.front();<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\"><code>co_await item.RunAsync();<\/code><\/td>\n<td>\u2192<\/td>\n<td>Suspended<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1 returns <code>IAsyncAction<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">Thread available to do other work<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #2<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; background-color: white; color: black;\">construct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.lock()<\/code> \u2014 blocks<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>During the period of suspension, anybody who wants to acquire the lock will block, since that&#8217;s how nonrecursive mutexes work.<\/p>\n<p>Formally speaking, attempting to acquire a nonrecursive mutex recursively triggers <i>undefined behavior<\/i>, so from a compiler-theoretic point of view, the game is over and anything can happen, <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20140627-00\/?p=633\"> including time travel<\/a>. In practice, what happens is that the attempted recursive acquisition blocks.<\/p>\n<p>And that&#8217;s a real hard block, not a coroutine suspend. The thread that tries to acquire the lock cannot do anything while waiting for the lock to become available. In particular, it <i>cannot run coroutine continuations<\/i>.<\/p>\n<p>Now, we don&#8217;t know much about <code>Run\u00adAsync<\/code>. Maybe it needs access to the originating thread in order to complete its work. Or maybe it uses another coroutine, and that <i>other<\/i> coroutine needs access to the originating thread. If that&#8217;s the case, then the <code>Run\u00adAsync<\/code> will never complete, because the originating thread is hung.<\/p>\n<p>Maybe you&#8217;re lucky, and <code>Run\u00adAsync<\/code> can do all of its work without needing to access the originating thread. You&#8217;re still in trouble, because the <code>Run\u00adOne\u00adAsync<\/code> might need access to the originating thread. For example, C++\/WinRT has a policy that <code>co_await<\/code> of an <code>IAsyncAction<\/code> always resumes in the same apartment context. If the original apartment is a single-threaded apartment (standard for UI threads), then it&#8217;s going to need to get back to that originating thread, but it can&#8217;t because the originating thread is hung waiting for the mutex.<\/p>\n<p>Now, suppose you&#8217;re super-lucky, and the <code>co_await<\/code> of <code>Run\u00adAsync<\/code> doesn&#8217;t need to resume on the originating thread. Maybe you started in the multi-threaded apartment, so it can resume on any other thread in that apartment. Great, your code is running again, just on a different thread.<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: left;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">Some other thread<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1 resumes<\/td>\n<td>\u2190<\/td>\n<td style=\"text-align: left;\"><code>RunAsync<\/code> #1 completes<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\"><code>item.Cleanup();<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: 1px black; border-style: none solid; background-color: white; color: black;\"><code>m_list.pop_front();<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\">destruct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.unlock()<\/code> \u2014 from the wrong thread<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We are unlocking a mutex from a thread that didn&#8217;t lock it. This is not a legal operation and the behavior is undefined.<\/p>\n<p>So yeah, double undefined behavior.<\/p>\n<p>In practice, what usually happens is that your main thread hangs unrecoverably. You dump all the stacks to try to find the owner, and you don&#8217;t see any stacks that are in code that&#8217;s holding the lock. That&#8217;s because the code that&#8217;s responsible for the lock isn&#8217;t active on any thread, so you won&#8217;t see it in any stack. The code is waiting to resume execution when its associated coroutine is resumed, and that coroutine is somewhere on the heap.<\/p>\n<p>Basically, any <code>co_await<\/code> is a point of potential re-entrancy.<\/p>\n<p>Next time, we&#8217;ll look at ways of addressing the problem.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Another way things can go wrong.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-105420","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Another way things can go wrong.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105420","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105420"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105420\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105420"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105420"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105420"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}