{"id":105417,"date":"2021-07-07T07:00:00","date_gmt":"2021-07-07T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105417"},"modified":"2021-07-14T07:23:02","modified_gmt":"2021-07-14T14:23:02","slug":"20210707-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20210707-00\/?p=105417","title":{"rendered":"On the perils of holding a lock across a coroutine suspension point, part 1: The set-up"},"content":{"rendered":"<p>Say you want to perform a bunch of asynchronous operations involving some object state, but also want to make sure that no other tasks access that object state at the same time. For synchronous code, you would use a traditional synchronization object like a mutex or critical section:<\/p>\n<pre>void MyObject::RunOne()\r\n{\r\n  std::lock_guard guard(m_mutex);\r\n\r\n  if (!m_list.empty()) {\r\n    auto&amp; item = m_list.front();\r\n    item.Run();\r\n    item.Cleanup();\r\n    m_list.pop_front();\r\n  }\r\n}\r\n<\/pre>\n<p>The mutex ensures that only one attempt to process an item from the list is active at a time, and also to prevent any other code from mutating the <code>m_list<\/code> while we are using it.<\/p>\n<p>But say that some of these operations are asynchronous. For simplicity, I&#8217;m eliding the traditional <code>auto lifetime = get_strong();<\/code> that is used to prevent the object from being destructed while awaiting. (Let&#8217;s say that the rule is that you cannot release your reference to <code>MyObject<\/code> until <code>Run\u00adOne\u00adAsync<\/code> completes.)<\/p>\n<pre>IAsyncAction MyObject::RunOneAsync()\r\n{\r\n  std::lock_guard guard(m_mutex);\r\n\r\n  if (!m_list.empty()) {\r\n    auto&amp; item = m_list.front();\r\n    <span style=\"color: blue;\">co_await item.RunAsync();<\/span>\r\n    item.Cleanup();\r\n    m_list.pop_front();\r\n  }\r\n}\r\n<\/pre>\n<p>Is this okay?<\/p>\n<p>One argument I&#8217;ve heard is that this is not okay because the <code>co_await<\/code> causes the original <code>Run\u00adOne\u00adAsync<\/code> call to to return an <code>IAsync\u00adAction<\/code> to its caller, and as part of the act of returning the <code>IAsync\u00adAction<\/code>, the lock is released.<\/p>\n<p>This argument is incorrect. The lock remains held while the coroutine is suspended. After all, if objects were destructed at suspension, then you wouldn&#8217;t be able to carry anything across a suspension point!<\/p>\n<pre>IAsyncAction WidgetManager::WhateverAsync()\r\n{\r\n  auto lifetime = get_strong();\r\n  std::string name = m_widget.GetName();\r\n  m_widget.SetName(\"temporary\");\r\n  co_await m_widget.SomethingAsync();\r\n  m_widget.SetName(name); \/\/ certainly \"name\" is still valid, right?\r\n  \/\/ certainly \"lifetime\" is still holding our object alive, right?\r\n}\r\n<\/pre>\n<p>Don&#8217;t worry. <code>name<\/code> and <code>lifetime<\/code> are still valid across the suspension because the formal parameters and local variables are kept in the coroutine frame, which remains alive while the coroutine is suspended. Indeed, the <code>lifetime<\/code> relies upon it!<\/p>\n<p>However, it&#8217;s the liveness of the lock guard that is the issue here.<\/p>\n<p>Since the lock guard hasn&#8217;t been destructed, the mutex remains locked while the coroutine is suspended.<\/p>\n<p>Now things get exciting.<\/p>\n<p>Suppose we have another coroutine that wants the lock. Heck, it could very well be another call to <code>Run\u00adOne\u00adAsync<\/code>!<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: left;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\">construct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.lock()<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: 1px black; border-style: none solid; background-color: white; color: black;\"><code>auto&amp; item = m_list.front();<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\"><code>co_await item.RunAsync();<\/code><\/td>\n<td>\u2192<\/td>\n<td>Suspended<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1 returns <code>IAsyncAction<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">Thread available to do other work<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #2<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; background-color: white; color: black;\">construct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.lock()<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Now we&#8217;re in trouble.<\/p>\n<p>If the <code>m_mutex<\/code> supports recursive acquisition, then what happens is that the second call to <code>Run\u00adOne\u00adAsync<\/code> successfully acquires the mutex (recursive acquisition), and execution continues:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td><code>RunOneAsync<\/code> #2 continues<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\"><code>auto&amp; item = m_list.front();<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\"><code>co_await item.RunAsync();<\/code><\/td>\n<td>\u2192<\/td>\n<td align=\"left\">Suspended<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #2 returns <code>IAsyncAction<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">Thread available to do other work<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We are running the front element twice! I bet it&#8217;s not expecting that.<\/p>\n<p>The mutex failed at its intended purpose of serializing calls to <code>Run\u00adOne\u00adAsync<\/code>.<\/p>\n<p>Okay, but wait, the disaster is still unfolding.<\/p>\n<p>Eventually, the two calls will complete, in some order. Let&#8217;s say that #1 finishes first. Execution continues:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: left;\" border=\"0\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1 resumes<\/td>\n<td>\u2190<\/td>\n<td style=\"text-align: left;\"><code>RunAsync<\/code> #1 completes<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\"><code>item.Cleanup();<\/code><\/td>\n<td>&nbsp;<\/td>\n<td>Cleaning up while <code>RunAsync<\/code> #2 still outstanding<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: 1px black; border-style: none solid; background-color: white; color: black;\"><code>m_list.pop_front();<\/code><\/td>\n<td>&nbsp;<\/td>\n<td>Destructing head item while #2 is still using it<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\">destruct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.unlock()<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #1 completes<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">Thread available to do other work<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: center;\" colspan=\"2\">\u2193<\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #2 resumes<\/td>\n<td>\u2190<\/td>\n<td><code>RunAsync<\/code> #2 completes<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-bottom: none; background-color: white; color: black; width: 18em;\"><code>item.Cleanup();<\/code><\/td>\n<td>&nbsp;<\/td>\n<td>Cleaning up already-destructed object<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: 1px black; border-style: none solid; background-color: white; color: black;\"><code>m_list.pop_front();<\/code><\/td>\n<td>&nbsp;<\/td>\n<td>Popping an item that was never run<\/td>\n<\/tr>\n<tr>\n<td style=\"width: 2em;\">\u00a0<\/td>\n<td style=\"border: solid 1px black; border-top: none; background-color: white; color: black;\">destruct lock_guard<\/td>\n<td>&nbsp;<\/td>\n<td><code>m_mutex.unlock()<\/code><\/td>\n<\/tr>\n<tr>\n<td style=\"border: solid 1px black; background-color: white; color: black; text-align: center;\" colspan=\"2\"><code>RunOneAsync<\/code> #2 completes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>When the first <code>RunAsync<\/code> completes, the first <code>Run\u00adOne\u00adAsync<\/code> resumes, and it proceeds to clean up the item that finished running, and then remove the head item from the list, thereby destructing it. All this happens even though the second <code>Run\u00adOne\u00adAsync<\/code> is still using it.<\/p>\n<p>The first <code>Run\u00adOne\u00adAsync<\/code> completes, having created a right mess of things but escaping unharmed.<\/p>\n<p>When the second <code>Run\u00adAsync<\/code> completes, the second <code>Run\u00adOne\u00adAsync<\/code> resumes, and it tries to clean up the item that has already been destructed. You get sent this crash dump and you scratch your head because you&#8217;re looking at the code and you see that mutex right there, and you&#8217;re thinking, &#8220;How can this thing get prematurely destructed? It&#8217;s protected by a mutex!&#8221;<\/p>\n<p>Now, maybe the <code>Cleanup<\/code> method happens by sheer luck not to crash. It &#8220;only&#8221; corrupts some memory. That just makes the debugging even harder.<\/p>\n<p>The second <code>Run\u00adOne\u00adAsync<\/code> then pops the front item from the list, thinking it&#8217;s popping the item that it just finished running, when in fact it&#8217;s popping an item on the list <i>that was never run at all<\/i>.<\/p>\n<p>Now the bug is that the program keeps running, but sometimes, items put onto the work list are thrown away without ever being run or cleaned up. Meanwhile, some items are run twice. This bug doesn&#8217;t come with crash dumps. It&#8217;s just end-user reports from the field that your program isn&#8217;t doing its job.<\/p>\n<p>Basically, what&#8217;s going on is that thanks to coroutines sharing a thread, your recursive mutex is not doing its job of ensuring mutual exclusion. Since everything is happening on a single thread, the recursive mutex always says, &#8220;Oh, I remember you. Come on in!&#8221;<\/p>\n<p>Next time, we&#8217;ll look at what happens if the mutex does not support recursive acquisition.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One way things can go wrong.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-105417","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>One way things can go wrong.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105417","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105417"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105417\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105417"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105417"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105417"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}