{"id":109020,"date":"2023-11-15T07:00:00","date_gmt":"2023-11-15T15:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=109020"},"modified":"2025-02-18T09:50:57","modified_gmt":"2025-02-18T17:50:57","slug":"20231115-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20231115-00\/?p=109020","title":{"rendered":"Why does calling a coroutine allocate a lot of stack space even though the coroutine frame is on the heap?"},"content":{"rendered":"<p>Consider the following:<\/p>\n<pre>#include &lt;coroutine&gt;\r\n#include &lt;exception&gt;\r\n\r\n\/\/ Define a coroutine type called \"task\"\r\n\/\/ (not relevant to scenario but we need *something*.)\r\nstruct task { void* p; };\r\n\r\nnamespace std\r\n{\r\n    template&lt;typename...Args&gt;\r\n    struct coroutine_traits&lt;task, Args...&gt;\r\n    {\r\n        struct promise_type\r\n        {\r\n            task get_return_object() { return { this }; }\r\n            void unhandled_exception() { std::terminate(); }\r\n            void return_void() {}\r\n            suspend_never initial_suspend() { return {}; }\r\n            suspend_never final_suspend() noexcept { return {}; }\r\n        };\r\n    };\r\n}\r\n\/\/ End of \"task\" boilerplate.\r\n\r\nvoid consume(void*);\r\n\r\ntask sample()\r\n{\r\n    char large[65536];\r\n    consume(large);\r\n    co_return;\r\n}\r\n<\/pre>\n<p>The <code>sample<\/code> coroutine function consumes a large buffer, but it never suspends, and the coroutine frame is destroyed before the function returns. This means that the function is a good candidate for <i>heap elision<\/i>, specifically <a title=\"Halo: coroutine Heap Allocation eLision Optimization: the joint response\" href=\"https:\/\/open-std.org\/JTC1\/SC22\/WG21\/docs\/papers\/2018\/p0981r0.html\"> Heap Allocation eLision Optimization (HALO)<\/a>, which permits the compiler to optimize out the entire heap allocation and put the coroutine frame on the stack:<\/p>\n<pre># clang -O3\r\nsample():\r\n        sub     rsp, 65576\r\n        lea     rdi, [rsp + 32]\r\n        call    consume(void*)@PLT\r\n        lea     rax, [rsp + 16]\r\n        add     rsp, 65576\r\n        ret\r\n<\/pre>\n<p>The Microsoft Visual C++ compiler&#8217;s code generation for coroutines (in version 19.34.31931.0) performs the initialization in a function named <code>$InitCoro$2<\/code>, with the declaration<\/p>\n<pre>void sample$InitCoro$2(\r\n    T* __coro_return_value,\r\n    bool const&amp; __coro_heap_ellision,\r\n    void* __coro_frame_ptr,\r\n    Args&amp;&amp;... args);\r\n<\/pre>\n<p>where <code>T<\/code> is the return type of the coroutine function (<code>task<\/code> in this example) and <code>Args&amp;&amp;...<\/code> are the parameters to the coroutine function (empty, in this case).<\/p>\n<p>The <code>__coro_<wbr \/>frame_<wbr \/>ptr<\/code> points to a block of memory that will be used as the coroutine frame.<\/p>\n<p>The <code>__coro_<wbr \/>heap_<wbr \/>ellision<\/code> parameter\u00b9 is <code>true<\/code> if the frame is on the stack and <code>false<\/code> if the frame is on the heap.<\/p>\n<p>The <code>__coro_<wbr \/>return_<wbr \/>value<\/code> points to the place to put the <code>T<\/code> returned by (or constructed from) the promise&#8217;s <code>get_return_object()<\/code>.<\/p>\n<p>The <code>$InitCoro$2<\/code> function initializes the coroutine frame, obtains the return object, and then runs the coroutine until its first suspension point, and then returns.<\/p>\n<p>The code generation for the coroutine function goes roughly like this:<\/p>\n<pre>task sample()\r\n{\r\n    char __coro_elision_buffer[sizeof(coroutine_frame)];\r\n    bool __coro_heap_elision = false;\r\n    void* __coro_frame_ptr = __coro_heap_elision\r\n        ? __coro_elision_buffer\r\n        : operator new(sizeof(coroutine_frame));\r\n    char alignas(task) __coro_return_value[sizeof(task)];\r\n\r\n    sample$InitCoro$2(&amp;__coro_return_value,\r\n        __coro_heap_elision,\r\n        __coro_frame_ptr);\r\n\r\n    return reinterpret_cast&lt;task&amp;&gt;(__coro_return_value);\r\n}\r\n<\/pre>\n<p>At optimization level 2, the Microsoft Visual C++ compiler propagates the <code>__coro_<wbr \/>heap_<wbr \/>elision<\/code> constant into the ternary, resulting in<\/p>\n<pre>task sample()\r\n{\r\n    char __coro_elision_buffer[sizeof(coroutine_frame)];\r\n    char alignas(task) __coro_return_value[sizeof(task)];\r\n\r\n    sample$InitCoro$2(&amp;__coro_return_value,\r\n        false,\r\n        operator new(sizeof(coroutine_frame)));\r\n\r\n    return reinterpret_cast&lt;task&amp;&gt;(__coro_return_value);\r\n}\r\n<\/pre>\n<p>However, it fails to realize that the <code>__coro_<wbr \/>elision_<wbr \/>buffer<\/code> is now a dead variable, so the function allocates stack space for a buffer it never uses.\u00b2<\/p>\n<p>This can be a problem if your coroutine has a large frame, because it will consume a lot of stack that may cause you to stack overflow prematurely. If you want to make sure a coroutine local variable goes on the heap, you should put it explicitly on the heap.<\/p>\n<pre>task sample()\r\n{\r\n    <span style=\"border: solid 1px currentcolor;\">auto large = std::make_unique_for_overwrite&lt;char[]&gt;(65536);<\/span>\r\n    consume(<span style=\"border: solid 1px currentcolor;\">large.get()<\/span>);\r\n    co_return;\r\n}\r\n<\/pre>\n<p>\u00b9 Yes, the word <i>elision<\/i> is misspelled.<\/p>\n<p>\u00b2 This missed optimization is <a href=\"https:\/\/developercommunity.visualstudio.com\/t\/Coroutine-reserves-stack-space-for-heap\/10270583\"> on the backlog<\/a>. <b>Update<\/b>: <a href=\"https:\/\/developercommunity.visualstudio.com\/t\/Coroutine-reserves-stack-space-for-heap-\/10270583#TPIN-N10851779\">Fixed<\/a>!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Heap elision optimization kicks in, and doesn&#8217;t kick out.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-109020","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Heap elision optimization kicks in, and doesn&#8217;t kick out.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109020","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=109020"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/109020\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=109020"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=109020"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=109020"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}