{"id":105777,"date":"2021-10-07T07:00:00","date_gmt":"2021-10-07T14:00:00","guid":{"rendered":"https:\/\/devblogs.microsoft.com\/oldnewthing\/?p=105777"},"modified":"2021-10-07T08:50:59","modified_gmt":"2021-10-07T15:50:59","slug":"20211007-00","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/oldnewthing\/20211007-00\/?p=105777","title":{"rendered":"Debugging coroutine handles: The Microsoft Visual C++ compiler, clang, and gcc"},"content":{"rendered":"<p>How compilers implement coroutines is an implementation detail which is subject to change at any time. Nevertheless, you may be called upon to debug them, so it&#8217;s nice to know what you&#8217;re looking at.<\/p>\n<p>The C++ language requires that any coroutine be resumable from a <code>coroutine_handle&lt;&gt;<\/code>, so there needs to be some vtable-like thing so that calling the <code>resume()<\/code> method on an arbitrary <code>coroutine_handle&lt;&gt;<\/code> resumes the correct coroutine.<\/p>\n<p><b>Note<\/b>: <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/c-coroutines-in-visual-studio-2019-version-16-8\/\"> The Microsoft Visual C++ compiler coroutine ABI took a breaking change in version 16.8<\/a>, so I&#8217;ll cover Microsoft Visual C++ coroutines twice, once in C++17 mode and once again in C++20 mode.<\/p>\n<p>In the Microsoft Visual C++ compiler, the C++17-style coroutine handle is a pointer to a structure we shall call a &#8220;frame&#8221; for expository purposes.<\/p>\n<pre>struct coroutine_frame\r\n{\r\n    void (*resume)(coroutine_frame*);\r\n    uint16_t index;\r\n    uint16_t flags;\r\n    promise_type promise;\r\n    parameters...\r\n    locals...\r\n    temporaries...\r\n    other bookkeeping...\r\n};\r\n<\/pre>\n<p>The <code>index<\/code> represents the progress of the coroutine through its function body. The <code>flags<\/code> value is nonzero if the coroutine frame was allocated on the heap.<\/p>\n<p>Constructing a coroutine frame consists of the following steps:<\/p>\n<ul>\n<li>Allocate memory for the frame, usually from the heap.<\/li>\n<li>Initialize the <code>resume<\/code> member to point to a custom function specific to the coroutine.<\/li>\n<li>Initialize the <code>index<\/code> to 2.<\/li>\n<li>Initialize the <code>flags<\/code> to 1 if the frame was allocated on the heap; otherwise initialize it to zero.<\/li>\n<\/ul>\n<p>The index is initialized to 2 because the state of a suspended coroutine is always recorded as a nonzero even number.<\/p>\n<ul>\n<li>Nonzero: I&#8217;m guessing that zero is kept as a permanently invalid state to aid in debugging.<\/li>\n<li>Even: We&#8217;ll see why later.<\/li>\n<\/ul>\n<p>When a coroutine suspends, its <code>index<\/code> is updated to remember where the coroutine needs to resume. The coroutine states appear to be numbered in the order in which they appear in the function, so the initial state of the coroutine is 2, the first suspension point is 4, the next one is 6, and so on.\u00b9 Some of the suspension points can get optimized out, say, because the compiler can prove that <code>await_ready<\/code> always returns <code>true<\/code>.<\/p>\n<p>To resume a suspended coroutine, call the <code>resume<\/code> function with a pointer to the coroutine frame. Each coroutine gets a custom <code>resume<\/code> function which uses the index as an index into a jump table to dispatch to the appropriate point in the coroutine where execution should resume.<\/p>\n<p>For 32-bit code, the jump table is an array of addresses to jump to. For 64-bit code, the jump table is an array of relative virtual addresses which need to be added to the module base address to for the code address. Using relative virtual addresses keeps the jump table smaller and also reduces the number of relocations needed.<\/p>\n<p>To destroy a suspended coroutine, set the bottom bit of the index (turning it into an odd number), and then call the <code>resume<\/code> function. The odd entries in the jump table point to cleanup functions which destruct the variables that were live at the point of suspension. And if the <code>flags<\/code> say that the coroutine was allocated on the heap, then it is <code>delete<\/code> from the heap.<\/p>\n<p>The Microsoft Visual C++ compiler uses the naming convention <code>function$_ResumeCoro$N<\/code> for the coroutine <code>resume<\/code> function, for some number <var>N<\/var>. (I haven&#8217;t yet figured out what the <var>N<\/var> means.) Here&#8217;s a 64-bit example:<\/p>\n<pre>function$_ResumeCoro$1:\r\n    mov     [rsp+8], rcx            ; save coroutine frame\r\n    push    rbx  \r\n    sub     rsp, 30h                ; build stack frame\r\n    mov     rbx, [rsp+40h]          ; rbx = coroutine frame\r\n    movzx   eax, word ptr [rbx+8]   ; eax = index\r\n    mov     [rsp+20h], ax           ; remember the index\r\n    inc     ax                      ; add 1, just for fun\r\n    cmp     ax, 6  \r\n    ja      fatal_error             ; invalid index\r\n    movsx   rax, ax  \r\n    lea     rdx, [__ImageBase]      ; get module base address\r\n    mov     ecx, [rdx+rax*4+3158h]  ; get offset from jump table\r\n    add     rcx,rdx                 ; apply offset to base address\r\n    jmp     rcx                     ; jump there\r\n<\/pre>\n<p>Note that the compiler <i>adds one<\/i> to the index before using it to look up the offset in the jump table, so you need to ignore the first entry in the jump table.<\/p>\n<p>The clang compiler uses a slightly different approach:<\/p>\n<pre>struct coroutine_frame\r\n{\r\n    void (*resume)(coroutine_frame*);\r\n    void (*destroy)(coroutine_frame*);\r\n    uintN_t index;\r\n    \/* parameters, local variables, other bookkeeping *\/\r\n};\r\n<\/pre>\n<p>Instead of encoding the &#8220;destroying&#8221; state in the bottom bit of the index, clang uses a separate <code>destroy<\/code> function. This means that the indices are small integers, with no special meaning for even\/odd values. (Zero is a valid index.) The <code>resume<\/code> and <code>destroy<\/code> functions have separate jump tables, one for resumption and one for destruction, and if the number of states is small, then clang doesn&#8217;t even bother making a jump table; it just uses a bunch of tests. The size of the variable used to hold the state is chosen to be large enough to hold all of the states. Most reasonable-sized coroutines can get by with an 8-bit index, but the compiler <a href=\"https:\/\/www.llvm.org\/docs\/Coroutines.html\"> internally supports indices up to 32 bits in size<\/a>.<\/p>\n<p>The gcc compiler sits somewhere in between the Microsoft and clang compilers.<\/p>\n<pre>struct coroutine_frame\r\n{\r\n    void (*actor)(coroutine_frame*);\r\n    void (*destroy)(coroutine_frame*);\r\n    uint8_t unused;\r\n    uint8_t flags;\r\n    uint16_t index;\r\n    \/* parameters, local variables, other bookkeeping *\/\r\n};\r\n<\/pre>\n<p>Like the clang compiler, the gcc compiler uses a pair of function pointers, one for resuming the coroutine (which is internally called the <code>actor<\/code>) and one for destroying it. However, the gcc compiler follows the Microsoft C++ convention of using even numbers for suspended states and odd numbers for destroying states. The <code>destroy<\/code> function just sets the bottom bit of the <code>index<\/code> and then jumps to the <code>actor<\/code> function.<\/p>\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=xpZ02A9aUVQ\"> Inside the <code>actor<\/code> function<\/a>, the code checks the bottom bit of the <code>index<\/code> and dispatches from two different jump tables, one for even indices and one for odd indices. Curiously, the table for even indices has <code>fatal_error<\/code> in all the odd slots, and the table for odd indices has <code>fatal_error<\/code> in all the even slots, so really, they could have been combined into a single table. Not sure what what&#8217;s about.<\/p>\n<p>The <code>flags<\/code> records whether the coroutine function&#8217;s parameters have been transferred to the frame. This is used when the frame is destroyed to know whether or not there are parameters in the frame which need to be destructed.<\/p>\n<p>Finally, we come to Microsoft Visual C++ coroutines in C++20 mode. <a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/c-coroutines-in-visual-studio-2019-version-16-8\/\"> As noted in their blog post<\/a>, the change was made in order to be ABI-compatible with clang and gcc, so that coroutines from all three compilers can interoperate.<\/p>\n<pre>struct coroutine_frame\r\n{\r\n    void (*resume)(coroutine_frame*);\r\n    void (*destroy)(coroutine_frame*);\r\n    promise_type promise;\r\n    parameters...\r\n    uint16_t index;\r\n    uint16_t flags;\r\n    locals...\r\n    temporaries...\r\n    other bookkeeping...\r\n};\r\n<\/pre>\n<p>The original <code>resume<\/code> function has been split into separate <code>resume<\/code> and <code>destroy<\/code> functions, and the other members of the coroutine frame have been rearranged.<\/p>\n<p>Adding a <code>destroy<\/code> function to the start of the coroutine frame establishes the <i lang=\"la\">de facto<\/i> common ABI for coroutine frames:<\/p>\n<pre>struct coroutine_frame_abi\r\n{\r\n    void (*resume)(coroutine_frame_abi*);\r\n    void (*destroy)(coroutine_frame_abi*);\r\n};\r\n<\/pre>\n<p>For all four coroutine frame formats, you can figure out what coroutine a coroutine handle corresponds to by dumping the start of the frame and looking at the <code>resume<\/code> pointer. You can also look at the <code>index<\/code> to see where in the coroutine&#8217;s execution you are, although for Microsoft Visual C++ coroutines in C++20 mode, the index is not at a fixed location, so digging it out will require you to disassemble the <code>resume<\/code> function to see where it reads the index from.<\/p>\n<p>In all cases, you&#8217;ll have to disassemble the <code>resume<\/code> function to find the jump table (or for clang, the switch statement) but you can then index into that jump table (after adjusting by 1 for the Microsoft C++ compiler) to find the point at which execution is going to resume.<\/p>\n<p>Here&#8217;s the cookbook in a table:<\/p>\n<table class=\"cp3\" style=\"border-collapse: collapse; text-align: center;\" border=\"1\" cellspacing=\"0\" cellpadding=\"3\">\n<tbody>\n<tr>\n<td>&nbsp;<\/td>\n<th>Microsoft Visual C++<\/th>\n<th>clang<\/th>\n<th>gcc<\/th>\n<\/tr>\n<tr>\n<th>Identify coroutine from handle<\/th>\n<td colspan=\"3\">Dump first pointer as a function pointer<\/td>\n<\/tr>\n<tr>\n<th>Is coroutine destroying?<\/th>\n<td>Index is odd<\/td>\n<td>(no way to tell)<\/td>\n<td>Index is odd<\/td>\n<\/tr>\n<tr>\n<th>Where will it resume?<\/th>\n<td>Disassemble resumption function<br \/>\nadd 1 to index<br \/>\nlook up in jump table<\/td>\n<td>Disassemble resumption function<br \/>\n\u00a0<br \/>\nfollow switch statement<\/td>\n<td>Disassemble resumption function<br \/>\nfind the right jump table (even\/odd)<br \/>\nlook up in jump table<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\u00b9 I&#8217;ve never created a coroutine with more than 32767 suspension points, nor do I have any interest in trying, so I don&#8217;t know whether the compiler switches to a 32-bit <code>index<\/code> or whether it just bails out with &#8220;Error: Coroutine has too many suspension points.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Peeking behind the curtain.<\/p>\n","protected":false},"author":1069,"featured_media":111744,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[25],"class_list":["post-105777","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-oldnewthing","tag-code"],"acf":[],"blog_post_summary":"<p>Peeking behind the curtain.<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105777","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/users\/1069"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/comments?post=105777"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/posts\/105777\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media\/111744"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/media?parent=105777"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/categories?post=105777"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/oldnewthing\/wp-json\/wp\/v2\/tags?post=105777"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}