Debugging coroutine handles: The Microsoft Visual C++ compiler, clang, and gcc

Raymond

How compilers implement coroutines is an implementation detail which is subject to change at any time. Nevertheless, you may be called upon to debug them, so it’s nice to know what you’re looking at.

The C++ language requires that any coroutine be resumable from a coroutine_handle<>, so there needs to be some vtable-like thing so that calling the resume() method on an arbitrary coroutine_handle<> resumes the correct coroutine.

Note: The Microsoft Visual C++ compiler coroutine ABI took a breaking change in version 16.8, so I’ll cover Microsoft Visual C++ coroutines twice, once in C++17 mode and once again in C++20 mode.

In the Microsoft Visual C++ compiler, the C++17-style coroutine handle is a pointer to a structure we shall call a “frame” for expository purposes.

struct coroutine_frame
{
    void (*resume)(coroutine_frame*);
    uint16_t index;
    uint16_t flags;
    promise_type promise;
    parameters...
    locals...
    temporaries...
    other bookkeeping...
};

The index represents the progress of the coroutine through its function body. The flags value is nonzero if the coroutine frame was allocated on the heap.

Constructing a coroutine frame consists of the following steps:

  • Allocate memory for the frame, usually from the heap.
  • Initialize the resume member to point to a custom function specific to the coroutine.
  • Initialize the index to 2.
  • Initialize the flags to 1 if the frame was allocated on the heap; otherwise initialize it to zero.

The index is initialized to 2 because the state of a suspended coroutine is always recorded as a nonzero even number.

  • Nonzero: I’m guessing that zero is kept as a permanently invalid state to aid in debugging.
  • Even: We’ll see why later.

When a coroutine suspends, its index is updated to remember where the coroutine needs to resume. The coroutine states appear to be numbered in the order in which they appear in the function, so the initial state of the coroutine is 2, the first suspension point is 4, the next one is 6, and so on.¹ Some of the suspension points can get optimized out, say, because the compiler can prove that await_ready always returns true.

To resume a suspended coroutine, call the resume function with a pointer to the coroutine frame. Each coroutine gets a custom resume function which uses the index as an index into a jump table to dispatch to the appropriate point in the coroutine where execution should resume.

For 32-bit code, the jump table is an array of addresses to jump to. For 64-bit code, the jump table is an array of relative virtual addresses which need to be added to the module base address to for the code address. Using relative virtual addresses keeps the jump table smaller and also reduces the number of relocations needed.

To destroy a suspended coroutine, set the bottom bit of the index (turning it into an odd number), and then call the resume function. The odd entries in the jump table point to cleanup functions which destruct the variables that were live at the point of suspension. And if the flags say that the coroutine was allocated on the heap, then it is delete from the heap.

The Microsoft Visual C++ compiler uses the naming convention function$_ResumeCoro$N for the coroutine resume function, for some number N. (I haven’t yet figured out what the N means.) Here’s a 64-bit example:

function$_ResumeCoro$1:
    mov     [rsp+8], rcx            ; save coroutine frame
    push    rbx  
    sub     rsp, 30h                ; build stack frame
    mov     rbx, [rsp+40h]          ; rbx = coroutine frame
    movzx   eax, word ptr [rbx+8]   ; eax = index
    mov     [rsp+20h], ax           ; remember the index
    inc     ax                      ; add 1, just for fun
    cmp     ax, 6  
    ja      fatal_error             ; invalid index
    movsx   rax, ax  
    lea     rdx, [__ImageBase]      ; get module base address
    mov     ecx, [rdx+rax*4+3158h]  ; get offset from jump table
    add     rcx,rdx                 ; apply offset to base address
    jmp     rcx                     ; jump there

Note that the compiler adds one to the index before using it to look up the offset in the jump table, so you need to ignore the first entry in the jump table.

The clang compiler uses a slightly different approach:

struct coroutine_frame
{
    void (*resume)(coroutine_frame*);
    void (*destroy)(coroutine_frame*);
    uintN_t index;
    /* parameters, local variables, other bookkeeping */
};

Instead of encoding the “destroying” state in the bottom bit of the index, clang uses a separate destroy function. This means that the indices are small integers, with no special meaning for even/odd values. (Zero is a valid index.) The resume and destroy functions have separate jump tables, one for resumption and one for destruction, and if the number of states is small, then clang doesn’t even bother making a jump table; it just uses a bunch of tests. The size of the variable used to hold the state is chosen to be large enough to hold all of the states. Most reasonable-sized coroutines can get by with an 8-bit index, but the compiler internally supports indices up to 32 bits in size.

The gcc compiler sits somewhere in between the Microsoft and clang compilers.

struct coroutine_frame
{
    void (*actor)(coroutine_frame*);
    void (*destroy)(coroutine_frame*);
    uint8_t unused;
    uint8_t flags;
    uint16_t index;
    /* parameters, local variables, other bookkeeping */
};

Like the clang compiler, the gcc compiler uses a pair of function pointers, one for resuming the coroutine (which is internally called the actor) and one for destroying it. However, the gcc compiler follows the Microsoft C++ convention of using even numbers for suspended states and odd numbers for destroying states. The destroy function just sets the bottom bit of the index and then jumps to the actor function.

Inside the actor function, the code checks the bottom bit of the index and dispatches from two different jump tables, one for even indices and one for odd indices. Curiously, the table for even indices has fatal_error in all the odd slots, and the table for odd indices has fatal_error in all the even slots, so really, they could have been combined into a single table. Not sure what what’s about.

The flags records whether the coroutine function’s parameters have been transferred to the frame. This is used when the frame is destroyed to know whether or not there are parameters in the frame which need to be destructed.

Finally, we come to Microsoft Visual C++ coroutines in C++20 mode. As noted in their blog post, the change was made in order to be ABI-compatible with clang and gcc, so that coroutines from all three compilers can interoperate.

struct coroutine_frame
{
    void (*resume)(coroutine_frame*);
    void (*destroy)(coroutine_frame*);
    promise_type promise;
    parameters...
    uint16_t index;
    uint16_t flags;
    locals...
    temporaries...
    other bookkeeping...
};

The original resume function has been split into separate resume and destroy functions, and the other members of the coroutine frame have been rearranged.

Adding a destroy function to the start of the coroutine frame establishes the de facto common ABI for coroutine frames:

struct coroutine_frame_abi
{
    void (*resume)(coroutine_frame_abi*);
    void (*destroy)(coroutine_frame_abi*);
};

For all four coroutine frame formats, you can figure out what coroutine a coroutine handle corresponds to by dumping the start of the frame and looking at the resume pointer. You can also look at the index to see where in the coroutine’s execution you are, although for Microsoft Visual C++ coroutines in C++20 mode, the index is not at a fixed location, so digging it out will require you to disassemble the resume function to see where it reads the index from.

In all cases, you’ll have to disassemble the resume function to find the jump table (or for clang, the switch statement) but you can then index into that jump table (after adjusting by 1 for the Microsoft C++ compiler) to find the point at which execution is going to resume.

Here’s the cookbook in a table:

  Microsoft Visual C++ clang gcc
Identify coroutine from handle Dump first pointer as a function pointer
Is coroutine destroying? Index is odd (no way to tell) Index is odd
Where will it resume? Disassemble resumption function
add 1 to index
look up in jump table
Disassemble resumption function
 
follow switch statement
Disassemble resumption function
find the right jump table (even/odd)
look up in jump table

¹ I’ve never created a coroutine with more than 32767 suspension points, nor do I have any interest in trying, so I don’t know whether the compiler switches to a 32-bit index or whether it just bails out with “Error: Coroutine has too many suspension points.”

0 comments

Leave a comment