November 15th, 2023

Why does calling a coroutine allocate a lot of stack space even though the coroutine frame is on the heap?

Consider the following:

#include <coroutine>
#include <exception>

// Define a coroutine type called "task"
// (not relevant to scenario but we need *something*.)
struct task { void* p; };

namespace std
{
    template<typename...Args>
    struct coroutine_traits<task, Args...>
    {
        struct promise_type
        {
            task get_return_object() { return { this }; }
            void unhandled_exception() { std::terminate(); }
            void return_void() {}
            suspend_never initial_suspend() { return {}; }
            suspend_never final_suspend() noexcept { return {}; }
        };
    };
}
// End of "task" boilerplate.

void consume(void*);

task sample()
{
    char large[65536];
    consume(large);
    co_return;
}

The sample coroutine function consumes a large buffer, but it never suspends, and the coroutine frame is destroyed before the function returns. This means that the function is a good candidate for heap elision, specifically Heap Allocation eLision Optimization (HALO), which permits the compiler to optimize out the entire heap allocation and put the coroutine frame on the stack:

# clang -O3
sample():
        sub     rsp, 65576
        lea     rdi, [rsp + 32]
        call    consume(void*)@PLT
        lea     rax, [rsp + 16]
        add     rsp, 65576
        ret

The Microsoft Visual C++ compiler’s code generation for coroutines (in version 19.34.31931.0) performs the initialization in a function named $InitCoro$2, with the declaration

void sample$InitCoro$2(
    T* __coro_return_value,
    bool const& __coro_heap_ellision,
    void* __coro_frame_ptr,
    Args&&... args);

where T is the return type of the coroutine function (task in this example) and Args&&... are the parameters to the coroutine function (empty, in this case).

The __coro_frame_ptr points to a block of memory that will be used as the coroutine frame.

The __coro_heap_ellision parameter¹ is true if the frame is on the stack and false if the frame is on the heap.

The __coro_return_value points to the place to put the T returned by (or constructed from) the promise’s get_return_object().

The $InitCoro$2 function initializes the coroutine frame, obtains the return object, and then runs the coroutine until its first suspension point, and then returns.

The code generation for the coroutine function goes roughly like this:

task sample()
{
    char __coro_elision_buffer[sizeof(coroutine_frame)];
    bool __coro_heap_elision = false;
    void* __coro_frame_ptr = __coro_heap_elision
        ? __coro_elision_buffer
        : operator new(sizeof(coroutine_frame));
    char alignas(task) __coro_return_value[sizeof(task)];

    sample$InitCoro$2(&__coro_return_value,
        __coro_heap_elision,
        __coro_frame_ptr);

    return reinterpret_cast<task&>(__coro_return_value);
}

At optimization level 2, the Microsoft Visual C++ compiler propagates the __coro_heap_elision constant into the ternary, resulting in

task sample()
{
    char __coro_elision_buffer[sizeof(coroutine_frame)];
    char alignas(task) __coro_return_value[sizeof(task)];

    sample$InitCoro$2(&__coro_return_value,
        false,
        operator new(sizeof(coroutine_frame)));

    return reinterpret_cast<task&>(__coro_return_value);
}

However, it fails to realize that the __coro_elision_buffer is now a dead variable, so the function allocates stack space for a buffer it never uses.²

This can be a problem if your coroutine has a large frame, because it will consume a lot of stack that may cause you to stack overflow prematurely. If you want to make sure a coroutine local variable goes on the heap, you should put it explicitly on the heap.

task sample()
{
    auto large = std::make_unique_for_overwrite<char[]>(65536);
    consume(large.get());
    co_return;
}

¹ Yes, the word elision is misspelled.

² This missed optimization is on the backlog.

Topics
Code

Author

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

2 comments

Discussion is closed. Login to edit/delete existing comments.

  • 紅樓鍮
    task sample()
    {
        char __coro_elision_buffer[sizeof(coroutine_frame)];
        // ...
    }

    I’d imagine this isn’t how the coroutine ramp actually works, since if heap elision happens then the returned task will point to __coro_elision_buffer which will be deallocated when sample returns. I’d believe the actual implementation allocates the heap-elided coroutine frame on the stack frame of sample‘s caller (or transitively its caller, if it in turn returns the task that sample returned to it).

  • Wojciech Gebczyk

    Hey!

    I went to link with your suggestion for the optimization.
    Reading thru the comments looks like “the bot” gives some answer in style “yeah we will maybe look at that, go away”. Then real person jumps in a few day later (yeah… two weeks, but “a few days” looks better), slaps the bot saying “behave yourself, it’s Raymond!” and happily overrides everything with “Hey, sure, right now, we will do it! Thanks Raymond!”