C++ coroutines: Cold-start coroutines

Raymond Chen

April 21st, 20213 0

So far, our coroutine promise has implemented a so-called hot-start coroutine, which is one that begins running as soon as it is created. Another model for coroutines is the so-called cold-start coroutine, which is one that doesn’t start running until it is awaited.

C# and JavaScript use the hot-start model: Creating the coroutine runs the coroutine body synchronously to its first suspension point, and only after the coroutine suspends for the first time is the coroutine returned to the caller. The usual usage pattern is to create the coroutine, and then go do other stuff while the coroutine is running, on the assumption that the synchronous portion of the coroutine is brief, and the expensive portion runs asynchronously. The hot-start model makes it easy to start multiple coroutines in parallel, and await the combined result.

Python uses cold-start coroutines: The coroutine doesn’t start running until you await it. With cold-start coroutines, you need other machinery if you want to do work in parallel with the await, although that machinery could be made relative simply, like Python’s create_task that runs a coroutine in an event loop. Cold-start coroutines have simpler bookkeeping since the running and awaiting states are identical, which makes a lot of state transitions impossible.

You could also create a hybrid model where the coroutine is cold-start, but can be manually started. Mind you, doing so reintroduces the state transitions you thought you had simplified away.

The C++ language doesn’t take a position on whether coroutines are hot-start or cold-start, or some hybrid of the two. It just provides the underlying infrastructure, and it’s up to you to decide what you want to build on top of it.

If we define the initial state as cold, then our valid state transitions are as follows:

cold → running → completed → abandoned: This is the common case where the task is awaited and then runs to completion.
cold → abandoned: This is the case where the coroutine is abandoned without ever starting.

From To	running	completed	abandoned
cold	Resume coroutine		Destroy promise
running		Resume awaiter
completed			Destroy promise

The nice thing about cold-start coroutines is that there are very few transitions, and none of them are contended. Furthermore, the state is completely implied by the actions of the task, so we don’t even need to keep track of it explicitly.

Here’s a sketch of the changes we can make to convert our hot-start coroutine promise to cold-start. Don’t incorporate these yet, for reasons we’ll see next time.

    template<typename T>
    struct simple_promise_base
    {
        ...

        std::experimental::coroutine_handle<> m_waiting{ nullptr };
        simple_promise_result_holder<T> m_holder;

        ...

        void abandon()
        {
            destroy();
        }

        std::experimental::suspend_always initial_suspend() noexcept
        {
            return {};
        }

        auto final_suspend() noexcept
        {
            struct awaiter : std::experimental::suspend_always
            {
                simple_promise_base& self;
                void await_suspend(
                    std::experimental::coroutine_handle<>)
                    const noexcept
                {
                    self.m_waiting();
                }
            };
            return awaiter{ {}, *this };
        }

        bool client_await_ready()
        {
            return false;
        }

        auto client_await_suspend(
            std::experimental::coroutine_handle<> handle)
        {
            m_waiting = handle;
            as_handle().resume();
        }

        ...
    };

What makes this a cold-start coroutine is the fact that the initial_suspend is a suspend_always rather than a suspend_never. This means that the coroutine body doesn’t start until the coroutine is explicitly resume()d.

The other state transitions are significantly simplified. Destroying the coroutine doesn’t need to check whether the coroutine is running, because it happens either before the coroutine even starts, or after it has completed, never when the coroutine is runinng. Completing the coroutine can always resume the m_waiting coroutine, since the awaiter registers completion before resuming, so the resumption handle is known to be valid by this point.

The other wrinkle about cold-start coroutines is that the awaiter is responsible for starting it, which we do by calling resume().

You may have noticed an inefficiency here: If the coroutine completes synchronously,¹ then we end up calling into the coroutine’s resume(), and the completion calls back into the awaiter’s resume(). This accumulates stack frames, which is a problem for a coroutine that awaits other synchronously-completing coroutines in a loop, since each time through the loop uses another level of stack.

We’ll address this problem next time, and it will require us to bring back some of the code we deleted, which is why I warned you not to incorporate it yet.

¹ “But why bother making it a coroutine if it completes synchronously?” The operation might complete synchronously under certain conditions, but asynchronously under other conditions. For example, the operation might complete asynchronously if a helper object needs to be started, but synchronously if the helper object is already up and running.

Raymond Chen

3 comments

David Haim April 21, 2021 8:33 am 0

A. I think the terms “eager tasks” and “lazy tasks” are more common in the C++ world.

B. Lazy tasks are supposed to have one killer benefit over eager tasks : performance. they don’t need any inter-thread synchronization when resuming the caller-coroutine, plus the compiler may shave away the memory allocation of the lazy task and allocate this memory in the parent caller stack (HALO)
The downside of them that you loose two concurrency models that almost every concurrent application needs: fire and forget, fire and consume later.
Also, you cannot implement “when_any” for lazy coroutines. It’s simply impossible.

Internal benchmarking I’ve done for my library (concurrencpp) shows that lazy tasks might have about 5%~8% performance gain over eager tasks, which I consider a marginal gain, considering the huge downsides. I still added them to my library (still in develop branch) because it’s so easy to implement them.

Blaise Lengrand April 22, 2021 1:57 pm 0

First of all thanks Raymond for this article and David for this very informative comment.
Out of curiosity, why “when_any” cannot be implemented with lazy coroutines? If you don’t mind could you elaborate a bit more? I am really interested in this topic and trying to implement myself such functionality but I am still in the learning phase…
- David Haim April 24, 2021 1:42 am 0
  
  Lazy coroutines have only the performance benefit over eager tasks, if you think about it. if you don’t want your coroutine to start, just don’t fire it, easy. fire it only when you actually need its result.
  Lazy coroutines don’t provide anything else over eager tasks besides performance.
  
  In order for HALO to kick in, the “future like object” destructor (might it be a “task”, “result”, “future”) that associates the running task must call coroutine_handle::destroy, and the destruction must happen in the scope of the parent coroutine. this way, the compiler knows that the callee coroutine starts and finishes in the parent coroutine scope, and it can allocate its memory from the calling stack.
  caller-destroying-the-callee is an axiom when talking about lazy coroutines.
  Eager coroutines, on the other hand, clean themselves up.
  
  So basically, all lazy coroutine behaves the same when fired:
  * start the callee coroutine suspended.
  * when awaited, store the parent coroutine handle in the callee coroutine promise and resume the callee coroutine.
  * when the callee coroutine finishes, store the result in the promise body
  * resume the parent coroutine
  * pull out the stored result
  * destroy the callee coroutine
  
  this way, the compiler can implement HALO for lazy coroutines .
  
  when_all is very similar, with a twist
  * start all coroutines suspended
  * have an atomic counter set to the number of task
  * iterate your coroutines, resume each one
  * when a coroutine finishes, it decrements the counter
  * if the counter is 0, resume the parent coroutine
  
  this way, you can guarantee that the parent coroutine is resumed only when all callee corotuines are done, ready to be consumed and destroyed.
  Again, note that the parent coroutine always destroy the callee, keeping the universal order of lazy coroutine.
  
  when_any is basically a contradiction to this paradigm – when_any by definition means “resume the parent task when at least one task is done”.
  So if you resume the parent coroutine when not all callee coroutines are done – who will destroy them? by the time all tasks are done, it could be that millions coroutines have executed, and the original coroutine does not exist anymore. Since lazy coroutines don’t know how to clean themselves up, when_any is not possible for lazy coroutines.
  If you look closely, no lazy corotuine library (cppcoro, folly’s task implementation, etc) provides when_any, but all of them provide when_all.

Discussion is closed. Login to edit/delete existing comments.

David Haim April 21, 2021 8:33 am 0

A. I think the terms “eager tasks” and “lazy tasks” are more common in the C++ world.

B. Lazy tasks are supposed to have one killer benefit over eager tasks : performance. they don’t need any inter-thread synchronization when resuming the caller-coroutine, plus the compiler may shave away the memory allocation of the lazy task and allocate this memory in the parent caller stack (HALO)
The downside of them that you loose two concurrency models that almost every concurrent application needs: fire and forget, fire and consume later.
Also, you cannot implement “when_any” for lazy coroutines. It’s simply impossible.

Internal benchmarking I’ve done for my library (concurrencpp) shows that lazy tasks might have about 5%~8% performance gain over eager tasks, which I consider a marginal gain, considering the huge downsides. I still added them to my library (still in develop branch) because it’s so easy to implement them.
- Blaise Lengrand April 22, 2021 1:57 pm 0
  
  First of all thanks Raymond for this article and David for this very informative comment.
  Out of curiosity, why “when_any” cannot be implemented with lazy coroutines? If you don’t mind could you elaborate a bit more? I am really interested in this topic and trying to implement myself such functionality but I am still in the learning phase…
  - David Haim April 24, 2021 1:42 am 0
    
    Lazy coroutines have only the performance benefit over eager tasks, if you think about it. if you don’t want your coroutine to start, just don’t fire it, easy. fire it only when you actually need its result.
    Lazy coroutines don’t provide anything else over eager tasks besides performance.
    
    In order for HALO to kick in, the “future like object” destructor (might it be a “task”, “result”, “future”) that associates the running task must call coroutine_handle::destroy, and the destruction must happen in the scope of the parent coroutine. this way, the compiler knows that the callee coroutine starts and finishes in the parent coroutine scope, and it can allocate its memory from the calling stack.
    caller-destroying-the-callee is an axiom when talking about lazy coroutines.
    Eager coroutines, on the other hand, clean themselves up.
    
    So basically, all lazy coroutine behaves the same when fired:
    * start the callee coroutine suspended.
    * when awaited, store the parent coroutine handle in the callee coroutine promise and resume the callee coroutine.
    * when the callee coroutine finishes, store the result in the promise body
    * resume the parent coroutine
    * pull out the stored result
    * destroy the callee coroutine
    
    this way, the compiler can implement HALO for lazy coroutines .
    
    when_all is very similar, with a twist
    * start all coroutines suspended
    * have an atomic counter set to the number of task
    * iterate your coroutines, resume each one
    * when a coroutine finishes, it decrements the counter
    * if the counter is 0, resume the parent coroutine
    
    this way, you can guarantee that the parent coroutine is resumed only when all callee corotuines are done, ready to be consumed and destroyed.
    Again, note that the parent coroutine always destroy the callee, keeping the universal order of lazy coroutine.
    
    when_any is basically a contradiction to this paradigm – when_any by definition means “resume the parent task when at least one task is done”.
    So if you resume the parent coroutine when not all callee coroutines are done – who will destroy them? by the time all tasks are done, it could be that millions coroutines have executed, and the original coroutine does not exist anymore. Since lazy coroutines don’t know how to clean themselves up, when_any is not possible for lazy coroutines.
    If you look closely, no lazy corotuine library (cppcoro, folly’s task implementation, etc) provides when_any, but all of them provide when_all.

C++ coroutines: Cold-start coroutines

Raymond Chen

Read next

3 comments