Can you use C++20 coroutines that are better than physical threads?

Abstract: event driven (event driven) is a common code model, which usually has a main loop (mainloop) continuously receives events from the queue, and then distributes them to the corresponding function/module for processing. Commonly used event-driven model software includes graphical user interface (GUI), embedded device software, network server, etc.

This article is shared from the HUAWEI cloud community " C++20 application of coroutine in event-driven code ", the original author: Feidele.

The conundrum of embedded event-driven code

event driven is a common code model, which usually has a main loop that continuously receives events from the queue, and then distributes them to the corresponding function/module for processing. Commonly used event-driven model software includes graphical user interface (GUI), embedded device software, network server, etc.

This article uses a highly simplified embedded processing module as an example of event-driven code: Assuming that the module needs to process various events such as user commands, external messages, and alarms, and distribute them in the main loop, the sample code is as follows:

#include <iostream>
#include <vector>

enum class EventType {
    COMMAND,
    MESSAGE,
    ALARM
};

// 仅用于模拟接收的事件序列
std::vector<EventType> g_events{EventType::MESSAGE, EventType::COMMAND, EventType::MESSAGE};

void ProcessCmd()
{
    std::cout << "Processing Command" << std::endl;
}

void ProcessMsg()
{
    std::cout << "Processing Message" << std::endl;
}

void ProcessAlm()
{
    std::cout << "Processing Alarm" << std::endl;
}

int main() 
{
    for (auto event : g_events) {
        switch (event) {
            case EventType::COMMAND:
                ProcessCmd();
                break;
            case EventType::MESSAGE:
                ProcessMsg();
                break;
            case EventType::ALARM:
                ProcessAlm();
                break;
        }
    }
    return 0;
}

This is just a minimal model example. The real code is much more complicated than it. It may also include: getting events from a specific interface, parsing different event types, and using table-driven methods for distribution... But these are related to this article. Not big, you can ignore it for now.

Use a sequence diagram to represent this model, roughly like this:

In actual projects, a problem often encountered is: some events take a long time to process. For example, a certain command may require thousands of hardware operations in batches:

void ProcessCmd()
{
    for (int i{0}; i < 1000; ++i) {
        // 操作硬件接口……
    }
}

This kind of event processing function will block the main loop for a long time, causing other events to wait in line. If all incidents do not require response speed, it will not cause problems. However, in actual scenarios, there are often events that require timely response. For example, after certain alarm events occur, business switching needs to be performed quickly, otherwise it will cause losses to users. At this time, events that take a long time to process will cause problems.

Some people would think of adding an extra thread dedicated to handling high-priority events. In practice, this is indeed a common method. However, in embedded systems, event processing functions can read and write many common data structures, and also operate hardware interfaces. If they are called concurrently, it is very easy to cause various data competitions and hardware operation conflicts, and these problems are often difficult to locate and solve. What about locking on the basis of multithreading? ——Designing which locks and where to add them is also a very brain-burning and error-prone work. If there are too many mutually exclusive waits, it will also affect performance and even cause troublesome problems such as deadlocks.

Another solution is to cut the task with a long processing time into many small tasks and re-add them to the event queue. This will not block the main loop for a long time. This solution avoids all kinds of headaches caused by concurrent programming, but it brings another problem: how to cut a large process into many independent small processes? When coding, this requires the programmer to parse all the context information of the function process, design data structures to store separately, and establish special events that associate these data structures. This often brings several times the amount of additional code and workload.

This problem exists in almost all event-driven software, but it is particularly prominent in embedded software. This is because resources such as CPU and threads in the embedded environment are limited, and real-time requirements are high, and concurrent programming is limited.

The C++20 language provides a new solution to this problem: coroutine.

Introduction to C++20 coroutine

As for what coroutine is, there is a good introduction in wikipedia[1] and other materials, so I won't repeat it in this article. In C++20, the key word of the coroutine is just syntactic sugar: the compiler will package the context of function execution (including local variables, etc.) into an object, and let the unexecuted function return to the caller first. After that, the caller can use this object to let the function continue execution from the original "breakpoint".

Using coroutines, you no longer need to bother to "cut" the function into multiple small tasks when coding, just write the internal code of the function according to the customary process, and add the co_yield statement where the execution is allowed to be temporarily interrupted, and compile The processor can then process the function as "segmented execution".

The use of coroutines feels a bit like thread switching, because the stack frame of the function is saved as an object by the compiler and can be restored at any time and then run down. However, in actual execution, the coroutines actually run in a single-threaded sequence, and there is no physical thread switching. Everything is just the "magic" of the compiler. Therefore, the use of coroutines can completely avoid the performance overhead and resource occupation of multi-thread switching, and there is no need to worry about data competition and other issues.

Unfortunately, the C++20 standard only provides the basic mechanism of coroutines, and does not provide a truly practical coroutine library (it may be improved in C++23). At present, if you want to use coroutines to write actual business, you can use open source libraries, such as the famous cppcoro [2]. However, for the scenario described in this article, cppcoro does not directly provide the corresponding tool (generator can solve this problem with proper packaging, but it is not intuitive), so I wrote a coroutine tool class for cutting tasks for example.

Customized coroutine tool

Below is the code of the SegmentedTask tool class I wrote. This code looks quite complicated, but it exists as a reusable tool. It is not necessary for programmers to understand its internal implementation, as long as you know how to use it. The use of SegmentedTask is very easy: it has only 3 external interfaces: Resume, IsFinished and GetReturnValue, and its functions can be explained by the interface name.

#include <optional>
#include <coroutine>

template<typename T>
class SegmentedTask {
public:
    struct promise_type {
        SegmentedTask<T> get_return_object() 
        {
            return SegmentedTask{Handle::from_promise(*this)};
        }

        static std::suspend_never initial_suspend() noexcept { return {}; }
        static std::suspend_always final_suspend() noexcept { return {}; }
        std::suspend_always yield_value(std::nullopt_t) noexcept { return {}; }

        std::suspend_never return_value(T value) noexcept
        {
            returnValue = value;
            return {};
        }

        static void unhandled_exception() { throw; }

        std::optional<T> returnValue;
    };
 
    using Handle = std::coroutine_handle<promise_type>;
 
    explicit SegmentedTask(const Handle coroutine) : coroutine{coroutine} {}
 
    ~SegmentedTask() 
    { 
        if (coroutine) {
            coroutine.destroy(); 
        }
    }
 
    SegmentedTask(const SegmentedTask&) = delete;
    SegmentedTask& operator=(const SegmentedTask&) = delete;
 
    SegmentedTask(SegmentedTask&& other) noexcept : coroutine(other.coroutine) { other.coroutine = {}; }

    SegmentedTask& operator=(SegmentedTask&& other) noexcept
    {
        if (this != &other) {
            if (coroutine) {
                coroutine.destroy();
            }
            coroutine = other.coroutine;
            other.coroutine = {};
        }
        return *this;
    }

    void Resume() const { coroutine.resume(); }
    bool IsFinished() const { return coroutine.promise().returnValue.has_value(); }
    T GetReturnValue() const { return coroutine.promise().returnValue.value(); }
 
private:
    Handle coroutine;
};

Writing tools for coroutines by yourself not only requires an in-depth understanding of the C++ coroutine mechanism, but it is also prone to undefined behaviors such as dangling references. Therefore, it is strongly recommended that the project team uniformly use the written coroutine class. If readers want to learn more about how to write coroutine tools, they can refer to Rainer Grimm's blog post [3].

Next, we use SegmentedTask to transform the previous event handling code. When any of the keywords of co_await, co_yield, and co_return are used in a C++ function, the function becomes a coroutine, and its return value also becomes the corresponding coroutine tool class. In the sample code, when the inner function needs to return early, co_yield is used. But C++20 co_yield must be followed by an expression. This expression is not necessary in the example scenario, so std::nullopt is used to make it compile and pass. In the actual business environment, co_yield can return a number or object to indicate the progress of the current task execution, which is convenient for outer query.

Coroutines cannot use ordinary return statements, they must use co_return to return values, and the return type is not directly equivalent to the expression type after co_return.

enum class EventType {
    COMMAND,
    MESSAGE,
    ALARM
};

std::vector<EventType> g_events{EventType::COMMAND, EventType::ALARM};
std::optional<SegmentedTask<int>> suspended;  // 没有执行完的任务保存在这里

SegmentedTask<int> ProcessCmd()
{
    for (int i{0}; i < 10; ++i) {
        std::cout << "Processing step " << i << std::endl;
        co_yield std::nullopt;
    }
    co_return 0;
}

void ProcessMsg()
{
    std::cout << "Processing Message" << std::endl;
}

void ProcessAlm()
{
    std::cout << "Processing Alarm" << std::endl;
}

int main()
{
    for (auto event : g_events) {
        switch (event) {
            case EventType::COMMAND:
                suspended = ProcessCmd();
                break;
            case EventType::MESSAGE:
                ProcessMsg();
                break;
            case EventType::ALARM:
                ProcessAlm();
                break;
        }
    }
    while (suspended.has_value() && !suspended->IsFinished()) {
        suspended->Resume();
    }
    if (suspended.has_value()) {
        std::cout << "Final return: " << suspended->GetReturnValue() << endl;
    }
    return 0;
}

For the purpose of keeping the example simple, only one COMMAND and one ALARM are placed in the event queue. COMMAND is a coroutine that can be executed in stages. After the first stage is executed, the main loop will give priority to the remaining events in the queue. Finally, let's continue to execute the rest of COMMAND. In actual scenarios, various scheduling strategies can be flexibly selected according to needs. For example, a queue is used to store all unfinished segmented tasks, and they are executed in sequence when they are idle.

The code in this article uses gcc 10.3 to compile and run. When compiling, you need to add -std=c++20 and -fcoroutines to support the coroutine. The code running results are as follows:

Processing step 0
Processing Alarm
Processing step 1
Processing step 2
Processing step 3
Processing step 4
Processing step 5
Processing step 6
Processing step 7
Processing step 8
Processing step 9
Final return: 0

It can be seen that the for loop statement of the ProcessCmd function (coroutine) has not been executed at once, and the execution of ProcessAlm is inserted in the middle. If you analyze the running threads, you will find that there is no physical thread switching during the entire process, and all codes are executed sequentially on the same thread.

The sequence diagram using the coroutine becomes like this:

The long execution time of the event processing function is no longer a problem, because you can "insert" other functions in the middle, and then return to the breakpoint to continue running down.

to sum up

A more common misunderstanding is: the use of multithreading can improve software performance. But in fact, as long as the CPU is not running idle, when the number of physical threads exceeds the number of CPU cores, performance will no longer be improved, and on the contrary, performance will be reduced due to thread switching overhead. In most development practices, the main benefit of concurrent programming is not to improve performance, but to facilitate coding, because many scene models in reality are concurrent and easily correspond directly to multithreaded code.

Coroutines can be as convenient and intuitive as multi-threaded coding, but at the same time there is no physical thread overhead, and there is no heavy design burden in concurrent programming such as mutual exclusion and synchronization. In many scenarios such as embedded applications, it is often compared to A better choice for physical threads.

I believe that with the gradual popularization of C++20, coroutines will be more and more widely used in the future.

Endnote

[1] https://en.wikipedia.org/wiki/Coroutine
[2] https://github.com/lewissbaker/cppcoro
[3] https://www.modernescpp.com/index.php/tag/coroutines

Click to follow, and learn about the fresh technology of Huawei Cloud for the first time~

Can you use C++20 coroutines that are better than physical threads?

The conundrum of embedded event-driven code

Introduction to C++20 coroutine

Customized coroutine tool

to sum up

Endnote

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Visual Studio Code (VS Code) – C/C++ 入门

如何系统地入门学习stm32？

AI处理器组合

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

嵌入式行业真的没前途吗？