Introduction | This article introduces some high-performance network solutions, including RDMA, HARP, io_uring, etc. From the technical principle, the feasibility of landing, etc., a brief analysis is made, and I hope that developers who are interested in this aspect can provide some experience and help.
1. Background
In business, there are often such scenarios:
With the increase of network card speed (10G/25G/100G) and the extreme pursuit of low latency (1ms/50us) for some services, the current kernel protocol stack has gradually become a business due to factors such as complex protocols, complex processes, and outdated designs. bottleneck.
There are already some RDMA and DPDK practices in the industry, but they are still relatively unfamiliar to most developers.
So what are the scenarios for each of these scenarios? Can it empower more businesses? The following is a brief summary of the stages.
2. RDMA
(1) Introduction to the principle
Compared with the traditional network protocol stack, the key features provided by RDMA are: Kernel Bypass, that is, using a dedicated NIC (network card) to perform protocol transmission, codec (Offload) at the hardware level, and directly communicate with user-mode programs through memory mapping technology. interaction, thereby avoiding complex and inefficient kernel mediation.
Based on this design, several additional important features are provided:
- Zero-Copy: Based on DMA operations, there is no additional CPU involved in the copy process during the communication process, thereby reducing CPU consumption.
- Stable and low latency: Due to the reliability of the hardware path, stable communication latency is guaranteed.
- Multiple transmission modes: RC, RD, UC, UD, etc. Based on the different reliability and performance requirements of different services, it provides multiple transmission modes similar to TCP/UDP.
Since RDMA is positioned as a high-performance network transmission, and at the same time to simplify the hardware design, in general, RDMA will avoid the complicated reliability design of software TCP, but rely heavily on the reliability of the underlying transmission network.
According to different transmission networks, the specific implementation of RDMA is divided into several categories:
Additional notes:
- Although RoCE v1/2 relies on converged Ethernet, that is, lossless transmission, there are also optimized implementations by some manufacturers that can reduce the dependence on lossless transmission.
- In Linux kernel 4.9+, Soft-RoCE, that is, the software version of RoCE v2, is implemented, which is mainly used for testing and learning.
(2) RoCE v2 vs iWARP
In the Ethernet environment, the main options are RoCE v2 and iWARP. The related comparisons are as follows:
At present, in the current equipment room network construction, the support for RoCE v2 is better, but iWARP is still in a relatively blank state.
To this end, the current survey is mainly focused on RoCE v2, while iWARP remains to be explored.
(3) Business landing
The mainstream protocol of the background business is still TCP, which has the advantages of stable operation and rich debugging tools. However, for a few services that expect high performance, RDMA is also worth considering.
Businesses using RDMA mainly face two difficulties:
- The requirement of RoCE v2 lossless network makes it difficult to transmit across computer rooms. Currently, Tencent computer rooms support intra-module transmission (for example, within 5 hops).
- The new development interfaces such as libverbs, UCX, etc., need to be adapted to the business software.
Some storage services rely on multiple copies, and network transmission needs to be able to transmit across MANs and even across cities. This directly makes it difficult for RoCE v2 to land.
3. io_uring/socket
(1) Introduction to the principle
io_uring is an asynchronous IO framework supported in Linux 5.1+. Its core advantages are:
- True asynchronous design (Proactor), not inherently synchronous behavior such as epoll (Reactor). The key is that the program and the kernel are decoupled through the SQ/CQ queues.
- The unified asynchronous IO framework not only supports storage and network. Due to good scalability, it can even support any system calls, such as openat, stat, etc.
As mentioned above, an instance of io_uring will create a pair of queues shared by the kernel and user programs, namely the submission queue SQ and the completion queue CQ, both of which are SPSC paradigms: - SQ: User mode thread production, and then the system call (io_uring_enter) notifies the kernel (io_wq kernel thread) for consumption. where the element is called SQE.
- CQ: Kernel production, and then notify (wake up if the user program sleeps and wait) user mode consumption. where the element is called CQE.
This is actually the most conventional and classic asynchronous model, which can be seen in many asynchronous designs.
In general, CQE and SQE correspond one-to-one, but this is not necessarily the case when io_uring supports multi-shot mode.
In addition, io_uring supports batch production and consumption, that is, after continuous production of multiple SQs, the kernel is notified at one time, or CQs are continuously consumed until it is empty.
To further optimize the performance of some scenarios, io_uring supports many advanced features:
- File Registration: Speed up its lookup mapping when repeatedly operating the same fd.
- Buffer Registration: In scenarios such as read/write that repeatedly need to exchange data between the kernel and the user program, you can reuse a pre-registered batch of memory.
- Automatic Buffer Selection: Pre-register a batch of memory for Proactor read, and the kernel automatically selects one of the blocks to store data when it is ready, thereby reducing memory allocation and release and saving memory resources.
- SQ Polling: Make the kernel (io_wq) poll SQ for a specified time before sleeping, thereby reducing the notification system calls.
- IO Polling: Turn on the polling mode of subsystems (storage, network, etc.) (requires device driver support) to accelerate some high-speed devices. In addition, it can be busy with io_uring_enter (flag: IORING_ENTER_GETEVENTS).
- Multi-Shot: Submit once and complete multiple times. For example, if you only need to submit socket accept once, it will return multiple times after subsequent connections arrive.
- In the storage IO scenario, io_uring has a good performance improvement compared to the previous blocking IO, glibc aio, linux aio, etc.
So what about in the network IO scenario? Is it better than epoll and other solutions?
(2) Test data
After investigation, among the well-known open source software, there is no official support for directly using io_uring for network IO, such as seastar/nginx.
Since io_uring is still in the perfect stage, there are many ways to support network IO. At present, we have sorted out 3 of them:
- Proactor: io_uring directly recv/send.
- Reactor: io_uring takes over socket_fd (POLL_ADD) and then recv/send.
- io_uring takes over epoll_fd and then epoll_wait and then RECV/SEND: The path is cumbersome and the performance is presumed to be poor, so it is simply skipped.
To this end, we test and compare the first two io_uring models, as well as the commonly used epoll model.
In order to take advantage of more io_uring features, the test uses the latest kernel (5.15). The test model is as follows:
- Communication protocol: tcp echo
- Service model: single thread, asynchronous concurrency
- Stress test client: multi-threaded, one connection per thread synchronization test
- Data: Packet size is 512B
- Test environment: native communication loopback interface
epoll
- io_uring(Proactor)
- io_uring (Reactor)
At present, many programs on the Internet use this method. However, theoretically, the performance of epoll should be close, so it has not been tested yet.
(3) Data analysis
By comparing and analyzing the above test data, the following conclusions can be drawn:
- io_uring is not as powerful as epoll in terms of network IO. The main bottleneck of network IO is the overhead of the kernel protocol stack.
Even if io_uring enables kernel polling, the latency can be reduced when the load is low, but the performance improvement under full load is not obvious, but CPU resources are wasted.
(IV) Business landing
In Linux network IO scenarios, io_uring does not bring additional performance gains over epoll. This is not the same as storage IO.
However, it is worth considering that if there are both network IO and storage IO in a system, compare the following two methods:
- Network IO adopts epoll, storage IO adopts io_uring (can be combined with eventfd and epoll)
- Both network IO and storage IO use io_uring.
Theoretically, Mode 2 can rely on optimizations such as io_uring batch submission to further reduce system calls. Can it improve performance?
This part requires further testing and analysis.
4. Summary
The above briefly introduces solutions such as RDMA, io_uring/socket, etc., each with its own advantages and disadvantages and scenario limitations. The DPDK solution will be introduced in the future, so stay tuned.
About the Author
quintonwang, Tencent backend development engineer.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。