Talking about Zero Copy of Dewu Technology

The zero copy mentioned in this article is based on network transmission.

What is zero copy

Zero copy does not mean that you do not need to copy, but to reduce the number of unnecessary copies.

Traditional IO process

Usually when we need to access hard disk data, the user process needs to use the kernel to access the hard disk data; the user informs the kernel by calling system methods, such as read(), write(), etc., so that the kernel can do the corresponding things.

read();

The traditional process of reading data:

截屏2021-06-14 上午10.43.36 (1).png

The copy process before DMA is shown in the figure above:

The user calls the read system method
After the CPU receives the read request, it initiates a corresponding command to the disk
The hard disk prepares the data, puts the data in the buffer, and initiates an IO interrupt instruction to the CPU
After the CPU receives the interrupt instruction, it pauses what it is doing and reads the data from the disk into the kernel buffer
Immediately afterwards, the CPU copies the data to the user buffer
At this point, the user can access the data

above 160f197f6d7407, the copying of data involved requires the CPU to complete. The CPU is a very precious resource. When the CPU is copying the data, it cannot do other things. If the transferred data is very large, the CPU has been copying the data. The inability to perform other tasks is very costly.

DMA

In essence, DMA technology is an independent chip on the computer motherboard. When the computer needs to transfer data between memory and I/O devices, it no longer needs the CPU to perform time-consuming IO operations, but is done through the DMA controller. , The process is as follows.

DMA数据传输 (1).png

The above figure shows that the data copy is done by DMA, and the CPU does not need to perform some time-consuming IO operations.

The following figure can express the process of file transfer more vividly:

image - 2021-07-16T193459.068.png

The steps are explained as follows:

The user process calls the system function read()
After the kernel receives the corresponding instruction, it goes to the disk to read the file into the kernel buffer. After the data is ready, it initiates an IO interrupt
After the CPU receives the IO interrupt signal, it stops its work and copies the data in the kernel buffer to the user process
The user process calls the system function write() after receiving the data, and the CPU copies the data to the socket buffer
The DMA controller copies the data in the socket buffer to the network card for data transmission.

The above traditional IO data copy has a lot of room for improvement in performance,

It can be seen from the above figure that in the case of file transfer, we copy the data to the user data buffer, and the user process sends the file directly without any data processing. Therefore, this step is redundant and can be omitted.

Achieve zero copy

The realization of zero copy is mainly optimized for the number of context switching and copying, and the goal of optimization is achieved by reducing the number of context switching and data copying.

Implementation method 1: mmap(..) + write(..)

what is mmap
The full name of mmap is Memory Mapped Files, which is a method of memory file mapping, which maps a file or other object to the address space of the process, and realizes the one-to-one mapping relationship between the file disk address and a virtual address in the process virtual address space. After generation, the user process can manipulate the file data in the memory through the pointer, and the system will automatically write the manipulated data to the disk, without the need to call read(), write() and other system calls to manipulate the data.

Implementation process
Use mmap() function to replace read() function. mmap maps the data in the kernel buffer to user space. There is no need to copy data between user space and the kernel, and they can share data.

image - 2021-07-16T193404.466.png

As can be seen in the figure, the data is no longer copied to the user buffer

After the user process calls the system function mmap(), the DMA will copy the data from the disk to the kernel buffer, and the user process shares this memory data with the kernel buffer;
The user process calls the write() function, and the CPU copies the data from the kernel buffer to the socket buffer;
Finally, DMA copies the data in the socket buffer to the network card for data transmission.

mmap reduces one copy of data and improves performance, but there are still 4 user mode and kernel mode switches, which is not the most ideal zero copy.

How to reduce context switching?

The user process does not have the authority to directly manipulate the data on the disk, and the kernel has the authority of God, so the user process can transfer the task to the kernel by calling system functions (such as read, wirte).

A system call will have two context switches, first switch from the user mode to the kernel mode to execute the task. After the task is executed, switch from the kernel mode to the user mode, and the user process continues to execute the logic.

Switching back the context takes time. Each context switch takes a few nanoseconds to a few microseconds. It seems that the time is very short, but it will be multiplied under concurrency.

Therefore, we need to reduce the number of context switches, to reduce the number of context switches, we need to reduce the number of system function calls .

Implementation method 2: sendfile function

After Linux 2.1 version, a system call function sendfile() is provided for sending files. The function is as follows:

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

Parameter Description:

out_fd: the file descriptor of the destination side

in_fd: the file descriptor of the source

offset: the offset of the source

count: the length of the copy

Returns the length of the actual copied data

The sendfile function is used to replace the two functions read and write, so that one system call can be reduced and the overhead of two context switches can be reduced.

$ ethtool -k eth0 | grep scatter-gather
scatter-gather: on

image - 2021-07-16T193242.808.png

DMA copies the data in the disk to the kernel buffer
Then transfer the file descriptor and data length in the kernel buffer to the socket (no need to copy the data to the socket)
The SG-DMA controller of the network card copies the data in the kernel buffer to the network card to complete the data transmission

The above process only involves one system function call, two context switches, and two DMA data copies. There is no need for the CPU to copy data, realizing a true zero copy.

Comparison of mmap and sendfile

All are achieved by calling the API functions provided by the operating system
mmap is a file memory mapping. The user process supports read and write operations on the data in the mapped memory, and the final result will be reflected on the disk
After sendfile reads the data into the kernel buffer, the network card copies the data through the SG-DMA controller
The realization of zero copy by mmap involves two system function calls, 4 context switches, and three data copies, which is not a true zero copy
sendfile has only one system function call, 2 context switches, 2 necessary data copies, realizing zero copy in the true sense
mmap optimization is more on write requests, sendfile is more on optimizing read requests

Kernel buffer (PageCache)

PageCache is a high-speed buffer of the disk. Since it is a very time-consuming operation to find data in the disk, some of the data in the disk is cached in PageCache, and the operations of reading and writing the disk are converted to the memory to improve the efficiency of reading and writing.

PageCache memory space is much smaller than that of disk, so it is impossible for us to put all the data on the disk, so what data do we need to read into the memory and how big is the read?

PageCache uses the pre-reading function. If we need to read 32kb of data, but the data loaded into the memory is not only 32kb, it will read the data in units of pages (64kb per page), so it will not only read 0- For 32kb data, 32-64kb data will also be read, so the cost of reading the 32-64kb part of the data is very small. If it is used by the process before the memory is eliminated, the benefit is very large.

So PageCache has two main benefits:

Cache recently accessed data
Pre-reading function

To put it bluntly, PageCache was born to improve disk read and write performance.

to sum up

Zero copy does not mean that you do not need to copy, but to reduce unnecessary copies, and it is necessary to avoid using the CPU for data copy.
DMA copy technology is a good substitute for CPU copy
The sendfile() function realizes zero copy in the true sense, requiring only 2 DMA copies, 1 system function call, and 2 context switches

discuss

PageCahe has limited memory. If we read a large file, PageCahe will soon be full. If PageCahe is used temporarily for a long time, then other hot data will not be able to use the benefits of PageCahe, which will cause the performance of the disk to decrease. do?

Answer: Let me talk about the answer first: In this case, asynchronous IO + direct IO, we should find a way to bypass PageCache. Large files should not use PageCache. Direct IO will directly bypass PageCacheIO. Reading is blocked. You can consider using asynchronous IO instead.

Why does RocketMQ use mmap instead of sendfile?

Everyone is welcome to discuss.

Text/Carpenter

Pay attention to Dewu Technology, and work hand in hand to the cloud

Talking about Zero Copy of Dewu Technology

What is zero copy

Traditional IO process

DMA

Achieve zero copy

Implementation method 1: mmap(..) + write(..)

How to reduce context switching?

Implementation method 2: sendfile function

Comparison of mmap and sendfile

Kernel buffer (PageCache)

to sum up

discuss

得物技术

引用和评论

得物自研DScript2.0脚本能力从0到1演进

IPD流程中的风险矩阵：为什么需要6步识别与缓解策略？

IPD项目管理流程图与工具选择指南（2025）

三大运营商骨干网架构深度剖析：线路建设与用户体验

阿里云个人博客外网访问中断应急指南：从安全组到日志的七步排查法

三大运营商骨干网架构深度剖析：线路建设与用户体验

🧩x tping (1) - 无需安装 tcping，轻松实现 TCP 端口 ping 测试与图形化展示

Talking about Zero Copy of Dewu Technology

What is zero copy

Traditional IO process

DMA

Achieve zero copy

Implementation method 1: mmap(..) + write(..)

How to reduce context switching?

Implementation method 2: sendfile function

Comparison of mmap and sendfile

Kernel buffer (PageCache)

to sum up

discuss

得物技术

引用和评论

得物自研DScript2.0脚本能力从0到1演进

​​IPD流程中的风险矩阵：为什么需要6步识别与缓解策略？​​

IPD项目管理流程图与工具选择指南（2025）

三大运营商骨干网架构深度剖析：线路建设与用户体验

阿里云个人博客外网访问中断应急指南：从安全组到日志的七步排查法

三大运营商骨干网架构深度剖析：线路建设与用户体验

🧩x tping (1) - 无需安装 tcping，轻松实现 TCP 端口 ping 测试与图形化展示

IPD流程中的风险矩阵：为什么需要6步识别与缓解策略？