Out of the box! The first native support for the Linux kernel, let your container experience fly! | Dragon Lizard Technology - 阿里巴巴云原生

Author: High Performance Storage SIG

Containerization is a popular trend in the devops world in recent years. Through the containerization of business, we will create a fully packaged, self-contained computing environment, allowing software developers to create and deploy their own applications more quickly. However, for a long time, due to the limitation of the image format, the loading of the container startup image is very slow. For the relevant background details, please refer to "Container Image of Container Technology" . In order to speed up the startup of containers, we can combine the optimized container images with technologies such as P2P networks, which can effectively reduce the time for container deployment and startup, and ensure the continuous and stable operation of containers. "Making container application management faster and safer, Dragonfly released Nydus Container Image Acceleration Service" .

In addition to startup speed, core features such as image layering, deduplication, compression, and on-demand loading are also particularly important in the field of container images. But since there is no native file system support, most opt for the userland solution, and Nydus initially did the same. With the continuous evolution of solutions and requirements, user-mode solutions have encountered more and more challenges, such as a large gap in performance compared with native file systems, and high resource overhead in high-density scenarios . The reason is that the image format parsing, on-demand loading and other operations are implemented in user mode, which will bring a lot of kernel mode/user mode communication overhead. So it seems that if the startup speed is solved, the research and development of these core functions will have a lot of challenges, and it is a bit careless.

So is it possible to have it both ways? To this end, the dragon lizard community made a bold attempt. We designed and implemented the RAFS v6 format compatible with the native EROFS file system of the kernel, hoping to sink the container image solution to the kernel state . At the same time, in order to do it once and for all, we also try to push this solution to the kernel mainline to benefit more people. In the end, with the challenges of the Linux kernel leaders and our continuous improvement, the erofs over fscache on-demand loading technology was finally incorporated into the 5.19 kernel (see the link at the end of the article), and the next-generation container image distribution scheme of Nydus image service has gradually become clear. . This is also the first natively supported, out-of-the-box container image distribution solution for the Linux mainline kernel . The high density, high performance, high availability and ease of use of container images will no longer be a problem! The author also added a chicken leg to himself for this, haha!

This article will introduce the evolution of this technology from three perspectives: Nydus architecture review, RAFS v6 image format and EROFS over Fscache on-demand loading technology, and show the excellent performance of the current solution by comparing data. I hope everyone can enjoy it as soon as possible. Container startup is the same experience as flying!

Nydus Architecture Review

To sum up in one sentence, the Nydus image acceleration service is an image acceleration implementation that optimizes the existing OCIv1 container image architecture, designs the RAFS (Registry Acceleration File System) disk format, and finally presents a container image format of a file system.

The fundamental requirement of the container image is to provide the root directory (rootfs) of the container, which can be carried by the file system or the archive format. Of course, it can also be used twice on the basis of the file system. Matryoshka (for example, carried through a custom block format), but the essential carrier is a directory tree, which is embodied as a file interface.

Let's take a look at the OCIv1 standard image first. The OCIv1 format is an image format specification based on the Docker Image Manifest Version 2 Schema 2 format. It consists of a manifest, an image index (optional), a series of container image layers and configuration files. For details, please refer to related documents. , this article will not repeat them. Essentially, an OCI image is a layer-based image format, each layer stores file-level diff data in tgz archive format, as follows:

title=

Due to the limitation of tgz, OCIv1 has some inherent problems, such as inability to load on demand, coarser layer deduplication granularity, volatile hash value per layer, and so on.

And some "secondary nesting doll" schemes (such as container image schemes based on custom block format) also have some principled design flaws. E.g:

The container image should ultimately be embodied as a directory tree, then a corresponding file system (such as ext4) is required to carry it, so that the entire link is "custom block format + user-mode block device + file system", compared to the file system scheme The link is longer and more complex, and the end-to-end stability is uncontrollable;
Since the block format is not aware of the upper file system, it is impossible to distinguish the metadata and data of the file system and process them separately (such as compression);
Unable to implement file-based image analysis features such as security scanning, hotspot analysis, and runtime interception;
For multiple "secondary nesting doll" container images, it is impossible to directly merge the blob content into a large image without modifying the blob content, and it is also impossible to filter some files to form a sub-image without modifying the blob content, which is the file system. the natural capacity of the programme;
Block devices + traditional file systems do not support rootless mounting, which is a requirement for rootless containers.

The Nydus we implemented is a file system-based container image storage solution . The data (blobs) and metadata (bootstrap) of the container image file system are separated, so that the original image layer only stores the data part of the file. And the files are divided into chunks, and each layer of blob stores the corresponding chunk data; because the chunk granularity is used, which refines the deduplication granularity, chunk-level deduplication allows data to be shared between layers and between mirrors and mirrors. Easier and easier to implement on-demand loading. Because the metadata is separated and combined into one place, the access to the metadata does not need to pull the corresponding blob data, the amount of data to be pulled is much smaller, and the I/O efficiency is higher . The Nydus RAFS image format is shown below:

title=

RAFS v6 image format

RAFS image format evolution

Before the introduction of the RAFS v6 format, Nydus used a fully userland implementation of the image format, served via the FUSE or virtiofs interface. However, the user-mode file system scheme has the following defects in design:

A large amount of system call overhead cannot be ignored, such as random small I/O access with depth 1;
When there are a large number of files in the container image, frequent file operations will generate a large number of fuse requests, resulting in frequent switching of kernel mode/user mode context, resulting in performance bottlenecks;
In non-FSDAX scenarios, the buffer copy from user mode to kernel mode will consume CPU usage;
In the FSDAX (virtiofs as an interface) scenario, a large number of small files will occupy a large amount of DAX window resources, and there is potential performance jitter; frequent switching to access small files will also generate a lot of DAX mapping setup overhead.

These problems are caused by the natural limitations of the user mode file system scheme, and if the implementation of the container image format is lowered to the kernel mode, the above problems can be radically solved in principle. Therefore, we introduced the RAFS v6 image format , a container image format implemented in the kernel state based on the kernel EROFS file system.

Introduction to the EROFS file system

The EROFS file system has existed in the Linux mainline since the Linux 4.19 kernel. In the past, it was mainly used in the field of embedded and mobile terminals, and it exists in the current popular distributions (such as Fedora, Ubuntu, Archlinux, Debian, Gentoo, etc.). The userland tool erofs-utils also already exists in these distributions and the OIN Linux system definition list, and the community is more active.

The EROFS file system has the following characteristics:

Native local read-only block file system suitable for various scenarios, the disk format has the minimum I/O unit definition;
page-sized block-aligned uncompressed metadata;
Effectively save space through Tail-packing inline technology while maintaining high access performance;
Data is addressed in block units (mmap I/O friendly, no post I/O processing required);
Random access friendly disk directory format;
The core disk format is very simple, and it is easy to increase the payload, and the scalability is better;
Support DIRECT I/O access, support block devices, FSDAX and other backends;
At the same time, EROFS reserves a boot sector, which can support bootloader self-starting and other requirements.

RAFS v6 image format

Over the past year, the Alibaba Cloud kernel team has made a series of improvements and enhancements to the EROFS file system, expanding its cloud-native usage scenarios, adapting it to the needs of container image storage systems, and finally presenting it as a kernel-based file system. Container image format RAFS v6. In addition to sinking the image format to the kernel state, RAFS v6 also carries out a series of optimizations on the image format, such as block alignment, more streamlined metadata, and more .

The new RAFS v6 image format is as follows:

title=

The improved Nydus image service architecture is shown in the following figure, adding support for the (EROFS based) RAFS v6 image format:

title=

EROFS over Fscache on-demand loading technology

erofs over fscache is the next-generation container image on-demand loading technology developed by the Alibaba Cloud kernel team for Nydus. It is also the on-demand image loading feature native to the Linux kernel. It was integrated into the Linux kernel mainline in version 5.19.

title=

And it was integrated into the 5.19 merge window highlight feature by the Linux kernel authoritative media LWN.net (see the link at the end of the article):

title=

Prior to this, almost all on-demand loading in the industry were user-mode solutions. The user-mode solution involves frequent kernel-mode/user-mode context switching and memory copying between kernel-mode/user-mode, resulting in performance bottlenecks. This problem is especially prominent when all the container images have been downloaded locally. At this time, the file access involved in the container running process will still be trapped in the user-mode service process.

In fact, we can decouple the two operations of 1) cache management of on-demand loading and 2) fetching data through various channels (such as the network) when the cache misses. Cache management can be downgraded to kernel mode execution, so that when the image is locally ready, it can avoid kernel mode/user mode context switching. And this is the value of erofs over fscache technology.

Scheme principle

fscache/cachefiles (hereinafter collectively referred to as fscache) is a relatively mature file caching solution in Linux systems, and is widely used in network file systems (such as NFS, Ceph, etc.). We have enhanced and extended it to support the on-demand loading feature of local filesystems (eg erofs). In this scheme, fscache takes over the work of cache management.

At this time, when the container accesses the container image, fscache will check whether the currently requested data has been cached. If the cache hits, the data will be read directly from the cache file. This process is in the kernel state throughout the process, and will not fall out of the user state.

Otherwise (cache miss), the user-mode Nydusd process needs to be notified to process this access request, and the container process will fall into a sleep waiting state; Nydusd obtains data from the remote end through the network, and writes the data to the corresponding cache file through fscache. Then notify the process that was in the sleep waiting state before that the request has been processed; then the container process can read the data from the cache file.

title=

Solution advantage

As described earlier, in the case where the image data has been downloaded to the local, the user mode scheme will cause the process of accessing the file to frequently fall into the user mode, and involve memory copying between kernel mode/user mode. However, under erofs over fscache, it will no longer be trapped in user mode, so that on-demand loading is really "on-demand" , so as to achieve almost lossless performance and stability in the scenario of downloading container images in advance, and finally obtain 1) On-demand A truly unified and lossless solution in the two scenarios of loading and 2) downloading the container image in advance.

Specifically, erofs over fscache has the following advantages over user-mode solutions.

Asynchronous prefetch

After the container is created, when the container process has not triggered on-demand loading (cache miss), the user-mode Nydusd can start to download data from the network and write the cache file, and then when the file location accessed by the container happens to be within the prefetch range. At this time, the cache hit will be triggered to read data directly from the cache file, and will not be trapped in user mode . The user-mode scheme cannot achieve this optimization.

title=

Network IO optimization

When on-demand loading (cache miss) is triggered, Nydusd can download more data from the network than the current actual request at one time, and write the downloaded data to the cache file. For example, a container accesses 4K data to trigger a cache miss, while Nydusd actually downloads 1MB of data at a time to reduce the network transmission delay per unit file size. Later, when the container accesses the next 1MB of data, it does not need to be trapped in user mode . The user mode solution cannot achieve this optimization, because even when a cache miss is triggered, the user mode service process also achieves this optimization. Since the user mode solution implements cache management in user mode, the next container access is located in the read amplification range. When using file data, it is also necessary to trap into user mode to check whether the currently accessed data has been cached.

title=

better performance

When the image data has been downloaded to the local (that is, the impact of on-demand loading is not considered), the performance of erofs over fscache is significantly better than that of the user-mode solution, and at the same time, the performance is similar to that of the native file system, so as to achieve the same performance as the native container image solution. (On-demand loading is not implemented) Similar performance. Below is the performance test data under several workloads [1].

1. read/randread IO

The following is a performance comparison of file read/randread buffer IO [2]:

title=

"native" means that the test files are located directly on the local ext4 filesystem
"loop" indicates that the test file is located in the erofs image, and the erofs image is mounted through the DIRECT IO mode of the loop device
"fscache" indicates that the test file is located in the erofs image, and the erofs image is mounted through the erofs over fscache scheme
"fuse" indicates that the mount test file is located in the fuse file system [3]
The "Performance" column normalizes the performance in each mode, and compares the performance in other modes based on the performance of the native ext4 file system

It can be seen that the read/randread performance in fscache mode is basically the same as that in loop mode, and is better than that in fuse mode; however, there is still a certain gap with the performance of the native ext4 file system. We are further analyzing and optimizing, theoretically This solution can reach the level of native file system.

2. File metadata operation test

Test the performance of file metadata operations by performing a tar operation [4] on a large number of small files.

title=

It can be seen that the metadata performance of the container image in erofs format is even better than that of the native ext4 file system, which is caused by the special file system format of erofs. Since erofs is a read-only file system, all its metadata can be closely arranged, while ext4 is a writable file system, and its metadata is scattered in multiple BG (block group) .

3. Typical workload test

Test the performance of Linux source compilation [5] under a typical workload.

title=

It can be seen that the Linux compilation load performance in fscache mode is basically the same as that of loop mode and native ext4 file system, and is better than fuse mode.

High-density deployment

Since the erofs over fscache solution is implemented based on files, that is, each container image is represented as a cache file under fscache, it naturally supports high-density deployment scenarios . For example, a typical node.js container image corresponds to ~20 cache files under this scheme, then in a machine with hundreds of containers deployed, only thousands of cache files need to be maintained.

Failback and Hot Upgrade

When all the image files are downloaded to the local, the access of the files in the image no longer requires the intervention of the user-mode service process, so the user-mode service process has a more abundant time window to realize the function of fault recovery and hot upgrade . In this scenario, the user-mode process is no longer required, so as to achieve a stable performance similar to that of the native container image solution (which does not implement on-demand loading).

Unified container image solution

With the RAFS v6 image format and erofs over fscache on-demand loading technology, Nydus is suitable for both runc and rund as a unified container image distribution solution for these two container scenarios.

More importantly, erofs over fscache is a truly unified and lossless solution for 1) on-demand loading and 2) pre-downloading of container images. On the one hand, it implements the on-demand loading feature. When the container starts, it does not need to download all the container images to the local, thus helping the ultimate container startup speed. On the other hand, it is perfectly compatible with the scenario where the container image has been downloaded to the local, and it is no longer frequently trapped in the user mode during file access, so as to achieve almost lossless performance with the native container image scheme (which does not implement on-demand loading). performance and stability.

Outlook and thanks

After that, we will continue to iterate and improve the erofs over fscache solution, such as image reuse between different containers, FSDAX support, and performance optimization.

In addition, the erofs over fscache solution has been integrated into the mainline of Linux 5.19, and we will bring this solution to OpenAnolis (5.10 and 4.19 kernels) in the future, so that the dragon lizard kernel is really available out of the box, and you are welcome to use it at that time.

Finally, I would like to thank all the individuals and teams who supported and helped us during the development of the program. Thanks to the students from ByteDance and Kuaishou for their strong support for the program, including but not limited to community support, testing, code contributions, etc. Interested parties are welcome Partners join the Dingding Group of the Dragon Lizard Community SIG (scan the QR code at the end of the article or search the group number: 34264214) to communicate with the Nydus image service Dingding group (group number: 34971767), let us work together to build a better container image ecosystem.

[1] Test environment ECS ecs.i2ne.4xlarge (16 vCPU, 128 GiB Mem), local NVMe disk

[2] Test command "fio -ioengine=psync -bs=4k -direct=0 -rw=[read|randread] -numjobs=1"

[3] Use passthrough as fuse daemon, eg "passthrough_hp <src_dir> <tgt_dir>"

[4] Test the execution time of "tar -cf /dev/null <linux_src_dir>" command

[5] Test the execution time of the "time make -j16" command

Related link address:
1. SIG address of high-performance storage technology of Dragon Lizard Community:

https://openanolis.cn/sig/high-perf-storage

2.erofs over fscache merged into the 5.19 kernel commit link:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=65965d9530b0c320759cd18a9a5975fb2e098462

FUSE passthough_hp daemon:

https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc

Nydus image service (please pay more attention and welcome contributions):

https://github.com/dragonflyoss/image-service

LWN.net report link:

https://lwn.net/SubscriberLink/896140/3d7b8c63b70776d4/

Out of the box! The first native support for the Linux kernel, let your container experience fly! | Dragon Lizard Technology

Nydus Architecture Review

RAFS v6 image format

RAFS image format evolution

Introduction to the EROFS file system

RAFS v6 image format

EROFS over Fscache on-demand loading technology

Scheme principle

Solution advantage

1. read/randread IO

2. File metadata operation test

3. Typical workload test

Outlook and thanks

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

🔥吐血整理 Bolt.diy 部署与应用攻略

rocky linux 使用记录

支付宝H5下载被拦截的原因排查与解决指南

快捷键打开某个窗口(如网页chatGPT)

但是，I/O多路复用中是如何判断文件“可读”/“可写”的？

麒麟系统中theia终端崩溃问题排查小记