容器 - Nydus - Exploration and Practice of Next-Generation Container Images - 金融级分布式架构SOFAStack

Text｜Yan Song (flower name: Jing Shou)

Nydus mirror open source project Maintainer, technical expert of Ant Group

Ant Group's infrastructure research and development, focusing on cloud-native images and container runtime ecology

This article is 7060 words read 15 minutes

｜Foreword｜

Container image is one of the cloud-native infrastructures. As the basis of the container runtime file system view, from its birth to the present, various ecosystems of the entire image life cycle from image construction, storage, distribution to runtime have been derived.

However, although there are many mirror ecosystems, the mirror design itself has not improved much since its birth. This article will discuss some thoughts on the future development of container images, as well as the exploration and practice of Nydus container images.

After reading this article, you will know:

- The basic principle of container image, and its composition format;

-What are the problems with the current mirror design and how to improve it;

- What are the explorations of Nydus container images and how to practice them.

PART. 1

container image

OCI Container Image Specification

The container provides the application with a fast, lightweight runtime with a basic isolation environment, and the image provides the container RootFS, that is, the entire Filesystem view that can be seen in the container, including at least the file directory tree structure, file meta data and data parts. The characteristics of the mirror are as follows:

- Ease of transport, such as uploading or downloading from the Registry over the network via HTTP;

-Easy to store, for example, it can be packaged into Tar Gzip format and stored on the Registry;

- With immutable characteristics, the entire image has a unique Hash, as long as the image content changes, the image Hash will also be changed.

The early image format was designed by Docker and has undergone evolution from Image Manifest V1[1], V2 Scheme 1[2] to V2 Scheme 2[3]. After other container runtimes such as CoreOS appeared, in order to avoid competition and ecological chaos, the OCI standardization community was established. It defines the implementation standards related to container runtime, image and distribution. The image formats we currently use are basically OCI compatible.

The image is mainly composed of two parts: the image layer and the container configuration.

What is an image layer?

You can recall the Dockerfile file you usually write: each ADD, COPY, and RUN command may generate a new image layer, and the new layer contains newly added or modified files (including metadata and data ) based on the old layer. ) , or deleted files (called Whiteout * [4] The special file means delete)*.

So simply put, each layer of the mirror stores the Diff between Lower and Upper, very similar to Git Commit. This layer of Diff is usually compressed into Tar Gzip format and uploaded to the Registry.

At runtime, all Diffs are stacked to form the entire filesystem view provided to the container, which is the RootFS. Another part of the image is the container runtime configuration, which contains commands, environment variables, ports and other information.

The image layer and runtime configuration each have a unique Hash (usually SHA256) , and these hashes will be written into a JSON file called Manifest[5]. When Pulling the image, the Manifest file is actually pulled first, and then according to the Hash Go to the Registry to pull the corresponding image layer/container runtime configuration.

Current mirror design issues

First , we noticed that the image layers need to be fully stacked before the container can see the entire file system view, so the container needs to wait until each layer of the image is downloaded and decompressed before it can be started. There is a FAST paper research analysis [6] that image pull accounts for about 76% of the container startup time, but only 6.4% of the data will be read by the container. This result is interesting and motivates us to speed up container startup by loading on demand. In addition, in the case of a large number of layers, there will be the overhead of Overlay stacking at runtime.

Second , each layer of image is composed of metadata and data, so as long as there is a change in the metadata of a file in a layer of image, for example, if the permission bit is modified, the Hash of the layer will change, and then the entire image will be changed. The image layer needs to be re-stored, or re-downloaded.

Third , if a file is deleted or modified in the Upper layer, the old version of the file still remains in the Lower layer and will not be deleted. When pulling a new image, the old version will still be downloaded and unpacked, but in fact these files are no longer needed by the container. Of course, we can think that this is because the image optimization is not good enough, but it is difficult to avoid such problems in complex scenarios.

Fourth , the image hash can ensure that the image is immutable when uploading and downloading, but after the image is decompressed and dropped, it is difficult to ensure that the runtime data is not tampered with, which means that the runtime data is untrustworthy.

Fifth , the image is based on the layer as the basic storage unit, and the data deduplication is through the hash of the layer, which also leads to a coarser granularity of data deduplication. From the perspective of the entire Registry storage, there is a large amount of duplicate data between layers in the image, and between images, which occupies storage and transmission costs.

How mirror design should be improved

We have seen many problems in OCI image design. In large-scale cluster scenarios, storage and network load pressure will be amplified, and the impact of these problems is particularly obvious. Therefore, image design urgently needs to be considered from the aspects of format, construction, distribution, operation, security, etc. do optimization.

First, we need to implement on-demand loading. When the container starts, the data of which files are requested by the business IO in the container, and then we pull the data from the remote Registry. In this way, we can avoid mirroring a large amount of data pulling and blocking the start of the container.

Second, we need to use an index file to record the offset position of the data block of a file in the layer. Because the current problem is that the Tar format is not addressable, that is to say, when a file is needed, the entire Tar stream can only be found by sequentially reading the entire Tar stream from the beginning, so we naturally thought that we can use this method to find this part of the data. accomplish.

Next, we reformat the layer's format to support simpler addressing. Since Tar is compressed by Gzip, it is difficult to Unzip even if you know Offset.

We let the original image layer only store the data part of the file (that is, the Blob layer in the figure) . The Blob layer stores chunks of file data. For example, a 10MB file is cut into 10 1MB chunks. The advantage of this is that we can record the Offset of the Chunk in an index. When the container requests part of the data of the file, we can only pull a part of the Chunks we need from the remote Registry, thus saving unnecessary network overhead.

In addition, another advantage of cutting by chunk is that the granularity of deduplication is refined. Chunk-level deduplication makes it easier to share data between layers and between mirrors.

Finally, we separate metadata and data . This avoids data layer updates due to metadata updates, saving storage and transmission costs in this way.

The metadata and the Chunk index are added together to form the Meta layer in the above figure, which is the entire Filesystem structure that the container can see after all the mirror layers are stacked, including the directory tree structure, file metadata, and Chunk information.

In addition, the Meta layer includes the Hash tree and the Hash of the Chunk data block, so as to ensure that we can verify the entire file tree and a certain Chunk data block at runtime, and can sign the entire Meta layer. , to ensure that the runtime data can still be checked after being tampered with.

As mentioned above, we introduced these features in the Nydus image format, which can be summarized as follows:

-Separation of image metadata and data, user mode loading and decompression on demand;

- More fine-grained block-level data cutting and deduplication;

- Flatten the metadata layer (remove the middle layer) and directly present the Filesystem view;

- End-to-end file system metadata tree and data verification.

PART. 2

Nydus Solutions

Mirror acceleration framework

The Nydus image acceleration framework is a sub-project of Dragonfly[7] (a CNCF incubating project) . It is compatible with the current OCI image construction, distribution, and runtime ecosystem. The Nydus runtime is written in Rust, which offers great advantages in terms of language-level safety and performance, memory, and CPU overhead, while also being safe and highly scalable.

By default, Nydus uses the user-mode file system to implement FUSE[8] for on-demand loading. The user-mode Nydus Daemon process uses the Nydus image mount point as the container RootFS directory. When the container generates file system IO such as read (fd, count) , the kernel-mode FUSE driver adds the request to the processing queue, and the user-mode Nydus Daemon reads and processes the request through the FUSE Device, and pulls the corresponding Count from the remote Registry After the number of Chunk data blocks, it is finally replied to the container through the kernel-mode FUSE.

The Nydus acceleration framework supports three operating modes to support on-demand loading of images in different scenarios:

1. The on-demand loading capability provided by FUSE to container runtimes such as RunC is also the most commonly used mode of Nydus;

2. Carrying the FUSE protocol through VirtioFS[9], allowing VM-based container runtimes, such as Kata, to provide RootFS on-demand loading capabilities for containers in VM Guest;

3. RootFS is provided through the EROFS[10] read-only file system in kernel mode. At present, the EROFS format support of Nydus has entered the mainline of Linux 5.16, and its kernel-mode cache solution erofs over fscache has also been integrated into the mainline of Linux 5.19-rc1. The scheme can reduce the overhead of context switching and memory copying, and this mode can be used when performance has extreme requirements.

On the storage backend side, Nydus can connect to various OCI Distribution-compatible Registries, and directly connect to object storage services such as OSS, network file systems such as NAS, etc. It also includes the local cache capability, after the data block is pulled from the remote, it will be decompressed and stored in the local cache to provide better performance on the next hot start.

In addition to the near-end local Cache, it can also be connected to a P2P file distribution system (such as Dragonfly) to speed up the transmission of block data. At the same time, it can also minimize the network load under large-scale clusters and the single-point pressure of the Registry. In the actual scenario test, with P2P caching, the network latency can be reduced by more than 80% .

As can be seen from the benchmark test in this figure, the end-to-end cold start time (from Pod creation to Ready) of the OCI image container increases as the image size increases, but the Nydus image container always remains stable , which takes about 2s.

Mirroring scene performance optimization

At present, only in the landing scenario of Ant, there are millions of Nydus accelerated image container creation per day, which is guaranteed in terms of production-level stability and performance. Under the test of such a large-scale scenario, Nydus has made many optimizations in terms of performance and resource consumption.

In terms of mirrored data performance, the runtime (nydusd) implemented by Rust itself has achieved low overhead in terms of memory and CPU. The main load that affects the startup performance of Nydus image containers is from the network. Therefore, in addition to pulling chunk data from nearby nodes through P2P distribution, Nydus also implements a layer of local cache. Chunks that have been pulled from the remote end will be decompressed and cached. Locally, Cache can be shared between mirrors on a layer-by-layer basis, and can also be shared at the Chunk level.

Although Nydus can configure in-cluster P2P acceleration, on-demand loading may initiate a network IO when each chunk is pulled. Therefore, we implement IO read amplification, combine small block requests together to initiate a request, and reduce the number of connections. At the same time, Dragonfly also implements P2P caching and acceleration at the Chunk block level for Nydus.

In addition, we can analyze the access pattern by observing the order in which the image file is read when the container is started, so that this part of the data is preloaded (prefetching) before the container IO reads the data, which can improve the cold start performance. At the same time, we were able to further reduce startup latency by rearranging the chunk order during the image build phase.

In terms of image metadata performance, for example, for a Nydus image with a size of tens of GB and many small files, its metadata layer may reach more than 10MB, and it will be very uneconomical to load it into memory at one time. Therefore, we modified the metadata structure to also implement on-demand loading (ondisk mmap) , which is very useful for memory-sensitive scenarios such as function computing.

In addition to optimizing performance at runtime, Nydus also does some optimization work at build time. In most scenarios, the export time of Nydus image layer is optimized to be 30% faster than the OCI image in Tar Gzip format, and the future goal is to optimize it to more than 50% .

More than just mirror acceleration

These optimizations are sufficient for image acceleration scenarios, but Nydus can not only be applied to image acceleration, it is also evolving into a general distribution acceleration framework that can also be applied in other fields. The overall presentation is as follows:

1. In addition to the native integration of Kata security container, Nydus also reduces the cold start time of runtime image preparation from 20s to 800ms in function computing scenarios, such as Alibaba Cloud's code package acceleration and serverless scenarios;

2. In software-dependent package management scenarios, such as front-end NPM packages, there are a large number of small files that need to be decompressed and placed on the disk during the installation phase. However, small file IO affects performance very much. Nydus can achieve no decompression. Ant's TNPM project [11] adds macOS platform support to Nydus, reducing the installation speed of native NPM from 25s to 6s ;

3. In the image dataization scenario, we analyze the chunk similarity between business images through algorithms, and construct Nydus Chunk dictionary images to reduce the storage consumption caused by rapid business iteration by more than 50% . In the future, we will use machine learning to help The business further optimizes the image size.

File system scalability

There are also image acceleration solutions based on user-mode block devices in the industry (custom block format > user-mode block device > file system) . From the above introduction, it can be found that Nydus, whether it is FUSE user mode or kernel mode EROFS mode, is based on file system rather than block device. This design makes Nydus easy to access whether it is built or run. File-level data information. This natural ability enables many other scenarios, such as:

1. In the security scanning scenario, without downloading and decompressing the entire image, you can analyze the metadata in advance to find the high-risk software version, and then read the file content on demand to scan and discover sensitive and non-compliant data, which greatly improves Mirror scan speed;

2. Mirror file system optimization, through trace runtime file access requests, inform users which files have been accessed and which programs have been executed. These records can be provided to users to help optimize the size of the image, to the security team to help audit suspicious operations, and to provide the image Optimize the layout of the build phase to improve the run-time read-ahead performance, etc.;

3. Through the hook file access request at runtime, intercept the execution of high-risk software, block the reading of sensitive data, and realize the replacement and hot repair of vulnerability resources without business sense;

End-to-end kernel mode solution

Nydus was completely implemented in user mode in the early days, but in order to meet the needs of extreme performance scenarios, such as function computing and code package scenarios, we have lowered the on-demand loading capability to kernel mode. Compared with the FUSE user-mode solution, the kernel-mode implementation can reduce a large amount of system call overhead caused by random small I/O access, reduce the context switching between user mode and kernel mode and memory copy overhead for FUSE request processing.

Relying on the kernel-mode EROFS (starting with Linux 4.19) file system, we have made a series of improvements and enhancements to it to expand its capabilities in mirroring scenarios, and finally presented a kernel-mode container image format - Nydus RAFS ( Registry Acceleration File System) v6, compared with the previous format, it has the advantages of block data alignment, more streamlined metadata, high scalability and high performance.

As mentioned above, when the image data is all downloaded locally, the FUSE user mode scheme will cause the process of accessing the file to frequently fall out of the user mode, and involve memory copying between kernel mode/user mode. Therefore we further support the EROFS over fscache scheme (Linux 5.19-rc1) .

When the user mode nydusd downloads the chunk from the remote end, it will directly write to the fscache cache, and then when the container accesses it, it can directly read the data through the kernel mode fscache without trapping into the user mode, and achieve almost lossless in the scenario of container image. performance and stability. It outperforms the FUSE userland solution, and is similar to the native filesystem (without on-demand loading) .

At present, Nydus supports this solution in construction, operation, and kernel mode (Linux 5.19-rc1) . For detailed usage, please refer to Nydus EROFS fscache user guide[12]. For more details on Nydus kernel mode implementation, please refer to Nydus image The evolution of the accelerated kernel [13].

PART. 3

The Nydus Ecosystem and the Future

Nydus is compatible with the current OCI image construction, distribution, and runtime ecosystem. In addition to providing its own tool chain, Nydus is compatible and integrated with the mainstream ecosystem of the community.

Nydus toolchain

- Nydus Daemon (nydusd[14]) : Nydus user mode runtime, supports FUSE, FUSE on VirtioFS mode and EROFS read-only file system format, and currently supports macOS platform operation;

- Nydus Builder (nydus-image[15]) : Nydus format construction tool, supports building Nydus format from source directory/eStargz TOC, etc., can be used for OCI image layered construction, and code package construction and other scenarios, supports Nydus format inspection and verification test;

- Nydusify (nydusify[16]) : Nydus format image conversion tool, which supports pulling images from the source Registry and converting them to Nydus image format and pushing to the target Registry or object storage service, supporting Nydus image checksum and remote cache acceleration conversion;

- Nydus Ctl (nydusctl[17]) : Nydus Daemon control CLI, which can be used to query Daemon status, Metrics indicators, and run-time hot update configuration;

- Ctr Remote (ctr-remote[18]) : an enhanced version of the Contianerd CLI (Ctr) tool to support pulling and running Nydus images directly;

- Nydus Backend Proxy (nydus-backend-proxy[19]) : An HTTP service for mapping local directories to Nydus Daemon storage backend, available in scenarios without Registry or object storage services;

- Nydus Overlayfs (nydus-overlayfs[20]) : Containerd Mount Helper tool, which can be used for VM-based container runtimes such as Kata Containers.

Nydus Ecosystem Integration

- Harbor (acceld[21]) : Acceld, an image conversion service initiated by Nydus, allows Harbor to natively support eStargz, Nydus and other accelerated image conversion;

- Dragonfly (dragonfly) : P2P file distribution system, which implements block-level data caching and distribution capabilities for Nydus;

- Nydus Snapshotter (nydus snapshotter[22]) : a sub-project of Containerd, which supports Nydus container images for Containerd with the Remote plug-in mechanism;

- Docker (nydus graphdriver[23]) : supports Nydus container image for Docker with Graph Driver plug-in mechanism;

- Kata Containers (kata containers[24]) : Nydus provides a native image acceleration solution for Kata security containers;

- EROFS (nydus with erofs[25]) : Nydus is compatible with the EROFS read-only file system format, and can directly run Nydus images in kernel mode to improve performance in extreme scenarios;

- Buildkit (nydus compression type[26]) : Export Nydus format images directly from Dockerfile.

The future direction of Nydus

While gradually advancing the upstream ecology and expanding the application field, Nydus is also making more explorations in the following directions such as performance and security:

1. Nydus currently supports the kernel-mode EROFS read-only file system, and we will do more work on performance and native integration;

2. At present, Nydus export speed is faster than OCIv1 Tar Gzip in most scenarios. Next, we will make the build also realize on-demand loading, for example, allowing the Base image to be designated as the Nydus image. When building the Dockerfile, you don't need to first load the entire Dockerfile. The Base image is pulled down to further improve the build speed;

3. We are using machine learning methods to analyze the storage between images and the entire image center, and use runtime access pattern analysis to further optimize the deduplication efficiency of image data, reduce storage, and improve runtime performance;

4. Cooperate with major image security scanning frameworks, natively support faster image scanning, intercept high-risk software execution at runtime, block high-risk reading and writing, and business-insensitive vulnerability hot repair and resource replacement;

5. In addition to on-demand loading, Nydus can also solve the IO performance problem of massive small files. Ant's upcoming open-source front-end tnpm project has already implemented the solution, and we are considering expanding to more scenarios.

Compared with other on-demand loading solutions in the community, Nydus has done a lot of work for performance optimization and low resource overhead in mirroring scenarios, and broadens the possibility of on-demand loading technology in mirror scanning and auditing, as well as in non-mirroring scenarios. .

As the title says, although it does not necessarily represent the future of container images, it must also provide a reference with core competitiveness for future container images in terms of format design, optimization direction, and practical ideas. Nydus adheres to the concept of open source and open source, and looks forward to more communities participating together to contribute to the future of container technology.

Nydus website: https://nydus.dev/

Go deep into Nydus and explore with us~

understand more...

Nydus Star ✨:
https://github.com/dragonflyoss/image-service