Alibaba Cloud Xu Li: Storage Innovation for Containers and Serverless Computing

*Author: Xu Li

The source of cloud-native innovation

Under the cloud-native trend, the proportion of application containerization is growing rapidly, and Kubernetes has also become a new infrastructure in the cloud-native era. Forrester predicts that by 2022, global organizations/companies will run containerized applications in production environments. Today, the proportion of less than 30% will be greatly increased to more than 75%, and the trend of enterprise application containerization is unstoppable. We can see two common phenomena. First of all, hosting Kubernetes on the cloud has become the first choice for enterprises to go to the cloud and run containers. In addition, the way users use containers is also changing. From stateless applications to core enterprise applications to data intelligence applications, more and more companies use containers to deploy production-level, high-complexity, and high-performance computing stateful applications. Such as Web services, content databases, databases, and even DevOps, AI/big data applications, etc.

With the gradual evolution of infrastructure from physical machines to virtual machines, to container environments represented by Kubernetes, and even to Serverless, today's computing and applications are facing tremendous changes. This change makes the granularity of resources more and more finer, shorter and shorter life cycles, and calculations are used on demand.

From the perspective of users, the most obvious change in storage brought by cloud native is that users use the storage interface to move up, and storage services that are not directly related to applications sink from the application layer to the cloud platform, and users are more concerned about applications.

For example, traditional users need to care about all hardware and software, and gradually transition to users who care about virtual machines, operating systems, and the entire application software stack. Today, in Serverless, users only care about application services and code. System resources rise from the physical resource layer and virtualized resource layer to the application development layer, and users do not need to care about the underlying infrastructure.

Under such a technical system, the evolution of storage capabilities is mainly reflected in the following three aspects:

1, high-density
In the era of virtual machines, a virtual machine corresponds to a complete storage space, which can be used to store all the data-related access and storage requirements required by the entire application. In the serverless functional computing environment, applications are divided into functions, and the corresponding resources require storage management. Therefore, the storage density has changed a lot, and the storage density is higher.

2, elastic
As the granularity of application splitting becomes more and more refined, and storage density is gradually increasing, serverless function computing large-scale instances requires high concurrent startup, and storage requires extremely flexible capabilities.

3. Extreme speed
From the perspective of serverless function calculations, functions are a part of the entire process, and their life cycles naturally become shorter. As a result, a large number of short-lived container instances have emerged. As the life cycle becomes shorter and shorter, storage needs to be mounted/unmounted quickly and accessed quickly.

As the service interface has moved up, the storage management and control interface has been reshaped, and the internal storage and external storage have become clearer. In the serverless environment, the user-visible interface is external storage (including file storage and object storage), while the built-in storage (including mirror storage and temporary storage) is invisible to users. The built-in storage is managed by Alibaba Cloud, providing innovative Chance.

Mirror-accelerated technological innovation

Challenges of Alibaba's large-scale container deployment

The main challenges faced by Alibaba's large-scale container deployment are reflected in the following aspects:

1. Large business volume. cluster is large in scale, up to 100,000; all applications are containerized, and the application image is large, usually tens of gigabytes in size.

2. Faster deployment speed. business scale of 161acb7d7b0afe continues to grow, and the cloud platform is required to deploy applications quickly to handle business growth, especially the emergency expansion during the Double Eleven promotion period, and it is difficult to accurately estimate the capacity of each service in advance.

3. However, it is still very slow to create or update container clusters on a large scale. The main reason is that the download and decompression of container deployment images are very slow. The main technical challenges are as follows:

• Large time overhead : Time overhead ∝ mirror size * number of nodes; one thousand nodes need to store one thousand mirrors;

• High CPU time overhead: gzip decompression is slow, and can only be decompressed serially;

• I/O pressure is high: downloads and decompresses two rounds of writing disks, including many nodes writing to disks at the same time, causing "resonance" to the cloud disk;

• Memory occupancy disturbance: causes serious disturbance to the host's page cache;

• But the proportion of valid data is small: only needs 6.4% of the mirrored data when it starts up.

To cope with the above technical challenges, the key requirements for large-scale container deployment are abstracted and summarized into three points:

1. On-demand: download decompression speed is fast enough, data on-demand access and on-demand transmission.

2. Incremental layering: data decoupling, through OCI-Artifacts standard overlayfs to divide the layering, incremental data, time and resource use more effectively.

3. Remote Image: adopts remote mirroring technology to change the image format and reduce the consumption of local resources.

Remote Image Technical Solution Comparison

There are two main ways to implement Remote Image technology, one is based on the file system, and the second is based on block devices. The comparison of Remote Image technical solutions is shown in the figure below:

The main feature of Remote Image technology based on file system is to directly provide file system interface, which is a natural extension of container Image. The complexity is high, and the realization of stability, optimization and advanced functions is difficult. In terms of versatility, it is bound to the operating system and has fixed capabilities, which may not match all applications. At the same time, the attack surface is relatively large. Industry representatives are mainly Google CRFS, Microsoft Azure Project Teleport, AWS SparseFS.

The main feature of Remote Image technology based on block devices is that it can be used with conventional file systems, such as ext4; ordinary containers, secure containers, and virtual machines can all be used directly. Complexity, stability, optimization and advanced features are easier to implement. In terms of versatility, unbundled with the operating system and file system, applications can freely choose the most suitable file system, such as NTFS, and package it into Image as a dependency. And the attack surface is small.

Alibaba chose the Date Accelerator for Disaggregated Infrastructure (referred to as DADI) and conducted scale verification.

Alibaba's self-developed container image acceleration technology DADI

DADI is Alibaba's original technical solution. DADI Mirroring Service is a hierarchical, block-level mirroring service that can deploy applications agilely and flexibly. DADI completely abandons the waterfall type of traditional container startup (that is, download, unpack, and start), and realizes the fine-grained on-demand loading of remote mirrors. Before the container starts, there is no need to deploy the mirror, and the container can be started immediately after it is created.

The data path of DADI is shown in the figure below. Below the dotted line is the kernel state, and above the dotted line is the user state. DADI abstracts the image as a virtual block device, and mounts a conventional file system such as ext4 on the container application. When the user application reads data, the read request is first processed through the conventional file system, and the file system converts the request into one or more reads of the virtual block device. The read request to the block device is forwarded to the DADI module in user mode, and finally converted to random read of one or more layers.

DADI mirroring uses block storage + layering technology, each layer only records data blocks that are incrementally modified, supports compression and real-time on-demand decompression; supports on-demand transmission, and only transmits the data blocks used for download; DADI can also Adopt P2P transmission architecture, one transmission to ten, ten transmission to one hundred, and the network traffic is balanced to all multiple nodes in a large-scale cluster.

Interpretation of DADI's key technologies

DADI incremental mirroring can be realized by block + layering technology, where each layer corresponds to a change of LBA. DADI's key technologies include fine-grained on-demand transmission of remote mirroring, efficient online decompression, trace-based reading, and P2P transmission technology for processing burst work. DADI is very effective in improving the agility and flexibility of deploying applications.

1. Overlay Block Device

Each layer records the variable-length data block LBA that is incrementally modified, and does not involve the concept of file/file system, with 512 bytes as the minimum granularity. Fast index, support variable length records to save memory, the LBA of each record does not overlap, and support efficient interval query.

2. Native support for writable layer

Two modes of writing additional files and writing sparse files randomly are provided to build a DADI image. Read-only layer, each read-only can be in accordance with different types of sizes, each layer query interval, the speed is extremely fast. The writable layer is composed of two parts: raw data (Raw Data) and index (Index), and is organized by append only.

3. ZFile compression format

Standard compressed file formats, such as gz, bz2, xz, etc., cannot perform random read and write operations efficiently. Regardless of which part of the compressed file is read, it needs to be decompressed from the head, in order to support layer blob compression and support remote at the same time For on-demand reading of images, DADI introduces the ZFile compression format. The compression format of ZFile is shown in the figure below. It is compressed according to fixed-size data blocks, and only the read data blocks are decompressed. It supports a variety of effective compression algorithms, including lz4, zstd, gzip, etc. It uses a common format and is not bound to DADI.

based on Trace

Record the read log during the application process, only record the location, but not the data itself. When the application is cold-started, if there is a trace record, DADI will pre-fetch the data back to the local location in advance according to the trace, and use high concurrent reading, which is more efficient. Trace is stored in image as a special layer, dedicated to acceleration, invisible to users, and can accommodate other acceleration files in the future. As shown in the figure below, the green part represents the acceleration layer, contains trace files and other files.

6(1).png

5. On-demand P2P transmission

In our production environment, several key applications have been deployed on thousands of servers and contain Layers of up to several gigabytes. The deployment of these applications puts tremendous pressure on the Registry and network infrastructure. In order to better handle such large-scale applications, DADI caches the recently used data blocks on the local disk of each host, and uses P2P to transfer data between the hosts.

1. Use a tree topology to distribute data

• Each node caches recently used data blocks

• Cross-node requests have a high probability of hitting the parent node's own cache

• The missed request will be recursively passed upwards until the registr

2. The topology is dynamically maintained by the root node

• Each layer has a separate transmission topology

3. Deploy a set of root for each computer room separately

• Multi-node high-availability architecture

• Division of labor based on consistent hashing

Large-scale launch of time-consuming testing

We compared the DADI container startup delay with the download of .tgz images, Slacker, CRFS, LVM, and P2P images. Using the WordPress image on DockerHub.com, we observed the cold start delay of a single instance. All servers and hosts are located in the same data center. . As shown in the figure on the left, the results show that the use of DADI can significantly reduce the cold start time of the vessel.

We created 1000 VMs on the public cloud and used them as container hosts. Start 10 containers on each host, for a total of 10,000 containers. The test uses Agility, a small program written in Python, to access the HTTP server and record the time on the server. As shown in the figure on the right, the result shows that the cold start of DADI is completed quickly within 3 seconds.

DADI's large-scale operation in Alibaba

DADI has been operating on a large scale in Alibaba Group and deployed on a large scale in Alibaba's production environment. The data shows that it only takes 3-4 seconds for DADI to start 10,000 containers on 10,000 hosts. DADI perfectly responded to the peak of Double Eleven. At present, it has deployed nearly 100,000 server hosts within the Alibaba Group. It supports more than 20,000 online and offline applications of the group's Sigma, search, and UC businesses, greatly improving the application. Release and expansion efficiency, the experience is as smooth as silk. Our experience using DADI in the production environment of one of the world's largest e-commerce platforms shows that DADI is very effective in improving the agility and flexibility of deploying applications.

Embrace open source and release the dividends of cloud native technology

Now, DADI is better releasing the dividends of cloud native technology by contributing to the community, and hopes to build a container image standard with more companies and developers.

At present, DADI has opened to support Contained (docker does not support remote images), supports direct node connection to Registry + local cache technology, and supports the construction and conversion of images.

In the future, P2P on-demand transmission will be opened: the P2P subsystem will be redesigned as an extension of the Registry, and will support shared storage, such as nfs, hdfs, ceph, glusterfs, etc., global Registry + computer room shared storage + node local cache + P2P data transmission, Build the cache in the computer room.

You can learn more by checking the following Github link:

Control plane (for containerd):

https://github.com/alibaba/accelerated-container-image

Data plane (overlaybd):

https://github.com/alibaba/overlaybd

Technical evolution of container persistent storage

Challenges faced by storage access technology

Above we talked about the new paradigm of serverless application architecture. Now we see a trend, from virtual machines to ordinary containers, and then gradually evolving into Shenlong bare metal deployment security containers. From the perspective of storage layout, the obvious challenge is higher density and multi-tenancy.

Container Access Technology Trend: The is based on ECS + ordinary container architecture to evolve based on Shenlong + secure container architecture, with a single node density of 2000, a single instance specification minimum granularity of 128MB, 1/12 CPU. The trend of container access technology has brought the challenge of amplifying I/O resources.

Alibaba Cloud Storage has its own thinking on end access. Storage is divided into built-in storage (mirror and temporary storage) and external storage (file system/shared file system, big data file system, database file system, etc.).

How to better connect the storage system to the bottom layer? The storage access container is offloaded to the Shenlong Moc card through the ability of virtio. The Shenlong Moc card + virtio channel and the underlying storage service are better linked.

Persistent storage-Elastic provision of cloud disk ESSD for modern applications

ESSD cloud disks provide users with highly available, highly reliable, and high-performance block-level random access services, as well as rich enterprise features such as native snapshot data protection and cross-domain disaster recovery.

The elastic supply cloud disk ESSD for modern applications has two key features:

The mounting density of cloud disks is increased by 4 times, and a single instance supports up to 64 cloud disks
Performance and capacity are completely decoupled, and user needs do not need to be pre-set, but are determined on demand.

For example, in order to cope with the problems faced by many users: it is impossible to accurately predict the peak value of the business, and it is difficult to make precise planning in the performance configuration. If the performance configuration reservation is too high, it will cause a large amount of idle waste of daily resources; and if the performance reservation is insufficient, the business will be damaged due to sudden floods. We launched the ESSD Auto PL cloud disk, which supports performance-specific configuration and auto-scaling according to business load. A single disk can automatically increase its performance to a maximum of 1 million IOPS, providing safe and convenient performance for unexpected burst access. Automatic configuration.

Persistent Storage-Container Network File System CNFS

In response to the advantages and challenges of using file storage in containers, the Alibaba Cloud storage team and the container service team jointly launched the container network file system CNFS, which is built into the Kubernetes service ACK hosted by Alibaba Cloud. CNFS abstracts Alibaba Cloud’s file storage into a K8s object (CRD) for independent management, including operation and maintenance operations such as creation, deletion, description, mounting, monitoring, and expansion, so that users can enjoy the benefits of container file storage. While convenient, it improves the performance of file storage and data security, and provides container-consistent declarative management.

CNFS has been deeply optimized for container storage in six aspects: accessibility, elastic capacity expansion, performance optimization, observability, data protection, and declarative. Compared with open source solutions, CNFS has the following obvious advantages:

In terms of storage types, CNFS supports file storage, and currently supports Alibaba Cloud file storage NAS
Supports Kubernetes-compatible declarative lifecycle management, which allows one-stop management of containers and storage
Support online expansion and automatic expansion of PV, optimized for container elastic scaling characteristics
Support better data protection combined with Kubernetes, including PV snapshot, recycle bin, deletion protection, data encryption, data disaster recovery, etc.
Support application-level application-consistent snapshots, automatic analysis of application configuration and storage dependencies, one-key backup and one-key restore
Support PV level monitoring
Support better access control, improve the permission security of the shared file system, including directory-level Quota, ACL
Provide performance optimization, provide more optimized performance for small file reading and writing in file storage
Cost optimization, providing low-frequency media and conversion strategies to reduce storage costs

Best Practices

Best practices for database containerization using ESSD cloud disks for high-density mounting

Database containerization The main requirements for business scenarios that use ESSD cloud disk high-density mounting are: the database deployment model develops from virtual machines to containerization, continuously improving flexibility and portability, and simplifying deployment. The density of container deployment increases linearly with the number of CPU cores, and persistent storage is required to increase the mounting density. As an IO-intensive business, the database puts forward higher requirements on the performance of stand-alone storage.

Our solution is to use g6se storage-enhanced instances for the database. A single instance provides a maximum mount density of 64 cloud disks. The g6se storage-enhanced instance provides up to 1 million IOPS and 4GB storage throughput to meet the performance requirements of a single-machine high-density deployment.

The advantages of using ESSD cloud disk high-density mounting for database containerization are:

High-density mounting: Compared with previous generation instances, the mounting density of cloud disks is increased by 400%, and the density of single-machine deployment of database instances is increased.
High performance: Up to 1 million IOPS for a single machine, natural isolation of IO between multiple cloud disks, providing stable and predictable read and write performance.
High elasticity: ESSD cloud disk supports IA snapshots, and snapshots are immediately available to realize the creation of read-only instances in seconds.
High reliability: The cloud disk is based on the reliability design of 9 9 databases and supports data protection methods such as snapshots and asynchronous replication to solve the data security trend caused by software and hardware failures.

Best practices for Prometheus monitoring services to use file storage

The implementation of Prometheus is that the Prometheus server is mainly used to capture and store data. Client libraries can be used to connect to the server and perform queries and other operations. Push gateway is used for batches, short-term monitoring data aggregation nodes, mainly used for business data reporting, etc. Different exporters are used for data collection in different scenarios, such as MongoDB exporter for collecting MongoDB information.

The core storage of Prometheus is TSDB, which is similar to the storage engine of the LSM tree. We see a trend that the storage engine multi-node data synchronization requires the introduction of the Paxos consensus protocol. When managing engines for small and medium-sized customers, it is very difficult to manage consistency agreements. The architecture separates computing and storage. Computing is stateless. TSDB storage engines are released to distributed file systems, which naturally require NAS shared file systems. .

The advantages of Prometheus monitoring service using file storage are:

shared high availability: multi-Pod shared NAS persistent storage, and the compute node Failover realizes the high availability of container applications.
0 Modification: distributed POSIX file system interface without any modification
high performance: supports concurrent access, performance meets the requirements of instant index query, synchronous data loading and low-latency index query + write
high flexibility: storage space does not need to be pre-configured, used on demand, billed by volume, adapting to the flexibility of the container

Summarize

The innovative development of storage for containers and Serverless Computing has driven new changes in the storage perspective. The entire storage interface has moved up, developers are more exclusive to the application itself, and the operation and maintenance of the infrastructure is managed as much as possible. The characteristics of storage supply are more dense, flexible and extremely fast.

The above shared the technological innovations of Alibaba Cloud container storage, including DADI image acceleration technology, which laid a good foundation for the large-scale launch of containers. ESSD cloud disks provide the ultimate performance, and CNFS container network file system provides the ultimate user experience.

The prelude to cloud-native innovation is kicked off at any time, and cloud-native storage innovation has just taken the first step. I believe that we will work with industry experts to create innovation opportunities with Reinvent storage.

More about Ali cloud container network file system CNFS technical capabilities, applications and usage scenarios, Click "read the original" understand.

Alibaba Cloud Xu Li: Storage Innovation for Containers and Serverless Computing

The source of cloud-native innovation

Mirror-accelerated technological innovation

Challenges of Alibaba's large-scale container deployment

Remote Image Technical Solution Comparison

Alibaba's self-developed container image acceleration technology DADI

Interpretation of DADI's key technologies

Large-scale launch of time-consuming testing

DADI's large-scale operation in Alibaba

Embrace open source and release the dividends of cloud native technology

Technical evolution of container persistent storage

Challenges faced by storage access technology

Persistent storage-Elastic provision of cloud disk ESSD for modern applications

Persistent Storage-Container Network File System CNFS

Best Practices

Best practices for database containerization using ESSD cloud disks for high-density mounting

Best practices for Prometheus monitoring services to use file storage

Summarize

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

k8s集群部署（一主两从）

黑客眼中的"肥羊"：刚开通的VPS为何最危险？