The first domestic serverless paper from a cloud vendor was selected into the world&#39;s top conference: How to speed up the launch of containers under burst traffic?

Introduction to USENIX ATC (USENIX Annual Technical Conference) is a top conference in the field of computer systems, and it was selected as a list of international conferences recommended by the China Computer Federation (CCF); a total of 341 papers were submitted and 64 papers were accepted for acceptance. The rate is 18.8%. The Alibaba Cloud Serverless team was the first to propose a decentralized fast image distribution technology in the FaaS scenario, and the papers created by the team were accepted by USENIX ATC'21.

Author | Ao Wang
Source | Serverless Public

Guide

The USENIX ATC (USENIX Annual Technical Conference) academic conference is the top conference in the field of computer systems. It was selected into the China Computer Federation (CCF) recommended category A international conference list; this conference submitted a total of 341 papers and received 64 papers, with an acceptance rate of 18.8%.

The Alibaba Cloud Serverless team was the first to propose a decentralized fast image distribution technology in the FaaS scenario, and the papers created by the team were accepted by USENIX ATC'21. The following is an interpretation of the core content of the paper, focusing on shortening the end-to-end delay of the function cold start of the Alibaba Cloud function computing product Custom Container Runtime.

USENIX ATC will be held online from 7.14 to 7.16. For information about the paper, please see:
https://www.usenix.org/conference/atc21/presentation/wang-ao

Summary

Serverless Computing (FaaS) is a new cloud computing paradigm that allows customers to focus only on their own code business logic, and the underlying virtualization, resource management, and elastic scaling of the system are all left to cloud system service providers for maintenance. Serverless Computing supports the container ecology and unlocks a variety of business scenarios. However, due to the complex image of the container, the large volume, and the dynamic and unpredictable features of the FaaS workload, many industry-leading products and technologies cannot be applied well. On the FaaS platform, efficient container distribution technology faces challenges on the FaaS platform.

In this paper, we design and propose FaaSNet. FaaSNet is a lightweight system middleware with high scalability. It uses the image acceleration format for container distribution. The target application scenario is the startup of large-scale container images under burst traffic in FaaS (function cold start). The core component of FaaSNet includes Function Tree (FT), which is a decentralized, self-balanced binary tree topology structure, and all nodes in the tree topology structure are all equivalent.

We integrated FaaSNet on the function computing products. The experimental results show that under high concurrent request volume, FaaSNet can provide FC with 13.4 times the container startup speed compared to native function computing (Function Compute, hereinafter referred to as FC). And for the unstable end-to-end delay time caused by the burst request volume, FaaSNet can restore the end-to-end delay to a normal level in 75.2% less time than FC.

Introduction

1. Background and Challenges

FC supported the function of custom container image ( https://developer.aliyun.com/article/772788 ) in September 2020. AWS Lambda successively announced Lambda container image support in December of the same year, indicating that FaaS embraces the container ecosystem. major trend. the function of function computing mirror acceleration (160c9a88156bdd https://developer.aliyun.com/article/781992 ) in February 2021. These two functions of function computing unlock more FaaS application scenarios, allowing users to seamlessly migrate their container business logic to the function computing platform, and enable GB-level mirroring to start in seconds.

When the function computing background encounters large-scale requests and causes too many functions to start cold, even with the support of the mirror acceleration function, it will bring huge pressure on the bandwidth of the container registry. Multiple machines mirror the same container registry at the same time. Pulling leads to a bandwidth bottleneck or current limit in the container mirroring service, which makes the time for pulling and downloading mirrored data longer (even in the accelerated mirroring format). A more straightforward approach can improve the bandwidth capability of the function computing background registry, but this method cannot solve the fundamental problem and will also bring additional system overhead.

1) Workload analysis

We first analyzed the online data of the two major FC regions (Beijing and Shanghai):

Figure (a) analyzes the delay of the pull image of the FC system during the cold start of the function. It can be seen that in Beijing and Shanghai, there are ~80% and ~90% of the pull image delays greater than 10 seconds;
Figure (b) shows the proportion of pull image in the entire cold start. It can also be found that 80% of the functions in the Beijing area and 90% of the functions in the Shanghai area will take up more than 60% of the delay in the entire cold start. ；

The workload analysis shows that the cold start of the function spends most of the time on the acquisition of container image data, so optimizing this part of the delay can greatly improve the cold start performance of the function.

According to the historical records of online operation and maintenance, a representative of a large user will simultaneously pull up 4000 function mirrors in an instant. The size of these mirrors is 1.8GB before decompression, and the size after decompression is 3-4GB. When the large traffic request arrives, it starts. The moment the container was pulled up, the flow control alarm of the container service was received, which caused the delay of some requests to be prolonged, and in severe cases, a notification that the container failed to start was received. Such problem scenarios are urgently needed by us to solve.

2) State-of-the-art comparison

There are a number of related technologies in academia and industry that can speed up the distribution of images, such as:

Alibaba’s DADI:
https://www.usenix.org/conference/atc20/presentation/li-huiba
dragonfly:
https://github.com/dragonfly/dragonfly
And Uber's open source Kraken:
https://github.com/uber/kraken/

DADI

DADI provides a very efficient image acceleration format that can be read on demand (FaaSNet also utilizes the container acceleration format). In terms of image distribution technology, DADI adopts a tree topology structure to implement networking between nodes at the granularity of a mirror layer. A layer corresponds to a tree topology structure, and each VM will exist in multiple logical trees. DADI's P2P distribution needs to rely on several root nodes with large performance specifications (CPU, bandwidth) to assume the role of data return to the source and maintain the role of the manager of the peer in the topology; the tree structure of DADI is static, because the speed of container provisioning is generally not It will last for a long time, so by default, the root node of DADI will dissolve the topology logic after 20 minutes and will not be maintained.

Dragonfly

Dragonfly is also a P2P-based image and file distribution network, in which the components include Supernode (Master node) and dfget (Peer node). Similar to DADI, Dragonfly also relies on several large-format Supernodes to support the entire cluster. Dragonfly also manages and maintains a fully-linked topology through the central Supernode node (multiple dfget nodes contribute to different pieces of the same file). To achieve the purpose of point-to-point transmission to the target node), Supernode performance will be a potential bottleneck for the throughput performance of the entire cluster.

Kraken

Kraken's origin and tracker nodes act as the central node to manage the entire network, and the agent exists on each peer node. Kraken's tracer node only manages the connection of peers in the organization cluster. Kraken will allow peer nodes to communicate data transmission by themselves. However, Kraken is also a container image distribution network with layers as the unit, and the networking logic will also become a more complex fully connected mode.

Through the interpretation of the above three industry-leading technologies, we can see several common points:

First, the three use the image layer as the distribution unit, and the networking logic is too fine-grained, resulting in multiple active data connections on each peer node at the same time;
Second, the three rely on the central node for the management of the networking logic and the coordination of peer nodes in the cluster. The central nodes of DADI and Dragonfly will also be responsible for data return to the source. Such a design requires that in production use, it is necessary to deploy several large The machine of specifications can bear the very high flow rate, but also need to adjust the parameters to achieve the expected performance index.

We take the above prerequisites to reflect on the design under the FC ECS architecture. The specifications of each machine in the FC ECS architecture are 2 CPU cores, 4GB memory, and 1Gbps intranet bandwidth, and the life cycle of these machines is unreliable Yes, it may be recycled at any time.

This brings about three more serious problems:

Insufficient intranet bandwidth makes it easier to run on bandwidth in full connections, resulting in a decrease in data transmission performance. The fully connected topology is not function-aware, and it is very easy to cause system security problems under FC, because each machine that executes function logic is not trusted by FC system components, leaving tenant A intercepting tenant B's data Security risks;
CPU and bandwidth specifications are limited. Due to the billing characteristics of function computing Pay-per-use, the life cycle of machines in our cluster is unreliable, and it is impossible to take out several machines in the machine pool as the central node to manage the entire cluster. The system overhead of this part of the machine will become a large part of the burden, and the reliability cannot be guaranteed, and the machine will lead to failure; what FC needs is to inherit the pay-on-demand feature and provide instant networking technology.
Multi-function problem. The above three do not have a function-awareness mechanism. For example, in DADI P2P, there may be a problem that too many mirrors on a single node become a hot spot, causing performance degradation. The more serious problem is that the multi-function pull is inherently unpredictable. When the multi-function concurrent pull reaches the full bandwidth, the services downloaded from the remote during the same period will also be affected, such as code packages and third-party dependent downloads, resulting in the entire The system has an availability problem.

With these questions in mind, we explain the FaaSNet design scheme in detail in the next section.

2. Design scheme-FaaSNet

According to the above three mature P2P solutions in the industry, there is no function level awareness, and the topology logic in the cluster is mostly a fully connected network mode, and certain requirements are put forward for the performance of the machine. These pre-settings are not compatible. System implementation of FC ECS. So we proposed Function Tree (hereinafter referred to as FT), a function-level and function-aware logical tree topology.

1) FaaSNet architecture

The gray part in the picture is the part that our FaaSNet has carried out system transformation, and the other white modules continue the existing system architecture of FC. It is worth noting that all Function Trees of FaaSNet are managed on the FC scheduler; on each VM, there is a VM agent to cooperate with the scheduler for gRPC communication to receive upstream and downstream messages; and the VM agent is also responsible for upstream and downstream mirroring data Acquire and distribute.

2) Decentralized function/mirror-level self-balancing tree topology

In order to solve the above three problems, we first upgraded the topology to the function/mirror level, which can effectively reduce the number of network connections on each VM. In addition, we designed a tree topology based on the AVL tree. Next, we elaborate on our Function Tree design.

Function Tree

Decentralized self-balancing binary tree topology

The design of FT is inspired by the AVL tree algorithm. In FT, there is currently no concept of node weight. All nodes are equivalent (including the root node). When any node is added or deleted from the tree, the entire tree will maintain a perfect The -balanced structure ensures that the absolute value of the height difference between the left and right subtrees of any node does not exceed 1. When a node is added or deleted, FT will adjust the shape of the tree (left/right rotation) to achieve a balanced structure. As shown in the right rotation example in the following figure, node 6 is about to be recycled, and its recycling results in node 1 being the parent The left and right subtrees of the node are highly imbalanced and need to be rotated right to reach the equilibrium state. State2 represents the final state after the rotation, and node 2 becomes the new root node of the tree. Note: All nodes represent ECS machines in FC.

In FT, all nodes are equivalent, and the main responsibilities include: 1. Pull data from the upstream node; 2. Distribute data to the two downstream child nodes. (Note that in FT, we do not specify the root node. The only difference between the root node and other nodes is that its upstream is the source station. The root node is not responsible for any metadata management. In the next part, we will introduce how we manage metadata. ).

Overlapping of multiple FTs on multiple peer nodes

A peer node is bound to have different functions under the same user, so there must be a situation where a peer node is located in multiple FTs. As shown in the figure above, there are three FTs in the example that belong to func 0-2. However, since the management of FT is independent of each other, even if there are overlapping transmissions, FT can help each node find the corresponding correct upper node.

In addition, we will limit the maximum number of functions that a machine can hold to the feature of function-awareness, which further solves the problem of uncontrollable data pull-down by multiple functions.

Discussion on the correctness of the design

By integrating on FC, we can see that because all nodes in FT are equivalent, we do not need to rely on any central node;
The manager of the topology logic does not exist in the cluster, but the FC system component (scheduler) maintains this memory state, and sends it to each peer node along with the operation request of creating the container through gRPC;
FT perfectly adapts to the high dynamics of FaaS workload, as well as nodes of any size in the cluster joining and leaving, FT will automatically update the form;
Networking with the coarser granularity of the function, and using the binary tree data structure to implement FT, can greatly reduce the number of network connections on each peer node;
Networking with functions as isolation can naturally realize function-aware to improve the security and stability of the system.

3. Performance evaluation

In the experiment, we selected the image of the Alibaba Cloud database DAS application scenario, using python as the base image, the container image was 700MB+ before decompression, and it had 29 layers. We select the stress test part for interpretation. Please refer to the original paper for all test results. In the test system, we compared Alibaba's DADI, Dragonfly technology and Uber's open source Kraken framework.

1) Stress test

The delay recorded in the stress measurement part is the end-to-end cold start average delay perceived by the user. First of all, we can see that the image acceleration function can significantly increase the end-to-end latency compared to the traditional FC, but as the amount of concurrency increases, more machines simultaneously pull data from the central container registry, causing competition for network bandwidth. Causes end-to-end delays to rise (orange and purple bars). However, in FaaSNet, due to our decentralized design, no matter how high the pressure on the origin site is, only one root node will pull data from the origin site and distribute it downwards, so it has extremely high system scalability. The average delay will not rise due to the increase in concurrent pressure.

At the end of the stress test section, we explored the performance if different image functions (multi-functions) are placed on the same VM. Here we compare the FC (DADI+P2P) with the mirror acceleration function enabled and DADI P2P installed. ) And FaaSNet.

The vertical axis of the above figure represents the standardized end-to-end delay level. As the number of different mirroring functions increases, DADI P2P has more layers, and the specifications of each ECS in the FC are smaller, which puts pressure on the bandwidth of each VM. The end-to-end delay has been stretched to more than 200%. However, because FaaSNet establishes connections at the mirror level, the number of connections is much lower than the layer tree of DADI P2P, so it can still maintain good performance.

to sum up

High scalability and fast image distribution speed can better unlock custom container image scenarios for FaaS service providers. FaaSNet uses a lightweight, decentralized, self-balancing Function Tree to avoid the performance bottleneck caused by the central node, does not introduce additional system overhead, and completely reuses the existing FC system components and architecture. FaaSNet can realize the function-awareness of real-time networking based on the dynamics of the workload, without the need for pre-workload analysis and preprocessing.

The target scenario of FaaSNet is not limited to FaaS. In many cloud-native scenarios, such as Kubernetes, Alibaba SAE can use its power to deal with sudden traffic to solve the pain points that affect the user experience due to excessive cold starts. The problem of slow cold start of the container is fundamentally solved.

FaaSNet is the first domestic cloud manufacturer to publish a paper on accelerating container startup technology to respond to burst traffic in a serverless scenario at a top international conference. We hope that this work can provide new opportunities for the container-based FaaS platform, which can completely open the door to embrace the container ecology and unlock more application scenarios, such as machine learning, big data analysis and other tasks.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

The first domestic serverless paper from a cloud vendor was selected into the world's top conference: How to speed up the launch of containers under burst traffic?

Guide

Summary