Introduction to Alibaba&#39;s ultra-large-scale Flink cluster operation and maintenance system

Abstract: This article is compiled from the speech delivered by Wang Hua (Shang Fu), senior operation and maintenance expert of Alibaba Cloud real-time computing, in the production practice session of Flink Forward Asia 2021. The main contents include:
Evolution history and operational challenges
Cluster operation and maintenance Flink Cluster
Application operation and maintenance Flink Job

Click to view live replay & speech PDF

1. Evolution history and O&M challenges

Ali's real-time computing has experienced rapid development in the past 10 years. Generally speaking, it can be divided into three eras:

1.0 era: From 2013 to 2017, three real-time computing engines coexisted. The familiar Jstorm and Blink were also called stream computing at that time.
2.0 Era: In 2017, the Group merged three real-time computing engines, and Blink became the only real-time computing engine with its excellent performance and efficient throughput, realizing a great unification. In the next 4 years, all the real-time computing business of the group was migrated to Blink. Alibaba's real-time computing business experienced the fastest growth. The scale of the platform also increased from 1,000 to 10,000. Real-time computing is all on Blink.
3.0 era: With the acquisition of the German Flink parent company by Alibaba in the past two years, Alibaba China and the German team have jointly built a new VVP platform based on a new cloud-native base and equipped with a new Flink open source engine. In 2021 Double 11, the new VVP platform has steadily supported Double 11 with substantial performance improvement, announcing that Alibaba's real-time computing has entered a new 3.0 era.

At present, Alibaba's real-time computing has millions of computing power, tens of thousands of physical machines, and tens of thousands of jobs, truly forming an ultra-large-scale real-time computing platform. Moreover, in the process of rapid business development, the overall architecture of the platform is undergoing a large-scale evolution from Hadoop Flink under the cloud to cloud-native K8s plus Flink.

Faced with such a behemoth of real-time computing, O&M also faces different challenges with the changing times:

The first stage is platform operation and maintenance. The core is to help SRE solve the platform operation and maintenance of super-large scale, that is, the problem of Flink Cluster cluster operation and maintenance;
The second stage is application operation and maintenance. The core is to help a large number of real-time computing users on the cluster to solve the complex problems of application-side Flink job operation and maintenance;
The third stage is that with the advent of the 3.0 era, the cluster base is fully cloud-native, and the global data is also standardized with the cloud-native. How to rapidly evolve and improve the operation and maintenance capabilities to cloud-native and intelligent has become a new challenge for us.

2. Cluster operation and maintenance of Flink Cluster

On the one hand, a very typical business is running on the Flink platform, which is the GMV media transaction flopper on the day of the Double 11 promotion, which is a well-known large-scale transaction volume screen. This business requires very high stability. In addition to the GMV flopper, Flink also carries all the important real-time computing services within Alibaba, including real-time scenarios of core e-commerce businesses such as Alimama, advertising metering and billing, search recommendations, and machine learning platforms. These real-time scenarios are both important and real-time sensitive, and stability is the number one challenge.
On the other hand, due to the huge scale of the platform, involving tens of thousands of exclusive machines, multi-regional deployment, and the increase in the complexity of platform deployment brought about by the increase in platform volume, local anomalies will become the norm. The second biggest challenge to stability.

The business is important and sensitive, the platform scale is large and the architecture is complex. Faced with such dual challenges, how to maintain the stability of the cluster is a major problem.

At the beginning, the Flink cluster used the number of failures to measure the stability, but in fact the granularity is very low, because there are many stability anomalies that do not meet the failure duration standard, which cannot be reflected in the final number of failures, resulting in the existence of stability. blind spot. Later, we built several sets of SLA availability based on minute-level availability to measure the stability of the entire cluster.

SLI is the golden indicator used to calculate SLA. It represents the availability of Flink Cluster. Because cluster is a virtual logical concept, we define Flink job status to represent SLI. The Flink job state itself is very complex, but we can simply abstract three states: scheduling, running normally, and running abnormally. Each job can calculate these three states, and then aggregate them to the cluster level to form the proportion of jobs. If the ratio exceeds a certain threshold, it means that the cluster is unavailable, so as to measure the SLI and then calculate the unavailability time of the whole year.

The availability measurement of the final SLA can be expressed as a simple mathematical formula, SLA availability = SLA exception number * SLA average duration of each exception, to achieve minute-level availability to measure cluster stability finely.

With fine quantification, the next step is the path of improvement. You can also start from the above formula to optimize two factors: one is to prevent stability to reduce the number of SLAs; at the same time, to quickly restore SLAs, Shorten the SLA duration and ultimately improve the overall availability.

The first is the SLA exception prevention part. The key idea is to do a good job in the inspection of the cluster, proactively discover abnormal hidden dangers, and eliminate hidden dangers in time, thereby reducing the number of SLA exceptions.

What are the hidden dangers of abnormal SLA? For example, a bunch of super-large jobs are suddenly started, causing the load of hundreds of machines in the cluster to be high or the disks to be full, causing a large number of job heartbeat timeouts. Another example is that a certain Flink version has major stability problems or defects, affecting nearly 1,000 online users. a job. These seemingly unpopular failure scenarios actually occur almost every day in an ultra-large-scale cluster and in the form of rich business scenarios. This is an inevitable challenge when the platform develops to a certain scale. Moreover, the larger the cluster size, the more likely the butterfly effect will occur, and the impact surface tends to be larger. In addition, the complexity and time consuming of each cluster exception location is very long. How to eliminate these SLA exceptions?

Our idea is to create an abnormal self-healing service of Flink Cluster, by regularly scanning the behavior data of full online jobs, such as job delay, failover, and back pressure, and then performing abnormal analysis and decision-making on these massive data to find hidden dangers. In general, there are two types of exceptions:

One is caused by the user's own job behavior, and the user is notified to change the corresponding job, such as OOM caused by unreasonable resource allocation, delay caused by job backpressure, etc.;
Another type of anomaly is caused by problematic versions on the platform side, and the platform side will carry out large-scale active upgrades to eliminate these problematic versions.

Finally, the platform side and the user side are combined to form a closed-loop self-healing SLA abnormality, thereby reducing the number of SLA abnormality.

In the abnormal self-healing service, the most complicated thing is the identification and decision-making of the rules behind it. After a lot of accumulation, we have accumulated dozens of abnormal rules and governance solutions with the highest frequency on the business side, to fully automatically identify and eliminate hidden dangers that were previously "invisible", and truly achieve stability prevention.

According to the formula of SLA exceptions, in addition to prevention to reduce the number of SLAs, another method is to shorten the duration of exceptions after SLAs occur.

The challenge is that there are nearly 10,000 jobs in an online cluster, but all cluster-level faults are characterized by difficult positioning and long recovery time. Coupled with the large number and wide distribution of clusters, the probability of failure increases. The two are superimposed. Several failures a year have almost become the norm, and overall stability is passive. We need to turn passive into active. If we can achieve cluster-level disaster tolerance by quickly switching business flows in failure scenarios, SLA exception recovery can not only shorten, but also increase its certainty.

The disaster recovery system is mainly divided into three parts:

First, where to cut, real-time computing requires milliseconds for the network, and dozens of milliseconds across cities will definitely not meet the real-time requirements. Therefore, on the platform-side deployment architecture, the deployment of two computer rooms in the same city is implemented, and the two-to-two disaster recovery is implemented.
Second, the resource capacity is limited. It is impossible for such a large platform to have disaster recovery resources as a budget, so a trade-off is required. How to distinguish the priority of the business with high priority and discard the business with low priority? The platform establishes a set of priority standards for Flink jobs according to business scenarios, and supports an automated management system for the whole process from application to governance to rectification, downgrades and rolls out the whole process, and fine-grained prioritization on the business side to ensure truly high-quality business quality and quantity. Under the condition of limited resources, we will focus on high-quality business to realize the exchange of resources for resources.
The last step is the most complicated, how to transparently cut away the job. The core idea is to reuse storage and ensure transparent switching of computing to ensure business insensitivity.

Flink jobs are long-lived, with state intermediate calculation results. First, the computing and storage clusters must be physically separated in the cluster deployment architecture. When the computing cluster fails, such as an abnormality in the infrastructure, etc., all Flink jobs can be migrated to another disaster recovery cluster through the cut flow, but the state storage still points to the old storage cluster, which can be restored from the original state point. Realize a truly transparent migration, with no sense to the user.

In addition to daily stability, Double 11 is a big test of stability. The special guarantee of Flink Double 11 can be summarized into 4 blocks and 8 words, namely stress test, current limit, downgrade, and hotspot. We have a mature security system behind each piece.

The first piece of pressure measurement refers to the pressure measurement platform. First, it provides users with the ability to clone production to shadow jobs with one click. Secondly, it also provides a large number of large-scale and accurate pressure generation, pressure control, and voltage regulation capabilities, and provides operation automation. Performance tuning, and the last step of production is a one-click fully automated one-stop stress testing solution.
The second downgrade refers to downgrading the platform, because at the peak of 0:00 in the big promotion, it is necessary to quickly downgrade low-priority services to achieve reasonable control of the water level.
The third block of current limiting, and some medium-quality or high-quality services, are not allowed to be downgraded in the big promotion state, but can accept short-term delays, so the platform also implements the isolation and limitation of job Pod resources based on the Cgroup of the Linux kernel. So as to achieve the effect of accurate current limiting of job granularity calculation.
The fourth block is the hot machine, which is also the most complicated point of the promotion. From the perspective of the cluster, there are differences between the resources sold by the cluster and the resources used by users. For example, a Flink job applies for 10 CPUs, but actually uses 5 CPUs. There are also peaks and valleys, which will lead to imbalanced water levels at the cluster level. .

The first picture above shows that the resource level of all machines at the cluster scheduling level is very average, and the CPU and memory are almost on the same line. However, the physical water levels of all machines actually running on the cluster are uneven, because scheduling is not aware of physical usage, so as the water level of the cluster continues to increase, such as the arrival of the zero-point peak of the big promotion, the hot machines in the cluster will move to higher levels. If the height goes to the translation, the resources of some machines in a certain dimension will reach the performance bottleneck. For example, the CPU is used by 95% or higher, which leads to hot machines.

In a distributed system, all the services on the machine are stateful and related. Local hot machines will not only affect the stability of the cluster, but also become a bottleneck for cluster performance improvement and waste costs. That is to say, hot machines will It is the short board of cluster stability and water level improvement.

The solution of hotspot machines is a very difficult problem, and generally needs to go through 4 processes:

The first step is to discover the hotspot machine, including the CPU, memory, network, and disk of the hotspot machine. The difficulty lies in the fact that the threshold of the hotspot machine comes from the rich experience of SRE online.
The second step is analysis. We have made a series of machine diagnostic tools to locate hot processes, including CPU to process and IO to process. The difficulty lies in requiring users to have an in-depth understanding and analysis of the principles of the entire Linux system.
The third step is the business decision and strategy. From the hot machine process to the business data to make decisions, different priorities can accept different strategies.
The last step is the real solution to hotspot machines. Low priority is degraded or equalized, and medium and high priority is used to reduce hotspot machines through runoff.

The things involved in this process include understanding of business such as priority, resource, configuration portrait, understanding of scheduling principles such as resource allocation strategy, scheduling strategy, and in-depth investigation and analysis of the system kernel, as well as business experience And strategy - throttling or downgrading. The definition and analysis decision of the whole link is a very complicated technical problem.

What we are doing is to deposit all the complete solutions of hotspot machines and build a K8s cloud-native Flink Cluster AutoPilot to realize fully automatic self-healing of hotspot machines.

In terms of deployment form, AutoPilot's services are fully managed based on K8s, lightweight deployment is performed according to the cluster dimension, and configuration files are used to facilitate management and operation and maintenance. In the execution phase, K8s is used to ensure the final state and final consistency. From the perspective of the technical capabilities of AutoPilot, it realizes the entire hotspot by abstracting the comprehensive analysis process of the hotspot machine into 6 stages, including the definition, perception, analysis, decision-making, execution and observability of the whole process of the hotspot machine. Fully automatic self-healing and high observability of machines improve cluster stability and reduce costs.

In the past few years, focusing on the three core values of operation and maintenance stability, cost, and efficiency, SRE has accumulated a lot of operation and maintenance capabilities and a better operation and maintenance platform on Flink Cluster's ultra-large-scale cluster operation and maintenance. However, with the arrival of the cloud-native wave, how to become more standardized based on cloud-native operation and maintenance capabilities, how to establish a more unified standard for the operation and maintenance interface, operation mode, execution mode and observability of the operation and maintenance process, will all be become our key development direction in the future. Flink Cluster AutoPilot will become the carrier of new technologies under cloud native to carry the continuous evolution and upgrade of the operation and maintenance system.

3. Application operation and maintenance of Flink Job

With the general trend of real-time computing, the number of users and jobs in Flink has experienced rapid growth, and now the number of jobs on the platform has reached tens of thousands. However, it is well known that the operation and maintenance of Flink jobs is a very complex issue. List some of the most frequently asked questions from daily users, such as why my job is slow to start, why failover, why back pressure, why delay, and how to adjust resource allocation to reduce costs ? These seemingly simple problems are actually very complex.

There are two difficulties in Flink's job operation and maintenance: on the one hand, there are many full-link components in the distributed system, and the dependencies are very complex. On the other hand, the principle of Flink itself, especially when it comes to the RunTime level, is very complicated. Therefore, we hope to use our own rich operation and maintenance knowledge, including the in-depth understanding of the calling process of the entire system link, the working principle of each component, and the rich experience in troubleshooting problems in daily and Double 11 promotions, as well as excellent troubleshooting. All ideas are converted into data and rule algorithms, and precipitated into operation and maintenance product functions.

This product mainly has two functions, one is Flink Job Adviser, which is used to discover and diagnose job exceptions; the other is Flink Job Operator, which is used to fix job exceptions. The two work together to solve the problem of Flink job operation and maintenance.

The above picture is the final effect of Flink Job Adviser to the user. The user only needs to enter the job name or link, @ a robot, and the Adviser service will be called.

For example, in Case 1, the job cannot be started due to insufficient resources. The advisor will give a diagnosis result, which is due to insufficient resources of a certain job, and attach improvement suggestions, so that the user can go to the console to expand the corresponding number of resources.

For example, in Case2, one of the user's jobs failed, and he wanted to know why. Through the correlation of global data, the result given by Adviser is caused by the machine offline on the platform side or the self-healing of hardware failure. It is recommended that users do not need to do anything and wait for the automatic recovery.

Another example is Case3. Due to the unreasonable memory configuration of user jobs, OOM frequently occurs, resulting in failover. Adviser will recommend users to adjust the memory configuration of the corresponding computing nodes to avoid new failovers.

There are dozens of abnormal diagnosis capabilities for complex scenarios behind Filnk job Adviser, forming a huge empirical decision tree. It can not only locate the abnormality that is happening, but also prevent it, and it mainly consists of three parts:

In the pre-event part, predictions are made based on the operation indicators of the job and the global events of the system, and the potential risks are discovered in advance to achieve the effect of prevention. For example, there are failover or version problems found by the job. These abnormalities have not really affected the operation. find these problems.
In the middle part, diagnose the whole life cycle of job operation, including start and stop problems, such as startup error, slow startup, stop error, etc., as well as insufficient running performance, delay, and running process error, data consistency, etc. accuracy, etc.
In the post-event part, users are supported to do full backtracking of historical jobs. For example, I want to see the reason for the failover in the middle of the night last night.

In the specific implementation of the decision tree, several typical and complex nodes are selected for sharing.

The first one is the status check of the whole life cycle of the job. A job is submitted from the console to resource allocation, then to the running environment, dependency download, to the creation of Top, to upstream and downstream loading, and finally to data processing. The entire link is a For a very complex process, Adviser collects and analyzes the time-consuming and full events of key nodes in a unified manner, and finally can diagnose and locate abnormalities in any state of the job.
The second is the problem of job running state performance, mainly for abnormal detection of various real-time monitoring indicators, or to find and analyze abnormality through judgment of empirical values and threshold values. For example, if the job is delayed, find the node where the back pressure is located through the node, then find the node where the TM is located, and then analyze the machine abnormality, and finally find that a certain machine may have a high load. In this way, the derivation of the entire link evidence chain is formed, and the correlation drill-down analysis is performed to locate the real root cause.
The third is the most frequent problem, the job has an error during the running process. The core idea is to collect logs of various components, such as submitted logs, scheduled logs, failover logs, and logs with JM and TM, and use these massive abnormal logs through log clustering algorithms, including natural language processing and actual extraction. Some unstructured logs are turned into structured data, and similar items are combined for compression. Finally, SRE and R&D will carry out cause annotation and suggestions to form a complete set of expert experience.

The earliest implementations of decision trees were static rules, but with the complexity of the scene, especially the explosion of data and the emergence of personalized scenes, static rules can no longer meet our needs, such as the delay of each job is a personality , error reporting can no longer be maintained by regular matching. We are actively trying to introduce various kinds of AI to solve these personalized problems.

After locating the abnormality through the Filnk job Adviser, the Filnk job Operator is required to fix the abnormality and form a closed loop.

Operator capabilities are mainly composed of 4 parts:

The first capability is upgrade, which transparently upgrades the problem version of the job and hot-updates the configuration to solve the hidden dangers and anomalies of the job in terms of code and configuration stability.
The second capability is optimization, which is based on Alibaba's internal Autopilot to configure and tune job performance, thereby helping user jobs solve performance and cost issues.
The third capability is migration. Jobs are transparently migrated across clusters to help users achieve efficient job management in large-scale job scenarios.
The last one is self-healing repair. According to various risks and rules diagnosed by Adviser, it is equipped with the self-healing ability of one-key repair.

With the development of real-time computing, operation and maintenance has also undergone evolution and upgrades from human flesh, tool-based, platform-based, intelligent to cloud-native. On computing management and control products, to solve the problem of ultra-large-scale real-time computing operation and maintenance.

In the whole system, there are two operation and maintenance objects, the cluster and the application, in the middle. The goals and value of the peripheral operation and maintenance have always revolved around the three goals of stability, cost, and efficiency. The carrier of operation and maintenance systems, technologies, and products is real-time computing management and control. Through real-time computing management and control, we can serve the upper-level real-time computing users, production and research, SRE, and ourselves. At the same time, the technical core of operation and maintenance management and control is fully evolving towards intelligence and cloud native.

To sum up in one sentence, with intelligence and cloud native as the technical core, build real-time computing operation and maintenance management and control products to solve the three major problems of stability, cost and efficiency encountered in the operation and maintenance of super-large Flink clusters and applications.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Introduction to Alibaba's ultra-large-scale Flink cluster operation and maintenance system

1. Evolution history and O&M challenges

2. Cluster operation and maintenance of Flink Cluster

3. Application operation and maintenance of Flink Job

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

DNS服务器地址大全

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

Introduction to Alibaba&#39;s ultra-large-scale Flink cluster operation and maintenance system

1. Evolution history and O&M challenges

2. Cluster operation and maintenance of Flink Cluster

3. Application operation and maintenance of Flink Job

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

DNS服务器地址大全

k8s集群部署（一主两从）

AD系列：Windows Server 2025 搭建AD域控和初始化

Introduction to Alibaba's ultra-large-scale Flink cluster operation and maintenance system