Deep optimization and production practice of Flink engine in Kuaishou

Abstract: This article organizes the speech made by Liu Jiangang, a technical expert of the Kuaishou real-time computing team, in the production practice session of Flink Forward Asia 2021. The main contents include:
The history and current situation of Kuaishou Flink
Flink fault tolerance improvement
Flink engine control and practice
Quick batch practice
future plan

Click to view live replay & speech PDF

1. The history and current situation of Kuaishou Flink

Kuaishou began to deeply integrate Flink in 2018. After four years of development, the real-time computing platform has gradually improved and enabled various surrounding components.

In 2018, we carried out platform construction for Flink 1.4 and greatly improved the operation and maintenance management capabilities to achieve production availability.
In 2019, we started iterative development based on version 1.6, and many businesses began to be real-time, such as optimizing interval joins to bring significant benefits to platforms such as commercialization, and developing real-time multi-dimensional analysis to accelerate the real-time real-time of super-large multi-dimensional reports. This year, our Flink The SQL platform is also in use.
In 2020, we upgraded to 1.10, which has made a lot of improvements to the functions of sql, and further optimized the core engine of Flink to ensure the ease of use, stability and maintainability of Flink.
In 2021, we will start off-line computing, support the construction of lake and warehouse integration, and further improve the Flink ecosystem.

The picture above is the Flink-based technology stack of Kuaishou.

The core and bottom layer is Flink's computing engine, including stream computing and batch processing. We have done a lot of work on stability and performance.
The outer layer is the peripheral components that deal with Flink, including middleware such as Kafka and rocketMQ, as well as data analysis tools such as ClickHouse and Hive, and the use of data lakes such as Hudi. Users can build various applications based on Flink and these components, covering various scenarios of real-time, near-real-time, and batch processing.
The outermost layer is specific usage scenarios. Common video-related business parties such as e-commerce and commercialization are used. Application scenarios include machine learning and multi-dimensional analysis. In addition, there are many technical departments based on Flink to implement data import and conversion, such as CDC, lake and warehouse integration, etc.

In terms of application scale, we have 500,000 CPU cores, mainly through Yarn and K8s for resource hosting. There are 2000+ jobs running on it, the peak processing reaches 600 million/second, and the daily processing number reaches 31.7 trillion. Holidays or The traffic will even double during the event.

2. Improvement of fault tolerance

Fault tolerance mainly includes the following parts:

The first is single-point recovery, which supports in-situ restart when any number of tasks fail, and long-running jobs can basically be continuously streamed;
Second, it is the response to cluster failures, including cold backup, hot backup, and the integration of Kafka dual clusters; the last is the use of blacklists.

In order to achieve exactly-once, Flink needs to restart the entire job if any node fails. Global restart will bring a long pause, up to ten minutes. Some scenarios do not pursue exactly-once, such as real-time scenarios such as recommendation, but they have high requirements on service availability, cannot tolerate job interruption, and scenarios with slow initialization such as model training, which require a particularly long restart time. will have a big impact. Based on the above considerations, we have developed a single point recovery function.

The above figure is the basic principle of single point recovery. As shown in the figure, there are three tasks, and the middle task fails. First, the master node of Flink will reschedule the middle task. At this time, the upstream and downstream tasks will not fail, but will wait for reconnection. After the intermediate task is successfully scheduled, the master node will notify the downstream task to reconnect to the upstream task, and at the same time, the intermediate task will also connect to its upstream task to restore data reading by rebuilding the read view. After the upstream and downstream are connected successfully, the job can work normally.

After understanding the basic principles, let's take a look at the case of online multi-task recovery. In the actual environment, multiple tasks often fail at the same time. At this time, we will restore the failed tasks one by one according to the topological order. For example, in the above figure, we restore from left to right.

After this function was launched, nearly 100 internal jobs have used this function. The jobs can be continuously streamed under normal failures. Even if there are small traffic fluctuations, the business can be unaware. The business side completely bid farewell to the service interruption. Streaming nightmare.

Once a cluster failure occurs, it is fatal, all data will be lost, and services will also hang. Our solution mainly includes cold standby, hot standby, and dual-cluster integration of Flink and Kafka.

Cold backup mainly refers to backing up data. After the cluster dies, computing tasks can be quickly started in another cluster.

As shown in the figure above, KwaiJobManager is the job management service of Kuaishou, and the failover coordinator is mainly responsible for fault handling. We will save all jar packages and other files in HDFS, and all information in Mysql, both of which are highly available. The job runs on the main cluster ClusterA, and incremental snapshots are used online, so there will be file dependencies, so we regularly do savepoints and copy them to the standby cluster. In order to avoid too many files, we set up timed deletion of historical snapshots.

Once the service detects the failure of cluster A, it will immediately start the job in cluster B and restore from the most recent snapshot, ensuring that the state is not lost. For users, it is only necessary to set up the active and standby clusters, and the rest is done by the platform, and the user is unaware of the fault throughout the process.

Hot standby is when two clusters run the same task at the same time. Our hot backups are full links, and Kafka or ClickHouse are all dual-run. The top display layer will only use one of the result data for display. In the event of a failure, the display layer will immediately switch to the other data. The switching process takes less than one second, and the user is unaware of the whole process.

Compared with cold standby, hot standby requires the same amount of resources for backup operation, but the switching speed is faster, which is more suitable for extremely demanding scenarios such as the Spring Festival Gala.

The dual-cluster integration of Flink and Kafka is mainly because Kuaishou's Kafka has the capability of dual-cluster, so Flink needs to support reading and writing Kafka topics of dual-cluster, so that Flink can switch seamlessly online when a Kafka cluster hangs. As shown in the figure above, our Flink abstracts the Kafka dual cluster. One logical topic bottom layer corresponds to two physical topics, which are composed of multiple partitions. Flink consumes logical topics, which is equivalent to reading the bottom two layers at the same time. data for a physical topic.

For various changes in the cluster, we abstract all of them into the expansion and contraction of the partition. For example, if the cluster hangs, it can be regarded as the partition reduction of the logical topic; the single cluster and the dual cluster can be regarded as the expansion of the logical topic; The migration of a topic can be seen as a logical topic that first expands and then shrinks. Here we use dual clusters as an example. In fact, whether it is dual clusters or more clusters, the principle is the same, and we provide support.

The blacklist function needs to be used when the following two situations occur. The first is to repeatedly schedule faulty machines, resulting in frequent job failures. The other is that due to hardware or network reasons, the individual Flink nodes are stuck but not failed.

For the first case, we developed a threshold blackout. If the job fails on the same machine or fails to deploy the threshold multiple times, it will be blocked if it exceeds the configured threshold; for the second case, we have established an exception classification mechanism. For network freezes and disk freezes, directly drive out the container and block the machine. In addition, we also exposed the blocking interface to the outside world, opened up external systems such as operation and maintenance Yarn, and realized real-time blocking. We also took the Flink blacklist as an opportunity to establish a complete set of hardware exception handling processes, which realized automatic job migration, automatic operation and maintenance throughout the process, and no user perception.

3. Flink engine control and practice

3.1 Flink real-time control

For long-running real-time jobs, users often need to make changes, such as adjusting parameters to change behavior, and some system operation and maintenance such as job downgrade, log level modification, etc. These changes need to restart the job to take effect, and sometimes it takes a few minutes to Dozens of minutes, on some important occasions, this is intolerable. For example, during activities or key points of troubleshooting, once the job is stopped, it will fall short, so we need to adjust the behavior of the job in real time without stopping the job, that is, real-time control.

From a broader perspective, Flink is not only a computing task, but also a long-running service. Our real-time control is based on this consideration, to provide an interactive control mode for real-time computing. As shown in the figure above, the user interacts with the Flink dispatcher through the classic kv data type. After Flink receives messages, it will first persist them to zk for failover, and then perform corresponding control according to the specific message, such as controlling the resource manager, Control the job master or other components.

We not only support user-defined dynamic parameters, but also provide users with many ready-made system controls. User customization mainly uses RichFunction to obtain dynamic parameters and implement corresponding logic, so that parameters can be passed in in real time when the job is running to achieve the effect of real-time control.

The real-time control capability provided by the system mainly includes functions such as data source speed limit, sampling, resetting Kafka offset, adjusting snapshot parameters, changing log level related to operation and maintenance, and blocking nodes. In addition, we also support dynamic modification of some Flink native configurations.

Kuaishou has realized the productization of the real-time control function, which is very convenient for users to use.

3.2 Source Control Capability

When Flink processes historical tasks or the job performance cannot keep up, it will cause the following problems:

First, the various concurrent processing speeds of the source are inconsistent, which will further aggravate the problems of data disorder, loss, and slow alignment. Second, the snapshots will continue to grow larger, seriously affecting job performance. In addition, there are uncontrollable traffic resources, which will cause stability problems such as CPU full and oom under high load conditions.

Since Flink is a pipeline real-time calculation, starting from the data source can fundamentally solve the problem.

First, let’s take a look at the accurate playback function of historical data. The above figure is to consume the historical data of Kafka at twice the rate. After the Flink job catches up with the lag, it can be converted into real-time consumption. In this way, the stability problem of complex tasks can be effectively solved.

The formula in the above figure is a basic principle. The consumption ratio = the time difference of Kafka / the system time difference of Flink. Users only need to configure the multiplier when using it.

Another ability is the QPS speed limit. When the data traffic is large, it will cause a high load of Flink and unstable operations. Based on the token bucket algorithm, we have implemented a distributed rate limiting strategy, which can effectively reduce the pressure on Flink. After using the QPS speed limit, the job became very healthy, visible in the green part of the picture above. In the 2019 Spring Festival Gala big screen, we achieved the guarantee of flexibility and availability through this technology.

In addition, we also support automatic adaptation of partition changes and real-time control, users can adjust the QPS of the job anytime, anywhere.

The last function is data source alignment, which mainly refers to the alignment of watermarks. First, each subtask will periodically report its watermark progress to the master node, mainly including the size and speed of the watermark. The master node will calculate the target of the next cycle, that is, the expected maximum watermark, and add a diff to return to each node. Each source task will ensure that the watermark of the next cycle does not exceed the set target. The bottom of the above figure is the calculation formula of target, which predicts the waterMark value at the end of the next cycle of each task, plus the maxdiff we allow and then takes the maximum value. In this way, the progress of each source can be guaranteed to be consistent and the diff overrun can be avoided. Large stability issues.

3.3 Job Balance Scheduling

Unbalanced resources often occur in the production environment. For example, the first point is that Flink's tasks are not evenly distributed, resulting in unbalanced use of task manager resources, and job performance is often limited by the busiest node. In response to this problem, we developed a strategy for balanced job scheduling; the second point is that the CPU usage is not balanced, and some machines are full and some machines are idle. In response to this problem, we developed the function of CPU balanced scheduling.

There are three jobVertex in the above figure, which are connected by hash shuffle. The middle part of the above figure shows the scheduling of Flink. Each jobVertex schedules tasks into the slot from top to bottom. The result is that the first two slots are very full and the other slots are very idle. The first task manager is very full and the second The task manager is very idle. This is a typical resource skewed scenario, and we have optimized it. When scheduling, first calculate the total required resources, that is, how many task managers are needed, and then calculate the number of slots allocated by each TM to ensure that the slot resources in the TM are balanced. Finally, evenly distribute tasks to each slot to ensure that the tasks in the slots are balanced.

There is another tilt situation in the actual operation process - CPU tilt, let's take a look at how to solve this problem. On the left side of the above figure, the user applied for one core but actually used only 0.5 cores, and some applied for one core and actually used one core. According to the default scheduling policy, a large number of such cases may cause some machines to have high CPU usage, while others are very idle, and machines with high load will have poor performance and stability. So how to make the diff applied and used as small as possible?

Our solution is to accurately profile job resources. The specific method is divided into the following steps: Count the CPU usage of the container where each task is located during the operation of the job, and then create a mapping from tasks to executionSlotSharingGroup, and then to container, so that each task is known. The CPU usage of the slot where the task is located, and then restart the job according to the mapping relationship, and apply for corresponding resources according to the historical CPU usage of the slot where the task is located. Generally, some buffers are reserved. As shown in the right figure above, if the prediction is accurate enough, the resources used by the task manager will remain unchanged after restarting, but the application value will become smaller, and the diff between the two will become smaller.

In fact, some advanced systems in the industry, such as borg, support dynamic modification of application values, but our underlying scheduling resources do not support this strategy, so we can only use resource portraits at the Flink layer to solve this problem. Of course, resource profiling cannot be guaranteed to be 100% accurate. We have other strategies, such as restricting machines with high CPU load to continue to allocate resources to reduce imbalance as much as possible. In addition, we have established a tiered security system. Jobs with different priorities have different cgroup restrictions. For example, low-priority jobs are no longer over-allocated, and high-priority jobs are allowed to be over-allocated a small amount to avoid imbalance caused by excessive CPU usage.

Fourth, the practice of fast batch processing

The image above is our batch architecture diagram. The bottom layer is the offline cluster, the middle is the Flink engine and Flink's data stream API, SQL API, and on the top are some platforms such as sql entry, timing scheduling platform, etc. In addition, there are some exploration of stream-batch integration, the top is various Users such as video, commercialization, etc.

In the integration of flow and batch, the characteristic of flow is low latency, and the characteristic of batch is high throughput. For stream-batch integration, we expect the system to not only process unfield batch data, but also adjust the shuffle size of data blocks to balance the throughput and latency of jobs.

Kuaishou has carried out a lot of exploration on the integration of flow and batch. We have established a unified schema standard for storing data, including flow table and batch table. Users can use the same code to process flow table and batch table, but the configuration is different. The generated results also need to conform to the unified Schema standard, so that the upstream and downstream can be connected and as much logic reuse as possible can be achieved. Schema unification is part of our Kuaishou data governance, and scenarios such as lake and warehouse integration also have this requirement.

Application scenarios mainly include the following aspects:

Metric calculation, such as real-time metrics and report calculation.
Data backtracking, using existing offline data to regenerate other indicators.
Data warehouse acceleration is mainly the real-time acceleration of data warehouses and data lakes.

The benefits brought by the integration of flow and batch are multi-faceted. First, it reduces development and operation and maintenance costs, realizes as much code logic reuse as possible, and no longer needs to maintain multiple systems for operation and maintenance. Secondly, the caliber of real-time processing and batch processing is consistent, which ensures the consistency of the final result. Finally, there are benefits in terms of resources. Some scenarios only require a real-time system.

We optimized in scheduling. For the three tasks shown in the figure above, at first a and c have been completed, and b is still running. At this time a fails, according to the default strategy ABC needs to be re-run, even if c has been completed. In actual scenarios, a large number of c will be recalculated, resulting in huge resource consumption. For this case, we have enabled the following strategy by default: if the result of a is decisive (in fact, the output of most batches is decisive), we can no longer recalculate c, just calculate a and b.

The above picture is the optimization and improvement of our Kuaishou for batch processing.

The first is the shuffle service, which now has both internal integration and a trial community version, mainly to realize the decoupling of storage and computing, and to improve the performance of shuffle. The second is the scheduling of dynamic resources, mainly based on the amount of data to automatically determine the concurrency of operators, to avoid manual repeated adjustment. The third is slow node avoidance, also known as speculative execution, mainly to reduce the long tail effect and reduce the total execution time. The fourth is hive optimization, such as UDF adaptation and syntax compatibility. In addition, for partition generation, we have added caching, multi-threaded generation, etc., which greatly reduces the time of sharding. Finally, support for some compression methods, such as support for gzip, zstd, etc.

V. Future Planning

Our future plans are mainly divided into the following aspects:

The first is real-time computing, which further enhances the performance, stability and applicability of Flink, and accelerates various business scenarios through real-time computing.
The second is the unification of online and offline, including real-time, near-real-time, and batch processing. We look forward to using Flink to unify Kuaishou's data synchronization, conversion, and offline computing, so that various scenarios such as ETL, data warehouse, and data lake processing use a set of Flink computing systems.
The last one is elastic scalability, mainly related to cloud native, including elastic scaling of offline co-location and jobs.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Deep optimization and production practice of Flink engine in Kuaishou

1. The history and current situation of Kuaishou Flink

2. Improvement of fault tolerance

3. Flink engine control and practice

3.1 Flink real-time control

3.2 Source Control Capability

3.3 Job Balance Scheduling

Fourth, the practice of fast batch processing

V. Future Planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成