Flink 1.13, stream batch-oriented runtime and DataStream API optimization

This article is compiled by community volunteer Miao Wenting, and the content comes from Alibaba technical expert Gao Yun (Yun Qian) shared "Flink Runtime and DataStream API Optimization for Streaming Batch Integration" shared by Alibaba technical expert Gao Yun (Yun Qian) on May 22nd at Beijing Station Flink Meetup. The article is mainly divided into 4 parts:
Review the design of Flink's flow and batch integration
Introduce optimization points for runtime
Introduce optimization points for DataStream API
Summary and some follow-up plans.

1. Flink with integrated flow and batch

1.1 Architecture introduction

First look at the overall logic of Flink's flow batch integration. In the early days of Flink, although it was a framework that could support stream processing and batch processing at the same time, its implementation of stream processing and batch processing, whether in the API layer, or in the underlying shuffle, scheduling, and operator layers, were all There are two separate sets. The two sets of implementations are completely independent and are not particularly closely related.

Under the guidance of the goal of stream batch integration, Flink has now unified the underlying operators, scheduling, and shuffle, and supports two sets of interfaces, DataStream API and Table API, in a unified manner. DataStream API is a more physical interface, Table API is a Declearetive interface, these two sets of interfaces are unified for streaming and batching.

1.2 Advantages

code reuse
Based on DataStream API and Table API, users can write the same set of codes to process historical data and real-time data at the same time, such as data reflow scenarios.
Easy to develop
Unified Connector and operator implementation reduces the cost of development and maintenance.
easy to learn
Reduce learning costs and avoid learning two sets of similar interfaces.
easy to maintain
Use the same system to support stream operations and batch operations, reducing maintenance costs.

1.3 Data processing process

The following briefly introduces how Flink abstracts the flow and batch integration. Flink divides the operations into two types:

first type of job is a job that handles an infinite stream of infinite data
This kind of job is the stream job that we usually recognize. For this kind of job, Flink adopts a standard stream execution mode. It needs to consider the recording time, and advance the time of the entire system through Watermark alignment to achieve some data aggregation and output. For the purpose of maintaining the intermediate state through State.

The second type of job is to process a limited data set
The data may be stored in a file, or a limited set of data retained in advance in other ways. At this time, the limited data set can be regarded as a special case of the infinite data set, so it can naturally run on top of the previous stream processing mode, without code modification, and can be directly supported.
However, the characteristics of limited data set and limited data may be ignored here, and more fine-grained time, watermark and other semantics need to be processed on the interface, which may introduce additional complexity. In addition, in terms of performance, because it is processed in a stream, all tasks need to be pulled up at the beginning, which may require more resources. If the RocksDB backend is used, it is equivalent to a large Hash table. If there are many keys, there may be random IO access problems.
However, in batch execution mode, the entire data processing process can be realized in a more IO-friendly way by sorting. Therefore, the batch mode provides us with more choices in the implementation of scheduling, shuffle, and operators under the premise of considering the limited data.
Finally, for limited data streams, no matter which processing mode is adopted, we hope that the final processing results are consistent.

1.4 Recent evolution

In the recent versions of Flink, a lot of efforts have been made towards the goal of stream batch integration in both the API and implementation layers.

in Flink 1.11 and before:
Flink unifies the Table/SQL API and introduces a unified blink planner. The blink planner convection and batch will be translated to the DataStream operator. In addition, convection and batch also introduce a unified shuffle architecture.
in Flink 1.12:
For batch shuffle, a new Sort-Merge-based shuffle mode is introduced. Compared with the previous built-in Hash shuffle in Flink, the performance will be greatly improved. In terms of scheduling, Flink introduced a scheduler based on Pipeline Region, which integrates streaming and batching.
in Flink 1.13:
Improved Sort-Merge Shuffle, and optimized the performance of the Pipeline Region scheduler under large-scale operations. In addition, as mentioned earlier, for the two execution modes of limited flow, we expect its execution results to be consistent. But now Flink still has some problems at the end of the job execution, which makes it not completely consistent.
So in 1.13, there is still a part of the work for limited data set operations, how to make its results consistent with the expected results in the streaming batch, especially in the streaming mode.
Future Flink 1.14:
Need to continue to complete limited job consistency guarantee, batch stream switching Source, and gradually abandon the DataSet API.

2. Runtime optimization

2.1 Large-scale job scheduling optimization

2.1.1 The time complexity of the edge

When Flink submits a job, it will generate a job DAG graph, which is composed of multiple vertices, and the vertices correspond to our actual processing nodes, such as Map. Each processing node will have a degree of concurrency. In the previous implementation of Flink, when we submit the job to JM, JM will expand the job and generate an Execution Graph.

As shown in the figure below, the job has two nodes, and the concurrency is 2 and 3 respectively. In the data structure actually maintained in JM, 2 tasks and 3 tasks are maintained respectively, and are composed of 6 execution edges. Flink maintains the topology information of the entire job based on this data structure. On the basis of this topology information, Flink can maintain the status of each task separately, and identify the task that needs to be started when the task hangs.

If you use this all-to-all communication, that is, when there is an edge between every two upstream and downstream tasks, the upstream concurrency is downstream concurrency will appear O(N^2) data structure. In this case, the memory usage is amazing. If it is 10k 10k, the memory usage of JM will reach 4.18G. In addition, the computational complexity of many jobs is related to the number of edges. At this time, the space complexity is O(N^2) or O(N^3). If it is a 10k * 10k edge, the initial scheduling time of the job Will reach 62s.

It can be seen that, in addition to the initial scheduling, for batch jobs, it is possible to continue to execute downstream after the upstream execution is completed, and the scheduling complexity in the middle is O(N^2) or O(N^3), so that Will cause a lot of performance overhead. In addition, if the memory footprint is large, the performance of the GC will not be particularly good.

2.1.2 Symmetry of Execution Graph

In view of some of the memory and performance problems of Flink in large-scale operations, after some in-depth analysis, it can be seen that there is a certain symmetry between the upstream and downstream nodes in the above example.

There are two types of "edges" in Flink:

is a Pointwise type , upstream and downstream are one-to-one correspondence, or upstream one corresponds to several downstream, not all connected, in this case, the number of edges is basically linear O(N), and operator The numbers are in the same order of magnitude.
other is the All-to-all type . Each upstream task must be connected to each downstream task. In this case, it can be seen that the data set generated by each upstream task must be owned by the downstream The task consumption is actually a symmetrical relationship. Just remember that the upstream data set will be consumed by all downstream tasks, and there is no need to store the middle edge separately.

Therefore, Flink introduced the concepts of ResultPartitionGroup and VertexGroup to upstream data sets and downstream nodes in 1.13. Especially for All-to-all edges, because the upstream and downstream are symmetrical, all the data sets generated by the upstream can be put into a Group, and all the downstream nodes are also put into a Group. In actual maintenance There is no need to store the relationship between the edges, just know which upstream data set is consumed by which downstream group, or which vertex downstream consumes which upstream group's data set.
In this way, the memory usage is reduced.

In addition, when actually doing some scheduling-related calculations, such as in batch processing, if all edges are blocking edges, each node belongs to a separate region. Before calculating the upstream and downstream relationship between regions, for each upstream vertex, it is necessary to traverse all the downstream vertices, so it is an O(N^2) operation.
After the introduction of ConsumerGroup, it will become an O(N) linear operation.

2.1.3 Optimization results

After the optimization of the above data structure, in the case of 10k * 10k edges, the JM memory footprint can be reduced from 4.18G to 12.08M, and the initial scheduling time can be reduced from 62s to 12s. This optimization is actually very significant. For users, as long as they upgrade to Flink 1.13, they can get benefits without any additional configuration.

2.2 Sort-Merge Shuffle

Another optimization is to optimize the data shuffle for batch jobs. In general, after a batch job is run upstream, the result will be written to an intermediate file, and then downstream will pull data from the intermediate file for processing.

The advantage of this approach is that it can save resources, without the need for upstream and downstream to get up at the same time, and in the case of failure, it does not need to be executed from scratch. This is a common way to execute batch processing.

2.2.1 Hash Shuffle

So, during the shuffle process, how are the intermediate results saved to an intermediate file and then pulled downstream?

Previously, Flink introduced Hash shuffle, and then took the All-to-all side as an example. The data set generated by the upstream task will write a separate file for each downstream task, so the system may generate a large number of small files. And regardless of the file IO or mmap method, at least one buffer is used to write each file, which will cause a waste of memory. The upstream data file randomly read by the downstream task will also generate a lot of random IO.

Therefore, before Flink's Hash shuffle was applied in batch processing, it could only compare work in production when the scale was relatively small or when SSDs were used. There is a bigger problem with larger scale or SATA disks.

2.2.2 Sort Shuffle

Therefore, in Flink 1.12 and Flink 1.13, after two versions, a new sort of shuffle based on Sort Merge was introduced. This Sort does not mean to sort the data, but to sort the task target written downstream.

The general principle is that when the upstream is outputting data, it will use a fixed-size buffer to prevent the size of the buffer from increasing with the increase in scale. All data is written to the buffer. When the buffer is full , It will be sorted once and written to a separate file, and the following data will be written based on this buffer area, and the continuation will be spelled out after the original file. Finally, a single upstream task will generate an intermediate file, which is composed of many segments, and each segment is an ordered structure.

Unlike other batch processing frameworks, this is not based on ordinary outer sorting. The general external sorting means that these segments will be merged again separately to form an overall ordered file, so that when downstream is read, there will be better IO continuity, preventing each segment from being read by each task. The data segment is very small. However, this kind of merge itself also consumes a lot of IO resources, and it is possible that the overhead caused by the merge time will far exceed the benefits of downstream sequential reading.

Therefore, another method is adopted here: when the downstream requests data, for example, the three downstreams in the following figure must read the upstream intermediate file, and there will be a scheduler to do the file position for the downstream request to read. A sorting method is to increase the IO scheduling method in the upper layer to achieve the continuity of the entire file IO reading and prevent a large amount of random IO from being generated on the SATA disk.

On SATA disks, the IO performance of Sort shuffle can be increased by 2 to 8 times compared to Hash shuffle. Through Sort shuffle, Flink batch processing has basically reached a production usable state. The IO performance on the SATA disk can hit the disk to more than 100M, and the SATA disk can reach the read and write speed of up to 200M.

In order to maintain compatibility, Sort shuffle is not enabled by default. Users can control how much downstream concurrency to enable Sort Merge Shuffle. And you can further improve the performance of batch processing by enabling compression. Sort Merge shuffle does not occupy additional memory. The upstream read-write buffer area currently occupied is a piece extracted from framework.off-heap.

3. DataStream API optimization

3.1 2PC & end-to-end consistency

In order to ensure end-to-end consistency, Flink streaming jobs are implemented through a two-phase submission mechanism, which combines Flink's checkpoint and failover mechanisms and some features of external systems.

The general logic is that when I want to do end-to-end consistency, such as reading Kafka and then writing to Kafka, during normal processing, the data will be written to a Kafka transaction first, and preCommit will be performed when it is used as a checkpoint, so that the data is not Will lose it again.

If the checkpoint is successful, a formal commit will be made. This ensures that the transaction of the external system is consistent with the internal failover of Flink. For example, if a failover occurs in Flink, it needs to be rolled back to the previous checkpoint. The transaction corresponding to this part of the external system will also be aborted. If the checkpoint is successful, Commit of external affairs will also succeed.

Flink's end-to-end consistency relies on the checkpoint mechanism. However, when encountering a finite stream, there will be some problems:

For jobs with limited streams, after the task ends, Flink does not support checkpointing. For example, for a mixed stream and batch job, some of them will end. After that, Flink will no longer be able to checkpoint and the data will no longer be submitted.
At the end of the limited stream data, because the checkpoint is executed regularly, there is no guarantee that the last checkpoint will be executed after all the data has been processed, which may cause the last part of the data to fail to be submitted.

The above will lead to inconsistent results in the limited-stream job stream/batch execution mode in the stream mode.

3.2 Support Checkpoint after some tasks (in progress)

Starting from Flink 1.13, it is supported to checkpoint after some tasks are finished. Checkpoint actually maintains a state list of all tasks of each operator.

After a part of the task ends, the dotted part is shown in the figure below. Flink will divide the finished tasks into two types:

If all subtasks of an operator have ended, a finished mark will be stored for this operator.
If an operator has only part of the task finished, only the unfinished task state is stored.
Based on this checkpoint, all operators will be pulled up after failover. If it is recognized that the last execution of the operator has ended, that is, finsihed = true, the execution of this operator will be skipped. Especially for the Source operator, if it has ended, the sending data will not be executed again later. Through the above method, the consistency of the entire state can be guaranteed. Even if a part of the task ends, the checkpoint is still performed.

Flink has also reorganized the end semantics. Now there are several possibilities for the end of the Flink job:

End of job: The data is limited, and the limited stream job ends normally;
stop-with-savepoint, end with a savepoint;
stop-with-savepoint --drain, take a savepoint to end, and advance the watermark to positive infinity.

Previously, there were two different implementation logics, and there was a problem that the last part of the data could not be submitted.

For the two semantics of job end and stop-with-savepoint --drain, it is expected that the job will not restart again, and endOfInput() will be called to the operator to notify the operator to do checkpoint in a unified way.
For stop-with-savepoint semantics, it is expected that the job will continue to savepoint and restart, and endOfInput() will not be called for the operator at this time. A checkpoint will be done later, so that for jobs that are bound to end and no longer restart, it can be guaranteed that the last part of the data can be submitted to the external system.

4. Summary

Among Flink's overall goals, one of them is to be a unified platform for efficient processing of finite data sets and unlimited data sets. At present, there is basically a preliminary prototype, whether in API or runtime. Let's take an example to illustrate the benefits of stream batch integration.

For the user’s reflow operation, it is usually an infinite stream operation. If you want to change the logic one day, use the stop-with-savepoint method to stop the stream, but this change logic needs to be traced back to the previous two months. Data to ensure the consistency of results. At this point, you can start a batch job: the job is not modified, and it runs to the input data cached in advance. The batch mode can be used to correct the data of the previous two months as soon as possible. In addition, based on the new logic, a new stream job can be restarted by using the savepoint saved earlier.

It can be seen that in the above-mentioned whole process, if the previous flow batches are separated, a separate development work is required for data correction. However, in the case of stream batch integration, data correction can be made naturally based on stream operations, without the need for users to do additional development.

In subsequent versions of Flink, more scenarios of stream-batch combination will be further considered. For example, the user first performs a batch processing, initializes the state, and then switches to the scenario of unlimited streams. Of course, the separate functions of streaming and batching will be further optimized and improved, making Flink a competitive computing framework for streaming batches.

Flink 1.13, stream batch-oriented runtime and DataStream API optimization

1. Flink with integrated flow and batch

1.1 Architecture introduction

1.2 Advantages

1.3 Data processing process

1.4 Recent evolution

2. Runtime optimization

2.1 Large-scale job scheduling optimization

2.1.1 The time complexity of the edge

2.1.2 Symmetry of Execution Graph

2.1.3 Optimization results

2.2 Sort-Merge Shuffle

2.2.1 Hash Shuffle

2.2.2 Sort Shuffle

3. DataStream API optimization

3.1 2PC & end-to-end consistency

3.2 Support Checkpoint after some tasks (in progress)

4. Summary

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

小米基于 Apache Paimon 的流式湖仓实践

物化视图详解：数据库性能优化的利器

基于Flink的配置化实时反作弊系统

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元