Read Flink source code to talk about design: the realization and status quo of stream-batch integration

This article was first published on mooring : https://www.yuque.com/17sing

Version	date	Remark
1.0	2022.3.16	Article first published

0. Background: Before Dataflow

Before the publication of Dataflow-related papers, everyone often believed that two sets of APIs were needed to implement stream computing and batch computing, and the typical implementation was the Lambda architecture.

Because the early stream processing framework did not support Exactly Once, the data of stream processing was not accurate. On this basis, once there is a problem with the data, it will lead to a large amount of data replay - this is because events often have timing requirements. Therefore, Lambda often obtains inaccurate results through the stream processing framework, and also runs the batch program regularly to obtain more accurate results - when more accurate results come out, we don't need the former.

But this also brings new problems. All views need to be done once in the stream and batch layers, and two sets of code must be written, which brings about different data calibers. It can be said that at least double the cost of computer resources and human resources.

Kappa proposes to drop all data on Kafka and unify the storage model with the computing model, but sacrifice time - when the amount of data is large, the pressure of backtracking calculation is huge.

Until the publication of The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing , streams and batches are expected to be unified in the programming model, and the above related problems can be alleviated

1. Implementation of Flink

Compared with other stream processing frameworks, Flink has two advantages:

Follow the Dataflow model and unify stream and batch on the programming model
Improve the Chandy-Lamport algorithm to ensure accurate one-time implementation at a lower cost

1.1 Behind the unification of the programming model

The unification of the programming model is embodied in Flink SQL and DataStream. We can run streaming and batch tasks with the same SQL or almost the same code. SQL, in particular, is more declarative than DataStream. Therefore, when users use them, they only need to describe what they want, not what they want to do. What to do, the Flink framework will help you.

On the Flink framework, the following problems are mainly solved at present:

IO model: batch processing pays more attention to throughput, so it is a pull model; while stream processing pays more attention to real-time performance, so it is a push model. Based on this condition, the Source operator needs to support two models at the same time to adapt to different calculation modes. See FLIP-27: Refactor Source Interface for details.
Scheduling strategy: The batch operators do not need to be online at the same time. After the previous batch of operators is completed, the next batch of operators can be scheduled. Since computing resources are often more expensive than storage resources, this is a very good optimization scheme. . Of course, in the case of sufficient resources, the pursuit of performance can also ignore this strategy; but stream processing jobs need to be scheduled when the job is started. Therefore, StreamGraph needs to support both modes - LazyScheduling and EagerScheduling.
Connection of batch streams: If we want to analyze the data of the past 30 days, in most cases, it is the offline data of 29 days plus the real-time data of the most recent day. How to ensure that the data is not too much or not enough during the connection is actually a troublesome thing , In many engineering practices, some hacks methods will be used. Fortunately, 062328821d0bac was introduced in Hybrid Source to simplify this - FLIP-150: Introduce Hybrid Source .

1.2 Checkpoint is not a silver bullet

Checkpoint is an important fault-tolerant mechanism in the Flink framework. One of its prerequisites is that the data source can be read repeatedly. In the data warehouse scenario, although the data will not change in most cases - , there will also be cold data processing mechanisms and some merges. This will pose a certain challenge to data re-readability. In addition, in QMatrix, the product in charge of the author, similar challenges will be encountered when migrating the database in full: the full data read at T1 is set 1, and the full data read at T2 is set 2. And MVVC can only be maintained in a session.

Described above are the fault tolerance conditions to be considered at the data source. When the data has all flowed into the task, the fault tolerance mechanism also needs to be reconsidered - try to avoid repeated reading of the data source and recalculation of upstream tasks. Therefore, the community introduced a pluggable Shuffle Service to provide the persistence of Shuffle data to support fine-grained fault-tolerant recovery - FLIP-31: Pluggable Shuffle Service .

2. The remaining problem: inconsistent data sources

The premise of the above stream-batch connection is that the data source is divided into a stream data source and a batch data source. Then the caliber is not uniform, which will bring some docking costs.

The current popular solution will use data lakes (such as IceBerg, Hudi, DeltaLake) to unify streaming and batch data, and since most data lakes support Time Travel , the problem of repeatable reading of offline data is also solved incidentally.

In addition, Pravega , a software designed for integrated storage of streams and batches, may also be one of the solutions.

3. Summary

In this article, the author and everyone understand the source of the integration of streaming and batching, and the efforts made by the Flink community in the integration of streaming and batching. In addition, we have also seen that some problems cannot be solved by the Flink framework, and the entire big data ecosystem needs to evolve together and move towards the integration of streaming and batching.

At the end of the article, thanks to Yu for the communication and guidance, we wrote this article together.

Read Flink source code to talk about design: the realization and status quo of stream-batch integration

0. Background: Before Dataflow

1. Implementation of Flink

1.1 Behind the unification of the programming model

1.2 Checkpoint is not a silver bullet

2. The remaining problem: inconsistent data sources

3. Summary

泊浮目

引用和评论

以防你不知道大佬认为写好注释有多重要

得物增长兑换商城的构架演进

得物业务参数配置中心架构综述

分析型数据库入门指南：如何选择适合你的实时分析工具？

物化视图详解：数据库性能优化的利器

Apache Flink 2.0.0: 实时数据处理的新纪元

字节跳动开源 Godel-Rescheduler：适用于云原生系统的全局最优重调度框架