Flink Next：Beyond Stream Processing

This article is compiled from the sharing by Wang Feng (Mo Wen), the initiator of the Apache Flink Chinese community and the head of Alibaba's open source big data platform, at Flink Forward Asia 2021. This article is mainly divided into four parts:
2021: The Apache Flink community continues to thrive
Apache Flink core technology evolution
The evolution and implementation of stream-batch integration
Machine Learning Scenario Support

Click to view live replay & speech PDF

1. 2021: The Apache Flink community continues to prosper

1.1 Flink major version iteration

In 2021, the Flink community will release two major versions: Flink 1.13 and Flink 1.14.

In Flink 1.13, Flink's deployment architecture has further evolved towards a cloud-native architecture, making Flink more adaptable to running in a cloud-native environment; in addition, Flink's ease of use has also been significantly improved, allowing users to easily implement Flink tasks Debugging, tuning, monitoring, etc.; in terms of storage, the Flink Checkpoint format has been unified, users can seamlessly switch between different State backends, and smooth upgrades can be performed between different versions.

The biggest change in Flink 1.14 is the complete realization of the Flink stream-batch integration architecture and concept. At last year's Flink Forward Asia 2020 conference I focused on sharing Flink Unified SQL. This year, Flink not only unifies stream-batch integration on SQL and Table API, but also unifies stream-batch integration on Java API itself, Data Stream, and Data set API. All the semantics of stream-batch integration are unified on the Data stream API. Based on Data stream, users can process limited and infinite streams to realize the development of stream-batch integration. In addition, Flink implements fine-grained resource management at the resource layer, making Flink tasks more efficient in large-scale scenarios; Flink has also upgraded adaptive flow control in terms of network flow control. After the adaptive flow control upgrade, users can Faster execution of Flink's globally consistent snapshots. After these technological innovations and technological evolutions, the Flink community continues to maintain rapid development and ecological activity.

1.2 Apache Flink's performance at the community level continues to shine

In 2021, in the Apache Software Foundation's fiscal year report, several indicators of the Flink community remain very healthy: continue to rank first in the mailing list; 2nd in Github project access traffic; Rank in the number of commits of the codebase Second and so on. Through these active indicators, it can be seen that the activity of the Flink community is among the best in the entire Apache Software Foundation community.

The healthy development of the Flink community is inseparable from the promotion of the software development and technical application of the Flink project by the majority of code contributors. So far, more than 1400 developers have made code-level contributions to the Apache Flink project. These developers come from more than 100 companies around the world, including not only world-renowned international companies, but also more local technology companies from China. The Chinese factor is playing an increasingly important role in the Flink community.

Alibaba has been actively building the Apache Flink Chinese community since 2018 to promote the rapid development of Flink in China.

Since the opening of Apache Flink's official account (Apache Flink), 50,000 developers have subscribed. Last year alone, the Apache Flink official account published more than 140 technical articles, mainly focusing on community dynamics, technology sharing, and User stories of how various industries are using Apache Flink. Recently, the Apache Flink video account has also been officially opened, hoping to use a more novel form to show the development of the community from more dimensions and share the technology and applications of the community.

In July 2021, the Flink Chinese learning website ( https://flink-learning.org.cn ) was officially launched. We bring together various Flink learning materials in the network, so that more Flink developers and Flink users can easily learn Flink's knowledge technology, application scenarios and application cases.

While 2021 is a year full of outbreaks, community operations have not stopped. We still held 4 meetups in Beijing, Shanghai and other places. We also expect that in the new year, more and more companies are willing to actively undertake the activities of the Flink community and promote the development of the Flink community.

2. Apache Flink core technology evolution

Big data and cloud native are two inseparable scenarios. The environment and elasticity on the cloud can provide more space for big data computing and promote faster popularization of big data. Apache Flink further evolves the deployment architecture and resource management method under the cloud-native trend, making it fully adaptable to the cloud-native mode: users on the cloud can dynamically expand or shrink resources at any time according to business traffic changes, so that they can be used on demand. Therefore, Flink should also adapt to the on-demand use scenario on the cloud, adjust concurrency in an adaptive mode in the computing topology according to the changes of dynamic resource expansion and contraction on the cloud, so as to adapt to the changes of resources on the cloud, and have better automatic operation and maintenance and automatic operation and maintenance. Adapt to operation and maintenance capabilities.

In addition to cloud native, the core technical concept of Flink is globally consistent snapshots. Because the biggest technical highlight of Flink is that it is a stateful real-time computing engine, the chandy Lamport algorithm is used to achieve global consistency snapshots to ensure that data can achieve complete data consistency guarantee and fault tolerance in complete real-time conditions. Therefore, global consistency snapshots It is the key to Flink's data consistency guarantee and system fault tolerance. If we can continuously improve the quality and performance of globally consistent snapshots, the real-time computing experience of Flink core will be improved, including disaster recovery will become smoother.

The process of Flink's globally consistent snapshot is divided into four steps:

Insert checkpoint barrier: periodically insert a special barrier at the source side, and the barrier will flow downstream along the data flow;
Multi-input barrier alignment: After each operator receives all upstream barriers, perform barrier alignment, and then perform the next snapshot;
Snapshot + Upload: During the snapshot process, the internal state data needs to be persistently stored, such as in HDFS;
Checkpoint Complete: After the snapshot is completed, the master is synchronized. When all operators are completed, the globally distributed consistent snapshot process is completed.

In this process, the real performance improvement is mainly in the second and third points, that is, barrier alignment and taking a snapshot of the entire data and uploading it to HDFS.

Therefore, in 2021, the Flink community will mainly focus on the second and third steps to optimize checkpoints. Barrier alignment seems to be a simple operation, but sometimes it takes a long time, especially in the case of back pressure, when the computing power of downstream operators suddenly drops, a large amount of data is blocked to the network layer, making barrier alignment take a long time , which will cause the time to be uncontrollable.

How to solve this problem? Flink's network flow control mechanism is a credit-based flow control mechanism, that is, the upstream and downstream negotiate and control the number of buffers between upstream and downstream to achieve efficient network flow control. However, when there is a back pressure in the downstream, the computing power drops sharply. At this time, a large number of network buffers are not needed for data buffering. Therefore, we propose to implement adaptive, that is, an adaptive network flow control mechanism on the basis of credit-based: not only consider the above The number of network buffers in the downstream negotiation will also consider the size of the network buffer, and dynamically adjust the buffer size according to the computing power. In this way, when the computing power of the downstream back pressure decreases, the data volume of the network buffer will be reduced, thereby alleviating the back pressure. effect on buffer alignment.

In order to further improve the performance and speed of checkpointing, it is necessary to speed up the process of each operator taking a local take snapshot. In order to change the entire take snapshot and backup snapshot mechanism, Flink introduces a log-based checkpoint mechanism to accelerate the execution of checkpoint by a single operator performance.

With this mechanism, users can write state backend on the one hand and changelog on the other hand when writing state. When doing a snapshot, you can type a meta data in the changelog, indicating that the check point has been done, and the state and snapshot data are not only state files, but also changelogs. In this process, the copying of state file data to HDFS can be used as a regular regularization process, and its frequency can be decoupled from the frequency of checkpoints, so the speed of checkpoints can be achieved in seconds or even milliseconds. This not only greatly improves the fault tolerance experience, but also greatly improves the global consistency experience and end-to-end data experience.

Flink not only has SQL API, but also Java API to process some routine calculations of big data. It is considered to still have great potential in the field of machine learning and scientific computing (in fact, Flink has indeed played a huge role in the field of machine learning) .

In 2021, PyFlink will completely match the capabilities of the table API and data stream API, and at the same time, it has made great innovations in performance: in PyFlink, Java code, C code and Python code are run in one process, without the need for cross-process communication. Through the technology of JNI, the Java framework can call the code of C, and call the Python parser in C to execute the Python UDF, successfully removing the dependency of cross-process communication, so that the performance of the whole Python UDF can reach the performance close to that of Java, so as to take into account the development and operating efficiency.

3. The evolution and implementation of stream-batch integration

In addition to the traditional classic stream processing, stream-batch integration is an innovative idea that the Flink community has been putting forward in recent years. Next, I will share with you the technical evolution and actual implementation scenarios of stream-batch integration in Flink.

At the API level, the integrated development of stream and batch can be realized through unified SQL Flink. In last year's Tmall Double Eleven project, Alibaba used streaming-batch-integrated SQL to implement real-time and offline integration of marketing data on a large screen. This year, the Flink community integrated the API stack into a more stream-batch integration. After the community integrated datastream and dataset, only the API of datastream was retained. On the API of datastream, it is possible to connect finite stream and infinite stream, and realize the API that integrates streams and batches at the java level.

In addition, the entire architecture of Flink also completely realizes the unification of stream-batch integration, which can process both limited and unlimited data sets at the same time. Users can develop a set of code to connect to two sets of data sources at the same time, because the connector framework is not only compatible with streaming storage, but also compatible with batch storage, and can even switch freely between streaming storage and batch storage. Flink's scheduling is a set of scheduling schedulers that can schedule various tasks. On the data network shuffle, not only the high-performance streaming shuffle framework that Flink is good at, but also the batch shuffle framework is introduced. Alibaba's real-time computing team also contributed the first version of the remote shuffle service that separates storage and computing, and put it under the flink open source ecological project team: https://github.com/flink-extended/flink-remote-shuffle

With this unified stream-batch architecture, the Flink community has truly realized a complete concept of stream-batch integration from API to system architecture.

The integration of flow and batch is a technical architecture concept, and it needs to be implemented in specific business scenarios to see the value. Next, I will give you an example of Flink CDC. Flink CDC + stream-batch integrated architecture can realize full-increment integrated data integration. The offline data integration and real-time data integration of traditional data integration basically use two sets of technology stacks. At the same time, complex cooperation on both sides can achieve a complete data integration solution. (PS: Data integration is just needed but complicated at the same time.)

Flink's stream-batch integration capability combined with Flink CDC's capability can realize one-stop data integration: first synchronize historical data in full, and then automatically continue to upload incremental data at breakpoints to achieve one-stop data synchronization. For example, we can use Flink CDC to synchronize all MySQL data to the Hudi data lake with one job and one SQL, and automatically perform incremental real-time synchronization.

Flink CDC can already connect to mainstream databases, such as MySQL, MariaDB, PGSQL, MongoDB, Oracle, etc. Based on Flink CDC 2.0, the entire database can be synchronized to other databases or data lakes in one-stop.

How does Flink CDC realize full-incremental integrated data integration?

It utilizes the basic technical capabilities of Flink's stream-batch integration, combined with the CDC connector. Inside the Flink CDC task, the first step is to read the entire table of the database in full. Based on Flink's parallel computing capability, the data of the entire table can be quickly synchronized; then automatically switch to the incremental data source of binlog, using the ability of Flink hybrid source, Switch between internal streaming and batch data sources; synchronize binlog in real time after switching to incremental, so as to achieve offline real-time full-incremental integrated data integration. In this process, the consistency of data can be naturally guaranteed, and there is no additional operation for the user.

Next, I will introduce the real-time offline integrated data warehouse scenario to you. The above picture is a very classic mainstream real-time and offline integrated data warehouse architecture. Most of the user scenarios will use Flink and Kafka for real-time data stream processing, and write the analysis results to the online service layer for user display or further analysis. There is an asynchronous offline data warehouse architecture in the background, which supplements real-time data, and regularly runs large-scale batches/full operations every day or periodically revises historical data.

But there are several problems with this architecture:

Due to the use of different technology stacks, there are two sets of APIs for the real-time link and the offline link, which not only increases the development cost, but also reduces the development efficiency;
The data caliber of real-time data warehouse and offline data warehouse is difficult to maintain natural consistency;
In the entire real-time flow link, the data in the message queue represented by Kafka is not convenient for real-time exploration and analysis

In the architecture of the new Streaming Warehouse, we introduced the concept of Dynamic Table, that is, the dynamic table of Flink. All the hierarchical data of the data warehouse is put into the Flink Dynamic Table, and the hierarchical data of the entire data warehouse is connected in real time through Flink SQL. Real-time flow between various layers, and offline correction of historical data can be achieved. At the same time, users can use Flink SQL to explore and analyze the data flowing in Dynamic Table in real time. This architecture truly achieves the integration of real-time offline analysis: unified SQL, unified data storage, unified computing framework, and allows data to flow in real time in the data warehouse, so we call it: Streaming Warehouse , this architecture has three advantages:

Real-time flow of full-link data in seconds and milliseconds;
All data in flow can be analyzed without any data blind spots;
Real-time offline analysis integration, complete all data analysis with a set of API.

The Streaming Warehouse (Streamhouse for short) will be the key evolution direction of the Flink community in the future.

In Streamhouse, we introduced a new concept called dynamic table, which can be understood as a set of stream-batch storage (Flink SQL / datastream, etc. are all flink stream-batch calculations, and Dynamic Table corresponds to stream-batch integrated storage). Dynamic Table has the duality of flow table, so it has two storage attributes in the data structure: the first is File Store, and the second is Log Store.

As the name suggests, File Store stores the persistent data of Dynamic Table, adopts the classic LSM architecture, supports real-time streaming update, deletion, addition and other semantics, and adopts the open column storage structure to support high performance such as compression, and can be connected to Flink SQL batch mode for processing. Historical data analysis.

The Log Store stores the change sequence of the Dynamic Table, which is an unchangeable sequence. It can be connected to the Flink SQL stream mode and perform real-time analysis by subscribing to the incremental changes of the Dynamic Table.

Next, we will introduce how to use Flink CDC, Flink SQL, and Flink Dynamic Table to build a complete set of streaming data warehouses through a demo to complete the real-time offline integration experience.

demo: https://www.bilibili.com/video/BV1B44y1N7VD/

4. Machine Learning Scenario Support

Next, I will introduce the evolution of Apache Flink in the machine learning ecosystem. Many application scenarios of Flink are related to machine learning. For example, Internet companies use Flink for online machine learning and offline machine learning, and Flink is widely used in business scenarios such as recommendation, advertising, and search.

This year, leveraging on the evolution and upgrade of Flink's stream-batch integration technology, we reconstructed the basic framework of Flink ML and upgraded it to Flink ML 2.0. Based on the new stream-batch integrated datastream API, build new iterative computing capabilities and ML algorithms, and contribute this project to the Flink community. At the same time, many developers or companies have contributed many open source projects of the machine learning ecosystem on Flink, such as deep learning on Flink contributed by Alibaba. I hope more companies will actively participate in the contribution in the future.

First of all, the machine learning architecture is based on the underlying architecture of stream-batch integration. Because the new datastream has both stream and batch processing capabilities, we have built a new set of iterative computing capabilities on the datastream. This set of iterative computing capabilities It is the native iterative computing capability of the Flink engine, which can realize synchronous iteration and asynchronous iteration at the same time, making the iteration efficiency more efficient. Flink's stream-batch integration capability can connect to different data sources, including streaming data sources, batch data sources, different computing capabilities, etc., making feature engineering and model training more efficient. In addition, the new Flink ML pipeline API also refers to the classic scikit-learn style, making it easier for developers of traditional machine learning to get started.

Based on the advantages of Flink's own big data computing capabilities, including the advantages of real-time and real-time processing capabilities, data cleaning, data preprocessing, feature calculation, sample splicing and model training can be completely connected in series under this architecture to form a set of efficient big data. The data + AI integrated computing process is also compatible with the industry's more mature deep learning algorithms.

The biggest feature of Flink ML is that its framework is compatible with streaming and batch data sources, enabling the integration of online machine learning processes and offline machine learning processes. At present, Alibaba's machine learning team is continuing to promote the contribution of the algorithm, and it is hoped that more developers or companies will participate in the contribution in the future. At the same time, Flink ML can be embedded in mainstream deep learning algorithm libraries, such as tensorflow and PyTorch, to create a full-link deep learning process.

The scheduling management of workflows in the entire machine learning process is a cross-border issue between big data and AI. In response to this problem, Alibaba's real-time computing team open sourced a project called AI flow ( https://github.com/flink-extended/ai-flow ) last year. This project can realize from feature calculation to model training and model prediction around Flink , Model verification whole process management and scheduling system. At present, many companies in the industry have participated in the contribution and use of this project, and it is very hoped that more companies and developers will participate to jointly promote the ecological development.

After the rapid development of the Flink community in recent years, technological innovation is still evolving, from the original stream processing engine to a more comprehensive stream data warehouse, and in the data lake, machine learning and other big data computing power-driven. Play a greater value in the scene. Looking forward to more companies and developers participating in the Flink community in the future to jointly expand the Flink ecosystem for further development.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Flink Next：Beyond Stream Processing

1. 2021: The Apache Flink community continues to prosper

1.1 Flink major version iteration

1.2 Apache Flink's performance at the community level continues to shine

2. Apache Flink core technology evolution

3. The evolution and implementation of stream-batch integration

4. Machine Learning Scenario Support

ApacheFlink

引用和评论

流批一体向量化引擎Flex

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Flink Next：Beyond Stream Processing

1. 2021: The Apache Flink community continues to prosper

1.1 Flink major version iteration

1.2 Apache Flink's performance at the community level continues to shine

2. Apache Flink core technology evolution

3. The evolution and implementation of stream-batch integration

4. Machine Learning Scenario Support

ApacheFlink

引用和评论

流批一体向量化引擎Flex

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈