The past, present and future of Pravega Flink connector

This article is compiled from the topic "Pravega Flink Connector Past, Present and Future" shared by software engineer Zhou Yumin of Dell Technology Group at Flink Forward Asia 2020. The content of the article is:
Introduction to Pravega and Pravega connector
The past of Pravega connector
Review Flink 1.11 high-end feature experience sharing
Future outlook

1. Introduction to Pravega and Pravega connector

The name of the Pravega project comes from Sanskrit, meaning good speed. The project originated in 2016 and was open sourced on Github based on the Apache V2 protocol. In November 2020, it joined the CNCF family and became the CNCF sandbox project.

The Pravega project is designed for large-scale data flow scenarios, a new enterprise-level storage system that makes up for the shortcomings of traditional message queue storage. It maintains borderless, high-performance reading and writing for streams, and also adds some enterprise-level features: such as elastic scaling and tiered storage, which can help enterprise users reduce the cost of use and maintenance. At the same time, we also have years of technical precipitation in the storage field, and can rely on the company's commercial storage products to provide customers with persistent storage.

The above architecture diagram describes the typical reading and writing scene of Pravega, and introduces Pravega terminology to help everyone further understand the system architecture.

The middle part is a Pravega cluster, which is a stream abstract system as a whole. Stream can be considered as a topic analogous to Kafka. Similarly, Pravega's Segment can be compared to Kafka's Partition as the concept of data partitioning while providing dynamic scaling.
Segment stores binary data streams, and according to the size of the data traffic, merge or split operations occur to release or concentrate resources. At this time, the segment will perform a seal operation to prohibit the writing of new data, and then the newly created segment will receive the new data.
The left side of the picture is the scene of data writing, which supports append only writing. The user can specify the routing key for each event to determine the attribution of the segment. This can be compared to Kafka Partitioner. The data on a single routing key is sequence-preserving, ensuring that the read sequence is the same as the write sequence.
The right side of the picture is the scene of data reading. Multiple readers will be controlled by a Reader Group. Reader Group controls the load balance between readers to ensure that all segments can be evenly distributed among readers. At the same time, it also provides a Checkpoint mechanism to form a consistent stream segmentation to ensure data failure recovery. For "read", we support both batch and stream semantics. For streaming scenarios, we support last reading; for batch scenarios, we will consider high concurrency more to achieve high throughput.

2. The past of Pravega Flink connector

Pravega Flink connector is the connector that Pravega originally supported. This is also because Pravega and Flink have very consistent design concepts. Both are stream-based batch-stream integrated systems that can form a complete solution for storage and computing.

1. Pravega development history

connector has been an independent Github project since 2017. In 2017, we developed based on the Flink 1.3 version. At that time, Flink PMC members including Stephan Ewen joined and cooperated to build the most basic Source / Sink function, which supports the most basic reading and writing, and also includes the integration of Pravega Checkpoint. This point will be introduced later.
One of the most important highlights in 2018 is the end-to-end precise one-time semantic support. At that time, there were a lot of discussions between the team and the Flink community. Pravega first supported the feature of transactional writing to the client. On this basis, the community cooperated, based on the sink function, and realized the distributed checkpoint based on a set of two-phase commit semantics. Transaction function. Later, Flink further abstracted the two-phase submission API, which is the well-known TwoPhaseCommitSinkFunction interface, which was also adopted by the Kafka connector. The community has a blog dedicated to introducing this interface and the end-to-end one-off semantics.
In 2019, more of the connector's completion of other APIs, including support for batch reading and Table API.
The main focus in 2020 is the integration of Flink 1.11, which focuses on the integration of new features of FLIP-27 and FLIP-95.

2. Checkpoint integration implementation

Taking Kafka as an example, you can first take a look at how Kafka integrates Flink Checkpoint.

The figure above shows a typical Kafka "read" architecture. The Flink checkpoint implementation based on the Chandy-Lamport algorithm, when the Job master Trigger a Checkpoint, it will send an RPC request to the Task Executor. After receiving it, it will merge the Kafka commit offset in its state storage back into the Job Manager to form a Checkpoint Metadata.

After careful thinking, you can actually find some of the small problems:

Scaling capacity and dynamic balance support. When Partition is adjusted, or for Pravega, when Partition dynamically expands and shrinks, how to ensure the consistency of Merge.
Another point is that Task needs to maintain an offset information, and the entire design will be coupled with Kafka's internal abstract offset.

Based on these shortcomings, Pravega has its own internally designed Checkpoint mechanism. Let's take a look at how it integrates with Flink's Checkpoint.

The Pravega Stream is also read. Starting Checkpoint is different here. The Job master no longer sends RPC requests to Task Executor, but instead uses the ExternallyInducedSource interface to send a Checkpoint request to Pravega.

At the same time, Pravega will use the StateSynchronizer component to synchronize and coordinate all readers, and will send Checkpoint events between all readers. When the Task Executor reads the Checkpoint Event, the entire Pravega will mark the completion of the Checkpoint, and then the returned Pravega Checkpoint will be stored in the Job master state to complete the Checkpoint.

Such an implementation is actually cleaner for Flink, because it does not couple the implementation details of external systems. The entire Checkpoint work is handed over to Pravega to implement and complete.

Third, review Flink 1.11 high-end feature experience sharing

Flink 1.11 is an important release version in 2020. In fact, there are many challenges for the connector, mainly focusing on the implementation of two FLIPs: FLIP-27 and FLIP-95. For these two new functions, the team also spent a lot of time to integrate, and encountered some problems and challenges in the process. Let's share with you how we stepped and filled the pits. This article will take FLIP-95 as an example.

1. FLIP-95 integration

FLIP-95 is a new Table API, and its motivation is similar to FLIP-27. It is also to achieve a batch-streaming integrated interface, and it can also better support CDC integration. For the lengthy configuration keys, the corresponding FLIP-122 is also proposed to simplify the configuration of the configuration keys.

1.1 Pravega's old Table API

From the figure above, you can see a Table API of Pravega before Flink 1.10, and you can see from the DDL of the table built in the figure:

Update mode and append are used to distinguish between batches and streams, and the distinction between batch stream data is not intuitive.
The configuration file is also very verbose and complicated. The read stream needs to be configured through a very long configuration key like connector.reader.stream-info.0.
At the code level, there are also many couplings with the DataStream API that are difficult to maintain.

In response to these problems, we also have a lot of motivation to implement such a new set of API, allowing users to better use the abstraction of the table. The entire framework is shown in the figure. With the help of the entire new framework, all configuration items are defined through the ConfigOption interface and are managed in the PravegaOptions class.

1.2 Pravega's new Table API

The following figure shows the implementation of the latest Table API table creation. Compared with the previous one, it is greatly simplified. At the same time, there are many optimizations in functions, such as the configuration of enterprise-level security options, multi-stream and the specified function of starting streamcut.

2. Flink-18641 experience sharing of the solution process

Next, I want to share here a small experience of Flink 1.11 integration, which is about the sharing of an issue resolution process. Flink-18641 is the problem we encountered when integrating version 1.11.0. During the upgrade process, CheckpointException will be reported in the unit test. Next is our complete debug process.

First of all, I will go to the step-by-step breakpoint debugging by myself, by viewing the error log of the error, analyzing the relevant Pravega and Flink source code, and confirming that it is some problems related to Flink CheckpointCoordinator;
Then we also checked some submission records in the community and found that after Flink 1.10, the CheckpointCoordinator thread model, the model controlled by the original lock, became the Mailbox model. This model caused some of our original synchronous serialization logic to be run in parallel by mistake, which led to the error;
After further reading the pull request for this change, I also got in touch with some related Committers via email. Finally, I confirmed the problem on the dev mailing list and opened the JIRA ticket.

We have also summarized the following precautions for compatriots who are doing open source communities:

Search the mailing list and JIRA to see if anyone else has raised similar issues;
Completely describe the problem, provide detailed version information, error log and reproduce steps;
After getting feedback from community members, you can further meetings to communicate and discuss solutions;
English is required in a non-Chinese environment.

In fact, as a Chinese developer, there are other than mailing list and JIRA. We also have DingTalk groups and videos to contact many Committers. In fact, it is more of a process of communication. To open source is to communicate more with the community, which can promote the common growth of projects.

Four, future prospects

The bigger job in the future is the integration of the Pravega schema registry. The Pravega schema registry provides management of the metadata of Pravega stream, including data schema and serialization methods, and stores them. This feature was accompanied by Pravega 0.8 version released the first open source version of the project. We will implement Pravega's Catalog based on this project in the subsequent version 0.10, making the use of Flink table API easier;
Secondly, we also always pay attention to the new trends of the Flink community, and will actively integrate new versions and new functions of the community. The current plans include FLIP-143 and FLIP-129;
The community is also gradually completing the conversion of the new Test Framework based on the docker container, and we are also paying attention to and integrating it.

Finally, I also hope that the community partners can pay more attention to the Pravega project and promote the common development of Pravega connector and Flink.

The past, present and future of Pravega Flink connector

1. Introduction to Pravega and Pravega connector

2. The past of Pravega Flink connector

1. Pravega development history

2. Checkpoint integration implementation

Third, review Flink 1.11 high-end feature experience sharing

1. FLIP-95 integration

1.1 Pravega's old Table API

1.2 Pravega's new Table API

2. Flink-18641 experience sharing of the solution process

Four, future prospects

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

小米基于 Apache Paimon 的流式湖仓实践

物化视图详解：数据库性能优化的利器

基于Flink的配置化实时反作弊系统

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元