Flink CDC 2.0 is officially released, detailing the core improvements

This article was compiled by community volunteer Chen Zhengyu, and the content was sourced from Alibaba senior development engineer Xu Bangjiang (Xue Jin) shared "Detailed Flink-CDC" at Beijing Station Flink Meetup on July 10th. In-depth explanation of the core features of the newly released Flink CDC 2.0.0 version, including: concurrent reading of full data, checkpoint, lock-free reading and other major improvements.

1. CDC overview

The full name of CDC is Change Data Capture. In a broad concept, as long as it is a technology that can capture data changes, we can call it CDC. The currently commonly described CDC technology is mainly oriented to database changes and is a technology used to capture data changes in the database. The application scenarios of CDC technology are very extensive:

data synchronization: used for backup and disaster recovery;
data distribution: one data source is distributed to multiple downstream systems;
data collection: is a very important data source for ETL data integration for data warehouse/data lake.

There are many technical solutions for CDC, and the current mainstream implementation mechanisms in the industry can be divided into two types:

CDC based on query:
- Offline scheduling query jobs, batch processing. Synchronize a table to other systems, and obtain the latest data in the table through query each time;
- Data consistency cannot be guaranteed, and the data may have been changed many times during the inspection process;
- Real-time performance is not guaranteed, and there is a natural delay based on offline scheduling.
Log-based CDC:
- Real-time consumption log, stream processing, for example, MySQL's binlog log completely records the changes in the database, and the binlog file can be used as the data source of the stream;
- Ensure data consistency, because the binlog file contains all historical change details;
- Guarantee real-time performance, because log files like binlog can be streamed and provide real-time data.

Comparing common open source CDC solutions, we can find:

Contrast incremental synchronization capabilities,
- Log-based method can achieve incremental synchronization well;
- However, it is difficult to achieve incremental synchronization based on the query.
Compared with full synchronization capabilities, CDC solutions based on query or log are basically supported, except Canal.
Compared with the ability of full + incremental synchronization, only Flink CDC, Debezium, and Oracle Goldengate support better.
From an architectural point of view, the table divides the architecture into stand-alone and distributed architecture. The distributed architecture here is not simply reflected in the horizontal expansion of data reading capabilities, but more importantly, distributed system access capabilities in big data scenarios. . For example, when Flink CDC data enters the lake or warehouse, the downstream is usually distributed systems, such as Hive, HDFS, Iceberg, Hudi, etc., then from the perspective of the ability to access distributed systems, the architecture of Flink CDC can be very good Access to such systems.
In terms of data conversion/data cleaning capabilities, can it be easier to filter, clean, or even aggregate the data when it enters the CDC tool?
- The operation on Flink CDC is quite simple, and these data can be manipulated through Flink SQL;
- However, DataX, Debezium, etc. need to be done through scripts or templates, so the threshold for users will be relatively high.
In addition, in terms of ecology, this refers to the support of some downstream databases or data sources. Flink CDC has a wealth of connectors downstream, such as writing to common systems such as TiDB, MySQL, Pg, HBase, Kafka, ClickHouse, etc., and also supports various custom connectors.

2. Flink CDC project

Speaking of this, let us review the motivation for developing the Flink CDC project.

1. Dynamic Table & ChangeLog Stream

Everyone knows that Flink has two basic concepts: Dynamic Table and Changelog Stream.

fcs_1

Dynamic Table is a dynamic table defined by Flink SQL. The concepts of dynamic table and flow are equivalent. Refer to the figure above, a stream can be converted into a dynamic table, and a dynamic table can also be converted into a stream.
In Flink SQL, data flows from one operator to another in the form of Changelog Stream. The Changelog Stream at any time can be translated into a table or a stream.

If you think about the tables and binlog logs in MySQL, you will find that all changes to a table in the MySQL database are recorded in the binlog log. If the table is updated all the time, the binlog log stream will always be added, and the tables in the database will be updated. It is equivalent to the result of the binlog log stream being materialized at a certain point in time; the log stream is the result of continuously capturing the change data of the table. This shows that Flink SQL's Dynamic Table can naturally represent a constantly changing MySQL database table.

Debezium_to_cdc

On this basis, we investigated some CDC technologies, and finally chose Debezium as the underlying collection tool of Flink CDC. Debezium supports full synchronization, incremental synchronization, and full + incremental synchronization, which is very flexible. At the same time, log-based CDC technology makes it possible to provide Exactly-Once.

Comparing Flink SQL's internal data structure RowData and Debezium's data structure, we can find that the two are very similar.

Each RowData has a metadata RowKind, including 4 types, namely insert (INSERT), mirror before update (UPDATE_BEFORE), mirror after update (UPDATE_AFTER), delete (DELETE), these four types and the binlog in the database The concept remains the same.
The data structure of Debezium also has a similar metadata op field. There are also four values for the op field, namely c, u, d, and r, which correspond to create, update, delete, and read. For u representing the update operation, the data part includes both before and after.

By analyzing the two data structures, the underlying data of Flink and Debezium can be easily connected. You can find that Flink is technically very suitable for CDC.

2. Traditional CDC ETL analysis

Let's take a look at the ETL analysis link of the traditional CDC, as shown in the following figure:

In traditional CDC-based ETL analysis, data collection tools are necessary. Foreign users often use Debezium, and domestic users often use Ali's open source Canal. Collection tools are responsible for collecting incremental data from the database. Some collection tools also support full data synchronization. The collected data is generally output to message middleware such as Kafka, and then the Flink computing engine consumes this part of the data and writes it to the destination. The destination can be various DBs, data lakes, real-time data warehouses, and offline data warehouses.

Note that Flink provides the changelog-json format, which can write changelog data to offline data warehouses such as Hive / HDFS; for real-time data warehouses, Flink supports directly writing changelogs to Kafka through the upsert-kafka connector.

cdc_4_1

We have been thinking about whether we can use Flink CDC to replace the collection components and message queues in the dashed box in the figure above, so as to simplify the analysis link and reduce maintenance costs. At the same time, fewer components also mean that data timeliness can be further improved. The answer is yes, so we have our ETL analysis process based on Flink CDC.

3. ETL analysis based on Flink CDC

After using Flink CDC, in addition to fewer components and more convenient maintenance, another advantage is that Flink SQL greatly reduces the threshold for users to use. You can see the following example:

cdc_etl_sql

This example uses Flink CDC to synchronize database data and write it to TiDB. The user directly uses Flink SQL to create a MySQL-CDC table of products and orders, then performs JOIN processing on the data stream, and writes it directly to the downstream database after processing. The CDC data analysis, processing and synchronization are completed through a Flink SQL job.

You will find that this is a pure SQL job, which means that as long as you know SQL BI, business line students can complete this kind of work. At the same time, users can also use the rich syntax provided by Flink SQL for data cleaning, analysis, and aggregation.

cdc_5_1

With these capabilities, it is very difficult for the existing CDC program to clean, analyze, and aggregate data.

In addition, using Flink SQL dual-stream JOIN, dimension table JOIN, and UDTF syntax can easily complete data widening and various business logic processing.

cdc_5_2

4. Flink CDC project development

In July 2020, Yunxie submitted the first commit, which is a project based on personal interest incubation;
Supported MySQL-CDC in mid-July 2020;
Supported Postgres-CDC at the end of July 2020;
In one year, the number of stars of this project on GitHub has exceeded 800.

cdc_6

Three, Flink CDC 2.0 detailed explanation

1. Flink CDC pain points

MySQL CDC is the most used and most important connector in Flink CDC. The following chapters in this article describe that Flink CDC Connectors are all MySQL CDC Connectors.

With the development of the Flink CDC project, we have received feedback from many users in the community, which are mainly summarized into three:

The process of full + incremental reading needs to ensure the consistency of all data, so it needs to be guaranteed by locking, but locking is a very dangerous operation at the database level. When the underlying Debezium guarantees data consistency, it needs to lock the read library or table. Global locks may cause database locks. Table-level locks will lock table reads. DBAs generally do not grant lock permissions.
Horizontal expansion is not supported, because the bottom layer of Flink CDC is based on Debezium, and the initial architecture is single node, so Flink CDC only supports single concurrency. In the full phase reading phase, if the table is very large (100 million level) and the reading time is at the hour or even day level, the user cannot increase the resources to increase the operating speed.
Checkpoint is not supported in the full reading phase: CDC reading is divided into two stages, full reading and incremental reading. Currently, checkpoint is not supported in the full reading phase, so there will be a problem: when we synchronize the full amount of data, Suppose it takes 5 hours. When we synced for 4 hours, the job failed. At this time, we need to restart and read for another 5 hours.

2. Debezium lock analysis

Flink CDC encapsulates Debezium at the bottom layer. Debezium synchronizes a table in two stages:

full phase: queries all records in the current table;
incremental stage: consumes change data from binlog.

Most of the scenarios used by users are full volume + incremental synchronization. Locking occurs in the full volume phase. The purpose is to determine the initial position of the full volume phase to ensure that the incremental + full volume achieves one less than one, and one less than one, so as to ensure data consistency. From the figure below, we can analyze some locking processes of global locks and table locks. The red line on the left is the life cycle of the lock, and the right is the life cycle of MySQL enabling repeatable read transactions.

Debezium_lock

Taking global locks as an example, first acquire a lock, and then open a repeatable read transaction. The lock operation here is to read the starting position of the binlog and the schema of the current table. The purpose of this is to ensure that the starting position of the binlog and the current schema read can correspond, because the schema of the table will change, such as deleting columns or adding columns. After reading these two information, SnapshotReader will read the full amount of data in a repeatable read transaction. After the full amount of data is read, BinlogReader will be started to read incrementally from the starting position of the read binlog to ensure the full amount Seamless connection of data + incremental data.

Table locks are a degenerate version of global locks, because the permissions of global locks are relatively high, so in some scenarios, users only have table locks. The lock time of the table lock will be longer, because the table lock has a feature: the lock is released in advance and the repeatable read transaction will be committed by default, so the lock needs to wait until the full amount of data is read before it can be released.

After the above analysis, let's take a look at what serious consequences these locks will cause:

Flink CDC 1.x can be unlocked and can meet most scenarios, but at the expense of certain data accuracy. Flink CDC 1.x adds global locks by default. Although data consistency can be guaranteed, there is the risk of data hanging as described above.

3. Flink CDC 2.0 design (take MySQL as an example)

Through the above analysis, we can know the 2.0 design plan, the core of which is to solve the above three problems, that is, support for lock-free, horizontal expansion, and checkpoint.

The lock-free algorithm described in the DBlog paper is shown in the following figure:

unlock_1

On the left is the description of Chunk's segmentation algorithm. Chunk's segmentation algorithm is actually similar to the principle of sharding tables in many databases. The data in the table is sharded through the primary key of the table. Assuming that the step size of each Chunk is 10, according to this rule to split, only need to make the interval of these Chunk into left-open, right-closed or left-closed and right-open interval, to ensure that the concatenated interval can be equal to the primary key interval of the table. Can.

On the right is the description of the lock-free read algorithm for each Chunk. The core idea of the algorithm is that after the Chunk is divided, for the full read and incremental read of each Chunk, a consistent merge is completed without locks. The segmentation of Chunk is shown in the following figure:

Chunk_cut

Because each chunk is only responsible for the data within the scope of its own primary key, it is not difficult to deduce. As long as the consistency of each Chunk reading can be guaranteed, the consistency of the entire table reading can be guaranteed. This is the basic principle of the lock-free algorithm .

The Chunk reading algorithm in Netflix's DBLog paper is to maintain a signal table in the DB, and then use the signal table to make a dot in the binlog file, recording the Low Position (low position) of each chunk before reading and the High Position after the reading is completed. (High point), query the full data of the Chunk between the low point and the high point. After reading the data of this part of the chunk, the binlog incremental data between the two locations is merged into the full data of the chunk to obtain the full data corresponding to the chunk at the high point.

Flink CDC combined its own situation and made improvements to the Chunk reading algorithm to remove the signal table, without additional maintenance of the signal table, by directly reading the binlog site instead of marking in the binlog, the overall chunk reading algorithm description As shown below:

Chunk_read

For example, Chunk-1 is being read, the interval of Chunk is [K1, K10], first select the data in the interval directly and store it in the buffer, record a position (low position) of the binlog before select, select After completion, record a position (high position) of the binlog. Then start the incremental part, consuming the binlog from the low point to the high point.

The-(k2,100) + (k2,108) record in the figure indicates that the value of this data is updated from 100 to 108;
The second record is to delete k3;
The third record is to update k2 to 119;
The fourth record is that the data of k5 is changed from 77 to 100.

Observing the final output in the lower right corner of the picture, you will find that when the binlog of the chunk is consumed, the keys that appear are k2, k3, and k5. We go to the buffer to mark these keys.

For k1, k4, k6, and k7, these records have not changed after the high point is read, so these data can be directly output;
For the changed data, the incremental data needs to be merged into the full amount of data, and only the final merged data is retained. For example, if the final result of k2 is 119, then only +(k2,119) needs to be output, and data that has changed in the middle is not needed.

In this way, the final output of the Chunk is the latest data in the chunk at the high point.

The above figure describes the consistent read of a single Chunk, but if there are multiple tables divided into many different Chunks, and these Chunks are distributed to different tasks, then how to distribute the Chunk and ensure global consistent reading?

This is implemented elegantly based on FLIP-27. You can see the components of SourceEnumerator through the following figure. This component is mainly used for Chunk division. The divided Chunk will be provided to the downstream SourceReader to read, and the chunk will be distributed. The process of concurrently reading Snapshot Chunk is realized for different SourceReaders. At the same time, based on FLIP-27, we can easily achieve chunk-granularity checkpoint.

snapshot-Chunk

After the Snapshot Chunk is read, there needs to be a reporting process, as shown in the orange report information in the following figure, and report the Snapshot Chunk completed information to SourceEnumerator.

Chunk_report

The main purpose of the report is to subsequently distribute the binlog chunk (as shown in the figure below). Because Flink CDC supports full + incremental synchronization, after all Snapshot Chunks are read, incremental binlogs need to be consumed. This is achieved by issuing a binlog chunk to any Source Reader for single concurrent reading.

binlog-Chunk

For most users, there is no need to pay too much attention to the details of the lock-free algorithm and sharding, just to understand the overall process.

The overall process can be summarized as follows: First, the table is divided into Snapshot Chunk by the primary key, and then the Snapshot Chunk is distributed to multiple SourceReaders. When each Snapshot Chunk is read, the algorithm achieves consistent reading under lock-free conditions, and SourceReader supports it when reading. For chunk-granularity checkpoints, after all Snapshot Chunks are read, a binlog chunk is issued for incremental binlog reading. This is the overall process of Flink CDC 2.0, as shown in the following figure:

Chunk_all

Flink CDC is a completely open source project. All the design and source code of the project have been contributed to the open source community. Flink CDC 2.0 has also been officially released. The core improvements and enhancements this time include:

Provide MySQL CDC 2.0, core features include
- Concurrent reading, the reading performance of the full amount of data can be expanded horizontally;
- No locks throughout the entire process, and no risk of locking online business;
- Resumable transfers at breakpoints, and support for checkpoints at all stages.
Build a document website, provide multi-version document support, and document keyword search

The author used the customer table in the TPC-DS data set to test, the Flink version is 1.13.1, the data volume of the customer table is 65 million, the Source concurrency is 8, and the full read stage:

MySQL CDC 2.0 takes 13 minutes;
MySQL CDC 1.4 takes 89 minutes;
The read performance is improved by 6.8 times.

In order to provide better documentation support, the Flink CDC community has built a documentation website, which supports version management of documents:

The document website supports keyword search function, which is very practical:

Four, future planning

Regarding the future planning of the CDC project, we hope to focus on three aspects: stability, advanced features and ecological integration.

Stability
- Attracting more developers through the community, the company's open source power enhances the maturity of Flink CDC;
- Support Lazy Assigning. The idea of Lazy Assigning is to divide chunks into a batch, rather than dividing them all at once. Currently Source Reader divides all chunks for data reading at one time. For example, if there are 10,000 chunks, you can divide 1,000 chunks first instead of dividing all at once. After SourceReader reads 1,000 chunks Continue to divide to save time for dividing chunks.
Advanced Feature
- Support Schema Evolution. This scenario is: when synchronizing the database, a field is suddenly added to the table, and it is hoped that this field can be automatically added when the downstream system is synchronized later;
- Support Watermark Pushdown Get some heartbeat information through CDC's binlog, these heartbeat information can be used as a Watermark, through this heartbeat information, you can know some progress of the current consumption of this stream;
- In the scenario of supporting META data and sub-database and table, it may need metadata to know which database and table the data comes from, and there can be more flexible operations in the downstream system into the lake and warehouse;
- Whole database synchronization: Users only need one line of SQL syntax to synchronize the entire database, instead of defining a DDL and query for each table.
Ecological integration
- Integrate more upstream databases, such as Oracle, MS SqlServer. Cloudera is currently actively contributing to the oracle-cdc connector;
- At the lake entry level, Hudi and Iceberg have some room for optimization in writing. For example, when high QPS enters the lake, the data distribution has a relatively large performance impact. This can be further optimized by opening up and integrating with the ecology.

appendix

[1] Flink-CDC project address

[2] Flink-CDC document website

[3] Percona-MySQL global lock time analysis

[4] DBLog-Lock-free algorithm paper

[5] Flink FLIP-27 Design Document

Flink CDC 2.0 is officially released, detailing the core improvements

1. CDC overview