Real-time synchronization and conversion of massive data based on Flink CDC

Abstract: This article is compiled from the speech delivered by Apache Flink Committer, Flink CDC Maintainer, and Alibaba Senior Development Engineer Xu Bangjiang (Xue Jin) at the Flink CDC Meetup on May 21. The main contents include:
Flink CDC technology
Pain points of traditional data integration solutions
Real-time synchronization and conversion of massive data based on Flink CDC
Flink CDC Community Development

Click to view live replay & speech PDF

1. Flink CDC technology

CDC is the abbreviation of Change Data Capture. It is a technology for capturing change data. CDC technology has existed for a long time. Since its development, there are many CDC technology solutions in the industry, which can be divided into two categories in principle:

One is the query-based CDC technology, such as DataX. As the current scene has higher and higher requirements for real-time performance, the shortcomings of such technologies are gradually becoming more prominent. Offline scheduling and batch processing lead to high latency; slicing based on offline scheduling, so data consistency cannot be guaranteed; in addition, real-time performance cannot be guaranteed.
One is the log-based CDC technology, such as Debezium, Canal, and Flink CDC. This CDC technology can consume database logs in real time, and the stream processing mode can ensure data consistency, provide real-time data, and meet the current increasingly real-time business needs.

The above picture shows a comparison of common open source CDC solutions. It can be seen that the mechanism of Flink CDC and its performance in incremental synchronization, breakpoint resuming, and full synchronization are very good, and it also supports full incremental integrated synchronization, while many other open source solutions cannot support full incremental integrated synchronization. Flink CDC is a distributed architecture, which can meet the business scenarios of massive data synchronization. Relying on Flink's ecological advantages, it provides DataStream API and SQL API, which provide very powerful transformation capabilities. In addition, the open source ecology of the Flink CDC community and the Flink community is very complete, attracting many community users and companies to develop and build in the community.

Flink CDC supports full-incremental integrated synchronization, providing users with real-time consistent snapshots. For example, there is full historical data in a table, as well as new real-time change data. Incremental data is continuously written to the Binlog log file. Flink CDC will first synchronize the full historical data, and then seamlessly switch to synchronizing incremental data. During incremental synchronization, if it is newly inserted data (the blue block in the above figure), it will be appended to the real-time consistent snapshot; if it is updated data (the yellow small block in the above figure), it will be added to the existing snapshot. Update historical data.

Flink CDC is equivalent to providing a real-time materialized view, providing users with a real-time consistent snapshot of the tables in the database, which can be used to further process the data, such as cleaning, aggregation, filtering, etc., and then write it downstream.

Second, the pain points of traditional data integration solutions

The above picture shows the traditional data warehousing architecture 1.0, which mainly uses DataX or Sqoop to fully synchronize to HDFS, and then builds data warehouses around Hive.

There are many defects in this solution: it is easy to affect the business stability, because data needs to be queried from the business table every day; the daily output leads to poor timeliness and high delay; if the scheduling interval is adjusted to a few minutes, the source database will be affected. Causes a lot of pressure; the scalability is poor, and performance bottlenecks are prone to occur after the business scale is expanded.

The above picture shows the traditional data warehouse 2.0 architecture. It is divided into two links, real-time and offline. The real-time link performs incremental synchronization. For example, after synchronizing to Kafka through Canal, real-time reflow is performed; the full synchronization is generally only performed once, and the daily increment is periodically merged on HDFS, and finally Import into Hive data warehouse.

This method only performs full synchronization once, so it basically does not affect business stability. However, incremental synchronization has regular backflow, which can generally only be maintained at the hour and day level, so its timeliness is relatively low. At the same time, the two links of full and incremental are separated, which means that there are many links and many components that need to be maintained, and the maintainability of the system will be relatively poor.

The above picture shows the traditional CDC ETL analysis architecture. After collecting CDC data through tools such as Debezium and Canal, it is written to the message queue, and then the computing engine is used for calculation cleaning, and finally transmitted to downstream storage to complete the construction of real-time data warehouse and data lake.

Traditional CDC ETL analysis introduces many components, such as Debezium and Canal, which need to be deployed and maintained, as well as the Kafka message queue cluster. The disadvantage of Debezium is that although it supports full increments, its single-concurrency model cannot handle massive data scenarios well. However, Canal can only read increments, and requires the cooperation of DataX and Sqoop to read the full amount, which is equivalent to requiring two links and increasing the number of components that need to be maintained. Therefore, the pain points of traditional CDC ETL analysis are poor single concurrency performance, full incremental splitting, and many dependent components.

3. Real-time synchronization and conversion of massive data based on Flink CDC

What improvements can Flink CDC's solution bring to real-time synchronization and conversion of massive data?

Flink CDC 2.0 implements the incremental snapshot read algorithm on MySQL CDC. In the latest version 2.2, the Flink CDC community abstracts the incremental snapshot algorithm into a framework, which enables other data sources to reuse the incremental snapshot algorithm.

The incremental snapshot algorithm solves some of the pain points in the full-incremental integrated synchronization. For example, the early version of Debezium will use locks when implementing full-incremental integrated synchronization, and it is a single-concurrency model with a redo mechanism on failure, which cannot achieve breakpoint resuming in the full stage. The incremental snapshot algorithm uses a lock-free algorithm, which is very friendly to business libraries; it supports concurrent reading, which solves the problem of processing massive data; it supports resuming transmission from breakpoints, avoiding failure and redoing, which can greatly improve the efficiency and efficiency of data synchronization. user experience.

The above picture shows the framework of full incremental integration. The whole framework is simply to divide the tables in the database into chunks according to PK or UK, and then assign them to multiple tasks for parallel reading, that is, parallel reading is realized in the full stage. The full amount and the increment can be automatically switched, and the lock-free consistency switch is performed through the lock-free algorithm during the switch. After switching to the incremental stage, only a separate task is required to be responsible for the data parsing of the incremental part, thus realizing the full-incremental integrated reading. After entering the incremental phase, the user can modify the job and release the resources that are no longer needed by the job.

We compared the full incremental integration framework with the Debezium 1.6 version for a simple TPC-DS read test. The customer single table data volume is 65 million. When Flink CDC uses 8 concurrency, the throughput is increased by 6.8 times, which is time-consuming. Only 13 minutes, thanks to the support of concurrent reading, if users need faster reading speed, users can increase the concurrent implementation.

Flink CDC is also designed with storage-friendly write design in mind. In Flink CDC 1.x version, if you want to achieve exactly-once synchronization, you need to cooperate with the checkpoint mechanism provided by Flink. If you do not do slicing in the full stage, you can only complete it in one checkpoint, which will cause a problem: in the middle of each checkpoint To spit out the full data of this table to the downstream writer, the writer will mix the full data of this table in the memory, which will put a lot of pressure on its memory, and the job stability is also very poor.

After Flink CDC 2.0 proposes the incremental snapshot algorithm, the checkpoint granularity can be reduced to chunks through slicing, and the chunk size is user-configurable, the default is 8096, users can adjust it to a smaller size, reducing the pressure on the writer and reducing The use of memory resources improves the stability of downstream writes to storage.

After full-incremental integration, Flink CDC's lake-entry architecture becomes very simple and will not affect the stability of the business; it can achieve minute-level output, which means that near real-time or real-time analysis can be achieved; concurrent reading It achieves higher throughput and performs well in massive data scenarios; short links, few components, and friendly operation and maintenance.

With Flink CDC, the pain points of traditional CDC ETL analysis have also been greatly improved, and components such as Canal and Kafka message queues are no longer needed, and only Flink needs to be relied on, realizing the ability of full-increment integrated synchronization and real-time ETL processing. And supports concurrent reading, the entire architecture has short links, few components, and is easy to maintain.

Relying on Flink DataStream API and easy-to-use SQL API, Flink CDC also provides very powerful and complete transformation capabilities, and can guarantee changelog semantics during the transformation process. In the traditional scheme, it is very difficult to do transformation on the changelog and ensure the semantics of the changelog.

Real-time synchronization and transformation of massive data Example 1: Flink CDC realizes the integration of heterogeneous data sources

This business scenario is that business tables such as product table and order table are stored in MySQL database, and logistics table is stored in PG database. It is necessary to realize the integration of heterogeneous data sources, and to widen the integration process. It is necessary to do Streaming Join on the product table, order table and logistics table, and then write the result table into the library. With Flink CDC, the whole process can be implemented with only 5 lines of Flink SQL. The downstream storage used here is Hudi, and the entire link can get minute-level or even lower output, making it possible to do near real-time analysis around Hudi.

Example 2 of real-time synchronization and conversion of massive data: Flink CDC realizes sub-database sub-table integration

Flink CDC has very complete support for sub-database and sub-table. When declaring a CDC table, it supports the use of regular expressions to match library names and table names. Regular expressions mean that multiple libraries and multiple tables under these multiple libraries can be matched. . At the same time, it provides the support of metadata column, which can know which database and table the data comes from. When writing to the downstream Hudi, you can bring the two columns declared in the metadata, database_name, table_name and the primary key in the original table (example The id column in the middle) is used as the new primary key, and only three lines of Flink SQL can realize the real-time integration of sub-database and sub-table data, which is very simple.

Relying on Flink's rich ecology, it can realize a lot of upstream and downstream expansion, and Flink itself has a rich connector ecology. After the addition of Flink CDC, the upstream has a richer source that can be ingested, and the downstream has a richer destination that can be written.

Real-time synchronization and conversion of massive data Example 3: Real-time ranking of cumulative sales of a single product with three lines of SQL

This Demo demonstrates the real-time ranking of commodities through 3 lines of SQL without any dependencies. First add MySQL and ElasticSearch images in Docker, ElasticSearch is the destination. With Docker pulled up, download the Flink package and the two SQL Connector jars for MySQL CDC and ElasticSearch. Pull up the Flink cluster and SQL Client. Create a table in the built-in MySQL database, fill in the data, use Flink SQL to do some real-time processing and analysis after updating, and write it to ES. Construct an order table in the MySQL database and insert data.

The first line of SQL in the above figure is to create an order table, the second line is to create a result table, and the third line is to perform a group by query to realize the real-time ranking function, and then write it into the ElasticSearch table created by the second line of SQL.

We have made a visual presentation in ElasticSearch, and you can see that with the continuous update of orders in MySQL, the rankings of ElasticSearch will be refreshed in real time.

4. Flink CDC community development

In the past year or so, the community has released 4 major versions, the number of contributors and commits is growing, and the community is becoming more and more active. We have always insisted on providing all the core features to the community version, such as MySQL's tens of billions of large tables, incremental snapshot framework, MySQL dynamic table addition and other advanced functions.

The latest 2.2 version also adds a lot of new features. First of all, in terms of data sources, OceanBase, PolarDB-X, SqlServer, and TiDB are supported. In addition, the ecosystem of Flink CDC has been continuously enriched, it is compatible with Flink 1.13 and 1.14 clusters, and provides an incremental snapshot reading framework. In addition, it supports MySQL CDC dynamic table addition and improves MongoDB, such as supporting specified collections, making it more flexible and friendly through regular expressions.

Beyond that, documentation is a particularly important part of the community. We provide an independent versioned community website, where different versions of the website correspond to different versions of documents, and provide a wealth of demos and FAQs in both Chinese and English to help newbies get started quickly.

In terms of multiple key indicators of the community, such as the number of issues created, the number of PRs merged, and the number of Github Stars, the Flink CDC community has performed very well.

The future planning of the Flink CDC community mainly includes the following three aspects:

Perfect framework : The incremental snapshot framework currently only supports MySQL CDC, and Oracle, PG and MongoDB are being connected. We hope that all databases can be connected to a better framework in the future; we have done some exploratory work on Schema Evolution and synchronization of the entire database. , which will be made available to the community when mature.
Ecological integration : Provide more DBs and more versions; hope for smoother links in data lake integration; provide some end-to-end solutions, users do not need to care about the parameters of Hudi and Flink CDC.
Ease of use : Provides more out-of-the-box experience and complete documentation tutorials.

Q&A

Q: When will CDC support the synchronization of the entire database and the synchronization of DDL?

A: It is under design, because it needs to take into account the support and cooperation of the Flink engine side. It cannot be developed in the Flink CDC community alone, but needs to be linked with the Flink community.

Q: When will Flink 1.15 be supported?

A: The current production Flink clusters are still 1.13 and 1.14. The community plans to support Flink 1.15 in version 2.3, you can follow the issue: https://github.com/ververica/flink-cdc-connectors/issues/1363 , and contributions are also welcome.

Q: Is there a practice of writing the CDC result table to Oracle?

A: Flink version 1.14 does not currently support it. This is because the JDBC Connector on the sink side does not support Oracle dialect. The JDBC Connector of Flink version 1.15 already supports Oracle dialect, and the Flink cluster of version 1.15 can support it.

Q: Will the next version support reading ES?

A: It is also necessary to examine the transactional log mechanism and whether it is suitable as a data source for CDC.

Q: Can a single job monitor multiple tables and sink multiple tables?

A: A single job can monitor sinks from multiple tables to multiple downstream tables; however, if the sink is to multiple tables, DataStream needs to be distributed, and different streams are written to different tables.

Q: Binlog logs only have the data of the last two months. Can it support the full amount first and then the incremental reading?

A: The default support is the full amount first and then the increment. Generally, the binlog can be stored for seven days or two or three days.

Q: MySQL version 2.2 does not have a primary key. How to synchronize the full amount?

A: It is possible to fall back to not using the incremental snapshot framework; on the incremental snapshot framework, the community already has issues of components, and it is expected to provide support in the community version 2.3.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Real-time synchronization and conversion of massive data based on Flink CDC

1. Flink CDC technology

Second, the pain points of traditional data integration solutions

3. Real-time synchronization and conversion of massive data based on Flink CDC

4. Flink CDC community development

Q&A

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成