The practice of Flink CDC in Dajian cloud warehouse

This article is compiled from the speech delivered by Gong Zhongqiang, the head of Dajian Cloud Warehouse and Flink CDC Maintainer, at the Flink CDC Meetup on May 21. The main contents include:
The background of introducing Flink CDC
Today's internal business scenarios
Future internal promotion and platform construction
community collaboration

Click to view live replay & speech PDF

1. Background of the introduction of Flink CDC

The company introduced CDC technology mainly based on the needs of the following four roles:

Logistics Scientist : Requires inventory, sales orders, logistics bills and other data for analysis.
Development : Basic information of other business systems needs to be synchronized.
Financials : I want financial data to be delivered to the financial system in real time, rather than being visible until the end of the month.
Boss : You need a big data screen to view the company's business and operations through the big screen.

CDC is a technology for capturing change in data. Broadly speaking, any technology that can capture data changes can be called CDC. But usually we say CDC technology is mainly for database changes.

There are two main ways to implement CDC, query-based and log-based:

Query- based : After querying, insert and update to the database, without special configuration of the database and account permissions. Its real-time performance is determined based on the query frequency, and the real-time performance can only be guaranteed by increasing the query frequency, which will inevitably cause huge pressure on the DB. In addition, because it is query-based, it cannot capture the change history of data between two queries, so it cannot guarantee data consistency.
Log-based : Implemented through a change log that consumes data in real time, so the real-time performance is high. And it will not have a big impact on the DB, and it can also ensure the consistency of the data, because the database will record all data changes in the change log. Through the consumption of logs, you can clearly know the change process of the data. Its disadvantage is that the implementation is relatively complicated, because the change log implementation of different databases is different, the format, opening method and special permissions are different, and corresponding adaptation development needs to be done for each database.

Just as Flink's declaration "real-time is the future", in today's context, real-time is an important issue to be solved urgently. Therefore, we compared the mainstream CDC log-based technologies, as shown in the figure above:

Data source : Flink CDC not only supports traditional relational databases, but also supports document-based, NewSQL (TiDB, OceanBase) and other popular databases; Debezium's support for databases is relatively less extensive, but All mainstream relational databases are well supported; Canal and OGG only support a single data source.
Resume from a breakpoint : All four technologies can be supported.
Synchronous mode : Except for Canal which only supports incremental, all other technologies support full + incremental mode. The full + incremental method means that the full-to-incremental switching process can be implemented through the CDC technology at the first launch. There is no need to manually read the full + incremental data through full tasks and incremental jobs. Pick.
Activity : Flink CDC has a very active community with rich materials. The official also provides detailed tutorials and quick start tutorials; the Debezium community is also quite active, but most of the materials are in English; Canal has a very large user base and relatively large materials. There are many, but the community activity is average; OGG is Oracle's big data suite, which requires payment and only official information.
Development difficulty : Flink CDC relies on two development modes, Flink SQL and Flink DataStream, especially Flink SQL. The development of data synchronization tasks can be completed through very simple SQL, and development is particularly easy; Debezium needs to parse the collected data change logs by itself Processed separately, so does Canal.
Operating environment dependencies : Flink CDC uses Flink as the engine, Debezium usually uses the Kafka connector as the running container; and Canal and OGG are run separately.
Downstream richness : Flink CDC relies on Flink's very active surroundings and rich ecology to open up rich downstream, and has done a good job of supporting ordinary relational databases and big data storage engines Iceberg, ClickHouse, Hudi, etc.; Debezium has Kafka JDBC connector, supports MySQL, Oracle, SqlServer; Canal can only directly consume data or output it to MQ for downstream consumption; OGG is an official suite, and the downstream richness is not good.

Second, the current internal landing business scenarios

Before 2018, the way of data synchronization of Dajian cloud warehouse is to synchronize data between systems regularly through multiple data applications.
After 2020, with the rapid development of cross-border business, multi-data source applications often fill up the DB, affecting online applications, and the management of the execution order of scheduled tasks is chaotic.
Therefore, in 2021, we will start to investigate and select CDC technology, build a small test scene, and conduct small-scale tests.
In 2022, the LDSS system inventory scene synchronization function based on Flink CDC will be launched.
In the future, we hope to rely on Flink CDC to build a data synchronization platform, complete the development, testing and launch of synchronization tasks through interface development and configuration, and be able to manage the entire life cycle of synchronization tasks online.

There are four main business scenarios for LDSS inventory management:

Warehousing department : The inventory capacity of the warehouse and the distribution of commodity categories are required to be reasonable. In terms of inventory capacity, some buffers need to be reserved to prevent the sudden stock-in order from causing the warehouse to explode; in terms of commodity categories, the unreasonable distribution of seasonal commodity inventory leads to hot issues. It will bring great challenges to warehouse management.
Platform customers : Hope that the order can be processed in a timely manner, and the goods can be delivered to the customer quickly and accurately.
Logistics department : We hope to improve logistics efficiency, reduce logistics costs, and efficiently utilize limited capacity.
Decision-making department : It is hoped that the LDSS system can provide scientific advice on when and where to build new warehouses.

The above picture shows the architecture diagram of the LDSS inventory management sub-scenario.

First of all, through the application of multi-data source synchronization, the data of the warehousing system, platform system and internal ERP system are pulled down, and the required data is extracted into the database of the LDSS system to support the business functions of the three modules of order, inventory and logistics of the LDSS system. .

Second, product information, order information, and warehouse information are required to make effective order splitting decisions. The multi-data source timing synchronization task is based on JDBC query, filters by time, and synchronizes the changed data to the LDSS system. Based on these data, the LDSS system makes decisions on order allocation to obtain the optimal solution.

The code for timed task synchronization first needs to define the timed task, define the class of the timed task, the execution method and the execution interval.

The left side of the above figure is the definition of timing tasks, and the right side is the logic development of timing tasks. First, open the Oracle database to query, and then upsert to the MySQL database, which completes the development of the scheduled task. Here, in a query method close to native JDBC, the data is stuffed into the corresponding database tables in turn. The development logic is very cumbersome and bugs are prone to occur.

Therefore, we modified it based on Flink CDC.

The above picture shows the real-time synchronization scenario based on Flink CDC. The only change is to replace the previous multi-data source synchronization application with Flink CDC.

First, connect and extract the table data corresponding to the warehousing platform and ERP system database through SqlServer CDC, MySQL CDC, and Oracle CDC, and then write it into the MySQL database of the LDSS system through the JDBC connector provided by Flink. It can convert heterogeneous data sources into unified Flink internal types through SqlServer CDC, MySQL CDC, and Oracle CDC, and then write downstream.

Compared with the previous architecture, this architecture is not intrusive to the business system, and the implementation is simpler.

We introduced MySQL CDC and SqlServer CDC to connect the MySQL database of the B2B platform and the SqlServer database of the warehousing system respectively, and then write the extracted data to the MySQL database of the LDSS system through the JDBC Connector.

Through the above transformation, thanks to the real-time capability that Flink CDC gives it, there is no need to manage complicated timing tasks.

The implementation based on Flink CDC synchronization code is divided into the following three steps:

The first step is to define the source table - the table that needs to be synchronized;
The second step is to define the target table - the target table to which data needs to be written;
The third step is to complete the development of the CDC synchronization task through the insert select statement.

The above development mode is very simple and the logic is clear. In addition, relying on the synchronization tasks of Flink CDC and the Flink architecture, features such as failure retry, distribution, high availability, and full incremental consistency switching are also obtained.

3. Future internal promotion and platform construction

The above figure shows the platform architecture diagram.

The source on the left is the source provided by Flink CDC + Flink, which can extract data from the rich source and write it to the target through the development on the data platform. The target side relies on the powerful ecology of Flink, which can well support data lakes, relational databases, MQ, etc.

Flink currently has two operating modes, one is Flink on Yarn, which is popular in China, and the other is Flink on Kubernetes. The data platform in the middle part manages the Flink cluster downward to support SQL online development, task development, bloodline management, task submission, online notebook development, permissions and configuration, as well as task performance monitoring and alarming. to good management.

The demand for data synchronization is particularly strong within the company, and the platform needs to be used to improve development efficiency and speed up delivery. And after platformization, the company's internal data synchronization technology can be unified, the synchronization technology stack can be consolidated, and maintenance costs can be reduced.

The goals of platformization are as follows:

Can manage metadata such as data sources and tables well;
The entire life cycle of the task can be completed on the platform;
Realize task performance observation and alarm;
Simplify development and get started quickly. Business developers can start developing synchronization tasks after simple training.

Platformization can bring benefits in the following three aspects:

Collect data synchronization tasks and manage them in a unified manner;
The platform manages and maintains the full life cycle of synchronization tasks;
A dedicated team is responsible, and the team can focus on cutting-edge data integration technologies.

With the platform, more business scenarios can be quickly applied.

Real-time data warehouse : I hope to support more real-time data warehouse business scenarios through Flink CDC, and use Flink's powerful computing power to make some materialized views of the database. Free the calculation from the DB, and rewrite it back to the database through Flink's external calculation to accelerate real-time application scenarios such as reporting, statistics, and analysis of platform applications.
Real-time application : Flink CDC can capture changes from the DB layer, so the content in the search engine can be updated in real time through Flink CDC, and financial and accounting data can be pushed to the financial system in real time. Because most of the data of the financial system needs to be calculated by the business system by running timed tasks and through a large number of operations such as association, aggregation, and grouping, and then pushed to the financial system. With the powerful data capture capabilities of Flink CDC and Flink's computing capabilities, these data can be pushed to the accounting system and financial system in real time, so that business problems can be discovered in time and the company's losses can be reduced.
Cache : Through Flink CDC, it is possible to build a real-time cache that is separated from traditional applications, which greatly improves the performance of online applications.

With the help of the platform, I believe that Flink CDC can better release its capabilities within the company.

4. Community cooperation

We hope to have diverse collaborations with the community to improve the quality of our code and the company's ability to collaborate on open source. Community cooperation will mainly be carried out through three aspects:

First, open source co-construction . I hope to have more opportunities to share the experience of Flink CDC in the company's implementation and access scenarios with peers, and also conduct internal training on Flink CDC technology, through training to let everyone understand Flink CDC technology, and pass the actual work. This technology to solve more business pain points.

At present, the cooperation between the company and the community has achieved some results. The company has contributed the SqlServer CDC Connector to the community and cooperated to complete the TiDB CDC Connector.

Second, serve the community . Cultivate the open source cooperation ability developed by the department, and contribute the characteristics of the company's internal version to the community. Only after the polishing of the majority of users in the community, the characteristics can be more stable and reasonable. In addition, I also hope to work closely with the community in the direction of schema evolution, turning performance, and synchronization of the entire library.

Third, explore the direction . It is believed that Flink CDC will not be satisfied with the current achievements, and will continue to move forward towards further goals. So I hope to work with the community to explore more possible directions for Flink CDC.

The recent cooperation between the company and the community is to contribute the features of SqlServer CDC based on the concurrent lock-free framework to the community.

The above figure shows the principle of SqlServer CDC.

First, SqlServer will record data changes in the transaction log, match the log log with the CDC table opened in the log through the captured process, and insert the matched log into the change tables generated by CDC after conversion, and finally called by SqlServer CDC The CDC query function obtains the insert, update, delete, and DDL statements in real time, and then converts them into Flink's internal OpType and RawData for operations such as calculation and entry into the lake.

After community students use the current version of SqlServer CDC, the main feedback problems are as follows:

Locking the table during the snapshot process : Locking the table is unbearable for both the DBA and the online application. The DBA cannot accept that the database is locked, and it will also affect the online application.
Cannot checkpoint during the snapshot process : Failure to checkpoint means that once the snapshot process fails, the snapshot process can only be restarted, which is very unfriendly to large tables.
The snapshot process only supports single concurrency : large tables of tens of millions or hundreds of millions need to be synchronized for ten or even dozens of hours in the case of single concurrency, which greatly restricts the application scenarios of SqlServer CDC.

We have practiced and improved the above problems, and optimized the SqlServer CDC with reference to the idea of the MySQL CDC version 2.0 concurrent lock-free algorithm of the community, and finally realized lock-free during the snapshot process and achieved consistent snapshots; support checkpoint during the snapshot process; Concurrency is supported in the snapshot process to speed up the snapshot process. In the case of large table synchronization, the concurrency advantage is particularly obvious.

However, since the 2.2 version community abstracted the concurrency and lock-free idea of MySQL into a unified public framework, SqlServer CDC needs to be re-adapted to this general framework before it can be contributed to the community.

Q&A

Q: Do I need to enable SqlServer's own CDC?

A: Yes, the function of SqlServer CDC is based on the CDC feature of the SqlServer database itself.

Q: How does the materialized view refresh the scheduled task trigger?

A: Run the SQL that needs to generate a materialized view in Flink through Flink CDC, trigger the calculation through the change of the original table, and then synchronize it to the materialized view table.

Q: How is platformization done?

A: Platformization refers to many open source projects in the community and excellent open source platforms, such as StreamX, DLink and other excellent open source projects.

Q: Does SqlServer CDC have a bottleneck when consuming transaction logs?

A: SqlServer does not directly consume the log. The principle is that the SqlServer capture process matches which tables in the log have CDC enabled, and then retrieves these tables from the log to open the CDC table's change data, and then inserts it into the change table. Finally, The data changes are obtained through the CDC query function generated by the database after CDC is enabled.

Q: How does Flink CDC high availability ensure that there are too many synchronization tasks or intensive processing solutions?

A: The high availability of Flink relies on Flink features such as checkpoint to ensure it. In the case of too many synchronization tasks or intensive processing solutions, it is recommended to use multiple sets of Flink downstream clusters, and then treat them differently according to the real-time synchronization, and publish the tasks to the corresponding clusters.

Q: Do I need Kafka in the middle?

A: It depends on whether the synchronization task or the data warehouse architecture needs to implement the intermediate data as Kafka.

Q: There are multiple tables in a database, can they be run in one task?

A: Depends on how it is developed. If it is the way of SQL development, to write multiple tables at one time can only be done through multiple tasks. But Flink CDC provides another higher-level development method, DataStream, which can put multiple tables into one task to run.

Q: Does Flink CDC support reading logs from Oracle slave libraries?

A: It is not yet possible.

Q: How to monitor and compare the data quality of the two terminals after synchronization through CDC?

A: Currently, data quality can only be checked through regular sampling. Data quality has always been a difficult issue in the industry.

Q: What scheduling system does Dajian Cloud Warehouse use? How does the system integrate with Flink CDC?

A: Using XXL Job as distributed task scheduling, CDC does not use timed tasks.

Q: Does SqlServer CDC need to be restarted if adding and deleting tables are collected?

A: SqlServer CDC currently does not support the function of dynamically adding tables.

Q: Will synchronization tasks affect system performance?

A: Doing synchronization tasks based on CDC will definitely affect system performance, especially the snapshot process will affect the database, which will affect the application system. In the future, the community will limit current and implement concurrent lock-free implementation for all connectors, all to expand the application scenarios and ease of use of CDC.

Q: How to deal with full and incremental savepoints?

A: (connectors not implemented through the concurrent lock-free framework) savepoint cannot be triggered during the full process. If you need to stop publishing during the incremental process, you can restore the task through savepoint.

Q: CDC synchronizes data to Kafka, and Kafka stores Binlog, how to save historical data and real-time data?

A: Sync all the data synchronized by CDC to Kafka. The retained data depends on the cleaning policy of the Kafka log and can be retained.

Q: Will CDC filter the log operation types of Binlog? Will it affect efficiency?

A: Even if there is a filtering operation, it has little impact on performance.

Q: During the initial snapshot stage of CDC reading MySQL, if multiple programs read different tables, the program will report an error and cannot obtain the permission to lock the table. What is the reason?

A: It is recommended to check whether MySQL CDC is implemented in the old way. You can try the new version of the concurrent lock-free implementation.

Q: How to connect the full amount and increment of MySQL's hundreds of millions of tables?

A: It is recommended to read the related blog of Mr. Xue Jin on 2.0 , which very simply and clearly introduces how to achieve a consistent snapshot of concurrent lock-free and complete the full and incremental switching.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

The practice of Flink CDC in Dajian cloud warehouse

1. Background of the introduction of Flink CDC

Second, the current internal landing business scenarios

3. Future internal promotion and platform construction

4. Community cooperation

Q&A

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成