Flink CDC 2.1 is officially released, the stability is greatly improved, and Oracle and MongoDB support are added

About Flink CDC 2.1 is officially released, more stable and more powerful
The author of this article Xu Bangjiang (Snow End)
The following video is the Flink CDC past and present shared by Wu Chong (云邪):
https://www.bilibili.com/video/BV1jT4y1R7ir/

Preface

CDC (Change Data Capture) is a technology used to capture database change data. Since version 1.11, Flink natively supports the processing of CDC data (changelog). It is now a very mature change data processing solution.

Flink CDC Connectors are a set of Source connectors of Flink and the core component of Flink CDC. These connectors are responsible for reading historical data and incremental change data from databases such as MySQL, PostgreSQL, Oracle, and MongoDB. Flink CDC Connectors is an independent open source project. Since the open source in July last year, the community has maintained a fairly high-speed development, with an average version of two months. The attention in the open source community continues to increase, and more and more Of users use Flink CDC to quickly build real-time data warehouses and data lakes.

In July this year, Flink CDC Maintainer Xu Bangjiang (Xue Jin) shared the design of Flink CDC 2.0 for the first time at the Flink Meetup in Beijing. In the following August, the Flink CDC community released version 2.0, which solved many pain points in production practices, and the user base of the Flink CDC community also grew rapidly.

In addition to the rapid expansion of community user groups, the number of developers in the community is also rapidly increasing. At present, developers from many domestic and foreign companies have joined the open source co-construction of the Flink CDC community, including developers from North America Cloudera, and also from Europe Vinted , Ververica developers, domestic developers are more active, there are developers from Internet companies such as Tencent, Ali, Byte, and developers from startups and traditional companies such as XTransfer, Xinhua Winshare. In addition, many cloud vendors at home and abroad have integrated Flink CDC in their stream computing products, allowing more users to experience the power and convenience of Flink CDC.

1. Overview of Flink CDC 2.1

With the joint efforts of community developers, today the Flink CDC community is happy to announce the official release of Flink CDC 2.1: https://github.com/ververica/flink-cdc-connectors/releases/tag/release-2.1.0

This article takes you 10 minutes to understand the major improvements and core functions of Flink CDC 2.1 version. The 2.1 version includes 23 100+ PR contributed by contributors, focusing on improving the performance and production stability of the MySQL CDC connector. The Oracle CDC connector and MongoDB CDC connector are launched.

MySQL CDC supports super-large tables with tens of billions of data, supports all MySQL data types, and greatly improves stability through optimizations such as connection pool reuse. At the same time, it provides a DataStream API that supports lock-free algorithms and concurrent reading, and users can use this to build a synchronization link for the entire library;
Added the Oracle CDC connector, which supports obtaining full historical data and incremental change data from the Oracle database;
A new MongoDB CDC connector is added, which supports obtaining full historical data and incremental change data from the MongoDB database;
All connectors support the metadata column function. Users can access meta information such as database name, table name, data change time and so on through SQL, which is very useful for data integration in sub-database and sub-table scenarios;
Enrich Flink CDC introductory documents and add end-to-end practical tutorials for various scenarios.

Second, detailed explanation of MySQL CDC connector improvement

In Flink CDC 2.0 version, MySQL CDC connector provides advanced features such as lock-free algorithm, concurrent reading, and resumable transmission, which also solves many pain points in production practice. Then a large number of users began to use it and went online on a large scale. . During the launch process, we cooperated with users to solve many production problems, and at the same time developed some high-quality features that users urgently needed. Flink CDC 2.1 version mainly includes two types of improvements to the MySQL CDC connector. One is stability improvement. One is functional enhancement.

1. Stability improvement

for different primary key distributions
For scenarios where the primary key is non-numeric, Snowflake ID, sparse primary key, combined primary key, etc., through the dynamic analysis of the uniformity of the primary key distribution of the source table, the shard size is automatically calculated according to the uniformity of the distribution, so that the slicing is more reasonable and the sharding calculation is more reasonable. quick. The dynamic sharding algorithm can well solve the problems of too many shards in the sparse primary key scenario, and too large shards in the combined primary key scenario, so that the number of rows contained in each shard is maintained as much as possible in the user-specified chunk size, so that the user Through chunk size, you can control the fragment size and the number of fragments without worrying about the primary key type.
supports tens of billions of ultra-large scale tables
When the table size is very large, the error of failure to deliver binlog fragments will be reported before. This is because there will be a lot of snapshot fragments corresponding to very large tables, and binlog fragments need to include all snapshot fragment information. When SourceCoordinator delivers When the binglog is fragmented to the SourceReader node, if the fragment size exceeds the maximum size supported by the RPC communication framework, the fragment delivery will fail. Although the problem of excessive fragment size can be alleviated by modifying the parameters of the RPC framework, it cannot be completely resolved. In version 2.1, multiple snapshot fragments are divided into groups and sent. A binlog fragment will be divided into multiple groups and sent one by one, thus completely solving the problem.
introduces connection pool management database connection to improve stability
By introducing a connection pool to manage database connections, on the one hand, the number of database connections is reduced, and connection leakage caused by extreme scenarios is also avoided.
supports when the sub-database and sub-table schemas are inconsistent, the missing fields are automatically filled with NULL values

2. Function enhancement

supports all MySQL data types
Including complex types such as enumeration types, array types, and geographic information types.
supports metadata column
Users can access meta information such as database name (database\_name), table name (table\_name), change time (op\_ts) and so on through db\_name STRING METADATA FROM'database\_name' in Flink DDL. This is very useful for data integration in sub-database and sub-table scenarios.
DataStream API supporting concurrent reading
In version 2.0, lock-free algorithms, concurrent reading and other functions are only revealed to users on SQL API, while DataStream API is not revealed to users. Version 2.1 supports DataStream API, and data sources can be created through MySqlSourceBuilder. Users can capture data from multiple tables at the same time to build a synchronization link for the entire database. At the same time, schema changes can be captured through MySqlSourceBuilder#includeSchemaChanges.
supports currentFetchEventTimeLag, currentEmitEventTimeLag, sourceIdleTime monitoring indicators
These indicators follow the connector indicator specifications of FLIP-33 [1], and you can check FLIP-33 to get the meaning of each indicator. Among them, the currentEmitEventTimeLag indicator records the difference between the point in time when Source sends a record to the downstream node and the point in time when the record is generated in the DB, which is used to measure the delay between the generation of data from the DB to the departure of the Source node. The user can use this indicator to determine whether the source has entered the binlog reading stage:
- That is, when the indicator is 0, it means that it is still in the stage of full historical reading;
- When it is greater than 0, it means that it has entered the binlog reading phase.

3. Explain the new Oracle CDC connector in detail

Oracle is also a widely used database. The Oracle CDC connector supports capturing and recording the row-level changes that occur in the Oracle database server. The principle is to use the LogMiner [2] tool provided by Oracle or the native XStream API [3] to obtain from Oracle Change the data.

LogMiner is an analysis tool provided by Oracle Database, which can parse Oracle Redo log files, and parse the data change log of the database into change event output. When using LogMiner, the Oracle server imposes strict resource restrictions on the process of parsing log files. Therefore, data parsing will be slow for very large tables. The advantage is that LogMiner can be used for free.

XStream API is an internal interface provided by Oracle Database for Oracle GoldenGate (OGG). Clients can efficiently obtain change events through XStream API. The change data is not obtained from Redo log files, but directly from a piece of memory in the Oralce server. Reading saves the overhead of data storage to log files and parsing log files, which is more efficient, but you must purchase an Oracle GoldenGate (OGG) license.

The Oracle CDC connector supports LogMiner and XStream API to capture change events. In theory, it can support various Oracle versions. Currently, Oracle 11, 12 and 19 versions are tested in the Flink CDC project. Using the Oracle CDC connector, users only need to declare the following Flink SQL to capture the changed data in the Oracle database in real time:

Using Flink's rich surrounding ecology, users can easily write to various downstream storage, such as message queues, data warehouses, and data lakes.

The Oracle CDC connector has shielded the underlying CDC details. For the entire real-time synchronization link, users only need a few lines of Flink SQL without developing any Java code to capture and send Oracle data changes in real time.

In addition, the Oracle CDC connector also provides two working modes, namely read full data + incremental change data, and read only incremental change data. The Flink CDC framework guarantees exactly-once semantics of no more and no less.

Fourth, explain the new MongoDB CDC connector in detail

The MongoDB CDC connector does not depend on Debezium and is independently developed in the Flink CDC project. The MongoDB CDC connector supports the capture and recording of real-time change data in the MongoDB database. The principle is to disguise a copy of the MongoDB cluster [4], using the high availability mechanism of the MongoDB cluster, the copy can obtain complete oplog (operation log) events from the master node flow. The Change Streams API provides the ability to subscribe to these oplog event streams in real time, and can push these real-time oplog event streams to subscribed applications.

Among the update events obtained from the ChangeStreams API, for the update event, there is no previous mirror value of the update event, that is, the MongoDB CDC data source can only be used as an upsert source. However, the Flink framework will automatically attach a Changelog Normalize node to MongoDB CDC to fill in the pre-mirrored value of the update event (ie UPDATE\_BEFORE event) to ensure the semantic correctness of the CDC data.

Using the MongoDB CDC connector, users only need to declare the following Flink SQL to capture the full and incremental change data in the MongoDB database in real time. With the powerful integration capabilities of Flink, users can easily synchronize the data in MongoDB to Flink support in real time. All downstream storage.

In the entire data capture process, users do not need to learn MongoDB's copy mechanism and principles, which greatly simplifies the process and lowers the threshold for use. MongoDB CDC also supports two startup modes:

The default initial mode is to synchronize the stock data in the table first, and then synchronize the incremental data in the table;
The latest-offset mode is to synchronize only the incremental data in the table from the current point in time.

In addition, MongoDB CDC also provides a wealth of configuration and optimization parameters. For the production environment, these configurations and parameters can greatly improve the performance and stability of the real-time link.

V. Summary and Outlook

In just over a year, the Flink CDC project has achieved phenomenal development and attention. This is inseparable from the selfless contributions of the contributors in the Flink CDC open source community, and the positive feedback from the majority of Flink CDC users. It is the benign interaction between the two that enables the healthy development of the Flink CDC project. This benign interaction is also the charm of the open source community.

The Flink CDC community will continue to build an open source community. In the future planning, there are three main directions:

do deep CDC technology
For example, the implementation of abstract reuse mysql-cdc allows Oracle, MongoDB, etc. to quickly support lock-free reading, concurrent reading and other features.
the database ecology
We will support richer database CDC data sources, such as TiDB, DB2, MS SqlServer, etc.
Do a good job in data integration scenarios
- Better integrate the downstream ecology of real-time data warehouses and data lakes, including Hudi, Iceberg, ClickHouse, Doris, etc.
- Further reduce the threshold for CDC data to enter the lake and warehouse, and solve the pain points of synchronization of the entire database and synchronization of table structure changes.

Thanks to

Special thanks to the Oracle CDC connector contributed by Marton Balassi and Tamas Kiss from Cloudera, and the MongoDB CDC connector contributed by Jiabao Sun from XTransfer.

contributor list:

Gunnar Morling, Jark Wu, Jiabao Sun, Leonard Xu, MartijnVisser, Marton Balassi, Shengkai, Tamas Kiss, Tamas Zseder, Zongwen Li, dongdongking008, frey66, gongzhongqiang, ili zh, jpatel, lsy, luoyuxia, manmao, mincwang, taox, tuple, wangminchao, yangbin09

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics

[2] https://oracle-base.com/articles/8i/logminer

[3] https://docs.oracle.com/cd/E11882\_01/server.112/e16545/toc.htm

[4] https://docs.mongodb.com/manual/replication/

On December 4-5, Flink Forward Asia 2021 be launched. There are 40+ multi-industry first-tier manufacturers in the world and 80+ dry goods issues, bringing a technical feast exclusively for developers.
https://flink-forward.org.cn/

In addition, the first Flink Forward Asia Hackathon officially launched, and the 10W bonus is waiting for you!
https://www.aliyun.com/page-source//tianchi/promotion/FlinkForwardAsiaHackathon

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.