Implementation principle and usage practice of Flink CDC MongoDB Connector

This article is compiled from the speech given by Sun Jiabao, a senior Java development engineer at XTransfer and a Flink CDC Maintainer, at the Flink CDC Meetup. The main contents include:
Introduction to MongoDB Change Stream Technology
MongoDB CDC Connector Business Practice
MongoDB CDC Connector Production Tuning
MongoDB CDC Connector Parallelized Snapshot Improvements
Follow-up planning

Click to view live replay & speech PDF

1. Introduction to MongoDB Change Stream Technology

MongoDB is a document-oriented non-relational database that supports semi-structured data storage; it is also a distributed database that provides two cluster deployment modes, replica set and sharded set, with high availability and horizontal scalability. Suitable for large-scale data storage. In addition, MongoDB 4.0 also provides support for multi-document transactions, which is more friendly to some more complex business scenarios.

MongoDB uses a weakly structured storage mode, supports flexible data structures and rich data types, and is suitable for business scenarios such as Json documents, tags, snapshots, geographic locations, and content storage. Its natural distributed architecture provides out-of-the-box sharding mechanism and automatic rebalance capability, which is suitable for large-scale data storage. In addition, MongoDB also provides the function of distributed grid file storage, namely GridFS, which is suitable for storing large files such as pictures, audios, and videos.

MongoDB provides two cluster mode deployment modes: replica set and shard set.

Replica set : A high-availability deployment mode. The secondary node replicates data by copying the operation log of the primary node. When the primary node fails, the secondary node and the quorum node will re-initiate a vote to elect a new primary node to achieve failover. In addition, the secondary nodes can also share query requests, reducing the query pressure on the primary nodes.

Sharded set : A horizontally scalable deployment mode that evenly distributes data on different shards. Each shard can be deployed as a replica set. The primary node in the shard carries read and write requests, and the secondary node replicates the operation log of the primary node. According to the specified sharding index and sharding strategy, the data is divided into multiple 16MB data blocks, and these data blocks are handed over to different shards for storage. The correspondence between shards and data blocks is recorded in Config Servers.

MongoDB's Oplog is similar to MySQL's Binlog, which records all operation logs of data in MongoDB. Oplog is a collection with capacity, if it exceeds the preset capacity range, the previous information will be discarded.

Unlike MySQL's Binlog, Oplog does not record complete information before/after changes. Traversing Oplog can indeed capture data changes in MongoDB, but there are still some limitations in converting to Changelog supported by Flink.

First, subscribing to Oplogs is difficult. Each replica set maintains its own Oplog. For a sharded cluster, each shard may be an independent replica set. It is necessary to traverse the Oplog of each shard and sort by operation time. In addition, Oplog does not contain the complete state before and after the change document, so it cannot be converted into either a Flink standard Changelog, nor an Upsert type of Changelog. This is also the main reason why we did not use the direct subscription Oplog scheme when implementing the MongoDB CDC Connector.

In the end, we chose to use the MongoDB Change Streams solution to implement the MongoDB CDC Connector.

Change Streams is a new feature provided by MongoDB 3.6. It provides a simpler change data capture interface and shields the complexity of traversing Oplog directly. Change Streams also provides the extraction function of the complete state of the document after the change, which can be easily converted into a Changelog of Flink Upsert type. It also provides a relatively complete failure recovery capability, and each change record data will contain a resume token to record the position of the current change stream. After a failure, it can be recovered from the current consumption point through the resume token.

In addition, Change Streams supports filtering and customization of change events. For example, regular filters for database and collection names can be pushed down to MongoDB, which can significantly reduce network overhead. It also provides change subscriptions to the collection library and the entire cluster level, and can support corresponding permission control.

The CDC Connector implemented using the MongoDB Change Streams feature is shown in the figure above. First subscribe to MongoDB changes through Change Streams. For example, there are four types of changes: insert, update, delete, and replace. First, convert them into the upsert Changelog supported by Flink, and then define a dynamic table on top of it and use Flink SQL for processing.

Currently, MongoDB CDC Connector supports Exactly-Once semantics, supports full and incremental subscriptions, supports recovery from checkpoints and savepoints, supports Snapshot data filtering, supports the extraction of database metadata such as Database and Collection, and also supports database collections. Regular filter function.

2. MongoDB CDC Connector business practice

Founded in 2017, XTransfer focuses on B2B cross-border payment business, providing foreign trade collection and risk control services for small, medium and micro enterprises engaged in cross-border e-commerce exports. The cross-border B-type business settlement scenario involves a long business link. From inquiry to final transaction, the process involves logistics terms, payment terms, etc., and it is necessary to do a good job in risk management and control in each link to comply with cross-border capital transactions. regulatory requirements.

All the above factors put forward higher requirements on the security and accuracy of XTransfer's data processing. On this basis, XTransfer has built its own big data platform based on Flink, which can effectively ensure that data on the entire cross-border B2B link can be effectively collected, processed and calculated, and meets the requirements of high security, low latency, and high precision. demand.

Change Data Collection CDC is a key link in data integration. Before Flink CDC was used, traditional CDC tools such as Debezium and Canal were generally used to extract the change log of the database, forward it to Kafka, and read the change log in Kafka downstream for consumption. This architecture has the following pain points:

There are many deployment components, and the operation and maintenance cost is high;
The downstream data consumption logic needs to be adapted according to the writing end, and there is a certain development cost;
The data subscription configuration is complex, and it is impossible to define a complete data synchronization logic only through SQL statements like Flink CDC;
It is difficult to meet the full collection + incremental collection, and it may be necessary to introduce full collection components such as DataX;
It is more inclined to the collection of changed data, and the ability to process and filter data is relatively weak;
It is difficult to meet the widening scenarios of heterogeneous data sources.

At present, our big data platform mainly uses Flink CDC for change data capture, which has the following advantages:

1. Real-time data integration

There is no need to deploy additional components such as Debezium, Canal, Datax, etc., and the operation and maintenance cost is greatly reduced;
It supports rich data sources, and can also reuse Flink's existing connectors for data collection and writing, which can cover most business scenarios;
The development difficulty is reduced, and a complete data integration workflow can be defined only through Flink SQL;
It has strong data processing capabilities. Relying on the powerful computing capabilities of the Flink platform, streaming ETL and even join and group by of heterogeneous data sources can be realized.

2. Build a real-time data warehouse

It greatly simplifies the deployment of real-time data warehouses. Flink CDC collects database changes in real time and writes them into Kafka, Iceberg, Hudi, TiDB and other databases, and then Flink can be used for in-depth data mining and data processing.
Flink's computing engine can support the integrated computing mode of streaming and batching, and there is no need to maintain multiple computing engines, which can greatly reduce the cost of data development.

3. Real-time risk control

In the past, real-time risk control was generally implemented by sending business events to Kafka. After using Flink CDC, risk control events can be captured directly from the business library, and then complex event processing can be performed through Flink CDC.
Models can be run to enrich machine learning capabilities with Flink ML, Alink. Finally, the disposal results of these real-time risk control are returned to Kafka, and risk control instructions are issued.

3. MongoDB CDC Connector production tuning

The use of MongoDB CDC Connector has the following requirements:

In view of the use of the characteristics of Change Streams to implement the MongoDB CDC Connector, the minimum available version of MongoDB is required to be 3.6, and 4.0.8 and above are recommended.
Cluster deployment mode must be used. Since subscribing to MongoDB's Change Streams requires nodes to replicate data to each other, a single-machine MongoDB cannot replicate data to each other, and there is no Oplog. There is a data replication mechanism only in the case of replica sets or shard sets.
Requires the WireTiger storage engine, using the pv1 replication protocol.
ChangeStream and find user permissions are required.

When using the MongoDB CDC Connector, pay attention to setting the capacity and expiration time of the Oplog. MongoDB oplog is a special collection with capacity. After the capacity reaches the maximum value, historical data will be discarded. Change Streams is restored through resume token. Too small oplog capacity may cause the oplog record corresponding to resume token to no longer exist, that is, resume token expires, and Change Streams cannot be restored.

You can use replSetResizeOplog to set the oplog capacity and minimum retention time, and MongoDB version 4.4 and later also supports setting the minimum time. In general, it is recommended that oplogs be kept for at least 7 days in a production environment.

For some tables with slow changes, it is recommended to enable heartbeat events in the configuration. The change event and the heartbeat event can advance the resume token at the same time. For tables with slow changes, the resume token can be refreshed through the heartbeat event to avoid its expiration.

The heartbeat interval can be set through heartbeat.interval.ms.

Since only MongoDB's Change Streams can be converted into Flink's Upsert changelog, which is similar to the Upsert Kafka form, in order to complete the -U pre-image value, an operator ChangelogNormalize will be added, which will bring additional state overhead. Therefore, it is recommended to use the RocksDB State Backend in a production environment.

When the parameters of the default connection cannot meet the usage requirements, you can pass the connection parameters supported by MongoDB by setting the connection.options configuration item.

For example, the database created by the user connecting to MongoDB is not in admin. You can set parameters to specify which database to use to authenticate the current user, and you can also set the maximum connection parameters of the connection pool, etc. The connection string of MongoDB supports these parameters by default.

Regular matching of multiple databases and multiple tables is a new feature provided by MongoDB CDC Connector after version 2.0. Note that the readAnyDatabase role is required if the database name uses a regular parameter. Because MongoDB's Change Streams can only be turned on at the granularity of the entire cluster, database, and collection. If the entire database needs to be filtered, the database can only enable Change Streams on the entire cluster when performing regular matching, and then filter the database changes through the Pipeline. You can subscribe to multiple databases and multiple tables by writing regular expressions in the two parameters of Ddatabase and Collection.

4. MongoDB CDC Connector Parallelized Snapshot Improvement

In order to speed up Snapshot, the source introduced by Flip-27 can be used for parallelization. First, a split enumerator is used to divide a complete Snapshot task into several sub-tasks according to a certain segmentation strategy, and then it is assigned to multiple split readers to do Snapshot in parallel, so as to improve the running speed of the overall task.

But in MongoDB, in most cases, the component is ObjectID, where the first four bytes are UNIX description, the middle five bytes are a random value, and the last three bytes are a self-increment. Documents inserted in the same description are not strictly increasing, and random values in the middle may affect the local strict increasing, but overall, they can still satisfy the increasing trend.

Therefore, unlike the incremental components of MySQL, MongoDB is not suitable for simple splitting of its collections using the offset + limit splitting strategy, and needs to adopt a targeted splitting strategy for ObjectID.

Ultimately, we adopted the following three MongoDB sharding strategies:

Sample sampling bucket : The principle is to use the $sample command to randomly sample the collection, and estimate the number of buckets required by the average document size and the size of each chunk. The advantage of requiring the query permission of the corresponding collection is that it is faster, and it is suitable for collections with a large amount of data but no fragmentation; the disadvantage is that the results of bucketing cannot be absolutely uniform due to the use of sampling estimation mode.
SplitVector index segmentation : SplitVector is an internal command for MongoDB to calculate chunk split points, and calculates the boundary of each chunk by accessing the specified index. The SplitVector permission is required, which has the advantages of fast speed and uniform chunk results; the disadvantage is that it is better to directly read the chunks metadata that has been divided in the config library for collections with large amounts of data and already sharded.
Chunks metadata read : Because MongoDB stores the actual sharding result of the sharded collection in the config database, the actual sharding result of the sharded collection can be directly read from the config. Requires read access to the config library, only for sharded collections. The advantage is that the speed is fast, there is no need to recalculate the chunk split point, and the chunk result is uniform, which is 64MB by default; the disadvantage is that it cannot meet all scenarios, only fragmentation scenarios.

The figure above is an example of sample sampling and bucketing. On the left is a complete set. Set the number of samples from the complete set, then reduce the entire sample, and bucket according to the samples after sampling. The final result is the chunks boundary we want.

The sample command is a built-in command for MongoDB sampling. When the sample value is less than 5%, the pseudo-random algorithm is used for sampling; when the sample value is greater than 5%, random sorting is used first, and then the top N documents are selected. Its uniformity and time-consuming mainly depend on the random algorithm and the number of samples. It is a compromise strategy between uniformity and segmentation speed. It is suitable for scenarios that require fast segmentation but can tolerate uneven segmentation results. .

In the actual test, the uniformity of sample sampling has a good performance.

The figure above shows an example of SplitVector index segmentation. The left side is the original collection, and the index to be accessed is specified by the SplitVector command, which is the ID index. You can set the size of each chunk in MB, then use the SplitVector command to access the index and calculate the boundaries of each chunk from the index.

It's fast and the chunk results are even, making it suitable for most scenarios.

The figure above shows an example of reading config.chuncks, that is, directly reading the metadata of chunks that have been divided by MongoDB. The Config Server stores each shard, the machine on which it resides, and the boundaries of each shard. For a sharded collection, its boundary information can be directly read in chunks, without the need to repeatedly calculate these split points, and it can also ensure that the reading of each chunk can be completed on a single machine, with extremely fast speed. It has a good performance in the sharded collection scenario.

V. Follow-up planning

The follow-up planning of Flink CDC is mainly divided into the following five aspects:

First, assist in improving the Flink CDC incremental snapshot framework;
Second, use MongoDB CDC to interface with the Flink CDC incremental Snapshot framework, enabling it to support parallel Snapshot improvements;
Third, MongoDB CDC supports Flink RawType. For some more flexible storage structures, RawType conversion is provided, and users can customize and parse them in the form of UDF;
Fourth, MongoDB CDC supports the collection of changed data from a specified location;
Fifth, optimization of MongoDB CDC stability.

Q&A

Q: Is MongoDB CDC high latency? Do you need to sacrifice performance to reduce latency?

A: The latency of MongoDB CDC is not high. Changelog normalize may cause some back pressure on incremental CDC collection during full collection, but this situation can be avoided by parallelizing MongoDB and increasing resources.

Q: When does the default connection fail to meet the requirements?

A: MongoDB users can create in any database and any sub-repository. If the user is not created in the admin database, you need to explicitly specify which database to authenticate the user during authentication, and you can also set parameters such as the maximum connection size.

Q: Does MongoDB's current DLog support lock-free concurrent reading?

A: DLog's lock-free concurrency has the capability of incremental snapshots, but because it is difficult for MongoDB to obtain the current changelog location, incremental snapshots cannot be implemented immediately, but lock-free concurrent Snapshots will soon be supported.

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

Implementation principle and usage practice of Flink CDC MongoDB Connector

1. Introduction to MongoDB Change Stream Technology

2. MongoDB CDC Connector business practice

3. MongoDB CDC Connector production tuning

4. MongoDB CDC Connector Parallelized Snapshot Improvement

V. Follow-up planning

Q&A

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

7天撸完KTV点歌系统,含后台管理系统(完整版)

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

基于 pyflink 的算法工作流设计和改造