大数据 - Flink CDC series - Production practice of Flink MongoDB CDC in XTransfer - 个人文章

The author of this article, Sun Jiabao, shared how to implement the Flink MongoDB CDC Connector through the MongoDB Change Streams feature on the basis of Flink CDC. The main contents include:
Flink CDC
MongoDB replication mechanism
Flink MongoDB CDC

Flink Chinese learning website
https://flink-learning.org.cn

foreword

XTransfer focuses on providing cross-border financial and risk control services for cross-border B2B e-commerce small and medium-sized enterprises. Comprehensive solutions for various cross-border financial services such as local collection accounts, foreign exchange exchange, and overseas foreign exchange control country declarations.

In the early stage of business development, we chose the traditional offline data warehouse architecture, and adopted the data integration method of full collection, batch processing, and overwrite writing, which resulted in poor data timeliness. With the development of the business, offline data warehouses are increasingly unable to meet the requirements for data timeliness. We decided to evolve from offline data warehouses to real-time data warehouses. The key to building a real-time data warehouse is to change the choice of data collection tools and real-time computing engines.

After a series of investigations, in February 2021, we paid attention to the Flink CDC project. Debezium is embedded in Flink CDC, which enables Flink itself to have the ability to capture change data, which greatly reduces the development threshold and simplifies deployment. the complexity. Coupled with Flink's powerful real-time computing capabilities and rich external system access capabilities, it has become a key tool for us to build real-time data warehouses.

In addition, we also use MongoDB a lot in production, so we implemented the Flink MongoDB CDC Connector through the MongoDB Change Streams feature on the basis of Flink CDC, and contributed to the Flink CDC community, which has been released in version 2.1. I am honored to be able to share with you the implementation details and production practices here.

1. Flink CDC

Dynamic Table is the core concept of Flink's Table API and SQL that support streaming data. Streams and tables have duality. You can convert a table into a changelog stream, and you can play back the changelog stream and restore it to a table.

Change streams come in two forms: Append Mode and Update Mode. Append Mode will only add, not change and delete, such as event flow. Update Mode may add new, may also change and delete, such as database operation log. Before Flink 1.11, only defining dynamic tables in Append Mode was supported.

Flink 1.11 introduces new TableSource and TableSink in FLIP-95, which implements support for Update Mode changelog. And in FLIP-105, direct support for Debezium and Canal CDC formats was introduced. By implementing ScanTableSource, receiving external system change logs (such as database change logs), interpreting them as Flink's identifiable changlogs and passing them down, it is possible to support the definition of dynamic tables from change logs.

Inside Flink, changelog records are represented by RowData, which includes 4 types: +I (INSERT), -U (UPDATE_BEFORE), +U (UPDATE_AFTER), -D (DELETE). According to the different record types generated by changelog, it can be divided into 3 changelog modes.

INSERT_ONLY: only contains +I for batch and event streaming.
ALL: contains +I, -U, +U, -D all RowKind, such as MySQL binlog.
UPSERT: only contains +I, +U, -D three types of RowKind, does not contain -U, but must be updated by idempotent unique key, such as MongoDB Change Streams.

2. MongoDB replication mechanism

As mentioned in the previous section, the key to implementing Flink CDC MongoDB is how to convert MongoDB's operation log into the changelog supported by Flink. To solve this problem, you first need to understand MongoDB's cluster deployment and replication mechanism.

2.1 Replica Sets and Sharded Clusters

A replica set is a high-availability deployment mode provided by MongoDB. Data synchronization between replica set members is accomplished through oplog (operation log) replication between replica set members.

Sharded cluster is a MongoDB deployment mode that supports large-scale datasets and high-throughput operations, and each shard consists of a replica set.

2.2 Replica Set Oplog

The operation log oplog, in MongoDB, is a special capped collection (fixed-capacity collection) that records the operation log of data for synchronization between replica set members. The data structure of the oplog record is shown below.

{
    "ts" : Timestamp(1640190995, 3),
    "t" : NumberLong(434),
    "h" : NumberLong(3953156019015894279),
    "v" : 2,
    "op" : "u",
    "ns" : "db.firm",
    "ui" : UUID("19c72da0-2fa0-40a4-b000-83e038cd2c01"),
    "o2" : {
        "_id" : ObjectId("61c35441418152715fc3fcbc")
    },
    "wall" : ISODate("2021-12-22T16:36:35.165Z"),
    "o" : {
        "$v" : 1,
        "$set" : {
            "address" : "Shanghai China"
        }
    }
}

field	Is it nullable	describe
ts	N	Operation time, BsonTimestamp
t	Y	Corresponding to the term in the raft protocol, every time a node is down, a new node is added, and the master-slave is switched, the term will increase automatically.
h	Y	The hash result of the operation's globally unique id
v	N	oplog version
op	N	Operation type: "i" insert, "u" update, "d" delete, "c" db cmd, "n" no op
ns	N	Namespace, indicating the full name of the collection corresponding to the operation
ui	N	session id
o2	Y	Record _id and sharding key in update operation
wall	N	Operation time, accurate to milliseconds
o	N	Change data description

As can be seen from the example, the update record of MongoDB oplog contains neither the information before the update nor the complete record after the update, so it cannot be converted into the ALL type changelog supported by Flink, and it is difficult to convert it into the UPSERT type changelog .

In addition, in a sharded cluster, data writing may occur in different shard replica sets, so each shard's oplog will only record data changes that occur on that shard. Therefore, it is necessary to obtain complete data changes, and the oplogs of each shard need to be sorted and merged together according to the operation time, which increases the difficulty and risk of capturing change records.

Before version 1.7, Debezium MongoDB Connector realized change data capture by traversing oplogs. For the above reasons, we did not use Debezium MongoDB Connector and chose MongoDB's official Change Streams-based MongoDB Kafka Connector.

2.3 Change Streams

Change Streams is a new feature introduced in MongoDB 3.6, which shields the complexity of traversing oplogs and enables users to subscribe to cluster, database, and collection-level data changes through a simple API.

2.3.1 Conditions of use

WiredTiger storage engine
Replica set (in the test environment, a single-node replica set can also be used) or sharded cluster deployment
Replica set protocol version: pv1 (default)
Majority Read Concern allowed before version 4.0: replication.enableMajorityReadConcern = true (allowed by default)
MongoDB user has find and changeStream permissions

2.3.2 Change Events

Change Events is the change record returned by Change Streams, and its data structure is as follows:

{
   _id : { <BSON Object> },
   "operationType" : "<operation>",
   "fullDocument" : { <document> },
   "ns" : {
      "db" : "<database>",
      "coll" : "<collection>"
   },
   "to" : {
      "db" : "<database>",
      "coll" : "<collection>"
   },
   "documentKey" : { "_id" : <value> },
   "updateDescription" : {
      "updatedFields" : { <document> },
      "removedFields" : [ "<field>", ... ],
      "truncatedArrays" : [
         { "field" : <field>, "newSize" : <integer> },
         ...
      ]
   },
   "clusterTime" : <Timestamp>,
   "txnNumber" : <NumberLong>,
   "lsid" : {
      "id" : <UUID>,
      "uid" : <BinData>
   }
}

field	type	describe
_id	document	Indicates resumeToken
operationType	string	Operation type, including: insert, delete, replace, update, drop, rename, dropDatabase, invalidate
fullDocument	document	Complete documentation, insert, replace is included by default, update needs to enable updateLookup, delete and other operation types are not included
ns	document	The full name of the collection corresponding to the operation record
to	document	When the operation type is rename, to represents the full name after the rename
documentKey	document	Contains the primary key _id of the change document. If the collection is a sharded collection, documentKey will also contain the shard creation
updateDescription	document	When the operation type is update, describe the fields and values that have changed
clusterTime	Timestamp	Operation time
txnNumber	NumberLong	transaction number
lsid	Document	session id

2.3.3 Update Lookup

Since the update operation of the oplog only includes the changed fields, the complete document after the change cannot be directly obtained from the oplog, but when converting to the changelog in UPSERT mode, UPDATE_AFTER RowData must have complete row records. Change Streams By setting fullDocument = updateLookup, you can return the latest state of the document when fetching the change record. In addition, each record of the Change Event contains documentKey (_id and shard key), which identifies the primary key information of the changed record, that is, it satisfies the condition of idempotent update. Therefore, through the Update Lookup feature, the change records of MongoDB can be converted into UPSERT changelogs of Flink.

3. Flink MongoDB CDC

In terms of specific implementation, we integrated the official MongoDB Kafka Connector implemented by MongoDB based on Change Streams. With Debezium EmbeddedEngine, it is easy to drive the MongoDB Kafka Connector to run in Flink. By converting Change Stream into Flink UPSERT changelog, MongoDB CDC TableSource is implemented. With the resume mechanism of Change Streams, the function of recovering from checkpoint and savepoint is realized.

As described in FLIP-149, some operations (such as aggregation) are difficult to handle correctly in the absence of -U messages. For the UPSERT type changelog, Flink Planner will introduce an additional computing node (Changelog Normalize) to normalize it to the ALL type changelog.

Support feature

Support for Exactly-Once semantics
Support full and incremental subscriptions
Support Snapshot data filtering
Support recovery from checkpoint, savepoint
Support metadata extraction

4. Production practice

4.1 Using RocksDB State Backend

Changelog Normalize In order to complete the pre-image value of -U, it will bring additional state overhead. It is recommended to use RocksDB State Backend in the production environment.

4.2 Appropriate oplog size and expiration time

MongoDB oplog.rs is a special collection with capacity. When the capacity of oplog.rs reaches the maximum value, historical data will be discarded. Change Streams is restored through resume tokens. Too small oplog capacity may cause the oplog records corresponding to resume tokens to no longer exist, resulting in restoration failure.

When the specified oplog size is not displayed, the default oplog size of the WiredTiger engine is 5% of the disk size, with a lower limit of 990MB and an upper limit of 50GB. After MongoDB 4.4, it is supported to set the minimum retention time of the oplog. When the oplog is full and the oplog record exceeds the minimum retention time, the oplog record will be recycled.

The oplog capacity and minimum retention time can be reset using the replSetResizeOplog command. In the production environment, it is recommended to set the oplog size to be no less than 20GB, and the oplog retention time to be no less than 7 days.

db.adminCommand(
  {
    replSetResizeOplog: 1, // 固定值1
    size: 20480,           // 单位为MB，范围在990MB到1PB
    minRetentionHours: 168 // 可选项，单位为小时
  }
)

4.3 Change the slow table to open the heartbeat event

Flink MongoDB CDC will periodically write the resume token to the checkpoint to restore the Change Stream. MongoDB change events or heartbeat events can trigger the update of the resume token. If the subscribed set changes slowly, the resume token corresponding to the last change record may expire, making it impossible to recover from the checkpoint. Therefore, for slow-changing collections, it is recommended to enable the heartbeat event (set heartbeat.interval.ms > 0) to maintain the update of the resume token.

WITH (
    'connector' = 'mongodb-cdc',
    'heartbeat.interval.ms' = '60000'
)

4.4 Customize MongoDB connection parameters

When a default connection can not meet the requirements, it can be transmitted through the configuration item connection.options connection parameters MongoDB support .

https://docs.mongodb.com/manual/reference/connection-string/#connection-string-options

WITH (
   'connector' = 'mongodb-cdc',
   'connection.options' = 'authSource=authDB&maxPoolSize=3'
)

4.5 Change Stream parameter tuning

The pulling of configuration change events can be refined through poll.await.time.ms and poll.max.batch.size in Flink DDL.

poll.await.time.ms

Change event pull interval, the default is 1500ms. For collections that change frequently, the pull interval can be appropriately reduced to improve processing time; for collections that change slowly, the pull interval can be appropriately increased to reduce database pressure.

poll.max.batch.size

The maximum number of pull change events per batch, the default is 1000. Increasing the change parameters will speed up the pull of change events from the Cursor, but will increase the memory overhead.

4.6 Subscribing to the entire database and cluster changes

database = "db", collection = "", you can subscribe to changes in the entire db database; database = "", collection = "", you can subscribe to changes in the entire cluster.

The DataStream API can use the pipeline to filter the db and collection that need to be subscribed. The filtering of the Snapshot collection is not supported at present.

MongoDBSource.<String>builder()
    .hosts("127.0.0.1:27017")
    .database("")
    .collection("")
    .pipeline("[{'$match': {'ns.db': {'$regex': '/^(sandbox|firewall)$/'}}}]")
    .deserializer(new JsonDebeziumDeserializationSchema())
    .build();

4.7 Permission Control

MongoDB supports fine-grained control of users, roles, and permissions. Users who enable Change Stream need to have both find and changeStream permissions.

single set

{ resource: { db: <dbname>, collection: <collection> }, actions: [ "find", "changeStream" ] }

single library

{ resource: { db: <dbname>, collection: "" }, actions: [ "find", "changeStream" ] }

cluster

{ resource: { db: "", collection: "" }, actions: [ "find", "changeStream" ] }

In a production environment, it is recommended to create a Flink user and role, and perform fine-grained authorization on the role. It should be noted that MongoDB can create users and roles under any database. If the user is not created under admin, you need to specify authSource = <the database where the user is located> in the connection parameters.

use admin;
// 创建用户
db.createUser(
 {
   user: "flink",
   pwd: "flinkpw",
   roles: []
 }
);

// 创建角色
db.createRole(
   {
     role: "flink_role", 
     privileges: [
       { resource: { db: "inventory", collection: "products" }, actions: [ "find", "changeStream" ] }
     ],
     roles: []
   }
);

// 给用户授予角色
db.grantRolesToUser(
    "flink",
    [
      // 注意：这里的db指角色创建时的db，在admin下创建的角色可以包含不同database的访问权限
      { role: "flink_role", db: "admin" }
    ]
);

// 给角色追加权限
db.grantPrivilegesToRole(
    "flink_role",
     [
       { resource: { db: "inventory", collection: "orders" }, actions: [ "find", "changeStream" ] }
     ]
);

In the development environment and test environment, read and readAnyDatabase , can be granted to Flink users to enable change streams for any collection.

use admin;
db.createUser({
  user: "flink",
  pwd: "flinkpw",
  roles: [
    { role: "read", db: "admin" },
    { role: "readAnyDatabase", db: "admin" }
  ]
});

V. Follow-up planning

Incremental Snapshot support

Currently, the MongoDB CDC Connector does not support incremental snapshots, and it cannot take advantage of Flink's parallel computing for tables with large amounts of data. In the future, the incremental snapshot function of MongoDB will be implemented to support checkpoint and concurrency setting in the snapshot stage.

Support for change subscriptions from a specified time

Currently, the MongoDB CDC Connector only supports Change Stream subscriptions from the current time, and will provide Change Stream subscriptions from a specified point in time in the future.

Screening of support libraries and collections

At present, the MongoDB CDC Connector supports change subscription and filtering of clusters and entire libraries, but does not support the filtering of collections that require Snapshots. This function will be improved in the future.

Reference Document

[1] Duality of Streams and Tables

[2] FLIP-95: New TableSource and TableSink interfaces

[3] FLIP-105: Support to Interpret Changelog in Flink SQL (Introducing Debezium and Canal Format)

[4] FLIP-149: Introduce the upsert-kafka Connector

[5] Apache Flink 1.11.0 Release Announcement

[6] Introduction to SQL in Flink 1.11

[7] MongoDB Manual

[8] MongoDB Connection String Options

[9] MongoDB Kafka Connector

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Flink CDC series - Production practice of Flink MongoDB CDC in XTransfer

foreword

1. Flink CDC

2. MongoDB replication mechanism

2.1 Replica Sets and Sharded Clusters

2.2 Replica Set Oplog

2.3 Change Streams

2.3.1 Conditions of use

2.3.2 Change Events

2.3.3 Update Lookup

3. Flink MongoDB CDC

4. Production practice

4.1 Using RocksDB State Backend

4.2 Appropriate oplog size and expiration time

4.3 Change the slow table to open the heartbeat event

4.4 Customize MongoDB connection parameters

4.5 Change Stream parameter tuning

4.6 Subscribing to the entire database and cluster changes

4.7 Permission Control

V. Follow-up planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

通过Milvus内置Sparse-BM25算法进行全文检索并将混合检索应用于RAG系统

MCP+Hologres+LLM 搭建数据分析 Agent

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成