This article uses examples to demonstrate how to use Flink CDC and Doris' Flink Connector to monitor data from the Mysql database and store it in the corresponding table of the Doris data warehouse in real time. The main contents include:
- What is CDC
- Flink CDC
- What is Flink Doris Connector
- Usage example
Flink Chinese learning website
https://flink-learning.org.cn
1. What is CDC
CDC is the abbreviation of Change Data Capture technology, which can synchronize the incremental change records of the source database (Source) to one or more data destinations (Sink). In the synchronization process, you can also perform certain processing on the data, such as grouping (GROUP BY), multi-table association (JOIN), and so on.
For example, for an e-commerce platform, a user’s order will be written to a certain source database in real time; Department A needs to simply aggregate the real-time data every minute and then save it to Redis for query, and Department B needs to temporarily store the data of the day in Elasticsearch uses one copy for report display, and Department C also needs a copy of data to ClickHouse for real-time data warehouse. As time goes by, the subsequent D and E departments will also have data analysis requirements. In this scenario, the traditional copying and distributing multiple copies is very inflexible, while CDC can realize a change record, process and deliver it in real time To multiple destinations.
Application scenarios of CDC
- data synchronization: used for backup and disaster recovery;
- data distribution: one data source is distributed to multiple downstream systems;
- data collection: is a very important data source for ETL data integration for data warehouse/data lake.
There are many technical solutions for CDC, and the current mainstream implementation mechanisms in the industry can be divided into two types:
CDC based on query :
- Offline scheduling query jobs, batch processing. Synchronize a table to other systems, and obtain the latest data in the table through query each time;
- Data consistency cannot be guaranteed, and the data may have been changed many times during the inspection process;
- Real-time performance is not guaranteed, and there is a natural delay based on offline scheduling.
Log-based CDC :
- Real-time consumption log, stream processing, for example, MySQL's binlog log completely records the changes in the database, and the binlog file can be used as the data source of the stream;
- Ensure data consistency, because the binlog file contains details of all historical changes;
- Guarantee real-time performance, because log files like binlog can be streamed and provide real-time data.
Two, Flink CDC
Flink added the feature of CDC in version 1.11, referred to as change data capture. The name is a bit messy, let's look at the content of CDC from the previous data structure.
The above is the previous mysq binlog
log processing flow. For example, canal listens to binlog and writes the log to kafka. And Apache Flink consumes Kakfa data in real time to realize mysql data synchronization or other content. In terms of splitting, it can be divided into the following stages as a whole:
- Mysql opens binlog;
- Canal synchronizes binlog data to Kafka;
- Flink reads the binlog data in Kakfa for related business processing.
The overall processing link is longer and requires more components. Apache Flink CDC can obtain binlog directly from the database for downstream business calculation analysis
Flink Connector Mysql CDC 2.0 features
Provide MySQL CDC 2.0, core features include:
- concurrent reads , the total amount of data read performance can be extended horizontally;
- whole journey , which does not cause the risk of locking online business;
- , supports full-stage checkpoints.
There are test documents on the Internet showing that the test was performed with the customer table in the TPC-DS data set. The Flink version is 1.13.1, the data volume of the customer table is 65 million, the Source concurrency is 8, and the full read stage:
- MySQL CDC 2.0 takes 13 minutes;
- MySQL CDC 1.4 takes 89 minutes;
- Reading performance improved 6.8 times.
Three, what is Flink Doris Connector
Flink Doris Connector is an extension of Doris community in order to facilitate users to use Flink to read and write Doris data tables. Currently, Doris supports Flink 1.11.x, 1.12.x, 1.13.x; Scala version: 2.12.x.
At present, the Flink Doris connector currently controls the warehousing through two parameters:
- sink.batch.size: How many pieces are written once, the default is 100 pieces;
- sink.batch.interval: How many seconds to write each time, the default is 1 second.
These two parameters work at the same time, whichever condition comes first will trigger the write Doris table operation,
Note:
The note here is to enable the http v2 version, specifically configure enable_http_server_v2=true
, and because the be list is obtained through fe http rest api, the users who need to be configured have admin permissions.
Four, usage example
4.1 Flink Doris Connector compilation
First, we need to compile Doris's Flink connector, which can also be downloaded from the following address:
https://github.com/hf200012/hf200012.github.io/raw/main/lib/doris-flink-1.0-SNAPSHOT.jar
Notice:
Here, because Doris' Flink Connector is developed based on Scala 2.12.x version, when you use Flink, please choose the version corresponding to Scala 2.12. If you download the corresponding jar using the above address, please ignore the compilation content part below .
Compile under Doris's docker compilation environment apache/incubator-doris:build-env-1.2
, because the JDK version below 1.3 is 11, there will be compilation problems.
Execute in the extension/flink-doris-connector/ source directory:
sh build.sh
After the compilation is successful, the file doris-flink-1.0.0-SNAPSHOT.jar
output/
directory. Copy this file to Flink
of ClassPath
to use Flink-Doris-Connector
. For example, Local
mode of Flink
, this file into jars/
folder. Yarn
running in cluster mode Flink
, put this file into the pre-deployment package.
Adaptation problem for Flink 1.13.x version
<properties>
<scala.version>2.12</scala.version>
<flink.version>1.11.2</flink.version>
<libthrift.version>0.9.3</libthrift.version>
<arrow.version>0.15.1</arrow.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<doris.home>${basedir}/../../</doris.home>
<doris.thirdparty>${basedir}/../../thirdparty</doris.thirdparty>
</properties>
Just change the 061b026c3d56de here to be the flink.version
your Flink cluster version and edit it again.
4.2 Configure Flink
Here we use Flink Sql Client to operate.
Here we demonstrate the software version used:
- Mysql 8.x
- Apache Flink :1.13.3
- Apache Doris :0.14.13.1
4.2.1 Install Flink
First download and install Flink:
https://dlcdn.apache.org/flink/flink-1.13.3/flink-1.13.3-bin-scala_2.12.tgz
The demonstration here uses the local stand-alone mode:
# wget https://dlcdn.apache.org/flink/flink-1.12.5/flink-1.12.5-bin-scala_2.12.tgz
# tar zxvf flink-1.12.5-bin-scala_2.12.tgz
Download Flink CDC related Jar packages:
Here pay attention to the version correspondence between Flink CDC and Flink.
- Copy the Flink Doris Connector jar package downloaded or compiled above to the lib directory under the Flink root directory;
- The jar package of Flink CDC is also copied to the lib directory under the Flink root directory.
4.2.2 Start Flink
Here we are using the local stand-alone mode.
# bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host doris01.
Starting taskexecutor daemon on host doris01.
We start the Flink cluster through web access (the default port is 8081), and we can see that the cluster starts normally.
4.3 Install Apache Doris
For the specific installation and deployment method of Doris, refer to the following link:
https://hf200012.github.io/2021/09/Apache-Doris-environment installation and deployment.
4.4 Install and configure Mysql
Install Mysql, quickly use Docker to install and configure Mysql, refer to the following link for details:
Open Mysql binlog, enter the Docker container to modify the /etc/my.cnf file, and add the following content under [mysqld],
log_bin=mysql_bin binlog-format=Row server-id=1
Then restart Mysql.
systemctl restart mysqld
- Create a Mysql database table.
CREATE TABLE `test_cdc` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
4.5 Create a Doris table
CREATE TABLE `doris_test` (
`id` int NULL COMMENT "",
`name` varchar(100) NULL COMMENT ""
) ENGINE=OLAP
UNIQUE KEY(`id`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`id`) BUCKETS 1
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "V2"
);
4.6 Start Flink Sql Client
./bin/sql-client.sh embedded
> set execution.result-mode=tableau;
4.6.1 Create Flink CDC Mysql Mapping Table
CREATE TABLE test_flink_cdc (
id INT,
name STRING,
primary key(id) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'localhost',
'port' = '3306',
'username' = 'root',
'password' = 'password',
'database-name' = 'demo',
'table-name' = 'test_cdc'
);
The Mysql mapping table created by the query is displayed normally.
select * from test_flink_cdc;
4.6.2 Create Flink Doris Table Mapping Table
Use Doris Flink Connector to create Doris mapping table.
CREATE TABLE doris_test_sink (
id INT,
name STRING
)
WITH (
'connector' = 'doris',
'fenodes' = 'localhost:8030',
'table.identifier' = 'db_audit.doris_test',
'sink.batch.size' = '2',
'sink.batch.interval'='1',
'username' = 'root',
'password' = ''
)
Execute the above statement on the command line, you can see that the table is created successfully, and then execute the query statement to verify whether it is normal.
select * from doris_test_sink;
Perform the insert operation and insert the data in Mysql into Doris through Flink CDC combined with Doris Flink Connector.
INSERT INTO doris_test_sink select id,name from test_flink_cdc
After the submission is successful, we can see the related Job task information on the Flink web interface.
4.6.3 Insert data into Mysql table
INSERT INTO test_cdc VALUES (123, 'this is a update');
INSERT INTO test_cdc VALUES (1212, '测试flink CDC');
INSERT INTO test_cdc VALUES (1234, '这是测试');
INSERT INTO test_cdc VALUES (11233, 'zhangfeng_1');
INSERT INTO test_cdc VALUES (21233, 'zhangfeng_2');
INSERT INTO test_cdc VALUES (31233, 'zhangfeng_3');
INSERT INTO test_cdc VALUES (41233, 'zhangfeng_4');
INSERT INTO test_cdc VALUES (51233, 'zhangfeng_5');
INSERT INTO test_cdc VALUES (61233, 'zhangfeng_6');
INSERT INTO test_cdc VALUES (71233, 'zhangfeng_7');
INSERT INTO test_cdc VALUES (81233, 'zhangfeng_8');
INSERT INTO test_cdc VALUES (91233, 'zhangfeng_9');
4.6.4 Observe the data in the Doris table
First stop the Insert into task, because I am in the local stand-alone mode and there is only one task task, so I have to stop it, and then execute the query statement on the command line to see the data.
4.6.5 Modify Mysql data
Restart the Insert into task:
Modify the data in the Mysql table:
update test_cdc set name='这个是验证修改的操作' where id =123
If you look at the data in the Doris table again, you will find that it has been modified.
Note that if you want to modify the data in the Mysql table, the data in Doris is also modified. If the model of the Doris data table is a Unique key model, other data models (Aggregate Key and Duplicate Key) cannot update data.
4.6.6 Delete data operation
Currently, Doris Flink Connector does not support delete operation, it is planned to add this operation later.
For more Flink CDC related technical issues, you can scan the code to join the community DingTalk exchange group~
related articles
- Flink CDC Series-Building Streaming ETL on MySQL and Postgres
- Flink CDC 2.1 is officially released, the stability is greatly improved, Oracle, MongoDB support
recent hot spots
- Flink Forward Asia 2021 postponed, meet online
- bonus doubled! latest entry guide for Flink Forward Asia Hackathon
For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。