Author: Luo Yuxia
This tutorial will show how to use Flink CDC to build a real-time data lake, and deal with the scenario of sub-database sub-table merge synchronization.
Flink-CDC project address:
Flink Chinese learning website
https://flink-learning.org.cn
In an OLTP system, in order to solve the problem of a large amount of data in a single table, a single large table is usually split by means of sub-database and sub-table to improve the throughput of the system.
However, in order to facilitate data analysis, it is usually necessary to merge the tables separated from the sub-database and sub-tables into a large table when they are synchronized to the data warehouse and data lake.
This tutorial will show how to use Flink CDC to build a real-time data lake to cope with this scenario. The demonstration in this tutorial is based on Docker and only involves SQL. There is no need for a line of Java/Scala code, and no IDE installation. You can easily use it in your own Complete the entire content of this tutorial on your computer.
Next, we will take the data synchronization from MySQL to Iceberg [1] as an example to show the whole process, the architecture diagram is as follows:
1. Preparation stage
Prepare a Linux or MacOS computer with Docker installed.
1.1 Prepare the components needed for the tutorial
The following tutorial will prepare the required components docker-compose
docker-compose.yml
file with the following content:
version: '2.1'
services:
sql-client:
user: flink:flink
image: yuxialuo/flink-sql-client:1.13.2.v1
depends_on:
- jobmanager
- mysql
environment:
FLINK_JOBMANAGER_HOST: jobmanager
MYSQL_HOST: mysql
volumes:
- shared-tmpfs:/tmp/iceberg
jobmanager:
user: flink:flink
image: flink:1.13.2-scala_2.11
ports:
- "8081:8081"
command: jobmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
volumes:
- shared-tmpfs:/tmp/iceberg
taskmanager:
user: flink:flink
image: flink:1.13.2-scala_2.11
depends_on:
- jobmanager
command: taskmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: jobmanager
taskmanager.numberOfTaskSlots: 2
volumes:
- shared-tmpfs:/tmp/iceberg
mysql:
image: debezium/example-mysql:1.1
ports:
- "3306:3306"
environment:
- MYSQL_ROOT_PASSWORD=123456
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
volumes:
shared-tmpfs:
driver: local
driver_opts:
type: "tmpfs"
device: "tmpfs"
The containers included in this Docker Compose are:
- SQL-Client : Flink SQL Client, used to submit SQL queries and view SQL execution results;
- Flink Cluster : Contains Flink JobManager and Flink TaskManager, used to execute Flink SQL;
- MySQL : As the data source of the sub-database and sub-table, it stores the
user
table of this tutorial.
docker-compose.yml
is located to start the components required for this tutorial:
docker-compose up -d
This command will automatically start all the containers defined in the Docker Compose configuration in detached mode. You can use docker ps
to observe whether the above container is started normally, or you can visit http://localhost:8081/ to check whether Flink is running normally.
Note:
- All container-related commands used in this tutorial need to be
docker-compose.yml
in the directory where 061cad21e3be03 is located. - In order to simplify the entire tutorial, the jar packages needed for this tutorial have been packaged into the SQL-Client container, and the image build script can be found on GitHub [2] .
If you want to run this tutorial in your own Flink environment, you need to download the packages listed below and put them in the lib directory of the directory where Flink is located, that is, FLINK_HOME/lib/
.
- flink-sql-connector-mysql-cdc-2.1.0.jar
- flink-shaded-hadoop-2-uber-2.7.5-10.0.jar
- iceberg-flink-1.13-runtime-0.13.0-SNAPSHOT.jar
As of now, the iceberg-flink-runtime
jar package that supports Flink 1.13 has not been released, so we provide a iceberg-flink-runtime
jar package that supports Flink 1.13. This jar package is packaged based on the master branch of Iceberg.
When Iceberg version 0.13.0 is released, you can also download it to the iceberg-flink-runtime
jar package apache official repository [3]
1.2 Prepare data
Enter the MySQL container:
docker-compose exec mysql mysql -uroot -p123456
- Create data and tables, and fill in the data.
Create two different databases, and create two tables in each database as the user
table sub-database sub-table.
CREATE DATABASE db_1;
USE db_1;
CREATE TABLE user_1 (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL DEFAULT 'flink',
address VARCHAR(1024),
phone_number VARCHAR(512),
email VARCHAR(255)
);
INSERT INTO user_1 VALUES (110,"user_110","Shanghai","123567891234","user_110@foo.com");
CREATE TABLE user_2 (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL DEFAULT 'flink',
address VARCHAR(1024),
phone_number VARCHAR(512),
email VARCHAR(255)
);
INSERT INTO user_2 VALUES (120,"user_120","Shanghai","123567891234","user_120@foo.com");
CREATE DATABASE db_2;
USE db_2;
CREATE TABLE user_1 (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL DEFAULT 'flink',
address VARCHAR(1024),
phone_number VARCHAR(512),
email VARCHAR(255)
);
INSERT INTO user_1 VALUES (110,"user_110","Shanghai","123567891234", NULL);
CREATE TABLE user_2 (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL DEFAULT 'flink',
address VARCHAR(1024),
phone_number VARCHAR(512),
email VARCHAR(255)
);
INSERT INTO user_2 VALUES (220,"user_220","Shanghai","123567891234","user_220@foo.com");
2. Use Flink DDL to create a table in Flink SQL CLI
First, use the following command to enter the Flink SQL CLI container:
docker-compose exec sql-client ./sql-client
We can see the following interface:
Then, proceed as follows:
- Turn on checkpoint
Checkpoint is not turned on by default. We need to turn on Checkpoint to allow Iceberg to submit transactions.
In addition, mysql-cdc needs to wait for a complete checkpoint before the binlog reading phase starts to avoid binlog records out of order.
-- Flink SQL
-- 每隔 3 秒做一次 checkpoint
Flink SQL> SET execution.checkpointing.interval = 3s;
- Create MySQL sub-database sub-table source table
Create source table user_source
to capture user
tables in MySQL, and use regular expressions to match these tables in the configuration items database-name
table-name
In addition, the user_source
table also defines a metadata column to distinguish which database and table the data comes from.
-- Flink SQL
Flink SQL> CREATE TABLE user_source (
database_name STRING METADATA VIRTUAL,
table_name STRING METADATA VIRTUAL,
`id` DECIMAL(20, 0) NOT NULL,
name STRING,
address STRING,
phone_number STRING,
email STRING,
PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'mysql',
'port' = '3306',
'username' = 'root',
'password' = '123456',
'database-name' = 'db_[0-9]+',
'table-name' = 'user_[0-9]+'
);
- Create Iceberg sink table
Create sink table all_users_sink
to load data into Iceberg.
In this sink table, considering that id
field of different MySQL database tables may be the same, we define a composite primary key ( database_name
, table_name
, id
).
-- Flink SQL
Flink SQL> CREATE TABLE all_users_sink (
database_name STRING,
table_name STRING,
`id` DECIMAL(20, 0) NOT NULL,
name STRING,
address STRING,
phone_number STRING,
email STRING,
PRIMARY KEY (database_name, table_name, `id`) NOT ENFORCED
) WITH (
'connector'='iceberg',
'catalog-name'='iceberg_catalog',
'catalog-type'='hadoop',
'warehouse'='file:///tmp/iceberg/warehouse',
'format-version'='2'
);
Three, streaming to Iceberg
Use the following Flink SQL statement to write data from MySQL to Iceberg:
-- Flink SQL Flink SQL> INSERT INTO all_users_sink select * from user_source;
The above command will start a streaming job, continuously synchronizing the full and incremental data in the MySQL database to Iceberg.
You can see this running job on Flink UI [4]
Then we can use the following command to see the files written in Iceberg:
docker-compose exec sql-client tree /tmp/iceberg/warehouse/default_database/
As follows:
In your running environment, the actual files may be different from the screenshots above, but the overall directory structure should be similar.
Use the following Flink SQL statement to query the data in
all_users_sink
-- Flink SQL Flink SQL> SELECT * FROM all_users_sink;
In Flink SQL CLI, we can see the following query results:
Modify the data in the table in all_users_sink
, and the data in the table 061cad21e3c3eb in Iceberg will also be updated in real time:
(3.1) Insert a new row db_1.user_1
--- db_1
INSERT INTO db_1.user_1 VALUES (111,"user_111","Shanghai","123567891234","user_111@foo.com");
(3.2) Update the data of db_1.user_2
--- db_1
UPDATE db_1.user_2 SET address='Beijing' WHERE id=120;
(3.3) Delete a row db_2.user_2
--- db_2
DELETE FROM db_2.user_2 WHERE id=220;
At each step, we can use SELECT * FROM all_users_sink
query the table all_users_sink
in the Flink Client CLI to see the changes in the data.
The final query results are as follows:
The latest results from the Iceberg You can see a new (db_1, user_1, 111)
record, (db_1, user_2, 120)
address updates become Beijing
, and (db_2, user_2, 220)
record is deleted, it is entirely consistent with our data in MySQL to do the update.
Four, environmental cleanup
After this tutorial is over, execute the following command in the directory where the docker-compose.yml
docker-compose down
Five, summary
In this article, we show how to use Flink CDC to synchronize the data of MySQL sub-database and sub-table, and quickly build an Icberg real-time data lake. Users can also synchronize data from other databases (Postgres/Oracle) to Hudi and other data lakes. Finally, I hope that through this article, you can help readers quickly get started with Flink CDC.
For more Flink CDC related technical issues, you can scan the code to join the community DingTalk exchange group~
Note:
[1] https://iceberg.apache.org/
[2] https://github.com/luoyuxia/flink-cdc-tutorial/tree/main/flink-cdc-iceberg-demo/sql-client
[3] https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime/
Flink Forward Asia 2021
On January 8-9, 2022, FFA 2021 will be launched. There are 40+ multi-industry first-tier manufacturers around the world and 80+ dry goods issues, bringing a technical feast exclusively for developers.
Conference official website:
https://flink-forward.org.cn
Conference online viewing address (remember to make an appointment):
https://developer.aliyun.com/special/ffa2021/live
For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。