Flink CDC Series-Synchronize MySQL sub-database and table to build Iceberg real-time data lake

Author: Luo Yuxia

This tutorial will show how to use Flink CDC to build a real-time data lake, and deal with the scenario of sub-database sub-table merge synchronization.
Flink-CDC project address:
https://github.com/ververica/flink-cdc-connectors

Flink Chinese learning website
https://flink-learning.org.cn

In an OLTP system, in order to solve the problem of a large amount of data in a single table, a single large table is usually split by means of sub-database and sub-table to improve the throughput of the system.

However, in order to facilitate data analysis, it is usually necessary to merge the tables separated from the sub-database and sub-tables into a large table when they are synchronized to the data warehouse and data lake.

This tutorial will show how to use Flink CDC to build a real-time data lake to cope with this scenario. The demonstration in this tutorial is based on Docker and only involves SQL. There is no need for a line of Java/Scala code, and no IDE installation. You can easily use it in your own Complete the entire content of this tutorial on your computer.

Next, we will take the data synchronization from MySQL to Iceberg ^[1] as an example to show the whole process, the architecture diagram is as follows:

real-time-data-lake-tutorial

1. Preparation stage

Prepare a Linux or MacOS computer with Docker installed.

1.1 Prepare the components needed for the tutorial

The following tutorial will prepare the required components docker-compose

docker-compose.yml file with the following content:

version: '2.1'
services:
  sql-client:
    user: flink:flink
    image: yuxialuo/flink-sql-client:1.13.2.v1 
    depends_on:
      - jobmanager
      - mysql
    environment:
      FLINK_JOBMANAGER_HOST: jobmanager
      MYSQL_HOST: mysql
    volumes:
      - shared-tmpfs:/tmp/iceberg
  jobmanager:
    user: flink:flink
    image: flink:1.13.2-scala_2.11
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
    volumes:
      - shared-tmpfs:/tmp/iceberg
  taskmanager:
    user: flink:flink
    image: flink:1.13.2-scala_2.11
    depends_on:
      - jobmanager
    command: taskmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 2
    volumes:
      - shared-tmpfs:/tmp/iceberg
  mysql:
    image: debezium/example-mysql:1.1
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=123456
      - MYSQL_USER=mysqluser
      - MYSQL_PASSWORD=mysqlpw

volumes:
  shared-tmpfs:
    driver: local
    driver_opts:
      type: "tmpfs"
      device: "tmpfs"

The containers included in this Docker Compose are:

SQL-Client : Flink SQL Client, used to submit SQL queries and view SQL execution results;
Flink Cluster : Contains Flink JobManager and Flink TaskManager, used to execute Flink SQL;
MySQL : As the data source of the sub-database and sub-table, it stores the user table of this tutorial.

docker-compose.yml is located to start the components required for this tutorial:

docker-compose up -d

This command will automatically start all the containers defined in the Docker Compose configuration in detached mode. You can use docker ps to observe whether the above container is started normally, or you can visit http://localhost:8081/ to check whether Flink is running normally.

flink-ui

Note:

All container-related commands used in this tutorial need to be docker-compose.yml in the directory where 061cad21e3be03 is located.
In order to simplify the entire tutorial, the jar packages needed for this tutorial have been packaged into the SQL-Client container, and the image build script can be found on GitHub ^[2] .

If you want to run this tutorial in your own Flink environment, you need to download the packages listed below and put them in the lib directory of the directory where Flink is located, that is, FLINK_HOME/lib/ .

As of now, the iceberg-flink-runtime jar package that supports Flink 1.13 has not been released, so we provide a iceberg-flink-runtime jar package that supports Flink 1.13. This jar package is packaged based on the master branch of Iceberg.

When Iceberg version 0.13.0 is released, you can also download it to the iceberg-flink-runtime jar package apache official repository ^[3]

1.2 Prepare data

Enter the MySQL container:

docker-compose exec mysql mysql -uroot -p123456

Create data and tables, and fill in the data.

Create two different databases, and create two tables in each database as the user table sub-database sub-table.

 CREATE DATABASE db_1;
 USE db_1;
 CREATE TABLE user_1 (
   id INTEGER NOT NULL PRIMARY KEY,
   name VARCHAR(255) NOT NULL DEFAULT 'flink',
   address VARCHAR(1024),
   phone_number VARCHAR(512),
   email VARCHAR(255)
 );
 INSERT INTO user_1 VALUES (110,"user_110","Shanghai","123567891234","user_110@foo.com");

 CREATE TABLE user_2 (
   id INTEGER NOT NULL PRIMARY KEY,
   name VARCHAR(255) NOT NULL DEFAULT 'flink',
   address VARCHAR(1024),
   phone_number VARCHAR(512),
   email VARCHAR(255)
 );
INSERT INTO user_2 VALUES (120,"user_120","Shanghai","123567891234","user_120@foo.com");

CREATE DATABASE db_2;
USE db_2;
CREATE TABLE user_1 (
  id INTEGER NOT NULL PRIMARY KEY,
  name VARCHAR(255) NOT NULL DEFAULT 'flink',
  address VARCHAR(1024),
  phone_number VARCHAR(512),
  email VARCHAR(255)
);
INSERT INTO user_1 VALUES (110,"user_110","Shanghai","123567891234", NULL);

CREATE TABLE user_2 (
  id INTEGER NOT NULL PRIMARY KEY,
  name VARCHAR(255) NOT NULL DEFAULT 'flink',
  address VARCHAR(1024),
  phone_number VARCHAR(512),
  email VARCHAR(255)
);
INSERT INTO user_2 VALUES (220,"user_220","Shanghai","123567891234","user_220@foo.com");

2. Use Flink DDL to create a table in Flink SQL CLI

First, use the following command to enter the Flink SQL CLI container:

docker-compose exec sql-client ./sql-client

We can see the following interface:

Then, proceed as follows:

Turn on checkpoint

Checkpoint is not turned on by default. We need to turn on Checkpoint to allow Iceberg to submit transactions.
In addition, mysql-cdc needs to wait for a complete checkpoint before the binlog reading phase starts to avoid binlog records out of order.

-- Flink SQL
-- 每隔 3 秒做一次 checkpoint                 
Flink SQL> SET execution.checkpointing.interval = 3s;

Create MySQL sub-database sub-table source table

Create source table user_source to capture user tables in MySQL, and use regular expressions to match these tables in the configuration items database-name table-name
In addition, the user_source table also defines a metadata column to distinguish which database and table the data comes from.

-- Flink SQL
Flink SQL> CREATE TABLE user_source (
    database_name STRING METADATA VIRTUAL,
    table_name STRING METADATA VIRTUAL,
    `id` DECIMAL(20, 0) NOT NULL,
    name STRING,
    address STRING,
    phone_number STRING,
    email STRING,
    PRIMARY KEY (`id`) NOT ENFORCED
  ) WITH (
    'connector' = 'mysql-cdc',
    'hostname' = 'mysql',
    'port' = '3306',
    'username' = 'root',
    'password' = '123456',
    'database-name' = 'db_[0-9]+',
    'table-name' = 'user_[0-9]+'
  );

Create Iceberg sink table

Create sink table all_users_sink to load data into Iceberg.
In this sink table, considering that id field of different MySQL database tables may be the same, we define a composite primary key ( database_name , table_name , id ).

-- Flink SQL
Flink SQL> CREATE TABLE all_users_sink (
    database_name STRING,
    table_name    STRING,
    `id`          DECIMAL(20, 0) NOT NULL,
    name          STRING,
    address       STRING,
    phone_number  STRING,
    email         STRING,
    PRIMARY KEY (database_name, table_name, `id`) NOT ENFORCED
  ) WITH (
    'connector'='iceberg',
    'catalog-name'='iceberg_catalog',
    'catalog-type'='hadoop',  
    'warehouse'='file:///tmp/iceberg/warehouse',
    'format-version'='2'
  );

Three, streaming to Iceberg

Use the following Flink SQL statement to write data from MySQL to Iceberg:

-- Flink SQL
Flink SQL> INSERT INTO all_users_sink select * from user_source;

The above command will start a streaming job, continuously synchronizing the full and incremental data in the MySQL database to Iceberg.
You can see this running job on Flink UI ^[4]

flink-cdc-iceberg-running-job

Then we can use the following command to see the files written in Iceberg:

docker-compose exec sql-client tree /tmp/iceberg/warehouse/default_database/

As follows:

files-in-iceberg

In your running environment, the actual files may be different from the screenshots above, but the overall directory structure should be similar.

Use the following Flink SQL statement to query the data in all_users_sink
```
-- Flink SQL
Flink SQL> SELECT * FROM all_users_sink;
```

In Flink SQL CLI, we can see the following query results:

data_in_iceberg

Modify the data in the table in all_users_sink , and the data in the table 061cad21e3c3eb in Iceberg will also be updated in real time:

(3.1) Insert a new row db_1.user_1

--- db_1
INSERT INTO db_1.user_1 VALUES (111,"user_111","Shanghai","123567891234","user_111@foo.com");

(3.2) Update the data of db_1.user_2

--- db_1
UPDATE db_1.user_2 SET address='Beijing' WHERE id=120;

(3.3) Delete a row db_2.user_2

--- db_2
DELETE FROM db_2.user_2 WHERE id=220;

At each step, we can use SELECT * FROM all_users_sink query the table all_users_sink in the Flink Client CLI to see the changes in the data.

The final query results are as follows:

final-data-in-iceberg

The latest results from the Iceberg You can see a new (db_1, user_1, 111) record, (db_1, user_2, 120) address updates become Beijing , and (db_2, user_2, 220) record is deleted, it is entirely consistent with our data in MySQL to do the update.

Four, environmental cleanup

After this tutorial is over, execute the following command in the directory where the docker-compose.yml

docker-compose down

Five, summary

In this article, we show how to use Flink CDC to synchronize the data of MySQL sub-database and sub-table, and quickly build an Icberg real-time data lake. Users can also synchronize data from other databases (Postgres/Oracle) to Hudi and other data lakes. Finally, I hope that through this article, you can help readers quickly get started with Flink CDC.

For more Flink CDC related technical issues, you can scan the code to join the community DingTalk exchange group~

Note:

[1] https://iceberg.apache.org/

[2] https://github.com/luoyuxia/flink-cdc-tutorial/tree/main/flink-cdc-iceberg-demo/sql-client

[3] https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime/

Flink Forward Asia 2021

On January 8-9, 2022, FFA 2021 will be launched. There are 40+ multi-industry first-tier manufacturers around the world and 80+ dry goods issues, bringing a technical feast exclusively for developers.

Conference official website:
https://flink-forward.org.cn

Conference online viewing address (remember to make an appointment):
https://developer.aliyun.com/special/ffa2021/live

For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~

Flink CDC Series-Synchronize MySQL sub-database and table to build Iceberg real-time data lake

1. Preparation stage

1.1 Prepare the components needed for the tutorial

1.2 Prepare data

2. Use Flink DDL to create a table in Flink SQL CLI

Three, streaming to Iceberg

Four, environmental cleanup

Five, summary

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

70k star，取代Postman！这款轻量级API工具，太香了！

大模型时代，后端程序员如何避免被AI卷死？

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

C++ 中 VS 项目引入公共配置文件

【Hadoop】HBase系统解析及适用场景