1、断点续传
public class MySQLCDC {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 开启CK
env.enableCheckpointing(5000);
env.getCheckpointConfig().setCheckpointTimeout(10000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.setStateBackend(new FsStateBackend("hdfs://xx.xx.xx.xx:8020/flink-cdc/ck"));
MySqlSource<String> sourceFunction = MySqlSource.<String>builder()
.hostname("xx.xx.xx.xx")
.port(3306)
.username("root")
.password("root@123")
.databaseList("journey")
.tableList("journey.employee")
.deserializer(new StringDebeziumDeserializationSchema())
.startupOptions(StartupOptions.initial())
.build();
DataStreamSource<String> dataStreamSource = env.fromSource(sourceFunction, WatermarkStrategy.noWatermarks(), "mysql cdc");
dataStreamSource.print();
env.execute();
}
}
1.1、启动
bin/flink run -d -t yarn-per-job -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar
因为启动选项配置的是 StartupOptions.initial(),所以打印出来的都是r,也就是read出来的数据
1.2、取消
bin/flink cancel -t yarn-per-job -Dyarn.application.id=application_1694766249884_1113 123f715267f69fbf175d00dca8b0b37b
1.3、重启断点续传
bin/flink run -d -t yarn-per-job -s hdfs://xx.xx.xx.xx:8020/flink-cdc/ck/96f0dc3d9b2717ae3c0a27799560d862/chk-62 -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar
1.4、增量同步
delete from employee where id = 2;
结果输出,发现之前初始化的数据就不会有了,因为做了checkpoint
SourceRecord{sourcePartition={server=mysql_binlog_source}, sourceOffset={transaction_id=null, ts_sec=1726815377, file=mysql-bin.000041, pos=8021341, row=1, server_id=109, event=3}} ConnectRecord{topic='mysql_binlog_source.journey.employee', kafkaPartition=null, key=Struct{id=2}, keySchema=Schema{mysql_binlog_source.journey.employee.Key:STRUCT}, value=Struct{before=Struct{id=2,name=Bob,department_id=2},source=Struct{version=1.6.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1726815377000,db=journey,table=employee,server_id=109,file=mysql-bin.000041,pos=8021497,row=0},op=d,ts_ms=1726815377916}, valueSchema=Schema{mysql_binlog_source.journey.employee.Envelope:STRUCT}, timestamp=null, headers=ConnectHeaders(headers=)}
注意 : 虽然程序中设置的checkpoint的路径是 hdfs://xx.xx.xx.xx:8020/flink-cdc/ck,但是需要指定的全路径是hdfs://xx.xx.xx.xx:8020/flink-cdc/ck/96f0dc3d9b2717ae3c0a27799560d862/chk-62,下面有多个checkpoint,指定最新的即可
也可以使用savepoint,savepoint其实和checkpoint是一样的,只是一个手动,一个自动
1.5、使用savepoint
停止并且设置保存点
bin/flink stop --savepointPath hdfs://xx.xx.xx.xx:8020/sp 1f55d1e49d2c3c3e42e240074a310508 -yid application_1726648863980_0018
上面命令会有如下提示
....
2024-09-20 16:24:42,234 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface cclinux3:32983 of application 'application_1726648863980_0018'.
Savepoint completed. Path: hdfs://xx.xx.xx.xx:8020/sp/savepoint-1f55d1-fb7041cd00d9
....
启动并从保存点启动
bin/flink run -d -t yarn-per-job -s hdfs://xx.xx.xx.xx:8020/sp/savepoint-1f55d1-fb7041cd00d9 -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar
可以看到applicationId下的job列表
bin/flink list -t yarn-per-job -Dyarn.application.id=application_1726648863980_0018
2、配置一库多表形式
MySqlSource<String> sourceFunction = MySqlSource.<String>builder()
.hostname("xx.xx.xx.xx")
.port(3306)
.username("root")
.password("root@123")
.databaseList("journey")
.tableList(".*") // 如果需要库下所有的表,需要的是.*,而不是tableList省去,不能省去
.deserializer(new StringDebeziumDeserializationSchema())
.startupOptions(StartupOptions.initial())
.build();
DataStreamSource<String> dataStreamSource = env.fromSource(sourceFunction, WatermarkStrategy.noWatermarks(), "mysql cdc");
dataStreamSource.print();
env.execute();
注意 : 如果需要库下所有的表,需要的是.*,而不是tableList省去,不能省去
3、启动模式
配置选项scan.startup.mode指定 MySQL CDC 使用者的启动模式。有效枚举包括:
- initial (默认):在第一次启动时对受监视的数据库表执行初始快照,并继续读取最新的 binlog ;其实就是初始快照 + binlog增量
- earliest-offset:跳过快照阶段,从可读取的最早 binlog 位点开始读取;只读取binlog增量,不读取初始快照,注意是从最早可读取的binlog进行读取
- latest-offset:首次启动时,从不对受监视的数据库表执行快照, 连接器仅从 binlog 的结尾处开始读取,这意味着连接器只能读取在连接器启动之后的数据更改;只读取binlog增量,不读取初始快照,从最新的binlog开始读取
specific-offset:跳过快照阶段,从指定的 binlog 位点开始读取。位点可通过 binlog 文件名和位置指定,或者在 GTID 在集群上启用时通过 GTID 集合指定;指定binlog位置或者GTID进行读取
timestamp:跳过快照阶段,从指定的时间戳开始读取 binlog 事件;指定时间戳进行读取MySQLSource.builder() .startupOptions(StartupOptions.initial()) // 初始快照 + binlog增量 .startupOptions(StartupOptions.earliest()) // 从最早位点启动 .startupOptions(StartupOptions.latest()) // 从最晚位点启动 .startupOptions(StartupOptions.specificOffset("mysql-bin.000003", 4L) // 从指定 binlog 文件名和位置启动 .startupOptions(StartupOptions.specificOffset("24DA167-0C0C-11E8-8442-00059A3C7B00:1-19")) // 从 GTID 集合启动 .startupOptions(StartupOptions.timestamp(1667232000000L) // 从时间戳启动 .startupOptions(StartupOptions.snapshot()) // 仅读取快照 ... .build()
注意 : StartupOptions.snapshot() 这玩意3.1之后才有的
4、关于无主键表
从2.4.0 版本开始支持无主键表,使用无主键表必须设置scan.incremental.snapshot.chunk.key-column,且只能选择非空类型的一个字段
在使用无主键表时,需要注意以下两种情况 :
1、配置 scan.incremental.snapshot.chunk.key-column 时,如果表中存在索引,请尽量使用索引中的列来加快 select 速度。
2、无主键表的处理语义由 scan.incremental.snapshot.chunk.key-column 指定的列的行为决定:
- 如果指定的列不存在更新操作,此时可以保证 Exactly once 语义。
- 如果指定的列存在更新操作,此时只能保证 At least once 语义。但可以结合下游,通过指定下游主键,结合幂等性操作来保证数据的正确性
5、Flink CDC和Flink版本对应
6、Flink CDC YARN DataStream 多Job 执行
POM.xml 配置
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cestc.cdc</groupId>
<artifactId>flinkcdc-data-sync</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>flinkcdc-data-sync</name>
<url>http://maven.apache.org</url>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<spring-boot.version>2.3.7.RELEASE</spring-boot.version>
<flink.version>1.15.2</flink.version>
<scala.binary.version>2.12</scala.binary.version>
<shade.name>shade</shade.name>
</properties>
<repositories>
<repository>
<id>scala-tools.org</id>
<name>Scala-Tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
<repository>
<id>spring</id>
<url>https://maven.aliyun.com/repository/spring</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- flink-connector-kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/connect-api -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>connect-api</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-base</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!-- flink-connector-jdbc -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!-- mysql-cdc-->
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>2.3.0</version>
</dependency>
<!-- mysql-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.33</version>
<scope>provided</scope>
</dependency>
<!-- postgresql-->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.2.5</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-json</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!-- SLF4J API -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.30</version>
</dependency>
<!-- Logback classic, which is the implementation of SLF4J -->
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.13.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 编译插件:编译源代码 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- 打包插件:生成包含依赖的 uber JAR -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers combine.children="append">
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
核心理念就是不要和 hdfs lib下的包进行冲突,不然有就有问题;注意 :也可以用shade别名,除了provided
代码 :
package com.journey.cdc;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import org.apache.commons.lang3.StringUtils;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.connector.jdbc.catalog.MySqlCatalog;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Schema;
import org.apache.flink.table.api.StatementSet;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.catalog.DefaultCatalogTable;
import org.apache.flink.table.catalog.ObjectPath;
import org.apache.flink.table.runtime.typeutils.InternalTypeInfo;
import org.apache.flink.table.types.DataType;
import org.apache.flink.table.types.logical.LogicalType;
import org.apache.flink.table.types.logical.RowType;
import org.apache.flink.types.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class FlinkCdcMultiSyncMySQLCdc2MySQL {
private static final Logger log = LoggerFactory.getLogger(FlinkCdcMultiSyncMySQLCdc2MySQL.class);
public static void main(String[] args) throws Exception {
// source端连接信息
String userName = "root";
String passWord = "xxx";
String host = "xx.xx.30.77";
String db = "journey";
// 如果是整库,tableList = ".*"
String tableList = "journey.employee,journey.department";
int port = 3306;
// sink连接信息模板
String sink_url = "jdbc:mysql://xx.xx.30.77:3306/dest?useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai";
String sink_username = "root";
String sink_password = "xx";
String connectorWithBody =
" with (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = '${sink_url}',\n" +
" 'username' = '${sink_username}',\n" +
" 'password' = '${sink_password}',\n" +
" 'table-name' = '${tableName}'\n" +
")";
connectorWithBody = connectorWithBody.replace("${sink_url}", sink_url)
.replace("${sink_username}", sink_username)
.replace("${sink_password}", sink_password);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(3000);
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
// 注册同步的库对应的catalog
MySqlCatalog mysqlCatalog = new MySqlCatalog("mysql-catalog", db, userName, passWord, String.format("jdbc:mysql://%s:%d", host, port));
List<String> tables = new ArrayList<>();
// 如果整库同步,则从catalog里取所有表,否则从指定表中取表名
if (".*".equals(tableList)) {
tables = mysqlCatalog.listTables(db);
} else {
String[] tableArray = tableList.split(",");
for (String table : tableArray) {
tables.add(table.split("\\.")[1]);
}
}
// 创建表名和对应RowTypeInfo映射的Map
Map<String, RowTypeInfo> tableTypeInformationMap = new HashMap<>();
Map<String, DataType[]> tableDataTypesMap = new HashMap<>();
Map<String, RowType> tableRowTypeMap = new HashMap<>();
for (String table : tables) {
// 获取mysql catalog中注册的表
ObjectPath objectPath = new ObjectPath(db, table);
DefaultCatalogTable catalogBaseTable = (DefaultCatalogTable) mysqlCatalog.getTable(objectPath);
// 获取表的Schema
Schema schema = catalogBaseTable.getUnresolvedSchema();
// 获取表中字段名列表
String[] fieldNames = new String[schema.getColumns().size()];
// 获取DataType
DataType[] fieldDataTypes = new DataType[schema.getColumns().size()];
LogicalType[] logicalTypes = new LogicalType[schema.getColumns().size()];
// 获取表字段类型
TypeInformation<?>[] fieldTypes = new TypeInformation[schema.getColumns().size()];
// 获取表的主键
List<String> primaryKeys = schema.getPrimaryKey().get().getColumnNames();
for (int i = 0; i < schema.getColumns().size(); i++) {
Schema.UnresolvedPhysicalColumn column = (Schema.UnresolvedPhysicalColumn) schema.getColumns().get(i);
fieldNames[i] = column.getName();
fieldDataTypes[i] = (DataType) column.getDataType();
fieldTypes[i] = InternalTypeInfo.of(((DataType) column.getDataType()).getLogicalType());
logicalTypes[i] = ((DataType) column.getDataType()).getLogicalType();
}
RowType rowType = RowType.of(logicalTypes, fieldNames);
tableRowTypeMap.put(table, rowType);
// 组装sink表ddl sql
StringBuilder stmt = new StringBuilder();
String tableName = table;
String jdbcSinkTableName = String.format("jdbc_sink_%s", tableName);
stmt.append("create table ").append(jdbcSinkTableName).append("(\n");
for (int i = 0; i < fieldNames.length; i++) {
String column = fieldNames[i];
String fieldDataType = fieldDataTypes[i].toString();
stmt.append("\t").append(column).append(" ").append(fieldDataType).append(",\n");
}
stmt.append(String.format("PRIMARY KEY (%s) NOT ENFORCED\n)", StringUtils.join(primaryKeys, ",")));
String formatJdbcSinkWithBody = connectorWithBody
.replace("${tableName}", jdbcSinkTableName);
String createSinkTableDdl = stmt.toString() + formatJdbcSinkWithBody;
// 创建sink表
log.info("createSinkTableDdl: {}", createSinkTableDdl);
tEnv.executeSql(createSinkTableDdl);
tableDataTypesMap.put(tableName, fieldDataTypes);
tableTypeInformationMap.put(tableName, new RowTypeInfo(fieldTypes, fieldNames));
}
// 监控mysql binlog
MySqlSource mySqlSource = MySqlSource.<Tuple2<String, Row>>builder()
.hostname(host)
.port(port)
.databaseList(db)
.tableList(tableList)
.username(userName)
.password(passWord)
.deserializer(new CustomDebeziumDeserializer(tableRowTypeMap))
.startupOptions(StartupOptions.initial())
.build();
SingleOutputStreamOperator<Tuple2<String, Row>> dataStreamSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "mysql cdc").disableChaining();
StatementSet statementSet = tEnv.createStatementSet();
// dataStream转Table,创建临时视图,插入sink表
for (Map.Entry<String, RowTypeInfo> entry : tableTypeInformationMap.entrySet()) {
String tableName = entry.getKey();
RowTypeInfo rowTypeInfo = entry.getValue();
SingleOutputStreamOperator<Row> mapStream = dataStreamSource.filter(data -> data.f0.equals(tableName)).map(data -> data.f1, rowTypeInfo);
Table table = tEnv.fromChangelogStream(mapStream);
String temporaryViewName = String.format("t_%s", tableName);
tEnv.createTemporaryView(temporaryViewName, table);
String sinkTableName = String.format("jdbc_sink_%s", tableName);
String insertSql = String.format("insert into %s select * from %s", sinkTableName, temporaryViewName);
log.info("add insertSql for {},sql: {}", tableName, insertSql);
statementSet.addInsertSql(insertSql);
}
statementSet.execute();
}
}
使用的是 StatementSet,这个玩意最终其实分成了两个流,但是是一个Job;因为是一个源(同一个MySQL CDC源),其实底层调用的是 env.executeAsync("job2")
如果要是两个源,那就是两个Job
两者区别在于提交了几次executeAsync,StatementSet只提交了一次,所以只有一个Job,而MultiJobExecuteAsync程序提交了两次,所以有两个Job
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。