1、断点续传

public class MySQLCDC {
 
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
 
        // 开启CK
        env.enableCheckpointing(5000);
        env.getCheckpointConfig().setCheckpointTimeout(10000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
 
        env.setStateBackend(new FsStateBackend("hdfs://xx.xx.xx.xx:8020/flink-cdc/ck"));
 
        MySqlSource<String> sourceFunction = MySqlSource.<String>builder()
                .hostname("xx.xx.xx.xx")
                .port(3306)
                .username("root")
                .password("root@123")
                .databaseList("journey")
                .tableList("journey.employee")
                .deserializer(new StringDebeziumDeserializationSchema())
                .startupOptions(StartupOptions.initial())
                .build();
        DataStreamSource<String> dataStreamSource = env.fromSource(sourceFunction, WatermarkStrategy.noWatermarks(), "mysql cdc");
        dataStreamSource.print();
        env.execute();
 
    }
}

1.1、启动

bin/flink run -d -t yarn-per-job -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar

因为启动选项配置的是 StartupOptions.initial(),所以打印出来的都是r,也就是read出来的数据

1.2、取消

bin/flink cancel -t yarn-per-job -Dyarn.application.id=application_1694766249884_1113 123f715267f69fbf175d00dca8b0b37b

1.3、重启断点续传

bin/flink run -d -t yarn-per-job -s hdfs://xx.xx.xx.xx:8020/flink-cdc/ck/96f0dc3d9b2717ae3c0a27799560d862/chk-62 -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar

1.4、增量同步

delete from employee where id = 2;

结果输出,发现之前初始化的数据就不会有了,因为做了checkpoint


SourceRecord{sourcePartition={server=mysql_binlog_source}, sourceOffset={transaction_id=null, ts_sec=1726815377, file=mysql-bin.000041, pos=8021341, row=1, server_id=109, event=3}} ConnectRecord{topic='mysql_binlog_source.journey.employee', kafkaPartition=null, key=Struct{id=2}, keySchema=Schema{mysql_binlog_source.journey.employee.Key:STRUCT}, value=Struct{before=Struct{id=2,name=Bob,department_id=2},source=Struct{version=1.6.4.Final,connector=mysql,name=mysql_binlog_source,ts_ms=1726815377000,db=journey,table=employee,server_id=109,file=mysql-bin.000041,pos=8021497,row=0},op=d,ts_ms=1726815377916}, valueSchema=Schema{mysql_binlog_source.journey.employee.Envelope:STRUCT}, timestamp=null, headers=ConnectHeaders(headers=)}

注意 : 虽然程序中设置的checkpoint的路径是 hdfs://xx.xx.xx.xx:8020/flink-cdc/ck,但是需要指定的全路径是hdfs://xx.xx.xx.xx:8020/flink-cdc/ck/96f0dc3d9b2717ae3c0a27799560d862/chk-62,下面有多个checkpoint,指定最新的即可

也可以使用savepoint,savepoint其实和checkpoint是一样的,只是一个手动,一个自动

1.5、使用savepoint

停止并且设置保存点
bin/flink stop --savepointPath hdfs://xx.xx.xx.xx:8020/sp 1f55d1e49d2c3c3e42e240074a310508 -yid application_1726648863980_0018
 
上面命令会有如下提示
....
2024-09-20 16:24:42,234 INFO  org.apache.flink.yarn.YarnClusterDescriptor                  [] - Found Web Interface cclinux3:32983 of application 'application_1726648863980_0018'.
Savepoint completed. Path: hdfs://xx.xx.xx.xx:8020/sp/savepoint-1f55d1-fb7041cd00d9
....
 
启动并从保存点启动
bin/flink run -d -t yarn-per-job -s hdfs://xx.xx.xx.xx:8020/sp/savepoint-1f55d1-fb7041cd00d9 -c example.MySQLCDC /opt/flinkcdc-data-sync-1.0-SNAPSHOT.jar
 
 
可以看到applicationId下的job列表
bin/flink list -t yarn-per-job -Dyarn.application.id=application_1726648863980_0018

2、配置一库多表形式

MySqlSource<String> sourceFunction = MySqlSource.<String>builder()
                .hostname("xx.xx.xx.xx")
                .port(3306)
                .username("root")
                .password("root@123")
                .databaseList("journey")
                .tableList(".*") // 如果需要库下所有的表,需要的是.*,而不是tableList省去,不能省去
                .deserializer(new StringDebeziumDeserializationSchema())
                .startupOptions(StartupOptions.initial())
                .build();
        DataStreamSource<String> dataStreamSource = env.fromSource(sourceFunction, WatermarkStrategy.noWatermarks(), "mysql cdc");
        dataStreamSource.print();
        env.execute();

注意 : 如果需要库下所有的表,需要的是.*,而不是tableList省去,不能省去

3、启动模式

配置选项scan.startup.mode指定 MySQL CDC 使用者的启动模式。有效枚举包括:

  • initial (默认):在第一次启动时对受监视的数据库表执行初始快照,并继续读取最新的 binlog ;其实就是初始快照 + binlog增量
  • earliest-offset:跳过快照阶段,从可读取的最早 binlog 位点开始读取;只读取binlog增量,不读取初始快照,注意是从最早可读取的binlog进行读取
  • latest-offset:首次启动时,从不对受监视的数据库表执行快照, 连接器仅从 binlog 的结尾处开始读取,这意味着连接器只能读取在连接器启动之后的数据更改;只读取binlog增量,不读取初始快照,从最新的binlog开始读取
  • specific-offset:跳过快照阶段,从指定的 binlog 位点开始读取。位点可通过 binlog 文件名和位置指定,或者在 GTID 在集群上启用时通过 GTID 集合指定;指定binlog位置或者GTID进行读取
    timestamp:跳过快照阶段,从指定的时间戳开始读取 binlog 事件;指定时间戳进行读取

    MySQLSource.builder()
      .startupOptions(StartupOptions.initial()) // 初始快照 + binlog增量
      .startupOptions(StartupOptions.earliest()) // 从最早位点启动
      .startupOptions(StartupOptions.latest()) // 从最晚位点启动
      .startupOptions(StartupOptions.specificOffset("mysql-bin.000003", 4L) // 从指定 binlog 文件名和位置启动
      .startupOptions(StartupOptions.specificOffset("24DA167-0C0C-11E8-8442-00059A3C7B00:1-19")) // 从 GTID 集合启动
      .startupOptions(StartupOptions.timestamp(1667232000000L) // 从时间戳启动
      .startupOptions(StartupOptions.snapshot()) // 仅读取快照
      ...
      .build()

    注意 : StartupOptions.snapshot() 这玩意3.1之后才有的

4、关于无主键表

从2.4.0 版本开始支持无主键表,使用无主键表必须设置scan.incremental.snapshot.chunk.key-column,且只能选择非空类型的一个字段

在使用无主键表时,需要注意以下两种情况 :

1、配置 scan.incremental.snapshot.chunk.key-column 时,如果表中存在索引,请尽量使用索引中的列来加快 select 速度。
2、无主键表的处理语义由 scan.incremental.snapshot.chunk.key-column 指定的列的行为决定:

  • 如果指定的列不存在更新操作,此时可以保证 Exactly once 语义。
  • 如果指定的列存在更新操作,此时只能保证 At least once 语义。但可以结合下游,通过指定下游主键,结合幂等性操作来保证数据的正确性

5、Flink CDC和Flink版本对应

image.png

6、Flink CDC YARN DataStream 多Job 执行

POM.xml 配置

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>com.cestc.cdc</groupId>
  <artifactId>flinkcdc-data-sync</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>
 
  <name>flinkcdc-data-sync</name>
  <url>http://maven.apache.org</url>
 
  <properties>
    <java.version>1.8</java.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <spring-boot.version>2.3.7.RELEASE</spring-boot.version>
    <flink.version>1.15.2</flink.version>
    <scala.binary.version>2.12</scala.binary.version>
    <shade.name>shade</shade.name>
  </properties>
 
  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
 
    <repository>
      <id>spring</id>
      <url>https://maven.aliyun.com/repository/spring</url>
    </repository>
 
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>
 
 
  <dependencies>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table-planner_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table-api-scala-bridge_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-table-common</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-clients</artifactId>
      <version>${flink.version}</version>
    </dependency>
 
    <!-- flink-connector-kafka -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-kafka</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <!-- https://mvnrepository.com/artifact/org.apache.kafka/connect-api -->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>connect-api</artifactId>
      <version>2.8.2</version>
    </dependency>
 
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-base</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
 
    <!-- flink-connector-jdbc -->
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-connector-jdbc</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
 
    <!-- mysql-cdc-->
    <dependency>
      <groupId>com.ververica</groupId>
      <artifactId>flink-connector-mysql-cdc</artifactId>
      <version>2.3.0</version>
    </dependency>
 
    <!--        mysql-->
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>8.0.33</version>
      <scope>provided</scope>
    </dependency>
 
    <!--        postgresql-->
    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>42.2.5</version>
    </dependency>
 
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-json</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-csv</artifactId>
      <version>${flink.version}</version>
      <scope>provided</scope>
    </dependency>
 
    <!-- SLF4J API -->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>1.7.30</version>
    </dependency>
 
    <!-- Logback classic, which is the implementation of SLF4J -->
    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-classic</artifactId>
      <version>1.2.3</version>
    </dependency>
 
    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-databind</artifactId>
      <version>2.13.3</version>
    </dependency>
    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-core</artifactId>
      <version>2.13.3</version>
    </dependency>
 
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>3.3.1</version>
      <scope>provided</scope>
    </dependency>
 
  </dependencies>
 
  <build>
    <plugins>
      <!-- 编译插件:编译源代码 -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
 
      <!-- 打包插件:生成包含依赖的 uber JAR -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.2.4</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
              <transformers combine.children="append">
                <transformer
                        implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
                </transformer>
              </transformers>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

核心理念就是不要和 hdfs lib下的包进行冲突,不然有就有问题;注意 :也可以用shade别名,除了provided

代码 :

 package com.journey.cdc;

import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import org.apache.commons.lang3.StringUtils;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.connector.jdbc.catalog.MySqlCatalog;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Schema;
import org.apache.flink.table.api.StatementSet;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.catalog.DefaultCatalogTable;
import org.apache.flink.table.catalog.ObjectPath;
import org.apache.flink.table.runtime.typeutils.InternalTypeInfo;
import org.apache.flink.table.types.DataType;
import org.apache.flink.table.types.logical.LogicalType;
import org.apache.flink.table.types.logical.RowType;
import org.apache.flink.types.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class FlinkCdcMultiSyncMySQLCdc2MySQL {

    private static final Logger log = LoggerFactory.getLogger(FlinkCdcMultiSyncMySQLCdc2MySQL.class);

    public static void main(String[] args) throws Exception {

        // source端连接信息
        String userName = "root";
        String passWord = "xxx";
        String host = "xx.xx.30.77";
        String db = "journey";
        // 如果是整库,tableList = ".*"
        String tableList = "journey.employee,journey.department";
        int port = 3306;

        // sink连接信息模板
        String sink_url = "jdbc:mysql://xx.xx.30.77:3306/dest?useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai";
        String sink_username = "root";
        String sink_password = "xx";

        String connectorWithBody =
                " with (\n" +
                        " 'connector' = 'jdbc',\n" +
                        " 'url' = '${sink_url}',\n" +
                        " 'username' = '${sink_username}',\n" +
                        " 'password' = '${sink_password}',\n" +
                        " 'table-name' = '${tableName}'\n" +
                        ")";
        connectorWithBody = connectorWithBody.replace("${sink_url}", sink_url)
                .replace("${sink_username}", sink_username)
                .replace("${sink_password}", sink_password);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(3000);
        StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);

        // 注册同步的库对应的catalog
        MySqlCatalog mysqlCatalog = new MySqlCatalog("mysql-catalog", db, userName, passWord, String.format("jdbc:mysql://%s:%d", host, port));
        List<String> tables = new ArrayList<>();

        // 如果整库同步,则从catalog里取所有表,否则从指定表中取表名
        if (".*".equals(tableList)) {
            tables = mysqlCatalog.listTables(db);
        } else {
            String[] tableArray = tableList.split(",");
            for (String table : tableArray) {
                tables.add(table.split("\\.")[1]);
            }
        }

        // 创建表名和对应RowTypeInfo映射的Map
        Map<String, RowTypeInfo> tableTypeInformationMap = new HashMap<>();
        Map<String, DataType[]> tableDataTypesMap = new HashMap<>();
        Map<String, RowType> tableRowTypeMap = new HashMap<>();
        for (String table : tables) {
            // 获取mysql catalog中注册的表
            ObjectPath objectPath = new ObjectPath(db, table);
            DefaultCatalogTable catalogBaseTable = (DefaultCatalogTable) mysqlCatalog.getTable(objectPath);
            // 获取表的Schema
            Schema schema = catalogBaseTable.getUnresolvedSchema();
            // 获取表中字段名列表
            String[] fieldNames = new String[schema.getColumns().size()];
            // 获取DataType
            DataType[] fieldDataTypes = new DataType[schema.getColumns().size()];
            LogicalType[] logicalTypes = new LogicalType[schema.getColumns().size()];
            // 获取表字段类型
            TypeInformation<?>[] fieldTypes = new TypeInformation[schema.getColumns().size()];
            // 获取表的主键
            List<String> primaryKeys = schema.getPrimaryKey().get().getColumnNames();

            for (int i = 0; i < schema.getColumns().size(); i++) {
                Schema.UnresolvedPhysicalColumn column = (Schema.UnresolvedPhysicalColumn) schema.getColumns().get(i);
                fieldNames[i] = column.getName();
                fieldDataTypes[i] = (DataType) column.getDataType();
                fieldTypes[i] = InternalTypeInfo.of(((DataType) column.getDataType()).getLogicalType());
                logicalTypes[i] = ((DataType) column.getDataType()).getLogicalType();
            }
            RowType rowType = RowType.of(logicalTypes, fieldNames);
            tableRowTypeMap.put(table, rowType);

            // 组装sink表ddl sql
            StringBuilder stmt = new StringBuilder();
            String tableName = table;
            String jdbcSinkTableName = String.format("jdbc_sink_%s", tableName);
            stmt.append("create table ").append(jdbcSinkTableName).append("(\n");

            for (int i = 0; i < fieldNames.length; i++) {
                String column = fieldNames[i];
                String fieldDataType = fieldDataTypes[i].toString();
                stmt.append("\t").append(column).append(" ").append(fieldDataType).append(",\n");
            }
            stmt.append(String.format("PRIMARY KEY (%s) NOT ENFORCED\n)", StringUtils.join(primaryKeys, ",")));
            String formatJdbcSinkWithBody = connectorWithBody
                    .replace("${tableName}", jdbcSinkTableName);
            String createSinkTableDdl = stmt.toString() + formatJdbcSinkWithBody;
            // 创建sink表
            log.info("createSinkTableDdl: {}", createSinkTableDdl);
            tEnv.executeSql(createSinkTableDdl);
            tableDataTypesMap.put(tableName, fieldDataTypes);
            tableTypeInformationMap.put(tableName, new RowTypeInfo(fieldTypes, fieldNames));
        }

        // 监控mysql binlog
        MySqlSource mySqlSource = MySqlSource.<Tuple2<String, Row>>builder()
                .hostname(host)
                .port(port)
                .databaseList(db)
                .tableList(tableList)
                .username(userName)
                .password(passWord)
                .deserializer(new CustomDebeziumDeserializer(tableRowTypeMap))
                .startupOptions(StartupOptions.initial())
                .build();
        SingleOutputStreamOperator<Tuple2<String, Row>> dataStreamSource = env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "mysql cdc").disableChaining();
        StatementSet statementSet = tEnv.createStatementSet();
        // dataStream转Table,创建临时视图,插入sink表
        for (Map.Entry<String, RowTypeInfo> entry : tableTypeInformationMap.entrySet()) {
            String tableName = entry.getKey();
            RowTypeInfo rowTypeInfo = entry.getValue();
            SingleOutputStreamOperator<Row> mapStream = dataStreamSource.filter(data -> data.f0.equals(tableName)).map(data -> data.f1, rowTypeInfo);
            Table table = tEnv.fromChangelogStream(mapStream);
            String temporaryViewName = String.format("t_%s", tableName);
            tEnv.createTemporaryView(temporaryViewName, table);
            String sinkTableName = String.format("jdbc_sink_%s", tableName);
            String insertSql = String.format("insert into %s select * from %s", sinkTableName, temporaryViewName);
            log.info("add insertSql for {},sql: {}", tableName, insertSql);
            statementSet.addInsertSql(insertSql);
        }
        statementSet.execute();
    }
} 

使用的是 StatementSet,这个玩意最终其实分成了两个流,但是是一个Job;因为是一个源(同一个MySQL CDC源),其实底层调用的是 env.executeAsync("job2")
image.png

如果要是两个源,那就是两个Job
image.png
两者区别在于提交了几次executeAsync,StatementSet只提交了一次,所以只有一个Job,而MultiJobExecuteAsync程序提交了两次,所以有两个Job


journey
32 声望22 粉丝

« 上一篇
研发模式