Fllink real-time computing application (4) Flink Table API & SQL in-depth detailed explanation

1. The overall implementation process of Flink Table API

The main operation process is as follows:

// 创建表的执行环境
val tableEnv = ...     

// 创建一张表，用于读取数据
tableEnv.connect(...).createTemporaryTable("inputTable")

// 注册一张表，用于把计算结果输出
tableEnv.connect(...).createTemporaryTable("outputTable")

// 通过 Table API 查询算子，得到一张结果表
val result = tableEnv.from("inputTable").select(...)

// 通过 SQL查询语句，得到一张结果表
val sqlResult  = tableEnv.sqlQuery("SELECT ... FROM inputTable ...")

// 将结果表写入输出表中
result.insertInto("outputTable")

2. Creation and configuration of stream processing execution environment

Create table environment
Based on the stream processing execution environment, adjust the create method to directly create:
```
val tableEnv = StreamTableEnvironment.create(env)
```

Table Environment (TableEnvironment) is the core concept of integrating Table API & SQL in flink. It is mainly responsible for:

Register catalog
Register in the internal catalog
Execute SQL query
Registered user-defined functions
Convert a DataStream or DataSet to a table
Save a reference to ExecutionEnvironment or StreamExecutionEnvironment

Environment configuration

When creating TableEnv, you can configure the characteristics of TableEnvironment through some parameters.

Old version of streaming query configuration

EnvironmentSettings settings = EnvironmentSettings.newInstance()
  .useOldPlanner()      // 使用老版本planner
  .inStreamingMode()    // 流处理模式
  .build()
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)

Old version batch configuration

EnvironmentSettings settings = EnvironmentSettings.newInstance()
                .useOldPlanner()      // 使用老版本planner
                .inBatchMode()    // 使用老版本的流处理模式
                .build()
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)

Stream processing configuration of blink version

EnvironmentSettings settings = EnvironmentSettings.newInstance()
  .useBlinkPlanner()
.inStreamingMode().build()
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)

Batch configuration of blink version

EnvironmentSettings bbSettings = EnvironmentSettings.newInstance()
        .useBlinkPlanner()
        .inBatchMode().build();
TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings);

3. Operation and use of Catalog

1) Catalog type:

GenericInMemoryCatalog: Built-in Catalog. The name is default_catalog and the default database name is default_database . By default, tables registered with TableEnvironment#registerTable will all be registered in this Catalog.
User-Defined Catalog: User-defined Catalog. Such as HiveCatalog in flink-connector-hive.

note:

The metadata object name in GenericInMemoryCatalog is case sensitive. HiveCatalog stores all metadata object names in lowercase.
Catalog used by default: default_catalog; Database: default_database.

2) Usage of Catalog:

Get the catalog currently in use
```
tableEnv.getCurrentCatalog()
```
Get the currently used Database
```
tableEnv.getCurrentDatabase()
```

tableEnv.registerCatalog("custom-catalog",new CustomCatalog("customCatalog"))

List all catalogs
```
tableEnv.listCatalogs()
```
List all databases
```
tableEnv.listDatabases()
```
Switch catalog
```
tableEnv.useCatalog("catalogName")
```
Switch Database
```
tableEnv.useDatabase("databaseName")
```

4. Implementation of file system reading operation (csv)

POM dependency

<!-- 导入csv格式描述器的依赖包-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-csv</artifactId>
    <version>${flink.version}</version>
</dependency>

Code

public static void main(String[] args) throws Exception {
    //1. 创建流式程序运行环境
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //2. 没有指定EnviromentSettings，默认使用的是老版本的Planner的流式查询
    StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

    //3. 指定读取csv文件的路径
    String filePath = "./data/user.csv";

    //4. 读取csv的文件，配置属性类型
    tabEnv.connect(new FileSystem().path(filePath))//读取指定文件目录的文件数据，该对象一定是实现了ConnectorDescriptor的实现类
    .withFormat(new Csv()) //定义从外部文件读取数据的格式化方法，需要传入继承自FormatDescriptor抽象类的实现类
    .withSchema(new Schema()
            .field("id", DataTypes.INT())
            .field("name", DataTypes.STRING())
    )//定义表的结构
    .createTemporaryTable("inputTable");

    System.out.println(tabEnv.getCurrentCatalog());
    System.out.println(tabEnv.getCurrentDatabase());
    //5. 将table表的数据转换成table对象
    Table inputTable = tabEnv.from("inputTable");

    //6. 打印测试
    tabEnv.toAppendStream(inputTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print().setParallelism(1);

    env.execute();
}

5. Implementation of the read operation of the message queue (kafka)

POM dependency

<!-- 导入kafka连接器jar包-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

<!-- flink json序列化jar包-->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-json</artifactId>
    <version>${flink.version}</version>
</dependency>

Code

public static void main(String[] args) throws Exception {
    //创建流式程序运行环境
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //没有指定EnviromentSettings，默认使用的是老版本的Planner的流式查询
    StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

    //接入kafka的connect消费数据
    tabEnv.connect(new Kafka() //从kafka中读取数据
            .version("universal") //指定当前环境采用的kafka的版本号："0.8", "0.9", "0.10", "0.11", and "universal"
            .topic("rate")  //指定消费的topic名字
            .property("zookeeper.connect", "10.10.20.15:2181") //指定zookeeper的集群地址
            .property("bootstrap.servers", "10.10.20.15:9092") //指定消费kafka的集群地址
    ).withFormat(new Csv())
            .withSchema(new Schema()
                    .field("timestamp", DataTypes.BIGINT())
                    .field("type", DataTypes.STRING())
                    .field("rate", DataTypes.INT())
            ).createTemporaryTable("RateInputTable");

    Table rateInputTable = tabEnv.from("RateInputTable");

    tabEnv.toAppendStream(rateInputTable, Rate.class).print();

    env.execute();
}

Consumption test

Open the kafka consumer side:

bin/kafka-console-producer.sh --broker-list 10.10.20.15:9092 --topic rate

send data:

1618388392479, 'REF', 9
1618388392480, 'USD', 4
1618388392580, 'HKD', 9

6. How to perform conditional query operations

6.1 Implementation of Table API

Table API is a query API integrated in Scala and Java languages. Unlike SQL, Table API queries are not represented by strings, but are called step by step in the host language.

The Table API is based on the Table class representing a "table" and provides a set of APIs for operation processing. These methods return a new Table object, which represents the result of applying a transformation operation to the input table. Some relational conversion operations can be composed of multiple method calls to form a chain call structure. For example, table.select(...).filter(...), where select(...) represents the specified field in the selection table, and filter(...) represents the filter condition.

Code:

...
//基于TableAPI进行数据的查询转换操作，所以要求注册的临时表需要读取出来，赋值给一个Table对象实例才可以操作
Table resultTable = inputTable.filter("id == 1").select("id,name");

//使用TableAPI对Table对象进行聚合计算
Table aggResultTable = inputTable.groupBy("id").select("id,id.count as count");

//打印测试
tabEnv.toAppendStream(resultTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print("resultTable>>>").setParallelism(1) ;
tabEnv.toRetractStream(aggResultTable, TypeInformation.of(new TypeHint<Tuple2<Integer, Long>>() {})).print("aggResultTable>>>").setParallelism(1) ;
...

Output result:

resultTable>>>> (1,zhangsan)
aggResultTable>>>> (true,(2,1))
aggResultTable>>>> (false,(2,1))
aggResultTable>>>> (true,(2,2))
aggResultTable>>>> (true,(1,1))

true means new data, false means historical data already exists, and then print "true,(2,2)" again for cumulative statistics.

6.2 Implementation of SQL

Flink's SQL integration is based on ApacheCalcite, which implements the SQL standard. In Flink, regular strings are used to define SQL query statements. The result of the SQL query is a new Table.

// 使用SQL对表的数据进行操作
Table resultTableBySQL = tabEnv.sqlQuery("select id,count(id) as cnt from inputTable group by id");
tabEnv.toRetractStream(resultTableBySQL, TypeInformation.of(new TypeHint<Tuple2<Integer, Long>>() {})).print("sql result>>>").setParallelism(1) ;

7. Realize data output operations

The output of the table is realized by writing data to TableSink. TableSink is a universal interface that can support different file formats, storage databases, and message queues.

For specific implementation, the most direct way to output a table is to write a Table into the registered TableSink through the Table.executeInsert() method.

7.1 Output to file

Code:

//1. 创建流式程序运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2. 没有指定EnviromentSettings，默认使用的是老版本的Planner的流式查询
StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

env.setParallelism(1);

//3. 指定读取csv文件的路径
String filePath = "./data/user.csv";

//4. 读取csv的文件，配置属性类型
tabEnv.connect(new FileSystem().path(filePath))//读取指定文件目录的文件数据，该对象一定是实现了ConnectorDescriptor的实现类
 .withFormat(new Csv()) //定义从外部文件读取数据的格式化方法，需要传入继承自FormatDescriptor抽象类的实现类
 .withSchema(new Schema()
             .field("id", DataTypes.INT())
             .field("name", DataTypes.STRING())
            )//定义表的结构
 .createTemporaryTable("inputTable");

//5. 将table表的数据转换成table对象
Table inputTable = tabEnv.from("inputTable");

Table resultTable = inputTable.select("id,name");

//定义结果表，将文件数据写入到结果文件中
tabEnv.connect(new FileSystem().path("./data/user.log"))
 .withFormat(new Csv())
 .withSchema(new Schema() //这个方法一定要指定
             .field("id", DataTypes.INT())
             .field("name", DataTypes.STRING())
            )
 .createTemporaryTable("outputTable");

resultTable.executeInsert("outputTable");

//6. 打印测试
tabEnv.toAppendStream(inputTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print().setParallelism(1);

env.execute();

7.2 Export to KAFKA

The processed data can be output to Kafka, combined with the previous Kafka as input data, create a data pipeline, and then output to the Kafka message queue:

String kafkaNode = "10.10.20.15:2181";
String kafkaNodeServer = "10.10.20.15:9092";
//创建流式程序运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//没有指定EnviromentSettings，默认使用的是老版本的Planner的流式查询
StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

//接入kafka的connect消费数据
tabEnv.connect(new Kafka() //从kafka中读取数据
        .version("universal") //指定当前环境采用的kafka的版本号："0.8", "0.9", "0.10", "0.11", and "universal"
        .startFromEarliest()
        .topic("rate")  //指定消费的topic名字
        .property("zookeeper.connect", kafkaNode) //指定zookeeper的集群地址
        .property("bootstrap.servers", kafkaNodeServer) //指定消费kafka的集群地址
).withFormat(new Csv())
        .withSchema(new Schema()
                .field("timestamp", DataTypes.BIGINT())
                .field("type", DataTypes.STRING())
                .field("rate", DataTypes.INT())
        ).createTemporaryTable("RateInputTable");

Table rateInputTable = tabEnv.from("RateInputTable");

//接入kafka的connect消费数据
tabEnv.connect(new Kafka() //从kafka中读取数据
        .version("universal") //指定当前环境采用的kafka的版本号："0.8", "0.9", "0.10", "0.11", and "universal"
        .topic("output_rate")  //指定消费的topic名字
        .property("zookeeper.connect", kafkaNode) //指定zookeeper的集群地址
        .property("bootstrap.servers", kafkaNodeServer) //指定消费kafka的集群地址
).withFormat(new Csv())
        .withSchema(new Schema()
                .field("timestamp", DataTypes.BIGINT())
                .field("type", DataTypes.STRING())
                .field("rate", DataTypes.INT())
        ).createTemporaryTable("RateOutputTable");

// 将table数据写入kafka消息队列
rateInputTable.executeInsert("RateOutputTable");

// 打印数据信息
tabEnv.toAppendStream(rateInputTable, StreamOutputKafkaApplication.Rate.class).print();

env.execute();

This article was created and shared by mirson. If you need further communication, please join the QQ group: 19310171 or visit www.softart.cn

Fllink real-time computing application (4) Flink Table API & SQL in-depth detailed explanation

1. The overall implementation process of Flink Table API

2. Creation and configuration of stream processing execution environment

3. Operation and use of Catalog

4. Implementation of file system reading operation (csv)

5. Implementation of the read operation of the message queue (kafka)

6. How to perform conditional query operations

6.1 Implementation of Table API

6.2 Implementation of SQL

7. Realize data output operations

7.1 Output to file

7.2 Export to KAFKA

mirson

引用和评论

并发场景下的死锁原因及规避解决方法

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性