1. The overall implementation process of Flink Table API
The main operation process is as follows:
// 创建表的执行环境
val tableEnv = ...
// 创建一张表,用于读取数据
tableEnv.connect(...).createTemporaryTable("inputTable")
// 注册一张表,用于把计算结果输出
tableEnv.connect(...).createTemporaryTable("outputTable")
// 通过 Table API 查询算子,得到一张结果表
val result = tableEnv.from("inputTable").select(...)
// 通过 SQL查询语句,得到一张结果表
val sqlResult = tableEnv.sqlQuery("SELECT ... FROM inputTable ...")
// 将结果表写入输出表中
result.insertInto("outputTable")
2. Creation and configuration of stream processing execution environment
Create table environment
Based on the stream processing execution environment, adjust the create method to directly create:
val tableEnv = StreamTableEnvironment.create(env)
Table Environment (TableEnvironment) is the core concept of integrating Table API & SQL in flink. It is mainly responsible for:
- Register catalog
- Register in the internal catalog
- Execute SQL query
- Registered user-defined functions
- Convert a DataStream or DataSet to a table
- Save a reference to ExecutionEnvironment or StreamExecutionEnvironment
Environment configuration
When creating TableEnv, you can configure the characteristics of TableEnvironment through some parameters.
Old version of streaming query configuration
EnvironmentSettings settings = EnvironmentSettings.newInstance() .useOldPlanner() // 使用老版本planner .inStreamingMode() // 流处理模式 .build() StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)
Old version batch configuration
EnvironmentSettings settings = EnvironmentSettings.newInstance() .useOldPlanner() // 使用老版本planner .inBatchMode() // 使用老版本的流处理模式 .build() StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)
Stream processing configuration of blink version
EnvironmentSettings settings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inStreamingMode().build() StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings)
Batch configuration of blink version
EnvironmentSettings bbSettings = EnvironmentSettings.newInstance() .useBlinkPlanner() .inBatchMode().build(); TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings);
3. Operation and use of Catalog
1) Catalog type:
- GenericInMemoryCatalog: Built-in Catalog. The name is default_catalog and the default database name is default_database . By default, tables registered with TableEnvironment#registerTable will all be registered in this Catalog.
- User-Defined Catalog: User-defined Catalog. Such as HiveCatalog in flink-connector-hive.
note:
The metadata object name in GenericInMemoryCatalog is case sensitive. HiveCatalog stores all metadata object names in lowercase.
Catalog used by default: default_catalog; Database: default_database.
2) Usage of Catalog:
Get the catalog currently in use
tableEnv.getCurrentCatalog()
Get the currently used Database
tableEnv.getCurrentDatabase()
Register a custom catalog
tableEnv.registerCatalog("custom-catalog",new CustomCatalog("customCatalog"))
List all catalogs
tableEnv.listCatalogs()
List all databases
tableEnv.listDatabases()
Switch catalog
tableEnv.useCatalog("catalogName")
Switch Database
tableEnv.useDatabase("databaseName")
4. Implementation of file system reading operation (csv)
POM dependency
<!-- 导入csv格式描述器的依赖包--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-csv</artifactId> <version>${flink.version}</version> </dependency>
Code
public static void main(String[] args) throws Exception { //1. 创建流式程序运行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //2. 没有指定EnviromentSettings,默认使用的是老版本的Planner的流式查询 StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env); //3. 指定读取csv文件的路径 String filePath = "./data/user.csv"; //4. 读取csv的文件,配置属性类型 tabEnv.connect(new FileSystem().path(filePath))//读取指定文件目录的文件数据,该对象一定是实现了ConnectorDescriptor的实现类 .withFormat(new Csv()) //定义从外部文件读取数据的格式化方法,需要传入继承自FormatDescriptor抽象类的实现类 .withSchema(new Schema() .field("id", DataTypes.INT()) .field("name", DataTypes.STRING()) )//定义表的结构 .createTemporaryTable("inputTable"); System.out.println(tabEnv.getCurrentCatalog()); System.out.println(tabEnv.getCurrentDatabase()); //5. 将table表的数据转换成table对象 Table inputTable = tabEnv.from("inputTable"); //6. 打印测试 tabEnv.toAppendStream(inputTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print().setParallelism(1); env.execute(); }
5. Implementation of the read operation of the message queue (kafka)
POM dependency
<!-- 导入kafka连接器jar包--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka_${scala.version}</artifactId> <version>${flink.version}</version> </dependency> <!-- flink json序列化jar包--> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-json</artifactId> <version>${flink.version}</version> </dependency>
Code
public static void main(String[] args) throws Exception { //创建流式程序运行环境 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //没有指定EnviromentSettings,默认使用的是老版本的Planner的流式查询 StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env); //接入kafka的connect消费数据 tabEnv.connect(new Kafka() //从kafka中读取数据 .version("universal") //指定当前环境采用的kafka的版本号:"0.8", "0.9", "0.10", "0.11", and "universal" .topic("rate") //指定消费的topic名字 .property("zookeeper.connect", "10.10.20.15:2181") //指定zookeeper的集群地址 .property("bootstrap.servers", "10.10.20.15:9092") //指定消费kafka的集群地址 ).withFormat(new Csv()) .withSchema(new Schema() .field("timestamp", DataTypes.BIGINT()) .field("type", DataTypes.STRING()) .field("rate", DataTypes.INT()) ).createTemporaryTable("RateInputTable"); Table rateInputTable = tabEnv.from("RateInputTable"); tabEnv.toAppendStream(rateInputTable, Rate.class).print(); env.execute(); }
Consumption test
Open the kafka consumer side:
bin/kafka-console-producer.sh --broker-list 10.10.20.15:9092 --topic rate
send data:
1618388392479, 'REF', 9
1618388392480, 'USD', 4
1618388392580, 'HKD', 9
6. How to perform conditional query operations
6.1 Implementation of Table API
Table API is a query API integrated in Scala and Java languages. Unlike SQL, Table API queries are not represented by strings, but are called step by step in the host language.
The Table API is based on the Table class representing a "table" and provides a set of APIs for operation processing. These methods return a new Table object, which represents the result of applying a transformation operation to the input table. Some relational conversion operations can be composed of multiple method calls to form a chain call structure. For example, table.select(...).filter(...), where select(...) represents the specified field in the selection table, and filter(...) represents the filter condition.
Code:
...
//基于TableAPI进行数据的查询转换操作,所以要求注册的临时表需要读取出来,赋值给一个Table对象实例才可以操作
Table resultTable = inputTable.filter("id == 1").select("id,name");
//使用TableAPI对Table对象进行聚合计算
Table aggResultTable = inputTable.groupBy("id").select("id,id.count as count");
//打印测试
tabEnv.toAppendStream(resultTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print("resultTable>>>").setParallelism(1) ;
tabEnv.toRetractStream(aggResultTable, TypeInformation.of(new TypeHint<Tuple2<Integer, Long>>() {})).print("aggResultTable>>>").setParallelism(1) ;
...
Output result:
resultTable>>>> (1,zhangsan)
aggResultTable>>>> (true,(2,1))
aggResultTable>>>> (false,(2,1))
aggResultTable>>>> (true,(2,2))
aggResultTable>>>> (true,(1,1))
true means new data, false means historical data already exists, and then print "true,(2,2)" again for cumulative statistics.
6.2 Implementation of SQL
Flink's SQL integration is based on ApacheCalcite, which implements the SQL standard. In Flink, regular strings are used to define SQL query statements. The result of the SQL query is a new Table.
// 使用SQL对表的数据进行操作
Table resultTableBySQL = tabEnv.sqlQuery("select id,count(id) as cnt from inputTable group by id");
tabEnv.toRetractStream(resultTableBySQL, TypeInformation.of(new TypeHint<Tuple2<Integer, Long>>() {})).print("sql result>>>").setParallelism(1) ;
7. Realize data output operations
The output of the table is realized by writing data to TableSink. TableSink is a universal interface that can support different file formats, storage databases, and message queues.
For specific implementation, the most direct way to output a table is to write a Table into the registered TableSink through the Table.executeInsert() method.
7.1 Output to file
Code:
//1. 创建流式程序运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2. 没有指定EnviromentSettings,默认使用的是老版本的Planner的流式查询
StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);
env.setParallelism(1);
//3. 指定读取csv文件的路径
String filePath = "./data/user.csv";
//4. 读取csv的文件,配置属性类型
tabEnv.connect(new FileSystem().path(filePath))//读取指定文件目录的文件数据,该对象一定是实现了ConnectorDescriptor的实现类
.withFormat(new Csv()) //定义从外部文件读取数据的格式化方法,需要传入继承自FormatDescriptor抽象类的实现类
.withSchema(new Schema()
.field("id", DataTypes.INT())
.field("name", DataTypes.STRING())
)//定义表的结构
.createTemporaryTable("inputTable");
//5. 将table表的数据转换成table对象
Table inputTable = tabEnv.from("inputTable");
Table resultTable = inputTable.select("id,name");
//定义结果表,将文件数据写入到结果文件中
tabEnv.connect(new FileSystem().path("./data/user.log"))
.withFormat(new Csv())
.withSchema(new Schema() //这个方法一定要指定
.field("id", DataTypes.INT())
.field("name", DataTypes.STRING())
)
.createTemporaryTable("outputTable");
resultTable.executeInsert("outputTable");
//6. 打印测试
tabEnv.toAppendStream(inputTable, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>() {})).print().setParallelism(1);
env.execute();
7.2 Export to KAFKA
The processed data can be output to Kafka, combined with the previous Kafka as input data, create a data pipeline, and then output to the Kafka message queue:
String kafkaNode = "10.10.20.15:2181";
String kafkaNodeServer = "10.10.20.15:9092";
//创建流式程序运行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//没有指定EnviromentSettings,默认使用的是老版本的Planner的流式查询
StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);
//接入kafka的connect消费数据
tabEnv.connect(new Kafka() //从kafka中读取数据
.version("universal") //指定当前环境采用的kafka的版本号:"0.8", "0.9", "0.10", "0.11", and "universal"
.startFromEarliest()
.topic("rate") //指定消费的topic名字
.property("zookeeper.connect", kafkaNode) //指定zookeeper的集群地址
.property("bootstrap.servers", kafkaNodeServer) //指定消费kafka的集群地址
).withFormat(new Csv())
.withSchema(new Schema()
.field("timestamp", DataTypes.BIGINT())
.field("type", DataTypes.STRING())
.field("rate", DataTypes.INT())
).createTemporaryTable("RateInputTable");
Table rateInputTable = tabEnv.from("RateInputTable");
//接入kafka的connect消费数据
tabEnv.connect(new Kafka() //从kafka中读取数据
.version("universal") //指定当前环境采用的kafka的版本号:"0.8", "0.9", "0.10", "0.11", and "universal"
.topic("output_rate") //指定消费的topic名字
.property("zookeeper.connect", kafkaNode) //指定zookeeper的集群地址
.property("bootstrap.servers", kafkaNodeServer) //指定消费kafka的集群地址
).withFormat(new Csv())
.withSchema(new Schema()
.field("timestamp", DataTypes.BIGINT())
.field("type", DataTypes.STRING())
.field("rate", DataTypes.INT())
).createTemporaryTable("RateOutputTable");
// 将table数据写入kafka消息队列
rateInputTable.executeInsert("RateOutputTable");
// 打印数据信息
tabEnv.toAppendStream(rateInputTable, StreamOutputKafkaApplication.Rate.class).print();
env.execute();
This article was created and shared by mirson. If you need further communication, please join the QQ group: 19310171 or visit www.softart.cn
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。