1. Overview of Flink
1. Basic introduction
Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at in-memory execution speed and at arbitrary scale. The main features include: batch flow integration, precise state management, event time support, and exactly-once state consistency guarantee. Flink can not only run on multiple resource management frameworks including YARN, Mesos, and Kubernetes, but also supports independent deployment on bare metal clusters. With the high availability option enabled, it has no single point of failure.
Here are two concepts:
- Boundary: unbounded and bounded data flow, which can be understood as the aggregation strategy or condition of data;
- Status: that is, whether there is a dependency in the execution order, that is, whether the next execution depends on the previous result;
2. Application scenarios
Data Driven
Event-driven applications do not need to query the remote database, and local data access makes it have higher throughput and lower latency. In the case of anti-fraud, DataDriven writes the processing rule model into the Datastream API, and then abstracts the entire logic to Flink Engine, when events or data flow in, the corresponding rule model will be triggered. Once the conditions in the rules are triggered, DataDriven will quickly process and notify business applications.
Data Analytics
Compared with batch analysis, streaming analysis eliminates the need for periodic data import and query process, so the latency of getting metrics from events is lower. Not only that, batch queries must deal with artificial data boundaries caused by periodic imports and input boundedness, while streaming queries do not need to consider this problem. Flink provides good support for both continuous streaming analysis and batch analysis, real-time processing Analyze data and apply more scenarios such as real-time large screens and real-time reports.
Data Pipeline
Compared with periodic ETL job tasks, continuous data pipeline can significantly reduce the delay of moving data to the destination. For example, based on upstream StreamETL to clean or expand data in real time, a real-time data warehouse can be built downstream to ensure the timeliness of data query. , forming a high-efficiency data query link, which is very common in media stream recommendation or search engines.
2. Environment deployment
1. Installation package management
[root@hop01 opt]# tar -zxvf flink-1.7.0-bin-hadoop27-scala_2.11.tgz
[root@hop02 opt]# mv flink-1.7.0 flink1.7
2. Cluster configuration
management node
[root@hop01 opt]# cd /opt/flink1.7/conf
[root@hop01 conf]# vim flink-conf.yaml
jobmanager.rpc.address: hop01
distribution node
[root@hop01 conf]# vim slaves
hop02
hop03
Both configurations are synchronized under all cluster nodes.
3. Start and stop
/opt/flink1.7/bin/start-cluster.sh
/opt/flink1.7/bin/stop-cluster.sh
Startup log:
[root@hop01 conf]# /opt/flink1.7/bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host hop01.
Starting taskexecutor daemon on host hop02.
Starting taskexecutor daemon on host hop03.
4. Web interface
Visit: http://hop01:8081/
3. Development entry case
1. Data script
Distribute a data script to each node:
/var/flink/test/word.txt
2. Introduce basic dependencies
Here is a base case written in Java.
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.7.0</version>
</dependency>
</dependencies>
3. Read file data
Here, the data in the file is directly read, and the number of occurrences of each word is analyzed through the program flow.
public class WordCount {
public static void main(String[] args) throws Exception {
// 读取文件数据
readFile () ;
}
public static void readFile () throws Exception {
// 1、执行环境创建
ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();
// 2、读取数据文件
String filePath = "/var/flink/test/word.txt" ;
DataSet<String> inputFile = environment.readTextFile(filePath);
// 3、分组并求和
DataSet<Tuple2<String, Integer>> wordDataSet = inputFile.flatMap(new WordFlatMapFunction(
)).groupBy(0).sum(1);
// 4、打印处理结果
wordDataSet.print();
}
// 数据读取个切割方式
static class WordFlatMapFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String input, Collector<Tuple2<String, Integer>> collector){
String[] wordArr = input.split(",");
for (String word : wordArr) {
collector.collect(new Tuple2<>(word, 1));
}
}
}
}
4. Read port data
Create a port on the hop01 service and simulate sending some data to that port:
[root@hop01 ~]# nc -lk 5566
c++,java
Read and analyze the data content of the port through the Flink program:
public class WordCount {
public static void main(String[] args) throws Exception {
// 读取端口数据
readPort ();
}
public static void readPort () throws Exception {
// 1、执行环境创建
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 2、读取Socket数据端口
DataStreamSource<String> inputStream = environment.socketTextStream("hop01", 5566);
// 3、数据读取个切割方式
SingleOutputStreamOperator<Tuple2<String, Integer>> resultDataStream = inputStream.flatMap(
new FlatMapFunction<String, Tuple2<String, Integer>>()
{
@Override
public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) {
String[] wordArr = input.split(",");
for (String word : wordArr) {
collector.collect(new Tuple2<>(word, 1));
}
}
}).keyBy(0).sum(1);
// 4、打印分析结果
resultDataStream.print();
// 5、环境启动
environment.execute();
}
}
Fourth, the operating mechanism
FlinkClient
The client is used to prepare and send the data stream to the JobManager node. After that, according to the specific requirements, the client can directly disconnect the connection, or maintain the connection state and wait for the task processing result.
JobManager
In a Flink cluster, a JobManger node and at least one TaskManager node will be started. After the JobManager receives the task submitted by the client, the JobManager will coordinate and deliver the task to the specific TaskManager node for execution. The TaskManager node will send the heartbeat and processing information to JobManager.
TaskManager
Task slot (slot) is the smallest resource scheduling unit in TaskManager. The number of slots is set at startup. Each slot can start a Task, receive tasks deployed by the JobManager node, and perform specific analysis and processing.
5. Source code address
GitEE·地址
https://gitee.com/cicadasmile/big-data-parent
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。