2
头图

1. Overview of Flink

1. Basic introduction

Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams. Flink is designed to run in all common cluster environments, performing computations at in-memory execution speed and at arbitrary scale. The main features include: batch flow integration, precise state management, event time support, and exactly-once state consistency guarantee. Flink can not only run on multiple resource management frameworks including YARN, Mesos, and Kubernetes, but also supports independent deployment on bare metal clusters. With the high availability option enabled, it has no single point of failure.

Here are two concepts:

  • Boundary: unbounded and bounded data flow, which can be understood as the aggregation strategy or condition of data;
  • Status: that is, whether there is a dependency in the execution order, that is, whether the next execution depends on the previous result;

2. Application scenarios

Data Driven

Event-driven applications do not need to query the remote database, and local data access makes it have higher throughput and lower latency. In the case of anti-fraud, DataDriven writes the processing rule model into the Datastream API, and then abstracts the entire logic to Flink Engine, when events or data flow in, the corresponding rule model will be triggered. Once the conditions in the rules are triggered, DataDriven will quickly process and notify business applications.

Data Analytics

Compared with batch analysis, streaming analysis eliminates the need for periodic data import and query process, so the latency of getting metrics from events is lower. Not only that, batch queries must deal with artificial data boundaries caused by periodic imports and input boundedness, while streaming queries do not need to consider this problem. Flink provides good support for both continuous streaming analysis and batch analysis, real-time processing Analyze data and apply more scenarios such as real-time large screens and real-time reports.

Data Pipeline

Compared with periodic ETL job tasks, continuous data pipeline can significantly reduce the delay of moving data to the destination. For example, based on upstream StreamETL to clean or expand data in real time, a real-time data warehouse can be built downstream to ensure the timeliness of data query. , forming a high-efficiency data query link, which is very common in media stream recommendation or search engines.

2. Environment deployment

1. Installation package management

 [root@hop01 opt]# tar -zxvf flink-1.7.0-bin-hadoop27-scala_2.11.tgz
[root@hop02 opt]# mv flink-1.7.0 flink1.7

2. Cluster configuration

management node

 [root@hop01 opt]# cd /opt/flink1.7/conf
[root@hop01 conf]# vim flink-conf.yaml

jobmanager.rpc.address: hop01

distribution node

 [root@hop01 conf]# vim slaves

hop02
hop03

Both configurations are synchronized under all cluster nodes.

3. Start and stop

 /opt/flink1.7/bin/start-cluster.sh
/opt/flink1.7/bin/stop-cluster.sh

Startup log:

 [root@hop01 conf]# /opt/flink1.7/bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host hop01.
Starting taskexecutor daemon on host hop02.
Starting taskexecutor daemon on host hop03.

4. Web interface

Visit: http://hop01:8081/

3. Development entry case

1. Data script

Distribute a data script to each node:

 /var/flink/test/word.txt

2. Introduce basic dependencies

Here is a base case written in Java.

 <dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.7.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-streaming-java_2.11</artifactId>
        <version>1.7.0</version>
    </dependency>
</dependencies>

3. Read file data

Here, the data in the file is directly read, and the number of occurrences of each word is analyzed through the program flow.

 public class WordCount {
    public static void main(String[] args) throws Exception {
        // 读取文件数据
        readFile () ;
    }

    public static void readFile () throws Exception {
        // 1、执行环境创建
        ExecutionEnvironment environment = ExecutionEnvironment.getExecutionEnvironment();

        // 2、读取数据文件
        String filePath = "/var/flink/test/word.txt" ;
        DataSet<String> inputFile = environment.readTextFile(filePath);

        // 3、分组并求和
        DataSet<Tuple2<String, Integer>> wordDataSet = inputFile.flatMap(new WordFlatMapFunction(
        )).groupBy(0).sum(1);

        // 4、打印处理结果
        wordDataSet.print();
    }

    // 数据读取个切割方式
    static class WordFlatMapFunction implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String input, Collector<Tuple2<String, Integer>> collector){
            String[] wordArr = input.split(",");
            for (String word : wordArr) {
                collector.collect(new Tuple2<>(word, 1));
            }
        }
    }
}

4. Read port data

Create a port on the hop01 service and simulate sending some data to that port:

 [root@hop01 ~]# nc -lk 5566
c++,java

Read and analyze the data content of the port through the Flink program:

 public class WordCount {
    public static void main(String[] args) throws Exception {
        // 读取端口数据
        readPort ();
    }

    public static void readPort () throws Exception {
        // 1、执行环境创建
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2、读取Socket数据端口
        DataStreamSource<String> inputStream = environment.socketTextStream("hop01", 5566);

        // 3、数据读取个切割方式
        SingleOutputStreamOperator<Tuple2<String, Integer>> resultDataStream = inputStream.flatMap(
                new FlatMapFunction<String, Tuple2<String, Integer>>()
        {
            @Override
            public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) {
                String[] wordArr = input.split(",");
                for (String word : wordArr) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }
        }).keyBy(0).sum(1);

        // 4、打印分析结果
        resultDataStream.print();

        // 5、环境启动
        environment.execute();
    }
}

Fourth, the operating mechanism

FlinkClient

The client is used to prepare and send the data stream to the JobManager node. After that, according to the specific requirements, the client can directly disconnect the connection, or maintain the connection state and wait for the task processing result.

JobManager

In a Flink cluster, a JobManger node and at least one TaskManager node will be started. After the JobManager receives the task submitted by the client, the JobManager will coordinate and deliver the task to the specific TaskManager node for execution. The TaskManager node will send the heartbeat and processing information to JobManager.

TaskManager

Task slot (slot) is the smallest resource scheduling unit in TaskManager. The number of slots is set at startup. Each slot can start a Task, receive tasks deployed by the JobManager node, and perform specific analysis and processing.

5. Source code address

 GitEE·地址
https://gitee.com/cicadasmile/big-data-parent

已注销
479 声望53 粉丝