1 Overview
The more important things in flink are time and status. In the process of learning flink, the understanding of the water level has been vague. After a period of digestion, I will summarize it here. This article mainly describes what the water level is, how it came from, and what it is useful for.
2. The water level line that is not well understood
Some people like to call the watermark a watermark. Whether it is a watermark or a watermark, the Chinese translation is not at all appropriate to our life. It is rather abstract and difficult to understand. In our life, the water level line is similar to a wall clock hanging on the wall at home, similar to our watch. Let's talk about the following topics:
1. How did it come about.
2. Since it is a wall clock, what are the characteristics of the clock? If the clock moves a small step forward every 1s, is the time getting bigger and bigger? Is there a water level line with these characteristics?
3. What is the use of the wall clock? Looking at the watch at night and finding 12 o'clock, we must be self-suggesting: "should go to bed", and let us know what time to do through the time.
3. What is the water level line?
3.1. Definition of water mark The water mark is a logical clock, why is it called a logical clock? The normal time is generated by the cpu, and the cycle is fixed and moves forward, but the time of our clock is calculated by the programmer and dynamically calculated according to the "event time" (as for what is a time event, there is no usage scenario here. Speaking of), if the result of the calculation at a certain moment is x, the value of x is 2022-10-10 10:10:10 corresponding to the timestamp is 1665367810000, the value of x becomes larger as the event time becomes larger, it is possible The result is x, x+1, x+2, x+3, x+4... Is the continuous and larger time stamp like a clock moving forward every 1s?
3.2. Composition of the water mark (logical clock) The water mark is composed of a series of consecutive time stamps, which are getting larger and larger, and each time stamp is dynamically calculated according to the event time. The clock is also composed of a continuous time, and it is getting bigger and bigger, such as 2022-10-10 10:10:10, 2022-10-10 10:10:11, 2022-10-10 10:10:12, 2022 -10-10 10:10:13 . . . Wait, the watermark is similar to the clock in life, so I call this watermark the logical clock, and the logical clock is the watermark, the watermark mechanism.
3.3. The current time of the logical clock is similar to the current time of the clock. Here is the time, minutes and seconds. This current time is more important. The closing of the window and the triggering of the scheduled task are all judged according to the current time.
Current value characteristics: getting bigger and bigger, inserting a negative infinity value when the stream is just generated, and inserting a positive infinity value at the end.
Personally, I feel that this current value is similar to a pointer-type variable, and its pointing is constantly changing (personal understanding).
3.4. Calculation formula of current time The "current time" of the clock corresponds to a specific timestamp. Current value of clock xxx = event time - max delay time - 1 ms.
3.5. Let's take a case Case description: read data from socket, and print the specific value of the current water level.
package com.deepexi.sql;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import java.time.Duration;
public class ExampleTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env
//从socket读取数据
.socketTextStream("192.168.117.211", 9999)
.map(r -> Tuple2.of(r.split(" ")[0], Long.parseLong(r.split(" ")[1])))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.assignTimestampsAndWatermarks(
//5s延迟时间
WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
//提取事件时间
return element.f1;
}
})
)
//分流
.keyBy(r -> r.f0)
.process(new KeyedProcessFunction<String, Tuple2<String, Long>, String>() {
@Override
public void processElement(Tuple2<String, Long> value, Context ctx, Collector<String> out) throws Exception {
out.collect("当前的水位线是:" + ctx.timerService().currentWatermark());
}
})
.print();
env.execute();
}
}
nc -lk 9999 Start the socket service and listen to the command line input on port 9999: a 1000
[root@localhost ~]# nc -lk 9999 a 1000
The idea console prints the current water level: -9223372036854775808 //-9223372036854775808 is an infinite number
Command line input: a 2000
The idea console prints:
The current watermark is: -4001 // value of the current watermark = event time - max delay time -1 = 1000 - 5000 -1 = -4000
Why use 1000- 5000 -1 and 2000 - 5000 -1? Flink will periodically insert water level lines into the stream. The water level line is also an element in the stream. Let’s understand it by looking at the figure below.
Command line input: a 3000
The idea console prints: The current water level is: -3001 //2000 - 5000 -1 = -2000
Command line input: a 10000
The idea console prints: the current water level is: -2001 //3000 - 5000 -1 = -2000
Command line input: a 1000
The idea console prints: The current water level is: 4999 //10000 - 5000 -1 = 4999
Command line input: a 1000
The idea console prints: The current water level is: 4999 //10000 - 5000 -1 = 4999
Command line input: a 2000
The idea console prints: The current water level is: 4999 //10000 - 5000 -1 = 4999
Through the printing results of the console, it is found that the water level is the same as the clock. The value is always larger and larger, and it changes with the change of the event time, but it will not become smaller. On input a 1000, a 2000 the value of the watermark is always 4999.
The command line window for the entire printing process:
[root@master ~]# nc -lk 9999
a 1000
a 2000
a 3000
a 10000
a 1000
a 1000
a 2000
idea prints:
当前的水位线是:-9223372036854775808
当前的水位线是:-4001
当前的水位线是:-3001
当前的水位线是:-2001
当前的水位线是:4999
当前的水位线是:4999
当前的水位线是:4999
4, how to produce
The watermark is essentially a timestamp. This timestamp is dynamically calculated by the programmer according to the event time. Let’s take a case directly.
Case 1
Customize the generation logic of the watermark, implement the WatermarkStrategy interface, and flink will call the onPeriodicEmit method every 200 milliseconds.
public class ExampleTest2 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//设置每隔1分钟插入一次水位线
//env.getConfig().setAutoWatermarkInterval(6 * 1000L);
env
.socketTextStream("192.168.117.211", 9999)
.map(new MapFunction<String, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] arr = value.split(" ");
return Tuple2.of(arr[0], Long.parseLong(arr[1]));
}
})
.assignTimestampsAndWatermarks(new CustomWatermarkGenerator())
.print();
env.execute();
}
public static class CustomWatermarkGenerator implements WatermarkStrategy<Tuple2<String, Long>> {
@Override
public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
};
}
@Override
public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new WatermarkGenerator<Tuple2<String, Long>>() {
// 最大延迟时间
private Long bound = 5000L;
private Long maxTs = -Long.MAX_VALUE + bound + 1L;
@Override
public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
//更新观察到的最大事件时间
maxTs = Math.max(maxTs, event.f1);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
System.out.println("水位线的值:" + (maxTs - bound - 1L));
// 发送水位线,计算公式:事件时间-延迟时间-1L
output.emitWatermark(new Watermark(maxTs - bound - 1L));
}
};
}
}
}
nc -lk 9999 starts the socket service, listens on port 9999 to start the idea, and the console prints the result every 200 milliseconds: the value of the water level: xxxxx. as follows:
Watermark value: -9223372036854775807
Watermark value: -9223372036854775807
Watermark value: -9223372036854775807
Watermark value: -9223372036854775807
Command line input: a 1000
The console prints the result every 200 milliseconds: the value of the watermark: xxxxx. as follows:
Watermark value: -4001
Watermark value: -4001
Watermark value: -4001
Watermark value: -4001
Watermark value: -4001
Command line input: a 2000
The console prints every 200 milliseconds the value of the interface: watermark: xxxxx. as follows:
Watermark value: -3001
Watermark value: -3001
Watermark value: -3001
Watermark value: -3001
Watermark value: -3001
//Default 200 milliseconds to insert the watermark into the stream, you can set the time interval for the insertion of the watermark into the stream
env.getConfig().setAutoWatermarkInterval(6 * 1000L);
The command line window for the entire printing process:
[root@master ~]# nc -lk 9999
a 1000
a 2000
idea prints:
水位线的值:-9223372036854775807
水位线的值:-9223372036854775807
水位线的值:-9223372036854775807
水位线的值:-9223372036854775807
水位线的值:-9223372036854775807
(a,1000)
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
(a,2000)
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
Disconnected from the target VM, address: '127.0.0.1:58591', transport: 'socket'
水位线的值:-3001
Process finished with exit code 130
From the results we can know that the value of the water line changes with the event time 1000, 2000. If you enter a 2000 and then enter a 1000, what will the console print out? That must print: the value of the watermark: -3001, because the value of the watermark will only get bigger and bigger like time.
Case 2
Modify the program and add the following code. After keyby, print out the elements entered on the command line.
nc -lk 9999 starts socket listening on port 9999 and starts idea
command line input
[root@localhost ~]# nc -lk 9999
a 1000
a 2000
a 5000
a 6000
The idea console prints:
水位线的值:-9223372036854775807
水位线的值:-9223372036854775807
输入业务数据是:(a,1000)
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
水位线的值:-4001
输入业务数据是:(a,2000)
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
水位线的值:-3001
输入业务数据是:(a,5000)
水位线的值:-1
水位线的值:-1
水位线的值:-1
水位线的值:-1
水位线的值:-1
水位线的值:-1
水位线的值:-1
水位线的值:-1
输入业务数据是:(a,6000)
水位线的值:999
水位线的值:999
水位线的值:999
水位线的值:999
水位线的值:999
Analysis and calculation results:
-9223372036854775807,-9223372036854775807,(a,1000),-4001,-4001,-4001,-4001,-4001,-4001,-4001,-4001,(a,2000),-3001,-3001,-3001 ,-3001,-3001,(a,5000),-1,-1,-1,(a,6000),999,999,999,999
I don't know if you have a feeling, what is the relationship between the water level and business data? Is it similar to the relationship between falling flowers and running water in life? Business data is the water in the river. The water level line is like a flower falling in the water. The two of them flow to the sea together. The water level line and business data belong to an element in the flow.
5. What is the use
The world logical clock in the stream is a reference. Let’s take the wall clock as an example. Seeing that it is already 12 o’clock on the wall clock, we will definitely be hinting that it is time to put down the mobile phone and go to bed. For the continuous data flow, the data flow is divided into multiple segments for processing, and statistics are performed for each segment of data. When will the statistics be triggered? At this time, the logical clock will be used, and the window will look at the current time of the logical time. If the end time of the window is found to be less than the time of the clock, the window will be closed for statistics.
Case 1, description of the execution function of the watermark-triggered timed task: the execution of the timed task is triggered after the current timestamp of the waterline is greater than the trigger time of the timed task.
public class ExampleTest3 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env
.socketTextStream("192.168.117.211", 9999)
.map(r -> Tuple2.of(r.split(" ")[0], Long.parseLong(r.split(" ")[1])))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1;
}
})
)
.keyBy(r -> r.f0)
.process(new KeyedProcessFunction<String, Tuple2<String, Long>, String>() {
@Override
public void processElement(Tuple2<String, Long> value, Context ctx, Collector<String> out) throws Exception {
// out.collect("当前的水位线是:" + ctx.timerService().currentWatermark());
ctx.timerService().registerEventTimeTimer(value.f1 + 5000L);
out.collect("注册了一个时间戳是:" + new Timestamp(value.f1 + 5000L) + " 的定时器");
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
super.onTimer(timestamp, ctx, out);
out.collect("定时器触发了!");
}
})
.print();
env.execute();
}
}
nc -lk 9999 Open the socket service and listen to the command line input on port 9999: a 1665367810000 //1665367810000 corresponds to 2022-10-10 10:10:10
Console output: A timer with a timestamp of: 2022-10-10 10:10:15.0 is registered // 2022-10-10 10:10:15 is converted to a timestamp of 1665367815000
Explain the value of the current watermark in the console output: 2022-10-10 10:10:10 - 5s -1 ms = 1665367810000 - 5000 -1 = 1665367804999. When the value of the watermark is greater than 1665367815000, the timed task is triggered.
Command line input: 1665367821000 //The time stamp 1665367821000 corresponding to the command line input 2022-10-10 10:10:21 will trigger the scheduled task console output: The timer is triggered!
Named line print input
[root@master ~]# nc -lk 9999
a 1665367810000
a 1665367821000
idea print input
注册了一个时间戳是:2022-10-10 10:10:15.0 的定时器
注册了一个时间戳是:2022-10-10 10:10:26.0 的定时器
定时器触发了!
Case 2, the current timestamp of the watermark is greater than the window end time and the window is closed
Case day3.Example4
public class ExampleTest4 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env
.socketTextStream("192.168.117.211", 9999)
.map(new MapFunction<String, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] arr = value.split(" ");
return Tuple2.of(arr[0], Long.parseLong(arr[1]));
}
})
.assignTimestampsAndWatermarks(
// 最大延迟时间设置为5秒
WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
@Override
public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
return element.f1; // 告诉flink事件时间是哪一个字段
}
})
)
.keyBy(r -> r.f0)
// 5秒的事件时间滚动窗口
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
@Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> out) throws Exception {
long windowStart = context.window().getStart();
long windowEnd = context.window().getEnd();
// System.out.println("当前窗口的结束值:" + context.currentWatermark());
// System.out.println("当前水位线的值:" + context.currentWatermark());
long count = elements.spliterator().getExactSizeIfKnown();
out.collect("用户" + key + " 在窗口" +
"" + new Timestamp(windowStart) + "~" + new Timestamp(windowEnd) + "" +
"中的pv次数是:" + count);
}
})
.print();
env.execute();
}
}
Command line input: a 1665367810000 //flink will open a window from 2022-10-10 10:10:10.0~2022-10-10 10:10:15, when the current value of the water mark (the current value refers to the current time above) Timestamps greater than the window end time will trigger the window to close.
Command line input: a 1665367821000 //The current value of the water line at this time is: 1665367821000 - 5000 -1 = 1665367815999, 1665367815999 is converted to time: 2022-10-10 10:10:15, 2022-10-10 10:10:15 Equal to the window end time, so the trigger window closes.
Control output: User a's pv times in window 2022-10-10 10:10:10.0~2022-10-10 10:10:15.0 are: 1
Command Line
[root@master ~]# nc -lk 9999
a 1665367810000
a 1665367821000
idea
当前窗口的结束值:1665367815999
当前水位线的值:1665367815999
用户a 在窗口2022-10-10 10:10:10.0~2022-10-10 10:10:15.0中的pv次数是:1
If statistical analysis is performed based on "processing time", the window must be closed for statistics, and there must be a reference time, but this time is generated with the help of the CPU, and the closing of the window is closed according to the time generated by the CPU, but at a certain moment of the logical clock The value of is calculated by the program, which is why the watermark is called a logical clock.
6. Processing of late data
6.1. What is a late data event time less than the current timestamp of the watermark. For example, the event time carried by the data xxx of the current data stream is 2022:20:50, and the time of the logical clock at this time is 2022:20:51, then flink thinks that xxx is a late data.
Case description: Manually send the watermark, and manually send the element carrying the event time.
public class ExampleTest5 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<String> result = env
.addSource(new SourceFunction<String>() {
@Override
public void run(SourceContext<String> ctx) throws Exception {
// 发送数据携带事件时间的数据hello world
ctx.collectWithTimestamp("hello world", 1000L);
// 发送水位线
ctx.emitWatermark(new Watermark(999L));
// 发送数据携带事件时间的数据 hello flink
ctx.collectWithTimestamp("hello flink", 2000L);
// 发送水位线
ctx.emitWatermark(new Watermark(1999L));
// 发送数据携带事件时间的数据hello late
ctx.collectWithTimestamp("hello late", 1000L);
}
@Override
public void cancel() {
}
})
.process(new ProcessFunction<String, String>() {
@Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
//System.out.println("当前水位线:" + ctx.timerService().currentWatermark());
//判断事件时间是否小于水位线
if (ctx.timestamp() < ctx.timerService().currentWatermark()) {
System.out.println("迟到元素:" + value);
} else {
System.out.println("正常元素:" + value);
}
}
});
env.execute();
}
}
Console output:
Normal element: hello world
Normal element: hello flink
Late element: hello late
6.2. Handling of late elements Understand what is late elements, and how to deal with them, flink provides several solutions, such as
Case: Late data sent to "side output stream"
public class ExampleTest {
// 定义侧输出流
private static OutputTag<String> lateElement = new OutputTag<String>("late-element") {
};
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator<String> result = env
.addSource(new SourceFunction<String>() {
@Override
public void run(SourceContext<String> ctx) throws Exception {
// 发送数据携带事件时间的数据hello world
ctx.collectWithTimestamp("hello world", 1000L);
// 发送水位线
ctx.emitWatermark(new Watermark(999L));
// 发送数据携带事件时间的数据 hello flink
ctx.collectWithTimestamp("hello flink", 2000L);
// 发送水位线
ctx.emitWatermark(new Watermark(1999L));
// 发送数据携带事件时间的数据hello late
ctx.collectWithTimestamp("hello late", 1000L);
}
@Override
public void cancel() {
}
})
.process(new ProcessFunction<String, String>() {
@Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
//判断事件时间是否小于水位线
if (ctx.timestamp() < ctx.timerService().currentWatermark()) {
ctx.output(lateElement, "迟到元素发送到侧输出流:" + value);
} else {
out.collect("正常到达的元素:" + value);
}
}
});
result.print("主流:");
result.getSideOutput(lateElement).print("侧输出流:");
env.execute();
}
}
Idea console output:
Mainstream: > Elements arriving normally: hello world
Mainstream: > Elements arriving normally: hello flink
side output stream: > late element sent to side output stream: hello late
Thinking: What is the relationship between the window, late element, and watermark?
7. Summary
The water level is similar to the clock in life. Through the clock, we know what time, minutes and seconds the current time is. This "current time" corresponds to a timestamp in flink, and the timestamp is used to trigger the closing of the window and the execution of the scheduled task. Also similar to the role of a reference.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。