前言
前面的第二讲,我们说过要介绍flink的水印,触发器相关概念。现在让我们先了解一下水印,触发器,迟到生存周期的概念。这里的概念有点抽象,需要动脑筋去理解。
-事件时间,进入时间,处理时间
在理解水印之前,我们需要先行介绍flink里面的三个时间:时间时间,进入时间,处理时间。
下面先看一张图:
这里我自问自答一下:
问:为什么会要有事件时间和处理时间?
答:假设生产者生存消息以后,由于网络延迟或者其他因素,我们(flink)拿到数据的时间总是晚于生产者生产消息的时间的。那么这个时间间隔,总该有个约束吧?比如我(flink)等10s或者更久,那么在这10秒钟以内到达的数据,我们称之为早到或者按时到达的数据,对于10秒以后到的数据我们称之为迟到数据。按时,早到的数据我们都可以正常处理,那么迟到的数据该怎么办呢?是否丢弃?或者将这些数据存放在某个地方后续统一处理?...这些flink都为我们考虑到了,并且有相应的类和方法,轮子已经造好,仅仅需要你去扬帆...哈哈,扯远了..
如上图,我们以从队列读取数据为例,事件时间是生产者产生数据的时候,存入数据的。进入时间是我们从datasource获取到生产者的消息的时间,处理时间就是我们真正处理这条数据的时间。相比较于事件时间,进入时间程序不同处理无序和迟到的事件,但是这个程序没必要定义怎样去生成水印。对于内部来说,进入时间更像是事件时间,但是有自动的时间戳分配和自动的水印生成。
下面我们将通过一个具体例子来了解水印,触发器相关用法
public class WatermarkTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "172.19.141.60:31090");
properties.setProperty("group.id", "crm_stream_window");
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
DataStream<String> stream =
env.addSource(new FlinkKafkaConsumer011<>("test-demo12", new SimpleStringSchema(), properties));
env.setParallelism(1);
DataStream<Tuple3<String, Long, Integer>> inputMap = stream.map(new MapFunction<String, Tuple3<String, Long, Integer>>() {
private static final long serialVersionUID = -8812094804806854937L;
@Override
public Tuple3<String, Long, Integer> map(String value) throws Exception {
KafkaEntity kafkaEntity = JSON.parseObject(value, KafkaEntity.class);
return new Tuple3(kafkaEntity.getName(), kafkaEntity.getCreate_time(), kafkaEntity.getId());
}
});
DataStream<Tuple3<String, Long, Integer>> watermark =
inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple3<String, Long, Integer>>() {
private static final long serialVersionUID = 8252616297345284790L;
Long currentMaxTimestamp = 0L;
Long maxOutOfOrderness = 2000L;//最大允许的乱序时间是2s
Watermark watermark = null;
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
@Nullable
@Override
public Watermark getCurrentWatermark() {
watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
return watermark;
}
@Override
public long extractTimestamp(Tuple3<String, Long, Integer> element, long previousElementTimestamp) {
Long timestamp = element.f1;
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
System.out.println("timestamp : " + element.f1 + "|" + format.format(element.f1) + " currentMaxTimestamp : " + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) + "," + " watermark : " + watermark.getTimestamp() + "|" + format.format(watermark.getTimestamp()));
return timestamp;
}
});
OutputTag<Tuple3<String, Long, Integer>> lateOutputTag = new OutputTag<Tuple3<String, Long, Integer>>("late-data") {
private static final long serialVersionUID = -1552769100986888698L;
};
SingleOutputStreamOperator<String> resultStream = watermark
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.trigger(new Trigger<Tuple3<String, Long, Integer>, TimeWindow>() {
private static final long serialVersionUID = 2742133264310093792L;
ValueStateDescriptor<Integer> sumStateDescriptor = new ValueStateDescriptor<Integer>("sum", Integer.class);
@Override
public TriggerResult onElement(Tuple3<String, Long, Integer> element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
ValueState<Integer> sumState = ctx.getPartitionedState(sumStateDescriptor);
if (null == sumState.value()) {
sumState.update(0);
}
sumState.update(element.f2 + sumState.value());
System.out.println(sumState.value());
// if (sumState.value() >= 2) {
//这里可以选择手动处理状态
// 默认的trigger发送是TriggerResult.FIRE 不会清除窗口数据
// return TriggerResult.FIRE_AND_PURGE;
return TriggerResult.FIRE_AND_PURGE;
// }
// return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
System.out.println("清理窗口状态 | 窗口内保存值为" + ctx.getPartitionedState(sumStateDescriptor).value());
ctx.getPartitionedState(sumStateDescriptor).clear();
}
})
//如果使用allowedLateness会有重复计算的效果
//默认的trigger情况下
// 在event time>window_end_time+watermark+allowedLateness时会触发窗口的clear
// 后续数据如果属于该窗口而且数据的event_time>watermark-allowedLateness 会触发重新计算
//
//在使用自定义的trigger情况下
//同一个窗口内只要满足要求可以不停的触发窗口数据往下流
//在event time>window_end_time+watermark+allowedLateness时会触发窗口clear
//后续数据如果属于该窗口而且数据的event_time>watermark-allowedLateness 会触发重新计算
//
//窗口状态的clear只和时间有关与是否自定义trigger无关
.allowedLateness(Time.seconds(3))
.sideOutputLateData(lateOutputTag)
.apply(new WindowFunction<Tuple3<String, Long, Integer>, String, Tuple, TimeWindow>() {
private static final long serialVersionUID = 7813420265419629362L;
@Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple3<String, Long, Integer>> input, Collector<String> out) throws Exception {
for (Tuple3<String, Long, Integer> stringLongTuple2 : input) {
System.out.println(stringLongTuple2.f1);
}
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
out.collect("window " + format.format(window.getStart()) + " window " + format.format(window.getEnd()));
System.out.println("-------------------------");
}
});
resultStream.print();
resultStream.getSideOutput(lateOutputTag).print();
env.execute("window test");
}
}
package cn.crawler.mft_seconed.demo4;
import cn.crawler.mft_seconed.KafkaEntity;
import cn.crawler.mft_seconed.demo2.SendDataToKafkaSql;
import com.alibaba.fastjson.JSON;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
import java.util.Random;
import java.util.UUID;
public class SendDataToKafkaDemo4 {
public static void main(String[] args){
SendDataToKafkaDemo4 sendDataToKafkaDemo4 = new SendDataToKafkaDemo4();
for(int i=0;i<40;i++){
KafkaEntity build = KafkaEntity.builder().id(1).message("message"+i).create_time(System.currentTimeMillis()).name(""+1).build();
System.out.println(build.toString());
sendDataToKafkaDemo4.send("test-demo13", "123", JSON.toJSONString(build));
}
}
public void send(String topic,String key,String data){
Properties props = new Properties();
props.put("bootstrap.servers", "172.19.141.60:31090");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<String,String>(props);
for(int i=1;i<2;i++){
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
producer.send(new ProducerRecord<String, String>(topic, key+i, data));
}
producer.close();
}
}
下面看一下输出数据:
timestamp : 1564038394140|2019-07-25 15:06:34.140 currentMaxTimestamp : 1564038394140|2019-07-25 15:06:34.140, watermark : -2000|1970-01-01 07:59:58.000
1
1564038394140
window 2019-07-25 15:06:33.000 window 2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395056|2019-07-25 15:06:35.056 currentMaxTimestamp : 1564038395056|2019-07-25 15:06:35.056, watermark : 1564038392140|2019-07-25 15:06:32.140
2
1564038395056
window 2019-07-25 15:06:33.000 window 2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395363|2019-07-25 15:06:35.363 currentMaxTimestamp : 1564038395363|2019-07-25 15:06:35.363, watermark : 1564038393056|2019-07-25 15:06:33.056
3
1564038395363
window 2019-07-25 15:06:33.000 window 2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038395786|2019-07-25 15:06:35.786 currentMaxTimestamp : 1564038395786|2019-07-25 15:06:35.786, watermark : 1564038393363|2019-07-25 15:06:33.363
4
1564038395786
window 2019-07-25 15:06:33.000 window 2019-07-25 15:06:36.000
-------------------------
timestamp : 1564038396216|2019-07-25 15:06:36.216 currentMaxTimestamp : 1564038396216|2019-07-25 15:06:36.216, watermark : 1564038393786|2019-07-25 15:06:33.786
1
1564038396216
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038396504|2019-07-25 15:06:36.504 currentMaxTimestamp : 1564038396504|2019-07-25 15:06:36.504, watermark : 1564038394216|2019-07-25 15:06:34.216
2
1564038396504
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038396960|2019-07-25 15:06:36.960 currentMaxTimestamp : 1564038396960|2019-07-25 15:06:36.960, watermark : 1564038394504|2019-07-25 15:06:34.504
3
1564038396960
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038397376|2019-07-25 15:06:37.376 currentMaxTimestamp : 1564038397376|2019-07-25 15:06:37.376, watermark : 1564038394960|2019-07-25 15:06:34.960
4
1564038397376
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038397755|2019-07-25 15:06:37.755 currentMaxTimestamp : 1564038397755|2019-07-25 15:06:37.755, watermark : 1564038395376|2019-07-25 15:06:35.376
5
1564038397755
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398077|2019-07-25 15:06:38.077 currentMaxTimestamp : 1564038398077|2019-07-25 15:06:38.077, watermark : 1564038395755|2019-07-25 15:06:35.755
6
1564038398077
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398511|2019-07-25 15:06:38.511 currentMaxTimestamp : 1564038398511|2019-07-25 15:06:38.511, watermark : 1564038396077|2019-07-25 15:06:36.077
7
1564038398511
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038398904|2019-07-25 15:06:38.904 currentMaxTimestamp : 1564038398904|2019-07-25 15:06:38.904, watermark : 1564038396511|2019-07-25 15:06:36.511
8
1564038398904
window 2019-07-25 15:06:36.000 window 2019-07-25 15:06:39.000
-------------------------
timestamp : 1564038399218|2019-07-25 15:06:39.218 currentMaxTimestamp : 1564038399218|2019-07-25 15:06:39.218, watermark : 1564038396904|2019-07-25 15:06:36.904
1
1564038399218
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038399635|2019-07-25 15:06:39.635 currentMaxTimestamp : 1564038399635|2019-07-25 15:06:39.635, watermark : 1564038397218|2019-07-25 15:06:37.218
2
1564038399635
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038399874|2019-07-25 15:06:39.874 currentMaxTimestamp : 1564038399874|2019-07-25 15:06:39.874, watermark : 1564038397635|2019-07-25 15:06:37.635
3
1564038399874
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400261|2019-07-25 15:06:40.261 currentMaxTimestamp : 1564038400261|2019-07-25 15:06:40.261, watermark : 1564038397874|2019-07-25 15:06:37.874
4
1564038400261
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400614|2019-07-25 15:06:40.614 currentMaxTimestamp : 1564038400614|2019-07-25 15:06:40.614, watermark : 1564038398261|2019-07-25 15:06:38.261
5
1564038400614
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038400935|2019-07-25 15:06:40.935 currentMaxTimestamp : 1564038400935|2019-07-25 15:06:40.935, watermark : 1564038398614|2019-07-25 15:06:38.614
6
1564038400935
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038401351|2019-07-25 15:06:41.351 currentMaxTimestamp : 1564038401351|2019-07-25 15:06:41.351, watermark : 1564038398935|2019-07-25 15:06:38.935
7
1564038401351
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
清理窗口状态 | 窗口内保存值为4 //这里!!!!!!触发了触发器的clear()操作
timestamp : 1564038401856|2019-07-25 15:06:41.856 currentMaxTimestamp : 1564038401856|2019-07-25 15:06:41.856, watermark : 1564038399351|2019-07-25 15:06:39.351
8
1564038401856
window 2019-07-25 15:06:39.000 window 2019-07-25 15:06:42.000
-------------------------
timestamp : 1564038402142|2019-07-25 15:06:42.142 currentMaxTimestamp : 1564038402142|2019-07-25 15:06:42.142, watermark : 1564038399856|2019-07-25 15:06:39.856
1
1564038402142
window 2019-07-25 15:06:42.000 window 2019-07-25 15:06:45.000
-------------------------
timestamp : 1564038402501|2019-07-25 15:06:42.501 currentMaxTimestamp : 1564038402501|2019-07-25 15:06:42.501, watermark : 1564038400142|2019-07-25 15:06:40.142
2
1564038402501
window 2019-07-25 15:06:42.000 window 2019-07-25 15:06:45.000
-------------------------
我们分析一下以上代码:
SendDataToKafkaDemo4类发送了40条数据进kafka,WatermarkTest会接到数据,并将其转换为java实体类。然后为其添加水印(最大迟到时间是2S)。并将窗口划分为3s的固定大小窗口。根据第一个字段key by后,为每个key by 后的窗口设置(更新)state的值。当水印时间 = window end time + 3s时,触动触发器的clear方法,执行清除窗口数据的操作。当然,我们也可以看到,触发器的重写方法有好几种,我们可以在自己需要的地方重写方法。
例如:
第一个时间窗口: 15:06:33.000 - 15:06:36.000
第一个时间窗口最终value: 4
我们拿到第一个时间窗口的最后时间 36s + 3s(allowedLateness时间) = 39 s 的时间点,即当水印达到15:06:39 000 时间点的时候,会执行窗口触发器的clear方法,随即,我们在事件时间为timestamp : 1564038401856|2019-07-25 15:06:41.856 的时候,水印时间戳已经达到了39s的时间点,即:
watermark : 1564038399351|2019-07-25 15:06:39.3518 是这个点 。此时触发....
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。