map
- DataStream --> DataStream:可以理解为映射,对每个元素进行一定的变换后,映射为另一个元素。
#Java实现map
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class MapDemo {
public static void main(String[] args) throws Exception {
//构建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义数据源
DataStream<String> socketTextStream = env.socketTextStream("localhost", 9000, "n");
//lambda表达
DataStream<String> result3 = socketTextStream.map( value -> value + "love");
//打印dataStream内容
result3.print();
//执行
env.execute();
}
}
#Scala实现map
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object MapDemoScala {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("127.0.0.1", 9875)
text.map(x => x + "love")
}
}
flatMap
- DataStream --> DataStream:输入一个参数,产生0、1或者多个输出,这个多用于拆分操作
- flatMap 和 map 方法的使用相似,但是因为一般 Java 方法的返回值结果都是一个,引入 flatMap 后,我们可以将处理后的多个结果放到一个 Collections 集合中(类似于返回多个结果)
#Java实现flatMap
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class FlatMapDemo {
public static void main(String[] args) throws Exception {
//构建执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义数据源
DataStream<String> socketTextStream = env.socketTextStream("localhost", 9000, "n");
DataStream<String> result = socketTextStream.flatMap((String s, Collector<String> collector) ->{
for(String str: s.split(" ")){
collector.collect(str);
}
});
//打印dataStream内容
result.print();
//执行
env.execute();
}
}
#Scala实现flatMap
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object FlatMapDemoScala {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("127.0.0.1", 9002)
text.flatMap{str => str.split(" ")}
text.flatMap{_.split(" ")}
}
}
filter
- DataStream → DataStream: 过滤器,对数据流中的每个元素进行过滤判断,判断为true的元素进入下一个数据流
#Java
DataStream<String> res = socketTestStream.filter(new FilterFunction<String>() {
@Override
public boolean filter(String s) throws Exception {
return s.startsWith("S");
}
});
#Scala
text.filter{_.startsWith("S")}
.print()
.setParallelism(1)
keyBy
- DataStream → KeyedStream
- 将数据流按照key分成多个不相交的分区,相同的key的记录会被分到同一个分区中,keyBy()是通过散列分区实现的。
我们可以将一个pojo类的一个或多个属性当作key,也可以将tuple的元素当作key,但是有两种类型不能作为key:
- 没有复写hashCode方法,仅默认继承object的hashCode方法的pojo类
- 数组类型
#Java in POJO
SingleOutputStreamOperator<WordCountPOJO> streamOperator = socketTextStream
.flatMap((String value, Collector<WordCountPOJO> out) -> {
Arrays.stream(value.split(" ")).
forEach(str -> out.collect(WordCountPOJO.of(value, 1)));
}).returns(WordCountPOJO.class);
KeyedStream<WordCountPOJO, Tuple> keyedStream = streamOperator.keyBy("word");
SingleOutputStreamOperator<WordCountPOJO> summed = keyedStream.sum("count");
#Java in Tuple
SingleOutputStreamOperator<Tuple2<String, Integer>> singleOutputStreamOperator = dataStreamSource
.flatMap((String value, Collector<Tuple2<String, Integer>> out)-> {
Arrays.stream(value.split(" "))
.forEach(str -> out.collect(Tuple2.of(value, 1)));
}).returns(Types.TUPLE(Types.STRING, Types.INT));
KeyedStream<Tuple2<String, Integer>, Tuple> keyedStream = singleOutputStreamOperator.keyBy(0);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = keyedStream.sum(1);
#Scala in POJO
text.flatMap{_.split(" ")}
.map(x => WordCountPOJO(x,1))
.keyBy("word")
.timeWindow(Time.seconds(5))
.sum("count")
.print()
.setParallelism(1)
reduce
- KeyedStream → DataStream: 将数据流中的每个分区的数据进行结果归纳,可以对数据进行聚合操作,如min(), max(), avg, count等可以通过reduce实现
public class StudentPOJO {
private String name;
private String gender;
private String className;
private double score;
public StudentPOJO() {
}
public StudentPOJO(String name, String gender, String className, double score) {
this.name = name;
this.gender = gender;
this.className = className;
this.score = score;
}
public static StudentPOJO of(String name, String gender, String className, double score) {
return new StudentPOJO(name,gender, className,score);
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public String getClassName() {
return className;
}
public void setClassName(String className) {
this.className = className;
}
public double getScore() {
return score;
}
public void setScore(double score) {
this.score = score;
}
@Override
public String toString() {
return "StudentPOJO{" +
"name='" + name + ''' +
", gender='" + gender + ''' +
", className='" + className + ''' +
", score=" + score +
'}';
}
}
#Java in POJO
SingleOutputStreamOperator<StudentPOJO> flatMapSocketTextStream = socketTextStream
.flatMap((String value, Collector<StudentPOJO> out) -> {
String[] values = value.split(" ");
out.collect(new StudentPOJO(values[0], values[1], values[2], Double.valueOf(values[3])));
}).returns(StudentPOJO.class);
DataStream<StudentPOJO> res = flatMapSocketTextStream
.keyBy("className")
.reduce((s1, s2) ->
s1.getScore() > s2.getScore() ? s1 : s2
);
#Java in Tuple
DataStream<Tuple2<String, Integer>> res1 = socketTextStream
.map(value -> Tuple2.of(value.trim(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0)
.timeWindow(Time.seconds(10))
.reduce((Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) ->
new Tuple2<>(t1.f0, t1.f1 + t2.f1));
DataStream<Tuple2<String, Integer>> res2 = socketTextStream
.map(value -> Tuple2.of(value.trim(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0)
.timeWindow(Time.seconds(10))
.reduce((old, news) -> {
old.f1 += news.f1;
return old;
}).returns(Types.TUPLE(Types.STRING, Types.INT));
fold
- KeyedStream → DataStream
- 一个有初始值的分组数据流的滚动折叠操作:合并当前元素和前一次折叠操作的结果,并产生一个新的值。
- A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence "start-1", "start-1-2", "start-1-2-3", ...
#TODO 没有实际的效果出现
DataStream<String> res = socketTextStream.map(value -> Tuple2.of(value.trim(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0)
.fold("结果:",(String current, Tuple2<String, Integer> t2) -> current + t2.f0 + ",");
union
- 在 DataStream 上使用 union 算子可以合并多个同类型的数据流,并生成同类型的新数据流,即可以将多个 DataStream 合并为一个新的 DataStream。
DataStream<String> streamSource01 = env.socketTextStream("localhost", 8888);
DataStream<String> streamSource02 = env.socketTextStream("localhost", 9922);
DataStream<String> mapStreamSource01 = streamSource01.map(value -> "来自8888端口的数据: " + value);
DataStream<String> mapStreamSource02 = streamSource02.map(value -> "来自9922端口的数据: " + value);
DataStream<String> res = mapStreamSource01.union(mapStreamSource02);
join
- 根据指定的Key将两个流进行关联
DataStream<Tuple2<String, String>> mapStreamSource01 = streamSource01
.map(value -> Tuple2.of(value, "来自8888端口的数据" + value))
.returns(Types.TUPLE(Types.STRING, Types.STRING));
DataStream<Tuple2<String, String>> mapStreamSource02 = streamSource02
.map(value -> Tuple2.of(value, "来自9922端口的数据" + value))
.returns(Types.TUPLE(Types.STRING, Types.STRING));
DataStream<String> res = mapStreamSource01.join(mapStreamSource02)
.where(t1->t1.getField(0))
.equalTo(t2->t2.getField(0))
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply((t1,t2) -> t1.getField(1) + "|" + t2.getField(1));
coGroup
- 关联两个流,关联不上的也保留下来。
DataStream<Tuple2<String, String>> mapStreamSource01 = streamSource01
.map(value -> Tuple2.of(value, "8888端口数据: " + value))
.returns(Types.TUPLE(Types.STRING, Types.STRING));
DataStream<Tuple2<String, String>> mapStreamSource02 = streamSource02
.map(value -> Tuple2.of(value, "9922端口数据: " + value))
.returns(Types.TUPLE(Types.STRING, Types.STRING));
DataStream<String> res = mapStreamSource01.coGroup(mapStreamSource02)
.where(t1 -> t1.getField(0))
.equalTo(t2 -> t2.getField(0))
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
@Override
public void coGroup(Iterable<Tuple2<String, String>> iterable1, Iterable<Tuple2<String, String>> iterable2, Collector<String> collector) throws Exception {
StringBuffer stringBuffer = new StringBuffer();
stringBuffer.append("来自8888的stream--");
for (Tuple2<String, String> item : iterable1) {
stringBuffer.append(item.f1 + " | ");
}
stringBuffer.append("来自9922的stream--");
for (Tuple2<String, String> item : iterable2) {
stringBuffer.append(item.f1);
}
collector.collect(stringBuffer.toString());
}
});
split
- DataStream → SplitStream: 将指定的 DataStream流拆分成多个流,用SplitStream来表示
select
- SplitStream → DataStream: 从一个 SplitStream 流中,通过 .select()方法来获得想要的流
SplitStream<Tuple2<String, Integer>> splitStream = streamSource
.map(values -> Tuple2.of(values.trim(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.split( t -> {
List<String> list = new ArrayList<>();
if (isNumeric(t.f0)) {
list.add("num");
} else {
list.add("str");
}
return list;
});
DataStream<Tuple2<String, Integer>> strDataStream1 = splitStream.select("str")
.map( t -> Tuple2.of("字符串: " + t.f0, t.f1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0)
.sum(1);
DataStream<Tuple2<String, Integer>> strDataStream2 = splitStream.select("num")
.map( t -> Tuple2.of("数字: " + t.f0, t.f1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(0)
.sum(1);
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。