Flink之DataStream--Transformations

Natasha

map

  • DataStream --> DataStream:可以理解为映射,对每个元素进行一定的变换后,映射为另一个元素。
#Java实现map
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class MapDemo {
     public static void main(String[] args) throws Exception {
         //构建执行环境
         StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         //定义数据源
         DataStream<String> socketTextStream = env.socketTextStream("localhost", 9000, "n");
         //lambda表达
         DataStream<String> result3 = socketTextStream.map( value -> value + "love");
         //打印dataStream内容
         result3.print();
         //执行
         env.execute();
     }
}

#Scala实现map
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object MapDemoScala {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1", 9875)
    text.map(x => x + "love")
  }
}

flatMap

  • DataStream --> DataStream:输入一个参数,产生0、1或者多个输出,这个多用于拆分操作
  • flatMap 和 map 方法的使用相似,但是因为一般 Java 方法的返回值结果都是一个,引入 flatMap 后,我们可以将处理后的多个结果放到一个 Collections 集合中(类似于返回多个结果)
#Java实现flatMap
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class FlatMapDemo {
     public static void main(String[] args) throws Exception {
         //构建执行环境
         StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         //定义数据源
         DataStream<String> socketTextStream = env.socketTextStream("localhost", 9000, "n");
         DataStream<String> result = socketTextStream.flatMap((String s, Collector<String> collector) ->{
                    for(String str: s.split(" ")){
                        collector.collect(str);
                    }
         });
         //打印dataStream内容
         result.print();
         //执行
         env.execute();
     }
}

#Scala实现flatMap
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object FlatMapDemoScala {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1", 9002)
    text.flatMap{str => str.split(" ")}
    text.flatMap{_.split(" ")}
  }
}

filter

  • DataStream → DataStream: 过滤器,对数据流中的每个元素进行过滤判断,判断为true的元素进入下一个数据流
#Java
DataStream<String> res = socketTestStream.filter(new FilterFunction<String>() {
    @Override
     public boolean filter(String s) throws Exception {
            return s.startsWith("S");
     }
});

#Scala
text.filter{_.startsWith("S")}
  .print()
  .setParallelism(1)

keyBy

  • DataStream → KeyedStream
  • 将数据流按照key分成多个不相交的分区,相同的key的记录会被分到同一个分区中,keyBy()是通过散列分区实现的。
  • 我们可以将一个pojo类的一个或多个属性当作key,也可以将tuple的元素当作key,但是有两种类型不能作为key:

    1. 没有复写hashCode方法,仅默认继承object的hashCode方法的pojo类
    2. 数组类型
#Java in POJO
SingleOutputStreamOperator<WordCountPOJO> streamOperator = socketTextStream
        .flatMap((String value, Collector<WordCountPOJO> out) -> {
            Arrays.stream(value.split(" ")).
            forEach(str -> out.collect(WordCountPOJO.of(value, 1)));
        }).returns(WordCountPOJO.class);
        
KeyedStream<WordCountPOJO, Tuple> keyedStream = streamOperator.keyBy("word");
SingleOutputStreamOperator<WordCountPOJO> summed = keyedStream.sum("count");

#Java in Tuple
SingleOutputStreamOperator<Tuple2<String, Integer>> singleOutputStreamOperator = dataStreamSource
            .flatMap((String value, Collector<Tuple2<String, Integer>> out)-> {
                 Arrays.stream(value.split(" "))
                .forEach(str -> out.collect(Tuple2.of(value, 1)));
            }).returns(Types.TUPLE(Types.STRING, Types.INT));
 
KeyedStream<Tuple2<String, Integer>, Tuple> keyedStream = singleOutputStreamOperator.keyBy(0);
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = keyedStream.sum(1);
     
#Scala in POJO
text.flatMap{_.split(" ")}
  .map(x => WordCountPOJO(x,1))
  .keyBy("word")
  .timeWindow(Time.seconds(5))
  .sum("count")
  .print()
  .setParallelism(1)

reduce

  • KeyedStream → DataStream: 将数据流中的每个分区的数据进行结果归纳,可以对数据进行聚合操作,如min(), max(), avg, count等可以通过reduce实现
public class StudentPOJO {
     private String name;
     private String gender;
     private String className;
     private double score;
     public StudentPOJO() {

     }
    public StudentPOJO(String name, String gender, String className, double score) {
         this.name = name;
         this.gender = gender;
         this.className = className;
         this.score = score;
    }
    public static StudentPOJO of(String name, String gender, String className, double score) {
        return new StudentPOJO(name,gender, className,score);
    }
    public String getName() {
        return name;
    }
    public void setName(String name) {
        this.name = name;
    }
    public String getGender() {
        return gender;
    }
    public void setGender(String gender) {
        this.gender = gender;
    }
    public String getClassName() {
        return className;
    }
    public void setClassName(String className) {
        this.className = className;
    }
    public double getScore() {
        return score;
    }
    public void setScore(double score) {
        this.score = score;
    }
    @Override
    public String toString() {
        return "StudentPOJO{" +
                "name='" + name + ''' +
                ", gender='" + gender + ''' +
                ", className='" + className + ''' +
                ", score=" + score +
                '}';
    }
}

#Java in POJO
SingleOutputStreamOperator<StudentPOJO> flatMapSocketTextStream = socketTextStream
        .flatMap((String value, Collector<StudentPOJO> out) -> {
             String[] values = value.split(" ");
             out.collect(new StudentPOJO(values[0], values[1], values[2], Double.valueOf(values[3])));
        }).returns(StudentPOJO.class);
        
DataStream<StudentPOJO> res = flatMapSocketTextStream
        .keyBy("className")
        .reduce((s1, s2) ->
            s1.getScore() > s2.getScore() ? s1 : s2
        );
        
#Java in Tuple
DataStream<Tuple2<String, Integer>> res1 = socketTextStream
        .map(value -> Tuple2.of(value.trim(), 1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .keyBy(0)
        .timeWindow(Time.seconds(10))
        .reduce((Tuple2<String, Integer> t1, Tuple2<String, Integer> t2) ->
                new Tuple2<>(t1.f0, t1.f1 + t2.f1));
                
DataStream<Tuple2<String, Integer>> res2 = socketTextStream
        .map(value -> Tuple2.of(value.trim(), 1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .keyBy(0)
        .timeWindow(Time.seconds(10))
        .reduce((old, news) -> {
            old.f1 += news.f1;
         return old;
         }).returns(Types.TUPLE(Types.STRING, Types.INT));

fold

  • KeyedStream → DataStream
  • 一个有初始值的分组数据流的滚动折叠操作:合并当前元素和前一次折叠操作的结果,并产生一个新的值。
  • A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence "start-1", "start-1-2", "start-1-2-3", ...
#TODO 没有实际的效果出现
DataStream<String> res = socketTextStream.map(value -> Tuple2.of(value.trim(), 1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .keyBy(0)
        .fold("结果:",(String current, Tuple2<String, Integer> t2) -> current + t2.f0 + ",");

union

  • 在 DataStream 上使用 union 算子可以合并多个同类型的数据流,并生成同类型的新数据流,即可以将多个 DataStream 合并为一个新的 DataStream。
DataStream<String> streamSource01 = env.socketTextStream("localhost", 8888);
DataStream<String> streamSource02 = env.socketTextStream("localhost", 9922);

DataStream<String> mapStreamSource01 = streamSource01.map(value -> "来自8888端口的数据: " + value);
DataStream<String> mapStreamSource02 = streamSource02.map(value -> "来自9922端口的数据: " + value);

DataStream<String> res = mapStreamSource01.union(mapStreamSource02);

join

  • 根据指定的Key将两个流进行关联
DataStream<Tuple2<String, String>> mapStreamSource01 = streamSource01
        .map(value -> Tuple2.of(value, "来自8888端口的数据" + value))
        .returns(Types.TUPLE(Types.STRING, Types.STRING));
        
DataStream<Tuple2<String, String>> mapStreamSource02 = streamSource02
        .map(value -> Tuple2.of(value, "来自9922端口的数据" + value))
        .returns(Types.TUPLE(Types.STRING, Types.STRING));
        
DataStream<String> res = mapStreamSource01.join(mapStreamSource02)
        .where(t1->t1.getField(0))
        .equalTo(t2->t2.getField(0))
        .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
        .apply((t1,t2) -> t1.getField(1) + "|" + t2.getField(1));

coGroup

  • 关联两个流,关联不上的也保留下来。
DataStream<Tuple2<String, String>> mapStreamSource01 = streamSource01
        .map(value -> Tuple2.of(value, "8888端口数据: " + value))
        .returns(Types.TUPLE(Types.STRING, Types.STRING));
        
DataStream<Tuple2<String, String>> mapStreamSource02 = streamSource02
        .map(value -> Tuple2.of(value, "9922端口数据: " + value))
        .returns(Types.TUPLE(Types.STRING, Types.STRING));
        
DataStream<String> res = mapStreamSource01.coGroup(mapStreamSource02)
        .where(t1 -> t1.getField(0))
        .equalTo(t2 -> t2.getField(0))
        .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
        .apply(new CoGroupFunction<Tuple2<String, String>, Tuple2<String, String>, String>() {
            @Override
            public void coGroup(Iterable<Tuple2<String, String>> iterable1, Iterable<Tuple2<String, String>> iterable2, Collector<String> collector) throws Exception {
                 StringBuffer stringBuffer = new StringBuffer();
                 stringBuffer.append("来自8888的stream--");
                 for (Tuple2<String, String> item : iterable1) {
                                    stringBuffer.append(item.f1 + " | ");
                 }
                 stringBuffer.append("来自9922的stream--");
                 for (Tuple2<String, String> item : iterable2) {
                                    stringBuffer.append(item.f1);
                 }
                collector.collect(stringBuffer.toString());
            }      
        });

split

  • DataStream → SplitStream: 将指定的 DataStream流拆分成多个流,用SplitStream来表示

select

  • SplitStream → DataStream: 从一个 SplitStream 流中,通过 .select()方法来获得想要的流
SplitStream<Tuple2<String, Integer>> splitStream = streamSource
        .map(values -> Tuple2.of(values.trim(), 1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .split( t -> {
            List<String> list = new ArrayList<>();
             if (isNumeric(t.f0)) {
                list.add("num");
             } else {
                list.add("str");
             }
             return list;
        });
 
DataStream<Tuple2<String, Integer>>  strDataStream1 = splitStream.select("str")
        .map( t -> Tuple2.of("字符串: " + t.f0, t.f1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .keyBy(0)
        .sum(1);
        
DataStream<Tuple2<String, Integer>>  strDataStream2 = splitStream.select("num")
        .map( t -> Tuple2.of("数字: " + t.f0, t.f1))
        .returns(Types.TUPLE(Types.STRING, Types.INT))
        .keyBy(0)
        .sum(1);
阅读 356

7 声望
2 粉丝
0 条评论
7 声望
2 粉丝
宣传栏