Java8-11-Stream收集器源码分析与自定义收集器

上一篇我们系统的学习了Stream的分组分区，本篇我们学习下Stream中的收集器。
那么什么是收集器呢，在之前的课程中，我们学习了可以通过Stream对集合中的元素进行例如映射，过滤，分组，分区等操作。例如下面将所有元素转成大写就是用map映射操作

List<String> list = Arrays.asList("hello", "world", "helloworld");
List<String> collect = list.stream().map(String::toUpperCase).collect(Collectors.toList());

现在再看上面的程序就很容易理解了，但是我们之前的文章只是对于中间操作（map方法等）进行了详细的介绍，包括lambda表达式和方法引用以及各种函数式接口。接下来我们将注意力放在collect方法上，collect接收一个Collector类型的参数，Collector就是Java8中的收集器。

<R, A> R collect(Collector<? super T, A, R> collector);

也就是说collect方法最终需要接收一个收集器作为结果容器。虽然大多数收集器不需要我们自行创建，可以借助Collectors类提供的创建常用收集器的方法，例如toList() toSet() toCollection(Supplier collectionFactory)等方法。但是深入理解收集器的实现，对我们编写正确的程序会起到极大的作用。

下面就是toList方法的具体实现

public static <T> Collector<T, ?, List<T>> toList() {
    return new CollectorImpl<>((Supplier<List<T>>) ArrayList::new, List::add,
                               (left, right) -> { left.addAll(right); return left; },
                               CH_ID);
}

通过查看toList方法源码，知道返回的收集器是一个CollectorImpl的实例。而CollectorImpl就是收集器Collector的一个实现类，被定义在Collectors辅助类中，用于创建常用的收集器实例供我们使用

/**
 * Simple implementation class for {@code Collector}.
 *
 * @param <T> the type of elements to be collected
 * @param <R> the type of the result
 */
static class CollectorImpl<T, A, R> implements Collector<T, A, R> {
    private final Supplier<A> supplier;
    private final BiConsumer<A, T> accumulator;
    private final BinaryOperator<A> combiner;
    private final Function<A, R> finisher;
    private final Set<Characteristics> characteristics;

    CollectorImpl(Supplier<A> supplier,
                  BiConsumer<A, T> accumulator,
                  BinaryOperator<A> combiner,
                  Function<A,R> finisher,
                  Set<Characteristics> characteristics) {
        this.supplier = supplier;
        this.accumulator = accumulator;
        this.combiner = combiner;
        this.finisher = finisher;
        this.characteristics = characteristics;
    }

    CollectorImpl(Supplier<A> supplier,
                  BiConsumer<A, T> accumulator,
                  BinaryOperator<A> combiner,
                  Set<Characteristics> characteristics) {
        this(supplier, accumulator, combiner, castingIdentity(), characteristics);
    }

    @Override
    public BiConsumer<A, T> accumulator() {
        return accumulator;
    }

    @Override
    public Supplier<A> supplier() {
        return supplier;
    }

    @Override
    public BinaryOperator<A> combiner() {
        return combiner;
    }

    @Override
    public Function<A, R> finisher() {
        return finisher;
    }

    @Override
    public Set<Characteristics> characteristics() {
        return characteristics;
    }
}

CollectorImpl构造方法根据传入的不同参数实现Collector接口中的方法，例如上面的toList
所以如果要实现自定义的收集器，就需要我们自己来实现Collector接口中的各个方法，接下来就接口中的每个方法进行分析

/*
 * @param <T> the type of input elements to the reduction operation
 * @param <A> the mutable accumulation type of the reduction operation (often
 *            hidden as an implementation detail)
 * @param <R> the result type of the reduction operation
 * @since 1.8
 */
public interface Collector<T, A, R> {

在分析Collector接口之前，我们需要关注下Collector接口的三个泛型
泛型T 表示向集合中放入的元素类型
泛型A 表示可变的中间结果容器类型
泛型R 表示最终的结果容器类型

下面我们还会提到这些泛型，接下来看下Collector接口中的方法

     /**
     * A function that creates and returns a new mutable result container.
     *
     * @return a function which returns a new, mutable result container
     */
    Supplier<A> supplier();

supplier()是一个创建并返回一个新的可变的结果容器的函数，也就是收集器工作时，首先要将收集的元素(也就是泛型T类型)放到supplier()创建的容器中。

    /**
     * A function that folds a value into a mutable result container.
     *
     * @return a function which folds a value into a mutable result container
     */
    BiConsumer<A, T> accumulator();

accumulator()是将一个个元素(泛型T类型)内容放到一个可变的结果容器(泛型A类型)中的函数，这个结果容器就是上面supplier()函数所创建的。

    /**
     * A function that accepts two partial results and merges them.  The
     * combiner function may fold state from one argument into the other and
     * return that, or may return a new result container.
     *
     * @return a function which combines two partial results into a combined
     * result
     */
    BinaryOperator<A> combiner();

combiner()会接收两部分结果容器(泛型A类型)并且将他们进行合并。即可以将一个结果集合并到另一个结果集中，也可以将这两个结果集合并到一个新的结果集中，并将得到的并集返回。
这里所说的结果集是指supplier()创建的结果容器中的所有元素，但是为什么说会接收两个结果集呢，这里涉及到并行流机制，如果是串行流执行只会生成一个结果容器不需要combiner()
函数进行合并，但是如果是并行流会生成多个结果容器，需要combiner()分别进行两两合并，最终得到一个最终的结果容器(泛型R类型)

其实并行流这里说的并不严谨，并行流需要结合Characteristics中的CONCURRENT特性值才能判断是否会产生多个中间可变结果容器，我们在后续分析收集器执行机制时，会结合示例来说明这部分的区别。

    /**
     * Perform the final transformation from the intermediate accumulation type
     * {@code A} to the final result type {@code R}.
     *
     * <p>If the characteristic {@code IDENTITY_TRANSFORM} is
     * set, this function may be presumed to be an identity transform with an
     * unchecked cast from {@code A} to {@code R}.
     *
     * @return a function which transforms the intermediate result to the final
     * result
     */
    Function<A, R> finisher();

finisher()会执行最终的转换操作，也就是说如果我们需要将得到的结果再次进行类型转换或者其他一些逻辑处理的话，可以通过finisher()完成。如果收集器包含了
Characteristics.IDENTITY_FINISH特性，说明不需要进行任何转换操作了，那么finisher()函数就不会执行。

    /**
     * Returns a {@code Set} of {@code Collector.Characteristics} indicating
     * the characteristics of this Collector.  This set should be immutable.
     *
     * @return an immutable set of collector characteristics
     */
    Set<Characteristics> characteristics();

最后来看下characteristics()函数，上面我们不止一次提到了收集器的特性值这个概念，characteristics()方法就是返回这些特性值的函数。这些特性值是我们创建收集器时，自己通过Characteristics指定的。Characteristics是一个定义在Collector接口中的枚举，它包括三个枚举值CONCURRENT,UNORDERED,IDENTITY_FINISH

    /**
     * Characteristics indicating properties of a {@code Collector}, which can
     * be used to optimize reduction implementations.
     */
    enum Characteristics {
        /**
         * Indicates that this collector is <em>concurrent</em>, meaning that
         * the result container can support the accumulator function being
         * called concurrently with the same result container from multiple
         * threads.
         *
         * <p>If a {@code CONCURRENT} collector is not also {@code UNORDERED},
         * then it should only be evaluated concurrently if applied to an
         * unordered data source.
         */
        CONCURRENT,

        /**
         * Indicates that the collection operation does not commit to preserving
         * the encounter order of input elements.  (This might be true if the
         * result container has no intrinsic order, such as a {@link Set}.)
         */
        UNORDERED,

        /**
         * Indicates that the finisher function is the identity function and
         * can be elided.  If set, it must be the case that an unchecked cast
         * from A to R will succeed.
         */
        IDENTITY_FINISH
    }

如果包含了CONCURRENT特性值，表示这个收集器是支持并发操作的，这意味着多个线程可以同时调用accumulator()函数来向同一个中间结果容器放置元素。
注意这里是同一个中间结果容器而不是多个中间结果容器，也就是说如果包含了CONCURRENT特性值，(即使是并行流)只会产生一个中间结果容器，并且这个中间结果容器支持并发操作。

UNORDERED特性就很好理解了，它表示收集器中的元素是无序的。

IDENTITY_FINISH特性就表示确定得到的结果容器类型就是我们最终需要的类型，(在进行向最终类型强制类型转换时一定是成功的)

分析完我们总结一下：
1.supplier() 用于创建并返回一个可变的结果容器。
2.accumulator() 可以将元素累加到可变的结果容器中，也就是supplier()返回的容器。
3.combiner() 将两部分结果容器（也就是supplier()返回的容器）合并起来，可以是将一个结果容器合并到另一个结果容器中，也可以是将两个结果容器合并到一个新的空结果容器。
4.finisher() 执行最终的转换，将中间结果类型转换成最终的结果类型。
5.characteristics() 收集器的特性集合不同的特性执行机制也不同

了解了Collector接口中的各个方法后，下面我们结合一个简单的需求，实现自己自的收集器
简单的需求就是将集合中的元素进行去重，这个需求十几枚多大意义，主要为了演示如何自定义收集器

public class MySetCollector<T> implements Collector<T,Set<T>,Set<T>>{
    @Override
    public Supplier<Set<T>> supplier() {
        return HashSet<T>::new;
    }

    @Override
    public BiConsumer<Set<T>, T> accumulator() {
        return Set<T>::add;
    }

    @Override
    public BinaryOperator<Set<T>> combiner() {
        return (Set<T> s1, Set<T> s2) -> {
            s1.addAll(s2);
            return s1;
        };
    }

    @Override
    public Function<Set<T>, Set<T>> finisher() {
        return Function.identity();
    }

    @Override
    public Set<Characteristics> characteristics() {
        EnumSet<Characteristics> characteristicsEnumSet = EnumSet.of(Characteristics.UNORDERED,
                Characteristics.IDENTITY_FINISH);//remove IDENTITY_FINISH finisher method will be invoked
        return Collections.unmodifiableSet(characteristicsEnumSet);
    }

    public static void main(String[] args) {
        List<String> list = Arrays.asList("hello","world","welcome","hello");
        Set<String> collect = list.stream().collect(new MySetCollector<String>());
        System.out.println(collect);
    }
}

MySetCollector类实现了Collector接口，并指定了三个泛型，集合中收集每个元素类型为T，中间结果容器类型为Set<T>，不需要对中间结果容器类型进行转换，所以最终结果类型也是Set<T>
supplier()中我们返回一个HashSet作为中间结果容器，accumulator()中调用Set的add方法将一个个元素加入到集合中，全都采用方法引用的方式实现。
然后combiner()对中间结果容器两两合并，finisher()中直接调用Function.identity()将合并后的中间结果容器作为最终的结果返回

    /**
     * Returns a function that always returns its input argument.
     *
     * @param <T> the type of the input and output objects to the function
     * @return a function that always returns its input argument
     */
    static <T> Function<T, T> identity() {
        return t -> t;
    }

characteristics()方法定义了收集器的特性值，UNORDERED和IDENTITY_FINISH。表示容器中的元素是无序的并且不需要进行最终的类型转换
执行结果为[world, hello, welcome]

本篇我们通过分析收集器源码并结合一个简单的元素去重的需求实现了自己的收集器MySetCollector，下一篇我们会继续借助这个实例来分析收集器的执行机制。

Java8-11-Stream收集器源码分析与自定义收集器

尹昊

引用和评论

Java8的新特性