Go efficiently processes data through streaming APIs such as Map/Filter/ForEach

What is stream processing

If you have experience in using java, students will be full of praise for java8's Stream, which greatly improves their ability to process collection type data.

int sum = widgets.stream()
              .filter(w -> w.getColor() == RED)
              .mapToInt(w -> w.getWeight())
              .sum();

Stream allows us to support chain calls and functional programming styles to achieve data processing. It seems that the data is continuously processed in real time like a pipeline, and finally summarized. The realization idea of Stream is to abstract the data processing flow into a data stream, and return a new stream for use after each processing.

Stream function definition

Before writing the code, think clearly, clarifying the requirements is the most important step, we try to substitute the author's perspective to think about the implementation process of the entire component. First put the logic of the bottom layer implementation, first try to define the stream function from scratch.

The workflow of Stream actually belongs to the production consumer model. The whole process is very similar to the production process in the factory. Try to define the life cycle of Stream first:

Creation phase/data acquisition (raw materials)
Processing stage/intermediate processing (pipeline processing)
Aggregation phase/final operation (final product)

The following defines the API around the three life cycles of the stream:

Creation phase

In order to create the abstract object of the data stream stream, it can be understood as a constructor.

We support three ways to construct a stream, namely: slice conversion, channel conversion, and functional conversion.

Note that the methods at this stage are all ordinary public methods and are not bound to the Stream object.

// 通过可变参数模式创建 stream
func Just(items ...interface{}) Stream

// 通过 channel 创建 stream
func Range(source <-chan interface{}) Stream

// 通过函数创建 stream
func From(generate GenerateFunc) Stream

// 拼接 stream
func Concat(s Stream, others ...Stream) Stream

Processing stage

The operations that need to be performed in the processing stage often correspond to our business logic, such as: conversion, filtering, de-duplication, sorting, and so on.

The API at this stage belongs to the method and needs to be bound to the Stream object.

In combination with commonly used business scenarios, the following definitions are made:

// 去除重复item
Distinct(keyFunc KeyFunc) Stream
// 按条件过滤item
Filter(filterFunc FilterFunc, opts ...Option) Stream
// 分组
Group(fn KeyFunc) Stream
// 返回前n个元素
Head(n int64) Stream
// 返回后n个元素
Tail(n int64) Stream
// 转换对象
Map(fn MapFunc, opts ...Option) Stream
// 合并item到slice生成新的stream
Merge() Stream
// 反转
Reverse() Stream
// 排序
Sort(fn LessFunc) Stream
// 作用在每个item上
Walk(fn WalkFunc, opts ...Option) Stream
// 聚合其他Stream
Concat(streams ...Stream) Stream

The processing logic of the processing stage will return a new Stream object, here is a basic implementation paradigm

Aggregation stage

The summary stage is actually the processing result we want, such as: whether it matches, count the number, traverse, and so on.

// 检查是否全部匹配
AllMatch(fn PredicateFunc) bool
// 检查是否存在至少一项匹配
AnyMatch(fn PredicateFunc) bool
// 检查全部不匹配
NoneMatch(fn PredicateFunc) bool
// 统计数量
Count() int
// 清空stream
Done()
// 对所有元素执行操作
ForAll(fn ForAllFunc)
// 对每个元素执行操作
ForEach(fn ForEachFunc)

After sorting out the requirements boundaries of the components, we have a clearer understanding of the stream to be implemented. In my cognition, the real architect's grasp of requirements and subsequent evolution can reach the point where it is extremely accurate. This is inseparable from in-depth thinking of requirements and insights into the essence behind the requirements. By substituting the author's perspective to simulate the construction process of the entire project, learning the author's thinking methodology is the greatest value of our learning of open source projects.

Well, let's try to define a complete overview of the Stream interface and functions.

The role of the interface is not only a template role, but also to use its abstract ability to build the overall framework of the project without getting into the details at the beginning. It can quickly express our thinking process through the interface and learn to develop self-centeredness. The following thinking method observes the entire system from a macro perspective. It is easy to draw the sword and look around at a loss if you fall into the details at the beginning. . .

rxOptions struct {
  unlimitedWorkers bool
  workers          int
}
Option func(opts *rxOptions)
// key生成器
//item - stream中的元素
KeyFunc func(item interface{}) interface{}
// 过滤函数
FilterFunc func(item interface{}) bool
// 对象转换函数
MapFunc func(intem interface{}) interface{}
// 对象比较
LessFunc func(a, b interface{}) bool
// 遍历函数
WalkFunc func(item interface{}, pip chan<- interface{})
// 匹配函数
PredicateFunc func(item interface{}) bool
// 对所有元素执行操作
ForAllFunc func(pip <-chan interface{})
// 对每个item执行操作
ForEachFunc func(item interface{})
// 对每个元素并发执行操作
ParallelFunc func(item interface{})
// 对所有元素执行聚合操作
ReduceFunc func(pip <-chan interface{}) (interface{}, error)
// item生成函数
GenerateFunc func(source <-chan interface{})

Stream interface {
  // 去除重复item
  Distinct(keyFunc KeyFunc) Stream
  // 按条件过滤item
  Filter(filterFunc FilterFunc, opts ...Option) Stream
  // 分组
  Group(fn KeyFunc) Stream
  // 返回前n个元素
  Head(n int64) Stream
  // 返回后n个元素
  Tail(n int64) Stream
  // 获取第一个元素
  First() interface{}
  // 获取最后一个元素
  Last() interface{}
  // 转换对象
  Map(fn MapFunc, opts ...Option) Stream
  // 合并item到slice生成新的stream
  Merge() Stream
  // 反转
  Reverse() Stream
  // 排序
  Sort(fn LessFunc) Stream
  // 作用在每个item上
  Walk(fn WalkFunc, opts ...Option) Stream
  // 聚合其他Stream
  Concat(streams ...Stream) Stream
  // 检查是否全部匹配
  AllMatch(fn PredicateFunc) bool
  // 检查是否存在至少一项匹配
  AnyMatch(fn PredicateFunc) bool
  // 检查全部不匹配
  NoneMatch(fn PredicateFunc) bool
  // 统计数量
  Count() int
  // 清空stream
  Done()
  // 对所有元素执行操作
  ForAll(fn ForAllFunc)
  // 对每个元素执行操作
  ForEach(fn ForEachFunc)
}

The channel() method is used to obtain the properties of the Stream pipeline. Because we are facing the interface object in the specific implementation, we expose a private method read.

// 获取内部的数据容器channel,内部方法
channel() chan interface{}

Realization idea

The function definition is sorted out clearly, and then consider several engineering implementation issues.

How to implement chain call

Chain call, the builder mode used to create the object can achieve the effect of chain call. In fact, Stream achieves the same chain-like effect principle. After each call, a new Stream is created and returned to the user.

// 去除重复item
Distinct(keyFunc KeyFunc) Stream
// 按条件过滤item
Filter(filterFunc FilterFunc, opts ...Option) Stream

How to realize the processing effect of the pipeline

A so-called pipeline may be understood as data is stored in the container in the Stream, we can go in the channel as a data pipe, reaches Stream chained calls perform operations non-blocking asynchronous effect.

How to support parallel processing

Data processing is essentially processing the data in the channel, so to achieve parallel processing is nothing more than parallel consumption of the channel. The use of goroutine coroutines and WaitGroup mechanisms can easily achieve parallel processing.

go-zero implementation

core/fx/stream.go

The implementation of Stream in go-zero does not define an interface, but it does not matter that the logic of the underlying implementation is the same.

In order to implement the Stream interface, we define an internal implementation class, where the source is the channel type, which simulates the pipeline function.

Stream struct {
  source <-chan interface{}
}

Create API

Channel create Range

Create stream through channel

func Range(source <-chan interface{}) Stream {  
  return Stream{  
    source: source,  
  }  
}

Variable parameter mode creation Just

It is a good habit to create a stream in variable parameter mode, and close the channel in time after writing.

func Just(items ...interface{}) Stream {
  source := make(chan interface{}, len(items))
  for _, item := range items {
    source <- item
  }
  close(source)
  return Range(source)
}

Function creation From

Create Stream by function

func From(generate GenerateFunc) Stream {
  source := make(chan interface{})
  threading.GoSafe(func() {
    defer close(source)
    generate(source)
  })
  return Range(source)
}

Because the function parameter calls passed in from outside are involved, the execution process is not available. Therefore, it is necessary to catch runtime exceptions to prevent panic errors from being transmitted to the upper layer and causing the application to crash.

func Recover(cleanups ...func()) {
  for _, cleanup := range cleanups {
    cleanup()
  }
  if r := recover(); r != nil {
    logx.ErrorStack(r)
  }
}

func RunSafe(fn func()) {
  defer rescue.Recover()
  fn()
}

func GoSafe(fn func()) {
  go RunSafe(fn)
}

Splicing Concat

Splicing other Streams to create a new Stream, call the internal Concat method method, and the source code implementation of Concat will be analyzed later.

func Concat(s Stream, others ...Stream) Stream {
  return s.Concat(others...)
}

Processing API

Distinct

Because the function parameter KeyFunc func(item interface{}) interface{} is passed in, it also supports custom deduplication according to the business scenario. Essentially, the result returned by KeyFunc is used to achieve deduplication based on map.

Function parameters are very powerful, which can greatly improve flexibility.

func (s Stream) Distinct(keyFunc KeyFunc) Stream {
  source := make(chan interface{})
  threading.GoSafe(func() {
    // channel记得关闭是个好习惯
    defer close(source)
    keys := make(map[interface{}]lang.PlaceholderType)
    for item := range s.source {
      // 自定义去重逻辑
      key := keyFunc(item)
      // 如果key不存在,则将数据写入新的channel
      if _, ok := keys[key]; !ok {
        source <- item
        keys[key] = lang.Placeholder
      }
    }
  })
  return Range(source)
}

Use Cases:

// 1 2 3 4 5
Just(1, 2, 3, 3, 4, 5, 5).Distinct(func(item interface{}) interface{} {
  return item
}).ForEach(func(item interface{}) {
  t.Log(item)
})

// 1 2 3 4
Just(1, 2, 3, 3, 4, 5, 5).Distinct(func(item interface{}) interface{} {
  uid := item.(int)
  // 对大于4的item进行特殊去重逻辑,最终只保留一个>3的item
  if uid > 3 {
    return 4
  }
  return item
}).ForEach(func(item interface{}) {
  t.Log(item)
})

Filter

By abstracting the filtering logic into FilterFunc, and then separately acting on the item according to the Boolean value returned by FilterFunc to determine whether to write back to the new channel to achieve the filtering function, the actual filtering logic is delegated to the Walk method.

The Option parameter contains two options:

unlimitedWorkers does not limit the number of coroutines
workers limit the number of coroutines

FilterFunc func(item interface{}) bool

func (s Stream) Filter(filterFunc FilterFunc, opts ...Option) Stream {
  return s.Walk(func(item interface{}, pip chan<- interface{}) {
    if filterFunc(item) {
      pip <- item
    }
  }, opts...)
}

Example of use:

func TestInternalStream_Filter(t *testing.T) {
  // 保留偶数 2,4
  channel := Just(1, 2, 3, 4, 5).Filter(func(item interface{}) bool {
    return item.(int)%2 == 0
  }).channel()
  for item := range channel {
    t.Log(item)
  }
}

Traverse execution Walk

Walk means walking in English, which means to perform a WalkFunc operation for each item and write the result to a new Stream.

Note here that because the internal coroutine mechanism is used to asynchronously read and write data, the order of data in the channel in the new Stream is random.

// item-stream中的item元素
// pipe-item符合条件则写入pipe
WalkFunc func(item interface{}, pipe chan<- interface{})

func (s Stream) Walk(fn WalkFunc, opts ...Option) Stream {
  option := buildOptions(opts...)
  if option.unlimitedWorkers {
    return s.walkUnLimited(fn, option)
  }
  return s.walkLimited(fn, option)
}

func (s Stream) walkUnLimited(fn WalkFunc, option *rxOptions) Stream {
  // 创建带缓冲区的channel
  // 默认为16,channel中元素超过16将会被阻塞
  pipe := make(chan interface{}, defaultWorkers)
  go func() {
    var wg sync.WaitGroup

    for item := range s.source {
      // 需要读取s.source的所有元素
      // 这里也说明了为什么channel最后写完记得完毕
      // 如果不关闭可能导致协程一直阻塞导致泄漏
      // 重要, 不赋值给val是个典型的并发陷阱，后面在另一个goroutine里使用了
      val := item
      wg.Add(1)
      // 安全模式下执行函数
      threading.GoSafe(func() {
        defer wg.Done()
        fn(item, pipe)
      })
    }
    wg.Wait()
    close(pipe)
  }()

  // 返回新的Stream
  return Range(pipe)
}

func (s Stream) walkLimited(fn WalkFunc, option *rxOptions) Stream {
  pipe := make(chan interface{}, option.workers)
  go func() {
    var wg sync.WaitGroup
    // 控制协程数量
    pool := make(chan lang.PlaceholderType, option.workers)

    for item := range s.source {
      // 重要, 不赋值给val是个典型的并发陷阱，后面在另一个goroutine里使用了
      val := item
      // 超过协程限制时将会被阻塞
      pool <- lang.Placeholder
      // 这里也说明了为什么channel最后写完记得完毕
      // 如果不关闭可能导致协程一直阻塞导致泄漏
      wg.Add(1)

      // 安全模式下执行函数
      threading.GoSafe(func() {
        defer func() {
          wg.Done()
          //执行完成后读取一次pool释放一个协程位置
          <-pool
        }()
        fn(item, pipe)
      })
    }
    wg.Wait()
    close(pipe)
  }()
  return Range(pipe)
}

Use Cases:

The order of return is random.

func Test_Stream_Walk(t *testing.T) {
  // 返回 300,100,200
  Just(1, 2, 3).Walk(func(item interface{}, pip chan<- interface{}) {
    pip <- item.(int) * 100
  }, WithWorkers(3)).ForEach(func(item interface{}) {
    t.Log(item)
  })
}

Group

Put the item into the map by matching it.

KeyFunc func(item interface{}) interface{}

func (s Stream) Group(fn KeyFunc) Stream {
  groups := make(map[interface{}][]interface{})
  for item := range s.source {
    key := fn(item)
    groups[key] = append(groups[key], item)
  }
  source := make(chan interface{})
  go func() {
    for _, group := range groups {
      source <- group
    }
    close(source)
  }()
  return Range(source)
}

Get the first n elements Head

If n is greater than the actual data set length, all elements will be returned

func (s Stream) Head(n int64) Stream {
  if n < 1 {
    panic("n must be greather than 1")
  }
  source := make(chan interface{})
  go func() {
    for item := range s.source {
      n--
      // n值可能大于s.source长度,需要判断是否>=0
      if n >= 0 {
        source <- item
      }
      // let successive method go ASAP even we have more items to skip
      // why we don't just break the loop, because if break,
      // this former goroutine will block forever, which will cause goroutine leak.
      // n==0说明source已经写满可以进行关闭了
      // 既然source已经满足条件了为什么不直接进行break跳出循环呢?
      // 作者提到了防止协程泄漏
      // 因为每次操作最终都会产生一个新的Stream,旧的Stream永远也不会被调用了
      if n == 0 {
        close(source)
        break
      }
    }
    // 上面的循环跳出来了说明n大于s.source实际长度
    // 依旧需要显示关闭新的source
    if n > 0 {
      close(source)
    }
  }()
  return Range(source)
}

Example of use:

// 返回1,2
func TestInternalStream_Head(t *testing.T) {
  channel := Just(1, 2, 3, 4, 5).Head(2).channel()
  for item := range channel {
    t.Log(item)
  }
}

Get the last n elements Tail

It is very interesting here. In order to ensure that the last n elements use the data structure of the ring slice Ring, let's first understand the implementation of Ring.

// 环形切片
type Ring struct {
  elements []interface{}
  index    int
  lock     sync.Mutex
}

func NewRing(n int) *Ring {
  if n < 1 {
    panic("n should be greather than 0")
  }
  return &Ring{
    elements: make([]interface{}, n),
  }
}

// 添加元素
func (r *Ring) Add(v interface{}) {
  r.lock.Lock()
  defer r.lock.Unlock()
  // 将元素写入切片指定位置
  // 这里的取余实现了循环写效果
  r.elements[r.index%len(r.elements)] = v
  // 更新下次写入位置
  r.index++
}

// 获取全部元素
// 读取顺序保持与写入顺序一致
func (r *Ring) Take() []interface{} {
  r.lock.Lock()
  defer r.lock.Unlock()

  var size int
  var start int
  // 当出现循环写的情况时
  // 开始读取位置需要通过去余实现,因为我们希望读取出来的顺序与写入顺序一致
  if r.index > len(r.elements) {
    size = len(r.elements)
    // 因为出现循环写情况,当前写入位置index开始为最旧的数据
    start = r.index % len(r.elements)
  } else {
    size = r.index
  }
  elements := make([]interface{}, size)
  for i := 0; i < size; i++ {
    // 取余实现环形读取,读取顺序保持与写入顺序一致
    elements[i] = r.elements[(start+i)%len(r.elements)]
  }

  return elements
}

Summarize the advantages of circular slices:

Support automatic rolling update
Save memory

Circular slicing can realize that the old data is continuously covered by new data when the fixed capacity is full. Because of this feature, it can be used to read the last n elements of the channel.

func (s Stream) Tail(n int64) Stream {
  if n < 1 {
    panic("n must be greather than 1")
  }
  source := make(chan interface{})
  go func() {
    ring := collection.NewRing(int(n))
    // 读取全部元素，如果数量>n环形切片能实现新数据覆盖旧数据
    // 保证获取到的一定最后n个元素
    for item := range s.source {
      ring.Add(item)
    }
    for _, item := range ring.Take() {
      source <- item
    }
    close(source)
  }()
  return Range(source)
}

So why not use slices of len(source) length directly?

The answer is to save memory. When it comes to ring-type data structures, there is an advantage that saves memory and can allocate resources on demand.

Example of use:

func TestInternalStream_Tail(t *testing.T) {
  // 4,5
  channel := Just(1, 2, 3, 4, 5).Tail(2).channel()
  for item := range channel {
    t.Log(item)
  }
  // 1,2,3,4,5
  channel2 := Just(1, 2, 3, 4, 5).Tail(6).channel()
  for item := range channel2 {
    t.Log(item)
  }
}

Element Conversion Map

For element conversion, the coroutine completes the conversion operation internally. Note that the output channel is not guaranteed to be output in the original order.

MapFunc func(intem interface{}) interface{}
func (s Stream) Map(fn MapFunc, opts ...Option) Stream {
  return s.Walk(func(item interface{}, pip chan<- interface{}) {
    pip <- fn(item)
  }, opts...)
}

Example of use:

func TestInternalStream_Map(t *testing.T) {
  channel := Just(1, 2, 3, 4, 5, 2, 2, 2, 2, 2, 2).Map(func(item interface{}) interface{} {
    return item.(int) * 10
  }).channel()
  for item := range channel {
    t.Log(item)
  }
}

Merge

The implementation is relatively simple. I thought for a long time and didn't think of any scenarios suitable for this method.

func (s Stream) Merge() Stream {
  var items []interface{}
  for item := range s.source {
    items = append(items, item)
  }
  source := make(chan interface{}, 1)
  source <- items
  return Range(source)
}

Reverse

Reverse the elements in the channel. The reversal algorithm flow is:

Find the intermediate node
The two sides of the node start to exchange in pairs

Pay attention to why you use slices to receive when you get s.source? Slices will automatically expand. Wouldn't it be better to use arrays?

In fact, you can't use an array here, because you don't know that the operation of Stream writing to the source is often written asynchronously in the coroutine. The channels in each Stream may change dynamically. It is indeed very vivid to use the pipeline to compare the Stream workflow.

func (s Stream) Reverse() Stream {
  var items []interface{}
  for item := range s.source {
    items = append(items, item)
  }
  for i := len(items)/2 - 1; i >= 0; i-- {
    opp := len(items) - 1 - i
    items[i], items[opp] = items[opp], items[i]
  }
  return Just(items...)
}

Example of use:

func TestInternalStream_Reverse(t *testing.T) {
  channel := Just(1, 2, 3, 4, 5).Reverse().channel()
  for item := range channel {
    t.Log(item)
  }
}

Sort

Intranet calls the sorting scheme of the official slice package, and then passes in the comparison function to realize the comparison logic.

func (s Stream) Sort(fn LessFunc) Stream {
  var items []interface{}
  for item := range s.source {
    items = append(items, item)
  }

  sort.Slice(items, func(i, j int) bool {
    return fn(i, j)
  })
  return Just(items...)
}

Example of use:

// 5,4,3,2,1
func TestInternalStream_Sort(t *testing.T) {
  channel := Just(1, 2, 3, 4, 5).Sort(func(a, b interface{}) bool {
    return a.(int) > b.(int)
  }).channel()
  for item := range channel {
    t.Log(item)
  }
}

Splicing Concat

func (s Stream) Concat(steams ...Stream) Stream {
  // 创建新的无缓冲channel
  source := make(chan interface{})
  go func() {
    // 创建一个waiGroup对象
    group := threading.NewRoutineGroup()
    // 异步从原channel读取数据
    group.Run(func() {
      for item := range s.source {
        source <- item
      }
    })
    // 异步读取待拼接Stream的channel数据
    for _, stream := range steams {
      // 每个Stream开启一个协程
      group.Run(func() {
        for item := range stream.channel() {
          source <- item
        }
      })
    }
    // 阻塞等待读取完成
    group.Wait()
    close(source)
  }()
  // 返回新的Stream
  return Range(source)
}

Aggregate API

All matches AllMatch

func (s Stream) AllMatch(fn PredicateFunc) bool {
  for item := range s.source {
    if !fn(item) {
      // 需要排空 s.source，否则前面的goroutine可能阻塞
      go drain(s.source)
      return false
    }
  }

  return true
}

Match AnyMatch arbitrarily

func (s Stream) AnyMatch(fn PredicateFunc) bool {
  for item := range s.source {
    if fn(item) {
      // 需要排空 s.source，否则前面的goroutine可能阻塞
      go drain(s.source)
      return true
    }
  }

  return false
}

None match NoneMatch

func (s Stream) NoneMatch(fn func(item interface{}) bool) bool {
  for item := range s.source {
    if fn(item) {
      // 需要排空 s.source，否则前面的goroutine可能阻塞
      go drain(s.source)
      return false
    }
  }

  return true
}

Count

func (s Stream) Count() int {
  var count int
  for range s.source {
    count++
  }
  return count
}

Empty Done

func (s Stream) Done() {
  // 排空 channel，防止 goroutine 阻塞泄露
  drain(s.source)
}

Iterate all elements ForAll

func (s Stream) ForAll(fn ForAllFunc) {
  fn(s.source)
}

Iterate each element ForEach

func (s Stream) ForAll(fn ForAllFunc) {
  fn(s.source)
}

summary

At this point, all the Stream components have been implemented. The core logic is to use the channel as a pipeline, and the data as a water stream, and continuously use the coroutine to receive/write data to the channel to achieve an asynchronous non-blocking effect.

Going back to the problem mentioned at the beginning, it seems very difficult to implement a stream before doing it. It is hard to imagine that such a powerful component can be realized in more than 300 lines of code in go.

Three language features to achieve efficient basic sources:

channel
Coroutine
Functional programming

Reference

pipeline mode

slice inversion algorithm

project address

https://github.com/zeromicro/go-zero

Welcome to use go-zero and star support us!

WeChat Exchange Group

Follow the " Practice " public account and click on the exchange group get the QR code of the community group.

Go efficiently processes data through streaming APIs such as Map/Filter/ForEach

What is stream processing

Stream function definition

Creation phase

Processing stage

Aggregation stage

Realization idea

How to implement chain call

How to realize the processing effect of the pipeline

How to support parallel processing

go-zero implementation

Create API

Channel create Range

Variable parameter mode creation Just

Function creation From

Splicing Concat

Processing API

Distinct

Filter

Traverse execution Walk

Group

Get the first n elements Head

Get the last n elements Tail

Element Conversion Map

Merge

Reverse

Sort

Splicing Concat

Aggregate API

All matches AllMatch

Match AnyMatch arbitrarily

None match NoneMatch

Count

Empty Done

Iterate all elements ForAll

Iterate each element ForEach

summary

Reference

project address

WeChat Exchange Group

kevinwan

引用和评论

熔断原理分析与源码解读

一文掌握 MCP 上下文协议：从理论到实践

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

大模型时代，后端程序员如何避免被AI卷死？

Go 1.24 相比 Go 1.23 有哪些值得注意的改动？

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务