gozero mapreduce source code analysis and simple implementation

Mapreduce

Mapreduce is a distributed parallel programming model. In a function or an interface call, a large number of calculations or a large number of calls to third-party interfaces will occur. At this time, Mapreduce can be used to become a model, so that a large number of calculations are processed on one or more machines, and finally the results are aggregated together and output.

Mapreduce in gozero

gozero is a recently popular go microservice framework, but there are also some interesting and useful class libraries in this library, which we can refer to individually, such as Mapreduce .
This is what the official said: In actual business scenarios, we often need to obtain corresponding attributes from different rpc services to assemble complex objects. If it is a serial call, the response time will increase linearly with the number of rpc calls, so we generally change serial to parallel to optimize performance.
We know that it is quite troublesome to implement a set of parallel modes by ourselves, and gozero Mapreduce can help us achieve this effect very easily. Let's focus on the implementation of our own business logic.

Simple to use

When it needs to be mentioned here, gozero also provides control of the number of threads. You can control the number of threads for parallel processing, so as to avoid excessive consumption of the server by too many threads. At the same time, we can also pass in our own context to control the timeout and cancellation logic of the entire method. These are all encapsulated in the class library, and we don't need to worry about it. Just pass the right arguments.
The following is a relatively simple time, read an array in parallel plus ":1" and finally input the changed array object.

 package main

import (
    "context"
    "fmt"
    "github.com/zeromicro/go-zero/core/mr"
    "time"
)

func main() {
    //要处理的数据
    uid := []string{"a", "b", "c", "d", "e", "f"}
    //传递数据的逻辑
    generateFunc := func(source chan<- interface{}) {
        for _, v := range uid {
            source <- v
            fmt.Println("source:", v)
        }
    }

    // 处理数据的逻辑
    mapFunc := func(item interface{}, writer mr.Writer, cancel func(err error)) {
        tmp := item.(string) + ":1"
        writer.Write(tmp)
        fmt.Println("item:", item)
    }

    // 合并的数据逻辑
    reducerFunc := func(pipe <-chan interface{}, writer mr.Writer, cancel func(err error)) {
        var uid []string
        for v := range pipe {
            uid = append(uid, v.(string))
            fmt.Println("pipe:", uid)
        }
        writer.Write(uid)
    }

    // 开始并发处理数据
    // 超时时间
    ctx, cl := context.WithTimeout(context.Background(), time.Second*3)
    res, err := mr.MapReduce(generateFunc, mapFunc, reducerFunc, mr.WithContext(ctx))
    //开启现成控制超时，如果超时则调用cl方法停止所有携程
    go func() {
        time.Sleep(time.Second * 2)
        fmt.Println("cl")
        cl()
    }()
    fmt.Println(res, err)
}

Source code analysis

First, let's look at the flow chart of the entire function. This flow chart is my own understanding. If there is something wrong, please discuss it in the comment area.
Mapreduce流程图
We see that a large number of channels are used in the entire process, and the entire complex process is controlled through these channels. Let's look at the main three steps:

Get the data to process

We see the generate function here, which is defined as follows. The parameter of the function is a channel, and we need to process the source data in this function, that is, the data we are going to process. For example, for an array in the above example, the data is transmitted to the pipeline through the for loop, and the subsequent logic is processed. Of course, we can also perform some simple preprocessing on the data in this process.
This is a typical functional transformation, passing a function as a parameter to another function.

 GenerateFunc func(source chan<- interface{})

 func buildSource(generate GenerateFunc, panicChan *onceChan) chan interface{} {
    source := make(chan interface{})
    go func() {
        defer func() {
            if r := recover(); r != nil {
                panicChan.write(r)
            }
            close(source)
        }()

        generate(source)
    }()

    return source
}

Data processing

What is ultimately needed in Mapreduce is the parallel processing logic, and the following is the core logic of parallel processing.
Control the number of goroutines through a pool channel. Use atomic's atomic operation feature to control the number of existing Ctrips, and finally use WaitGroup to wait for the processing of left and right Ctrips to complete.
What we need to focus on here is the processing logic of the defer function in many of the channel controls and functions.
If there is data in ctx or doneChan in the control logic select, the whole function will be stopped. When Ctrip is still free, we will first put a data into the pool channel, which is equivalent to occupying a Ctrip, and use wg.Add(1) to count. In the defer function we need wg.done(), which deducts the atomic amount and spit out a data from the pool. Let the subsequent data have more spare Ctrip can operate the data.

 // 调用
go executeMappers(mapperContext{
        ctx: options.ctx,
        mapper: func(item interface{}, w Writer) {
            mapper(item, w, cancel)
        },
        source:    source,
        panicChan: panicChan,
        collector: collector,
        doneChan:  done,
        workers:   options.workers,
    })

// 具体执行方法
func executeMappers(mCtx mapperContext) {
    var wg sync.WaitGroup
    defer func() {
        wg.Wait()
        close(mCtx.collector)
        drain(mCtx.source)
    }()

    var failed int32
    pool := make(chan lang.PlaceholderType, mCtx.workers)
    writer := newGuardedWriter(mCtx.ctx, mCtx.collector, mCtx.doneChan)
    for atomic.LoadInt32(&failed) == 0 {
        select {
        case <-mCtx.ctx.Done():
            return
        case <-mCtx.doneChan:
            return
        case pool <- lang.Placeholder:
            item, ok := <-mCtx.source
            if !ok {
                <-pool
                return
            }

            wg.Add(1)
            go func() {
                defer func() {
                    if r := recover(); r != nil {
                        atomic.AddInt32(&failed, 1)
                        mCtx.panicChan.write(r)
                    }
                    wg.Done()
                    <-pool
                }()

                mCtx.mapper(item, writer)
            }()
        }
    }
}

Collapse data

While the executeMappers function keeps writing data to the collector channel, the reducer function is also processing the data, and at the end, the collected data is sent to the output channel through the wirte. So far the whole function is over.

 go func() {
        defer func() {
            drain(collector)
            if r := recover(); r != nil {
                panicChan.write(r)
            }
            finish()
        }()

        reducer(collector, writer, cancel)
    }()

final wait and output

The last select of the function, which has three cases. Hitting any of them can make the whole function end

The Done function of the context is penalized: when the incoming and outgoing context is canceled, the entire function will stop running
There is data incoming in panicChan: when each goroutine in the function panics that cannot go on, data can be passed into the entire channel of panicChan.
There is data incoming in the output: this channel is to put the object that needs to be output into this channel at the end of the data collection. Also means the end of the entire function.
```
 select {
  case <-options.ctx.Done():
      cancel(context.DeadlineExceeded)
      return nil, context.DeadlineExceeded
  case v := <-panicChan.channel:
      panic(v)
  case v, ok := <-output:
      if err := retErr.Load(); err != nil {
          return nil, err
      } else if ok {
          return v, nil
      } else {
          return nil, ErrReduceNoOutput
      }
  }
```
Summarize
We can see from the above flow chart and step analysis that this Mapreduce package actually does a lot of judgment and fault tolerance. These complex interactions between Ctrip and Ctrip are all realized through channels. If we want to implement our own business logic, it is more complicated, but it is easier to understand if we directly use the Mapreduce package of gozero.

Implement the simplified version yourself

We know that it is relatively easy to understand if you look at other people's code as long as you debug it a few times, but if you want to really understand it, you have to implement it yourself, even if it is a simplified version. The following is my simple implementation method after reading the source code. It may be rough and imperfect. If you have any questions, please feel free to communicate.
Implementation code:

 package main

import (
    "context"
    "fmt"
    "sync"
    "sync/atomic"
)

// 第一个方法用来往管道里存储要执行的事情 GenerateFunc
// 第二个方法用来执行方法  MapperFunc
// 第三个方法用来归并结果集 ReducerFunc

// MapReduce 逻辑简单介绍下:
//1. 要处理的数据放到一个无缓冲的管道里，再来一个协程从这个无缓冲的管道里读取要处理的数据，然后把读取出来的数据开一个协程用 传入的方法处理;
//2. 创建一个无缓冲的管道, 用来保存执行的结果, 让合并结果的协程从这个管道里读取数据, 然后合并数据, 写入合并数据的管道里
//3. 创建一个用来停止其它协程的管道, 如果执行中有什么错误就关闭这个管道里，同时停止执行其它协程，返回失败
//4. 最后要把没有关闭的管道关闭了

type (
    GenerateFunc func(source chan<- interface{})
    MapperFunc   func(item interface{}, write Write)
    ReducerFunc  func(pipe chan interface{}, write Write)
)

type Write interface {
    Write(val interface{})
}

type Writer struct {
    Ch chan<- interface{}
}

func (w *Writer) Write(val interface{}) {
    w.Ch <- val
}

func drain(channel <-chan interface{}) {
    for range channel {

    }
}

func executeMappers(threadNum int, ctx context.Context, ch chan interface{}, doneCh chan interface{}, fn MapperFunc, wr Write) {
    var failed int32
    pool := make(chan interface{}, threadNum)
    wg := sync.WaitGroup{}
    defer func() {
        wg.Wait()
    }()

    for atomic.LoadInt32(&failed) == 0 {
        select {
        case <-ctx.Done():
            return
        case <-doneCh:
            return
        case pool <- 1:
            item, ok := <-ch
            if !ok {
                return
            }
            wg.Add(1)
            go func() {
                defer func() {
                    if r := recover(); r != nil {
                        atomic.AddInt32(&failed, 1)
                    }
                    fn(item, wr)
                    wg.Done()
                    <-pool
                }()
            }()
        }
    }
}

func MapReduce(generateFunc GenerateFunc, mapperFunc MapperFunc, reducerFunc ReducerFunc, ctx context.Context) (interface{}, error) {
    sourceChannel := make(chan interface{})
    pipeChannel := make(chan interface{})
    write := &Writer{Ch: pipeChannel}
    // 来源
    go func() {
        defer func() {
            if r := recover(); r != nil {
                fmt.Println("panic")
            }
            close(sourceChannel)
        }()
        generateFunc(sourceChannel)
    }()

    mapperChannel := make(chan interface{})
    mapperWrite := &Writer{Ch: mapperChannel}
    //wg := sync.WaitGroup{}
    doneCh := make(chan interface{})

    // 执行
    go func() {
        defer func() {
            close(pipeChannel)
        }()
        executeMappers(10, ctx, sourceChannel, doneCh, mapperFunc, mapperWrite)
    }()

    // 汇总

    var cancelOnce sync.Once
    var mapperOnce sync.Once
    var reducerOnce sync.Once
    //var doneOnce sync.Once
    closeChannel := func() {
        cancelOnce.Do(func() {
            close(sourceChannel)
        })
        mapperOnce.Do(func() {
            close(mapperChannel)
        })
        reducerOnce.Do(func() {
            close(pipeChannel)
        })
        doneCh <- 1
    }

    go func() {
        defer func() {
            closeChannel()
            drain(pipeChannel)
        }()
        reducerFunc(mapperChannel, write)
    }()

    // 处理上下文线程
    go func() {
        for {
            select {
            case <-ctx.Done():
                fmt.Println("fun done****************************")
                closeChannel()
                return
            }
        }
    }()

    //time.Sleep(time.Second * 10)
    select {
    case <-ctx.Done():
        fmt.Println("finish done")
        return nil, nil
    case v, ok := <-pipeChannel:
        fmt.Println("finish resp:", v, "**ok:", ok)
        return nil, nil

    }

    return nil, nil
}

Call method:

 package main

import (
    "context"
    "fmt"
    "strconv"
    "time"
)

func main() {
    ctx, cl := context.WithCancel(context.Background())
    fmt.Println(cl)
    var list []string
    for i := 0; i < 50; i++ {
        //ctx.Done()
        list = append(list, strconv.Itoa(i))
    }
    //传递数据的逻辑
    a := func(source chan<- interface{}) {
        for _, v := range list {
            //time.Sleep(time.Millisecond * 300)
            source <- v
            fmt.Println("source:", v)
        }
    }

    // 处理数据的逻辑
    b := func(item interface{}, write Write) {
        tmp := item.(string) + ":1"
        fmt.Println("tmp:", tmp)
        time.Sleep(time.Second)
        write.Write(item)
    }

    c := func(pipe chan interface{}, write Write) {
        for v := range pipe {
            fmt.Println("reducerFunc:", v)
        }
        write.Write("finish")
    }

    //go func() {
    //    time.Sleep(time.Second * 5)
    //    cl()
    //}()

    MapReduce(a, b, c, ctx)
}

gozero mapreduce source code analysis and simple implementation

Mapreduce

Mapreduce in gozero

Simple to use

Source code analysis

Get the data to process

Data processing

Collapse data

final wait and output

Summarize

Implement the simplified version yourself

大二小的宝

引用和评论

服务器注册发现在Go微服务中的使用

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

如何系统地入门学习stm32？