Mapreduce
Mapreduce is a distributed parallel programming model. In a function or an interface call, a large number of calculations or a large number of calls to third-party interfaces will occur. At this time, Mapreduce can be used to become a model, so that a large number of calculations are processed on one or more machines, and finally the results are aggregated together and output.
Mapreduce in gozero
gozero is a recently popular go microservice framework, but there are also some interesting and useful class libraries in this library, which we can refer to individually, such as Mapreduce .
This is what the official said: In actual business scenarios, we often need to obtain corresponding attributes from different rpc services to assemble complex objects. If it is a serial call, the response time will increase linearly with the number of rpc calls, so we generally change serial to parallel to optimize performance.
We know that it is quite troublesome to implement a set of parallel modes by ourselves, and gozero Mapreduce can help us achieve this effect very easily. Let's focus on the implementation of our own business logic.
Simple to use
When it needs to be mentioned here, gozero also provides control of the number of threads. You can control the number of threads for parallel processing, so as to avoid excessive consumption of the server by too many threads. At the same time, we can also pass in our own context to control the timeout and cancellation logic of the entire method. These are all encapsulated in the class library, and we don't need to worry about it. Just pass the right arguments.
The following is a relatively simple time, read an array in parallel plus ":1" and finally input the changed array object.
package main
import (
"context"
"fmt"
"github.com/zeromicro/go-zero/core/mr"
"time"
)
func main() {
//要处理的数据
uid := []string{"a", "b", "c", "d", "e", "f"}
//传递数据的逻辑
generateFunc := func(source chan<- interface{}) {
for _, v := range uid {
source <- v
fmt.Println("source:", v)
}
}
// 处理数据的逻辑
mapFunc := func(item interface{}, writer mr.Writer, cancel func(err error)) {
tmp := item.(string) + ":1"
writer.Write(tmp)
fmt.Println("item:", item)
}
// 合并的数据逻辑
reducerFunc := func(pipe <-chan interface{}, writer mr.Writer, cancel func(err error)) {
var uid []string
for v := range pipe {
uid = append(uid, v.(string))
fmt.Println("pipe:", uid)
}
writer.Write(uid)
}
// 开始并发处理数据
// 超时时间
ctx, cl := context.WithTimeout(context.Background(), time.Second*3)
res, err := mr.MapReduce(generateFunc, mapFunc, reducerFunc, mr.WithContext(ctx))
//开启现成控制超时,如果超时则调用cl方法停止所有携程
go func() {
time.Sleep(time.Second * 2)
fmt.Println("cl")
cl()
}()
fmt.Println(res, err)
}
Source code analysis
First, let's look at the flow chart of the entire function. This flow chart is my own understanding. If there is something wrong, please discuss it in the comment area.
We see that a large number of channels are used in the entire process, and the entire complex process is controlled through these channels. Let's look at the main three steps:
Get the data to process
We see the generate function here, which is defined as follows. The parameter of the function is a channel, and we need to process the source data in this function, that is, the data we are going to process. For example, for an array in the above example, the data is transmitted to the pipeline through the for loop, and the subsequent logic is processed. Of course, we can also perform some simple preprocessing on the data in this process.
This is a typical functional transformation, passing a function as a parameter to another function.
GenerateFunc func(source chan<- interface{})
func buildSource(generate GenerateFunc, panicChan *onceChan) chan interface{} {
source := make(chan interface{})
go func() {
defer func() {
if r := recover(); r != nil {
panicChan.write(r)
}
close(source)
}()
generate(source)
}()
return source
}
Data processing
What is ultimately needed in Mapreduce is the parallel processing logic, and the following is the core logic of parallel processing.
Control the number of goroutines through a pool channel. Use atomic's atomic operation feature to control the number of existing Ctrips, and finally use WaitGroup to wait for the processing of left and right Ctrips to complete.
What we need to focus on here is the processing logic of the defer function in many of the channel controls and functions.
If there is data in ctx or doneChan in the control logic select, the whole function will be stopped. When Ctrip is still free, we will first put a data into the pool channel, which is equivalent to occupying a Ctrip, and use wg.Add(1) to count. In the defer function we need wg.done(), which deducts the atomic amount and spit out a data from the pool. Let the subsequent data have more spare Ctrip can operate the data.
// 调用
go executeMappers(mapperContext{
ctx: options.ctx,
mapper: func(item interface{}, w Writer) {
mapper(item, w, cancel)
},
source: source,
panicChan: panicChan,
collector: collector,
doneChan: done,
workers: options.workers,
})
// 具体执行方法
func executeMappers(mCtx mapperContext) {
var wg sync.WaitGroup
defer func() {
wg.Wait()
close(mCtx.collector)
drain(mCtx.source)
}()
var failed int32
pool := make(chan lang.PlaceholderType, mCtx.workers)
writer := newGuardedWriter(mCtx.ctx, mCtx.collector, mCtx.doneChan)
for atomic.LoadInt32(&failed) == 0 {
select {
case <-mCtx.ctx.Done():
return
case <-mCtx.doneChan:
return
case pool <- lang.Placeholder:
item, ok := <-mCtx.source
if !ok {
<-pool
return
}
wg.Add(1)
go func() {
defer func() {
if r := recover(); r != nil {
atomic.AddInt32(&failed, 1)
mCtx.panicChan.write(r)
}
wg.Done()
<-pool
}()
mCtx.mapper(item, writer)
}()
}
}
}
Collapse data
While the executeMappers function keeps writing data to the collector channel, the reducer function is also processing the data, and at the end, the collected data is sent to the output channel through the wirte. So far the whole function is over.
go func() {
defer func() {
drain(collector)
if r := recover(); r != nil {
panicChan.write(r)
}
finish()
}()
reducer(collector, writer, cancel)
}()
final wait and output
The last select of the function, which has three cases. Hitting any of them can make the whole function end
- The Done function of the context is penalized: when the incoming and outgoing context is canceled, the entire function will stop running
- There is data incoming in panicChan: when each goroutine in the function panics that cannot go on, data can be passed into the entire channel of panicChan.
There is data incoming in the output: this channel is to put the object that needs to be output into this channel at the end of the data collection. Also means the end of the entire function.
select { case <-options.ctx.Done(): cancel(context.DeadlineExceeded) return nil, context.DeadlineExceeded case v := <-panicChan.channel: panic(v) case v, ok := <-output: if err := retErr.Load(); err != nil { return nil, err } else if ok { return v, nil } else { return nil, ErrReduceNoOutput } }
Summarize
We can see from the above flow chart and step analysis that this Mapreduce package actually does a lot of judgment and fault tolerance. These complex interactions between Ctrip and Ctrip are all realized through channels. If we want to implement our own business logic, it is more complicated, but it is easier to understand if we directly use the Mapreduce package of gozero.
Implement the simplified version yourself
We know that it is relatively easy to understand if you look at other people's code as long as you debug it a few times, but if you want to really understand it, you have to implement it yourself, even if it is a simplified version. The following is my simple implementation method after reading the source code. It may be rough and imperfect. If you have any questions, please feel free to communicate.
Implementation code:
package main
import (
"context"
"fmt"
"sync"
"sync/atomic"
)
// 第一个方法用来往管道里存储要执行的事情 GenerateFunc
// 第二个方法用来执行方法 MapperFunc
// 第三个方法用来归并结果集 ReducerFunc
// MapReduce 逻辑简单介绍下:
//1. 要处理的数据放到一个无缓冲的管道里,再来一个协程从这个无缓冲的管道里读取要处理的数据,然后把读取出来的数据开一个协程用 传入的方法处理;
//2. 创建一个无缓冲的管道, 用来保存执行的结果, 让合并结果的协程从这个管道里读取数据, 然后合并数据, 写入合并数据的管道里
//3. 创建一个用来停止其它协程的管道, 如果执行中有什么错误就关闭这个管道里,同时停止执行其它协程,返回失败
//4. 最后要把没有关闭的管道关闭了
type (
GenerateFunc func(source chan<- interface{})
MapperFunc func(item interface{}, write Write)
ReducerFunc func(pipe chan interface{}, write Write)
)
type Write interface {
Write(val interface{})
}
type Writer struct {
Ch chan<- interface{}
}
func (w *Writer) Write(val interface{}) {
w.Ch <- val
}
func drain(channel <-chan interface{}) {
for range channel {
}
}
func executeMappers(threadNum int, ctx context.Context, ch chan interface{}, doneCh chan interface{}, fn MapperFunc, wr Write) {
var failed int32
pool := make(chan interface{}, threadNum)
wg := sync.WaitGroup{}
defer func() {
wg.Wait()
}()
for atomic.LoadInt32(&failed) == 0 {
select {
case <-ctx.Done():
return
case <-doneCh:
return
case pool <- 1:
item, ok := <-ch
if !ok {
return
}
wg.Add(1)
go func() {
defer func() {
if r := recover(); r != nil {
atomic.AddInt32(&failed, 1)
}
fn(item, wr)
wg.Done()
<-pool
}()
}()
}
}
}
func MapReduce(generateFunc GenerateFunc, mapperFunc MapperFunc, reducerFunc ReducerFunc, ctx context.Context) (interface{}, error) {
sourceChannel := make(chan interface{})
pipeChannel := make(chan interface{})
write := &Writer{Ch: pipeChannel}
// 来源
go func() {
defer func() {
if r := recover(); r != nil {
fmt.Println("panic")
}
close(sourceChannel)
}()
generateFunc(sourceChannel)
}()
mapperChannel := make(chan interface{})
mapperWrite := &Writer{Ch: mapperChannel}
//wg := sync.WaitGroup{}
doneCh := make(chan interface{})
// 执行
go func() {
defer func() {
close(pipeChannel)
}()
executeMappers(10, ctx, sourceChannel, doneCh, mapperFunc, mapperWrite)
}()
// 汇总
var cancelOnce sync.Once
var mapperOnce sync.Once
var reducerOnce sync.Once
//var doneOnce sync.Once
closeChannel := func() {
cancelOnce.Do(func() {
close(sourceChannel)
})
mapperOnce.Do(func() {
close(mapperChannel)
})
reducerOnce.Do(func() {
close(pipeChannel)
})
doneCh <- 1
}
go func() {
defer func() {
closeChannel()
drain(pipeChannel)
}()
reducerFunc(mapperChannel, write)
}()
// 处理上下文线程
go func() {
for {
select {
case <-ctx.Done():
fmt.Println("fun done****************************")
closeChannel()
return
}
}
}()
//time.Sleep(time.Second * 10)
select {
case <-ctx.Done():
fmt.Println("finish done")
return nil, nil
case v, ok := <-pipeChannel:
fmt.Println("finish resp:", v, "**ok:", ok)
return nil, nil
}
return nil, nil
}
Call method:
package main
import (
"context"
"fmt"
"strconv"
"time"
)
func main() {
ctx, cl := context.WithCancel(context.Background())
fmt.Println(cl)
var list []string
for i := 0; i < 50; i++ {
//ctx.Done()
list = append(list, strconv.Itoa(i))
}
//传递数据的逻辑
a := func(source chan<- interface{}) {
for _, v := range list {
//time.Sleep(time.Millisecond * 300)
source <- v
fmt.Println("source:", v)
}
}
// 处理数据的逻辑
b := func(item interface{}, write Write) {
tmp := item.(string) + ":1"
fmt.Println("tmp:", tmp)
time.Sleep(time.Second)
write.Write(item)
}
c := func(pipe chan interface{}, write Write) {
for v := range pipe {
fmt.Println("reducerFunc:", v)
}
write.Write("finish")
}
//go func() {
// time.Sleep(time.Second * 5)
// cl()
//}()
MapReduce(a, b, c, ctx)
}
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。