This article explains the principle and implementation of adaptive fusing

Why do I need to fuse

In a microservice cluster, each application basically depends on a certain number of external services. It is possible to encounter slow network connection, timeout, dependent service overload, and service unavailability at any time. In high concurrency scenarios, if the caller does not do any processing at this time and continues to request faulty services, it will easily cause the entire microservice cluster. avalanche.
For example, user order services in high-concurrency scenarios generally need to rely on the following services:

Commodity Service
Account service
Inventory Service

If the account service is overloaded at this time, the order service continues to request the account service and can only passively wait for the account service to report an error or the request timeout, which causes a large accumulation of order requests. These invalid requests will still occupy system resources: cpu, memory, data connection... . Lead to the overall unavailability of the order service. Even if the account service restores the order service, it cannot restore itself.

At this time, if there is an active protection mechanism to deal with this scenario, the order service can at least guarantee its own operating status, and the order service will also recover itself synchronously when waiting for the account service to recover. This self-protection mechanism is called a fuse mechanism in service governance.

Fuse

Fuse is a mechanism for the caller to protect itself (objectively it can also protect the callee), and the fuse object is an external service.

Downgrade

Downgrade is a self-protection mechanism of the called party (service provider) to prevent overload due to insufficient resources of its own, and the object of downgrade is itself.

The word fuse comes from the fuse in the circuit of our daily life. When the load is too high (the current is too large), the fuse will blow itself to prevent the circuit from being burned out. Many technologies are derived from the extraction of life scenes.

working principle

The fuse generally has three states:

Closed: The default state, the request can be reached to the target service, and the number of successes and failures in the window time is counted. If the error rate threshold is reached, it will enter the disconnected state.
Disconnected: In this state, an error will be returned directly, if there is a fallback configuration, the fallback method will be directly called.
Semi-disconnected: The disconnected state will maintain a supermarket time. When the timeout period is reached, it will enter the semi-disconnected state. Try to allow a department request to pass normally and count the number of successes. If the request is normal, it is considered that the target service has been restored to the closed state at this time , Otherwise it will enter the disconnected state. The purpose of the semi-disconnected state is to realize self-repair and to prevent the service being restored from being destroyed again.

Use more fuse components:

hystrix circuit breaker (no longer maintained)
hystrix-go
resilience4j (recommended)
sentinel (recommended)

What is adaptive fusing

Based on the above-mentioned fuse principle, we usually need to prepare the following parameters to use the fuse in the project:

Error ratio threshold: When this threshold is reached, it enters the disconnected state.
Disconnected state timeout time: enter the semi-disconnected state after timeout.
The number of requests allowed in the semi-disconnected state.
Window time size.

In fact, there are a lot of optional configuration parameters, refer to https://resilience4j.readme.io/docs/circuitbreaker

For developers who are not experienced enough, there is no bottom line about how appropriate these parameter settings are.

So is there an adaptive fusing algorithm that allows us not to pay attention to the parameters, as long as a simple configuration can meet most of the scenarios?

In fact, it . 161959e9968e7c google sre provides an adaptive fusing algorithm to calculate the probability of discarding the request:

Algorithm parameters:

requests: the total number of requests within the window time
accepts: the number of normal requests
K: Sensitivity, the smaller the K is, the easier it is to miss the request. Generally, it is recommended to be between 1.5-2

Algorithm explanation:

Under normal circumstances, requests=accepts, so the probability is 0.
As the number of normal requests decreases, when requests == K* accepts continue to request, the probability P will gradually become greater than 0, and some requests will be gradually discarded according to the probability. If the failure is serious, more and more packets will be lost. If within the window time accepts==0 is completely fuse.
When the application gradually returns to normal, accepts and requests are increasing at the same time, but K*accepts will increase faster than requests, so the probability will soon return to 0 and the fuse will be turned off.

Code

Next, think about how to implement a fuse.

The preliminary idea is:

Regardless of the fuse, it must rely on indicator statistics to switch states, and statistical indicators generally require data within a recent period of time (too long data has no reference meaning and wastes space), so a sliding time window data structure is usually used for storage Statistical data. At the same time, the state of the fuse also needs to rely on index statistics to achieve observability. The first step for us to realize any system is observability, otherwise the system is a black box.
The results of external service requests are various, so a custom judgment method needs to be provided to judge whether the request is successful. It may be http.code, rpc.code, body.code, the fuse needs to collect this data in real time.
When an external service is fuse, users often need to customize the fast failure logic, and consider providing a custom fallback() function.

Let's analyze the source code implementation of go-zero step by step:

core/breaker/breaker.go

Fuse interface definition

With no effort, the food and grass go first. After the requirements are clarified, we can start to plan and define the interface. The interface is the first and most important step in the abstraction of our coding thinking.

The core definition contains two types of methods:

Allow(): Need to manually call back the request result to the fuse, which is equivalent to manual blocking.

DoXXX(): Automatically call back the request result to the fuse, which is equivalent to automatic transmission. In fact, the DoXXX() type method is called at the end
`DoWithFallbackAcceptable(req func() error, fallback func(err error) error, acceptable Acceptable) error
`

    // 自定义判定执行结果
    Acceptable func(err error) bool
    
    // 手动回调
    Promise interface {
        // Accept tells the Breaker that the call is successful.
        // 请求成功
        Accept()
        // Reject tells the Breaker that the call is failed.
        // 请求失败
        Reject(reason string)
    }    

    Breaker interface {
        // 熔断器名称
        Name() string

        // 熔断方法，执行请求时必须手动上报执行结果
        // 适用于简单无需自定义快速失败，无需自定义判定请求结果的场景
        // 相当于手动挡。。。
        Allow() (Promise, error)

        // 熔断方法，自动上报执行结果
        // 自动挡。。。
        Do(req func() error) error

        // 熔断方法
        // acceptable - 支持自定义判定执行结果
        DoWithAcceptable(req func() error, acceptable Acceptable) error

        // 熔断方法
        // fallback - 支持自定义快速失败
        DoWithFallback(req func() error, fallback func(err error) error) error

        // 熔断方法
        // fallback - 支持自定义快速失败
        // acceptable - 支持自定义判定执行结果
        DoWithFallbackAcceptable(req func() error, fallback func(err error) error, acceptable Acceptable) error
    }

Fuse realization

circuitBreaker inherits throttle, which is actually equivalent to a static proxy. The proxy mode can enhance the function without changing the original object. We will see later that the reason for go-zero is to collect fuse error data, that is, to Achieve observability.

The fuse implementation uses a static proxy mode, which seems a bit confusing.

// 熔断器结构体
circuitBreaker struct {
    name string
    // 实际上 circuitBreaker熔断功能都代理给 throttle来实现
    throttle
}
// 熔断器接口
throttle interface {
    // 熔断方法
    allow() (Promise, error)
    // 熔断方法
    // DoXXX()方法最终都会该方法
    doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error
}
    
func (cb *circuitBreaker) Allow() (Promise, error) {
     return cb.throttle.allow()
}
    
func (cb *circuitBreaker) Do(req func() error) error {
  return cb.throttle.doReq(req, nil, defaultAcceptable)
}
    
func (cb *circuitBreaker) DoWithAcceptable(req func() error, acceptable Acceptable) error {
  return cb.throttle.doReq(req, nil, acceptable)
}
    
func (cb *circuitBreaker) DoWithFallback(req func() error, fallback func(err error) error) error {
  return cb.throttle.doReq(req, fallback, defaultAcceptable)
}
    
func (cb *circuitBreaker) DoWithFallbackAcceptable(req func() error, fallback func(err error) error,
  acceptable Acceptable) error {
    return cb.throttle.doReq(req, fallback, acceptable)
}

The throttle interface implementation class:

loggedThrottle adds a rolling window for collecting error logs, the purpose is to collect error logs when the request fails.

// 带日志功能的熔断器
type loggedThrottle struct {
    // 名称
    name string
    // 代理对象
    internalThrottle
    // 滚动窗口,滚动收集数据,相当于环形数组
    errWin *errorWindow
}

// 熔断方法
func (lt loggedThrottle) allow() (Promise, error) {
    promise, err := lt.internalThrottle.allow()
    return promiseWithReason{
        promise: promise,
        errWin:  lt.errWin,
    }, lt.logError(err)
}

// 熔断方法
func (lt loggedThrottle) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error {
    return lt.logError(lt.internalThrottle.doReq(req, fallback, func(err error) bool {
        accept := acceptable(err)
        if !accept {
            lt.errWin.add(err.Error())
        }
        return accept
    }))
}

func (lt loggedThrottle) logError(err error) error {
    if err == ErrServiceUnavailable {
        // if circuit open, not possible to have empty error window
        stat.Report(fmt.Sprintf(
            "proc(%s/%d), callee: %s, breaker is open and requests dropped\nlast errors:\n%s",
            proc.ProcessName(), proc.Pid(), lt.name, lt.errWin))
    }

    return err
}

Error log collection errorWindow

errorWindow is a circular array. New data continuously rolls over the oldest data, which is achieved by taking the remainder.

// 滚动窗口
type errorWindow struct {
    reasons [numHistoryReasons]string
    index   int
    count   int
    lock    sync.Mutex
}

// 添加数据
func (ew *errorWindow) add(reason string) {
    ew.lock.Lock()
    // 添加错误日志
    ew.reasons[ew.index] = fmt.Sprintf("%s %s", timex.Time().Format(timeFormat), reason)
    // 更新index,为下一次写入数据做准备
    // 这里用的取模实现了滚动功能
    ew.index = (ew.index + 1) % numHistoryReasons
    // 统计数量
    ew.count = mathx.MinInt(ew.count+1, numHistoryReasons)
    ew.lock.Unlock()
}

// 格式化错误日志
func (ew *errorWindow) String() string {
    var reasons []string

    ew.lock.Lock()
    // reverse order
    for i := ew.index - 1; i >= ew.index-ew.count; i-- {
        reasons = append(reasons, ew.reasons[(i+numHistoryReasons)%numHistoryReasons])
    }
    ew.lock.Unlock()

    return strings.Join(reasons, "\n")
}

Seeing here we have not seen the actual fuse implementation, in fact the real fuse operation is delegated to the internalThrottle object.

    internalThrottle interface {
        allow() (internalPromise, error)
        doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error
    }

The internalThrottle interface implements the googleBreaker structure definition

type googleBreaker struct {
    // 敏感度，go-zero中默认值为1.5
    k float64
    // 滑动窗口，用于记录最近一段时间内的请求总数，成功总数
    stat *collection.RollingWindow
    // 概率生成器
    // 随机产生0.0-1.0之间的双精度浮点数
    proba *mathx.Proba
}

It can be seen that the properties of the fuse are actually very simple, and the data statistics are implemented using a sliding time window.

RollingWindow sliding window

The sliding window belongs to a more general data structure, and is often used for behavioral data statistics in a recent period of time.

Its implementation is very interesting, especially how to simulate the window sliding process.

First look at the structure definition of the sliding window:

    RollingWindow struct {
        // 互斥锁
        lock sync.RWMutex
        // 滑动窗口数量
        size int
        // 窗口，数据容器
        win *window
        // 滑动窗口单元时间间隔
        interval time.Duration
        // 游标，用于定位当前应该写入哪个bucket
        offset int
        // 汇总数据时，是否忽略当前正在写入桶的数据
        // 某些场景下因为当前正在写入的桶数据并没有经过完整的窗口时间间隔
        // 可能导致当前桶的统计并不准确
        ignoreCurrent bool
        // 最后写入桶的时间
        // 用于计算下一次写入数据间隔最后一次写入数据的之间
        // 经过了多少个时间间隔
        lastTime      time.Duration 
    }

Window is the actual storage location of the data, in fact it is an array, which provides operations to add data and clear to the specified offset.
The array is divided into multiple buckets according to the internal time interval.

// 时间窗口
type window struct {
    // 桶
    // 一个桶标识一个时间间隔
    buckets []*Bucket
    // 窗口大小
    size int
}

// 添加数据
// offset - 游标，定位写入bucket位置
// v - 行为数据
func (w *window) add(offset int, v float64) {
    w.buckets[offset%w.size].add(v)
}

// 汇总数据
// fn - 自定义的bucket统计函数
func (w *window) reduce(start, count int, fn func(b *Bucket)) {
    for i := 0; i < count; i++ {
        fn(w.buckets[(start+i)%w.size])
    }
}

// 清理特定bucket
func (w *window) resetBucket(offset int) {
    w.buckets[offset%w.size].reset()
}

// 桶
type Bucket struct {
    // 当前桶内值之和
    Sum float64
    // 当前桶的add总次数
    Count int64
}

// 向桶添加数据
func (b *Bucket) add(v float64) {
    // 求和
    b.Sum += v
    // 次数+1
    b.Count++
}

// 桶数据清零
func (b *Bucket) reset() {
    b.Sum = 0
    b.Count = 0
}

window add data:

Calculate how many time intervals have passed since the current time was added to the last time. In fact, several buckets have expired.
Clean up data in expired buckets
Update the offset, the process of updating the offset is actually sliding in the simulation window
adding data

// 添加数据
func (rw *RollingWindow) Add(v float64) {
    rw.lock.Lock()
    defer rw.lock.Unlock()
    // 获取当前写入的下标
    rw.updateOffset()
    // 添加数据
    rw.win.add(rw.offset, v)
}

// 计算当前距离最后写入数据经过多少个单元时间间隔
// 实际上指的就是经过多少个桶
func (rw *RollingWindow) span() int {
    offset := int(timex.Since(rw.lastTime) / rw.interval)
    if 0 <= offset && offset < rw.size {
        return offset
    }
    // 大于时间窗口时 返回窗口大小即可
    return rw.size
}

// 更新当前时间的offset
// 实现窗口滑动
func (rw *RollingWindow) updateOffset() {
    // 经过span个桶的时间
    span := rw.span()
    // 还在同一单元时间内不需要更新
    if span <= 0 {
        return
    }
    offset := rw.offset
    // 既然经过了span个桶的时间没有写入数据
    // 那么这些桶内的数据就不应该继续保留了，属于过期数据清空即可
    // 可以看到这里全部用的 % 取余操作，可以实现按照下标周期性写入
    // 如果超出下标了那就从头开始写，确保新数据一定能够正常写入
    // 类似循环数组的效果
    for i := 0; i < span; i++ {
        rw.win.resetBucket((offset + i + 1) % rw.size)
    }
    // 更新offset
    rw.offset = (offset + span) % rw.size
    now := timex.Now()
    // 更新操作时间
    // 这里很有意思
    rw.lastTime = now - (now-rw.lastTime)%rw.interval
}

window statistics:

// 归纳汇总数据
func (rw *RollingWindow) Reduce(fn func(b *Bucket)) {
    rw.lock.RLock()
    defer rw.lock.RUnlock()

    var diff int
    span := rw.span()
    // 当前时间截止前，未过期桶的数量
    if span == 0 && rw.ignoreCurrent {
        diff = rw.size - 1
    } else {
        diff = rw.size - span
    }
    if diff > 0 {
        // rw.offset - rw.offset+span之间的桶数据是过期的不应该计入统计
        offset := (rw.offset + span + 1) % rw.size
        // 汇总数据
        rw.win.reduce(offset, diff, fn)
    }
}

googleBreaker judges whether it should be blown

Collect statistics in a sliding window
Calculate the probability of fusing

// 按照最近一段时间的请求数据计算是否熔断
func (b *googleBreaker) accept() error {
    // 获取最近一段时间的统计数据
    accepts, total := b.history()
    // 计算动态熔断概率
    weightedAccepts := b.k * float64(accepts)
    // https://landing.google.com/sre/sre-book/chapters/handling-overload/#eq2101
    dropRatio := math.Max(0, (float64(total-protection)-weightedAccepts)/float64(total+1))
    // 概率为0，通过
    if dropRatio <= 0 {
        return nil
    }
    // 随机产生0.0-1.0之间的随机数与上面计算出来的熔断概率相比较
    // 如果随机数比熔断概率小则进行熔断
    if b.proba.TrueOnProba(dropRatio) {
        return ErrServiceUnavailable
    }

    return nil
}

googleBreaker fuse logic implementation

Two types of fuse exposure methods

In simple scenarios, it is directly judged whether the object is blown. After the request is executed, the execution result must be manually reported to the fuse.

func (b *googleBreaker) allow() (internalPromise, error)

Supports custom fast failure in complex scenarios, customizing the fuse method to determine whether the request is successful, and automatically reporting the execution result to the fuse.

`func (b *googleBreaker) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error
`

The purpose of the Acceptable parameter is to determine whether the request is successful or not.

Acceptable func(err error) bool

// 熔断方法
// 返回一个promise异步回调对象，可由开发者自行决定是否上报结果到熔断器
func (b *googleBreaker) allow() (internalPromise, error) {
    if err := b.accept(); err != nil {
        return nil, err
    }

    return googlePromise{
        b: b,
    }, nil
}

// 熔断方法
// req - 熔断对象方法
// fallback - 自定义快速失败函数，可对熔断产生的err进行包装后返回
// acceptable - 对本次未熔断时执行请求的结果进行自定义的判定，比如可以针对http.code,rpc.code,body.code
func (b *googleBreaker) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error {
    // 判定是否熔断
    if err := b.accept(); err != nil {
        // 熔断中，如果有自定义的fallback则执行
        if fallback != nil {
            return fallback(err)
        }

        return err
    }
    // 如果执行req()过程发生了panic，依然判定本次执行失败上报至熔断器
    defer func() {
        if e := recover(); e != nil {
            b.markFailure()
            panic(e)
        }
    }()
    // 执行请求
    err := req()
    // 判定请求成功
    if acceptable(err) {
        b.markSuccess()
    } else {
        b.markFailure()
    }

    return err
}

// 上报成功
func (b *googleBreaker) markSuccess() {
    b.stat.Add(1)
}

// 上报失败
func (b *googleBreaker) markFailure() {
    b.stat.Add(0)
}

// 统计数据
func (b *googleBreaker) history() (accepts, total int64) {
    b.stat.Reduce(func(b *collection.Bucket) {
        accepts += int64(b.Sum)
        total += b.Count
    })

    return
}

material

Microsoft Azure about the fuse design pattern )

Sony refers to Microsoft's document open source fuse implementation

go-zero adaptive fuse document

project address

https://github.com/zeromicro/go-zero

Welcome to use go-zero and star support us!

This article explains the principle and implementation of adaptive fusing

Why do I need to fuse

working principle

What is adaptive fusing

Code

Fuse interface definition

Fuse realization

Error log collection errorWindow

The internalThrottle interface implements the googleBreaker structure definition

RollingWindow sliding window

googleBreaker judges whether it should be blown

googleBreaker fuse logic implementation

material

project address

kevinwan

引用和评论

熔断原理分析与源码解读

一文掌握 MCP 上下文协议：从理论到实践

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

腾讯 tRPC-Go 教学——（1）搭建服务

一文弄懂用Go实现MCP服务