The algorithm behind adaptive microservice governance

Preface

Students in the go-zero group often ask:

What algorithm is used to implement service monitoring?
How does the sliding window work? Can you talk about the principle of this?
How is the fusing algorithm designed? Why is there no half-open and half-closed state?

In this article, let's go-zero the implementation algorithm and logic behind the indicator statistics in 060bd7dabb197e.

How to count the indicators

For this, we directly look at breaker :

type googleBreaker struct {
  k     float64
  stat  *collection.RollingWindow
  proba *mathx.Proba
}

go-zero default breaker is based on google SRE as the implementation blueprint.

When breaker is intercepting the request, it will record the current success/failure rate of this type of request:

func (b *googleBreaker) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error {
  ...
  // 执行实际请求函数
  err := req()
  if acceptable(err) {
    // 实际执行：b.stat.Add(1)
    // 也就是说：内部指标统计成功+1
    b.markSuccess()
  } else {
    // 原理同上
    b.markFailure()
  }

  return err
}

So in fact, the bottom layer is : the 160bd7dabb1a41 request is executed, and the internal statistical data structure will add the statistical value (positive or negative) according to the number of errors. At the same time, as time moves, statistical values also need to evolve over time.

Simply put: time series memory database database, it is a storage, just a memory version]

Let's talk about what data structure is used to organize this time series.

Sliding window

Let's take a look at the data structure defined by rollingwindow

type RollingWindow struct {
    lock          sync.RWMutex
    size          int
    win           *window
    interval      time.Duration
    offset        int
    ignoreCurrent bool
    lastTime      time.Duration
  }

In the above structure definition, window stores indicator record attributes.

In a rollingwindow contains several buckets (this depends on the developer's own definition):

Each bucket stores: Sum total number of successful, Count total number of requests. So in the end breaker do the calculation time, it will add up to a total of Sum accepts , the Count and the cumulative increase to total , so you can count out the current error rate.

How does the sliding happen

First of all, for breaker it needs to count the request status in a unit time (such as 1s). Corresponding to the above bucket we only need to record the indicator data per unit time in this bucket .

Then how do we ensure that the specified Bucket stores the data per unit time during the progress of time?

The first method that comes to mind: Open a timer in the background, create a bucket every unit time, and then when the request is made, the current timestamp falls in bucket to record the current request status. There are critical conditions for creating buckets periodically. When the data comes, the buckets have not been built.

The second method is to create bucket , and then check and create bucket when a piece of data is encountered. In this way, there are sometimes buckets and sometimes no buckets, and a large number of bucket will be created. Can we reuse it?

The go-zero method is: rollingwindow directly created in advance, the current time of the request is determined to bucket through an algorithm, and the request status is recorded.

Let's take a look at the process of breaker calling b.stat.Add(1)

func (rw *RollingWindow) Add(v float64) {
  rw.lock.Lock()
  defer rw.lock.Unlock()
  // 滑动的动作发生在此
  rw.updateOffset()
  rw.win.add(rw.offset, v)
}

func (rw *RollingWindow) updateOffset() {
  span := rw.span()
  if span <= 0 {
    return
  }

  offset := rw.offset
  // 重置过期的 bucket
  for i := 0; i < span; i++ {
    rw.win.resetBucket((offset + i + 1) % rw.size)
  }

  rw.offset = (offset + span) % rw.size
  now := timex.Now()
  // 更新时间
  rw.lastTime = now - (now-rw.lastTime)%rw.interval
}

func (w *window) add(offset int, v float64) {
  // 往执行的 bucket 加入指定的指标
  w.buckets[offset%w.size].add(v)
}

The above picture is the window change of bucket Add(delta) explain:

updateOffset is to bucket and determine which bucket current time falls on [return the number of buckets directly if the number of buckets is exceeded], and reset bucket
- Determine the span of the current time relative to bucket interval [return the number of buckets directly if the number of buckets is exceeded]
- Clear the data bucket in the span. reset
- Update offset , it is about to write data bucket
- Update the execution time lastTime , and also make a mark for the next move
From the last updated offset , write data to the corresponding bucket

In this process, how to determine the bucket and the update time. The most important thing about the sliding window is the time update. The following figure explains this process:

The bucket expiration point, saying that white is lastTime an update time that is spanned several bucket : timex.Since(rw.lastTime) / rw.interval

Thus, the Add() process by lastTime and nowTime labels, be achieved through continuous sliding window reset, new data continues to fill, thereby achieving the window calculation.

to sum up

This article analyzes the go-zero framework and the realization of sliding window rollingWindow . Of course, in addition, store/redis also has indicator statistics. There is no need for sliding window counting in this, because it only needs to calculate the hit rate. Hit +1 for hits, and miss +1 for misses. Index count, and finally count the hit rate.

The sliding window is suitable for calculating indicators in flow control, and it can also control flow.

For more design and implementation articles about go-zero

project address

https://github.com/tal-tech/go-zero

Welcome to use go-zero and star support us!

WeChat Exchange Group

Follow the " practice " public exchange group get the QR code of the community group.

For the go-zero series of articles, please refer to the official account of "Microservice Practice"

The algorithm behind adaptive microservice governance

Preface

How to count the indicators

Sliding window

How does the sliding happen

to sum up

project address

WeChat Exchange Group

kevinwan

引用和评论

熔断原理分析与源码解读

腾讯 tRPC-Go 教学——（5）filter、context 和日志组件

Go slice切片使用教程，一次通关！

腾讯 tRPC-Go 教学——（1）搭建服务

gozero限流、熔断、降级如何实现？面试的时候怎么回答？

一文弄懂用Go实现MCP服务

如何系统地入门学习stm32？