1

Circuit Breaker refers to a mechanism that suspends trading for a period of time when the price volatility reaches a certain target (circuit breaker) during the trading hours of the stock market. This mechanism is like a fuse that blows when the current is too high, hence the name. The purpose of the circuit breaker mechanism is to prevent systemic risks, give the market more time to calm down, avoid the spread of panic and lead to fluctuations in the entire market, and thus prevent the occurrence of large-scale stock price declines.

Similarly, in the design of highly concurrent distributed systems, there should also be a circuit breaker mechanism. Fusing is generally configured on the client (calling side). When the client initiates a request to the server, the errors on the server continue to increase. At this time, the fusing may be triggered. After the fusing is triggered, the client's request will no longer be sent to the service. Instead, the client directly rejects the request, thus protecting the server from overloading. The server mentioned here may be rpc service, http service, or mysql, redis, etc. Note that fusing is a lossy mechanism, and some degrading strategies may be required to cooperate when fusing.

Fusing principle

Modern microservice architectures are basically distributed, and the entire distributed system is composed of many microservices. Different services call each other to form a complex call chain. If a service in a complex call chain is unstable, it may be cascaded layer by layer, which may eventually cause the entire link to hang up. Therefore, we need to fuse and downgrade unstable service dependencies, temporarily cut off unstable service calls, and avoid local instability factors that cause the avalanche of the entire distributed system.

To put it bluntly, I think circuit breaker is like a proxy for those services that are prone to exceptions. This proxy can record the number of errors in recent calls, and then decide whether to continue the operation or return an error immediately.

A fuse state machine is maintained inside the fuse. The conversion relationship of the state machine is shown in the following figure:

The fuse has three states:

  • Closed state : It is also the initial state, we need a counter of call failures, if the call fails, increase the number of failures by 1. If the number of recent failures exceeds the allowable failure threshold within a given time, it switches to the Open state, and a timeout clock is started at this time. When the timeout clock time is reached, it switches to the Half Open state. The setting of the timeout time is Gives the system a chance to fix the error that caused the call to fail to return to normal working. In the Closed state, the error count is time-based. It resets automatically at a specific time interval, which prevents the fuse from going into the Open state due to an accidental error, or based on the number of consecutive failures.
  • Open state : In this state, the client request will immediately return an error response without calling the server.
  • Half-Open state : Allow a certain number of clients to call the server. If the calls to the service are successful for these requests, it can be considered that the error that caused the call to fail has been corrected. At this time, the circuit breaker switches to the Closed state, and the error counter is set at the same time. reset. If a certain number of requests fail to call, it is considered that the problem that caused the previous call to fail still exists, the circuit breaker switches back to the disconnected state, and then resets the timer to give the system time to correct the error. The Half-Open state can effectively prevent the recovering service from being suspended again by a sudden large number of requests.

The following figure is the implementation logic of the circuit breaker in Netflix's open source project Hystrix:

From this flow chart, it can be seen that:

  1. When a request comes, the allowRequest() function first judges whether the fuse is in the process of being blown. If not, it is released. If so, it depends on whether a fuse time slice has been reached. If the fuse time slice is reached, it is also released, otherwise an error is returned directly.
  2. Each call has two functions makeSuccess(duration) and makeFailure(duration) to count how many are successful or failed within a certain duration.
  3. The condition isOpen() to judge whether the fuse is blown is to calculate the current error rate of failure/(success+failure). If it is higher than a threshold, the fuse is turned on, otherwise it is turned off.
  4. Hystrix will maintain a data in memory, which records the statistics of the request results of each cycle, and elements that exceed the duration will be deleted.

Fuse implementation

After understanding the principle of fuse, let's implement a set of fuse by ourselves.

Friends who are familiar with go-zero know that the fuse in go-zero does not use the method described above, but refers to "Google Sre" and adopts an adaptive fuse mechanism. What is the benefit of this adaptive method? ? The following will make a comparison based on these two mechanisms.

Next, we implement a set of our own fuses based on the fuse principle described above.

Code path: go-zero/core/breaker/hystrixbreaker.go

The default state of the fuse is Closed. When the fuse is opened, the default cooling time is 5 seconds. When the fuse is in the HalfOpen state, the default detection time is 200 milliseconds. By default, the rateTripFunc method is used to determine whether the fuse is triggered. The rule is sampling If it is greater than or equal to 200 and the error rate is greater than 50%, a sliding window is used to record the total number of requests and errors.

 func newHystrixBreaker() *hystrixBreaker {
  bucketDuration := time.Duration(int64(window) / int64(buckets))
  stat := collection.NewRollingWindow(buckets, bucketDuration)
  return &hystrixBreaker{
    state:          Closed,
    coolingTimeout: defaultCoolingTimeout,
    detectTimeout:  defaultDetectTimeout,
    tripFunc:       rateTripFunc(defaultErrRate, defaultMinSample),
    stat:           stat,
    now:            time.Now,
  }
}
 func rateTripFunc(rate float64, minSamples int64) TripFunc {
  return func(rollingWindow *collection.RollingWindow) bool {
    var total, errs int64
    rollingWindow.Reduce(func(b *collection.Bucket) {
      total += b.Count
      errs += int64(b.Sum)
    })
    errRate := float64(errs) / float64(total)
    return total >= minSamples && errRate > rate
  }
}

The doReq method is called for each request. In this method, the accept() method is used to determine whether to reject the request. If the rejection is rejected, a fuse error will be returned directly. Otherwise, execute req() to actually initiate the server call, and call b.markSuccess() and b.markFailure() for success and failure respectively

 func (b *hystrixBreaker) doReq(req func() error, fallback func(error) error, acceptable Acceptable) error {
  if err := b.accept(); err != nil {
    if fallback != nil {
      return fallback(err)
    }
    return err
  }

  defer func() {
    if e := recover(); e != nil {
      b.markFailure()
      panic(e)
    }
  }()

  err := req()
  if acceptable(err) {
    b.markSuccess()
  } else {
    b.markFailure()
  }

  return err
}

In the accept() method, the current fuse state is obtained first, and when the fuse is in the Closed state, it returns directly, indicating that the request is being processed normally.

When the current state is Open, determine whether the cooling time has expired. If it has not expired, it will directly return a fuse error to reject the request. If it has expired, change the fuse state to HalfOpen. The main purpose of the cooling time is to give the server some time. Perform failure recovery to avoid continuous requests to hang up the server.

When the current state is HalfOpen, first determine the detection interval to avoid too frequent detection. The default detection interval is 200 milliseconds.

 func (b *hystrixBreaker) accept() error {
  b.mux.Lock()
  switch b.getState() {
  case Open:
    now := b.now()
    if b.openTime.Add(b.coolingTimeout).After(now) {
      b.mux.Unlock()
      return ErrServiceUnavailable
    }
    if b.getState() == Open {
      atomic.StoreInt32((*int32)(&b.state), int32(HalfOpen))
      atomic.StoreInt32(&b.halfopenSuccess, 0)
      b.lastRetryTime = now
      b.mux.Unlock()
    } else {
      b.mux.Unlock()
      return ErrServiceUnavailable
    }
  case HalfOpen:
    now := b.now()
    if b.lastRetryTime.Add(b.detectTimeout).After(now) {
      b.mux.Unlock()
      return ErrServiceUnavailable
    }
    b.lastRetryTime = now
    b.mux.Unlock()
  case Closed:
    b.mux.Unlock()
  }

  return nil
}

If the request returns normally, the markSuccess() method is called. If the current circuit breaker is in the HalfOpen state, it is judged whether the current number of successful detections is greater than the default number of successful detections. If it is greater, the status of the circuit breaker is updated to Closed.

 func (b *hystrixBreaker) markSuccess() {
  b.mux.Lock()
  switch b.getState() {
  case Open:
    b.mux.Unlock()
  case HalfOpen:
    atomic.AddInt32(&b.halfopenSuccess, 1)
    if atomic.LoadInt32(&b.halfopenSuccess) > defaultHalfOpenSuccesss {
      atomic.StoreInt32((*int32)(&b.state), int32(Closed))
      b.stat.Reduce(func(b *collection.Bucket) {
        b.Count = 0
        b.Sum = 0
      })
    }
    b.mux.Unlock()
  case Closed:
    b.stat.Add(1)
    b.mux.Unlock()
  }
}

In the markFailure() method, if the current state is Closed, execute tripFunc to determine whether the fuse condition is met, and if so, change the fuse state to the Open state.

 func (b *hystrixBreaker) markFailure() {
  b.mux.Lock()
  b.stat.Add(0)
  switch b.getState() {
  case Open:
    b.mux.Unlock()
  case HalfOpen:
    b.openTime = b.now()
    atomic.StoreInt32((*int32)(&b.state), int32(Open))
    b.mux.Unlock()
  case Closed:
    if b.tripFunc != nil && b.tripFunc(b.stat) {
      b.openTime = b.now()
      atomic.StoreInt32((*int32)(&b.state), int32(Open))
    }
    b.mux.Unlock()
  }
}

The overall implementation logic of the circuit breaker is relatively simple, and you can basically understand it by reading the code. This part of the code is implemented in a hurry, and there may be bugs. If you find a bug, you can contact me at any time to correct it.

Comparison of hystrixBreaker and googlebreaker

Next, compare the fusing effects of the two fuses.

This part of the sample code is under: go-zero/example

The user-api and user-rpc services are defined respectively, user-api acts as a client to request user-rpc, and user-rpc acts as a server to respond to client requests.

In the example method of user-rpc, there is a 20% chance of returning an error.

 func (l *UserInfoLogic) UserInfo(in *user.UserInfoRequest) (*user.UserInfoResponse, error) {
  ts := time.Now().UnixMilli()
  if in.UserId == int64(1) {
    if ts%5 == 1 {
      return nil, status.Error(codes.Internal, "internal error")
    }
    return &user.UserInfoResponse{
      UserId: 1,
      Name:   "jack",
    }, nil

  }
  return &user.UserInfoResponse{}, nil
}

In the example method of user-api, make a request to user-rpc, and then use the prometheus metric to record the number of normal requests.

 var metricSuccessReqTotal = metric.NewCounterVec(&metric.CounterVecOpts{
  Namespace: "circuit_breaker",
  Subsystem: "requests",
  Name:      "req_total",
  Help:      "test for circuit breaker",
  Labels:    []string{"method"},
})

func (l *UserInfoLogic) UserInfo() (resp *types.UserInfoResponse, err error) {
  for {
    _, err := l.svcCtx.UserRPC.UserInfo(l.ctx, &user.UserInfoRequest{UserId: int64(1)})
    if err != nil && err == breaker.ErrServiceUnavailable {
      fmt.Println(err)
      continue
    }
    metricSuccessReqTotal.Inc("UserInfo")
  }

  return &types.UserInfoResponse{}, nil
}

Start two services, and then observe the number of normal requests under the two circuit breaker strategies.

The normal request rate of the googleBreaker circuit breaker is shown in the following figure:

The normal request rate of the hystrixBreaker circuit breaker is shown in the following figure:

It can be seen from the above experimental results that the normal number of requests for googleBreaker built in go-zero is higher than that of hystrixBreaker. This is because hystrixBreaker maintains three states. When entering the Open state, in order to avoid continuing to cause pressure on the server to initiate requests, a cooling clock will be used, and no requests will be missed during this period. At the same time, from the After the HalfOpen state changes to the Closed state, a large number of requests will be sent to the server in an instant. At this time, the server may not be restored, causing the circuit breaker to become the Open state again. And googleBreaker adopts an adaptive circuit breaker strategy, it does not require multiple states, and it will not be one-size-fits-all like hystrixBreaker, but will process as many requests as possible, which is not what we expect. is detrimental. Let's learn together the built-in fuse googleBreaker of go-zero.

Source code interpretation

The code path of googleBreaker is: go-zero/core/breaker/googlebreaker.go

In the doReq() method, the accept() method is used to determine whether the fuse is triggered. If the fuse is triggered, an error is returned. If a callback function is defined here, the callback can be executed, such as some processing of downgraded data. If the request is normal, add 1 to both the total number of requests and the normal number of requests through markSuccess(). If the request fails through markFailure, only add 1 to the total number of requests.

 func (b *googleBreaker) doReq(req func() error, fallback func(err error) error, acceptable Acceptable) error {
  if err := b.accept(); err != nil {
    if fallback != nil {
      return fallback(err)
    }

    return err
  }

  defer func() {
    if e := recover(); e != nil {
      b.markFailure()
      panic(e)
    }
  }()

  err := req()
  if acceptable(err) {
    b.markSuccess()
  } else {
    b.markFailure()
  }

  return err
}

In the accept() method, it is determined by calculation whether the fuse is triggered.

In this algorithm, two request numbers need to be recorded, which are:

  • Total requests (requests): The total number of requests initiated by the caller
  • Number of requests processed normally (accepts): The number of requests normally processed by the server

Under normal circumstances, these two values are equal. As the callee service encounters an exception and begins to reject the request, the value of the number of requests accepted (accepts) begins to gradually become smaller than the number of requests (requests). At this time, the caller can continue to send requests. , until requests = K * accepts, once this limit is exceeded, the circuit breaker will be turned on, and the new request will be discarded locally with a certain probability and return an error directly. The calculation formula of the probability is as follows:

 max(0, (requests - K * accepts) / (requests + 1))

By modifying the K (multiplier value) in the algorithm, the sensitivity of the fuse can be adjusted. When the value is reduced, the adaptive fuse algorithm will be more sensitive. When the value is increased, the adaptive fuse algorithm will be less sensitive. For example Say, assuming that the caller's request limit is adjusted from requests = 2 acceptst to requests = 1.1 accepts, then it means that one in every ten requests of the caller will trigger a circuit breaker.

 func (b *googleBreaker) accept() error {
  accepts, total := b.history()
  weightedAccepts := b.k * float64(accepts)
  // https://landing.google.com/sre/sre-book/chapters/handling-overload/#eq2101
  dropRatio := math.Max(0, (float64(total-protection)-weightedAccepts)/float64(total+1))
  if dropRatio <= 0 {
    return nil
  }

  if b.proba.TrueOnProba(dropRatio) {
    return ErrServiceUnavailable
  }

  return nil
}

history counts the current total number of requests and the number of requests processed normally from the sliding window.

 func (b *googleBreaker) history() (accepts, total int64) {
  b.stat.Reduce(func(b *collection.Bucket) {
    accepts += int64(b.Sum)
    total += b.Count
  })

  return
}

concluding remarks

This article introduces a client-side throttling mechanism in service governance - circuit breaker. Three states need to be implemented in the hystrix fusing strategy, namely Open, HalfOpen and Closed. The switching timing of different states is also described in detail above. You can read and understand it repeatedly, and it is best to implement it yourself. The built-in fuse of go-zero has no state. If you have to say its state, there are only two cases: open and closed. It is an adaptive discard request according to the success rate of the current request, which is a more With the flexible circuit breaker strategy, the probability of discarding requests varies with the number of requests being processed normally. The more requests that are processed normally, the lower the probability of discarding requests, and the higher the probability of discarding requests.

Although the principle of fusing is the same, the effects caused by different implementation mechanisms may be different. In actual production, you can choose a fusing strategy that meets the business scenario according to the actual situation.

Hope this article is helpful to you.

The code of this article: https://github.com/zhoushuguang/go-zero/tree/circuit-breaker

refer to

https://martinfowler.com/bliki/CircuitBreaker.html

https://github.com/Netflix/Hystrix/wiki/How-it-Works

project address

https://github.com/zeromicro/go-zero

Welcome go-zero and star support us!

WeChat exchange group

Follow the official account of " Microservice Practice " and click on the exchange group to get the QR code of the community group.


kevinwan
931 声望3.5k 粉丝

go-zero作者