10
头图

Original link: # The fuse framework under the

background

As the micro-service architecture is being promoted like a fire, some concepts have also been pushed to us. When it comes to microservices, these words are inseparable: high cohesion and low coupling; the ultimate goal of microservice architecture design is to achieve these words. In the microservice architecture, a microservice is to complete a single business function. Each microservice can evolve independently. An application may consist of multiple microservices. The data exchange between microservices can be completed through remote calls. This kind of dependency will be formed under a microservice architecture:

Microservice A calls microservices C and D, microservice B relies on microservices B and E, and microservice D depends on service F. This is just a simple example. The dependencies between services in actual business are more complicated than this. In this way, if the call response time of a microservice is too long or unavailable on the call link, the call to the upstream service (named according to the call relationship) will occupy more and more system resources, which will cause the system to crash. This is the snow bouncing effect of microservices.

In order to solve the snow bouncing effect of microservices, it is proposed to use a fuse mechanism to provide a protection mechanism for the microservice link. Everyone should be familiar with the fuse mechanism. The middle fuse of the circuit is a fuse mechanism. What is the fuse mechanism in microservices?

When a microservice in the link is unavailable or the response time is too long, the service will be degraded, and the call of the microservice of the node will be fuse, and the wrong response information will be quickly returned. When the response of the microservice call of the node is detected After normal, restore the call link.

In this article, we introduce an open source fuse framework: hystrix-go.

Fuse frame (hystrix-go)

Hystrix is a latency and fault-tolerant library designed to isolate access points to remote systems, services, and third-party services, stop cascading failures, and achieve resilience in complex distributed systems where failures are inevitable. hystrix-go is designed to allow Go programmers to easily build applications with execution semantics similar to the Java-based Hystrix library. So this article starts with the use of hystrix-go to analyze the source code.

Quick install

go get -u github.com/afex/hystrix-go/hystrix

Quick to use

hystrix-go is really easy to use out of the box, and it is mainly divided into two steps:

  • Configure the circuit breaker, otherwise the default configuration will be used. Methods that can be called
func Configure(cmds map[string]CommandConfig) 
func ConfigureCommand(name string, config CommandConfig)

Configure internal method is called ConfigureCommand way is to pass parameters are not the same, according to their own code style choices.

  • Define the application logic that depends on the external system- runFunc and the logic code executed during the service interruption- fallbackFunc , the methods that can be called:
func Go(name string, run runFunc, fallback fallbackFunc) // 内部调用Goc方法
func GoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) 
func Do(name string, run runFunc, fallback fallbackFunc) // 内部调用的是Doc方法
func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) // 内部调用Goc方法,处理了异步过程

Go and Do is whether it is asynchronous or synchronous. The Do method handles the asynchronous process within the call to the Doc method, and they eventually call the Goc method. We will analyze later.

To give an example: we add an interface-level fuse middleware to the Gin

// 代码已上传github: 文末查看地址
var CircuitBreakerName = "api_%s_circuit_breaker"
func CircuitBreakerWrapper(ctx *gin.Context){
    name := fmt.Sprintf(CircuitBreakerName,ctx.Request.URL)
    hystrix.Do(name, func() error {
        ctx.Next()
        code := ctx.Writer.Status()
        if code != http.StatusOK{
            return errors.New(fmt.Sprintf("status code %d", code))
        }
        return nil

    }, func(err error) error {
        if err != nil{
            // 监控上报(未实现)
            _, _ = io.WriteString(f, fmt.Sprintf("circuitBreaker and err is %s\n",err.Error())) //写入文件(字符串)
            fmt.Printf("circuitBreaker and err is %s\n",err.Error())
            // 返回熔断错误
            ctx.JSON(http.StatusServiceUnavailable,gin.H{
                "msg": err.Error(),
            })
        }
        return nil
    })
}

func init()  {
    hystrix.ConfigureCommand(CircuitBreakerName,hystrix.CommandConfig{
        Timeout:                int(3*time.Second), // 执行command的超时时间为3s
        MaxConcurrentRequests:  10, // command的最大并发量
        RequestVolumeThreshold: 100, // 统计窗口10s内的请求数量,达到这个请求数量后才去判断是否要开启熔断
        SleepWindow:            int(2 * time.Second), // 当熔断器被打开后,SleepWindow的时间就是控制过多久后去尝试服务是否可用了
        ErrorPercentThreshold:  20, // 错误百分比,请求数量大于等于RequestVolumeThreshold并且错误率到达这个百分比后就会启动熔断
    })
    if checkFileIsExist(filename) { //如果文件存在
        f, errfile = os.OpenFile(filename, os.O_APPEND, 0666) //打开文件
    } else {
        f, errfile = os.Create(filename) //创建文件
    }
}


func main()  {
    defer f.Close()
    hystrixStreamHandler := hystrix.NewStreamHandler()
    hystrixStreamHandler.Start()
    go http.ListenAndServe(net.JoinHostPort("", "81"), hystrixStreamHandler)
    r := gin.Default()
    r.GET("/api/ping/baidu", func(c *gin.Context) {
        _, err := http.Get("https://www.baidu.com")
        if err != nil {
            c.JSON(http.StatusInternalServerError, gin.H{"msg": err.Error()})
            return
        }
        c.JSON(http.StatusOK, gin.H{"msg": "success"})
    }, CircuitBreakerWrapper)
    r.Run()  // listen and serve on 0.0.0.0:8080 (for windows "localhost:8080")
}

func checkFileIsExist(filename string) bool {
    if _, err := os.Stat(filename); os.IsNotExist(err) {
        return false
    }
    return true
}

Command: wrk -t100 -c100 -d1s http://127.0.0.1:8080/api/ping/baidu

operation result:

circuitBreaker and err is status code 500
circuitBreaker and err is status code 500
..... 
circuitBreaker and err is hystrix: max concurrency
circuitBreaker and err is hystrix: max concurrency
.....
circuitBreaker and err is hystrix: circuit open
circuitBreaker and err is hystrix: circuit open
.....

Analyze the error:

  • circuitBreaker and err is status code 500 : Because we closed the network, the request was unresponsive
  • circuitBreaker and err is hystrix: max concurrency : The maximum concurrency we set MaxConcurrentRequests is 10 , our stress testing tool uses 100 concurrency, all will trigger this fuse
  • circuitBreaker and err is hystrix: circuit open : We set the number of requests for fuse opening RequestVolumeThreshold to 100 , so when the number of requests in 10 100 , the fuse will be triggered.

A simple analysis of the above example:

  • Add interface-level fuse middleware
  • Initialize fusing related configuration
  • Open dashboard visualize the reported information of hystrix, open http://localhost:81 browser, and you can see the following results:

hystrix-go process analysis

Originally wanted to analyze the source code, the amount of code was a bit large, so I analyzed the process and looked at some core codes by the way.

Configure fusing rules

Since it is a fuse, there must be a fuse rule. We can call two methods to configure the fuse rule. The ones that will not be called are ConfigureCommand . There is no special logic here. If we do not configure it, the system will use the default fuse rule:

var (
    // DefaultTimeout is how long to wait for command to complete, in milliseconds
    DefaultTimeout = 1000
    // DefaultMaxConcurrent is how many commands of the same type can run at the same time
    DefaultMaxConcurrent = 10
    // DefaultVolumeThreshold is the minimum number of requests needed before a circuit can be tripped due to health
    DefaultVolumeThreshold = 20
    // DefaultSleepWindow is how long, in milliseconds, to wait after a circuit opens before testing for recovery
    DefaultSleepWindow = 5000
    // DefaultErrorPercentThreshold causes circuits to open once the rolling measure of errors exceeds this percent of requests
    DefaultErrorPercentThreshold = 50
    // DefaultLogger is the default logger that will be used in the Hystrix package. By default prints nothing.
    DefaultLogger = NoopLogger{}
)

The configuration rules are as follows:

  • Timeout : Define the timeout time for executing command, the time unit is ms , and the default time is 1000ms ;
  • MaxConcurrnetRequests : Define the maximum concurrency of command, the default value is 10 concurrency;
  • SleepWindow : Fuse used after being opened, after the fuse is open, according to SleepWindow after the set time trying to control how long the service is available, the default time is 5000ms ;
  • RequestVolumeThreshold : One of the conditions for judging the fuse switch, count the 10s (the code is dead), after reaching this number of requests, judge whether to open the fuse according to the error rate;
  • ErrorPercentThreshold : One of the conditions for judging the fuse switch, the error percentage is RequestVolumeThreshold , the number of requests is greater than or equal to 061370e01cbf8e and the error rate reaches this percentage, the fuse will be activated. The default value of is 50;

These rules are distinguished and stored in a map according to the name of the command.

Execute command

command are four main methods that can be called by executing 061370e01cc033, which are:

func Go(name string, run runFunc, fallback fallbackFunc)
func GoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) 
func Do(name string, run runFunc, fallback fallbackFunc)
func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC)

Do internal call Doc method, Go internal call is Goc method, Doc internal method eventually calls or Goc method, but in Doc do synchronization logic inside the method:

func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) error {
  ..... 省略部分封装代码
  var errChan chan error
    if fallback == nil {
        errChan = GoC(ctx, name, r, nil)
    } else {
        errChan = GoC(ctx, name, r, f)
    }

    select {
    case <-done:
        return nil
    case err := <-errChan:
        return err
    }
}

Because they all call the Goc method in the end, we perform the analysis Goc method; the code is a bit long, and we analyze it separately:

Create command object
    cmd := &command{
        run:      run,
        fallback: fallback,
        start:    time.Now(),
        errChan:  make(chan error, 1),
        finished: make(chan bool, 1),
    }
    // 获取熔断器
    circuit, _, err := GetCircuit(name)
    if err != nil {
        cmd.errChan <- err
        return cmd.errChan
    }

Introduce the data structure of command

type command struct {
    sync.Mutex

    ticket      *struct{}
    start       time.Time
    errChan     chan error
    finished    chan bool
    circuit     *CircuitBreaker
    run         runFuncC
    fallback    fallbackFuncC
    runDuration time.Duration
    events      []string
}

Field introduction:

  • ticket : used to control the maximum concurrency, this is a token
  • start : Record the start time of execution of command
  • errChan : Record command execution error
  • finished : mark the command execution of 061370e01cc25d, used for coroutine synchronization
  • circuit : Store fuse related information
  • run : Application
  • fallback : The function to be executed after the application fails to execute
  • runDuration : Record the execution time of command
  • events : events mainly stores event type information, such as the successful execution of success , or the failed timeout , context_canceled etc.

The focus of the GetCircuit code is the 061370e01cc372 method. The purpose of this step is to obtain the fuse, and use the dynamic loading method. If not, create a fuse. The structure of the fuse is as follows:

type CircuitBreaker struct {
    Name                   string
    open                   bool
    forceOpen              bool
    mutex                  *sync.RWMutex
    openedOrLastTestedTime int64

    executorPool *executorPool
    metrics      *metricExchange
}

Explain these fields:

  • name : The name of the fuse is actually the name of the created command
  • open : A sign to determine whether the fuse is open
  • forceopen : Manually trigger the switch of the fuse, for unit testing
  • mutex : Use read-write locks to ensure concurrency safety
  • openedOrLastTestedTime : Record the last time the fuse was opened, because it is necessary to make a recovery attempt SleepWindow
  • executorPool : Used for flow control, because we have a maximum concurrency control, which is the flow control based on this, and each request must obtain a token
  • metrics : The event used to report the execution status, through which the execution status information is stored in the data set of the actual fuse executing each dimension status (success, failure, timeout...).

The implementation logic of executorPool and metrics will be analyzed separately later.

Define the methods and variables related to the token

Because we have a condition that is maximum concurrency control, we use a token method for flow control. Every request must obtain a token, and the token must be returned after use. Let’s take a look at this code first:

    ticketCond := sync.NewCond(cmd)
    ticketChecked := false
    // When the caller extracts error from returned errChan, it's assumed that
    // the ticket's been returned to executorPool. Therefore, returnTicket() can
    // not run after cmd.errorWithFallback().
    returnTicket := func() {
        cmd.Lock()
        // Avoid releasing before a ticket is acquired.
        for !ticketChecked {
            ticketCond.Wait()
        }
        cmd.circuit.executorPool.Return(cmd.ticket)
        cmd.Unlock()
    }

Use sync.NewCond create a condition variable to coordinate the notification that you can return the token.

Then define a Return method to return the token.

Define the method for reporting execution events

As we mentioned earlier, our fuse will report execution status events and store the execution status information in the data collection of the actual fuse execution status of each dimension (success, failure, timeout...). So define a reporting method:

    reportAllEvent := func() {
        err := cmd.circuit.ReportEvent(cmd.events, cmd.start, cmd.runDuration)
        if err != nil {
            log.Printf(err.Error())
        }
    }
Start coroutine one: execute application logic- runFunc

The main purpose of coroutine one is to execute application logic:

go func() {
        defer func() { cmd.finished <- true }() // 标志协程一的command执行结束,同步到协程二

        // 当最近执行的并发数量超过阈值并且错误率很高时,就会打开熔断器。 
      // 如果熔断器打开,直接拒绝拒绝请求并返回令牌,当感觉健康状态恢复时,熔断器将允许新的流量。
        if !cmd.circuit.AllowRequest() {
            cmd.Lock()
            // It's safe for another goroutine to go ahead releasing a nil ticket.
            ticketChecked = true
            ticketCond.Signal() // 通知释放ticket信号
            cmd.Unlock()
      // 使用sync.Onece保证只执行一次。
            returnOnce.Do(func() {
        // 返还令牌
                returnTicket()
        // 执行fallback逻辑
                cmd.errorWithFallback(ctx, ErrCircuitOpen)
        // 上报状态事件
                reportAllEvent()
            })
            return
        }
   // 控制并发
        cmd.Lock()
        select {
    // 获取到令牌
        case cmd.ticket = <-circuit.executorPool.Tickets:
      // 发送释放令牌信号
            ticketChecked = true
            ticketCond.Signal()
            cmd.Unlock()
        default:
         // 没有令牌可用了, 也就是达到最大并发数量则直接处理fallback逻辑
            ticketChecked = true
            ticketCond.Signal()
            cmd.Unlock()
            returnOnce.Do(func() {
                returnTicket()
                cmd.errorWithFallback(ctx, ErrMaxConcurrency)
                reportAllEvent()
            })
            return
        }
        // 执行应用程序逻辑
        runStart := time.Now()
        runErr := run(ctx)
        returnOnce.Do(func() {
            defer reportAllEvent() // 状态事件上报
      // 统计应用程序执行时长
            cmd.runDuration = time.Since(runStart)
      // 返还令牌
            returnTicket()
      // 如果应用程序执行失败执行fallback函数
            if runErr != nil {
                cmd.errorWithFallback(ctx, runErr)
                return
            }
            cmd.reportEvent("success")
        })
    }()

Summarize this coroutine:

  • Determine whether the fuse is open, if the fuse is opened, the fuse will be directly fuse, and the subsequent request will not be performed
  • Run application logic
Open the coroutine 2: Synchronize the coroutine and listen for errors

Look at the code first:

go func() {
    //  使用定时器来做超时控制,这个超时时间就是我们配置的,默认1000ms
        timer := time.NewTimer(getSettings(name).Timeout)
        defer timer.Stop()

        select {
      // 同步协程一
        case <-cmd.finished:
            // returnOnce has been executed in another goroutine
      
    // 是否收到context取消信号
        case <-ctx.Done():
            returnOnce.Do(func() {
                returnTicket()
                cmd.errorWithFallback(ctx, ctx.Err())
                reportAllEvent()
            })
            return
    // command执行超时了
        case <-timer.C:
            returnOnce.Do(func() {
                returnTicket()
                cmd.errorWithFallback(ctx, ErrTimeout)
                reportAllEvent()
            })
            return
        }
    }()

The logic of this coroutine is clear and clear, and the purpose is to monitor the cancellation and timeout of business execution.

Draw a picture to summarize the command execution process

We all analyzed the above through code, it still looks a bit messy, and finally draw a picture to summarize:

We have analyzed the entire specific process above, and then we will analyze some core points

Report status event

hystrix-go sets a default statistical controller for each Command , which is used to save all the states of the fuse, including the number of calls, the number of failures, the number of rejections, etc. The storage index structure is as follows:

type DefaultMetricCollector struct {
    mutex *sync.RWMutex

    numRequests *rolling.Number
    errors      *rolling.Number

    successes               *rolling.Number
    failures                *rolling.Number
    rejects                 *rolling.Number
    shortCircuits           *rolling.Number
    timeouts                *rolling.Number
    contextCanceled         *rolling.Number
    contextDeadlineExceeded *rolling.Number

    fallbackSuccesses *rolling.Number
    fallbackFailures  *rolling.Number
    totalDuration     *rolling.Timing
    runDuration       *rolling.Timing
}

Use rolling.Number structure to save the state indicator, use rolling.Timing save the time indicator.

The final monitoring report is metricExchange , and the data structure is as follows:

type metricExchange struct {
    Name    string
    Updates chan *commandExecution
    Mutex   *sync.RWMutex

    metricCollectors []metricCollector.MetricCollector
}

Report the information structure of command

type commandExecution struct {
    Types            []string      `json:"types"` // 区分事件类型,比如success、failure....
    Start            time.Time     `json:"start_time"` // command开始时间
    RunDuration      time.Duration `json:"run_duration"` // command结束时间
    ConcurrencyInUse float64       `json:"concurrency_inuse"` // command 线程池使用率
}

Having said so much, everyone is still a little confused, and a practical class diagram can show the relationship between them:

We can see that the class mertricExchange provides a Monitor . The main logic of this method is to monitor status events and then write indicators, so the entire reporting process looks like this:

flow control

hystrix-go adopts a token algorithm for flow control. If the token can be obtained, the subsequent work can be executed, and the token must be returned after execution.
The structure executorPool is the concrete realization of hystrix-go flow control. The field Max is the maximum concurrency value per second.

type executorPool struct {
    Name    string
    Metrics *poolMetrics // 上报执行数量指标
    Max     int // 最大并发数量
    Tickets chan *struct{} // 代表令牌
}

There is also a report indicator here. This separately implements a set of methods to count the number of executions, such as the total number of executions, the maximum number of concurrency, etc. We rely on drawing a class diagram to represent:

The logic of the reported execution quantity is the same as that of the reported status event. channel , both the reporting and returning tokens are in the Return method:

func (p *executorPool) Return(ticket *struct{}) {
    if ticket == nil {
        return
    }

    p.Metrics.Updates <- poolMetricsUpdate{
        activeCount: p.ActiveCount(),
    }
    p.Tickets <- ticket
}

Two main logical steps:

  • Report the number of tokens currently available
  • Return token

Fuse

Finally, we analyze the fuse in a more important way: AllowRequest , we perform Command will be judged according to whether this method can be performed command , then we take a look at the main logic of the judgment:

func (circuit *CircuitBreaker) AllowRequest() bool {
    return !circuit.IsOpen() || circuit.allowSingleTest()
}

Internally, the two methods IsOpen() allowSingleTest are called:

  • IsOpen()
func (circuit *CircuitBreaker) IsOpen() bool {
    circuit.mutex.RLock()
    o := circuit.forceOpen || circuit.open
    circuit.mutex.RUnlock()
    // 熔断已经开启
    if o {
        return true
    }
    // 判断10s内的并发数是否超过设置的最大并发数,没有超过时,不需要开启熔断器
    if uint64(circuit.metrics.Requests().Sum(time.Now())) < getSettings(circuit.Name).RequestVolumeThreshold {
        return false
    }
    // 此时10s内的并发数已经超过设置的最大并发数了,如果此时系统错误率超过了预设值,那就开启熔断器
    if !circuit.metrics.IsHealthy(time.Now()) {
        // 
        circuit.setOpen()
        return true
    }

    return false
}
  • allowSingleTest()

First explain why we have this method, remember that we set up a fusing rule SleepWindow , if the fusing is turned on, SleepWindow time, the purpose of this method is to do this:

func (circuit *CircuitBreaker) allowSingleTest() bool {
    circuit.mutex.RLock()
    defer circuit.mutex.RUnlock()
    
  // 获取当前时间戳
    now := time.Now().UnixNano()
    openedOrLastTestedTime := atomic.LoadInt64(&circuit.openedOrLastTestedTime)
  // 当前熔断器是开启状态,当前的时间已经大于 (上次开启熔断器的时间 +SleepWindow 的时间)
    if circuit.open && now > openedOrLastTestedTime+getSettings(circuit.Name).SleepWindow.Nanoseconds() {
    // 替换openedOrLastTestedTime
        swapped := atomic.CompareAndSwapInt64(&circuit.openedOrLastTestedTime, openedOrLastTestedTime, now)
        if swapped {
            log.Printf("hystrix-go: allowing single test to possibly close circuit %v", circuit.Name)
        }
        return swapped
    }

Here only see the setting of the fuse opened, but there is no logic to close the fuse, because the logic of closing the fuse is implemented in the method of reporting status indicators ReportEvent , let’s finally look at the implementation of ReportEvent

func (circuit *CircuitBreaker) ReportEvent(eventTypes []string, start time.Time, runDuration time.Duration) error {
    if len(eventTypes) == 0 {
        return fmt.Errorf("no event types sent for metrics")
    }
    
    circuit.mutex.RLock()
    o := circuit.open
    circuit.mutex.RUnlock()
  // 上报的状态事件是success 并且当前熔断器是开启状态,则说明下游服务正常了,可以关闭熔断器了
    if eventTypes[0] == "success" && o {
        circuit.setClose()
    }

    var concurrencyInUse float64
    if circuit.executorPool.Max > 0 {
        concurrencyInUse = float64(circuit.executorPool.ActiveCount()) / float64(circuit.executorPool.Max)
    }

    select {
    // 上报状态指标,与上文的monitor呼应
    case circuit.metrics.Updates <- &commandExecution{
        Types:            eventTypes,
        Start:            start,
        RunDuration:      runDuration,
        ConcurrencyInUse: concurrencyInUse,
    }:
    default:
        return CircuitError{Message: fmt.Sprintf("metrics channel (%v) is at capacity", circuit.Name)}
    }

    return nil
}

Visualize the reported information of hystrix

Through the above analysis, we know that hystrix-go reported status events and execution quantity events. How can we check these indicators?

The designers have thought of this problem for a long time, so they made a dashborad , you can view hystrix , the method of use only needs to add the following code when the service is started:

hystrixStreamHandler := hystrix.NewStreamHandler()
hystrixStreamHandler.Start()
go http.ListenAndServe(net.JoinHostPort("", "81"), hystrixStreamHandler)

Then open the browser: http://127.0.0.1:81/hystrix-dashboard, and make observations.

Summarize

The story is finally coming to an end. The realization of a fuse mechanism is indeed not simple. The factors to be considered are also all aspects. Especially in the microservice architecture, the fuse mechanism is essential. It is not only necessary to implement the fuse mechanism at the framework level, but also according to the specifics. Business scenarios use circuit breakers, which are worthy of our careful consideration. The implementation of the fuse framework introduced in this article is quite perfect, and this excellent design idea is worth learning.

The code in the article has been uploaded github : https://github.com/asong2020/Golang_Dream/tree/master/code_demo/hystrix_demo, welcome star .

Welcome to pay attention to the official account: [Golang DreamWorks]

Recommend previous articles:


asong
605 声望907 粉丝