Original link: # The fuse framework under the
background
As the micro-service architecture is being promoted like a fire, some concepts have also been pushed to us. When it comes to microservices, these words are inseparable: high cohesion and low coupling; the ultimate goal of microservice architecture design is to achieve these words. In the microservice architecture, a microservice is to complete a single business function. Each microservice can evolve independently. An application may consist of multiple microservices. The data exchange between microservices can be completed through remote calls. This kind of dependency will be formed under a microservice architecture:
Microservice A calls microservices C and D, microservice B relies on microservices B and E, and microservice D depends on service F. This is just a simple example. The dependencies between services in actual business are more complicated than this. In this way, if the call response time of a microservice is too long or unavailable on the call link, the call to the upstream service (named according to the call relationship) will occupy more and more system resources, which will cause the system to crash. This is the snow bouncing effect of microservices.
In order to solve the snow bouncing effect of microservices, it is proposed to use a fuse mechanism to provide a protection mechanism for the microservice link. Everyone should be familiar with the fuse mechanism. The middle fuse of the circuit is a fuse mechanism. What is the fuse mechanism in microservices?
When a microservice in the link is unavailable or the response time is too long, the service will be degraded, and the call of the microservice of the node will be fuse, and the wrong response information will be quickly returned. When the response of the microservice call of the node is detected After normal, restore the call link.
In this article, we introduce an open source fuse framework: hystrix-go.
Fuse frame (hystrix-go)
Hystrix is a latency and fault-tolerant library designed to isolate access points to remote systems, services, and third-party services, stop cascading failures, and achieve resilience in complex distributed systems where failures are inevitable. hystrix-go is designed to allow Go programmers to easily build applications with execution semantics similar to the Java-based Hystrix library. So this article starts with the use of hystrix-go to analyze the source code.
Quick install
go get -u github.com/afex/hystrix-go/hystrix
Quick to use
hystrix-go is really easy to use out of the box, and it is mainly divided into two steps:
- Configure the circuit breaker, otherwise the default configuration will be used. Methods that can be called
func Configure(cmds map[string]CommandConfig)
func ConfigureCommand(name string, config CommandConfig)
Configure
internal method is called ConfigureCommand
way is to pass parameters are not the same, according to their own code style choices.
- Define the application logic that depends on the external system-
runFunc
and the logic code executed during the service interruption-fallbackFunc
, the methods that can be called:
func Go(name string, run runFunc, fallback fallbackFunc) // 内部调用Goc方法
func GoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC)
func Do(name string, run runFunc, fallback fallbackFunc) // 内部调用的是Doc方法
func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) // 内部调用Goc方法,处理了异步过程
Go
and Do
is whether it is asynchronous or synchronous. The Do
method handles the asynchronous process within the call to the Doc
method, and they eventually call the Goc
method. We will analyze later.
To give an example: we add an interface-level fuse middleware to the Gin
// 代码已上传github: 文末查看地址
var CircuitBreakerName = "api_%s_circuit_breaker"
func CircuitBreakerWrapper(ctx *gin.Context){
name := fmt.Sprintf(CircuitBreakerName,ctx.Request.URL)
hystrix.Do(name, func() error {
ctx.Next()
code := ctx.Writer.Status()
if code != http.StatusOK{
return errors.New(fmt.Sprintf("status code %d", code))
}
return nil
}, func(err error) error {
if err != nil{
// 监控上报(未实现)
_, _ = io.WriteString(f, fmt.Sprintf("circuitBreaker and err is %s\n",err.Error())) //写入文件(字符串)
fmt.Printf("circuitBreaker and err is %s\n",err.Error())
// 返回熔断错误
ctx.JSON(http.StatusServiceUnavailable,gin.H{
"msg": err.Error(),
})
}
return nil
})
}
func init() {
hystrix.ConfigureCommand(CircuitBreakerName,hystrix.CommandConfig{
Timeout: int(3*time.Second), // 执行command的超时时间为3s
MaxConcurrentRequests: 10, // command的最大并发量
RequestVolumeThreshold: 100, // 统计窗口10s内的请求数量,达到这个请求数量后才去判断是否要开启熔断
SleepWindow: int(2 * time.Second), // 当熔断器被打开后,SleepWindow的时间就是控制过多久后去尝试服务是否可用了
ErrorPercentThreshold: 20, // 错误百分比,请求数量大于等于RequestVolumeThreshold并且错误率到达这个百分比后就会启动熔断
})
if checkFileIsExist(filename) { //如果文件存在
f, errfile = os.OpenFile(filename, os.O_APPEND, 0666) //打开文件
} else {
f, errfile = os.Create(filename) //创建文件
}
}
func main() {
defer f.Close()
hystrixStreamHandler := hystrix.NewStreamHandler()
hystrixStreamHandler.Start()
go http.ListenAndServe(net.JoinHostPort("", "81"), hystrixStreamHandler)
r := gin.Default()
r.GET("/api/ping/baidu", func(c *gin.Context) {
_, err := http.Get("https://www.baidu.com")
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{"msg": err.Error()})
return
}
c.JSON(http.StatusOK, gin.H{"msg": "success"})
}, CircuitBreakerWrapper)
r.Run() // listen and serve on 0.0.0.0:8080 (for windows "localhost:8080")
}
func checkFileIsExist(filename string) bool {
if _, err := os.Stat(filename); os.IsNotExist(err) {
return false
}
return true
}
Command: wrk -t100 -c100 -d1s http://127.0.0.1:8080/api/ping/baidu
operation result:
circuitBreaker and err is status code 500
circuitBreaker and err is status code 500
.....
circuitBreaker and err is hystrix: max concurrency
circuitBreaker and err is hystrix: max concurrency
.....
circuitBreaker and err is hystrix: circuit open
circuitBreaker and err is hystrix: circuit open
.....
Analyze the error:
circuitBreaker and err is status code 500
: Because we closed the network, the request was unresponsivecircuitBreaker and err is hystrix: max concurrency
: The maximum concurrency we setMaxConcurrentRequests
is10
, our stress testing tool uses 100 concurrency, all will trigger this fusecircuitBreaker and err is hystrix: circuit open
: We set the number of requests for fuse openingRequestVolumeThreshold
to100
, so when the number of requests in10
100
, the fuse will be triggered.
A simple analysis of the above example:
- Add interface-level fuse middleware
- Initialize fusing related configuration
- Open
dashboard
visualize the reported information of hystrix, openhttp://localhost:81
browser, and you can see the following results:
hystrix-go
process analysis
Originally wanted to analyze the source code, the amount of code was a bit large, so I analyzed the process and looked at some core codes by the way.
Configure fusing rules
Since it is a fuse, there must be a fuse rule. We can call two methods to configure the fuse rule. The ones that will not be called are ConfigureCommand
. There is no special logic here. If we do not configure it, the system will use the default fuse rule:
var (
// DefaultTimeout is how long to wait for command to complete, in milliseconds
DefaultTimeout = 1000
// DefaultMaxConcurrent is how many commands of the same type can run at the same time
DefaultMaxConcurrent = 10
// DefaultVolumeThreshold is the minimum number of requests needed before a circuit can be tripped due to health
DefaultVolumeThreshold = 20
// DefaultSleepWindow is how long, in milliseconds, to wait after a circuit opens before testing for recovery
DefaultSleepWindow = 5000
// DefaultErrorPercentThreshold causes circuits to open once the rolling measure of errors exceeds this percent of requests
DefaultErrorPercentThreshold = 50
// DefaultLogger is the default logger that will be used in the Hystrix package. By default prints nothing.
DefaultLogger = NoopLogger{}
)
The configuration rules are as follows:
Timeout
: Define the timeout time for executing command, the time unit isms
, and the default time is1000ms
;MaxConcurrnetRequests
: Define the maximum concurrency of command, the default value is10
concurrency;SleepWindow
: Fuse used after being opened, after the fuse is open, according toSleepWindow
after the set time trying to control how long the service is available, the default time is5000ms
;RequestVolumeThreshold
: One of the conditions for judging the fuse switch, count the10s
(the code is dead), after reaching this number of requests, judge whether to open the fuse according to the error rate;ErrorPercentThreshold
: One of the conditions for judging the fuse switch, the error percentage isRequestVolumeThreshold
, the number of requests is greater than or equal to 061370e01cbf8e and the error rate reaches this percentage, thefuse will be activated. The default value of
is 50;
These rules are distinguished and stored in a map
according to the name of the command.
Execute command
command
are four main methods that can be called by executing 061370e01cc033, which are:
func Go(name string, run runFunc, fallback fallbackFunc)
func GoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC)
func Do(name string, run runFunc, fallback fallbackFunc)
func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC)
Do
internal call Doc
method, Go
internal call is Goc
method, Doc
internal method eventually calls or Goc
method, but in Doc
do synchronization logic inside the method:
func DoC(ctx context.Context, name string, run runFuncC, fallback fallbackFuncC) error {
..... 省略部分封装代码
var errChan chan error
if fallback == nil {
errChan = GoC(ctx, name, r, nil)
} else {
errChan = GoC(ctx, name, r, f)
}
select {
case <-done:
return nil
case err := <-errChan:
return err
}
}
Because they all call the Goc
method in the end, we perform the analysis Goc
method; the code is a bit long, and we analyze it separately:
Create command
object
cmd := &command{
run: run,
fallback: fallback,
start: time.Now(),
errChan: make(chan error, 1),
finished: make(chan bool, 1),
}
// 获取熔断器
circuit, _, err := GetCircuit(name)
if err != nil {
cmd.errChan <- err
return cmd.errChan
}
Introduce the data structure of command
type command struct {
sync.Mutex
ticket *struct{}
start time.Time
errChan chan error
finished chan bool
circuit *CircuitBreaker
run runFuncC
fallback fallbackFuncC
runDuration time.Duration
events []string
}
Field introduction:
ticket
: used to control the maximum concurrency, this is a tokenstart
: Record the start time of execution ofcommand
errChan
: Recordcommand
execution errorfinished
: mark thecommand
execution of 061370e01cc25d, used for coroutine synchronizationcircuit
: Store fuse related informationrun
: Applicationfallback
: The function to be executed after the application fails to executerunDuration
: Record the execution time ofcommand
events
:events
mainly stores event type information, such as the successful execution ofsuccess
, or the failedtimeout
,context_canceled
etc.
The focus of the GetCircuit
code is the 061370e01cc372 method. The purpose of this step is to obtain the fuse, and use the dynamic loading method. If not, create a fuse. The structure of the fuse is as follows:
type CircuitBreaker struct {
Name string
open bool
forceOpen bool
mutex *sync.RWMutex
openedOrLastTestedTime int64
executorPool *executorPool
metrics *metricExchange
}
Explain these fields:
name
: The name of the fuse is actually the name of the created commandopen
: A sign to determine whether the fuse is openforceopen
: Manually trigger the switch of the fuse, for unit testingmutex
: Use read-write locks to ensure concurrency safetyopenedOrLastTestedTime
: Record the last time the fuse was opened, because it is necessary to make a recovery attemptSleepWindow
executorPool
: Used for flow control, because we have a maximum concurrency control, which is the flow control based on this, and each request must obtain a tokenmetrics
: The event used to report the execution status, through which the execution status information is stored in the data set of the actual fuse executing each dimension status (success, failure, timeout...).
The implementation logic of executorPool
and metrics
will be analyzed separately later.
Define the methods and variables related to the token
Because we have a condition that is maximum concurrency control, we use a token method for flow control. Every request must obtain a token, and the token must be returned after use. Let’s take a look at this code first:
ticketCond := sync.NewCond(cmd)
ticketChecked := false
// When the caller extracts error from returned errChan, it's assumed that
// the ticket's been returned to executorPool. Therefore, returnTicket() can
// not run after cmd.errorWithFallback().
returnTicket := func() {
cmd.Lock()
// Avoid releasing before a ticket is acquired.
for !ticketChecked {
ticketCond.Wait()
}
cmd.circuit.executorPool.Return(cmd.ticket)
cmd.Unlock()
}
Use sync.NewCond
create a condition variable to coordinate the notification that you can return the token.
Then define a Return
method to return the token.
Define the method for reporting execution events
As we mentioned earlier, our fuse will report execution status events and store the execution status information in the data collection of the actual fuse execution status of each dimension (success, failure, timeout...). So define a reporting method:
reportAllEvent := func() {
err := cmd.circuit.ReportEvent(cmd.events, cmd.start, cmd.runDuration)
if err != nil {
log.Printf(err.Error())
}
}
Start coroutine one: execute application logic- runFunc
The main purpose of coroutine one is to execute application logic:
go func() {
defer func() { cmd.finished <- true }() // 标志协程一的command执行结束,同步到协程二
// 当最近执行的并发数量超过阈值并且错误率很高时,就会打开熔断器。
// 如果熔断器打开,直接拒绝拒绝请求并返回令牌,当感觉健康状态恢复时,熔断器将允许新的流量。
if !cmd.circuit.AllowRequest() {
cmd.Lock()
// It's safe for another goroutine to go ahead releasing a nil ticket.
ticketChecked = true
ticketCond.Signal() // 通知释放ticket信号
cmd.Unlock()
// 使用sync.Onece保证只执行一次。
returnOnce.Do(func() {
// 返还令牌
returnTicket()
// 执行fallback逻辑
cmd.errorWithFallback(ctx, ErrCircuitOpen)
// 上报状态事件
reportAllEvent()
})
return
}
// 控制并发
cmd.Lock()
select {
// 获取到令牌
case cmd.ticket = <-circuit.executorPool.Tickets:
// 发送释放令牌信号
ticketChecked = true
ticketCond.Signal()
cmd.Unlock()
default:
// 没有令牌可用了, 也就是达到最大并发数量则直接处理fallback逻辑
ticketChecked = true
ticketCond.Signal()
cmd.Unlock()
returnOnce.Do(func() {
returnTicket()
cmd.errorWithFallback(ctx, ErrMaxConcurrency)
reportAllEvent()
})
return
}
// 执行应用程序逻辑
runStart := time.Now()
runErr := run(ctx)
returnOnce.Do(func() {
defer reportAllEvent() // 状态事件上报
// 统计应用程序执行时长
cmd.runDuration = time.Since(runStart)
// 返还令牌
returnTicket()
// 如果应用程序执行失败执行fallback函数
if runErr != nil {
cmd.errorWithFallback(ctx, runErr)
return
}
cmd.reportEvent("success")
})
}()
Summarize this coroutine:
- Determine whether the fuse is open, if the fuse is opened, the fuse will be directly fuse, and the subsequent request will not be performed
- Run application logic
Open the coroutine 2: Synchronize the coroutine and listen for errors
Look at the code first:
go func() {
// 使用定时器来做超时控制,这个超时时间就是我们配置的,默认1000ms
timer := time.NewTimer(getSettings(name).Timeout)
defer timer.Stop()
select {
// 同步协程一
case <-cmd.finished:
// returnOnce has been executed in another goroutine
// 是否收到context取消信号
case <-ctx.Done():
returnOnce.Do(func() {
returnTicket()
cmd.errorWithFallback(ctx, ctx.Err())
reportAllEvent()
})
return
// command执行超时了
case <-timer.C:
returnOnce.Do(func() {
returnTicket()
cmd.errorWithFallback(ctx, ErrTimeout)
reportAllEvent()
})
return
}
}()
The logic of this coroutine is clear and clear, and the purpose is to monitor the cancellation and timeout of business execution.
Draw a picture to summarize the command execution process
We all analyzed the above through code, it still looks a bit messy, and finally draw a picture to summarize:
We have analyzed the entire specific process above, and then we will analyze some core points
Report status event
hystrix-go
sets a default statistical controller for each Command
, which is used to save all the states of the fuse, including the number of calls, the number of failures, the number of rejections, etc. The storage index structure is as follows:
type DefaultMetricCollector struct {
mutex *sync.RWMutex
numRequests *rolling.Number
errors *rolling.Number
successes *rolling.Number
failures *rolling.Number
rejects *rolling.Number
shortCircuits *rolling.Number
timeouts *rolling.Number
contextCanceled *rolling.Number
contextDeadlineExceeded *rolling.Number
fallbackSuccesses *rolling.Number
fallbackFailures *rolling.Number
totalDuration *rolling.Timing
runDuration *rolling.Timing
}
Use rolling.Number
structure to save the state indicator, use rolling.Timing
save the time indicator.
The final monitoring report is metricExchange
, and the data structure is as follows:
type metricExchange struct {
Name string
Updates chan *commandExecution
Mutex *sync.RWMutex
metricCollectors []metricCollector.MetricCollector
}
Report the information structure of command
type commandExecution struct {
Types []string `json:"types"` // 区分事件类型,比如success、failure....
Start time.Time `json:"start_time"` // command开始时间
RunDuration time.Duration `json:"run_duration"` // command结束时间
ConcurrencyInUse float64 `json:"concurrency_inuse"` // command 线程池使用率
}
Having said so much, everyone is still a little confused, and a practical class diagram can show the relationship between them:
We can see that the class mertricExchange
provides a Monitor
. The main logic of this method is to monitor status events and then write indicators, so the entire reporting process looks like this:
flow control
hystrix-go
adopts a token algorithm for flow control. If the token can be obtained, the subsequent work can be executed, and the token must be returned after execution.
The structure executorPool
is the concrete realization of hystrix-go
flow control. The field
Max
is the maximum concurrency value per second.
type executorPool struct {
Name string
Metrics *poolMetrics // 上报执行数量指标
Max int // 最大并发数量
Tickets chan *struct{} // 代表令牌
}
There is also a report indicator here. This separately implements a set of methods to count the number of executions, such as the total number of executions, the maximum number of concurrency, etc. We rely on drawing a class diagram to represent:
The logic of the reported execution quantity is the same as that of the reported status event. channel
, both the reporting and returning tokens are in the Return
method:
func (p *executorPool) Return(ticket *struct{}) {
if ticket == nil {
return
}
p.Metrics.Updates <- poolMetricsUpdate{
activeCount: p.ActiveCount(),
}
p.Tickets <- ticket
}
Two main logical steps:
- Report the number of tokens currently available
- Return token
Fuse
Finally, we analyze the fuse in a more important way: AllowRequest
, we perform Command
will be judged according to whether this method can be performed command
, then we take a look at the main logic of the judgment:
func (circuit *CircuitBreaker) AllowRequest() bool {
return !circuit.IsOpen() || circuit.allowSingleTest()
}
Internally, the two methods IsOpen()
allowSingleTest
are called:
IsOpen()
func (circuit *CircuitBreaker) IsOpen() bool {
circuit.mutex.RLock()
o := circuit.forceOpen || circuit.open
circuit.mutex.RUnlock()
// 熔断已经开启
if o {
return true
}
// 判断10s内的并发数是否超过设置的最大并发数,没有超过时,不需要开启熔断器
if uint64(circuit.metrics.Requests().Sum(time.Now())) < getSettings(circuit.Name).RequestVolumeThreshold {
return false
}
// 此时10s内的并发数已经超过设置的最大并发数了,如果此时系统错误率超过了预设值,那就开启熔断器
if !circuit.metrics.IsHealthy(time.Now()) {
//
circuit.setOpen()
return true
}
return false
}
allowSingleTest()
First explain why we have this method, remember that we set up a fusing rule SleepWindow
, if the fusing is turned on, SleepWindow
time, the purpose of this method is to do this:
func (circuit *CircuitBreaker) allowSingleTest() bool {
circuit.mutex.RLock()
defer circuit.mutex.RUnlock()
// 获取当前时间戳
now := time.Now().UnixNano()
openedOrLastTestedTime := atomic.LoadInt64(&circuit.openedOrLastTestedTime)
// 当前熔断器是开启状态,当前的时间已经大于 (上次开启熔断器的时间 +SleepWindow 的时间)
if circuit.open && now > openedOrLastTestedTime+getSettings(circuit.Name).SleepWindow.Nanoseconds() {
// 替换openedOrLastTestedTime
swapped := atomic.CompareAndSwapInt64(&circuit.openedOrLastTestedTime, openedOrLastTestedTime, now)
if swapped {
log.Printf("hystrix-go: allowing single test to possibly close circuit %v", circuit.Name)
}
return swapped
}
Here only see the setting of the fuse opened, but there is no logic to close the fuse, because the logic of closing the fuse is implemented in the method of reporting status indicators ReportEvent
, let’s finally look at the implementation of ReportEvent
func (circuit *CircuitBreaker) ReportEvent(eventTypes []string, start time.Time, runDuration time.Duration) error {
if len(eventTypes) == 0 {
return fmt.Errorf("no event types sent for metrics")
}
circuit.mutex.RLock()
o := circuit.open
circuit.mutex.RUnlock()
// 上报的状态事件是success 并且当前熔断器是开启状态,则说明下游服务正常了,可以关闭熔断器了
if eventTypes[0] == "success" && o {
circuit.setClose()
}
var concurrencyInUse float64
if circuit.executorPool.Max > 0 {
concurrencyInUse = float64(circuit.executorPool.ActiveCount()) / float64(circuit.executorPool.Max)
}
select {
// 上报状态指标,与上文的monitor呼应
case circuit.metrics.Updates <- &commandExecution{
Types: eventTypes,
Start: start,
RunDuration: runDuration,
ConcurrencyInUse: concurrencyInUse,
}:
default:
return CircuitError{Message: fmt.Sprintf("metrics channel (%v) is at capacity", circuit.Name)}
}
return nil
}
Visualize the reported information of hystrix
Through the above analysis, we know that hystrix-go
reported status events and execution quantity events. How can we check these indicators?
The designers have thought of this problem for a long time, so they made a dashborad
, you can view hystrix
, the method of use only needs to add the following code when the service is started:
hystrixStreamHandler := hystrix.NewStreamHandler()
hystrixStreamHandler.Start()
go http.ListenAndServe(net.JoinHostPort("", "81"), hystrixStreamHandler)
Then open the browser: http://127.0.0.1:81/hystrix-dashboard, and make observations.
Summarize
The story is finally coming to an end. The realization of a fuse mechanism is indeed not simple. The factors to be considered are also all aspects. Especially in the microservice architecture, the fuse mechanism is essential. It is not only necessary to implement the fuse mechanism at the framework level, but also according to the specifics. Business scenarios use circuit breakers, which are worthy of our careful consideration. The implementation of the fuse framework introduced in this article is quite perfect, and this excellent design idea is worth learning.
The code in the article has been uploaded github
: https://github.com/asong2020/Golang_Dream/tree/master/code_demo/hystrix_demo, welcome star
.
Welcome to pay attention to the official account: [Golang DreamWorks]
Recommend previous articles:
- Learning channel design: from entry to abandon
- detailed memory alignment
- [[Caution] Do not abuse goroutine]( https://mp.weixin.qq.com/s/JC14dWffHub0nfPlPipsHQ)
- source code analysis panic and recover, if you don’t understand, call me!
- Interviewer: Komatsuko come to talk about memory escape
- [Interviewer: Can you talk about the conversion of string and []byte? ]( https://mp.weixin.qq.com/s/jztwFH6thFdcySzowXOH_Q)
- Interviewer: What is the result of comparing two nil?
- concurrent programming package
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。