Analyze the working mechanism of Hystrix from the source code

1. What problem does Hystrix solve?

There are many dependencies in complex distributed applications, and each dependency will inevitably fail at a certain moment. If the application does not isolate each dependency and reduce external risks, it is easy to bring down the entire application.

Take a common example in the e-commerce scenario, for example, the order service calls inventory services, commodity services, loyalty services, and payment services. When the system is normal, the order module runs normally.

But when the credit service is abnormal and will be blocked for 30s, some of the order service requests fail, and the worker thread is blocked on calling the credit service.

When the traffic peaks, the problem will be more serious. All requests for the order service will be blocked on calling the credit service, and all worker threads will hang up, resulting in exhaustion of machine resources and unavailability of the order service, causing cascading effects and downtime of the entire cluster. This is called the avalanche effect.

Therefore, a mechanism is needed so that when a single service fails, the availability of the entire cluster is not affected. Hystrix is the framework for implementing this mechanism. Let's analyze the overall working mechanism of Hystrix.

2. Overall mechanism

[Entrance] The execution entrance of Hystrix is the HystrixCommand or HystrixObservableCommand object. Usually in Spring applications, annotations and AOP are used to implement object construction to reduce the intrusion of business code;

[Cache] After the HystrixCommand object actually starts to execute, whether to open the cache first, if the cache is turned on and it hits, it will return directly;

[Fuse] If the fuse is open, a short circuit will be executed and the downgrade logic will go directly; if the fuse is closed, continue to the next step and enter the isolation logic. The status of the fuse is mainly based on the execution failure rate within the window. If the failure rate is too high, the fuse will automatically open;

[Isolation] The user can configure thread pool isolation or semaphore isolation, and if the thread pool task is full (or semaphore), it will enter the downgrade logic; otherwise, continue to the next step, and the thread pool task thread will actually execute the business call;

[Execute] Actually start to execute the business call. If the execution fails or is abnormal, it will enter the downgrade logic; if the execution is successful, it will return normally;

[Timeout] Use the timer delay task to detect whether the execution of the service call is overtime, if it expires, the thread of service execution is cancelled and enters the degrading logic; if it does not expire, it returns normally. Both the thread pool and semaphore strategies are isolated in a way that supports timeout configuration (the semaphore strategy has defects);

[Downgrade] After entering the downgrade logic, when the business implements the HystrixCommand.getFallback() method, the downgraded data will be returned; when it is not implemented, an exception will be returned;

[Statistics] The success, failure, timeout, etc. of the business call execution result will all enter the statistics module, and the health statistics result will determine whether the fuse is opened or closed.

It is said that there are no secrets in the source code. Let's analyze the source code of the core functions and see how Hystrix implements the overall working mechanism.

Three, fuse

There are fuses in household circuits. The role of fuses is that when the circuit is faulty or abnormal, the current will continue to rise, and the rising current may damage some important or valuable components in the circuit, or it may be burned. The circuit can even cause a fire.

If the fuse is correctly installed in the circuit, the fuse will fuse and cut off the current when the current abnormally rises to a certain height, thereby protecting the safe operation of the circuit. The fuse provided by Hystrix has a similar function. The application calls a service provider. When the total number of requests exceeds the configured threshold within a certain period of time, and the error rate during the window period is too high, Hystrix will fuse the call request and subsequent requests Directly short-circuit, directly enter the downgrade logic, and execute the local downgrade strategy.

Hystrix has the ability to self-adjust. After the fuse is opened for a certain period of time, it will try to pass a request, and adjust the fuse state according to the execution result, so that the fuse can automatically switch between closed, open, and half-open states.

[HystrixCircuitBreaker] boolean attemptExecution(): Every time HystrixCommand is executed, this method must be called to determine whether the execution can be continued. If the fuse status is open and exceeds the sleep window, the fuse status is updated to half-open; the fuse is changed through CAS atomic change The state of the device ensures that only one service request is actually called to the provider, and the state is adjusted according to the execution result.

public boolean attemptExecution() {
    //判断配置是否强制打开熔断器
    if (properties.circuitBreakerForceOpen().get()) {
        return false;
    }
    //判断配置是否强制关闭熔断器
    if (properties.circuitBreakerForceClosed().get()) {
        return true;
    }
    //判断熔断器开关是否关闭
    if (circuitOpened.get() == -1) {
        return true;
    } else {
        //判断请求是否在休眠窗口后
        if (isAfterSleepWindow()) {
            //更新开关为半开，并允许本次请求通过
            if (status.compareAndSet(Status.OPEN, Status.HALF_OPEN)) {
                return true;
            } else {
                return false;
            }
        } else {
            //拒绝请求
            return false;
        }
    }
}

[HystrixCircuitBreaker] void markSuccess(): Called after HystrixCommand is executed successfully. When the fuse status is half-open, the fuse status is updated to closed. In this case, the fuse is originally open, a single request is actually called to the service provider, and the subsequent execution is successful, Hystrix automatically adjusts the fuse to closed.

public void markSuccess() {
    //更新熔断器开关为关闭
    if (status.compareAndSet(Status.HALF_OPEN, Status.CLOSED)) {
        //重置订阅健康统计
        metrics.resetStream();
        Subscription previousSubscription = activeSubscription.get();
        if (previousSubscription != null) {
            previousSubscription.unsubscribe();
        }
        Subscription newSubscription = subscribeToStream();
        activeSubscription.set(newSubscription);
        //更新熔断器开关为关闭
        circuitOpened.set(-1L);
    }
}

[HystrixCircuitBreaker] void markNonSuccess(): Called after HystrixCommand is executed successfully. If the fuse status is half-open, update the fuse status to open. In this case, the fuse was originally open, and a single request actually called the service provider, and subsequent execution failed. Hystrix continued to keep the fuse open, and took this request as the start time of the sleep window.

public void markNonSuccess() {
      //更新熔断器开关，从半开变为打开
      if (status.compareAndSet(Status.HALF_OPEN, Status.OPEN)) {
          //记录失败时间，作为休眠窗口开始时间
          circuitOpened.set(System.currentTimeMillis());
      }
  }

[HystrixCircuitBreaker] void subscribeToStream(): The fuse subscribes to the health statistics results. If the current request data is greater than a certain value and the error rate is greater than the threshold, the fuse status is automatically updated to open, subsequent requests are short-circuited, and the service provider is no longer actually called, and directly enters Downgrade logic.

 private Subscription subscribeToStream() {
    //订阅监控统计信息
    return metrics.getHealthCountsStream()
            .observe()
            .subscribe(new Subscriber<HealthCounts>() {
                @Override
                public void onCompleted() {}
                @Override
                public void onError(Throwable e) {}
                @Override
                public void onNext(HealthCounts hc) {
                    // 判断总请求数量是否超过配置阈值，若未超过，则不改变熔断器状态
                    if (hc.getTotalRequests() < properties.circuitBreakerRequestVolumeThreshold().get()) {
 
                    } else {
                        //判断请求错误率是否超过配置错误率阈值，若未超过，则不改变熔断器状态；若超过，则错误率过高，更新熔断器状态未打开，拒绝后续请求
                        if (hc.getErrorPercentage() < properties.circuitBreakerErrorThresholdPercentage().get()) {
 
                        } else {
                            if (status.compareAndSet(Status.CLOSED, Status.OPEN)) {
                                circuitOpened.set(System.currentTimeMillis());
                            }
                        }
                    }
                }
            });
}

Four, resource isolation

In cargo ships, in order to prevent the spread of water leakage and fire, the warehouses are generally divided to avoid the tragedy of sinking the entire ship due to an accident in one warehouse. Similarly, in Hystrix, this bulkhead model is also adopted to isolate the service providers in the system. Increased delay or failure of a service provider will not lead to the failure of the entire system, and it can also control the call The concurrency of these services. As shown in the figure below, the order service uses different thread pools to call downstream credit, inventory and other services. When the credit service fails, the corresponding thread pool will only be full without affecting the calls of other services. Hystrix isolation mode supports thread pool and semaphore.

4.1 Semaphore mode

The semaphore mode controls the execution concurrency of a single service provider. For example, the number of ongoing requests under a single CommonKey is N. If N is less than maxConcurrentRequests, the execution continues; if it is greater than or equal to maxConcurrentRequests, it directly rejects and enters the downgrade logic. The semaphore mode uses the request thread itself to execute, there is no thread context switching, the overhead is small, but the timeout mechanism is invalid.

[AbstractCommand] Observable<R>applyHystrixSemantics(finalAbstractCommand<R> _cmd): Try to obtain the semaphore, if it can be obtained, then continue to call the service provider; if it cannot be obtained, then enter the downgrade strategy.

private Observable<R> applyHystrixSemantics(final AbstractCommand<R> _cmd) {
    executionHook.onStart(_cmd);
    //判断熔断器是否通过
    if (circuitBreaker.attemptExecution()) {
        //获取信号量
        final TryableSemaphore executionSemaphore = getExecutionSemaphore();
        final AtomicBoolean semaphoreHasBeenReleased = new AtomicBoolean(false);
        final Action0 singleSemaphoreRelease = new Action0() {
            @Override
            public void call() {
                if (semaphoreHasBeenReleased.compareAndSet(false, true)) {
                    executionSemaphore.release();
                }
            }
        };
        final Action1<Throwable> markExceptionThrown = new Action1<Throwable>() {
            @Override
            public void call(Throwable t) {
                eventNotifier.markEvent(HystrixEventType.EXCEPTION_THROWN, commandKey);
            }
        };
        //尝试获取信号量
        if (executionSemaphore.tryAcquire()) {
            try {
                //记录业务执行开始时间
                executionResult = executionResult.setInvocationStartTime(System.currentTimeMillis());
                //继续执行业务
                return executeCommandAndObserve(_cmd)
                        .doOnError(markExceptionThrown)
                        .doOnTerminate(singleSemaphoreRelease)
                        .doOnUnsubscribe(singleSemaphoreRelease);
            } catch (RuntimeException e) {
                return Observable.error(e);
            }
        } else {
            //信号量拒绝，进入降级逻辑
            return handleSemaphoreRejectionViaFallback();
        }
    } else {
        //熔断器拒绝，直接短路，进入降级逻辑
        return handleShortCircuitViaFallback();
    }
}

[AbstractCommand] TryableSemaphore getExecutionSemaphore(): Get a semaphore instance. If the current isolation mode is a semaphore, the semaphore is obtained according to the commandKey, and initialized and cached if it does not exist; if the current isolation mode is a thread pool, the default semaphore TryableSemaphoreNoOp is used. DEFAULT, all requests can be passed.

protected TryableSemaphore getExecutionSemaphore() {
    //判断隔离模式是否为信号量
    if (properties.executionIsolationStrategy().get() == ExecutionIsolationStrategy.SEMAPHORE) {
        if (executionSemaphoreOverride == null) {
            //获取信号量
            TryableSemaphore _s = executionSemaphorePerCircuit.get(commandKey.name());
            if (_s == null) {
                //初始化信号量并缓存
                executionSemaphorePerCircuit.putIfAbsent(commandKey.name(), new TryableSemaphoreActual(properties.executionIsolationSemaphoreMaxConcurrentRequests()));
                //返回信号量
                return executionSemaphorePerCircuit.get(commandKey.name());
            } else {
                return _s;
            }
        } else {
            return executionSemaphoreOverride;
        }
    } else {
        //返回默认信号量，任何请求均可通过
        return TryableSemaphoreNoOp.DEFAULT;
    }
}

4.2 Thread Pool Mode

The thread pool mode controls the execution concurrency of a single service provider. The code will first get the semaphore, but the default semaphore is used, all requests can be passed, and then the thread pool logic is actually called. In the thread pool mode, for example, the number of requests under a single CommonKey is N. If N is less than the maximumPoolSize, a thread will be obtained from the thread pool managed by Hystrix, and then the parameters will be passed to the task thread to perform the real call. If there are many concurrent requests Depending on the number of threads in the thread pool, there are tasks that need to be queued in the queue, but the queue also has an upper limit. If the queue is also full, go to the downgrade logic. The thread pool mode can support asynchronous calls, timeout calls, thread switching, and high overhead.

[AbstractCommand] Observable<R>executeCommandWithSpecifiedIsolation(final AbstractCommand<R> _cmd): Obtain threads from the thread pool, execute them, and record the thread status during the process.

private Observable<R> executeCommandWithSpecifiedIsolation(final AbstractCommand<R> _cmd) {
      //判断是否为线程池隔离模式
      if (properties.executionIsolationStrategy().get() == ExecutionIsolationStrategy.THREAD) {
          return Observable.defer(new Func0<Observable<R>>() {
              @Override
              public Observable<R> call() {
                  executionResult = executionResult.setExecutionOccurred();
                  if (!commandState.compareAndSet(CommandState.OBSERVABLE_CHAIN_CREATED, CommandState.USER_CODE_EXECUTED)) {
                      return Observable.error(new IllegalStateException("execution attempted while in state : " + commandState.get().name()));
                  }
                  //统计信息
                  metrics.markCommandStart(commandKey, threadPoolKey, ExecutionIsolationStrategy.THREAD);
                  //判断是否超时，若超时，直接抛出异常
                  if (isCommandTimedOut.get() == TimedOutStatus.TIMED_OUT) {
                      return Observable.error(new RuntimeException("timed out before executing run()"));
                  }
                  //更新线程状态为已开始
                  if (threadState.compareAndSet(ThreadState.NOT_USING_THREAD, ThreadState.STARTED)) {
                      HystrixCounters.incrementGlobalConcurrentThreads();
                      threadPool.markThreadExecution();
                      endCurrentThreadExecutingCommand = Hystrix.startCurrentThreadExecutingCommand(getCommandKey());
                      executionResult = executionResult.setExecutedInThread();
                      //执行hook，若异常，则直接抛出异常
                      try {
                          executionHook.onThreadStart(_cmd);
                          executionHook.onRunStart(_cmd);
                          executionHook.onExecutionStart(_cmd);
                          return getUserExecutionObservable(_cmd);
                      } catch (Throwable ex) {
                          return Observable.error(ex);
                      }
                  } else {
                      //空返回
                      return Observable.empty();
                  }
              }
          }).doOnTerminate(new Action0() {
              @Override
              public void call() {
                  //结束逻辑，省略
              }
          }).doOnUnsubscribe(new Action0() {
              @Override
              public void call() {
                  //取消订阅逻辑，省略
              }
              //从线程池中获取业务执行线程
          }).subscribeOn(threadPool.getScheduler(new Func0<Boolean>() {
              @Override
              public Boolean call() {
                  //判断是否超时
                  return properties.executionIsolationThreadInterruptOnTimeout().get() && _cmd.isCommandTimedOut.get() == TimedOutStatus.TIMED_OUT;
              }
          }));
      } else {
          //信号量模式
          //省略
      }
  }

[HystrixThreadPool] Subscription schedule (final Action0 action): HystrixContextScheduler is a rewrite of the Scheduler scheduler in rx by Hystrix. It is mainly used to implement commands not to be executed when the Observable is not subscribed, and to support interruption during command execution. In rx, the Scheduler will generate the corresponding Worker to Observable for executing commands, and the Worker is specifically responsible for the scheduling of related execution threads. ThreadPoolWorker is a Worker implemented by Hystrix itself and the core method of execution scheduling.

public Subscription schedule(final Action0 action) {
    //若无订阅，则不执行直接返回
    if (subscription.isUnsubscribed()) {
        return Subscriptions.unsubscribed();
    }
    ScheduledAction sa = new ScheduledAction(action);
    subscription.add(sa);
    sa.addParent(subscription);
    //获取线程池
    ThreadPoolExecutor executor = (ThreadPoolExecutor) threadPool.getExecutor();
    //提交执行任务
    FutureTask<?> f = (FutureTask<?>) executor.submit(sa);
    sa.add(new FutureCompleterWithConfigurableInterrupt(f, shouldInterruptThread, executor));
    return sa;
}

Five, timeout detection

The Hystrix timeout mechanism reduces the impact of high third-party dependency delays on the caller, and makes requests fail quickly. It is mainly realized through the delayed task mechanism, including the process of registering the delayed task and the process of executing the delayed task.

When the isolation strategy is the thread pool, the main thread subscribes to the execution result, and the task thread in the thread pool calls the provider server. At the same time, there will be a timer thread to check whether the task is completed after a certain period of time. The timeout is abnormal, and the execution result of the subsequent task thread will be skipped and no longer released; if it has been completed, it means that the task is completed within the timeout period, and the timer detection task ends.

When the isolation strategy is a semaphore, the main thread subscribes to the execution result and actually calls the provider server (no task thread). When the specified time is exceeded, the main thread will still execute the business call and then throw a timeout exception. The timeout configuration in the semaphore mode has certain flaws. The call in execution cannot be cancelled, and the return time of the main thread cannot be limited.

[AbstractCommand] Observable<R>executeCommandAndObserve(finalAbstractCommand<R> \_cmd): timeout detection entry, execute lift(new HystrixObservableTimeoutOperator<R>(\_cmd)) associated timeout detection task.

private Observable<R> executeCommandAndObserve(final AbstractCommand<R> _cmd) {
    //省略
    Observable<R> execution;
    //判断是否开启超时检测
    if (properties.executionTimeoutEnabled().get()) {
        execution = executeCommandWithSpecifiedIsolation(_cmd)
                //增加超时检测操作
                .lift(new HystrixObservableTimeoutOperator<R>(_cmd));
    } else {
        //正常执行
        execution = executeCommandWithSpecifiedIsolation(_cmd);
    }
    return execution.doOnNext(markEmits)
            .doOnCompleted(markOnCompleted)
            .onErrorResumeNext(handleFallback)
            .doOnEach(setRequestContext);
}

[HystrixObservableTimeoutOperator] Subscriber<? super R> call(final Subscriber<? super R> child): create a detection task and associate a delayed task; if the detection task is not executed yet, a timeout exception will be thrown; if the execution is completed Or abnormal, the detection task is cleared.

public Subscriber<? super R> call(final Subscriber<? super R> child) {
        final CompositeSubscription s = new CompositeSubscription();
        child.add(s);
        final HystrixRequestContext hystrixRequestContext = HystrixRequestContext.getContextForCurrentThread();
        //实列化监听器
        TimerListener listener = new TimerListener() {
            @Override
            public void tick() {
                //若任务未执行完成，则更新为超时
                if (originalCommand.isCommandTimedOut.compareAndSet(TimedOutStatus.NOT_EXECUTED, TimedOutStatus.TIMED_OUT)) {
                    // 上报超时失败
                    originalCommand.eventNotifier.markEvent(HystrixEventType.TIMEOUT, originalCommand.commandKey);
                    // 取消订阅
                    s.unsubscribe();
                    final HystrixContextRunnable timeoutRunnable = new HystrixContextRunnable(originalCommand.concurrencyStrategy, hystrixRequestContext, new Runnable() {
 
                        @Override
                        public void run() {
                            child.onError(new HystrixTimeoutException());
                        }
                    });
                    //抛出超时异常
                    timeoutRunnable.run();
                }
            }
            //超时时间配置
            @Override
            public int getIntervalTimeInMilliseconds() {
                return originalCommand.properties.executionTimeoutInMilliseconds().get();
            }
        };
        //注册监听器，关联检测任务
        final Reference<TimerListener> tl = HystrixTimer.getInstance().addTimerListener(listener);
        originalCommand.timeoutTimer.set(tl);
        Subscriber<R> parent = new Subscriber<R>() {
            @Override
            public void onCompleted() {
                if (isNotTimedOut()) {
                    // 未超时情况下，任务执行完成，清除超时检测任务
                    tl.clear();
                    child.onCompleted();
                }
            }
            @Override
            public void onError(Throwable e) {
                if (isNotTimedOut()) {
                    // 未超时情况下，任务执行异常，清除超时检测任务
                    tl.clear();
                    child.onError(e);
                }
            }
            @Override
            public void onNext(R v) {
                    //未超时情况下，发布执行结果；超时时则直接跳过发布执行结果
                if (isNotTimedOut()) {
                    child.onNext(v);
                }
            }
            //判断是否超时
            private boolean isNotTimedOut() {
                return originalCommand.isCommandTimedOut.get() == TimedOutStatus.COMPLETED ||
                        originalCommand.isCommandTimedOut.compareAndSet(TimedOutStatus.NOT_EXECUTED, TimedOutStatus.COMPLETED);
            }
        };
        s.add(parent);
        return parent;
    }
}

[HystrixTimer] Reference<TimerListener>addTimerListener(finalTimerListener listener): addTimerListener is executed after the delay timeout period through the java timed task service scheduleAtFixedRate.

public Reference<TimerListener> addTimerListener(final TimerListener listener) {//Initialize xianstartThreadIfNeeded();//Construct detection task Runnable r = new Runnable() {

public Reference<TimerListener> addTimerListener(final TimerListener listener) {
    //初始化xian
    startThreadIfNeeded();
    //构造检测任务
    Runnable r = new Runnable() {
 
        @Override
        public void run() {
            try {
                listener.tick();
            } catch (Exception e) {
                logger.error("Failed while ticking TimerListener", e);
            }
        }
    };
    //延迟执行检测任务
    ScheduledFuture<?> f = executor.get().getThreadPool().scheduleAtFixedRate(r, listener.getIntervalTimeInMilliseconds(), listener.getIntervalTimeInMilliseconds(), TimeUnit.MILLISECONDS);
    return new TimerReference(listener, f);
}

Six, downgrade

Hystrix downgrade logic is a bottom-line strategy. When business execution is abnormal, thread pool or semaphore is full, execution timeout, etc., it will enter the downgrade logic. In the downgrade logic, general returns should be obtained from memory or static logic, and try not to rely on network calls. If the downgrade method is not implemented or an exception occurs in the downgrade method, an exception will be thrown in the business thread.

[AbstractCommand] Observable<R> getFallbackOrThrowException(finalAbstractCommand<R> _cmd, final HystrixEventType eventType, final FailureType failureType, final String message, final Exception originalException): first determine whether it is an unrecoverable exception, if it is, it will return the exception directly without going through the downgrade logic ; Secondly, it is judged whether the degrading semaphore can be obtained, and then the degrading logic is followed; when an abnormality occurs in the degrading logic or there is no degrading method implemented, the abnormal return is made.

private Observable<R> getFallbackOrThrowException(final AbstractCommand<R> _cmd, final HystrixEventType eventType, final FailureType failureType, final String message, final Exception originalException) {
    final HystrixRequestContext requestContext = HystrixRequestContext.getContextForCurrentThread();
    long latency = System.currentTimeMillis() - executionResult.getStartTimestamp();
    executionResult = executionResult.addEvent((int) latency, eventType);
    //判断是否为不可恢复异常，如栈溢出、OOM等
    if (isUnrecoverable(originalException)) {
        logger.error("Unrecoverable Error for HystrixCommand so will throw HystrixRuntimeException and not apply fallback. ", originalException);
        Exception e = wrapWithOnErrorHook(failureType, originalException);
        //直接返回异常
        return Observable.error(new HystrixRuntimeException(failureType, this.getClass(), getLogMessagePrefix() + " " + message + " and encountered unrecoverable error.", e, null));
    } else {
        //判断为是否可恢复错误
        if (isRecoverableError(originalException)) {
            logger.warn("Recovered from java.lang.Error by serving Hystrix fallback", originalException);
        }
        //判断降级配置是否打开
        if (properties.fallbackEnabled().get()) {
          /**
            * 省略
            */
            final Func1<Throwable, Observable<R>> handleFallbackError = new Func1<Throwable, Observable<R>>() {
                @Override
                public Observable<R> call(Throwable t) {
                    Exception e = wrapWithOnErrorHook(failureType, originalException);
                    Exception fe = getExceptionFromThrowable(t);
 
                    long latency = System.currentTimeMillis() - executionResult.getStartTimestamp();
                    Exception toEmit;
                    //是否是不支持操作异常，当业务中没有覆写getFallBack方法时，会抛出此异常
                    if (fe instanceof UnsupportedOperationException) {
                        logger.debug("No fallback for HystrixCommand. ", fe);
                        eventNotifier.markEvent(HystrixEventType.FALLBACK_MISSING, commandKey);
                        executionResult = executionResult.addEvent((int) latency, HystrixEventType.FALLBACK_MISSING);
                        toEmit = new HystrixRuntimeException(failureType, _cmd.getClass(), getLogMessagePrefix() + " " + message + " and no fallback available.", e, fe);
                    } else {
                        //执行降级逻辑时发生异常
                        logger.debug("HystrixCommand execution " + failureType.name() + " and fallback failed.", fe);
                        eventNotifier.markEvent(HystrixEventType.FALLBACK_FAILURE, commandKey);
                        executionResult = executionResult.addEvent((int) latency, HystrixEventType.FALLBACK_FAILURE);
                        toEmit = new HystrixRuntimeException(failureType, _cmd.getClass(), getLogMessagePrefix() + " " + message + " and fallback failed.", e, fe);
                    }
                    //判断异常是否包装
                    if (shouldNotBeWrapped(originalException)) {
                        //抛出异常
                        return Observable.error(e);
                    }
                    //抛出异常
                    return Observable.error(toEmit);
                }
            };
            //获取降级信号量
            final TryableSemaphore fallbackSemaphore = getFallbackSemaphore();
            final AtomicBoolean semaphoreHasBeenReleased = new AtomicBoolean(false);
            final Action0 singleSemaphoreRelease = new Action0() {
                @Override
                public void call() {
                    if (semaphoreHasBeenReleased.compareAndSet(false, true)) {
                        fallbackSemaphore.release();
                    }
                }
            };
            Observable<R> fallbackExecutionChain;
            // 尝试获取降级信号量
            if (fallbackSemaphore.tryAcquire()) {
                try {
                    //判断是否定义了fallback方法
                    if (isFallbackUserDefined()) {
                        executionHook.onFallbackStart(this);
                        //执行降级逻辑
                        fallbackExecutionChain = getFallbackObservable();
                    } else {
                        //执行降级逻辑
                        fallbackExecutionChain = getFallbackObservable();
                    }
                } catch (Throwable ex) {
                    fallbackExecutionChain = Observable.error(ex);
                }
                return fallbackExecutionChain
                        .doOnEach(setRequestContext)
                        .lift(new FallbackHookApplication(_cmd))
                        .lift(new DeprecatedOnFallbackHookApplication(_cmd))
                        .doOnNext(markFallbackEmit)
                        .doOnCompleted(markFallbackCompleted)
                        .onErrorResumeNext(handleFallbackError)
                        .doOnTerminate(singleSemaphoreRelease)
                        .doOnUnsubscribe(singleSemaphoreRelease);
            } else {
                //处理降级信号量拒绝异常
               return handleFallbackRejectionByEmittingError();
            }
        } else {
            //处理降级配置关闭时异常
            return handleFallbackDisabledByEmittingError(originalException, failureType, message);
        }
    }
}

[HystrixCommand] R getFallback(): HystrixCommand throws operations that do not support exceptions by default, and subclasses need to override the getFalBack method to implement downgrade logic.

protected R getFallback() {
    throw new UnsupportedOperationException("No fallback available.");
}

7. Health Statistics

Hystrix determines the proportion of service failures based on the data statistics through the sliding window and selects the circuit breaker, which can achieve rapid failure and downgrade logic. Proceed as follows:

After AbstractCommand is executed, the handleCommandEnd method is called to publish the execution result HystrixCommandCompletion event to the event stream;
The event stream uses the Observable.window() method to group events by time, and uses the flatMap() method to aggregate the events into buckets by type (success, failure, etc.) to form a bucket stream;
Then use Observable.window() to aggregate each bucket into sliding window data according to the number of buckets in the window;
Aggregate sliding window data into data objects (such as health data streams, accumulated data, etc.);
When the fuse CircuitBreaker is initialized, it subscribes to the health data stream, and modifies the switch of the fuse according to the health status.

[AbstractCommand] void handleCommandEnd(boolean commandExecutionStarted): After the business is executed, the handleCommandEnd method is called. In this method, the execution result is reported, which is also the entrance to health statistics.

private void handleCommandEnd(boolean commandExecutionStarted) {
    Reference<TimerListener> tl = timeoutTimer.get();
    if (tl != null) {
        tl.clear();
    }

    long userThreadLatency = System.currentTimeMillis() - commandStartTimestamp;
    executionResult = executionResult.markUserThreadCompletion((int) userThreadLatency);
    //执行结果上报健康统计
    if (executionResultAtTimeOfCancellation == null) {
        metrics.markCommandDone(executionResult, commandKey, threadPoolKey, commandExecutionStarted);
    } else {
        metrics.markCommandDone(executionResultAtTimeOfCancellation, commandKey, threadPoolKey, commandExecutionStarted);
    }

    if (endCurrentThreadExecutingCommand != null) {
        endCurrentThreadExecutingCommand.call();
    }
}

【BucketedRollingCounterStream】BucketedRollingCounterStream(HystrixEventStream<Event> stream, final int numBuckets, int bucketSizeInMs,final Func2<Bucket, Event, Bucket> appendRawEventToBucket,final Func2<Output, Bucket, Output> re-duceBucket)

The sliding window of the health statistics class HealthCountsStream is mainly implemented in the parent class BucketedRollingCounterStream. First, the parent class BucketedCounterStream processes the event stream into a bucket stream, and BucketedRollingCounterStream processes it into a sliding window, and then processes it into health statistics by the reduceBucket function passed in by HealthCountsStream.

protected BucketedRollingCounterStream(HystrixEventStream<Event> stream, final int numBuckets, int bucketSizeInMs,
                                       final Func2<Bucket, Event, Bucket> appendRawEventToBucket,
                                       final Func2<Output, Bucket, Output> reduceBucket) {
    //调用父类，数据处理成桶流
    super(stream, numBuckets, bucketSizeInMs, appendRawEventToBucket);
    //根据传入的reduceBucket函数，处理滑动窗口内数据
    Func1<Observable<Bucket>, Observable<Output>> reduceWindowToSummary = new Func1<Observable<Bucket>, Observable<Output>>() {
        @Override
        public Observable<Output> call(Observable<Bucket> window) {
            return window.scan(getEmptyOutputValue(), reduceBucket).skip(numBuckets);
        }
    };
    //对父类桶流数据进行操作
    this.sourceStream = bucketedStream
    //窗口内桶数量为numBuckets，每次移动1个桶
            .window(numBuckets, 1)
            //滑动窗口内数据处理
            .flatMap(reduceWindowToSummary)
            .doOnSubscribe(new Action0() {
                @Override
                public void call() {
                    isSourceCurrentlySubscribed.set(true);
                }
            })
            .doOnUnsubscribe(new Action0() {
                @Override
                public void call() {
                    isSourceCurrentlySubscribed.set(false);
                }
            })
            .share()
            .onBackpressureDrop();
}

[HealthCounts] HealthCounts plus(long[] eventTypeCounts): accumulate the data in the bucket according to the event type to generate statistical data HealthCounts;

public HealthCounts plus(long[] eventTypeCounts) {
    long updatedTotalCount = totalCount;
    long updatedErrorCount = errorCount;

    long successCount = eventTypeCounts[HystrixEventType.SUCCESS.ordinal()];
    long failureCount = eventTypeCounts[HystrixEventType.FAILURE.ordinal()];
    long timeoutCount = eventTypeCounts[HystrixEventType.TIMEOUT.ordinal()];
    long threadPoolRejectedCount = eventTypeCounts[HystrixEventType.THREAD_POOL_REJECTED.ordinal()];
    long semaphoreRejectedCount = eventTypeCounts[HystrixEventType.SEMAPHORE_REJECTED.ordinal()];
    //总数
    updatedTotalCount += (successCount + failureCount + timeoutCount + threadPoolRejectedCount + semaphoreRejectedCount);
    //失败数
    updatedErrorCount += (failureCount + timeoutCount + threadPoolRejectedCount + semaphoreRejectedCount);
    return new HealthCounts(updatedTotalCount, updatedErrorCount);
}

8. Summary

In a distributed environment, it is inevitable that many service dependencies will fail. As a library, Hystrix can help users control the interaction between distributed services by adding logic such as fusing, isolation, and downgrading, so as to improve the overall flexibility of the system. The main functions are as follows:

Protect the system and control delays and failures from accessing third-party dependencies (usually via the network)
Prevent cascading failures in complex distributed systems
Fail fast and recover quickly
Smooth degradation
Near real-time monitoring, alerting and control

There are some points to pay attention to during the use of Hystrix:

Override the getFallback() method, try not to have network dependencies. If there is network dependency, it is recommended to downgrade multiple times, that is, instantiate HystrixCommand in getFallback() and execute Command. getFallback() try to ensure high-performance return and rapid degradation.
HystrixCommand recommends a thread isolation strategy.
When hystrix.threadpool.default.allowMaximumSizeToDivergeFromCoreSize is set to true, hystrix.threadpool.default.maximumSize will take effect. The maximum number of threads needs to be considered based on the business's own situation and performance test results. Try to set a smaller initial value and support dynamic resizing, because it is the main tool to reduce load and prevent resources from being blocked when delays occur.
Under the signal isolation strategy, when executing business logic, the parent thread of the application service (such as the Tomcat container thread) is used. Therefore, it is necessary to set the amount of concurrency, calls with network overhead, it is not recommended to use this strategy, it is easy to cause container thread queuing blockage, which affects the entire application service.
In addition, Hystrix highly relies on RxJava, a reactive functional programming framework. A simple understanding of how RxJava is used is conducive to understanding the source code logic.

Reference documents

Hystrix Github warehouse: https://github.com/Netflix/Hystrix

Analyze the working mechanism of Hystrix from the source code

1. What problem does Hystrix solve?

2. Overall mechanism

Three, fuse

Four, resource isolation

4.1 Semaphore mode

4.2 Thread Pool Mode

Five, timeout detection

Six, downgrade

7. Health Statistics

8. Summary

Reference documents

vivo互联网技术

引用和评论

主打一个“小巧灵动”：Vite + Svelte

聊聊微服务：Hystrix熔断机制和原理

Redis怎么实现分布式锁，以及注意事项

分布式系统架构5：限流设计模式

Redis 分片

商业银行基于容器云的分布式数据库架构设计与创新实践

OceanBase 的探索与实践