3

Problems caused by architecture evolution

When we use the traditional CS architecture, the server will block the request due to failures and other reasons, which may cause the client's request to lose response, which will cause a batch of users to lose the service after a period of time. The possible impact of this situation is limited and can be estimated. However, under the microservice system, your server may depend on several other microservices, and these microservices depend on other more microservices. In this case, a certain service may block downstream in an instant (several seconds). (Inside) Because the resource consumption of the cascade causes catastrophic consequences on the entire link, we call it a "service crash".

image.png

image.png

Several ways to solve the problem

  1. Fuse mode: As the name suggests, just like a household circuit, if a line voltage is too high, the fuse will blow to prevent fire. In a system that uses the fuse mode, if it is found that the upstream service is called slowly or has a large number of timeouts, the call to the service is directly suspended, the information is directly returned, and the resources are released quickly. The call will not be resumed until the upstream service improves.
  2. Isolation mode: Divide the calls of different resources or services into several different request pools. The exhaustion of resources in one pool will not affect the requests of other resources, preventing a single point of failure from consuming all resources. This is a very traditional disaster tolerance design.
  3. Current-limiting mode: fusing and isolation are both post-processing methods, while current-limiting mode can reduce the probability of problems before they occur. The current limiting mode can set a highest QPS threshold for certain service requests, and requests that exceed the threshold are returned directly without occupying resources for processing. However, the current limiting mode cannot solve the problem of service blood collapse, because blood collapse is often caused not because of the large number of requests, but because of the amplification of multiple cascading layers.

Mechanism and realization of circuit breaker

The existence of the circuit breaker is equivalent to giving us a layer of protection. When the stability of the call is not good, or the service and resource that is likely to be called failed, the circuit breaker can monitor these errors and fail the request after reaching a certain threshold. Prevent excessive consumption of resources. In addition, the circuit breaker also has the function of automatically identifying the service status and restoring it. When the upstream service returns to normal, the circuit breaker can automatically determine and restore the normal request.

Let's look at a request process without a circuit breaker:
The user relies on ServiceA to provide services, and ServiceA relies on the services provided by ServiceB. Assuming that ServiceB fails at this time, every request will be delayed by 10 seconds for a period of time.
image.png

Then suppose we have N users requesting the service of ServiceA, within a few seconds, the resources of ServiceA will be consumed due to the suspension of the request initiated by ServiceB, thereby rejecting any subsequent requests from User. For users, this means that both ServiceA and ServiceB have failed at the same time, causing the entire service link to collapse.

And what happens when we install a circuit breaker on ServiceA?

  1. When the number of failures reaches a certain threshold, the circuit breaker will find that the request to ServiceB is invalid. At this time, ServiceA does not need to continue to request ServiceB, but directly returns the failure, or uses other Fallback backup data. At this time, the circuit breaker is in the open circuit state.
  2. After a period of time, the circuit breaker will start to check whether ServiceB has recovered. At this time, the circuit breaker is in the half-open state.
  3. If ServiceB has been restored, then the circuit breaker will be placed in the closed state, at this time ServiceA will call ServiceB normally and return the result.

image.png

The state diagram of the circuit breaker is as follows:
image.png

It can be seen that several core points of the circuit breaker are as follows:

  1. Timeout: how long the request reaches, and it has caused a failure
  2. Failure threshold: the number of failures that need to be reached before the circuit breaker triggers an open circuit
  3. Retry timeout: When the circuit breaker is in the open state, how long does it take to retry the request, that is, enter the half-open state

Armed with this knowledge, we can try to create a circuit breaker:

class CircuitBreaker {
  constructor(timeout, failureThreshold, retryTimePeriod) {
    // We start in a closed state hoping that everything is fine
    this.state = 'CLOSED';
    // Number of failures we receive from the depended service before we change the state to 'OPEN'
    this.failureThreshold = failureThreshold;
    // Timeout for the API request.
    this.timeout = timeout;
    // Time period after which a fresh request be made to the dependent
    // service to check if service is up.
    this.retryTimePeriod = retryTimePeriod;
    this.lastFailureTime = null;
    this.failureCount = 0;
  }
}

Construct the state machine of the circuit breaker:

async call(urlToCall) {
    // Determine the current state of the circuit.
    this.setState();
    switch (this.state) {
      case 'OPEN':
      // return  cached response if no the circuit is in OPEN state
        return { data: 'this is stale response' };
      // Make the API request if the circuit is not OPEN
      case 'HALF-OPEN':
      case 'CLOSED':
        try {
          const response = await axios({
            url: urlToCall,
            timeout: this.timeout,
            method: 'get',
          });
          // Yay!! the API responded fine. Lets reset everything.
          this.reset();
          return response;
        } catch (err) {
          // Uh-oh!! the call still failed. Lets update that in our records.
          this.recordFailure();
          throw new Error(err);
        }
      default:
        console.log('This state should never be reached');
        return 'unexpected state in the state machine';
    }
  }

Supplement the remaining functions:

// reset all the parameters to the initial state when circuit is initialized
  reset() {
    this.failureCount = 0;
    this.lastFailureTime = null;
    this.state = 'CLOSED';
  }

  // Set the current state of our circuit breaker.
  setState() {
    if (this.failureCount > this.failureThreshold) {
      if ((Date.now() - this.lastFailureTime) > this.retryTimePeriod) {
        this.state = 'HALF-OPEN';
      } else {
        this.state = 'OPEN';
      }
    } else {
      this.state = 'CLOSED';
    }
  }

  recordFailure() {
    this.failureCount += 1;
    this.lastFailureTime = Date.now();
  }

When using a circuit breaker, you only need to wrap the request in the Call method of the circuit breaker instance and call it:

...
const circuitBreaker = new CircuitBreaker(3000, 5, 2000);

const response = await circuitBreaker.call('http://0.0.0.0:8000/flakycall');

Mature Node.js circuit breaker library

Red Hat has long created a Opossum , the link is here: Opossum . For distributed systems, using this library can greatly improve the fault tolerance of your services and fundamentally solve the problem of service blood collapse.

Author: ES2049

The article can be reprinted at will, but please keep this link to the original text.
You are very welcome to join ES2049 Studio if you are passionate. Please send your resume to caijun.hcj@alibaba-inc.com


ES2049
3.7k 声望3.2k 粉丝