Scenarios that have to be considered when designing a stable microservice system

Author: ten sleep

Our production environment often has some unstable situations, such as:

During the big promotion, the instantaneous peak traffic caused the system to exceed the maximum load, the load soared, and the system crashed, causing users to be unable to place orders
The "dark horse" hot commodity breaks down the cache, the DB is defeated, and the normal traffic is squeezed
The caller is dragged down by unstable services, and the thread pool is full, causing the entire call link to freeze

These unstable scenarios can have serious consequences. You may want to ask: How to achieve uniform and smooth user access? How to prevent the impact of excessive traffic or service instability?

introduce

The following two methods are the two common solutions in the face of unstable traffic factors. They are also two capabilities that we have to consider before designing a highly available system. They are a very critical part of service traffic governance.

flow control

Traffic is very random and unpredictable. The first second may be calm, and the next second there may be traffic peaks (such as the scene of Double Eleven at 0:00). Each system and service has an upper limit of the capacity that it can carry. If the sudden traffic exceeds the capacity of the system, the request may not be processed, the accumulated requests are processed slowly, CPU/Load soars, and finally lead to system breakdown. Therefore, we need to limit this kind of burst traffic and ensure that the service is not overwhelmed while processing requests as much as possible. This is flow control.

Fusing downgrade

A service often calls other modules, which may be another remote service, database, or third-party API. For example, when making a payment, it may be necessary to remotely call the API provided by UnionPay; to query the price of a certain commodity, a database query may be required. However, the stability of this dependent service is not guaranteed. If the dependent service is unstable and the response time of the request becomes longer, the response time of the method calling the service will also become longer, the threads will accumulate, and eventually the thread pool of the business itself may be exhausted, and the service itself will also change. must not be available.

title=

Modern microservice architectures are distributed and consist of a very large number of services. Different services call each other to form a complex call chain. The above problems will have a magnifying effect in the link call. If a ring on a complex link is unstable, it may be cascaded, eventually making the entire link unavailable. Therefore, we need to fuse and downgrade unstable weakly dependent services, temporarily cut off unstable calls, and avoid the overall avalanche caused by local unstable factors.

Q: Many students are asking, is there no need for traffic control and current limiting protection because the magnitude of the service is very small? Is it that the architecture of microservices is relatively simple and there is no need to introduce a circuit breaker protection mechanism?

A: Actually, it has nothing to do with the magnitude of the request and the complexity of the architecture. In many cases, the failure of a very marginal service may affect the overall business and cause huge losses. We need to have the awareness of failure-oriented design, do a good job in capacity planning and sorting out strong and weak dependencies in peacetime, reasonably configure flow control degradation rules, and do a good job in advance protection, rather than remediate after online problems.

In traffic control, degradation and fault tolerance scenarios, we have multiple ways to describe our governance scheme. Below I will introduce a set of open, general, distributed service-oriented architecture, and service governance covering the full-link heterogeneous ecosystem. Standard OpenSergo, let's see how OpenSergo defines standards for flow control degradation and fault tolerance, and what are the implementations that support these standards, and what problems can they help us solve?

OpenSergo flow control degradation and fault tolerance v1alpha1 standard

In OpenSergo, we abstract the standard CRD for the implementation of flow control degradation and fault tolerance scenarios in combination with the scenarios of Sentinel and other frameworks. We can think that a fault tolerance governance rule (FaultToleranceRule) consists of the following three parts:

Target: for what kind of request
Strategy: Fault tolerance or control strategies, such as flow control, fusing, concurrency control, adaptive overload protection, outlier instance removal, etc.
FallbackAction: The fallback action after triggering, such as returning an error or status code

title=

Then let's see what the specific standard definition of OpenSergo looks like for common flow control degradation scenarios, and how does it solve our problems?

First mentioned, as long as the microservice framework is adapted to OpenSergo, governance such as flow control and downgrade can be performed through a unified CRD. Whether it is Java, Go or Mesh service, whether it is HTTP request or RPC call, or database SQL access, we can use this unified fault-tolerant governance rule CRD to configure fault-tolerant governance for each link in the microservice architecture to protect us Service link stability. Let's take a detailed look at a configuration of OpenSergo in each specific scenario.

flow control

The following example defines a policy for cluster flow control, and the overall cluster dimension does not exceed 180 requests per second. Example CR YAML:

 apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: RateLimitStrategy
metadata:
  name: rate-limit-foo
spec:
  metricType: RequestAmount
  limitMode: Global
  threshold: 180
  statDuration: "1s"

Such a simple CR can configure a flow control capability for our system. The flow control capability is equivalent to an airbag of the application. Requests beyond the system service capability will be rejected. The specific logic can be customized by us (such as returning Specify content or jump pages).

title=

Fuse protection

The following example defines a slow-call proportional circuit breaker policy, sample CR YAML:

 apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: CircuitBreakerStrategy
metadata:
  name: circuit-breaker-slow-foo
spec:
  strategy: SlowRequestRatio
  triggerRatio: '60%'
  statDuration: '30s'
  recoveryTimeout: '5s'
  minRequestAmount: 5
  slowConditions:
    maxAllowedRt: '500ms'

The semantics of this CR is: when the proportion of requests exceeding 500ms within 30s reaches 60%, and the number of requests reaches 5, the fuse will be automatically triggered, and the fuse recovery time will be 5s.

title=

Imagine that during peak business hours. When some downstream service providers encounter performance bottlenecks, it even affects the business. We configure such a rule for some non-critical service consumers. When the proportion of slow calls or errors in a period of time reaches a certain condition, the circuit breaker is automatically triggered, and the service calls for a period of time will directly return the result of the Mock, which can not only ensure the calling end Not being dragged down by unstable services, it can give unstable downstream services some "breathing" time, and at the same time can ensure the normal operation of the entire business link.

The realization of flow control degradation and fault tolerance standard

Introducing Sentinel

The following introduces Sentinel, a project that supports OpenSergo flow control degradation and fault tolerance standards.

Sentinel is Alibaba's open source traffic control component for distributed service architecture. It mainly uses traffic as the entry point to help developers ensure the security of microservices from multiple dimensions such as traffic control, traffic shaping, circuit breaker downgrade, and system adaptive protection. stability.

Sentinel technical highlights:

Highly scalable: basic core + SPI interface expansion capabilities, users can easily expand flow control, communication, monitoring and other functions
Diversified flow control strategies (resource granularity, invocation relationship, flow control indicators, flow control effects and other dimensions) provide distributed cluster flow control capabilities
Hotspot traffic detection and protection
Circuit breaker, downgrade and isolate unstable services
The system load adaptive protection in the global dimension adjusts the flow in real time according to the system water level
Covers API Gateway scenarios and provides gateway traffic control capabilities for Spring Cloud Gateway and Zuul
Cloud-native scenarios provide the ability for Envoy service mesh cluster flow control
Real-time monitoring and rule dynamic configuration management capabilities

title=

Some common usage scenarios:

In the case of service providers, we need to protect the service providers themselves from being overwhelmed by traffic floods. At this time, flow control is usually performed according to the service capabilities of the service provider, or restrictions are imposed on specific service callers. We can evaluate the bearing capacity of the core interface based on the previous stress test, and configure the current limit of the QPS mode. When the number of requests per second exceeds the set threshold, the excess requests will be automatically rejected.
In order to avoid being dragged down by unstable services when calling other services, we need to isolate and fuse the unstable service dependencies on the service caller (Service Consumer). The means include semaphore isolation, abnormal proportion degradation, RT degradation and other means.
When the system is at a low water level for a long time, when the flow suddenly increases, directly pulling the system to a high water level may instantly overwhelm the system. At this time, we can use Sentinel's WarmUp flow control mode to control the passing traffic to increase slowly and gradually increase to the upper limit of the threshold within a certain period of time, instead of allowing it all at once. This gives the cooling system a time to warm up and avoids the cooling system being overwhelmed.
Use Sentinel's uniform queuing mode to "cut peaks and fill valleys", spread the request spikes evenly over a period of time, keep the system load within the request processing water level, and process as many requests as possible.
Take advantage of Sentinel's gateway flow control feature to protect traffic at the gateway entrance, or limit the frequency of API calls.

Alibaba Cloud Microservice Solution

Alibaba Cloud provides an enterprise-level product MSE that fully complies with the OpenSergo microservice standard. The traffic governance capability in the enterprise version of MSE service governance can be understood as a commercial version of Sentinel. We also briefly summarize MSE. A comparison of traffic governance and community solutions in traffic control degradation and fault tolerance scenarios.

title=

Below I will demonstrate based on MSE, how to protect our system through flow control and circuit breaker degradation, which can calmly face uncertain traffic and a series of unstable scenarios.

Configure flow control rules

We can view the real-time monitoring of each interface on the monitoring details page.

title=

We can click the "Add Protection Rule" button in the upper right corner of the interface overview to add a flow control rule:

title=

We can configure the simplest flow control rules in QPS mode. For example, in the above example, the number of single-machine calls per second for this interface is limited to no more than 80 times.

Monitor and view the flow control effect

After configuring the rules, wait for a while to see the current limiting effect on the monitoring page:

title=

Denied traffic also returns an error message. The framework embedded points that come with MSE have default flow control processing logic, such as returning 429 Too Many Requests after the Web interface is limited, and throwing an exception after the DAO layer is limited. If users want to customize the flow control processing logic of each layer more flexibly, they can access and configure the custom flow control processing logic through the SDK.

Summarize

Flow control degradation and fault tolerance are scenarios that we have to consider when designing a stable microservice system. If we design each system, we need to spend a lot of thought to design the flow control degradation and fault tolerance of the system, which will make us every time. A headache for developers. So we have contacted and designed the flow control degradation of so many systems. Are there any common scenarios, best practices, design standards and specifications, or even reference implementations that can be precipitated?

This article briefly introduces OpenSergo's traffic control and fusing protection standards from the perspective of scenarios, and also introduces the background and methods of Sentinel traffic protection. Finally, an example is used to introduce how to use the traffic protection capabilities of MSE service governance to escort your applications.

Click to view the live video:

https://yqh.aliyun.com/live/detail/28956

The OpenSergo standard is currently only v1alpha1. It is foreseeable that we still have a long way to go in the continuous formulation and development of OpenSergo service governance standards. If you are also interested in the scenarios of flow control degradation and fault tolerance, and are interested in the standard construction of microservice governance, you are welcome to join. We will set standards and promote implementation in an open, transparent and democratic manner. In the community, mechanisms such as GitHub issue, Gitter, mailing list, and bi-weekly community meetings are also used to ensure that standards and implementations are jointly built through community collaboration. We welcome everyone to discuss and build together through these forms.

10% discount for the first purchase of MSE Registration and Configuration Center Professional Edition, 10% discount for MSE Cloud Native Gateway Prepaid Full Specifications. Click here to take advantage of the discount!

Scenarios that have to be considered when designing a stable microservice system

introduce

flow control

Fusing downgrade

OpenSergo flow control degradation and fault tolerance v1alpha1 standard

flow control

Fuse protection

The realization of flow control degradation and fault tolerance standard

Introducing Sentinel

Alibaba Cloud Microservice Solution

Summarize

阿里云云原生

引用和评论

用通义灵码，从 0 开始打造一个完整APP，无需编程经验就可以完成

前端微服务跨域配置解决办法，devServer为例

Go 程序如何实现优雅退出？来看看 K8s 是怎么做的——上篇

Monorepo：让你的项目脱胎换骨，既能代码复用，又能独立部署！

2024 OSCAR 开源产业大会在京召开

张晋涛：KubeCon China 2024 回顾

2024OSCAR开源产业大会 | 开源项目社区与商业化分论坛精彩前瞻！！！