Author: Su He
The problem of microservice runtime stability
The stability of microservices has always been a topic of great concern to developers. With the evolution of business from a monolithic architecture to a distributed architecture and changes in deployment methods, the dependencies between services have become more and more complex, and business systems are also facing huge high-availability challenges. You may have experienced the following scenarios:
- The peak traffic at the moment of concert ticket rush causes the system to exceed the maximum load, the load soars, and users cannot place orders normally;
- During online course selection, too many requests for course selection are submitted at the same time, and the system cannot respond;
- A certain piece of content in the page service is very slow to access and cannot be loaded all the time, causing the entire page to be stuck and unable to operate normally
There are many factors that affect the availability of microservices, and these unstable scenarios can have serious consequences. From the perspective of microservice traffic, we can roughly divide it into two common runtime scenarios:
- The service's own traffic exceeds the carrying capacity, resulting in unavailability. For example, the surge in traffic and batch task delivery cause the service load to soar, and the request cannot be processed normally.
- The service is not available in its own chain due to its dependence on other unavailable services. For example, our service may depend on several third-party services. If a payment service is abnormal, the call is very slow, and the caller does not effectively prevent and deal with it, the thread pool of the caller will be full, affecting the service itself. run. In a distributed system, the invocation relationship is meshed and intricate, and the failure of a service may lead to cascading reactions, making the entire link unavailable.
How should we solve the problem of runtime stability of these microservices? For these unstable scenarios, MSE provides a full range of traffic protection capabilities. Based on the stability protection capabilities of the open source traffic management component Sentinel, it takes traffic as the entry point, from traffic control, concurrency control, fuse downgrade, hotspot protection, and system adaptation. Protection and other dimensions to help ensure the stability of services, covering several scenarios such as microservice frameworks, cloud native gateways, and service mesh, and supporting heterogeneous microservice architectures in Java, Go, C++, Rust and other languages.
CloudWeGo natively supports Sentinel and OpenSergo
CloudWeGo is a set of middleware open sourced by ByteDance that can quickly build enterprise-level cloud-native microservice architectures. The common features of the CloudWeGo project are high performance, high scalability, high reliability, and focus on microservice communication and governance. Among them, Kitex is the next-generation high-performance and highly scalable Golang RPC framework that provides a complete set of microservice capabilities; Hertz is a high-usage, high-performance, and highly scalable Golang microservice HTTP framework designed to simplify construction for developers Microservices.
Recently, the CloudWeGo community and Sentinel community have cooperated and built together to provide an adapter module for CloudWeGo Kitex and Hertz to connect to Sentinel Go. Simply import the adapter module and introduce the corresponding Middleware to quickly connect Kitex and Hertz services to Sentinel Go. Enjoy comprehensive traffic management and protection capabilities. At the same time, combined with the connection between Sentinel Go and OpenSergo microservice governance standards, CloudWeGo will also support the standard configuration of flow control degradation and fault tolerance in OpenSergo spec. In the future, a unified CRD method can be used for traffic governance and control.
Adapter module documentation can refer to:
- Kitex adapter:
https://pkg.go.dev/github.com/alibaba/sentinel-golang/pkg/adapters/kitex
- Hertz adapter:
https://pkg.go.dev/github.com/alibaba/sentinel-golang/pkg/adapters/hertz
At the same time, based on Sentinel's Kitex and Hertz adapter modules, we can combine the MSE service management Go SDK access method to easily connect Kitex and Hertz services to Alibaba Cloud's MSE microservice management products, through the white screen observation and governance configuration methods to ensure the runtime traffic stability of microservices.
CloudWeGo + MSE Traffic Protection Best Practices
MSE microservice governance provides a full range of traffic governance, traffic protection, and database governance capabilities from the perspective of microservices. For Go microservices, developers can access MSE through the Go SDK to enjoy governance, whitening, and traffic control degradation and fault tolerance. Screen configuration and observation capabilities to ensure the stability of the service operation. MSE traffic governance supports the flow control degradation and fault tolerance of common Go microservice frameworks such as CloudWeGo, gRPC, Gin, dubbo-go, and go-micro. First, let's take a look at a typical scenario of traffic protection.
Traffic control ensures service stability under surge traffic
Traffic is very random and unpredictable. The first second may be calm, and the next second there may be traffic peaks (such as the scene of Double Eleven at 0:00). However, the capacity of our system is always limited. If the sudden traffic exceeds the capacity of the system, it may cause requests not to be processed, accumulated requests to be processed slowly, CPU/Load soaring, and finally cause the system to crash. Therefore, we need to limit this kind of burst traffic and ensure that the service is not overwhelmed while processing requests as much as possible. This is flow control. The flow control scenario is very general, and scenarios like pulse flow are applicable.
MSE Sentinel is based on millisecond-level sliding window accurate statistics and flow control algorithms such as token buckets, leaky buckets, and WarmUp, and provides flow control scenarios in multiple dimensions including second-level accurate flow control, total cluster flow control, and uniform queuing. . Usually in the case of web portal or RPC service provider (Service Provider), we need to protect the service provider itself from being overwhelmed by traffic floods. At this time, flow control is usually performed according to the service capabilities of the service provider, or restrictions are imposed on specific service callers. We can combine the previous stress test to evaluate the bearing capacity of the core interface, and configure the flow control rules in the QPS mode. When the number of requests per second exceeds the set threshold, the excess requests will be automatically rejected.
Circuit breaker downgrade and isolation ensure that services are not dragged down by slow dependency calls
A service often calls other modules, which may be another remote service, database, or third-party API. For example, when making a payment, it may be necessary to remotely call the API provided by UnionPay; to query the price of a certain commodity, a database query may be required. However, the stability of this dependent service is not guaranteed. If the dependent service is unstable and the response time of the request becomes longer, the response time of the method calling the service will also become longer, threads will accumulate, and eventually the thread pool of the business itself may be exhausted, and the service itself will also change. must not be available.
Modern microservice architectures are distributed and consist of a very large number of services. Different services call each other to form a complex call chain. The above problems will have a magnifying effect in the link call. If a ring on a complex link is unstable, it may be cascaded, eventually making the entire link unavailable.
MSE Sentinel provides the following capabilities to avoid service unavailability caused by unstable factors such as slow calls:
- Concurrency control (isolation rules): As a means of lightweight isolation, it controls the number of concurrent calls (that is, the number of ongoing calls) to prevent too many slow calls from filling up the thread pool and causing overall unavailability. Concurrency control rules can be used as an important guarantee to prevent services from being dragged down by a large number of slow calls.
- Unstable call fuse: Automatically fuse and downgrade unstable weakly dependent calls, temporarily cut off unstable calls, and avoid local unstable factors causing overall avalanches.
- Early downgrade: For some weakly dependent services (non-critical link dependencies), dynamic downgrades can be performed before activities or when resources are tight to prioritize the stability of important services. A downgraded service will simply return the given mock value without triggering the actual call.
The circuit breaker downgrade feature is based on the idea of the circuit breaker mode. When unstable factors occur in the service (such as a longer response time and an increase in the error rate), the service call is temporarily cut off, and a gradual recovery attempt is made after waiting for a period of time. On the one hand, it prevents "adding trouble" to unstable services, and on the other hand, it protects the caller of the service from being dragged down. Currently, two circuit breaker strategies are supported: based on response time (proportion of slow calls) and based on errors (proportion of errors/number of errors), which can effectively protect against various unstable scenarios.
Note that the circuit breaker mode is generally suitable for weakly dependent calls, that is, the main business process is not affected after the circuit breaker, and concurrency control is applicable to both weak and strong dependent calls. Developers need to design the fallback logic and return value after downgrade. In addition, it should be noted that even if the service caller introduces a circuit breaker downgrade mechanism, we still need to configure the request timeout period on the HTTP or RPC client to do a bottom-up protection.
CloudWeGo Access MSE Traffic Protection Example
Next, we will take the CloudWeGo Kitex service as an example to show how to combine MSE Sentinel to ensure the runtime stability of CloudWeGo microservices.
First, we first introduce the MSE Go SDK into the project, and perform some simple initialization configuration; at the same time, we also introduce the Kitex Sentinel adapter module to connect our Kitex service to MSE:
import (
sentinel "github.com/alibaba/sentinel-golang/api"
sentinelPlugin "github.com/alibaba/sentinel-golang/pkg/adapters/kitex"
api "github.com/cloudwego/kitex-examples/hello/kitex_gen/api/hello"
mse "github.com/aliyun/aliyun-mse-go-sdk"
)
func main() {
// 初始化 MSE Sentinel;可以通过环境变量或 sentinel.yml 文件配置应用名
err := mse.InitMseDefault()
if err != nil {
log.Fatalf("Failed to init MSE: %+v", err)
}
// 创建 Kitex server 时添加 Sentinel 适配模块中的 middleware
svr := api.NewServer(new(HelloImpl), server.WithMiddleware(sentinelPlugin.SentinelServerMiddleware()))
err = svr.Run()
if err != nil {
log.Println(err.Error())
}
}
Next, we start the service provider, and we can see our Kitex service in the MSE console. We trigger traffic access through the consumer, and then we can see the detailed monitoring information of our Kitex service calls on the interface details page of the MSE console.
Next, we configure a flow control rule with a single-machine QPS of 10 for the interface method Hello:Echo:
After the configuration is complete, the rules will take effect in the service in real time. After a while, we can see on the interface details page that the single-machine processing volume of the interface method Hello:Echo is limited to 10 times per second, and the consumer side will also receive the corresponding flow control error.
Standardized traffic governance with OpenSergo
Standardized traffic governance with OpenSergo
Microservice governance in the industry has problems such as inconsistent concepts, inconsistent configuration forms, inconsistent capabilities, and complex unified management and control of multiple frameworks. For example, if we want to configure fuse and downgrade rules for an interface, in Sentinel, it may be configured through Sentinel dynamic rules. In Istio, it may be another set of configuration methods, and other components may have different configurations. The inconsistency in the governance configuration of different frameworks makes the unified governance and control of microservices quite complex.
Based on the above background, the OpenSergo project jointly initiated by Alibaba, bilibili, CloudWeGo and other enterprises and communities came into being. OpenSergo aims to provide a set of open and general microservice governance standards for cloud native services, covering microservices and upstream and downstream related components, and provides a series of API and SDK implementations according to the standards. The biggest feature of OpenSergo is that it defines service governance rules with a unified set of configuration/DSL/protocol, oriented to multi-language heterogeneous architecture, covering microservice framework and upstream and downstream related components. Whether the language of the microservice is Java, Go, Node.js or other languages, whether it is standard microservice or mesh access, from gateway to microservice framework, from database to cache access, from service registration discovery to configuration, developers can Unified governance and control can be carried out through the same set of OpenSergo CRD standard configuration, without paying attention to the differences between frameworks and languages, reducing the complexity of heterogeneous and full-link microservice governance and control.
In OpenSergo, we combine the scenarios of Sentinel and MSE to extract standard CRDs for flow control degradation and fault tolerance. A fault tolerance governance rule (FaultToleranceRule) consists of the following three parts:
- Target: for what kind of request
- Strategy: Fault tolerance or control strategies, such as flow control, fusing, concurrency control, adaptive overload protection, outlier instance removal, etc.
- FallbackAction: The fallback action after triggering, such as returning an error or status code
The following YAML CR example defines a rule that configures a flow control policy for the service method Hello:Echo (identified by the resource name), with a global limit of 10 QPS:
apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: RateLimitStrategy
metadata:
name: rate-limit-foo
spec:
metricType: RequestAmount
limitMode: Global
threshold: 10
statDuration: "1s"
---
apiVersion: fault-tolerance.opensergo.io/v1alpha1
kind: FaultToleranceRule
metadata:
name: my-fault-tolerance-rule
spec:
selector:
app: foo-app # 规则配置生效的服务名
targets:
- targetResourceName: 'Hello:Echo'
strategies:
- name: rate-limit-foo
# 这里还可以单独定义 fallbackAction,比如自定义返回值或错误;不指定则为默认行为
Sentinel 2.0 will natively support CRD configuration and capabilities related to OpenSergo traffic governance, combined with the adaptation modules of each framework provided by Sentinel, so that 20+ frameworks such as Dubbo, Spring Cloud Alibaba, gRPC, CloudWeGo can be seamlessly integrated into the OpenSergo ecosystem. A unified CRD is used to configure governance rules such as traffic routing, flow control degradation, and service fault tolerance. Whether it is Java, Go or Mesh service, whether it is HTTP request or RPC call, or database SQL access, we can use this unified fault-tolerant governance rule CRD to configure governance for each link in the microservice architecture to ensure our services link stability. As an enterprise-level product of the OpenSergo microservice governance standard, MSE will also natively support the OpenSergo spec.
Summary and Outlook
Flow control degradation and fault tolerance are an important part of microservice traffic governance. At the same time, MSE also provides microservice governance capabilities in a wider range and more scenarios, including full-link grayscale, lossless online and offline, microservice database governance, A series of microservice governance capabilities such as log governance. Service governance is the only way for microservice transformation to reach a certain stage, and it is the key to making microservices stable and well done. At the same time, we are also working with CloudWeGo, Kratos, Spring Cloud Alibaba, Dubbo, ShardingSphere and other communities to build the OpenSergo microservice governance standard, and extract the microservice governance scenarios and best practices in enterprises and communities into standard specifications. More welcome The community and enterprises participate in the co-construction of the OpenSergo microservice governance standard. Welcome to join the OpenSergo community exchange group (Dingding group) for discussion: 34826335
Reference link:
MSE Microservice Governance:
https://help.aliyun.com/document_detail/170447.html
Sentinel Go:
https://github.com/alibaba/sentinel-golang
OpenSergo:
CloudWeGo:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。