How to use "cluster flow control" to ensure the stability of microservices?

Introduction to : 161b1be9281605 Application High Availability Service AHAS (Application High Availability Service) is a cloud product that has been precipitated by Alibaba’s internal high availability system for many years. It uses traffic and fault tolerance as the entry point to control traffic, isolate unstable calls, and degrade Hotspot traffic protection, system adaptive protection, cluster flow control and other dimensions help ensure the stability of the service, while providing second-level traffic monitoring and analysis functions.

Author: Su He

The stability of microservices has always been a topic of great concern to developers. As the business evolves from a single architecture to a distributed architecture and changes in deployment methods, the dependencies between services have become more and more complex, and business systems are also facing huge high-availability challenges. Application High Availability Service AHAS (Application High Availability Service) is a cloud product that has been precipitated by Alibaba's internal high availability system for many years. It uses traffic and fault tolerance as the entry point, from traffic control, unstable call isolation, fuse degradation, hotspot traffic protection, System adaptive protection, cluster flow control and other dimensions help ensure the stability of the service, while providing second-level traffic monitoring and analysis functions. AHAS not only has a wide range of applications in Alibaba's internal Taobao, Tmall and other e-commerce fields, but also has a large number of practices in Internet finance, online education, games, live broadcast industries and other large government and central enterprises.

Flow control is the most commonly used and most direct control method to ensure the stability of microservices. Each system and service has an upper limit on the capacity that it can carry. The flow control idea is very simple. When the QPS of an interface request exceeds a certain upper limit, redundant requests are rejected to prevent the system from being overwhelmed by sudden traffic. The most common solution on the market is flow control in a single machine dimension. For example, through the PTS performance test, it is estimated that the upper limit of the capacity of an interface is 100 QPS, and the service has 10 instances, then a stand-alone flow control of 10 QPS is configured. However, in many cases, due to the uncertainty of the flow distribution, the flow control of the stand-alone dimension has some poor results.

Typical scenario 1: Precisely control the total amount of calls to downstream

Scenario: Service A needs to frequently call the query interface of Service B, but there are differences in the capacity of Service A and Service B. Service B agrees to provide service A with a total of 600 QPS query capability, which is controlled through flow control and other means.

Pain point: configured according to the single-machine flow control strategy, due to the call logic, load balancing strategy, etc., the distribution of traffic reaching each instance of A calling B may be very uneven, and some instances of service B with large traffic trigger single-machine flow control. , But the overall limit has not been reached, resulting in the SLA is not up to standard. This uneven situation often occurs when a dependent service or component (such as database access) is called. This is also a typical scenario of cluster flow control: precise control of microservice clusters on downstream services (or databases, caches) The total number of calls.

Typical scenario 2: The total amount of requests is controlled at the service link entrance

scenario: performs ingress flow control at the Nginx/Ingress gateway, API Gateway (Spring Cloud Gateway, Zuul), and hopes to precisely control the flow of a certain API or a group of APIs to protect it in advance, and excess traffic will not hit the backend system.

Pain Points: is configured according to the stand-alone dimension, on the one hand, it is not easy to perceive changes in the number of gateway machines, on the other hand, uneven gateway traffic may lead to poor current limiting effects; and from the perspective of gateway entry, configuring the overall threshold is the most natural means.

AHAS cluster flow control

AHAS cluster flow control can accurately control the total amount of real-time calls of a certain service interface in the entire cluster, and can solve the problem of poor flow control effect due to uneven flow, frequent changes in the number of machines, and too small amortization threshold. Stand-alone flow control pocket bottom, to better play the effect of flow protection.

For the above scenario, through AHAS cluster flow control, whether it is Dubbo service call, Web API access, or custom business logic, it supports precise control of the total number of calls, regardless of call logic, traffic distribution, and instance distribution. It can support large flow control of hundreds of thousands of QPS, as well as precise control of small flow in minute-hour-level business dimensions. The behavior after the protection is triggered can be customized by the user (such as returning customized content, objects).

AHAS cluster protection has the following advantages:

rich in scenarios: comprehensively covers scenarios from gateway/Mesh ingress traffic precise protection, Web/RPC service/SQL call precise flow control, to minute-hour-level business dimension flow control, supporting hundreds of thousands of QPS;

low cost of use: not require special access methods, it can be used out of the box and quickly configured;

Automatic control and operation and maintenance: automatic control and allocation of token server resources, automatic operation and maintenance capabilities to ensure availability, users do not need to pay attention to server-side resource preparation and allocation, only the rule configuration and business process;

low performance loss: In performance mode, there is no delay increase on the service link, and the RT loss of the service link in the precise mode is controlled within 3ms, and users can use it with confidence;

supporting observable capabilities, understands the stability of the interface and the effective status of the rules in real time.

Let's use an example to introduce how to quickly connect the application to AHAS to make use of the cluster flow control capability and ensure service stability.

Quickly play AHAS cluster flow control

In the first step, we connect the service or gateway to AHAS traffic protection. AHAS provides a variety of fast and convenient non-intrusive access methods:

AHAS traffic protection supports native multi-language access such as Java/Go/C++/PHP, as well as Nginx/Ingress gateway access and Mesh access; Java applications support full range of 20+ frameworks/components 161b1be9281a86:

Web server: Spring Web/Spring Boot/Spring Cloud/Tomcat/Jetty/Undertow

Web client：OkHttp/Apache HttpClient

RPC：Dubbo/Feign/gRPC

DAO/Cache: MyBatis/Spring Data JPA/Memcached/Jedis client

MQ consumer：RocketMQ client/Kafka client/RocketMQ client

API Gateway：Spring Cloud Gateway/Zuul 1.x

Reactor framework

After successfully connecting to AHAS, as long as the service call/interface access is triggered, you can AHAS console , and you can see your interface on the monitoring page:

In the second step, we turn on the cluster flow control function on the "Cluster Flow Control-Cluster Configuration" page of the left menu of the application. For testing applications, we can start a "trial" cluster, and different cluster specifications can carry different QPS levels:

The third step, we look for an interface in real-time monitoring page, click on the "+" sign in the upper right corner, new flow control rules . In the following example, we configure cluster flow control rules for the /doSomething interface, and the total traffic of this interface does not exceed 200 times per second. The rule status is "on", which means it will take effect immediately after it is added.

Click Next, we can also configure protection rule is triggered for the selected Web/RPC call, such as a custom return value. After the final configuration is completed, we click the Add button, and this rule will take effect to each node.

After configuration, we can initiate a certain number of requests for this interface to different machines in the application cluster, and we can find that after more than 200 requests per second, it will automatically return to the return behavior we preset in the rules; at the same time, the console monitors the page in real time. It can also be seen that the excess traffic is rejected, and the total level of the interface passing every second is stable at 200 QPS:

Through a few simple configurations, we can quickly experience the "silky smooth" protection capabilities that AHAS cluster flow control brings to business traffic; at the same time, AHAS has recently Nginx/Ingress gateway ingress traffic protection , Web Request parameter flow control and other core functions, please click to read the original text, and go to the AHAS console for a quick experience.

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

How to use "cluster flow control" to ensure the stability of microservices?

Typical scenario 1: Precisely control the total amount of calls to downstream

Typical scenario 2: The total amount of requests is controlled at the service link entrance

AHAS cluster flow control

Quickly play AHAS cluster flow control

阿里云开发者

引用和评论

福利来了！计算巢支持在已经购买的 ECS 上搭建幻兽帕鲁服务器，支持图形化管理配置

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性