How to use "cluster flow control" to ensure the stability of microservices?

Author: Su He

The stability of microservices has always been a topic of great concern to developers. As the business evolves from a single architecture to a distributed architecture and changes in deployment methods, the dependencies between services have become more and more complex, and business systems are also facing huge high-availability challenges. Application High Availability Service AHAS (Application High Availability Service) is a cloud product that has been precipitated by Alibaba's internal high availability system for many years. It uses traffic and fault tolerance as the entry point, from traffic control, unstable call isolation, fuse degradation, hotspot traffic protection, System adaptive protection, cluster flow control and other dimensions help ensure the stability of the service, while providing second-level traffic monitoring and analysis functions. AHAS not only has a wide range of applications in Alibaba's internal Taobao, Tmall and other e-commerce fields, but also has a large number of practices in Internet finance, online education, games, live broadcast industries and other large government and central enterprises.

Flow control is the most commonly used and most direct control method to ensure the stability of microservices. Each system and service has an upper limit on the capacity that it can carry. The flow control idea is very simple. When the QPS of an interface request exceeds a certain upper limit, redundant requests are rejected to prevent the system from being overwhelmed by sudden traffic. The most common solution on the market is flow control in a single machine dimension. For example, through the PTS performance test, it is estimated that the upper limit of the capacity of an interface is 100 QPS, and the service has 10 instances, then a stand-alone flow control of 10 QPS is configured. However, in many cases, due to the uncertainty of the flow distribution, the flow control of the stand-alone dimension has some poor results.

Typical scenario 1: Precisely control the total amount of calls to downstream

scenario: service A needs to frequently call the query interface of service B, but the capacity of service A and B are different. Service B agrees to provide service A with a total query capability of 600 QPS at most, which is controlled through flow control and other means.

Pain Points: configured according to the single-machine flow control strategy, due to the calling logic and load balancing strategy, the distribution of traffic reaching each instance of A calling B may be very uneven, and some instances of service B with large traffic may trigger stand-alone flow control. , But the overall limit has not been reached, resulting in the SLA is not up to standard. This uneven situation often occurs when a dependent service or component (such as database access) is called. This is also a typical scenario of cluster flow control: precise control of microservice clusters on downstream services (or databases, caches) The total number of calls.

Typical scenario 2: The total amount of requests is controlled at the service link entrance

scenario: performs ingress flow control on the Nginx/Ingress gateway and API Gateway (Spring Cloud Gateway, Zuul), hoping to precisely control the flow of a certain or a group of APIs to protect it in advance, and excess traffic will not hit the backend system.

pain points: if configured as a stand-alone dimension, on the one hand the number of bad perception gateway machine changes, on the other hand gateway uneven flow may result in poor limiting effect; and from the gateway entrance point of view, the overall configuration is the most natural threshold means.

AHAS cluster flow control

AHAS cluster flow control can accurately control the total amount of real-time calls of a certain service interface in the entire cluster, and can solve the problem of poor flow control effect due to uneven flow, frequent changes in the number of machines, and too small amortization threshold. Stand-alone flow control pocket bottom, to better play the effect of flow protection.

For the above scenario, through AHAS cluster flow control, whether it is Dubbo service call, Web API access, or custom business logic, it supports precise control of the total number of calls, regardless of call logic, traffic distribution, and instance distribution. It can support large flow control of hundreds of thousands of QPS, as well as precise control of small flow in minute-hour-level business dimensions. The behavior after the protection is triggered can be customized by the user (such as returning customized content, objects).

AHAS cluster protection has the following advantages:

rich in scenarios: comprehensively covers scenarios from gateway/Mesh ingress traffic precise protection, Web/RPC service/SQL call precise flow control, to minute-hour-level business dimension flow control, and supports hundreds of thousands of QPS levels;
low cost of use: not require special access methods, it can be used out of the box and quickly configured;
automatic control and operation and maintenance: automatic control and allocation of token server resources, automatic operation and maintenance capabilities to ensure availability, users do not need to pay attention to server-side resource preparation and allocation, only the rule configuration and business process;
low performance loss: In the performance mode, there is no delay increase on the service link. The precise mode controls the RT loss of the service link within 3ms, and users can use it with confidence;
equipped with observable capabilities, and understands the stability of the interface and the effect of rules in real time.

Let's use an example to introduce how to quickly connect the application to AHAS to make use of the cluster flow control capability and ensure service stability.

Quickly play AHAS cluster flow control

In the first step, we connect the service or gateway to AHAS traffic protection. AHAS provides a variety of fast and convenient non-intrusive access methods:

AHAS traffic protection supports native multi-language access such as Java/Go/C++/PHP, as well as Nginx/Ingress gateway access and Mesh access; Java applications support a full range of 20+ microservice frameworks/components (see related links at the end of the article for details) ):

Web server: Spring Web/Spring Boot/Spring Cloud/Tomcat/Jetty/Undertow
Web client：OkHttp/Apache HttpClient
RPC：Dubbo/Feign/gRPC
DAO/Cache: MyBatis/Spring Data JPA/Memcached/Jedis client
MQ consumer：RocketMQ client/Kafka client/RocketMQ client
API Gateway：Spring Cloud Gateway/Zuul 1.x
Reactor framework

After successfully accessing AHAS, as long as the service call/interface access is triggered, you can see your service on the AHAS console (see related links at the end of the article for details), and you can see your interface on the monitoring page:

In the second step, we turn on the cluster flow control function on the "Cluster Flow Control-Cluster Configuration" page of the left menu of the application. For testing applications, we can start a "trial" cluster, and different cluster specifications can carry different QPS levels:

In the third step, we find an interface on the real-time monitoring page, click the "+" sign in the upper right corner, and add flow control rules (see related links at the end of the article for details). In the following example, we configure cluster flow control rules for the /doSomething interface, and the total traffic of this interface does not exceed 200 times per second. The rule status is "on", which means it will take effect immediately after it is added.

Click Next, we can also configure the processing logic after the protection rule is triggered for the selected Web/RPC call (see related links at the end of the article for details), such as custom return values. After the final configuration is completed, we click the Add button, and this rule will take effect to each node.

After the configuration is complete, we can initiate a certain number of requests for this interface to different machines in the application cluster, and we can find that after more than 200 requests per second, it will automatically return to the return behavior we preset in the rules; at the same time, the console monitors the page in real time. It can also be seen that the excess traffic is rejected, and the total level of the interface passing every second is stable at 200 QPS:

Through a few simple configurations, we can quickly experience the "silk-smooth" protection capabilities that AHAS cluster flow control brings to business traffic; at the same time, AHAS has also recently launched Nginx/Ingress gateway ingress traffic protection and Web request parameter flow. Control (see related links at the end of the article for details) and other core functions, welcome everyone to click to read the original text, and go to the AHAS console for a quick experience.

related links

1) A full range of 20+

https://help.aliyun.com/document_detail/128800.html

2) AHAS console:

https://common-buy.aliyun.com/?commodityCode=ahas_001#/buy

3) New flow control rule:

https://help.aliyun.com/document_detail/174871.html

4) Processing logic after the protection rule is triggered ">https://help.aliyun.com/document_detail/209640.html

6) Web request parameter flow control ">https://help.aliyun.com/document_detail/337922.html

How to use "cluster flow control" to ensure the stability of microservices?

Typical scenario 1: Precisely control the total amount of calls to downstream

Typical scenario 2: The total amount of requests is controlled at the service link entrance

AHAS cluster flow control

Quickly play AHAS cluster flow control

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强