Major upgrade of MSE Governance Center - Traffic Governance, Database Governance, Same AZ Priority - 阿里巴巴云原生

Author: Liu Shi

This time, the MSE Governance Center has made major upgrades in current limiting and downgrading, database governance, and priority to the same AZ. It has comprehensively enhanced the flexibility of microservice governance, the stability of relying on middleware, and the performance of traffic scheduling, and is committed to building cloud-native The era's microservice governance platform.

Past situation review

Before introducing the upgrade capabilities, first briefly review the core capabilities of MSE products, which are divided into development state, test state and operation state. Among them, the more commonly used functions in service governance include lossless online and offline, full-link grayscale, and daily environment isolation. Great feature.

title=

Lossless online and offline aspects

It supports small-traffic service warm-up to prevent newly started applications from being overwhelmed by traffic; the warm-up model supports dynamic adjustment to meet the needs of complex scenarios; and the warm-up process supports associated Kubernetes checks.

Full link grayscale

Swimlane settings can be performed, and gateways, RPC, RocketMQ, etc. are supported; it has the ability to dynamically cut traffic with one key, and can view the cut flow effect through monitoring; in addition, it provides an end-to-end stable baseline environment, which is convenient for users to quickly and safely verify the new version .

Daily environment isolation function

Traffic flows in the feature environment to achieve efficient and agile development; each environment is logically isolated, and only one set of baseline environments needs to be maintained, which greatly reduces costs; using Cloud Toolkit's device-cloud interconnection in IDEA can connect locally-launched applications to the development environment to reduce development and commissioning costs.

title=

The following introduces traffic management, database management, and the problems and specific solutions to be solved first in the same AZ.

Current limiting and downgrading have been fully upgraded to traffic management

The corresponding traffic governance model forms an expandable closed loop in traffic protection, and conducts effective governance around various problems that may occur in the online environment of the system. The model starts with 'fault identification', and finds problems at different levels, such as status codes and exception types at the interface layer, and abnormal indicators at the operating system layer. For example, adaptive current limiting protection or scenario-based current limiting protection; after the protection rules are set, the system will protect the system according to the preset thresholds and protection methods, and the effect of system protection can be viewed through monitoring. On the other hand, it can also be monitored through monitoring Check the rationality of traffic protection rule settings in reverse, and adjust them in time.

title=

For the case where there is no historical data reference for the first access, you can use the system stress test method to set stress test parameters in combination with business scenarios, configure traffic governance rules for possible online problems, and prepare protection strategies.

Stand-alone traffic protection

title=

First of all, let’s look at flow control. The principle is to monitor the QPS indicator of application or service traffic. When the indicator reaches the set threshold, the traffic is immediately intercepted to prevent the application from being overwhelmed by the instantaneous traffic peak, thereby ensuring the high availability of the application. This product provides a variety of current limiting methods such as single-machine current limiting, cluster flow control, minute-hour current limiting, and associated current limiting, and supports multiple current limiting algorithms such as sliding window, token bucket, and funnel bucket.

For concurrency control, when a strongly dependent method or interface is unstable, you can limit the number of unstable strongly dependent concurrency by configuring the number of concurrent threads to isolate exceptions. If the response time of running the request becomes longer, the number of concurrent threads will increase. When the number of concurrent threads exceeds the threshold, AHAS will reject redundant requests until the accumulated tasks are completed and the number of concurrent threads decreases. To achieve the effect of isolating exceptions and reducing instability.

In terms of system protection, it supports adaptive flow control or manually setting system rules. Adaptive flow control is to automatically and dynamically adjust the ingress traffic of the application according to the CPU usage of the system; flow control. The purpose is to achieve a balance between the ingress traffic of the system and the load of the system, so as to ensure the stable operation of the system under the state of maximum throughput.

Circuit breaker protection can monitor the response time or abnormal proportion of internal or downstream dependencies of the application, and immediately reduce the priority of downstream dependencies when a specified threshold is reached. During the specified time, the system will not call the unstable resource to prevent the application from being affected, thus ensuring the high availability of the application. When the specified time elapses, the call to the resource is resumed.

Active downgrade protection can be specified to downgrade certain interfaces, and the downgraded interfaces will trigger custom downgrade behaviors (such as returning specified content) without executing the original logic.

Hotspot protection protects system stability by analyzing parameters with a high number of invocations during resource invocation, and limiting the current of resource invocations containing hotspot parameters according to the configured hotspot rules.

Finally, when the system encounters some non-fatal errors (such as occasional timeouts, etc.), the system can be automatically retried to avoid the final failure of the system.

Cluster Traffic Protection

Among them, cluster traffic protection is used to solve the problems of uneven traffic flow, frequent changes in the number of machines, and too small amortization thresholds in single-machine flow control, resulting in poor current limiting effect. Cluster flow control can accurately control the flow of a service interface in the entire cluster. The total number of real-time calls. It is more suitable for the following scenarios:

1. Uneven traffic of service calls and needs to be alleviated

Unbalanced traffic to each service instance leads to inaccurate current limiting on a single machine (“pre-current limiting” on the total amount), making it impossible to precisely control the total amount

2. Accurate scenarios with small cluster traffic

When the total traffic limit of the cluster is relatively small, the single-machine current limit will fail (for example, the total amount of an interface per second does not exceed 10QPS, but the number of machines is 50, even if the single-machine threshold is set to 1, the threshold will still be exceeded)

3. Business cluster flow control

For minute-hour-level flow control with business meaning, it can protect downstream systems from being overwhelmed (eg, the gateway layer limits how many times each user can call an API per minute).

title=

Cluster flow control has the advantages of rich scenarios, low cost of use, and fully automatic control:

Rich scenarios : comprehensively cover scenarios ranging from accurate gateway ingress traffic protection, precise flow control of Web/RPC service calls to minute-hour-level business dimension traffic control

Low cost of use : no special access method is required, and it can be used out of the box

Fully automatic management and control : automatic control and allocation of server resources, automatic operation and maintenance capabilities to ensure availability, users do not need to pay attention to the details of resource preparation and allocation, just focus on business

Gateway Traffic Protection

Gateway traffic protection is used to precisely control the traffic of a certain or a group of APIs, and plays a role in advance protection, so that excess traffic will not hit the back-end system. If configured according to the stand-alone dimension, on the one hand, it is difficult to perceive changes in the number of gateway machines, and on the other hand, uneven gateway traffic may result in poor current limiting effect.

title=

Gateway protection has four core capabilities:

1. Real-time monitoring and traffic control of API/Host dimension

2. Dynamic rule configuration, take effect in real time

3. Cluster flow control to precisely control the total amount of API calls

4. Request parameter/header dimension flow control and fuse

Full Link & Multilingual

title=

The upgraded traffic governance of MSE can be applied to the whole link of microservices. For example, at the traffic entry layer, it can be accessed through a gateway. At the microservice level, it can not only protect the microservice itself, but also protect the middleware and the middleware that the microservice depends on. Such as cache, database and other three-party dependencies, if you access through ACK or Agent, you can easily access without modifying a line of code. If you have high-level traffic management requirements, such as custom buried points, you can access it through SDK. enter.

New database management capabilities

Typical Governance Scenario

A system provides a query interface to the outside world, and the SQL statement involves multi-table join. In some cases, a slow query will be triggered, which takes up to 30s. Eventually, the DB connection pool/Tomcat thread pool is full, and the application as a whole is unavailable.
The application has just started, because the database Druid connection pool is still being initialized, but a large number of requests have entered at this time, which quickly causes Dubbo's thread pool to be full, and many sites are stuck in the process of initializing the database connection, resulting in a large number of business requests reporting errors.
In the full-link grayscale scenario, because the new application version changed the content of the database table, the grayscale traffic caused the data in the online database to be chaotic, and business students manually corrected the online data overnight.
In the early stage of the project, the performance of SQL was not well considered. With the development of the business and the increase in the number of users, the SQL of the old online interface has gradually become a performance bottleneck. Therefore, we need effective SQL insight to help us discover the legacy SQL. , and perform performance optimization in time.
The long processing time of SQL statements results in a large number of slow calls on the online business interface. It is necessary to quickly locate the problematic slow SQL, and isolate it through certain governance methods to quickly restore the business. Therefore, when microservices access the data layer, real-time SQL insights can help us quickly locate slow SQL calls.

In fact, for most back-end applications, the bottleneck of the system is mainly limited by the database. Of course, the complexity of the business must be inseparable from the operation of the database. Therefore, database issues are also the highest priority work, and database governance is also an essential part of microservice governance.

title=

core solution

title=

Slow SQL governance

Slow SQL is one of the more fatal factors affecting system stability. Slow SQL in the system may cause CPU, abnormal load, and system resource exhaustion. Severe slow SQL may drag down the entire database and cause disruptive risks to online business. The possible reasons for slow SQL in the online production environment are as follows:

Hardware reasons such as slow network speed, insufficient memory, low I/O throughput, and full disk space.
There is no index or the index is invalid.
Too much system data.
The performance of SQL was not considered at the beginning of the project.
Connection Pool Governance

Connection pool governance is a very important part of database governance. Through some real-time indicators of connection pools, we can effectively identify risks in the system in advance. The following are some common connection pool governance scenarios.

Connect in advance

In the scenario of application release or elastic expansion, if the connection in the newly started instance has not been established, but the instance has been started and the readiness check has passed, it means that a large amount of business traffic will enter the newly started instance at this time. pod. A large number of requests are blocked on the action of acquiring connections from the connection pool, resulting in the full thread pool of the service and the failure of a large number of business requests. If our application has the ability to establish connections in advance, then the number of connection requests can be guaranteed to be above minIdle before the traffic arrives, and with the ability to warm up with small traffic, the above headache can be solved. problem.

"Bad" connection culling

Sometimes there will be some problematic connections in the connection pool. It may be that the underlying network is jittering, or the business execution is slow or deadlocked. If we can detect abnormal connections in time from the perspective of connection pool, and remove and recycle them in time, then we can ensure the overall stability of the connection pool, and will not be dragged by individual problematic business processing or network jitter. collapse.

Access control

In theory, not all database tables can be accessed casually. At some point, some important tables may be for some less important services. We want it to be a write-forbidden, read-only state, or when the database appears In the case of jitter and full thread pool, we hope to reduce some time-consuming SQL execution of reading the database, or tables with some sensitive data only allow a certain application to read and write access. Then we can use the dynamic access control capability to issue access control rules in real time to achieve access control for individual methods, applied SQL-oriented database instances, and tables that prohibit reading and writing.

Database Grayscale

In the microservice architecture, the dependencies between services are intricate, and sometimes a function release depends on multiple services being upgraded and launched at the same time. We hope that the new versions of these services can be verified with small traffic at the same time. This is the unique full-link grayscale scene in the microservice architecture. By building an environment isolation from the gateway to the entire backend service, multiple different versions can be verified. service for grayscale verification. MSE uses the shadow table method, users can achieve full-link grayscale at the database level without modifying any business code.

Dynamic read-write separation

Through the SQL insight provided by MSE, combined with our understanding of the business, we can quickly locate and divide interface requests into weak requests. The read operations that will have a great impact on the performance and stability of the main library can be offloaded to the RDS read-only library, which can effectively reduce the read and write pressure on the main library and further improve the stability of microservice applications.

title=

The above is a preview of a database governance capability that MSE is about to launch. From the perspective of application, we have sorted out and abstracted some of our practical experience in terms of stability governance, performance optimization, and efficiency improvement in accessing and using the database. For a back-end application, the database is undoubtedly the top priority. We hope that through our database governance capabilities, we can help everyone use database services better.

Same as AZ first

The characteristic of the same city is that RT is generally at a relatively low latency (< 3ms), so by default, we can build a large LAN based on different computer rooms in the same city, and then distribute our applications across multiple computer rooms in multiple computer rooms In order to deal with the risk of traffic damage when a single equipment room fails. Compared with multiple activities in different places, the construction cost of this kind of infrastructure is relatively small, and the structure changes are relatively small. However, under the microservice system, the links between applications are intricate and complex. As the link depth becomes deeper and deeper, the complexity of governance will also increase. The scenario shown in the figure below is that the front-end traffic is likely to be Different computer rooms call each other, resulting in a sudden increase in RT, which eventually leads to traffic loss.

scenes to be used

When applications are deployed in multiple computer rooms, there will be cross-computer room situations when applications call each other.

title=

When application A in computer room 1 calls application B in computer room 2, the network delay of calling across computer rooms increases, resulting in an increase in HTTP response time.

After enabling the same computer room priority, the consumer will give priority to calling the provider service in the same computer room:

title=

solution

According to routing rules, the same availability zone is automatically identified, and the same availability zone is preferentially selected to reduce call delay, improve performance, and realize traffic switching in disaster recovery scenarios to ensure availability.

title=

Epilogue

The capabilities of the MSE Governance Center in terms of current limiting and downgrading, database governance, and priority in the same AZ help enterprises to more easily achieve system resilience, timely sense the abnormal state of the system SQL, and do targeted governance and protection. The priority of the same AZ can be improved. Improve the overall performance of the system and build a robust and stable operating environment. This upgrade is the first stage of the governance center upgrade. Governance methods will be introduced in the future to protect your system.

10% discount for the first purchase of MSE Registration and Configuration Center Professional Edition, 10% discount for MSE Cloud Native Gateway Prepaid Full Specifications. Click here to take advantage of the discount!

Major upgrade of MSE Governance Center - Traffic Governance, Database Governance, Same AZ Priority

Past situation review

Current limiting and downgrading have been fully upgraded to traffic management

Stand-alone traffic protection

Cluster Traffic Protection

Gateway Traffic Protection

Full Link & Multilingual

New database management capabilities

Typical Governance Scenario

core solution

Same as AZ first

scenes to be used

solution

Epilogue

阿里云云原生

引用和评论

通义灵码带你玩转 MCP 最佳实践（合辑，持续更新中）

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强