The 99 promotion is coming, use the MSE service autonomy system to escort the business

Author: Grass Valley

foreword

Business promotion preparation is one of the homework that enterprises must do. Today, before the 99th promotion strikes, let’s talk about how to use MSE’s service autonomy capabilities to discover potential risks in advance, understand the internal operation status of the engine through observable capabilities, and provide automatic services. Build Nacos/ZooKeeper one-click migration to the cloud service to help the business cope with the big promotion smoothly.

Click to view the live replay:

https://yqh.aliyun.com/live/detail/29401

Challenges of Microservices

The change from monolith to microservice

With the rapid growth of Internet business, the architecture of the system is constantly changing, evolving from the initial monolithic form to the most popular microservice architecture; there is no silver bullet in the software architecture design, enjoying the benefits of microservices. The scalability and performance improvement brought by it will inevitably suffer some side effects. In general, there are the following changes:

Invoke chain adds multiple hops

The business logic of a single application is executed in a closed loop in a node process. After the transformation of the microservice architecture, the logic of different functional attributes is divided into services and deployed on independent nodes. To complete a complete business logic, it is necessary to each The independent nodes cooperate with each other, and A->B becomes A->B1->B2->B3.

Added dependencies on complex middleware

In the microservice architecture, RPC is the most basic technology introduced, which includes: RPC client (Dubbo/Spring Cloud), registry (Nacos/ZooKeeper/Eureka), if there are transaction requirements, it also needs to rely on some distributed transactions Components such as Seata.

From individual combat to multi-team collaboration

The upgrade of the microservice architecture, in addition to the changes at the application system level, may also change the production relationship. In the past, a system was in charge of one person, and it became a collaborative development of multiple service teams to support each other.

title=

challenges

Faced with the changes brought by the microservice architecture, it has brought many challenges to developers and operation and maintenance students:

title=

In the daily development and operation and maintenance process, some typical problems are often encountered as follows:

Scenario 1: The service call fails, and the Consumer log shows that there is no service available. It is clear that the Provider process is running normally. Is the service not registered? Or did the registry not push the address to the client?
Scenario 2: The Nacos client has an exception in an extreme scenario. After a long time of investigation, it is caused by a known bug in the Nacos client. It needs to be upgraded to the xx stable version, but as a developer/operation and maintenance you, the daily business needs are so Many, how to keep constant attention to the client version iteration?
Scenario 3: The big business promotion is coming, and the client is in full swing to expand the capacity to cope with the surge in traffic. Suddenly the registration and configuration center does not work. It turns out that the rated capacity of the registration and configuration center has been reached, and the capacity needs to be expanded. How about a hindsight, and then do capacity planning in advance?
Scenario 4: FullGC appears in the online registration configuration center, restarts and relieves it, and it reappears every so often. The feedback from the students is that the client may be misused. A large amount of read and write data causes the memory to be overwhelmed, but it is difficult to find out. Who is "troubling"?

Service autonomy

Cloud-native microservices are still the most popular technical architecture ( "40% of cloud-native developers focus on microservices" ), so solving the pain points of these groups can bring the greatest value to enterprises, which is also MSE's original intention.

Alibaba has evolved from a monolithic architecture in 2008 to the present. It has more than ten years of experience in stepping into pits and has also summed up a set of strategies. The service autonomy capability of MSE aims to help users quickly find problems, locate problems, and solve them. It mainly provides a series of functions and tools around the following three aspects:

title=

observability

Observability is an important part of helping microservices run robustly:

"Is the system still normal?"
"Is the end user experience as expected?"
"How to proactively discover system risks before the system is about to fail?"

If monitoring can tell us that there is a problem with the system, then observability can tell us what is wrong with the system and what causes the problem. Observability can not only judge whether the system is normal, but also actively discover system risks before the system has problems.

title=

monitor the market

MSE provides a wealth of monitoring dashboards, seamlessly integrates ARMS, and provides you with a wealth of observable capabilities for free. You can use these indicators to spy on the capacity situation, find problems as early as possible, and locate problems:

title=

1. Basic market

Some core indicators of infrastructure are provided, mainly as follows:

JVM monitoring
Memory/CPU
Network traffic

For these basic core indicators, it is recommended to at least add memory/CPU warnings, and set the threshold to 60%.

If your application is latency-sensitive, you need to focus on the FullGC indicator in JVM monitoring, which will slow down the process response.

The network traffic indicator can be used to observe the network problems of SLB. For example, the traffic suddenly rises to a certain point and then keeps going sideways. At this time, your client also has a link failure exception, which may be the traffic threshold.

title=

2. Overview of the market

The main purpose of overviewing the indicators of the market is to quickly show you some core indicators, so that you can have a global perspective:

Client distribution
Current configuration/service level
number of links
Number of configurations/services

Among them, the client distribution indicator can help you see the distribution of various client versions in the system. Combined with the version usage restrictions of Nacos, you can find high-risk versions, and promote the solution of the stability risk brought by the client.

For example, Nacos recently released the latest version of usage constraints. Nacos 1.4.1 has a serious abnormal DNS resolution problem. You can find the distribution of the client through the client distribution indicator, and notify the corresponding business to upgrade.

title=

3. Business Market-Nacos Service/Configuration Market

The indicators in the business scale provided by MSE are carefully selected and representative, which can help you fully understand the internal business scale of the registration and configuration center; when the big promotion is coming, the company requires you to evaluate the current capacity of the registration and configuration center. A comprehensive analysis can be carried out through these indicator data. The usage scenarios of Nacos are divided into registration center and configuration center. MSE sets up the market separately according to these two scenarios:

Configure central metrics:

Configuration quantity
Configure the number of listeners
Configured TPS/QPS
Read and write RT

title=

Registry Service Metrics:

Number of service providers/subscribers
Registration Center QPS/TPS
Registry read and write RT
Push success rate/time-consuming/TPS

title=

4. ZooKeeper TopN Market

The TopN market is very efficient in locating the problem that external factors cause exceptions on the server side:

Znode size Top N sort
Client's read and write TPS/QPS Top N to ZooKeeper
TPS/QPS Top N of Hotspot Data
The number of monitoring hotspot data Top N

In daily development, you have probably encountered the scenario of ZooKeeper FullGC, but you do not know the specific cause of GC. It may be caused by ZooKeeper pushing a large amount of data, and you are not sure which hot data is subscribed to. Maybe a client writes big data to ZooKeeper, but can't find which client wrote it?

Let's look at two typical misuse scenarios for clients:

The client misused to write large data, and there were a lot of subscribers, which caused ZooKeeper to push a large amount of data and caused FullGC:

Big data is written to the /99testWriteBig path, and the big data nodes can be found through the Znode size TopN

title=

The client misuses a certain ZK frequently, resulting in increased cluster performance pressure and response delay. It is necessary to find this client:

A client whose SessionId is: 0x1030871c8ed0004, frequently reads the /99testRead node, can find it through the client QPS TopN dashboard, and can also see which data is read most frequently in the current server

title=

Indicator warning

MSE provides the registration configuration center with the early warning capability of core indicators. It is recommended to configure the following indicators:

Nacos recommended configuration:

- Average time to read and write services: performance problems can be found
- Configure the number of long rotation training links: capacity problems can be found
- Number of services/configurations: Capacity issues/client misuse can be found
ZooKeeper recommended configuration:

- Number of Znodes: Client misuse can be found
- The rate of change in the number of connections: if the server suddenly drops, the server node may be faulty
- Number of connections per server: capacity issues/client misuse can be found

title=

link tracking

push track

The push track refers to the display of relevant information on a push link from the server side to the client side of the registration configuration center. The push track can make it very convenient for users to query. During the development process, the following problems can be quickly located through the push track, which greatly improves the troubleshooting efficiency of the problem:

Client does not receive service push
An exception occurred in the inter-service call
The configuration release is abnormal
After configuration modification, it is found that a certain machine does not take effect
Need to view configuration center changes and push events

title=

MSE - Nacos registry push track query page

title=

MSE - Nacos configuration center push track configuration dimension query page

Cluster Diagnostics

One-click diagnosis

If the various monitoring dashboards provided by MSE are to help you find and locate problems, then the one-click diagnosis function that MSE will provide will automatically scan and find risks for you. The two cooperate with each other. To evaluate the following aspects:

title=

The following picture is the function page of one-click diagnosis. From the above, you can see the risks of the engine you are currently purchasing. These are automatically scanned according to the built-in rules. Suggestions for you to improve:

title=

Smooth migration of MSE

The MSE service autonomy function introduced above will continue to be improved and polished to provide more autonomous capabilities, including event statistics, health audit and other functions, to reduce the difficulty of troubleshooting in the registration and configuration center and improve usability.

If you are still building your own registration and configuration center, it is recommended to migrate to the cloud as soon as possible to enjoy these enterprise-level services. MSE provides an efficient migration tool, MSE Sync, which provides two-way synchronization, automatic service acquisition, and one-click synchronization of all services. Users can better complete the migration of Nacos and Zookeeper registration configuration center.

title=

The official website documentation of MSE provides detailed Step by Step migration operation documentation:

"Self-built Dubbo ZooKeeper to migrate to MSE ZooKeeper"

https://help.aliyun.com/document_detail/444943.html

"Self-built Dubbo ZooKeeper registration center migrated to MSE Nacos"

https://help.aliyun.com/document_detail/446904.html

"Self-built Dubbo Nacos registration center migrated to MSE Nacos"

https://help.aliyun.com/document_detail/445140.html

If you encounter problems with the migration process or need customization, you can contact us for expert one-on-one migration support.

Purchase MSE to enjoy enterprise-level services

MSE provides core competencies such as high availability, high performance, security and ease of use!

title=

The 99 promotion is coming, use the MSE service autonomy system to escort the business

foreword

Challenges of Microservices

The change from monolith to microservice

challenges

Service autonomy

observability

1. Basic market

2. Overview of the market

3. Business Market-Nacos Service/Configuration Market

4. ZooKeeper TopN Market

link tracking

Cluster Diagnostics

Smooth migration of MSE

Purchase MSE to enjoy enterprise-level services

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强