Author: Grass Valley
foreword
Business promotion preparation is one of the homework that enterprises must do. Today, before the 99th promotion strikes, let’s talk about how to use MSE’s service autonomy capabilities to discover potential risks in advance, understand the internal operation status of the engine through observable capabilities, and provide automatic services. Build Nacos/ZooKeeper one-click migration to the cloud service to help the business cope with the big promotion smoothly.
Click to view the live replay:
https://yqh.aliyun.com/live/detail/29401
Challenges of Microservices
The change from monolith to microservice
With the rapid growth of Internet business, the architecture of the system is constantly changing, evolving from the initial monolithic form to the most popular microservice architecture; there is no silver bullet in the software architecture design, enjoying the benefits of microservices. The scalability and performance improvement brought by it will inevitably suffer some side effects. In general, there are the following changes:
- Invoke chain adds multiple hops
The business logic of a single application is executed in a closed loop in a node process. After the transformation of the microservice architecture, the logic of different functional attributes is divided into services and deployed on independent nodes. To complete a complete business logic, it is necessary to each The independent nodes cooperate with each other, and A->B becomes A->B1->B2->B3.
- Added dependencies on complex middleware
In the microservice architecture, RPC is the most basic technology introduced, which includes: RPC client (Dubbo/Spring Cloud), registry (Nacos/ZooKeeper/Eureka), if there are transaction requirements, it also needs to rely on some distributed transactions Components such as Seata.
- From individual combat to multi-team collaboration
The upgrade of the microservice architecture, in addition to the changes at the application system level, may also change the production relationship. In the past, a system was in charge of one person, and it became a collaborative development of multiple service teams to support each other.
challenges
Faced with the changes brought by the microservice architecture, it has brought many challenges to developers and operation and maintenance students:
In the daily development and operation and maintenance process, some typical problems are often encountered as follows:
- Scenario 1: The service call fails, and the Consumer log shows that there is no service available. It is clear that the Provider process is running normally. Is the service not registered? Or did the registry not push the address to the client?
- Scenario 2: The Nacos client has an exception in an extreme scenario. After a long time of investigation, it is caused by a known bug in the Nacos client. It needs to be upgraded to the xx stable version, but as a developer/operation and maintenance you, the daily business needs are so Many, how to keep constant attention to the client version iteration?
- Scenario 3: The big business promotion is coming, and the client is in full swing to expand the capacity to cope with the surge in traffic. Suddenly the registration and configuration center does not work. It turns out that the rated capacity of the registration and configuration center has been reached, and the capacity needs to be expanded. How about a hindsight, and then do capacity planning in advance?
- Scenario 4: FullGC appears in the online registration configuration center, restarts and relieves it, and it reappears every so often. The feedback from the students is that the client may be misused. A large amount of read and write data causes the memory to be overwhelmed, but it is difficult to find out. Who is "troubling"?
Service autonomy
Cloud-native microservices are still the most popular technical architecture ( "40% of cloud-native developers focus on microservices" ), so solving the pain points of these groups can bring the greatest value to enterprises, which is also MSE's original intention.
Alibaba has evolved from a monolithic architecture in 2008 to the present. It has more than ten years of experience in stepping into pits and has also summed up a set of strategies. The service autonomy capability of MSE aims to help users quickly find problems, locate problems, and solve them. It mainly provides a series of functions and tools around the following three aspects:
observability
Observability is an important part of helping microservices run robustly:
- "Is the system still normal?"
- "Is the end user experience as expected?"
- "How to proactively discover system risks before the system is about to fail?"
If monitoring can tell us that there is a problem with the system, then observability can tell us what is wrong with the system and what causes the problem. Observability can not only judge whether the system is normal, but also actively discover system risks before the system has problems.
- monitor the market
MSE provides a wealth of monitoring dashboards, seamlessly integrates ARMS, and provides you with a wealth of observable capabilities for free. You can use these indicators to spy on the capacity situation, find problems as early as possible, and locate problems:
1. Basic market
Some core indicators of infrastructure are provided, mainly as follows:
- JVM monitoring
- Memory/CPU
- Network traffic
For these basic core indicators, it is recommended to at least add memory/CPU warnings, and set the threshold to 60%.
If your application is latency-sensitive, you need to focus on the FullGC indicator in JVM monitoring, which will slow down the process response.
The network traffic indicator can be used to observe the network problems of SLB. For example, the traffic suddenly rises to a certain point and then keeps going sideways. At this time, your client also has a link failure exception, which may be the traffic threshold.
2. Overview of the market
The main purpose of overviewing the indicators of the market is to quickly show you some core indicators, so that you can have a global perspective:
- Client distribution
- Current configuration/service level
- number of links
- Number of configurations/services
Among them, the client distribution indicator can help you see the distribution of various client versions in the system. Combined with the version usage restrictions of Nacos, you can find high-risk versions, and promote the solution of the stability risk brought by the client.
For example, Nacos recently released the latest version of usage constraints. Nacos 1.4.1 has a serious abnormal DNS resolution problem. You can find the distribution of the client through the client distribution indicator, and notify the corresponding business to upgrade.
3. Business Market-Nacos Service/Configuration Market
The indicators in the business scale provided by MSE are carefully selected and representative, which can help you fully understand the internal business scale of the registration and configuration center; when the big promotion is coming, the company requires you to evaluate the current capacity of the registration and configuration center. A comprehensive analysis can be carried out through these indicator data. The usage scenarios of Nacos are divided into registration center and configuration center. MSE sets up the market separately according to these two scenarios:
Configure central metrics:
- Configuration quantity
- Configure the number of listeners
- Configured TPS/QPS
- Read and write RT
Registry Service Metrics:
- Number of service providers/subscribers
- Registration Center QPS/TPS
- Registry read and write RT
- Push success rate/time-consuming/TPS
4. ZooKeeper TopN Market
The TopN market is very efficient in locating the problem that external factors cause exceptions on the server side:
- Znode size Top N sort
- Client's read and write TPS/QPS Top N to ZooKeeper
- TPS/QPS Top N of Hotspot Data
- The number of monitoring hotspot data Top N
In daily development, you have probably encountered the scenario of ZooKeeper FullGC, but you do not know the specific cause of GC. It may be caused by ZooKeeper pushing a large amount of data, and you are not sure which hot data is subscribed to. Maybe a client writes big data to ZooKeeper, but can't find which client wrote it?
Let's look at two typical misuse scenarios for clients:
- The client misused to write large data, and there were a lot of subscribers, which caused ZooKeeper to push a large amount of data and caused FullGC:
Big data is written to the /99testWriteBig path, and the big data nodes can be found through the Znode size TopN
- The client misuses a certain ZK frequently, resulting in increased cluster performance pressure and response delay. It is necessary to find this client:
A client whose SessionId is: 0x1030871c8ed0004, frequently reads the /99testRead node, can find it through the client QPS TopN dashboard, and can also see which data is read most frequently in the current server
- Indicator warning
MSE provides the registration configuration center with the early warning capability of core indicators. It is recommended to configure the following indicators:
- Nacos recommended configuration:
<!---->
- Average time to read and write services: performance problems can be found
- Configure the number of long rotation training links: capacity problems can be found
- Number of services/configurations: Capacity issues/client misuse can be found
- ZooKeeper recommended configuration:
<!---->
- Number of Znodes: Client misuse can be found
- The rate of change in the number of connections: if the server suddenly drops, the server node may be faulty
- Number of connections per server: capacity issues/client misuse can be found
link tracking
- push track
The push track refers to the display of relevant information on a push link from the server side to the client side of the registration configuration center. The push track can make it very convenient for users to query. During the development process, the following problems can be quickly located through the push track, which greatly improves the troubleshooting efficiency of the problem:
- Client does not receive service push
- An exception occurred in the inter-service call
- The configuration release is abnormal
- After configuration modification, it is found that a certain machine does not take effect
- Need to view configuration center changes and push events
MSE - Nacos registry push track query page
MSE - Nacos configuration center push track configuration dimension query page
Cluster Diagnostics
- One-click diagnosis
If the various monitoring dashboards provided by MSE are to help you find and locate problems, then the one-click diagnosis function that MSE will provide will automatically scan and find risks for you. The two cooperate with each other. To evaluate the following aspects:
The following picture is the function page of one-click diagnosis. From the above, you can see the risks of the engine you are currently purchasing. These are automatically scanned according to the built-in rules. Suggestions for you to improve:
Smooth migration of MSE
The MSE service autonomy function introduced above will continue to be improved and polished to provide more autonomous capabilities, including event statistics, health audit and other functions, to reduce the difficulty of troubleshooting in the registration and configuration center and improve usability.
If you are still building your own registration and configuration center, it is recommended to migrate to the cloud as soon as possible to enjoy these enterprise-level services. MSE provides an efficient migration tool, MSE Sync, which provides two-way synchronization, automatic service acquisition, and one-click synchronization of all services. Users can better complete the migration of Nacos and Zookeeper registration configuration center.
The official website documentation of MSE provides detailed Step by Step migration operation documentation:
"Self-built Dubbo ZooKeeper to migrate to MSE ZooKeeper"
https://help.aliyun.com/document_detail/444943.html
"Self-built Dubbo ZooKeeper registration center migrated to MSE Nacos"
https://help.aliyun.com/document_detail/446904.html
"Self-built Dubbo Nacos registration center migrated to MSE Nacos"
https://help.aliyun.com/document_detail/445140.html
If you encounter problems with the migration process or need customization, you can contact us for expert one-on-one migration support.
Purchase MSE to enjoy enterprise-level services
MSE provides core competencies such as high availability, high performance, security and ease of use!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。