Alibaba Cloud cloud-native microservice observable practice

Author: Shimian, Shui Yu

Observable introduction

Peter Drucker once said, "If you can't quantify it, you can't manage it." Observability is an important part of helping microservices run robustly. "Is our system still healthy?", "Is the end-user experience as expected?", "How do we proactively identify system risks before they are about to fail?". If monitoring can tell us that there is a problem with the system, then observability can tell us what is wrong with the system and what causes the problem. Observability can not only judge whether the system is normal, but also actively discover system risks before the system has problems.

在这里插入图片描述

From a system point of view, monitoring is based on Ops, focusing on discovery and ensuring system stability. The goal of observability is white boxing, focusing on Recall+Precision, running through Dev/Tester/Ops and other links, and using multiple observation methods to ensure that the root cause can be found and prevent problems before they occur.

在这里插入图片描述

Observable Challenges of Microservice Applications in Cloud Native

At present, common microservice frameworks include multilingual microservices such as Spring Cloud and Dubbo, and have basic capabilities such as service registration discovery, service configuration, load balancing, API gateway, and distributed microservices. Among them, service governance includes lossless offline, service fault tolerance, service routing and other capabilities. Observability includes application monitoring, link tracking, log management, application diagnostics, etc.

在这里插入图片描述

With the advent of cloud native, the microservice architecture has been applied more and more. From the initial cloud server ECS with machines as the core, to the containerized cloud-native deployment with containers as the core; in order to be more agile, Alibaba Cloud began to use applications as the core of microservices. Now, when microservices have developed to a certain scale, Alibaba Cloud has begun to focus on the core of the business, with service governance for the purpose of improving efficiency and stability.

在这里插入图片描述

Microservice observability under cloud native mainly faces three challenges:
• Difficult to find
From cloud server ECS to Kubernetes, the complexity of microservice architecture increases, the complexity of observation objects increases, and monitoring data coverage is incomplete.

• Difficult to
With the deepening of various governance capabilities, the observability requirements are high, the complexity of the service framework increases, the technical threshold increases, the complexity of the data itself increases, and the data correlation is poor.

• Poor coordination
As organizational roles change, observability is not just an operational job.

在这里插入图片描述

The application of the real-time monitoring service ARMS as an observable product of Alibaba Cloud supports automatic detection of some product problems. At present, more than 50 fault scenarios have been covered, including application changes, large requests, and QPS sudden increases. The recognition rate of diagnostic reports is as high as 80%.
在这里插入图片描述

As shown in the figure below, 7% of online applications are time-consuming on Dubbo's RPC, and the root cause cannot be located due to the problem of buried points.

在这里插入图片描述

Alibaba Cloud has found many problems in the process of serving customers.

• Service Discovery
At present, some monitoring tools cannot realize the problem diagnosis at the service discovery level of the service framework, resulting in many remaining service invocation problems that are difficult to troubleshoot. Just looking at monitoring makes it impossible for customers to start. Therefore, we hope that by providing the following service discovery monitoring and diagnosis capabilities, we can help customers to timely troubleshoot application abnormality caused by problems in the service discovery field.

(1) The monitoring client has no provider problem;
(2) Which registration center is the microservice application connected to, and an example diagram of service discovery link invocation. The large content includes Provider, Consumer, and registration center. Click the corresponding component to see the detailed relevant address;
(3) Whether the application service is successfully registered;
(4) The number & content of addresses pulled by the app last time;
(5) Whether the heartbeat of the application and the registration center is healthy;
(6) Registration center status information, such as CPU, memory and other operating hardware status information, the number of registered services, the number of subscription services, and service content.

• Microservice Lifecycle
Microservices are slow to start, 3 minutes for one server and 30 minutes for 5 servers. We hope that during the application startup process, the Spring bean loading, the monitoring of connection pool connections, the service registration of microservices, and the monitoring and checking of Kubernetes are ready; during the application offline process, service registration, in-transit request stop, scheduled tasks/MQ, etc. Cancellation , Service downtime; for example: Spring bean initialization is abnormal, which bean is stuck on loading, and which bean initialization takes a long time. Help users analyze the reasons for slow startup, and automatically give repair suggestions. However, the current overall process lacks relevant observational capabilities.

• Call link
The Consumer call times out, but the Provider returns quickly.

在这里插入图片描述

In addition, the configuration of microservices is chaotic and difficult to sort out; after the application of microservices to Kubernetes, the thread pool is full, but the reason cannot be found.

Then, when thinking about how to build the system from the perspective of microservices, we propose a solution for enhancing the observability of microservices. What else can be done on top of traditional monitoring solutions?

Observable exploration and practice in microservice scenarios

What Problems Does Microservice Observable Enhancement Solve?

In one sentence, it is: comprehensively enhance the observability in microservice scenarios.

Let front-line operation and maintenance personnel have the basic ability to diagnose microservices, and can troubleshoot 80% of the common problems of microservices and quickly perform performance analysis and diagnosis.

The ARMS Microservices Observability Enhancement Scenario answers the following questions:

• Why is the service startup very slow
From Pod creation to application initialization to service registration application startup, end-to-end analysis of the root cause of slow application startup, complements the observability of the application startup life cycle;

• Is there a hidden danger in the dependency
Analyze the Jar packages that SpringCloud/Dubbo depends on, and locate whether there are problems such as Jar package dependency conflicts;

• Configuration Analysis
In the microservice scenario, the configuration is scattered and redundant, providing the application runtime configuration observability and expert experience in configuration optimization;

• Dubbo call chain enhancement
Covering the buried points of addressing, serialization, network and other stages, you can see at a glance where the time of Dubbo calls has gone.

Why is the service startup slow? From Pod creation to application initialization to service registration application startup, end-to-end analysis of the root cause of slow application startup is completed, and the observability of the application startup life cycle is completed.

在这里插入图片描述

By connecting the entire process in series, the time-consuming of each point can be observed in real time, and the observable view can analyze the problem. The figure above is the ARMS container startup analysis function. On the left is the service startup. The system splits each piece of time in the startup process, so as to clearly see where the microservice startup is slow and enhance its observability.

在这里插入图片描述

The microservice engine provides the ability to go online without loss. Dynamic configuration of the console, real-time non-destructive online and offline observable views, and no need to change a line of code for a complete solution. The protection and governance of various schemes are carried out in the whole process of microservice startup: in the pre-established connection stage, the connection is created asynchronously in advance to ensure that the connection will not be blocked in the process of connection establishment; in the service registration discovery stage, the parallel registration and subscription capabilities are further improved. Startup speed; in the warm-up phase of small traffic, adjust the load balancing capability of the client to ensure that the traffic in the newly started instance grows slowly.

在这里插入图片描述

Because the coverage relationship of microservice configuration is complex, configuration analysis is required.

在这里插入图片描述

The above picture shows the configuration coverage relationship officially provided by Dubbo, and it can be seen that it has a certain sequence. It is often difficult to judge whether the configuration is in the wrong place, whether it takes effect, or whether it is overwritten. In the microservice scenario, the configuration is scattered and redundant, and we provide the application runtime configuration observability and configuration optimization expert experience.

We provide the ability to analyze the Jar packages that SpringCloud/Dubbo depends on, to help locate whether there are Jar package dependency conflicts, and whether the dependent Jars have security and performance risks.

在这里插入图片描述

Where does the time of an RPC call go? An RPC call has various links such as routing, current limiting and downgrading, serialization, and networking. From the client side, it needs to go through routing, filter, invoker, serialize, and remote. From the server side, it needs to go through serialize, Proxy Invoke, filter, and impleme.

在这里插入图片描述

The figure above is a flow chart of an RPC call. These include the connection establishment time of addressing and load balancing, the serialization time of packaging, the deserialization time of the unpacked reprint value, the processing time of the server, and the time waiting for the server to process the return.

在这里插入图片描述

The above is the answer we gave. The call chain is further subdivided in the RPC framework, and the time-consuming details such as routing, serialization, network, proxy, and server-side processing can be seen at a glance.

Summarize

The microservice observability enhancement solution is based on the traditional observability solution. From the perspective of microservices, we expand the data of Tracing, Logging, and Metrics covered by traditional observability, and combine the diagnosis experience of microservice experts.

From the front-end, applications to the underlying machines, the application real-time monitoring service ARMS monitors every operation, every slow SQL, and every exception of the application service in real time. At the same time, it provides complete data large-scale monitoring, showing important key indicators such as request volume, response time, FullGC times, slow SQL and abnormal times, and inter-application call times and time-consuming. Provide the best user experience.

在这里插入图片描述

Alibaba Cloud's microservice engine MSE has been newly upgraded, and the MSE in the governance center has improved the efficiency and stability of microservice development. Support Spring Cloud and Dubbo applications for nearly 5 years, and multi-language heterogeneous microservice system. Provides differentiated capabilities such as lossless online and offline, full-link grayscale, outlier instance removal, service authentication, and more. In the registry configuration center, MSE has fully managed Zookeeper/Nacos/Eureka services. Default high availability: multi-zone deployment, automatic detection. Configure authentication, encryption and grayscale publishing. In terms of cloud native gateways, MSE integrates monitoring and alarming, link tracking, current limiting and downgrading, and certificate management. The traffic network has two-in-one microservice gateways, and the cost is reduced by 50%.
在这里插入图片描述

Alibaba Cloud cloud-native microservice observable practice

Observable introduction

Observable Challenges of Microservice Applications in Cloud Native

Observable exploration and practice in microservice scenarios

What Problems Does Microservice Observable Enhancement Solve?

Summarize

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强