How to use Kubernetes to monitor and locate slow calls

Author: Li

Hello everyone, this is Li Huangdong from Alibaba Cloud. Today I will share with you the fourth section of the Kubernetes monitoring open class, how to use Kubernetes to monitor and locate slow calls. Today’s course is mainly divided into three parts. First, I will introduce the hazards and common reasons of slow calls; secondly, I will introduce the analysis methods and best practices of slow calls; finally, I will demonstrate the slow calls through a few cases. Analysis process.

Slow call hazards and common causes

图片 1.png

In the process of developing software, slow call is a very common exception. The possible harms of slow calls include:

Front-end business dimensions: may cause the problem of slow front-end loading. Slow front-end loading may further lead to a high application uninstall rate, which in turn affects the brand’s reputation.
project delivery: to reach SLO due to slow interface, which led to project delay.
business architecture stability: when the interface call is slow, it is very easy to cause timeout, when other business services rely on this interface, then it will cause a large number of retries, which will lead to resource exhaustion, and eventually lead to part of the service or the entire service is unavailable The phenomenon of avalanche.

Therefore, a seemingly innocuous slow call may hide a huge risk, and we should be vigilant. It is best not to ignore the slow call, and to analyze the reasons behind it as much as possible, so as to control the risk.

图片 2.png

What are the reasons for the slow call? There are thousands of reasons for slow calls. In the final analysis, there are five common reasons.

The first is the problem of high resource usage, such as CPU memory, disks, network cards, and so on. When these usage rates are too high, it is very easy to cause slow service.
The second is the problem of code design. Generally speaking, if SQL is associated with many tables and makes many tables, it will greatly affect the performance of SQL execution.
The third is the dependency problem. There is no problem with the service itself, but when the downstream service is called, the downstream return is slow, and the service itself is in a waiting state, which will also cause the service to be called slowly.
The fourth is a design problem. For example, the tables with massive data are very large, and there is no sub-database and sub-table for 100-million-level data query, then it is very easy to cause slow query. In a similar situation, time-consuming operations are not cached.
The fifth is a network problem, such as intercontinental calls. The physical distance of intercontinental calls is too large, which leads to a long round-trip time, which in turn leads to slow calls. Or the network performance between the two points may be poor. For example, there is the problem of packet loss retransmission rate and high retransmission rate.

Today our example revolves around these five aspects, let's take a look together.

In general, what are the steps for locating slow calls, or what are the best practices? What I have summarized here are three aspects: golden signals + resource indicators + global architecture.

图片 3.png

Let's take a look at the golden signals first. First of all, the golden signal comes from the book Site Reliability Engineering The minimum set of indicators used to characterize the health of the system, including:

delay - used to describe the time it takes for the system to execute the request. Common indicators include average response time, P90/P95/P99 these quantiles, these indicators can well represent the speed or slowness of the external response of the system, which is relatively intuitive.
Flow - used to characterize the busyness of the service. Typical indicators are QPS and TPS.
error -it is similar to the 500 and 400 in the HTTP protocol in the protocol. Usually, if there are many errors, it means that there may be a problem.
saturation - is the resource level. Generally speaking, services that are close to saturation are more prone to problems. For example, if the disk is full, the log cannot be written, and the service responds. Typical resources include CPU, memory, disk, queue length, number of connections, and so on.

In addition to the golden signal, we also need to pay attention to a resource indicator. The famous performance analysis god Brandan Gregg mentioned a USE method in his performance analysis method paper chapter. The USE method is to analyze from the perspective of resources. It is to check utilization (utilization), saturation (saturation), and error (error) for each resource. Together, it is USE. Checking these three items can basically solve 80% Service issues, and you only need to spend 5% of the time.

图片 4.png

After we have the golden signals and resource indicators, what else should we pay attention to? As Branda mentioned in his methodology, "We can't just see the trees and not the forest". Zhuge Liang also said, "Those who do not seek the overall situation are not enough to seek a domain." We should draw the system architecture and look at performance issues from a global perspective, not just a certain resource or a certain service. It is also a better method to consider everything comprehensively, identify the bottleneck, and solve the problem systematically through design methods. Therefore, we need a combination of golden signals, resource indicators, and global architecture.

Best practices for slow calls

Next, I will talk about three cases. The first is the problem of full node CPU. This is also a typical problem of slow service caused by resource problems, that is, problems caused by the resources of the service itself. The second is the slow invocation of the dependent service middleware. The third is poor network performance. The first case is to determine whether the service itself has problems; the second case is to determine the downstream service problems; the third is to determine the network performance problems between itself and the service.

Let's take an e-commerce application as an example. First, the traffic entry is Alibaba Cloud SLB, and then the traffic enters the microservice system. In the microservice, we receive all the traffic through the gateway, and then the gateway will send the traffic to the corresponding internal services, such as ProductService, CartService, and PaymentService. Below we rely on some middleware, such as Redis, MySQL, etc. For this entire architecture, we will use Alibaba Cloud's ARMS Kubernetes monitoring product to monitor the entire architecture. In terms of fault injection, we will use chaosblade to inject different types of exceptions such as full CPU and network exceptions.

图片 5.png

Case 1: Node CPU is full

What kind of problems will a full node CPU cause? After the node CPU is full, the above Pod may not be able to apply for more CPU, causing the threads inside to be in a state of waiting for scheduling, which leads to slow calls. In addition to the nodes, in addition to the CPU, we also have some resources like disks, memory, and so on.

图片 6.png

Next, let's take a look at some of the characteristics of the CPU in the Kubernetes cluster. First of all, CPU is a compressible resource. In Kubernetes, we look at these configurations on the right. There are several common configurations, such as Requests. Requests are mainly used for scheduling. Limits are used to set a limit at runtime. If the limit is exceeded, it will be limited. Therefore, our experimental principle is to fully inject the CPU of the node, which causes the Pod to be unable to apply for more memory, which in turn causes the service to slow down.

Before the official start, we identify the key links through the topology map and configure some alarms on it. For example, for gateways and payment links, we will configure alarms such as average response time P90 and slow calls. Then after the configuration, I will inject a node CPU to fill up such a fault. Then this node is the gateway node. After about five minutes, we can receive the alarm, which is the validity of the verification alarm in the second step.

图片 7.png

Next we enter the root cause positioning. First, we enter to view the application details of the gateway. The first step is to check the relevant golden signal. The golden signal is the response time. We see that the response time is very intuitive and shows a sudden increase. Below is the number of slow calls. The number of slow calls is more than 1,000. The number of slow calls has suddenly increased. P90/P95 has risen significantly, and more than one second, indicating that the entire service has also slowed down.

图片 8.png

Next, we need to analyze resource indicators. In the Pod CPU usage chart, we can see that Pod usage has risen rapidly during this period. This process shows that more memory needs to be requested from the host or node. Let's take a closer look at the CPU usage of the node or host. We see that the usage rate during this period is close to 100%, and the Pod can't apply for more CPU, which further causes the service to slow down, which leads to a large increase in the average response time.

图片 9.png

After locating the problem, we can think about specific solutions. Configure elastic scaling through CPU usage. Because we don’t know the relevant traffic or resources, and we don’t know when it suddenly becomes insufficient. Then the best way to deal with this scenario is to configure elastic scaling for resources and configure elastic scaling for nodes, mainly to ensure that resources can dynamically expand when the load increases. In order to configure elastic scaling for the application, we can configure a scaling action that increases the number of replicas for, for example, the CPU index to share the traffic. Here we can configure the maximum number of replicas to ten, the minimum number of replicas to three, and so on.

The effect is as follows: When the CPU is injected into a slow fault, the slow call will rise, and after the rise is completed, the elastic scaling will be triggered, that is, the CPU usage rate exceeds the threshold, such as 70%. Then, it will automatically expand some replicas to share the traffic, and we can see that the number of slow calls gradually decreases until it disappears, indicating that our elastic scaling has worked.

图片 10.png

Case 2: The problem of slow call of dependent service middleware

Next we look at the second case. First of all, let’s introduce the preparation work. In the picture on the left, we can see that the gateway has entered and dropped two downstream services, one is MySQL and the other is ProductService. Therefore, directly configure an alarm greater than one second on the gateway, and the average response time P99 is greater than one. Seconds of alarm. In the second step, we see that this Product is also on the key link. I will configure it with an alarm with P99 greater than one second, and the third is MySQL, with an alarm greater than one second. After the configuration is complete, I will enter the Product. Inject a MySQL slow query failure on the service. After about two minutes, we can see that successive alarms are triggered. There is a red dot and a gray dot on the gateway and the Product. This point is actually reported. Kubernetes monitoring will automatically match the alarm event to this node through the namespace application, so you can see which services and applications are abnormal at a glance, so you can quickly locate the problem . Now that we have received the alert, we will proceed to a root cause location in the next step.

图片 11.png

Let me talk about the process of updating the location first. The alarm driver is better to prevent than remedy, so we use the process of configuring the alarm first and then updating the location. Then we will use the topology to perform a visual analysis, because the topology is capable of architecture perception, analysis of upstream and downstream, and visual analysis. After receiving the alarm, you can see what happened to the corresponding application for the alarm. The first one we look at the gateway, we see that the P99 of the gateway rises above 1800 milliseconds, so an alarm greater than the 1-second threshold is triggered. We can also see that several quantiles are rising, and then we further look at another service that has an alarm, that is, Product. After clicking on this node, we can see from the panel that this Product has also occurred. A slow call, P99 and P95 have already occurred to varying degrees. Slow calls are mostly longer than one second. Then we can look at the resource usage of the Product at this time, because there may be problems with the Product itself. When we look at the downstream of Product, one is Nacos and the other is MySQL. When we look at this interaction of MySQL, we find that there are a lot of slow calls in it, and then after seeing these slow calls, click on these details and go to drill down to take a look. What happened when it was called? After further looking at the data, we will find that when Product in SQL called Mysql, it executed a very complicated SQL statement that joined multiple tables. From calling Trace, we can see that it takes a lot of time. In this way, we can locate a problem that is basically caused by this SQL.

图片 12.png

Summarizing our entire process, first we will identify the critical path through architecture awareness, and then configure alarms on this critical path to actively discover anomalies. After discovering the anomaly, we use our own resource indicator golden signal to locate the problem. If there is no problem, then we can follow the downstream, we look at the downstream resource indicators, and use such a method to locate a dependency problem of slow calls, the problem of middleware calls.

图片 13.png

Case 3: Poor network performance

Next, we will talk about the last example of poor network performance. The network architecture of Kubernetes is more complicated, such as communication between containers, communication between Pods, communication between Pods and services, communication between external and services, and so on. Therefore, the complexity is relatively high, and the learning curve is relatively steep, which brings certain difficulties to the positioning problem. So, how do we deal with this situation? If key network environmental indicators are used to discover network anomalies, what are the key environmental indicators? The first is rate and bandwidth, the second is throughput, the third is delay, and the fourth is RTT.

图片 14.png

First of all, I will configure an alarm here to inject the failure of the node where MySQL is located, the packet loss rate is high. After waiting for a few minutes, we will receive a slow call alarm, and the response time of the gateway and the Product has an alarm of more than one second. Next, let’s take a look at the root cause. We see that the gateway has experienced a slow increase in the response time of the P99 call, and then the Product has also experienced a sudden increase in average response time. That is, the service just called is slow, and then we Looking further at the downstream of Product, relying on the three services of Nacos, Redis, and MySQL, we can find that the slow call is more obvious. Then when we look at its downstream, we find that Product has a more serious slow call when it adjusts MySQL. At the same time, its RTT and retransmission are also obvious.

图片 15.png

Under normal circumstances, RTT is very stable. It reflects the round-trip time between upstream and downstream. If it rises very fast, it can basically be regarded as a network problem, so you can see that there are three things, from gateway, product, MySQL, from here We can conclude that this method of identifying critical paths and configuring alarms on the topology can locate the problem very quickly, without having to verify a lot of information scattered in various places. We only need to go to the top of this topology to check the corresponding performance indicators, network indicators, etc., to quickly locate the problem. Therefore, this is our best practice of golden signals + resource indicators + resource topological positioning such as slow calls.

图片 16.png

Finally, summarize this best practice:

1. Actively discover abnormalities through default alarms. The default alarm template covers RED, common resource type indicators. In addition to the default alarm rules, users can also customize the configuration based on templates.

2. Find and locate abnormalities through golden signals and resource indicators, and at the same time, Trace will cooperate with the drill to locate the root cause.

3. Doing upstream and downstream analysis, dependency analysis, and architecture perception through the topology map is conducive to examining the architecture from a global perspective, so as to obtain the optimal solution, achieve continuous improvement, and build a more stable system.

Click here to view more observable related dry goods content and product practices~

The content of this lesson is over here. Welcome to Dingding to scan the code or search Dingding group (31588365) to join the Q&A exchange group for communication.

二维码.png

Recently Popular

#HOT TOPIC native accelerator, for you#

Just waiting for you! Hurry up and click here participate in the registration~

How to use Kubernetes to monitor and locate slow calls

Slow call hazards and common causes

Best practices for slow calls

Case 1: Node CPU is full

Case 2: The problem of slow call of dependent service middleware

Case 3: Poor network performance

Recently Popular

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

Jenkins 企业级 CI/CD 实践：安装、配置与 Kubernetes & Docker 集成