Deep Decryption | Panorama of Kubernetes Troubleshooting Based on eBPF Released - 阿里巴巴云原生

As Kubernetes becomes the cloud-native de facto standard, observability challenges follow

Currently, cloud-native technology is based on container technology and provides infrastructure through standard and scalable scheduling, networking, storage, and container runtime interfaces. At the same time, operation and maintenance capabilities are provided through standard and extensible declarative resources and controllers. The two-layer standardization promotes the separation of development and operation and maintenance concerns, and further improves scale and specialization in various fields to achieve cost, efficiency, and stability. Fully optimized.

Against such a large technological background, more and more companies have introduced cloud-native technologies to develop, operate and maintain business applications. It is precisely because cloud native technology brings more and more complicated possibilities that business applications have the distinctive features of numerous microservices, multi-language development, and multi-communication protocols. At the same time, cloud-native technologies themselves move complexity down, creating more challenges for observability:

1. Chaos micro-service architecture, mixed with multi-language and multi-network protocols

Due to the division of labor, the business architecture is prone to a large number of services, and the calling protocols and relationships are very complex. Common problems caused include:
• It is impossible to accurately and clearly understand and control the overall system operation structure;
• Inability to answer whether the connectivity between applications is correct;
• Multi-language and multi-network calling protocols bring about a linear increase in the cost of tracking, and the ROI of repeated tracking is low. Development generally lowers the priority of such requirements, but observable data has to be collected.

2. The sinking infrastructure capability shields the implementation details, and the problem is more difficult

Infrastructure capabilities continue to sink, development and operation and maintenance concerns continue to be separated, and implementation details are shielded from each other after layering, data is not well correlated, and it is impossible to quickly delineate which layer the problem occurs when a problem occurs. Developers only pay attention to whether the application is working properly, not the details of the underlying infrastructure. When a problem occurs, the operation and maintenance classmates need to coordinate to troubleshoot the problem. In the process of troubleshooting, O&M students need to provide enough upstream and downstream to facilitate the investigation. Otherwise, they will only get a general statement such as "a certain application has high latency", which is difficult to get further results. Therefore, developers and operation and maintenance students need a common language to improve communication efficiency. Kubernetes concepts such as Label and Namespace are very suitable for constructing contextual information.

3. Many monitoring systems cause inconsistent monitoring interface

A serious side effect of complex systems is the multitude of monitoring systems. The data links are not related and unified, and the monitoring interface experience is inconsistent. Many operation and maintenance students may have had this experience: when locating a problem, the browser opens dozens of windows, and switches back and forth between various tools such as Grafana, console, log, etc., which is not only very time-consuming, but also can be processed by the brain. The information is limited and the problem location is inefficient. If there is a unified observability interface, data and information can be effectively organized, reducing distraction and page switching, improving problem location efficiency, and investing valuable time in building business logic.

Solutions and technical solutions

In order to solve the above problems, we need to use a technology that supports multiple languages and multiple communication protocols, and covers the end-to-end observability requirements of the software stack as much as possible at the product level. The operating system, the observability solution for the upward correlation application performance monitoring.

It is very challenging to collect data from various dimensions of containers, node operating environments, applications, and networks. The cloud native community provides various methods such as cAdvisor, node exporter, and kube-state-metics for different needs, but they still cannot meet all needs. The cost of maintaining many collectors should not be underestimated. One of the thoughts that arises is whether there is a data collection solution that is non-intrusive to applications and supports dynamic expansion. The current best answer is eBPF.

"Data Acquisition: The Superpower of eBPF"

eBPF is equivalent to building an execution engine in the kernel, and attaching this program to a kernel event through a kernel call to monitor kernel events. With the event, we can further deduce the protocol, filter out the protocol of interest, and further process the event and put it into the data structure Map that comes with the ringbuffer or eBPF for the user-mode process to read. After the user mode process reads the data, it further associates the Kubernetes metadata and pushes it to the storage side. This is the overall process.

The super power of eBPF is reflected in the ability to subscribe to various kernel events, such as file reading and writing, network traffic, etc. All behaviors in containers or Pods running in Kubernetes are implemented through kernel system calls, and the kernel knows all processes on the machine. So the kernel is pretty much the sweet spot for observability, which is why we chose eBPF. Another advantage of monitoring on the kernel is that the application does not need to be changed, and the kernel does not need to be recompiled, so it is truly non-intrusive. When there are dozens or hundreds of applications in the cluster, a non-intrusive solution can help a lot.

But as a new technology, people also have some concerns about eBPF, such as safety and probe performance. In order to fully guarantee the security of kernel runtime, the eBPF code has many restrictions, such as the current maximum stack space is 512, and the maximum number of instructions is 1 million. Meanwhile, for performance concerns, the eBPF probe is controlled at around 1%. Its high performance is mainly reflected in the processing of data in the kernel, reducing the copying of data between the kernel state and the user state. Simply put, the data is calculated in the kernel and then given to the user process, such as a Gauge value. In the past, the original data was copied to the user process and then calculated.

Programmable execution engines are a natural fit for observability

Observability engineering eliminates knowledge blind spots and eliminates systemic risks in a timely manner by helping users better understand the internal state of the system. What is the power of eBPF for observability?

Taking application exceptions as an example, when the application is found to be abnormal, it is found that the application-level observability is lacking in the process of solving the problem. At this time, the application observability is supplemented by embedding, testing, and going online, and the specific problems are solved, but often The symptoms are not cured. Next time there is a problem elsewhere, the same process needs to be followed. In addition, multi-language and multi-protocol make the cost of burying points higher. A better approach is to solve it in a non-invasive way to avoid no data when observations are needed.

The eBPF execution engine can collect observability data by dynamically loading and executing eBPF scripts. For example, suppose that the original Kubernetes system does not perform process-related monitoring. One day, a malicious process (such as a mining program) was discovered in the Crazy CPU usage, at this time, we will find that such malicious process creation should be monitored. At this time, we can achieve this by integrating the open source process event detection library, but this often requires packaging, testing, and publishing the whole process. , it may take a month to complete all of them.

In contrast, the eBPF method is more efficient and faster. Since eBPF supports dynamic loading into the kernel to monitor the events created by the process, we can abstract the eBPF script into a sub-module, and the collection client only needs to load this sub-module each time. The script in the module completes the data collection, and then pushes the data to the backend through a unified data channel. In this way, we save the tedious process of changing code, packaging, testing, and publishing, and dynamically realize the requirements of process monitoring in a non-invasive way. Therefore, the eBPF programmable execution engine is very suitable for enhancing observability, collecting rich kernel data, and facilitating troubleshooting by correlating business applications.

From monitoring systems to observability

With the cloud-native wave, the concept of observability is taking hold. However, it is still inseparable from the data cornerstones of three types of observable fields: logs, indicators, and links. Students who have done operation and maintenance or SRE often encounter such problems: they are pulled into the emergency group in the middle of the night, and they are asked why the database is not working. Without context, they cannot immediately grasp the core of the problem. We believe that a good observability platform should help users to provide good contextual feedback, as the CEO of Datadog said: Monitoring tools are not about having more functions, but about how to bridge between different teams and members To bridge the gap between the teams and get everything on the same page.

Therefore, the product design of the observability platform needs to be based on indicators, links, and logs, and integrate various cloud services of Alibaba Cloud. It also supports open source product data access, and associates key contextual information, which is convenient Engineers from different backgrounds understand, thereby speeding up troubleshooting. If the information is not organized effectively, there will be a cost of understanding. The information granularity is organized into a page from coarse to fine with events -> indicators -> links -> logs, which is convenient for drilling down and does not require multiple systems to jump back and forth. This provides a consistent experience.

So how is it related? How is the information organized? Mainly from two aspects:

1. End-to-end: expands to say that it is application-to-application, service-to-service, Kubernetes standardization and separation of concerns, and each development, operation and maintenance focuses on their own fields, so end-to-end monitoring often becomes a "three-for-one" area , when a problem occurs, it is difficult to check which link on the link has the problem. Therefore, from an end-to-end point of view, the call relationship between the two is the basis of the association, because the system call creates the connection. Through eBPF technology, it is very convenient to collect network calls in a non-invasive way, and then parse the calls into well-known application protocols, such as HTTP, GRPC, MySQL, etc., and finally build the topology relationship to form a clear service topology. Quickly locate the problem. In the complete link of gateway->Java application->Python application->cloud service in the following figure, if any link is delayed, the problem should be seen at a glance in the service topology. This is the first pipeline point end-to-end.

2. Top-down full stack association: uses Pod as the medium, Kubernetes layer associates Workload, Service and other objects, the infrastructure layer can associate nodes, storage devices, networks, etc., and the application layer associates logs, call links, etc.

Next, we introduce the core functions of Kubernetes monitoring.

A timeless gold indicator

golden indicator is the minimum set used to monitor system performance and . The golden indicator has two advantages: First, it directly and clearly expresses whether the system is serving the outside world normally. Second, it can quickly assess the impact on users or the severity of the situation, which can save a lot of time for SRE or R&D. Imagine if we take CPU usage as the golden indicator, then SRE or R&D will be exhausted, because CPU usage A high rate probably won't make much of a difference.

Kubernetes monitoring supports these metrics:
• Number of requests/QPS
• Response time and quantiles (P50, P90, P95, P99)
• number of errors
• Number of slow calls

As shown below:

Service topology from a global perspective

Zhuge Liang once said, "Those who do not seek the overall situation are not enough to seek a domain." With the increasing complexity of the current technical architecture and deployment architecture, it becomes more and more difficult to locate the problem after a problem occurs, which in turn leads to higher and higher MTTR. Another effect is that the analysis of the impact surface brings a very big challenge, usually causing one to lose the other. Therefore, it is necessary to have a large topological map like a map. The global topology has the following characteristics:

• System Architecture Perception: system architecture diagram is an important reference for programmers to understand a new system. When getting a system, at least they need to know where the traffic entry is, which core modules are, and which internal and external components depend on. In the process of anomaly location, having a map of the global architecture greatly promotes the process of anomaly location.

• Dependency Analysis: has some problems in downstream dependencies. If this dependency is not maintained by your own team, it will be more troublesome. When your own system and downstream system do not have enough observability, it will be more troublesome. In this case It is difficult to explain the problem to the maintainer of the dependency. In our topology, a call graph is formed by connecting the upstream and downstream of the golden indicator with the call relationship. Edges serve as a visualization of dependencies and can view golden signals corresponding to calls. With the golden signal, you can quickly analyze whether there is a problem with downstream dependencies.

Distributed Tracing Helps Root Cause Location

Protocol Trace is also non-invasive and language-agnostic. If there is a distributed link TraceID in the request content, it can be automatically identified, which is convenient for further drilling down to the link trace. The request and response information of the application layer protocol helps to analyze the request content and return code, so as to know which interface has a problem. To view the details of the code level or request community, you can click the Trace ID to drill down to the link trace analysis view.

Out-of-the-box alerts

The out-of-the-box alarm template covers all different levels, no need to manually configure alarms, integrates large-scale Kubernetes operation and maintenance experience into the alarm template, carefully designed alarm rules plus intelligent noise reduction and deduplication, we can do Once the alarm is a valid alarm, and the alarm has associated information, the abnormal entity can be quickly located. The advantage of full-stack coverage of alarm rules is that high-risk events can be reported to users in a timely and proactive manner. Users can gradually achieve better system stability through a series of means such as troubleshooting, troubleshooting, post-event review, and failure-oriented design. .

Network performance monitoring

Network performance problems are very common in the Kubernetes environment. Because the underlying mechanism of TCP shields the complexity of network transmission, the application layer is insensitive to this, which brings about problems such as high packet loss rate and high retransmission rate in the production environment. Certain trouble. Kubernetes monitoring supports RTT, retransmission & packet loss, and TCP connection information to characterize network conditions. Taking RTT as an example below, it supports viewing network performance from the dimensions of namespace, node, container, Pod, service, and workload. Supports the location of the following various network problems:

• The load balancer cannot access a Pod, and the traffic on this Pod is 0. It is necessary to determine whether there is a problem with the Pod network or the load balancing configuration;
• The performance of the application on a certain node seems to be very poor, and it is necessary to determine whether there is a problem with the node network, which can be achieved through the network of other nodes;
• Packet loss occurs on the link, but it is not sure at which layer it occurs. You can check the order of nodes, pods, and containers.

Kubernetes Observability Panorama

With the above product capabilities, based on Alibaba's rich and in-depth practices in containers and Kubernetes, we summarize and convert these valuable production practices into product capabilities to help users locate production environment problems more effectively and quickly. Use this troubleshooting panorama in the following ways:

• The general structure is based on services and Deployments (applications), and most developers only need to focus on this layer. Focus on whether services and applications are wrong or slow, whether services are connected, whether the number of replicas meets expectations, etc.

• The next layer down is the Pod that provides the real workload capability. Pod focuses on whether there are wrong and slow requests, whether it is healthy, whether resources are sufficient, whether downstream dependencies are healthy, etc.

• The bottom layer is the node, which provides the operating environment and resources for Pods and services. Focus on whether the node is healthy, whether it is in a schedulable state, and whether resources are sufficient.

Internet problem

Networking is the thorniest and most common problem in Kubernetes, and it makes it difficult to locate network problems in production environments for several reasons:

• The network architecture of Kubernetes is highly complex. Nodes, Pods, containers, services, and VPCs complement each other, making you dazzled;
• Troubleshooting network problems requires certain expertise, and most people have a natural fear of network problems;
• The eight fallacies of distribution tell us that the network is not stable, the network topology is not static, and the delay cannot be ignored, resulting in the uncertainty of the network topology between end-to-end.
The network problems in the Kubernetes environment are:
• Conntrack record full problem;
• IP conflicts;
• CoreDNS parsing is slow and parsing fails;
• The node is not connected to the external network. (yes, you heard that right);
• Service access is blocked;
• Configuration issues (LoadBalance configuration, routing configuration, device configuration, network card configuration);
• A network outage renders the entire service unavailable.
There are thousands of network problems, but the one thing that never changes is that the network has a "golden indicator" that indicates whether it is running normally:
• network traffic and bandwidth;
• Number of lost packets (rate) and number of retransmissions (rate);
• RTT。

The following example shows slow calls caused by network problems. From the gateway point of view, there is a slow call. Looking at the topology, it is found that the RT of the downstream product is relatively high, but the golden indicator of the product itself shows that there is no problem with the service of the product itself. Further check the network status between the two and find the RTT and retransmission. All are relatively high, indicating that the network performance has deteriorated, resulting in slower overall network transmission. The TCP retransmission mechanism covers up this fact, which cannot be perceived at the application level, and the log cannot see the problem. At this time, the golden indicator of the network helps to delineate the problem, thereby speeding up the troubleshooting of the problem.

Node problem

Kubernetes has done a lot of work to ensure that the nodes provided to workloads and services are normal as much as possible. The node controller checks the status of the nodes 7x24 hours a day. After finding problems that affect the normal operation of the nodes, the nodes are set to NotReady or unschedulable. Evict the business Pod from the problem node through the kubelet. This is the first line of defense of Kubernetes, and the second line of defense is the node self-healing component designed by cloud manufacturers for high-frequency abnormal node scenarios. For example, Alibaba Cloud's node repairer: After finding the faulty node, it performs drainage and eviction and replaces the machine, so as to do To automate the normal operation of business. Even so, during the long-term use of the node, various strange problems will inevitably occur, and it is time-consuming and labor-intensive to locate. FAQ Categories and Levels:

In view of these complicated problems, the following troubleshooting flowchart is summarized:

Take a CPU full as an example: 1. The node status is OK, and the CPU usage exceeds 90%

2. Look at the triplet of the corresponding CPU: usage rate, TopN, and timing diagram. First, the usage rate of each core is very high, which leads to high overall CPU usage. Next, we naturally need to know who is using the CPU frantically. Judging from the TopN list, there is one Pod that dominates the CPU; finally, we have to confirm when the CPU spike started.

slow service response

There are many service responses. The possible reasons for the scenario include code design problems, network problems, resource competition problems, and slow dependency services. In a complex Kubernetes environment, there are two solutions to locate slow calls: first, whether the application itself is slow; second, whether the downstream or network is slow; finally, check the resource usage. As shown in the figure below, Kubernetes monitoring analyzes service performance from horizontal and vertical perspectives:

• Horizontal: It is mainly from the end-to-end perspective. First, check whether there is a problem with the golden indicators of your own services, and then gradually look at the downstream network indicators. Note that if it takes a long time to call the downstream from the client's point of view, but it is normal from the perspective of the golden indicator of the downstream itself, it is very likely to be a network problem or a problem at the operating system level. At this time, you can use the network performance indicator (traffic , packet loss, retransmission, RTT, etc.) to determine.

• Vertical: Determine the external delay of the application itself is high, the next step is to determine the specific reason, and determine which step/method is slow, you can use the flame graph to see. If there is no problem with the code, there may be a problem with the environment in which the code is executed. At this time, you can check whether there are problems with the system's CPU/Memory and other resources for further investigation.

The following is an example of SQL slow query (as shown below). In this example, the gateway calls the product service. The product service relies on the MySQL service. Step by step, check the golden indicators on the link. Finally, it is found that the product executes a very complex SQL and associates multiple tables, resulting in a slow response of the MySQL service. The MySQL protocol is based on TCP. After our eBPF probe recognizes the MySQL protocol, it assembles and restores the content of the MySQL protocol, and can collect SQL statements executed in any language.

The second example is that the application itself is slow. At this time, it is natural to ask which step and function caused the slowness. The flame graph supported by ARMS application monitoring helps to quickly locate the code by regularly sampling the CPU time (as shown in the figure below). level issue.

Application/Pod status issues

Pods are responsible for managing containers, and containers are the carriers that actually execute business logic. At the same time, Pod is the smallest unit of Kubernetes scheduling, so Pod has the complexity of business and infrastructure at the same time, which needs to be combined with logs, links, system indicators, and downstream service indicators. The Pod traffic problem is a high-frequency problem in the production environment. For example, the database traffic increases sharply. When there are thousands of Pods in the environment, it is particularly difficult to check which Pod the traffic mainly comes from.

Next, let's look at a typical case: the downstream service grayscales a Pod during the publishing process. The Pod responds very slowly due to code reasons, causing the upstream to time out. The reason why we can achieve Pod-level observability is because we use ebpf technology to collect Pod traffic and golden indicators, so we can easily view Pods and Pods, Pods and services, Pods and external flow.

Summarize

Through eBPF non-intrusive collection of multi-language and multi-network protocol golden indicators/network indicators/Trace, by correlating Kubernetes objects, applications, cloud services and other contexts, and providing professional monitoring tools when further drilling is required ( Such as flame graph), realizing a one-stop observability platform in the Kubernetes environment.

If you have the following problems in the process of building cloud native monitoring, please do not hesitate to contact us to discuss:
• Unfamiliar with Kubernetes and need a complete set of unified monitoring solutions;
• Data fragmentation and difficulty in getting started with multiple systems such as Prometheus, Alertmanager, and Grafana;
• The cost of burying applications and infrastructure in the container environment is too high, and we are looking for low-cost or non-invasive solutions;

Publish the latest information on cloud native technology, gather the most comprehensive content of cloud native technology, regularly hold cloud native events, live broadcasts, and release Alibaba products and user best practices. Explore cloud-native technologies side by side with you and share the cloud-native content you need.

Follow the official account of [Alibaba Cloud Native] to get more real-time information about cloud native!

Deep Decryption | Panorama of Kubernetes Troubleshooting Based on eBPF Released

As Kubernetes becomes the cloud-native de facto standard, observability challenges follow

Solutions and technical solutions

"Data Acquisition: The Superpower of eBPF"

Programmable execution engines are a natural fit for observability

From monitoring systems to observability

A timeless gold indicator

Service topology from a global perspective

Distributed Tracing Helps Root Cause Location

Out-of-the-box alerts

Network performance monitoring

Kubernetes Observability Panorama

Internet problem

Node problem

slow service response

Application/Pod status issues

Summarize

阿里云云原生

引用和评论

通义灵码 AI IDE 上线，第一时间测评体验

支付宝H5下载被拦截的原因排查与解决指南

JManus - 面向 Java 开发者的开源通用智能体

MCP协议重大升级，Spring AI Alibaba联合Higress发布业界首个Streamable HTTP实现方案

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

深度测评国产 AI 程序员，在 QwQ 和满血版 DeepSeek 助力下，哪些能力让你眼前一亮？