In-depth analysis | Kubernetes one-stop observability system based on eBPF - 阿里巴巴云原生

Author: Li Huangdong, Yan Xun

Summary

Alibaba Cloud has launched a one-stop observability system for Kubernetes, which aims to solve the problems of high complexity of architecture and coexistence of multiple languages and protocols in the Kubernetes environment. Skyrim's eBPF technology supports non-intrusive collection of application gold indicators and builds a global topology, which greatly reduces the difficulty of operating and maintaining Kubernetes for public cloud users.

foreword

Background and problems

At present, cloud native technology is mainly based on container technology and is based on the standardized technology ecosystem of Kubernetes. It provides infrastructure through standard and scalable scheduling, network, storage, and container runtime interfaces. The controller provides operation and maintenance capabilities. The two-level standardization promotes the refined social division of labor, further enhances the scale and specialization in various fields, and fully achieves the optimization of cost, efficiency, and stability. In this context, a large number of companies use Cloud native technology to develop operation and maintenance applications. Because cloud native technology brings more possibilities, current business applications have the characteristics of numerous microservices, multi-language development, and multiple communication protocols. At the same time, cloud native technology itself moves down the complexity, which brings observability. More challenges:

1. Chaos Microservice Architecture

Due to the division of labor, the business architecture is prone to a large number of services and complex service relationships (as shown in Figure 1).

在这里插入图片描述
Figure 1 Chaos microservice architecture (see the image source at the end of the article)

This causes a series of problems:

Unable to answer the current operating architecture;
Unable to determine whether the downstream dependent services of a specific service are normal;
Unable to determine whether the upstream dependent service traffic of a specific service is normal;
Unable to answer whether the application's DNS request resolution is normal;
Unable to answer whether the connectivity between applications is correct;
...

2. Multilingual application

In the business architecture, different applications are written in different languages (as shown in Figure 2). The traditional observability method requires different methods to be observable in different languages.

在这里插入图片描述
Figure 2 Multilingual (see the image source at the end of the article)

This will also cause a series of problems:

Different languages require different tracking methods, and even some languages do not have ready-made tracking methods;
The impact of buried points on application performance cannot be easily evaluated;

3. Multiple communication protocols

In the business architecture, the communication protocols between different services are also different (as shown in Figure 3). The traditional observable method is usually to bury the point in the specific communication interface of the application layer.

在这里插入图片描述
Figure 3 Multiple communication protocols

This will also cause a series of problems:

Different communication protocols require different burying methods for different clients, and even some communication protocols do not have ready-made burying methods;
The impact of buried points on application performance cannot be easily evaluated;

4. End-to-end complexity introduced by Kubernetes

Complexity is eternal, we can only find ways to manage it, but cannot eliminate it. Although the introduction of cloud native technology reduces the complexity of business applications, in the entire software stack, it just moves the complexity down to container virtualization The chemical layer is not eliminated (Figure 4).

在这里插入图片描述
Figure 4 End-to-end software stack

This will also cause a series of problems:

The expected number of replicas of the Deployment is inconsistent with the actual number of running replicas;
Service has no backend and cannot handle traffic;
Pods cannot be created or scheduled;
The Pod cannot reach the Ready state;
Node is in Unknown state;
...

Solutions and technical solutions

In order to solve the above problems, we need to use a technology that supports multiple languages and multiple communication protocols, and cover the end-to-end observability requirements of the software stack as much as possible at the product level. The underlying operating system is connected to the observability solution of application performance observation (see Figure 5).

data collection

在这里插入图片描述
Figure 5 Solution to end-to-end observability

We use the container as the core to collect the associated Kubernetes observable data. At the same time, we collect the system and network observable data of the container-related process downward, and collect the performance data of the container-related application upward. End-to-end observable data coverage.

data transmission link

Our data types include metrics, logs and links, and the open telemetry collector scheme (as shown in Figure 6) is used to support unified data transmission.

在这里插入图片描述
Figure 6 OpenTelemetry Collector (see the image source at the end of the article)

data storage

Backed by the existing infrastructure of ARMS, indicators are stored through ARMS Prometheus, and logs/links are stored through XTRACE.

Product core function introduction

The core scenarios support architecture awareness, error and slow request analysis, resource consumption analysis, DNS resolution performance analysis, external performance analysis, service connectivity analysis, and network traffic analysis. The basis for supporting these scenarios is that the product design follows the principle from the whole to the individual: start with the global view, find abnormal service individuals, such as a service, locate the service and view the golden indicators and related information of the service. , Trace, etc. for further correlation analysis.

在这里插入图片描述
Figure 7 Core business scenario

A timeless gold indicator

What is the golden indicator ? Minimal set for observability of system performance and state: latency, traffic, errors, saturation. The following is quoted from the SRE Bible, Site Reliability Engineering:

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

is the gold indicator important ? First, it directly and clearly expresses whether the system is serving the outside world normally. Second, for customers, it can further evaluate the impact on users or the severity of the situation, which can save a lot of time for SRE or R&D. Imagine if we take CPU usage as the golden indicator, then SRE or R&D will be exhausted. life, as high CPU usage probably won't make much of a difference, especially in a smoothly running Kubernetes environment. So Kubernetes observability supports these golden metrics:

Requests/QPS
Response time and quantiles (P50, P90, P95, P99)
number of errors
number of slow calls

在这里插入图片描述
Figure 8 Gold Indicator

Mainly support the following scenarios:

1. Performance analysis
2. Slow call analysis

Applied Topology from a Global Perspective

Those who do not seek the overall situation are insufficient to seek a domain. -- Zhuge Liang

With the increasing complexity of the current technical architecture and deployment architecture, it becomes more and more difficult to locate the problem after a problem occurs, which in turn leads to higher and higher MTTR. Another impact is that the analysis of the impact surface brings a very big challenge, usually one can not care about the other. Therefore, it is very necessary to have a big picture like a map. The global topology has the following characteristics:

System Architecture Perception : The system architecture diagram is usually called an important reference for programmers to understand a new system. When we get a system, at least we have to know where the traffic entrance is, what core modules are there, and which internal and external components are dependent on Wait. In the process of anomaly localization, having a map of the global architecture greatly promotes the process of anomaly localization. A topology example of a simple e-commerce application, the entire architecture is at a glance:

在这里插入图片描述
Figure 9 Architecture Awareness

Dependency Analysis : Some problems occur in downstream dependencies. If this dependency is not maintained by your own team, it will be more troublesome. When your own system and downstream systems do not have enough observability, it will be even more troublesome. In this case, It's hard to explain the problem to the maintainer of the dependency. In our topology, a call graph is formed by connecting the upstream and downstream of the golden indicator with the call relationship. Edges serve as a visualization of dependencies and can view golden signals corresponding to calls. With the golden signal, you can quickly analyze whether there is a problem with downstream dependencies. The following figure shows an example of the positioning of the overall application RT caused by the slow call of the underlying service calling microservice, from the ingress gateway, to the internal service, to the MySQL service, and finally to the statement where the slow SQL occurs:

在这里插入图片描述
Figure 10 Dependency analysis

High Availability Analysis : The topology map can easily see the interaction between systems, so as to see which systems are the main core links or are heavily relied upon. For example, CoreDNS, almost all components will perform DNS resolution through CoreDNS. Therefore, we further see the possible bottlenecks, and predict whether the application is healthy and whether the capacity is insufficient by checking the golden indicators of CoreDNS.

在这里插入图片描述
Figure 11 High Availability Analysis

Non-invasive : Unlike Ant's linkd and Group's eagleeye, our solution is completely non-invasive. Sometimes we lack an aspect of observability, not because it can't be done, but because the application needs to change the code. As an SRE, it is a good starting point for better observability, but it is obviously inappropriate to let the application owner of the whole group accompany you to change the code. At this time, the power of non-intrusiveness is shown: the application does not need to change the code, and does not need to restart. So the access cost is very low.

Protocol Trace facilitates root cause location

Protocol Trace is different from distributed tracing in that it only traces one call. Protocol Trace is also non-invasive and language-agnostic. If there is a distributed link TraceID in the request content, it can be automatically identified, which is convenient for further drilling down to the link trace. The request and response information of the application layer protocol helps to analyze the request content and return code, so as to know which interface has a problem.

在这里插入图片描述
Figure 12 Protocol Details

Out-of-the-box alerts

It is inappropriate for any observability system to not support alerts.

1. The default template is issued, and the thresholds have passed the best practices in the industry.

在这里插入图片描述
Figure 13 Alarm

2. Support multiple configuration methods for users

Static threshold, users only need to configure the threshold, no need to manually write PromQL
Dynamic threshold based on sensitivity adjustment, suitable for scenarios where it is difficult to determine the threshold
Compatible with PromQL, requires a certain learning cost, suitable for advanced users

rich context

The CEO of datadog bluntly stated in an interview that datadog's product strategy is not to support as many functions as possible, but to think about how to build bridges between different teams and members, and to put information on the same page as much as possible (to bridge the gap between the teams and get everything on the same page). In product design, we associate key contextual information to facilitate understanding by engineers with different backgrounds, thereby speeding up troubleshooting.

At present, our associated contexts include alarm information, golden indicators, logs, Kubernetes meta information, etc. At the same time, valuable information is continuously added. For example, alarm information, alarm information is automatically associated with the corresponding service or application node, and you can clearly see which applications are abnormal. Clicking the application or alarm can automatically expand the application details, alarm details, and application golden indicators. All actions are in In one page:

在这里插入图片描述
Figure 14 Context association

other

1. Network performance observability:

It is a common problem that network performance leads to longer response time. Because the underlying mechanism of TCP shields part of the complexity, the application layer is indifferent to this, which brings about high packet loss rate and high retransmission rate scenarios. some trouble. Kubernetes supports retransmission & packet loss, and TCP connection information to characterize network conditions. The following figure shows an example of high retransmission resulting in high RT:

在这里插入图片描述
Figure 15 Network performance observability

eBPF superpowers revealed

在这里插入图片描述
Figure 16 Data processing flow

eBPF is equivalent to building an execution engine in the kernel, attaching this program to a kernel event through the kernel call, so as to monitor the kernel event; with the event, we can further deduce the protocol and filter out the protocol of interest , and further process the event and put it into the data structure Map that comes with ringbuffer or eBPF for user mode process to read; after user mode process reads the data, it further associates Kubernetes metadata and pushes it to the storage side. This is the overall process.

The super power of eBPF is reflected in the ability to subscribe to various kernel events, such as file reading and writing, network traffic, etc. All behaviors in containers or Pods running in Kubernetes are implemented through kernel system calls, and the kernel knows all processes on the machine. So the kernel is pretty much the sweet spot for observability, which is why we chose eBPF. Another advantage of monitoring on the kernel is that the application does not need to be changed, and the kernel does not need to be recompiled, so it is truly non-intrusive. When there are dozens or hundreds of applications in the cluster, a non-intrusive solution can help a lot.

As a new technology, it is normal for people to worry about eBPF. Here are the simple answers:

1. How secure is eBPF? The eBPF code has many limitations, such as the current maximum stack space of 512 and the maximum number of instructions of 1 million, the purpose of these limitations is to fully guarantee the security of the kernel at runtime.

2. What is the performance of the eBPF probe? around 1%. The high performance of eBPF is mainly reflected in processing data in the kernel, reducing the copying of data between kernel mode and user mode. Simply put, the data is calculated in the kernel and then given to the user process, such as a Gauge value. In the past, the original data was copied to the user process and then calculated.

Summarize

product value

Alibaba Cloud Kubernetes Observability is a set of one-stop observability products developed for Kubernetes clusters. Based on the metrics, application links, logs, and events under the Kubernetes cluster, Alibaba Cloud Kubernetes Observability aims to provide an overall observability solution for IT developers and operators.

Alibaba Cloud Kubernetes observability has the following features:

code is non-invasive : Through bypass technology, you can obtain rich network performance data without burying the code.
language-independent : Perform network protocol analysis at the kernel level, support any language and any framework.
High-performance : Based on eBPF technology, rich network performance data can be obtained with extremely low consumption.
Strong association : Describes entity associations from multiple dimensions through network topology, resource topology, and resource relationships, and also supports associations between various types of data (observable indicators, links, logs, and events).
data end-to-end coverage : Observation data covering the end-to-end software stack.
Scenario Closed-loop : Scenario design of the console, related to architecture-aware topology, application observability, Prometheus observability, cloud dialing test, health inspection, event center, log service and cloud service, including application understanding, exception discovery , a complete closed loop for anomaly localization.

Click here to go to the Alibaba Cloud observable special page for more details!

Image Source:

figure 1:
https://www.infoq.com/presentations/netflix-chaos-microservices/

figure 2:
https://www.lackuna.com/2013/01/02/4-programming-languages-to-ace-your-job-interviews/

Image 6:
https://opentelemetry.io/docs/collector/

You are welcome to scan the code or search the DingTalk group number (31588365) to join the Q&A group for communication.

In-depth analysis | Kubernetes one-stop observability system based on eBPF

Summary

foreword

Background and problems

1. Chaos Microservice Architecture

2. Multilingual application

3. Multiple communication protocols

4. End-to-end complexity introduced by Kubernetes

Solutions and technical solutions

data collection

data transmission link

data storage

Product core function introduction

A timeless gold indicator

Applied Topology from a Global Perspective

Protocol Trace facilitates root cause location

Out-of-the-box alerts

rich context

other

1. Network performance observability:

eBPF superpowers revealed

Summarize

product value

阿里云云原生

引用和评论

AI 时代，为什么编程能力≠ 开发门槛

K8s 小白入门｜从电影配乐谈起，聊聊容器编排和 K8s

全网首发 | PAI Model Gallery一键部署阶跃星辰Step-Video-T2V、Step-Audio-Chat模型

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

支付宝H5下载被拦截的原因排查与解决指南

无需编码5分钟免费部署云上调用满血版DeepSeek

如何在通义灵码里用上DeepSeek-V3 和 DeepSeek-R1 满血版671B模型？