Author: Yan Xun
Over the past year, ARMS has built Kubernetes monitoring based on eBPF technology, providing multi-language non-intrusive application performance, system performance, and network performance observation capabilities, and released a Kubernetes troubleshooting panorama, verifying the effectiveness of eBPF technology. eBPF technology and ecology are developing very well, and the future is promising. As a practitioner of this technology, the goal of this article is to introduce the eBPF technology itself by answering 7 core questions, so as to unravel the veil of eBPF for everyone.
Follow the official account of [Alibaba Cloud Cloud Native], and reply to the keyword [K8s panorama] in the background to get the HD download address of the panorama!
What is eBPF
eBPF is a technology that can run sandbox programs in the kernel, providing a mechanism to safely inject code when kernel events and user program events occur, so that non-kernel developers can also control the kernel. With the development of the kernel, eBPF has gradually expanded from the original packet filtering to network, kernel, security, tracking, etc., and its functional characteristics are still developing rapidly. The early BPF is called classic BPF, or cBPF for short, which is exactly the This function expansion makes the current BPF called Extended BPF, or eBPF for short.
What are the application scenarios of eBPF?
Network Optimization
eBPF combines high performance and high scalability, making it the preferred solution for network packet processing in network solutions:
- high performance
JIT compilers provide near-kernel-native code execution efficiency.
- Highly scalable
In the context of the kernel, protocol parsing and routing strategies can be added quickly.
Troubleshooting
eBPF has both the kernel and user tracing capabilities through the kprobe and tracepoints tracing mechanism. This end-to-end tracing capability can quickly diagnose faults. At the same time, eBPF supports revealing profiling statistics in a more efficient way, without the need for Traditional systems need to leak a large amount of sampled data, making continuous real-time profiling possible.
safely control
eBPF can see all system calls, all network packets and socket network operations, integrated with process context tracking, network operation level filtering, and system call filtering, which can provide better security control.
performance monitoring
Compared with traditional system monitoring components such as sar, which can only provide static counters and gauges, eBPF supports programmable dynamic collection and edge computing aggregation of custom indicators and events, which greatly improves the efficiency and imagination of performance monitoring.
Why does eBPF appear?
The emergence of eBPF is essentially to solve the contradiction between the slow iteration speed of the kernel and the rapid change of system requirements. An example commonly used in the field of eBPF is that eBPF is relative to Linux Kernel, similar to Javascript relative to HTML, and the highlight is programmability. Generally speaking, the support of programmability usually brings some new problems. For example, the kernel module is also to solve this problem, but it does not provide a good boundary, which causes the kernel module to affect the stability of the kernel itself. The kernel version needs to be adapted, etc. eBPF employs the following strategies to make it a safe and efficient kernel programmable technology:
- Safety
The eBPF program must be verified by the verifier before it can be executed, and cannot contain unreachable instructions; the eBPF program cannot call kernel functions at will, but can only call auxiliary functions defined in the API; the eBPF program stack space is only 512 bytes at most, If you want larger storage, you must use mapped storage.
- efficient
With the help of the just-in-time compiler (JIT), and because the eBPF instructions still run in the kernel, there is no need to copy data to the user mode, which greatly improves the efficiency of event processing.
- standard
Provide standard interfaces and data models for developers to use through BPF Helpers, BTF, and PERF MAP.
- Powerful
eBPF not only expands the number of registers and introduces a new BPF map storage, but also gradually expands the original single packet filtering event to kernel mode functions, user mode functions, trace points, and performance events (perf_events) in the 4.x kernel. and security control.
How does eBPF work?
5 steps
1. Develop an eBPF program in C language;
That is, the eBPF sandbox program to be called when the instrumentation point triggers an event, and the program will run in kernel mode.
2. Compile the eBPF program into BPF bytecode with the help of LLVM;
The eBPF program is compiled into BPF bytecode for subsequent verification and execution in the eBPF virtual machine.
3. Submit the BPF bytecode to the kernel through the bpf system call;
The BPF bytecode is loaded into the kernel through the bpf system in user mode.
4. The kernel verifies and runs the BPF bytecode, and saves the corresponding state to the BPF map;
The kernel verifies the security of the BPF bytecode, and ensures that the correct eBPF program is called when the corresponding event occurs. If there is a state that needs to be saved, it is written into the corresponding BPF map. For example, monitoring data can be written into the BPF map.
5. The user program queries the running status of the BPF bytecode through the BPF mapping.
The user mode obtains the running status of the bytecode by querying the content of the BPF mapping, such as obtaining the captured monitoring data.
A complete eBPF program usually includes two parts: user mode and kernel mode: the user mode program needs to interact with the kernel through BPF system calls, and then complete tasks such as eBPF program loading, event mounting, and map creation and update; while in the kernel mode , eBPF programs cannot call kernel functions arbitrarily, but need to complete the required tasks through BPF auxiliary functions. Especially when accessing memory addresses, it is necessary to read memory data with the help of the bpf_probe_read series of functions to ensure safe and efficient memory access. When the eBPF program needs large blocks of storage, we also need to introduce a specific type of BPF mapping according to the application scenario, and use it to provide the user space program with the data of the running state.
eBPF program classification and usage scenarios
bpftool feature probe | grep program_type
The above command can view the types of eBPF programs supported by the system, which are generally as follows:
eBPF program_type socket_filter is available
eBPF program_type kprobe is available
eBPF program_type sched_cls is available
eBPF program_type sched_act is available
eBPF program_type tracepoint is available
eBPF program_type xdp is available
eBPF program_type perf_event is available
eBPF program_type cgroup_skb is available
eBPF program_type cgroup_sock is available
eBPF program_type lwt_in is available
eBPF program_type lwt_out is available
eBPF program_type lwt_xmit is available
eBPF program_type sock_ops is available
eBPF program_type sk_skb is available
eBPF program_type cgroup_device is available
eBPF program_type sk_msg is available
eBPF program_type raw_tracepoint is available
eBPF program_type cgroup_sock_addr is available
eBPF program_type lwt_seg6local is available
eBPF program_type lirc_mode2 is NOT available
eBPF program_type sk_reuseport is available
eBPF program_type flow_dissector is available
eBPF program_type cgroup_sysctl is available
eBPF program_type raw_tracepoint_writable is available
eBPF program_type cgroup_sockopt is available
eBPF program_type tracing is available
eBPF program_type struct_ops is available
eBPF program_type ext is available
eBPF program_type lsm is available
For details, please refer to:
https://elixir.bootlin.com/linux/v5.13/source/include/linux/bpf_types.h
It is mainly divided into 3 major usage scenarios:
- track
Tracepoint, kprobe, perf_event, etc., are mainly used to extract trace information from the system, and then provide data support for monitoring, troubleshooting, and performance optimization.
- network
xdp, sock_ops, cgroup_sock_addr, sk_msg, etc., are mainly used to filter and process network packets, and then realize various functions such as network observation, filtering, flow control and performance optimization. Packet loss and redirection can be performed here.
cilium basically uses all the hook points.
- security and other
lsm is used for security, and others are flow_dissector, lwt_in are not very commonly used, so I won't repeat them.
What are the best practices for eBPF?
Finding kernel instrumentation points
As can be seen from the front, the eBPF program itself is not difficult, the difficulty is to find a suitable event source to trigger the operation. For monitoring and diagnostics, there are three types of event sources for tracing eBPF programs: kernel functions (kprobe), kernel tracepoints (tracepoint), or performance events (perf_event). There are 2 questions to answer at this point:
1. What are the kernel functions, kernel tracepoints or performance events in the kernel?
- Use debug information to get kernel functions, kernel tracepoints
sudo ls /sys/kernel/debug/tracing/events
- Use bpftrace to get kernel functions, kernel tracepoints
# 查询所有内核插桩和跟踪点
sudo bpftrace -l
# 使用通配符查询所有的系统调用跟踪点
sudo bpftrace -l 'tracepoint:syscalls:*'
# 使用通配符查询所有名字包含"open"的跟踪点
sudo bpftrace -l '*open*'
- Get performance events using perf list
sudo perf list tracepoint
2. For kernel functions and kernel trace points, how to query the definition format of these data structures when they need to track their incoming parameters and return values?
- Use debug info to get
sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format
Get it with bpftrace
sudo bpftrace -lv tracepoint:syscalls:sys_enter_openat
For details on how to use the above information, please refer to bcc.
Find instrumentation points for your application
1. How to query the tracepoint of a user process?
- Statically compiled languages retain debugging information through the -g compilation option. The application binary will contain DWARF (Debugging With Attributed Record Format). With the debugging information, you can use readelf, objdump, nm and other tools to query functions and variables that can be used for tracing. list of symbols
# 查询符号表
readelf -Ws /usr/lib/x86_64-linux-gnu/libc.so.6
# 查询USDT信息
readelf -n /usr/lib/x86_64-linux-gnu/libc.so.6
- use bpftrace
# 查询uprobe
bpftrace -l 'uprobe:/usr/lib/x86_64-linux-gnu/libc.so.6:*'
# 查询USDT
bpftrace -l 'usdt:/usr/lib/x86_64-linux-gnu/libc.so.6:*'
uprobe is file based. When a function in a file is traced, unless the process PID is filtered, all processes using the file are instrumented by default.
The above is a statically compiled language. It is similar to the tracking of the kernel. The symbolic information of the application can be stored in the ELF binary file, or it can be placed in the debugging file in the form of a separate file; and the symbolic information of the kernel can be stored in addition to In addition to the kernel binary, it is also exposed to user space in the form of /proc/kallsyms and /sys/kernel/debug.
For non-statically compiled languages, there are two main ones:
1. Interpreted language
Use the tracepoint query method similar to compiled language applications to query their uprobe and USDT tracepoints at the interpreter level. How to associate the interpreter-level behavior with the application behavior needs to be analyzed by experts in the relevant language.
2. Just-in-time compiled language
The application source code of this kind of language will first be compiled into bytecode, and then compiled into machine code by a just-in-time compiler (JIT) for execution. There will also be a lot of optimization, and tracking is very difficult. Similar to interpreted programming languages, uprobe and USDT trace can only be used on the just-in-time compiler, and the function information of the final application can be obtained from the trace point parameters of the just-in-time compiler. Figuring out the relationship between the tracepoints of the JIT and the execution of the application requires analysis by experts in the relevant language.
You can refer to BCC's application tracing, user process tracing, which essentially executes the uprobe handler through breakpoints. Although the kernel community has done a lot of performance tuning for BPF, tracking user-mode functions (especially high-frequency functions such as lock contention and memory allocation) may still bring a lot of performance overhead. Therefore, when we use uprobe, we should try to avoid tracking high-frequency functions.
For details on how to use the above information, please refer to:
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#events--arguments * *
Correlating Issues and Instrumentation Points
An ideal state is that all problems should be clearly observed and those instrumentation points should be observed, but this requires technicians to have a thorough understanding of the details of the end-to-end software stack. A more reasonable method is the rule of 28, which is to maximize the data flow of the software stack. The core 80% of the context is seized, and it is enough to ensure that problems will be discovered in this context. At this time, use the kernel stack and user stack to view the specific call stack to find the core problem. For example, it is found that the network is losing packets, but I don't know why. At this time, we know that the network packet loss will definitely call the kfree_skb kernel function. Then we can pass:
sudo bpftrace -e 'kprobe:kfree_skb /comm=="<your comm>"/ {printf("kstack: %s\n", kstack);}'
Find the call stack of the function:
kstack: kfree_skb+1 udpv6_destroy_sock+66 sk_common_release+34 udp_lib_close+9 inet_release+75 inet6_release+49 __sock_release+66 sock_close+21 __fput+159 ____fput+14 task_work_run+103 exit_to_user_mode_loop+411 exit_to_user_mode_prepare+187 syscall_exit_to_user_mode+23 do_syscall_64+110 entry_SYSCALL_64_after_hwframe+68
Then you can backtrack the above functions to see which line they are called under what conditions, and you can locate the problem. This method can not only locate the problem, but can also be used to deepen the understanding of kernel calls, such as:
bpftrace -e 'tracepoint:net:* { printf("%s(%d): %s %s\n", comm, pid, probe, kstack()); }'
All network related tracepoints and their call stacks can be viewed.
What is the implementation principle of eBPF?
5 modules
eBPF is mainly composed of 5 modules in the kernel:
1. BPF Verifier
Secure the eBPF program. The verifier will create the instruction to be executed as a directed acyclic graph (DAG) to ensure that the program does not contain unreachable instructions; then simulate the execution process of the instruction to ensure that invalid instructions will not be executed. However, the validator here cannot guarantee 100% security, so for all BPF programs, strict monitoring and review are still required.
2. BPF JIT
Compile eBPF bytecode into native machine instructions for more efficient execution in the kernel.
3. A storage module composed of multiple 64-bit registers, a program counter and a 512-byte stack
It is used to control the running of the eBPF program, save the stack data, and participate in and out parameters.
4. BPF Helpers (helper function)
Provides a set of functions for eBPF programs to interact with other modules of the kernel. These functions cannot be called by any eBPF program, and the set of available functions is determined by the type of BPF program. Note that all modifications to the input and output parameters in eBPF must conform to the BPF specification. Except for local variable changes, other changes should be done using BPF Helpers. If the BPF Helpers do not support it, they cannot be modified.
bpftool feature probe
Through the above command, you can see which BPF Helpers can be run by different types of eBPF programs.
5. BPF Map & context
Used to provide large blocks of storage that can be accessed by user-space programs to control the running state of eBPF programs.
bpftool feature probe | grep map_type
Through the above command, you can see which types of maps are supported by the system.
3 actions
Let's talk about the important system call bpf first:
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
Here cmd is the key, attr is the parameter of cmd, size is the parameter size, so the key is to see what cmd has:
// 5.11内核
enum bpf_cmd {
BPF_MAP_CREATE,
BPF_MAP_LOOKUP_ELEM,
BPF_MAP_UPDATE_ELEM,
BPF_MAP_DELETE_ELEM,
BPF_MAP_GET_NEXT_KEY,
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
BPF_PROG_ATTACH,
BPF_PROG_DETACH,
BPF_PROG_TEST_RUN,
BPF_PROG_GET_NEXT_ID,
BPF_MAP_GET_NEXT_ID,
BPF_PROG_GET_FD_BY_ID,
BPF_MAP_GET_FD_BY_ID,
BPF_OBJ_GET_INFO_BY_FD,
BPF_PROG_QUERY,
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
BPF_MAP_LOOKUP_AND_DELETE_ELEM,
BPF_MAP_FREEZE,
BPF_BTF_GET_NEXT_ID,
BPF_MAP_LOOKUP_BATCH,
BPF_MAP_LOOKUP_AND_DELETE_BATCH,
BPF_MAP_UPDATE_BATCH,
BPF_MAP_DELETE_BATCH,
BPF_LINK_CREATE,
BPF_LINK_UPDATE,
BPF_LINK_GET_FD_BY_ID,
BPF_LINK_GET_NEXT_ID,
BPF_ENABLE_STATS,
BPF_ITER_CREATE,
BPF_LINK_DETACH,
BPF_PROG_BIND_MAP,
};
The core is PROG, MAP related cmd, which is program loading and mapping processing.
1. Program loading
Calling the BPF_PROG_LOAD cmd will load the BPF program into the kernel, but the eBPF program is not like a regular thread, it will always run there after it is started, and it will only be executed after an event is triggered. These events include system calls, kernel tracepoints, call exits of kernel functions and user mode functions, network events, etc., so the second action is required.
2. Binding events
b.attach_kprobe(event="xxx", fn_name="yyy")
The above is to bind a specific event to a specific BPF function. The actual implementation principle is as follows:
(1) With the help of the bpf system call, after loading the BPF program, the returned file descriptor will be remembered;
(2) Know the event number of the corresponding function type through the attach operation;
(3) Call perf_event_open to create performance monitoring events according to the return value of attach;
(4) Bind the BPF program to the performance monitoring event through the PERF_EVENT_IOC_SET_BPF command of ioctl.
3. Mapping operation
Control the addition and deletion of MAP through the cmd related to MAP, and then the user mode interacts with the kernel state based on the MAP.
What is the development status of eBPF?
Kernel support
Recommended >=4.14
ecology
The bottom-up situation of the eBPF ecology is as follows:
1. Infrastructure
Support the development of eBPF basic capabilities.
- Linux Kernal
- LLVM\
2. Development toolset
It is mainly used to load, compile and debug eBPF programs. Different languages have different development toolsets:
- Go
- https://github.com/cilium/ebpf
- https://github.com/aquasecurity/libbpfgo
- C/C++
- https://github.com/libbpf/libbpf
3. eBPF application
Provides a set of development tools and scripts.
bpftrace
Based on bcc, it provides a scripting language.
Network optimization and security
cyber security
Katran
<https://github.com/facebookincubator/katran>
High-performance Layer 4 load balancing
observable
observable
observable
kubectl trace
Schedule the bpftrace script
A platform for launching and managing eBPF programs in a distributed environment
dynamic linux trace
Linux runtime security monitoring
4. Websites that track ecology
write at the end
The premise of using eBPF well is the understanding of the software stack
Through the above introduction, I believe that everyone has a sufficient understanding of eBPF. eBPF provides only a framework and mechanism. The core still needs eBPF people's understanding of the software stack to find a suitable instrumentation point to be able to communicate with application problems. association.
The killer feature of eBPF is full coverage, non-intrusive, programmable
1. Full coverage
Kernel, application instrumentation points are fully covered.
2. No intrusion
No need to modify any hooked code.
3. Programmable
Dynamically issue eBPF programs, dynamically execute instructions at the edge, and perform dynamic aggregation analysis.
Team information
Alibaba Cloud's observable team covers front-end monitoring, application monitoring, container monitoring, Prometheus, link tracking, intelligent alerting, O&M visualization, and other technical fields and products, accumulating Alibaba Cloud's observable capabilities in different industries and technical scenarios. Observe solutions and best practices.
Alibaba Cloud Kubernetes monitoring is a set of one-stop non-intrusive observability products developed for Kubernetes clusters based on eBPF technology. Provides an overall observability scheme.
introduce:
https://help.aliyun.com/document_detail/260777.html
Access:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。