The core of eBPF (Extended Berkeley Packet Filter) is an efficient virtual machine that resides in the kernel. The original purpose is an efficient network filtering framework, formerly known as BPF, so we first understand BPF
BPF
frame
The above picture is the location and framework of BPF. It should be noted that the kernel and user use buffers to transmit data to avoid frequent context switching. The BPF virtual machine is very simple and consists of an accumulator, index register, storage, and implicit program counter.
Example
Next, let's take a look at an example to filter all ip packets, you can use tcpdump -d ip to view:
(000) ldh [12] // 链路层第12字节的数据加载到寄存器,ethertype字段
(001) jeq #0x800 jt 2 jf 3 // 比较寄存器的ethertype字段是否为IP类型,true跳到2,false跳到3
(002) ret #65535 // 返回true
(003) ret #0 // 返回0
BPF only uses 4 virtual machine instructions to provide very useful IP packet filtering.
tcpdump -d tcp
(000) ldh [12] // 链路层第12字节的数据(2字节)加载到寄存器,ethertype字段
(001) jeq #0x86dd jt 2 jf 7 // 判断是否为IPv6类型,true跳到2,false跳到7
(002) ldb [20] // 链路层第20字节的数据(1字节)加载到寄存器,IPv6的next header字段
(003) jeq #0x6 jt 10 jf 4 // 判断是否为TCP,true跳到10,false跳到4
(004) jeq #0x2c jt 5 jf 11 // 可能是IPv6分片标志,true跳到5,false跳到11
(005) ldb [54] // 我编不下去了...
(006) jeq #0x6 jt 10 jf 11 // 判断是否为TCP,true跳到10,false跳到11
(007) jeq #0x800 jt 8 jf 11 // 判断是否为IP类型,true跳到8,false跳到11
(008) ldb [23] // 链路层第23字节的数据(1字节)加载到寄存器,next proto字段
(009) jeq #0x6 jt 10 jf 11 // 判断是否为TCP,true跳到10,false跳到11
(010) ret #65535 // 返回true
(011) ret #0 // 返回0
The above is freebsd's BPF. It should not be called LSF in Linux. Let's see for yourself.
eBPF
First acquaintance with eBPF
Linux kernel version 3.18 began to include eBPF. Compared with BPF, some important improvements have been made. First, efficiency is due to the compilation of eBPF code by JIB; second is the scope of application, which extends from network packets to general event processing; finally, it is no longer used Socket, use map for efficient data storage.
Based on the above improvements, the kernel developers have made network monitoring, speed limit, and system monitoring in less than two and a half years.
Currently eBPF can be decomposed into three processes:
- Create eBPF programs in the form of bytecode. Write C code to compile LLVM into eBPF bytecode residing in ELF file.
- Load the program into the kernel and create the necessary eBPF-maps. eBPF can be used as socket filter, kprobe processor, flow control scheduling, flow control operation, tracepoint processing, eXpress Data
Path (XDP), performance monitoring, cgroup limitation, lightweight tunnel program type. - Attach the loaded program to the system. Attach to different kernel systems according to different program types. When the program is running, start the state and begin to filter, analyze or capture information.
At the NetDev 1.2 conference in October 2016, Jakub Kicinski and Nic Viljoen of Netronome published the title "eBPF/XDP Hardware Offload to SmartNIC". Nic Viljoen introduced in it that each FPC on the Netronome SmartNIC reaches 3 million packets per second, and each SmartNIC has 72 to 120 FPCs, which may support up to 4.3 Tbps of eBPF throughput! (In theory)
eBPF entrance
Next, we take the kernel version 4.14 as an example to check.
bpf system call
kernel/bpf/syscall.c
The header file of the bpf system call
include/uapi/linux/bpf.h
Entry function
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
kernel/bpf/syscall.c
from the macro definition in 0610e52a277839.
eBPF commands
There are 10 commands for the BPF system call of the Linux system, 6 of which are listed in the man page:
BPF_PROG_LOAD
verifies and loads the eBPF program, and returns a new file descriptor.BPF_MAP_CREATE
creates a map and returns a file descriptor pointing to the mapBPF_MAP_LOOKUP_ELEM
Find the element from the specified map by key and return the value valueBPF_MAP_UPDATE_ELEM
Create or update elements in the specified map (key/value pairing)BPF_MAP_DELETE_ELEM
find the element from the specified map by key and delete itBPF_MAP_GET_NEXT_KEY
Find the element from the specified map by key, and return the next key value
The above commands can be divided into two categories, loading eBPF programs and eBPF-maps operations. The eBPF-maps operation has great autonomy. It is used to create eBPF-maps, find, update and delete elements from it, and traverse eBPF-maps (BPF_MAP_GET_NEXT_KEY)
Next, list the remaining 4 commands, which can be seen in the code:
- BPF_OBJ_PIN is newly added in version 4.4 and belongs to persistent eBPF. With this, eBPF-maps and eBPF programs can be placed in /sys/fs/bpf
- BPF_OBJ_GET is the same as above. Before this, there is no tool to create eBPF programs and end, because the filter will be destroyed, and the file system can still retain eBPF-maps and eBPF programs after the program that created them exits.
- BPF_PROG_ATTACH added in version 4.10, attach the eBPF program to the cgroup, which is applicable to the container
- BPF_PROG_DETACH Same as above.
eBPF-map type
There are 10 commands for the BPF system call of the Linux system, 6 of which are listed in the man page:
BPF_PROG_LOAD
verifies and loads the eBPF program, and returns a new file descriptor.
BPF_MAP_CREATE
creates a map and returns a file descriptor pointing to the map
BPF_MAP_LOOKUP_ELEM
Find the element from the specified map by key and return the value value
BPF_MAP_UPDATE_ELEM
Create or update elements in the specified map (key/value pairing)
BPF_MAP_DELETE_ELEM
Find the element from the specified map by key and delete it
BPF_MAP_GET_NEXT_KEY
Find the element from the specified map by key, and return the next key value
The above commands can be divided into two categories, loading eBPF programs and eBPF-maps operations. The eBPF-maps operation has great autonomy. It is used to create eBPF-maps, find, update and delete elements from it, and traverse eBPF-maps (BPF_MAP_GET_NEXT_KEY)
Next, list the remaining 4 commands, which can be seen in the code:
- BPF_OBJ_PIN is newly added in version 4.4 and belongs to persistent eBPF. With this, eBPF-maps and eBPF programs can be placed in /sys/fs/bpf
- BPF_OBJ_GET is the same as above. Before this, there is no tool to create eBPF programs and end, because the filter will be destroyed, and the file system can still retain eBPF-maps and eBPF programs after the program that created them exits.
- BPF_PROG_ATTACH added in version 4.10, attach the eBPF program to the cgroup, which is applicable to the container
- BPF_PROG_DETACH Same as above.
eBPF-map type
BPF_MAP_TYPE_UNSPEC
BPF_MAP_TYPE_HASH
eBPF-maps hash table, one of the first two methods mainly usedBPF_MAP_TYPE_ARRAY
similar to the above, except that the index is like an arrayBPF_MAP_TYPE_PROG_ARRAY
saves the value of the file descriptor of the loaded eBPF program. It is commonly used to use numbers to identify different eBPF program types. You can also find the eBPF program from eBPF-maps with a given key value and jump to the program.BPF_MAP_TYPE_PERF_EVENT_ARRAY
cooperates with perf tools, CPU performance counters, tracepoints, kprobes and uprobes. You can view tracex6_kern.c, tracex6_user.c, tracex6_kern.c, tracex6_user.c under the path samples/bpf/BPF_MAP_TYPE_PERCPU_HASH
same as BPF_MAP_TYPE_HASH, except that it is created for each CPUBPF_MAP_TYPE_PERCPU_ARRAY is the same as BPF_MAP_TYPE_ARRAY, except that it is created for each CPU
BPF_MAP_TYPE_STACK_TRACE
used to store stack-traces
BPF_MAP_TYPE_CGROUP_ARRAY
Check the croup attribution of skb
BPF_MAP_TYPE_LRU_HASH
BPF_MAP_TYPE_LRU_PERCPU_HASH
BPF_MAP_TYPE_LPM_TRIE
most professional usage, a trie of LPM (Longest Prefix Match)BPF_MAP_TYPE_ARRAY_OF_MAPS
may be for each portBPF_MAP_TYPE_HASH_OF_MAPS
may be for each portBPF_MAP_TYPE_DEVMAP
may be directed to the devBPF_MAP_TYPE_SOCKMAP
may be connected to the socket
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。