Background: This is reproduced in the environment of centos 7.6.1810. There are many smart network cards at present
The network card on the cloud server is standard. In OPPO, it is mainly used in scenarios such as vpc. The code of the smart network card follows
The enhancement of functions has led to an increase in complexity. Driver bugs have always been the bulk of the kernel bugs. When encountering similar problems, the kernel developers are not familiar with the driver code, so the investigation will be more difficult. The background knowledge involved is: dma_pool, dma_page, net_device, mlx5_core_dev devices, device uninstallation, uaf problems, etc. In addition, this bug has not been solved visually in the latest Linux baseline. This article lists it separately because the uaf problem is relatively unique.
Below is a list of how we troubleshoot and solve this problem.
1. Failure phenomenon
The OPPO cloud kernel team received a connectivity alarm and found that the machine was reset:
UPTIME: 00:04:16-------------运行的时间很短
LOAD AVERAGE: 0.25, 0.23, 0.11
TASKS: 2027
RELEASE: 3.10.0-1062.18.1.el7.x86_64
MEMORY: 127.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"
PID: 23283
COMMAND: "spider-agent"
TASK: ffff9d1fbb090000 [THREAD_INFO: ffff9d1f9a0d8000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent"
#0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34
#1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592
#2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680
#3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798
#4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4
#5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82
#6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4
#7 [ffff9d1f9a0db870] __do_page_fault at ffffffffb6d88750
#8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975
#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
[exception RIP: dma_pool_alloc+427]//caq:异常地址
RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10
RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00
R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0
R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]//涉及的模块
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
#12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core]
#14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core]
#15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core]
#16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06
#17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8
#18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643
#19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f
#20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6
#21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30
#22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125
#23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff
#24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf
#25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede
RIP: 00000000004a5030 RSP: 000000c001099378 RFLAGS: 00000212
RAX: 0000000000000000 RBX: 000000c000040000 RCX: ffffffffffffffff
RDX: 000000000000000a RSI: 000000c00109976e RDI: 000000000000000d---read的文件fd编号
RBP: 000000c001099640 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000c
R13: 0000000000000032 R14: 0000000000f710c4 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
From the stack point of view, it is a process reading a file that triggers a kernel-state null pointer reference.
2. Failure analysis
From the stack information:
1. The process opened the file with fd number 13, which can be seen from the value of rdi.
2. speed_show and __ethtool_get_link_ksettings indicate that the speed value of the network card is being read
Let’s take a look at which file is open,
crash> files 23283
PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent"
ROOT: /rootfs CWD: /rootfs/home/service/app/spider
FD FILE DENTRY INODE TYPE PATH
....
9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/p1p1/speed---这个还在
10 ffff9d0f4a45a400 ffff9d0f9982e240 ffff9d0fb7b873a0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/p3p1/speed---注意对应关系 0000:5e:00.0 对应p3p1
11 ffff9d0f57098f00 ffff9d1facc80240 ffff9d1069a1b530 REG /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.1/net/p1p2/speed---这个还在
13 ffff9d0f4a458a00 ffff9d0f9982e0c0 ffff9d0fb7b875f0 REG /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.1/net/p3p2/speed---注意对应关系 0000:5e:00.1 对应p3p2
....
Note the correspondence between the above PCI number and the network card name, which will be used later.
Opening a file to read speed itself should be a very common process,
The following is from the exception RIP: dma_pool_alloc+427 to further analyze why the NULL pointer dereference is triggered
Expand the specific stack as follows:
#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
[exception RIP: dma_pool_alloc+427]
RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000
RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10
RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00
R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0
R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00
ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0
ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80
ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4
ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080
ffff9d1f9a0db968: 0000000000000246 0000000000001000
ffff9d1f9a0db978: 0000000000000000 0000000000000246
ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff
ffff9d1f9a0db998: ffffffffb680efab 0000000000000010
ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8
ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45
ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000
ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c
ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1
ffff9d1f9a0db9f8: 0000000000000000 0000000000000000
ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000
ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70
ffff9d1f9a0dba28: ffffffffc03e10e3
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]
ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001
ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050
ffff9d1f9a0dba50: ffff9d0fa3c800c0 0000000000000005 --r12是rdi ,ffff9d0fa3c800c0
ffff9d1f9a0dba60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0
ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
Take the corresponding mlx5_core_dev from the stack as ffff9d0fa3c800c0
crash> mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo
struct mlx5_core_dev {
[ffff9d0fa3c80138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff9d0fa3c80138
pool = 0xffff9d0fa45f4c00------这个就是dma_pool,写驱动代码的同学会经常遇到
The line number of the code in question is:
crash> dis -l dma_pool_alloc+427 -B 5
/usr/src/debug/kernel-3.10.0-1062.18.1.el7/linux-3.10.0-1062.18.1.el7.x86_64/mm/dmapool.c: 334
0xffffffffb680efab <dma_pool_alloc+427>: mov (%r15),%ecx
而对应的r15,从上面的堆栈看,确实是null。
305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
306 dma_addr_t *handle)
307 {
...
315 spin_lock_irqsave(&pool->lock, flags);
316 list_for_each_entry(page, &pool->page_list, page_list) {
317 if (page->offset < pool->allocation)---//caq:当前满足条件
318 goto ready;//caq:跳转到ready
319 }
320
321 /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
322 spin_unlock_irqrestore(&pool->lock, flags);
323
324 page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO));
325 if (!page)
326 return NULL;
327
328 spin_lock_irqsave(&pool->lock, flags);
329
330 list_add(&page->page_list, &pool->page_list);
331 ready:
332 page->in_use++;//caq:表示正在引用
333 offset = page->offset;//从上次用完的地方开始使用
334 page->offset = *(int *)(page->vaddr + offset);//caq:出问题的行号
...
}
From the above code, page->vaddr is NULL and offset is also 0, then NULL will be quoted. Page has two sources.
The first one is taken from the page_list in the pool,
The second is to apply temporarily from pool_alloc_page. Of course, after applying, it will be linked to the page_list in the pool.
Check out this page_list below.
crash> dma_pool ffff9d0fa45f4c00 -x
struct dma_pool {
page_list = {
next = 0xffff9d0fa45f4c80,
prev = 0xffff9d0fa45f4c00
},
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x1
}
}
}
}
},
size = 0x400,
dev = 0xffff9d1fbddec098,
allocation = 0x1000,
boundary = 0x1000,
name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
pools = {
next = 0xdead000000000100,
prev = 0xdead000000000200
}
}
crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
offset = 0
vaddr = 0x0
ffff9d0fa45f4d00
offset = 0
vaddr = 0x0
Judging from the code logic of the dma_pool_alloc function, pool->page_list is indeed not empty and satisfies
The condition of if (page->offset <pool->allocation), so the first page should be ffff9d0fa45f4c80
That is to take out from the first case:
crash> dma_page ffff9d0fa45f4c80
struct dma_page {
page_list = {
next = 0xffff9d0fa45f4d00,
prev = 0xffff9d0fa45f4c80
},
vaddr = 0x0, //caq:这个异常,引用这个将导致crash
dma = 0,
in_use = 1, //caq:这个标记为在使用,符合page->in_use++;
offset = 0
}
The problem analysis ends here, because the page in dma_pool, vaddr will be initialized after application,
It is generally initialized in pool_alloc_page, how can it be NULL?
Then check this address:
crash> kmem ffff9d0fa45f4c80-------这个是dma_pool中的page
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff9d00ffc07900 kmalloc-128//caq:注意这个长度 128 8963 14976 234 8k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35
FREE / [ALLOCATED]
ffff9d0fa45f4c80
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head
Since I have used a similar dma function before, I remember that dma_page is not so big, let’s take a look at the second dma_page as follows:
crash> kmem ffff9d0fa45f4d00
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff9d00ffc07900 kmalloc-128 128 8963 14976 234 8k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35
FREE / [ALLOCATED]
ffff9d0fa45f4d00
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head
crash> dma_page ffff9d0fa45f4d00
struct dma_page {
page_list = {
next = 0xffff9d0fa45f5000,
prev = 0xffff9d0fa45f4d00
},
vaddr = 0x0, -----------caq:也是null
dma = 0,
in_use = 0,
offset = 0
}
crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
offset = 0
vaddr = 0x0
ffff9d0fa45f4d00
offset = 0
vaddr = 0x0
ffff9d0fa45f5000
offset = 0
vaddr = 0x0
.........
It seems that not only the first dma_page has a problem, but all the dma_page units in the pool are the same.
Then check the normal size of dma_page directly:
crash> p sizeof(struct dma_page)
$3 = 40
According to reason, the length is only 40 bytes. Even if you apply for slab, it should be expanded to 64 bytes. How can it be 128 bytes like the dma_page above? In order to solve this doubt, find a normal other node to compare:
crash> net
NET_DEVICE NAME IP ADDRESS(ES)
ffff8f9e800be000 lo 127.0.0.1
ffff8f9e62640000 p1p1
ffff8f9e626c0000 p1p2
ffff8f9e627c0000 p3p1 -----//caq:以这个为例
ffff8f9e62100000 p3p2
然后根据代码:通过net_device查看mlx5e_priv:
static int mlx5e_get_link_ksettings(struct net_device *netdev,
struct ethtool_link_ksettings *link_ksettings)
{
...
struct mlx5e_priv *priv = netdev_priv(netdev);
...
}
static inline void *netdev_priv(const struct net_device *dev)
{
return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
}
crash> px sizeof(struct net_device)
$2 = 0x8c0
crash> mlx5e_priv.mdev ffff8f9e627c08c0---根据偏移计算
mdev = 0xffff8f9e67c400c0
crash> mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo
struct mlx5_core_dev {
[ffff8f9e67c40138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff8f9e67c40138
pool = 0xffff8f9e7bf48f80
crash> dma_pool 0xffff8f9e7bf48f80
struct dma_pool {
page_list = {
next = 0xffff8f9e79c60880, //caq:其中的一个dma_page
prev = 0xffff8fae6e4db800
},
.......
size = 1024,
dev = 0xffff8f9e800b3098,
allocation = 4096,
boundary = 4096,
name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234>\250\217\217\377\377",
pools = {
next = 0xffff8f9e800b3290,
prev = 0xffff8f9e800b3290
}
}
crash> dma_page 0xffff8f9e79c60880 //caq:查看这个dma_page
struct dma_page {
page_list = {
next = 0xffff8f9e79c60840, -------其中的一个dma_page
prev = 0xffff8f9e7bf48f80
},
vaddr = 0xffff8f9e6fc9b000, //caq:正常vaddr不可能会NULL的
dma = 69521223680,
in_use = 0,
offset = 0
}
crash> kmem 0xffff8f9e79c60880
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff8f8fbfc07b00 kmalloc-64--正常长度 64 667921 745024 11641 4k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffde5140e71800 ffff8f9e79c60000 0 64 64 0
FREE / [ALLOCATED]
[ffff8f9e79c60880]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffde5140e71800 1039c60000 0 0 1 2fffff00000080 slab
The above operation requires familiarity with net_device and mlx5 related driver code.
Compared with the abnormal dma_page, the normal dma_page is a 64-byte slab, so it is obvious that
Either this is a memory stepping problem, or it is a uaf (used after free) problem.
When I find this in general questions, how can I quickly determine which type it is? Because these two problems involve memory disorder, they are generally difficult to check. At this time, we need to jump out. Let's take a look at the other running processes. We found a process as follows:
crash> bt 48263
PID: 48263 TASK: ffff9d0f4ee0a0e0 CPU: 56 COMMAND: "reboot"
#0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a
#1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9
#2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48
#3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5
#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
#5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
#7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib]
#8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib]
#9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib]
#10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib]
#11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core]
#12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core]
#13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core]
#14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda
#15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb
#16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916
#17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932
#18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9
#19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e
#20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede
RIP: 00007fc9be7a5226 RSP: 00007ffd9a19e448 RFLAGS: 00010246
RAX: 00000000000000a9 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000001234567 RSI: 0000000028121969 RDI: fffffffffee1dead
RBP: 0000000000000002 R8: 00005575d529558c R9: 0000000000000000
R10: 00007fc9bea767b8 R11: 0000000000000206 R12: 0000000000000000
R13: 00007ffd9a19e690 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b
Why pay attention to this process? Because over the years, the uaf problem caused by unloading modules has been checked no less than 20 times. Sometimes it is reboot, sometimes it is unload, and sometimes it is to release resources in work, so intuitively, I think It has a lot to do with this uninstallation. Let's analyze below, where is the operation in the reboot process.
2141 void device_shutdown(void)
2142 {
2143 struct device *dev, *parent;
2144
2145 spin_lock(&devices_kset->list_lock);
2146 /*
2147 * Walk the devices list backward, shutting down each in turn.
2148 * Beware that device unplug events may also start pulling
2149 * devices offline, even as the system is shutting down.
2150 */
2151 while (!list_empty(&devices_kset->list)) {
2152 dev = list_entry(devices_kset->list.prev, struct device,
2153 kobj.entry);
........
2178 if (dev->device_rh && dev->device_rh->class_shutdown_pre) {
2179 if (initcall_debug)
2180 dev_info(dev, "shutdown_pre\n");
2181 dev->device_rh->class_shutdown_pre(dev);
2182 }
2183 if (dev->bus && dev->bus->shutdown) {
2184 if (initcall_debug)
2185 dev_info(dev, "shutdown\n");
2186 dev->bus->shutdown(dev);
2187 } else if (dev->driver && dev->driver->shutdown) {
2188 if (initcall_debug)
2189 dev_info(dev, "shutdown\n");
2190 dev->driver->shutdown(dev);
2191 }
}
The following two points can be seen from the above code:
1. The kobj.entry member of each device is concatenated in devices_kset->list.
2. The shutdown process of each device is serial from device_shutdown.
From the reboot stack, the process of uninstalling a mlx device includes the following:
pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_detach_device
-->mlx5_cmd_cleanup-->dma_pool_destroy
The process branch of mlx5_detach_device is:
void dma_pool_destroy(struct dma_pool *pool)
{
.......
while (!list_empty(&pool->page_list)) {//caq:将pool中的dma_page一一删除
struct dma_page *page;
page = list_entry(pool->page_list.next,
struct dma_page, page_list);
if (is_page_busy(page)) {
.......
list_del(&page->page_list);
kfree(page);
} else
pool_free_page(pool, page);//每个dma_page去释放
}
kfree(pool);//caq:释放pool
.......
}
static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
{
dma_addr_t dma = page->dma;
#ifdef DMAPOOL_DEBUG
memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
list_del(&page->page_list);//caq:释放后会将page_list成员毒化
kfree(page);
}
View the corresponding information from the reboot stack
#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18
ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200
ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138
ffff9d0f95d7fb40: 0000000000000000 0000001002020000
ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8
ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000
ffff9d0f95d7fb70: 00000000756415e3 ffff9d0fa74800c0 ----mlx5_core_dev设备,对应的是 p3p1,
ffff9d0f95d7fb80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8
ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8
ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b
#5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0
ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00
ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d
#6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
It should be noted that the mlx5_core_dev being released by reboot is ffff9d0fa74800c0, and the net_device corresponding to this device is:
p3p1, and the mlx5_core_dev that the 23283 process is accessing is ffff9d0fa3c800c0, which corresponds to p3p2.
crash> net
NET_DEVICE NAME IP ADDRESS(ES)
ffff9d0fc003e000 lo 127.0.0.1
ffff9d1fad200000 p1p1
ffff9d0fa0700000 p1p2
ffff9d0fa00c0000 p3p1 对应的 mlx5_core_dev 是 ffff9d0fa74800c0
ffff9d0fa0200000 p3p2 对应的 mlx5_core_dev 是 ffff9d0fa3c800c0
Let's take a look at the devices currently remaining in devices_kset:
crash> p devices_kset
devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0
crash> p devices_kset.list
$5 = {
next = 0xffffffffb72f2a38,
prev = 0xffff9d0fbe0ea130
}
crash> list -H -o 0x18 0xffffffffb72f2a38 -s device.kobj.name >device.list
我们发现p3p1 与 p3p2均不在 device.list中,
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.0 device.list //caq:未找到 这个是 p3p1,当前reboot流程正在卸载。
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.1 device.list //caq:未找到,这个是 p3p2,已经卸载完
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.0 device.list //caq:这个mlx5设备还没unload
kobj.name = 0xffff9d1fbe82aa70 "0000:3b:00.0",
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.1 device.list //caq:这个mlx5设备还没unload
kobj.name = 0xffff9d1fbe82aae0 "0000:3b:00.1",
Since p3p2 and p3p1 are not in the device.list, and according to the serial uninstallation process of pci_device_shutdown, the current uninstallation is p3p1, so it is certain that the 23283 process accesses the uninstalled cmd_pool, according to the uninstallation process described earlier:
pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_cmd_cleanup-->dma_pool_destroy
At this time the pool has been released, and the dma_page in the pool is invalid.
Then I tried the bug corresponding to google and found that it was very similar to the current phenomenon, and redhat encountered a similar problem: https://access.redhat.com/solutions/5132931
However, Red Hat believes in this link that the problem of uaf has been solved, but the integrated patch is:
commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911
Author: Parav Pandit <parav@mellanox.com>
Date: Thu Dec 12 13:30:21 2019 +0200
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 997cbfe..05b557d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6725,6 +6725,8 @@ void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
const struct mlx5_ib_profile *profile,
int stage)
{
+ dev->ib_active = false;
+
/* Number of stages to cleanup */
while (stage) {
stage--;
Knock on the blackboard, three times:
This integration cannot solve the corresponding bugs, such as the following concurrency:
We use a simple diagram to represent concurrent processing:
CPU1 CPU2
dev_attr_show
pci_device_shutdown speed_show
shutdown
mlx5_unload_one
mlx5_detach_device
mlx5_detach_interface
mlx5e_detach
mlx5e_detach_netdev
mlx5e_nic_disable
rtnl_lock
mlx5e_close_locked
clear_bit(MLX5E_STATE_OPENED, &priv->state);---只清理了这个bit
rtnl_unlock
rtnl_trylock---持锁成功后
netif_running 只是判断net_device.state的最低位
__ethtool_get_link_ksettings
mlx5e_get_link_ksettings
mlx5_query_port_ptys()
mlx5_core_access_reg()
mlx5_cmd_exec
cmd_exec
mlx5_alloc_cmd_msg
mlx5_cmd_cleanup---清理dma_pool
dma_pool_alloc---访问cmd.pool,触发crash
So if you want to really solve this problem, you also need to clean up the __LINK_STATE_START bit in netif_device_detach, or judge the __LINK_STATE_PRESENT bit in speed_show? If you consider the scope of influence and do not want to move the public process, you should
Judge __LINK_STATE_PRESENT in mlx5e_get_link_ksettings.
This is left to students who like to deal with the community to improve it.
static void mlx5e_nic_disable(struct mlx5e_priv *priv)
{
.......
rtnl_lock();
if (netif_running(priv->netdev))
mlx5e_close(priv->netdev);
netif_device_detach(priv->netdev);
//caq:增加一下清理 __LINK_STATE_PRESENT位
rtnl_unlock();
.......
Three, failure to reproduce
1. The competition problem can create a competition scene similar to cpu1 and cpu2 in the above figure.
Four, fault avoidance or resolution
The possible solutions are:
2. Patch separately.
About the Author
Anqing
Currently in the OPPO Hybrid Cloud, responsible for the virtualization of Linux kernels, containers, and virtual machines.
For more exciting content, please scan the QR code to follow the [OPPO Digital Intelligence Technology] public account
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。