An Analysis of a Network Card Failure of a Smart Network Card (mellanox)

Background: This is reproduced in the environment of centos 7.6.1810. There are many smart network cards at present
The network card on the cloud server is standard. In OPPO, it is mainly used in scenarios such as vpc. The code of the smart network card follows
The enhancement of functions has led to an increase in complexity. Driver bugs have always been the bulk of the kernel bugs. When encountering similar problems, the kernel developers are not familiar with the driver code, so the investigation will be more difficult. The background knowledge involved is: dma_pool, dma_page, net_device, mlx5_core_dev devices, device uninstallation, uaf problems, etc. In addition, this bug has not been solved visually in the latest Linux baseline. This article lists it separately because the uaf problem is relatively unique.
Below is a list of how we troubleshoot and solve this problem.

1. Failure phenomenon

The OPPO cloud kernel team received a connectivity alarm and found that the machine was reset:

UPTIME: 00:04:16-------------运行的时间很短
LOAD AVERAGE: 0.25, 0.23, 0.11
TASKS: 2027
RELEASE: 3.10.0-1062.18.1.el7.x86_64
MEMORY: 127.6 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at           (null)"
PID: 23283
COMMAND: "spider-agent"
TASK: ffff9d1fbb090000  [THREAD_INFO: ffff9d1f9a0d8000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 23283  TASK: ffff9d1fbb090000  CPU: 0   COMMAND: "spider-agent"
 #0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34
 #1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592
 #2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680
 #3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798
 #4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4
 #5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82
 #6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4
 #7 [ffff9d1f9a0db870] __do_page_fault at ffffffffb6d88750
 #8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975
 #9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
    [exception RIP: dma_pool_alloc+427]//caq:异常地址
    RIP: ffffffffb680efab  RSP: ffff9d1f9a0db9c8  RFLAGS: 00010046
    RAX: 0000000000000246  RBX: ffff9d0fa45f4c80  RCX: 0000000000001000
    RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff9d0fa45f4c10
    RBP: ffff9d1f9a0dba20   R8: 000000000001f080   R9: ffff9d00ffc07c00
    R10: ffffffffc03e10c4  R11: ffffffffb67dd6fd  R12: 00000000000080d0
    R13: ffff9d0fa45f4c10  R14: ffff9d0fa45f4c00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]//涉及的模块
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]
#12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
#13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core]
#14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core]
#15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core]
#16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06
#17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8
#18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643
#19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f
#20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6
#21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30
#22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125
#23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff
#24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf
#25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede
    RIP: 00000000004a5030  RSP: 000000c001099378  RFLAGS: 00000212
    RAX: 0000000000000000  RBX: 000000c000040000  RCX: ffffffffffffffff
    RDX: 000000000000000a  RSI: 000000c00109976e  RDI: 000000000000000d---read的文件fd编号
    RBP: 000000c001099640   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000c
    R13: 0000000000000032  R14: 0000000000f710c4  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

From the stack point of view, it is a process reading a file that triggers a kernel-state null pointer reference.

2. Failure analysis

From the stack information:

1. The process opened the file with fd number 13, which can be seen from the value of rdi.

2. speed_show and __ethtool_get_link_ksettings indicate that the speed value of the network card is being read
Let’s take a look at which file is open,

crash> files 23283
PID: 23283  TASK: ffff9d1fbb090000  CPU: 0   COMMAND: "spider-agent"
ROOT: /rootfs    CWD: /rootfs/home/service/app/spider
 FD       FILE            DENTRY           INODE       TYPE PATH
....
  9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG  /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/net/p1p1/speed---这个还在
 10 ffff9d0f4a45a400 ffff9d0f9982e240 ffff9d0fb7b873a0 REG  /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/p3p1/speed---注意对应关系  0000:5e:00.0 对应p3p1
 11 ffff9d0f57098f00 ffff9d1facc80240 ffff9d1069a1b530 REG  /rootfs/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.1/net/p1p2/speed---这个还在
 13 ffff9d0f4a458a00 ffff9d0f9982e0c0 ffff9d0fb7b875f0 REG  /rootfs/sys/devices/pci0000:5d/0000:5d:00.0/0000:5e:00.1/net/p3p2/speed---注意对应关系 0000:5e:00.1 对应p3p2
....

Note the correspondence between the above PCI number and the network card name, which will be used later.
Opening a file to read speed itself should be a very common process,
The following is from the exception RIP: dma_pool_alloc+427 to further analyze why the NULL pointer dereference is triggered
Expand the specific stack as follows:

#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778
    [exception RIP: dma_pool_alloc+427]
    RIP: ffffffffb680efab  RSP: ffff9d1f9a0db9c8  RFLAGS: 00010046
    RAX: 0000000000000246  RBX: ffff9d0fa45f4c80  RCX: 0000000000001000
    RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff9d0fa45f4c10
    RBP: ffff9d1f9a0dba20   R8: 000000000001f080   R9: ffff9d00ffc07c00
    R10: ffffffffc03e10c4  R11: ffffffffb67dd6fd  R12: 00000000000080d0
    R13: ffff9d0fa45f4c10  R14: ffff9d0fa45f4c00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00 
    ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0 
    ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80 
    ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4 
    ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080 
    ffff9d1f9a0db968: 0000000000000246 0000000000001000 
    ffff9d1f9a0db978: 0000000000000000 0000000000000246 
    ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff 
    ffff9d1f9a0db998: ffffffffb680efab 0000000000000010 
    ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8 
    ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45 
    ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000 
    ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c 
    ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1 
    ffff9d1f9a0db9f8: 0000000000000000 0000000000000000 
    ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000 
    ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70 
    ffff9d1f9a0dba28: ffffffffc03e10e3 
#10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core]
    ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001 
    ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050 
    ffff9d1f9a0dba50: ffff9d0fa3c800c0 0000000000000005 --r12是rdi ,ffff9d0fa3c800c0
    ffff9d1f9a0dba60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0 
    ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92 
#11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]

Take the corresponding mlx5_core_dev from the stack as ffff9d0fa3c800c0

crash> mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo
struct mlx5_core_dev {
  [ffff9d0fa3c80138] struct mlx5_cmd cmd;
}
crash> mlx5_cmd.pool ffff9d0fa3c80138
  pool = 0xffff9d0fa45f4c00------这个就是dma_pool，写驱动代码的同学会经常遇到

The line number of the code in question is:

crash> dis -l dma_pool_alloc+427 -B 5
/usr/src/debug/kernel-3.10.0-1062.18.1.el7/linux-3.10.0-1062.18.1.el7.x86_64/mm/dmapool.c: 334
0xffffffffb680efab <dma_pool_alloc+427>:        mov    (%r15),%ecx
而对应的r15，从上面的堆栈看，确实是null。
    305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
    306                      dma_addr_t *handle)
    307 {
...
    315         spin_lock_irqsave(&pool->lock, flags);
    316         list_for_each_entry(page, &pool->page_list, page_list) {
    317                 if (page->offset < pool->allocation)---//caq:当前满足条件
    318                         goto ready;//caq:跳转到ready
    319         }
    320 
    321         /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */
    322         spin_unlock_irqrestore(&pool->lock, flags);
    323 
    324         page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO));
    325         if (!page)
    326                 return NULL;
    327 
    328         spin_lock_irqsave(&pool->lock, flags);
    329 
    330         list_add(&page->page_list, &pool->page_list);
    331  ready:
    332         page->in_use++;//caq:表示正在引用
    333         offset = page->offset;//从上次用完的地方开始使用
    334         page->offset = *(int *)(page->vaddr + offset);//caq:出问题的行号
...
    }

From the above code, page->vaddr is NULL and offset is also 0, then NULL will be quoted. Page has two sources.

The first one is taken from the page_list in the pool,

The second is to apply temporarily from pool_alloc_page. Of course, after applying, it will be linked to the page_list in the pool.

Check out this page_list below.

crash> dma_pool ffff9d0fa45f4c00 -x
struct dma_pool {
  page_list = {
    next = 0xffff9d0fa45f4c80, 
    prev = 0xffff9d0fa45f4c00
  }, 
  lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x1
          }
        }
      }
    }
  }, 
  size = 0x400, 
  dev = 0xffff9d1fbddec098, 
  allocation = 0x1000, 
  boundary = 0x1000, 
  name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  pools = {
    next = 0xdead000000000100, 
    prev = 0xdead000000000200
  }
}

crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
  offset = 0
  vaddr = 0x0
ffff9d0fa45f4d00
  offset = 0
  vaddr = 0x0

Judging from the code logic of the dma_pool_alloc function, pool->page_list is indeed not empty and satisfies
The condition of if (page->offset <pool->allocation), so the first page should be ffff9d0fa45f4c80
That is to take out from the first case:

crash> dma_page ffff9d0fa45f4c80
struct dma_page {
  page_list = {
    next = 0xffff9d0fa45f4d00, 
    prev = 0xffff9d0fa45f4c80
  }, 
  vaddr = 0x0, //caq:这个异常，引用这个将导致crash
  dma = 0, 
  in_use = 1, //caq:这个标记为在使用，符合page->in_use++;
  offset = 0
}

The problem analysis ends here, because the page in dma_pool, vaddr will be initialized after application,
It is generally initialized in pool_alloc_page, how can it be NULL?
Then check this address:

crash> kmem ffff9d0fa45f4c80-------这个是dma_pool中的page
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff9d00ffc07900 kmalloc-128//caq:注意这个长度  128       8963     14976    234     8k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffe299c0917d00  ffff9d0fa45f4000     0     64         29    35
  FREE / [ALLOCATED]
   ffff9d0fa45f4c80  

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe299c0917d00 10245f4000                0 ffff9d0fa45f4c00  1 2fffff00004080 slab,head

Since I have used a similar dma function before, I remember that dma_page is not so big, let’s take a look at the second dma_page as follows:

crash> kmem ffff9d0fa45f4d00
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff9d00ffc07900 kmalloc-128              128       8963     14976    234     8k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffe299c0917d00  ffff9d0fa45f4000     0     64         29    35
  FREE / [ALLOCATED]
   ffff9d0fa45f4d00  

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffe299c0917d00 10245f4000                0 ffff9d0fa45f4c00  1 2fffff00004080 slab,head

crash> dma_page ffff9d0fa45f4d00
struct dma_page {
  page_list = {
    next = 0xffff9d0fa45f5000, 
    prev = 0xffff9d0fa45f4d00
  }, 
  vaddr = 0x0, -----------caq：也是null
  dma = 0, 
  in_use = 0, 
  offset = 0
}

crash> list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
ffff9d0fa45f4c80
  offset = 0
  vaddr = 0x0
ffff9d0fa45f4d00
  offset = 0
  vaddr = 0x0
ffff9d0fa45f5000
  offset = 0
  vaddr = 0x0
.........

It seems that not only the first dma_page has a problem, but all the dma_page units in the pool are the same.
Then check the normal size of dma_page directly:

crash> p sizeof(struct dma_page)
$3 = 40

According to reason, the length is only 40 bytes. Even if you apply for slab, it should be expanded to 64 bytes. How can it be 128 bytes like the dma_page above? In order to solve this doubt, find a normal other node to compare:

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff8f9e800be000  lo     127.0.0.1
ffff8f9e62640000  p1p1   
ffff8f9e626c0000  p1p2   
ffff8f9e627c0000  p3p1   -----//caq:以这个为例
ffff8f9e62100000  p3p2   

然后根据代码：通过net_device查看mlx5e_priv：

static int mlx5e_get_link_ksettings(struct net_device *netdev,
                    struct ethtool_link_ksettings *link_ksettings)
{
...
    struct mlx5e_priv *priv    = netdev_priv(netdev);
...
}

static inline void *netdev_priv(const struct net_device *dev)
{
    return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
}

crash> px sizeof(struct net_device)
$2 = 0x8c0

crash> mlx5e_priv.mdev ffff8f9e627c08c0---根据偏移计算
  mdev = 0xffff8f9e67c400c0

crash> mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo
struct mlx5_core_dev {
  [ffff8f9e67c40138] struct mlx5_cmd cmd;
}

crash> mlx5_cmd.pool ffff8f9e67c40138
  pool = 0xffff8f9e7bf48f80

crash> dma_pool 0xffff8f9e7bf48f80
struct dma_pool {
  page_list = {
    next = 0xffff8f9e79c60880, //caq:其中的一个dma_page
    prev = 0xffff8fae6e4db800
  }, 
.......
  size = 1024, 
  dev = 0xffff8f9e800b3098, 
  allocation = 4096, 
  boundary = 4096, 
  name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234>\250\217\217\377\377", 
  pools = {
    next = 0xffff8f9e800b3290, 
    prev = 0xffff8f9e800b3290
  }
}
crash> dma_page 0xffff8f9e79c60880     //caq:查看这个dma_page
struct dma_page {
  page_list = {
    next = 0xffff8f9e79c60840, -------其中的一个dma_page
    prev = 0xffff8f9e7bf48f80
  }, 
  vaddr = 0xffff8f9e6fc9b000, //caq:正常vaddr不可能会NULL的
  dma = 69521223680, 
  in_use = 0, 
  offset = 0
}

crash> kmem 0xffff8f9e79c60880
CACHE            NAME             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff8f8fbfc07b00 kmalloc-64--正常长度    64     667921    745024  11641  4k
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffde5140e71800  ffff8f9e79c60000     0     64         64     0
  FREE / [ALLOCATED]
  [ffff8f9e79c60880]

      PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffde5140e71800 1039c60000                0        0  1 2fffff00000080 slab

The above operation requires familiarity with net_device and mlx5 related driver code.
Compared with the abnormal dma_page, the normal dma_page is a 64-byte slab, so it is obvious that
Either this is a memory stepping problem, or it is a uaf (used after free) problem.
When I find this in general questions, how can I quickly determine which type it is? Because these two problems involve memory disorder, they are generally difficult to check. At this time, we need to jump out. Let's take a look at the other running processes. We found a process as follows:

crash> bt 48263
PID: 48263  TASK: ffff9d0f4ee0a0e0  CPU: 56  COMMAND: "reboot"
 #0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a
 #1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9
 #2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48
 #3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5
 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
 #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
 #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
 #7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib]
 #8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib]
 #9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib]
#10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib]
#11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core]
#12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core]
#13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core]
#14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda
#15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb
#16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916
#17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932
#18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9
#19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e
#20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede
    RIP: 00007fc9be7a5226  RSP: 00007ffd9a19e448  RFLAGS: 00010246
    RAX: 00000000000000a9  RBX: 0000000000000004  RCX: 0000000000000000
    RDX: 0000000001234567  RSI: 0000000028121969  RDI: fffffffffee1dead
    RBP: 0000000000000002   R8: 00005575d529558c   R9: 0000000000000000
    R10: 00007fc9bea767b8  R11: 0000000000000206  R12: 0000000000000000
    R13: 00007ffd9a19e690  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

Why pay attention to this process? Because over the years, the uaf problem caused by unloading modules has been checked no less than 20 times. Sometimes it is reboot, sometimes it is unload, and sometimes it is to release resources in work, so intuitively, I think It has a lot to do with this uninstallation. Let's analyze below, where is the operation in the reboot process.

2141 void device_shutdown(void)
   2142 {
   2143         struct device *dev, *parent;
   2144 
   2145         spin_lock(&devices_kset->list_lock);
   2146         /*
   2147          * Walk the devices list backward, shutting down each in turn.
   2148          * Beware that device unplug events may also start pulling
   2149          * devices offline, even as the system is shutting down.
   2150          */
   2151         while (!list_empty(&devices_kset->list)) {
   2152                 dev = list_entry(devices_kset->list.prev, struct device,
   2153                                 kobj.entry);
........
   2178                 if (dev->device_rh && dev->device_rh->class_shutdown_pre) {
   2179                         if (initcall_debug)
   2180                                 dev_info(dev, "shutdown_pre\n");
   2181                         dev->device_rh->class_shutdown_pre(dev);
   2182                 }
   2183                 if (dev->bus && dev->bus->shutdown) {
   2184                         if (initcall_debug)
   2185                                 dev_info(dev, "shutdown\n");
   2186                         dev->bus->shutdown(dev);
   2187                 } else if (dev->driver && dev->driver->shutdown) {
   2188                         if (initcall_debug)
   2189                                 dev_info(dev, "shutdown\n");
   2190                         dev->driver->shutdown(dev);
   2191                 }
   }

The following two points can be seen from the above code:

1. The kobj.entry member of each device is concatenated in devices_kset->list.

2. The shutdown process of each device is serial from device_shutdown.

From the reboot stack, the process of uninstalling a mlx device includes the following:

pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_detach_device

                                            -->mlx5_cmd_cleanup-->dma_pool_destroy

The process branch of mlx5_detach_device is:

void dma_pool_destroy(struct dma_pool *pool)
{
.......
        while (!list_empty(&pool->page_list)) {//caq:将pool中的dma_page一一删除
                struct dma_page *page;
                page = list_entry(pool->page_list.next,
                                  struct dma_page, page_list);
                if (is_page_busy(page)) {
.......
                        list_del(&page->page_list);
                        kfree(page);
                } else
                        pool_free_page(pool, page);//每个dma_page去释放
        }

        kfree(pool);//caq：释放pool
.......        
}

static void pool_free_page(struct dma_pool *pool, struct dma_page *page)
{
        dma_addr_t dma = page->dma;

#ifdef  DMAPOOL_DEBUG
        memset(page->vaddr, POOL_POISON_FREED, pool->allocation);
#endif
        dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma);
        list_del(&page->page_list);//caq:释放后会将page_list成员毒化
        kfree(page);
}

View the corresponding information from the reboot stack

 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
    ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18 
    ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200 
    ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138 
    ffff9d0f95d7fb40: 0000000000000000 0000001002020000 
    ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8 
    ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000 
    ffff9d0f95d7fb70: 00000000756415e3 ffff9d0fa74800c0 ----mlx5_core_dev设备，对应的是 p3p1，
    ffff9d0f95d7fb80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8 
    ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8 
    ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b 
 #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
    ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0 
    ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00 
    ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d 
 #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]

It should be noted that the mlx5_core_dev being released by reboot is ffff9d0fa74800c0, and the net_device corresponding to this device is:
p3p1, and the mlx5_core_dev that the 23283 process is accessing is ffff9d0fa3c800c0, which corresponds to p3p2.

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff9d0fc003e000  lo     127.0.0.1
ffff9d1fad200000  p1p1   
ffff9d0fa0700000  p1p2   
ffff9d0fa00c0000  p3p1  对应的 mlx5_core_dev 是 ffff9d0fa74800c0
ffff9d0fa0200000  p3p2  对应的 mlx5_core_dev 是 ffff9d0fa3c800c0

Let's take a look at the devices currently remaining in devices_kset:

crash> p devices_kset
devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0
crash> p devices_kset.list
$5 = {
  next = 0xffffffffb72f2a38, 
  prev = 0xffff9d0fbe0ea130
}

crash> list -H -o 0x18 0xffffffffb72f2a38 -s device.kobj.name >device.list

我们发现p3p1 与 p3p2均不在 device.list中，

[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.0 device.list //caq:未找到 这个是 p3p1，当前reboot流程正在卸载。
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:5e:00.1 device.list //caq:未找到，这个是 p3p2,已经卸载完
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.0  device.list //caq:这个mlx5设备还没unload
  kobj.name = 0xffff9d1fbe82aa70 "0000:3b:00.0",
[root@it202-seg-k8s-prod001-node-10-27-96-220 127.0.0.1-2020-12-07-10:58:06]# grep 0000:3b:00.1 device.list //caq:这个mlx5设备还没unload
  kobj.name = 0xffff9d1fbe82aae0 "0000:3b:00.1",

Since p3p2 and p3p1 are not in the device.list, and according to the serial uninstallation process of pci_device_shutdown, the current uninstallation is p3p1, so it is certain that the 23283 process accesses the uninstalled cmd_pool, according to the uninstallation process described earlier:
pci_device_shutdown-->shutdown-->mlx5_unload_one-->mlx5_cmd_cleanup-->dma_pool_destroy
At this time the pool has been released, and the dma_page in the pool is invalid.

Then I tried the bug corresponding to google and found that it was very similar to the current phenomenon, and redhat encountered a similar problem: https://access.redhat.com/solutions/5132931

However, Red Hat believes in this link that the problem of uaf has been solved, but the integrated patch is:

commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911
Author: Parav Pandit <parav@mellanox.com>
Date:   Thu Dec 12 13:30:21 2019 +0200

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 997cbfe..05b557d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -6725,6 +6725,8 @@ void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
                      const struct mlx5_ib_profile *profile,
                      int stage)
 {
+       dev->ib_active = false;
+
        /* Number of stages to cleanup */
        while (stage) {
                stage--;

Knock on the blackboard, three times:
This integration cannot solve the corresponding bugs, such as the following concurrency:
We use a simple diagram to represent concurrent processing:

    CPU1                                                            CPU2
                                                                   dev_attr_show
    pci_device_shutdown                                            speed_show
      shutdown                          
        mlx5_unload_one
          mlx5_detach_device
            mlx5_detach_interface
              mlx5e_detach
               mlx5e_detach_netdev
                 mlx5e_nic_disable
                   rtnl_lock
                     mlx5e_close_locked 
                     clear_bit(MLX5E_STATE_OPENED, &priv->state);---只清理了这个bit
                   rtnl_unlock                   
                                                  rtnl_trylock---持锁成功后
                                                  netif_running 只是判断net_device.state的最低位
                                                    __ethtool_get_link_ksettings
                                                    mlx5e_get_link_ksettings
                                                      mlx5_query_port_ptys()
                                                      mlx5_core_access_reg()
                                                      mlx5_cmd_exec
                                                      cmd_exec
                                                      mlx5_alloc_cmd_msg
          mlx5_cmd_cleanup---清理dma_pool                                                       
                                                      dma_pool_alloc---访问cmd.pool,触发crash

So if you want to really solve this problem, you also need to clean up the __LINK_STATE_START bit in netif_device_detach, or judge the __LINK_STATE_PRESENT bit in speed_show? If you consider the scope of influence and do not want to move the public process, you should
Judge __LINK_STATE_PRESENT in mlx5e_get_link_ksettings.
This is left to students who like to deal with the community to improve it.

static void mlx5e_nic_disable(struct mlx5e_priv *priv)
{
.......
    rtnl_lock();
    if (netif_running(priv->netdev))
        mlx5e_close(priv->netdev);
    netif_device_detach(priv->netdev);
  //caq:增加一下清理 __LINK_STATE_PRESENT位 
    rtnl_unlock();
.......

Three, failure to reproduce

1. The competition problem can create a competition scene similar to cpu1 and cpu2 in the above figure.

Four, fault avoidance or resolution

The possible solutions are:

1. Don't upgrade like

2. Patch separately.

About the Author

Anqing

Currently in the OPPO Hybrid Cloud, responsible for the virtualization of Linux kernels, containers, and virtual machines.

For more exciting content, please scan the QR code to follow the [OPPO Digital Intelligence Technology] public account

An Analysis of a Network Card Failure of a Smart Network Card (mellanox)

1. Failure phenomenon

2. Failure analysis

Three, failure to reproduce

Four, fault avoidance or resolution

About the Author

OPPO数智技术

引用和评论

OPPO云数据库访问服务技术揭秘

C++ 中 VS 项目引入公共配置文件

疯狂推荐！从零开始 Dify 部署全攻略！

Cherry Studio 入门 MCP：为你的大模型插上翅膀

狂揽17k star！Docker可视化神器，一键部署项目真香！

OpenWebUI：一站式 AI 应用构建平台体验

Spring 数据校验：@Validated 与@Valid 注解全面对比与应用