Moment For Technology

Analysis of a Mellanox inic fault

Posted on Dec. 2, 2022, 7:07 p.m. by Cheryl Moore
Category: The back-end Tag: The back-end

Background: This is repeated in centos 7.6.1810. The inic is the standard nic on many cloud servers. In OPPO, the inic is mainly used for VPC and other scenarios. Because kernel developers are not familiar with the driver code, the investigation will be more difficult. The background knowledge involved in itself is: Dma_pool, DMA_page,net_device, MLx5_core_dev devices, device uninstallation, UAF issues, etc. In addition, this bug is apparently not resolved in the latest Linux baselines. This article has singled it out because UAF issues are relatively unique. Here's how we solved the problem.

First, the fault phenomenon

OPPO Cloud kernel team received a connectivity alarm and found that the machine was reset:

LOAD AVERAGE: 0.25, 0.23, 0.11 TASKS: 2027 RELEASE: 3.10.0-1062.18.1.el7.x86_64 MEMORY: 127.6GB unable to handle kernel NULL pointer dereference at (null)" PID: 23283 COMMAND: "spider-agent" TASK: ffff9d1fbb090000 [THREAD_INFO: ffff9d1f9a0d8000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash bt PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent" #0 [ffff9d1f9a0db650] machine_kexec at ffffffffb6665b34 #1 [ffff9d1f9a0db6b0] __crash_kexec at ffffffffb6722592 #2 [ffff9d1f9a0db780] crash_kexec at ffffffffb6722680 #3 [ffff9d1f9a0db798] oops_end at ffffffffb6d85798 #4 [ffff9d1f9a0db7c0] no_context at ffffffffb6675bb4 #5 [ffff9d1f9a0db810] __bad_area_nosemaphore at ffffffffb6675e82 #6 [ffff9d1f9a0db860] bad_area_nosemaphore at ffffffffb6675fa4 #7 [ffff9d1f9a0db870] __do_page_fault at  ffffffffb6d88750 #8 [ffff9d1f9a0db8e0] do_page_fault at ffffffffb6d88975 #9 [ffff9d1f9a0db910] page_fault at RIP: FFFFFFB6d84778 [exception: dMA_pool_alloc +427]// CAQ: Exception RIP: FFFFFFFFB680efab RSP: FFFF9d1F9A0DB9c8 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10 RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00 R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0 R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff9d1f9a0Dba28] mlx5_alloc_cmd_msg at ffffFFFFFFC03e10e3 [mlx5_core  ffffffffc03e3c92 [mlx5_core] #12 [ffff9d1f9a0dbb18] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core] #13 [ffff9d1f9a0dbb48] mlx5_core_access_reg at ffffffffc03ee354 [mlx5_core] #14 [ffff9d1f9a0dbba0] mlx5_query_port_ptys at ffffffffc03ee411 [mlx5_core] #15 [ffff9d1f9a0dbc10] mlx5e_get_link_ksettings at ffffffffc0413035 [mlx5_core] #16 [ffff9d1f9a0dbce8] __ethtool_get_link_ksettings at ffffffffb6c56d06 #17 [ffff9d1f9a0dbd48] speed_show at ffffffffb6c705b8 #18 [ffff9d1f9a0dbdd8] dev_attr_show at ffffffffb6ab1643 #19 [ffff9d1f9a0dbdf8] sysfs_kf_seq_show at ffffffffb68d709f #20 [ffff9d1f9a0dbe18] kernfs_seq_show at ffffffffb68d57d6 #21 [ffff9d1f9a0dbe28] seq_read at ffffffffb6872a30 #22 [ffff9d1f9a0dbe98] kernfs_fop_read at ffffffffb68d6125 #23 [ffff9d1f9a0dbed8] vfs_read at ffffffffb684a8ff #24 [ffff9d1f9a0dbf08] sys_read at ffffffffb684b7bf #25 [ffff9d1f9a0dbf50] system_call_fastpath at ffffffffb6d8dede RIP: 00000000004a5030 RSP: 000000c001099378 RFLAGS: 00000212 RAX: 0000000000000000 RBX: 000000c000040000 RCX: ffffffffffffffff RDX: 000000000000000a RSI: 000000c00109976e RDI: 000000000000000D -- Read file FD id RBP: 000000C001099640 R8:0000000000000000 R9:0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000c R13: 0000000000000032 R14: 0000000000f710c4 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002bCopy the code

From the stack, a process reading a file triggers a kernel-mode null pointer reference.

Two, fault phenomenon analysis

From the stack information:

1. The process opens a file with a fd number of 13, as indicated by the rDI value.

Speed_show and __ethtool_get_link_ksettings indicate which file is open under the rate of reading the network card.

crash files 23283 PID: 23283 TASK: ffff9d1fbb090000 CPU: 0 COMMAND: "spider-agent" ROOT: /rootfs CWD: /rootfs/home/service/app/spider FD FILE DENTRY INODE TYPE PATH .... 9 ffff9d0f5709b200 ffff9d1facc80a80 ffff9d1069a194d0 REG / rootfs/sys/devices/pci0000:3 a / 0000:3 a: 00.0/0000:3 b: 00.0 / net/p1p1 / speed - this also in 10 ffff9d0f4a45a400 ffff9d0f9982e240 Ffff9d0fb7b873a0 REG/rootfs/sys/devices/pci0000:5 d / 0000:5 d: 00.0/0000:5 e: 00.0 / net/p3p1 / speed - pay attention to the corresponding relationship between 0000:5 e: 00.0 P3p1 11 ffFF9d0F57098F00 FFFF9D1facC80240 FFFF9D1069a1B530 REG / rootfs/sys/devices/pci0000:3 a / 0000:3 a: 00.0/0000:3 b: 00.1 / net/p1p2 / speed - this also in 13 ffff9d0f4a458a00 ffff9d0f9982e0c0 Ffff9d0fb7b875f0 REG/rootfs/sys/devices/pci0000:5 d / 0000:5 d: 00.0/0000:5 e: 00.1 / net/p3p2 / speed - pay attention to the corresponding relationship between 0000:5 e: 00.1 Corresponding p3p2...Copy the code

Note the mapping between the PCI number and the network card name above, which will be used later. Opening a file to read speed should be a common process by itself. Exception RIP: dma_pool_alloc+427 will trigger NULL pointer dereference to expand the stack.

#9 [ffff9d1f9a0db910] page_fault at ffffffffb6d84778 [exception RIP: dma_pool_alloc+427] RIP: ffffffffb680efab RSP: ffff9d1f9a0db9c8 RFLAGS: 00010046 RAX: 0000000000000246 RBX: ffff9d0fa45f4c80 RCX: 0000000000001000 RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff9d0fa45f4c10 RBP: ffff9d1f9a0dba20 R8: 000000000001f080 R9: ffff9d00ffc07c00 R10: ffffffffc03e10c4 R11: ffffffffb67dd6fd R12: 00000000000080d0 R13: ffff9d0fa45f4c10 R14: ffff9d0fa45f4c00 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 ffff9d1f9a0db918: 0000000000000000 ffff9d0fa45f4c00 ffff9d1f9a0db928: ffff9d0fa45f4c10 00000000000080d0 ffff9d1f9a0db938: ffff9d1f9a0dba20 ffff9d0fa45f4c80 ffff9d1f9a0db948: ffffffffb67dd6fd ffffffffc03e10c4 ffff9d1f9a0db958: ffff9d00ffc07c00 000000000001f080 ffff9d1f9a0db968: 0000000000000246 0000000000001000 ffff9d1f9a0db978: 0000000000000000 0000000000000246 ffff9d1f9a0db988: ffff9d0fa45f4c10 ffffffffffffffff ffff9d1f9a0db998: ffffffffb680efab 0000000000000010 ffff9d1f9a0db9a8: 0000000000010046 ffff9d1f9a0db9c8 ffff9d1f9a0db9b8: 0000000000000018 ffffffffb680ee45 ffff9d1f9a0db9c8: ffff9d0faf9fec40 0000000000000000 ffff9d1f9a0db9d8: ffff9d0faf9fec48 ffffffffb682669c ffff9d1f9a0db9e8: ffff9d00ffc07c00 00000000618746c1 ffff9d1f9a0db9f8: 0000000000000000 0000000000000000 ffff9d1f9a0dba08: ffff9d0faf9fec40 0000000000000000 ffff9d1f9a0dba18: ffff9d0fa3c800c0 ffff9d1f9a0dba70 ffff9d1f9a0dba28: ffffffffc03e10e3 #10 [ffff9d1f9a0dba28] mlx5_alloc_cmd_msg at ffffffffc03e10e3 [mlx5_core] ffff9d1f9a0dba30: ffff9d0f4eebee00 0000000000000001 ffff9d1f9a0dba40: 000000d0000080d0 0000000000000050 ffff9d1f9a0dba50: Ffff9d0fa3c800c0 0000000000000005 -- R12 is RDI, ffFF9d0FA3C800C0 ffFF9d1F9A0DBA60: ffff9d0fa3c803e0 ffff9d1f9d87ccc0 ffff9d1f9a0dba70: ffff9d1f9a0dbb10 ffffffffc03e3c92 #11 [ffff9d1f9a0dba78] cmd_exec at ffffffffc03e3c92 [mlx5_core]Copy the code

Pull the corresponding MLx5_core_dev from the stack as ffFF9d0FA3C800c0

crash mlx5_core_dev.cmd ffff9d0fa3c800c0 -xo struct mlx5_core_dev { [ffff9d0fa3c80138] struct mlx5_cmd cmd; } crash mlx5_cmd.pool ffff9d0fa3C80138 pool = 0xffff9D0Fa45F4c00 ------Copy the code

The line number of the problem is:

crash dis -l dma_pool_alloc+427 -B 5 The/usr/SRC/debug/kernel - 3.10.0-1062.18.1. El7 / Linux - 3.10.0-1062.18.1. El7. X86_64 / mm/dmapool. C: 334 0xffFFFFFFB680EFab  dMA_POOL_alloc +427: mov (%r15),%ecx 305 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, 306 dma_addr_t *handle) 307 { ... 315 spin_lock_irqsave(pool-lock, flags); 316 list_for_each_entry(page, pool-page_list, Page_list) {317 if (page-offset  pool-allocation)-- //caq: = 318 goto ready; //caq: jump to ready 319} 320 321 /* pool_alloc_page() might sleep, so temporarily drop pool-lock */ 322 spin_unlock_irqrestore(pool-lock, flags); 323 324 page = pool_alloc_page(pool, mem_flags  (~__GFP_ZERO)); 325 if (! page) 326 return NULL; 327 328 spin_lock_irqsave(pool-lock, flags); 329 330 list_add(page-page_list, pool-page_list); 331 ready: 332 page-in_use++; // offset = page-offset; *(int *)(page-vaddr + offset); *(page-vaddr + offset); //caq: the line number in question... }Copy the code

Page -vaddr = NULL; offset = 0;

The first is from the pool's page_list,

The second is a temporary application from pool_alloc_page, which is then mounted to the page_list in the pool.

Let's take a look at this page_list.

crash dma_pool ffff9d0fa45f4c00 -x
struct dma_pool {
  page_list = {
    next = 0xffff9d0fa45f4c80, 
    prev = 0xffff9d0fa45f4c00
  lock = {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x1
  size = 0x400, 
  dev = 0xffff9d1fbddec098, 
  allocation = 0x1000, 
  boundary = 0x1000, 
  name = "mlx5_cmd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
  pools = {
    next = 0xdead000000000100, 
    prev = 0xdead000000000200

crash list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr
  offset = 0
  vaddr = 0x0
  offset = 0
  vaddr = 0x0
Copy the code

If (page-offset pool-allocation) if (page-offset pool-allocation) if (page-offset pool-allocation) So the first page is going to be ffFF9d0FA45F4C80 which is the one from the first case:

crash dma_page ffff9d0fa45f4c80 struct dma_page { page_list = { next = 0xffff9d0fa45f4d00, Prev = 0xffff9d0fa45F4c80}, vaddr = 0x0, //caq: this exception will cause crash DMA = 0, in_use = 1, //caq: this exception will cause crash DMA = 0, in_use = 1, //caq: this exception will cause crash DMA = 0, in_use = 1, //caq: this exception will cause crash DMA = 0, in_use = 1, offset = 0 }Copy the code

The page in dMA_pool is initialized, and the vaddr is initialized in the pool_alloc_page. Then check out this address:

Crash  kmem ffff9d0FA45f4c80 ------- This is Page CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE FFff9d00FFc07900 in DMA_pool 128 8963 14976 234 8K SLAB MEMORY NODE TOTAL ALLOCATED FREE FFffe299C0917D00 FFFF9d0FA45F4000 0 64 29 35 FREE / [ALLOCATED] ffff9d0fa45f4c80 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,headCopy the code

Since we have used similar DMA functions before, we have the impression that the DMA_page is not this large. Let's look at the second dMA_page as follows:

crash kmem ffff9d0fa45f4d00 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff9d00ffc07900 kmalloc-128 128 8963 14976 234 8k SLAB MEMORY NODE TOTAL ALLOCATED FREE ffffe299c0917d00 ffff9d0fa45f4000 0 64 29 35 FREE / [ALLOCATED] ffff9d0fa45f4d00 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffe299c0917d00 10245f4000 0 ffff9d0fa45f4c00 1 2fffff00004080 slab,head crash dma_page ffff9d0fa45f4d00 struct dma_page { page_list = { next = 0xffff9d0fa45f5000, Prev = 0xffff9d0fa45F4d00}, vaddr = 0x0, -----------caq: Null dma = 0, in_use = 0, offset = 0 } crash list dma_pool.page_list -H 0xffff9d0fa45f4c00 -s dma_page.offset,vaddr ffff9d0fa45f4c80 offset = 0 vaddr = 0x0 ffff9d0fa45f4d00 offset = 0 vaddr = 0x0 ffff9d0fa45f5000 offset = 0 vaddr = 0x0 .........Copy the code

Dma_page = dMA_page = dMA_page = dMA_page = dMA_page = dMA_page = dMA_page = dMA_page

crash p sizeof(struct dma_page)
$3 = 40
Copy the code

If you apply for slab, it should be extended to 64 bytes. How can it be 128 bytes like the previous dMA_page? To solve this puzzle, find a normal other node and compare it:

Crash  NET NET_DEVICE NAME IP ADDRESS(ES) FFFF8F9e800be000 LO ffFF8F9e62640000 p1p1 ffFF8F9E626c0000 p1p2 Ffff8f9e627c0000 p3p1 -----//caq: for example ffFF8F9e62100000 p3p2  static int mlx5e_get_link_ksettings(struct net_device *netdev, struct ethtool_link_ksettings *link_ksettings) { ... struct mlx5e_priv *priv = netdev_priv(netdev); . } static inline void *netdev_priv(const struct net_device *dev) { return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN); } crash px sizeof(struct net_device) $2 = 0x8c0 crash mlx5e_priv.mdev ffff8F9e627C08c0 0xffff8f9e67c400c0 crash mlx5_core_dev.cmd 0xffff8f9e67c400c0 -xo struct mlx5_core_dev { [ffff8f9e67c40138] struct mlx5_cmd cmd; } crash mlx5_cmd.pool ffff8f9e67c40138 pool = 0xffff8f9e7bf48f80 crash dma_pool 0xffff8f9e7bf48f80 struct dma_pool { Page_list = {next = 0xffff8F9e79C60880, //caq: one of the dMA_page prev = 0xFFff8Fae6E4DB800},....... size = 1024, dev = 0xffff8f9e800b3098, allocation = 4096, boundary = 4096, name = "mlx5_cmd\000\217\364{\236\217\377\377\300\217\364{\236\217\377\377\200\234\250\217\217\377\377", pools = { next = 0xffff8f9e800b3290, Prev = 0xffff8F9e800b3290}} Crash  dma_page 0xFFFF8F9e79C60880 = 0xffff8F9e79C60840, ------- one of dma_page prev = 0xffff8F9e7bf48F80}, vaddr = 0xffff8F9e6fc9b000, //caq: dma = 69521223680, in_use = 0, offset = 0 } crash kmem 0xffff8f9e79c60880 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff8f8fbfc07b00 ALLOCATED 4K SLAB MEMORY NODE TOTAL FREE FFFFde5140e71800 FFFF8F9e79C60000 0 64 64 0 FREE / [ALLOCATED] [ffff8f9e79c60880] PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffde5140e71800 1039c60000 0 0 1 2fffff00000080 slabCopy the code

The preceding operations must be familiar with net_device and MLX5 driver codes. A normal DMA_page is a 64-byte slab compared to an abnormal DMA_page, so obviously this is either a stomping memory problem or a UAF (Used After Free) problem. General questions to check this, how to quickly determine which type? Because these two kinds of problems, involving memory disorders, are generally difficult to find, this time we need to jump out, we first look at the other running process, found a process as follows:

crash bt 48263
PID: 48263  TASK: ffff9d0f4ee0a0e0  CPU: 56  COMMAND: "reboot"
 #0 [ffff9d0f95d7f958] __schedule at ffffffffb6d80d4a
 #1 [ffff9d0f95d7f9e8] schedule at ffffffffb6d811f9
 #2 [ffff9d0f95d7f9f8] schedule_timeout at ffffffffb6d7ec48
 #3 [ffff9d0f95d7faa8] wait_for_completion_timeout at ffffffffb6d81ae5
 #4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core]
 #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core]
 #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]
 #7 [ffff9d0f95d7fc40] mlx5_mr_cache_cleanup at ffffffffc0c60aab [mlx5_ib]
 #8 [ffff9d0f95d7fca8] mlx5_ib_stage_pre_ib_reg_umr_cleanup at ffffffffc0c45d32 [mlx5_ib]
 #9 [ffff9d0f95d7fcc0] __mlx5_ib_remove at ffffffffc0c4f450 [mlx5_ib]
#10 [ffff9d0f95d7fce8] mlx5_ib_remove at ffffffffc0c4f4aa [mlx5_ib]
#11 [ffff9d0f95d7fd00] mlx5_detach_device at ffffffffc03fe231 [mlx5_core]
#12 [ffff9d0f95d7fd30] mlx5_unload_one at ffffffffc03dee90 [mlx5_core]
#13 [ffff9d0f95d7fd60] shutdown at ffffffffc03def80 [mlx5_core]
#14 [ffff9d0f95d7fd80] pci_device_shutdown at ffffffffb69d1cda
#15 [ffff9d0f95d7fda8] device_shutdown at ffffffffb6ab3beb
#16 [ffff9d0f95d7fdd8] kernel_restart_prepare at ffffffffb66b7916
#17 [ffff9d0f95d7fde8] kernel_restart at ffffffffb66b7932
#18 [ffff9d0f95d7fe00] SYSC_reboot at ffffffffb66b7ba9
#19 [ffff9d0f95d7ff40] sys_reboot at ffffffffb66b7c4e
#20 [ffff9d0f95d7ff50] system_call_fastpath at ffffffffb6d8dede
    RIP: 00007fc9be7a5226  RSP: 00007ffd9a19e448  RFLAGS: 00010246
    RAX: 00000000000000a9  RBX: 0000000000000004  RCX: 0000000000000000
    RDX: 0000000001234567  RSI: 0000000028121969  RDI: fffffffffee1dead
    RBP: 0000000000000002   R8: 00005575d529558c   R9: 0000000000000000
    R10: 00007fc9bea767b8  R11: 0000000000000206  R12: 0000000000000000
    R13: 00007ffd9a19e690  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b

Copy the code

The reason why I focused on this process is that over the years, there have been at least 20 uAF fixes caused by uninstalling modules, sometimes by reboot, sometimes by unload, sometimes by releasing resources in work, so intuitively, it has a lot to do with this uninstallation. Let's analyze where the reboot process goes.

2141 void device_shutdown(void) 2142 { 2143 struct device *dev, *parent; 2144 2145 spin_lock(devices_kset-list_lock); 2146 /* 2147 * Walk the devices list backward, shutting down each in turn. 2148 * Beware that device unplug events may also start pulling 2149 * devices offline, even as the system is shutting down. 2150 */ 2151 while (! list_empty(devices_kset-list)) { 2152 dev = list_entry(devices_kset-list.prev, struct device, 2153 kobj.entry); . 2178 if (dev-device_rh  dev-device_rh-class_shutdown_pre) { 2179 if (initcall_debug) 2180 dev_info(dev, "shutdown_pre\n"); 2181 dev-device_rh-class_shutdown_pre(dev); 2182 } 2183 if (dev-bus  dev-bus-shutdown) { 2184 if (initcall_debug) 2185 dev_info(dev, "shutdown\n"); 2186 dev-bus-shutdown(dev); 2187 } else if (dev-driver  dev-driver-shutdown) { 2188 if (initcall_debug) 2189 dev_info(dev, "shutdown\n"); 2190 dev-driver-shutdown(dev); 2191}}Copy the code

The above code shows the following two things:

The kobj.entry member of each device is concatenated in devices_kset-list

2. The shutdown process of each device is serial from device_shutdown.

From the reboot stack, the process for uninstalling an MLX device is as follows:

pci_device_shutdown--shutdown--mlx5_unload_one--mlx5_detach_device --mlx5_cmd_cleanup--dma_pool_destroy

The mlX5_DETach_Device process branches as follows:

void dma_pool_destroy(struct dma_pool *pool) { ....... while (! List_empty (pool-page_list)) {// struct dma_page *page; page = list_entry(, struct dma_page, page_list); if (is_page_busy(page)) { ....... list_del(page-page_list); kfree(page); } else pool_free_page(pool, page); } kfree(pool); //caq: releases the pool....... } static void pool_free_page(struct dma_pool *pool, struct dma_page *page) { dma_addr_t dma = page-dma; #ifdef DMAPOOL_DEBUG memset(page-vaddr, POOL_POISON_FREED, pool-allocation); #endif dma_free_coherent(pool-dev, pool-allocation, page-vaddr, dma); list_del(page-page_list); //caq: poison the page_list member kfree(page); }Copy the code

View the corresponding information from the reboot stack

#4 [ffff9d0f95d7fb08] cmd_exec at ffffffffc03e41c9 [mlx5_core] ffff9d0f95d7fb10: ffffffffb735b580 ffff9d0f904caf18 ffff9d0f95d7fb20: ffff9d00ff801da8 ffff9d0f23121200 ffff9d0f95d7fb30: ffff9d0f23121740 ffff9d0fa7480138 ffff9d0f95d7fb40: 0000000000000000 0000001002020000 ffff9d0f95d7fb50: 0000000000000000 ffff9d0f95d7fbe8 ffff9d0f95d7fb60: ffff9d0f00000000 0000000000000000 ffff9d0f95d7fb70: 00000000756415e3 ffFF9d0FA74800c0 ----mlx5_core_dev device, corresponding to p3P1, FFFF9d0F95D7FB80: ffff9d0f95d7fbf8 ffff9d0f95d7fbe8 ffff9d0f95d7fb90: 0000000000000246 ffff9d0f8f3a20b8 ffff9d0f95d7fba0: ffff9d0f95d7fbd0 ffffffffc03e442b #5 [ffff9d0f95d7fba8] mlx5_cmd_exec at ffffffffc03e442b [mlx5_core] ffff9d0f95d7fbb0: 0000000000000000 ffff9d0fa74800c0 ffff9d0f95d7fbc0: ffff9d0f8f3a20b8 ffff9d0fa74bea00 ffff9d0f95d7fbd0: ffff9d0f95d7fc38 ffffffffc03f085d #6 [ffff9d0f95d7fbd8] mlx5_core_destroy_mkey at ffffffffc03f085d [mlx5_core]Copy the code

Mlx5_core_dev is ffff9d0FA74800c0, and net_device is: P3p1, while the MLx5_core_dev that process 23283 is accessing is ffff9d0FA3C800c0, corresponding to p3p2.

Crash  NET NET_DEVICE NAME IP ADDRESS(ES) FFFF9d0fc003e000 LO ffFF9d1fad200000 p1p1 ffFF9d0FA0700000 p1p2 The mlx5_core_dev for ffFF9d0FA00c0000 p3P1 is FFFF9d0FA74800c0 ffFF9d0FA0200000 p3p2 is FFFF9d0FA3C800c0Copy the code

Let's take a look at the remaining devices in devices_kset:

crash p devices_kset devices_kset = $4 = (struct kset *) 0xffff9d1fbf4e70c0 crash p devices_kset.list $5 = { next = 0xffffffffb72f2a38, Prev = 0xffff9D0FBe0ea130} Crash  list -h -o 0x18 0xFFffFFFFb72f2a38 -s device.kobj. Name device P3p2 are not in device.list, [[email protected]]# grep 0000:5e:00.0 device.list //caq: Not found This is P3P1. The current reboot process is being uninstalled. [[email protected]]# grep 0000:5e:00.1 device. List //caq: no find, [[email protected]]# grep 0000:3b:00.0 Device.list //caq: This mlx5 device does not unload = 0xffff9D1FBe82AA70 "0000:3b:00.0" [[email protected]]# grep 0000:3b:00.1 device. List Name = 0xFFFF9D1FBe82aAEw "0000:3b:00.1", 0, 0Copy the code

Since p3P2 and P3P1 are not in device.list, according to the serial uninstallation process of Pci_device_shutdown, p3P1 is being uninstalled. Process 23283 is accessing the uninstalled CMD_pool, according to the uninstallation process described above: Pci_device_shutdown --shutdown-- mlX5_unload_one -- mlX5_cmd_cleanup --dma_pool_destroy None of the DMA_pages in the pool is valid.

Then try to Google corresponding bug, see very similar to the one with the current phenomenon, redhat encountered a similar problem:

However, red Hat in this link thinks that the UAF problem has been fixed, but the patch is:

commit 4cca96a8d9da0ed8217cfdf2aec0c3c8b88e8911 Author: Parav Pandit [email protected] Date: Thu Dec 12 13:30:21 2019 +0200 diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 997cbfe.. 100644-05 b557d a/drivers/infiniband/hw/mlx5 / main. The c + + + b/drivers/infiniband/hw/mlx5 / main. @ @ + 6725-6725, 6, 8 c @ @ void __mlx5_ib_remove(struct mlx5_ib_dev *dev, const struct mlx5_ib_profile *profile, int stage) { + dev-ib_active = false; + /* Number of stages to cleanup */ while (stage) { stage--;Copy the code

Knock on the blackboard three times: this combination does not solve the corresponding bug, such as the following concurrency: Let's use a simple graph to represent concurrency:

CPU1 CPU2 dev_attr_show pci_device_shutdown speed_show shutdown mlx5_unload_one mlx5_detach_device mlx5_detach_interface  mlx5e_detach mlx5e_detach_netdev mlx5e_nic_disable rtnl_lock mlx5e_close_locked clear_bit(MLX5E_STATE_OPENED, priv-state); -- Only the bit rtnl_UNLOCK rtnl_trylock is cleared -- after the lock is successfully held, netif_running only determines the lowest order of net_device.state __ethTool_get_link_kSettings mlx5e_get_link_ksettings mlx5_query_port_ptys() mlx5_core_access_reg() mlx5_cmd_exec cmd_exec mlx5_alloc_cmd_msg Mlx5_cmd_cleanup -- Clean up dma_pool dma_pool_alloc-- Access cmd.pool, triggering crashCopy the code

So to really fix this problem, do you need to clean up the __LINK_STATE_START bit in netif_device_detach, or do you need to check the __LINK_STATE_PRESENT bit in speed_show? If you consider the scope of impact and do not want to move the public process, you should check __LINK_STATE_PRESENT in mlx5E_GET_link_kSettings. I'll leave that to those of you who like to work with the community.

static void mlx5e_nic_disable(struct mlx5e_priv *priv) { ....... rtnl_lock(); if (netif_running(priv-netdev)) mlx5e_close(priv-netdev); netif_device_detach(priv-netdev); //caq: add a clean __LINK_STATE_PRESENT bit rtnl_unlock(); .Copy the code

Three, fault recurrence

1. Race mode problem, you can create a competition scenario similar to cpu1 and CPU2 above.

Four, fault avoidance or solution

Possible solutions:

1, do not in accordance with the red hat

2. Patch separately.

Author's brief introduction


Currently, he is responsible for the virtualization of Linux kernel, container and virtual machine in OPPO Hybrid cloud

For more exciting content, please scan the code to follow [OPPO Digital Intelligence Technology] public number

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.