Background: This is an example of a crash encountered on 3.10.0-957.el7.x86_64. Here’s how we can troubleshoot and solve the problem.

First, fault phenomenon

Oppo Cloud Intelligent Monitoring and Discovery Machine Down:

The KERNEL: / usr/lib/debug/lib/modules / 3.10.0-957. The el7. X86_64 / vmlinux... PANIC: "Kernel panic - not syncing: Hard LOCKUP" PID: 14 COMMAND: "migration/1" TASK: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: 1 STATE: TASK_INTERRUPTIBLE (PANIC) crash> bt PID: 14 TASK: ffff8f1bf6bb9040 CPU: 1 COMMAND: "migration/1" #0 [ffff8f4afbe089f0] machine_kexec at ffffffff83863674 #1 [ffff8f4afbe08a50] __crash_kexec at ffffffff8391ce12 #2 [ffff8f4afbe08b20] panic at ffffffff83f5b4db #3 [ffff8f4afbe08ba0] nmi_panic at ffffffff8389739f #4 [ffff8f4afbe08bb0] watchdog_overflow_callback at ffffffff83949241 #5 [ffff8f4afbe08bc8] __perf_event_overflow at ffffffff839a1027 #6 [ffff8f4afbe08c00] perf_event_overflow at ffffffff839aa694 #7 [ffff8f4afbe08c10] intel_pmu_handle_irq at ffffffff8380a6b0 #8 [ffff8f4afbe08e38] perf_event_nmi_handler at ffffffff83f6b031 #9 [ffff8f4afbe08e58] nmi_handle at ffffffff83f6c8fc #10 [ffff8f4afbe08eb0] do_nmi at ffffffff83f6cbd8 #11 [ffff8f4afbe08ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: native_queued_spin_lock_slowpath+462] RIP: ffffffff839121ae RSP: ffff8f1bf6bc7c50 RFLAGS: 00000002 RAX: 0000000000000001 RBX: 0000000000000082 RCX: RDX: 0000000000000101 RSI: 0000000000000001 RDI: FFFF8F1AFDF55FE8 -- LOCK RBP: FFFF8F1BF6BC7C50 R8: 0000000000000101 R9: 0000000000000400 R10: 000000000000499e R11: 000000000000499f R12: ffff8f1afdf55fe8 R13: ffff8f1bf5150000 R14: ffff8f1afdf5b488 R15: ffff8f1bf5187818 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #12 [ffff8f1bf6bc7c50] native_queued_spin_lock_slowpath at ffffffff839121ae #13 [ffff8f1bf6bc7c58] queued_spin_lock_slowpath at ffffffff83f5bf4b #14 [ffff8f1bf6bc7c68] _raw_spin_lock_irqsave at ffffffff83f6a487 #15 [ffff8f1bf6bc7c80] cpu_stop_queue_work at ffffffff8392fc70 #16 [ffff8f1bf6bc7cb0] stop_one_cpu_nowait at ffffffff83930450 #17 [ffff8f1bf6bc7cc0] load_balance at ffffffff838e4c6e #18 [ffff8f1bf6bc7da8] idle_balance at ffffffff838e5451 #19 [ffff8f1bf6bc7e00] __schedule at ffffffff83f67b14 #20 [ffff8f1bf6bc7e88] schedule at ffffffff83f67bc9 #21 [ffff8f1bf6bc7e98] smpboot_thread_fn at ffffffff838ca562 #22 [ffff8f1bf6bc7ec8] kthread at ffffffff838c1c31 #23 [ffff8f1bf6bc7f50] ret_from_fork_nospec_begin at ffffffff83f74c1d crash>

Two, fault phenomenon analysis

The “migration/1” process on the stack is trying to control the spinlock. Since _raw_spin_lock_irq_save calls arch_local_irq_disable before taking the lock, the “migration/1” process is trying to control the spinlock. While arch_local_irq_disable isa common shutdown interrupt function, let’s look at who holds the lock that the process is trying to hold.

In x86 architecture, the RDI of Native_Queued_Spin_Lock_SlowPath holds the lock address

Under the x86 architecture, Native_queued_spin_lock_lock_slowpath crashes > arch_spinlock_t ffff8f1afdf55fe8 struct arch_spinlock_t {val = { counter = 257 } }

Now, we need to know what kind of lock this is. Idle_balance –>load_balance–>stop_one_cpu_nowait–>cpu_stop_queue_work

Crash > dis-l ffffff8392fc70 /usr/ SRC /debug/kernel-3.10.0-957. El7 /linux-3.10.0-957. El7.x86_64 /kernel/stop_machine.c: Crash > dis-l ffffff8392fc70 /usr/ SRC /debug/kernel-3.10.0-957. El7 /linux-3.10.0-957. 91 0xffffffff8392fc70 <cpu_stop_queue_work+48>: CMPB $0x0,0xc(% RBX) 85 static void cpu_stop_queue_work(unsigned int CPU,0xc(% RBX)) struct cpu_stop_work *work) 86 { 87 struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); 88 unsigned long flags; 89 90 spin_lock_irqsave(&stopper->lock, flags); If (stopper->enabled) 92 __cpu_stop_queue_work(stopper, work); 93 else 94 cpu_stop_signal_done(work->done, false); 95 spin_unlock_irqrestore(&stopper->lock, flags); 96}

It looks like we need to get the corresponding percpu variable cpu_stopper based on the CPU number. This parameter finds the busiest rq in the load_balance function and then gets its corresponding CPU number:

6545 static int load_balance(int this_cpu, struct rq *this_rq, 6546 struct sched_domain *sd, enum cpu_idle_type idle, 6547 int *should_balance) 6548 { .... 6735 if (active_balance) { 6736 stop_one_cpu_nowait(cpu_of(busiest), 6737 active_load_balance_cpu_stop, busiest, 6738 &busiest->active_balance_work); 6739}... 6781 } crash> dis -l load_balance |grep stop_one_cpu_nowait -B 6 0xffffffff838e4c4d <load_balance+2045>: callq 0xffffffff83f6a0e0 <_raw_spin_unlock_irqrestore> / usr/SRC/debug/kernel - 3.10.0-957. El7 / Linux - 3.10.0-957. El7. X86_64 / kernel/sched/fair. C: 6736 0xffffffff838e4c52 <load_balance+2050>: MOV 0x930(% RBX),%edi------------ RQ 0xffffffff838e4c58 <load_balance+2056>: lea 0x908(%rbx),%rcx 0xffffffff838e4c5f <load_balance+2063>: mov %rbx,%rdx 0xffffffff838e4c62 <load_balance+2066>: mov $0xffffffff838de690,%rsi 0xffffffff838e4c69 <load_balance+2073>: callq 0xffffffff83930420 <stop_one_cpu_nowait>

Then we fetch the following data from the stack:

Crash > rq. CPU ffff8f1afdf5ab80 CPU = 26

CPU # 26 is waiting for the percpu variable cpu_stopper to lock.

Then we search for which other process the lock is in the stack and find the following:

Ffff8f4957fbfab0: ffff8f1afdf55fe8 -------- crash> kmem ffff8f4957fbfab0 PID: 355608 COMMAND: "Custom_Exporter" TASK: ffff8f4aea3a8000 [THREAD_INFO: ffff8f4957fbc000] CPU: 26-------- Exporter happens to be a process running on CPU 26 STATE: TASK_RUNNING (ACTIVE)

Next, we need to analyze why the process Custom_Exporter, located in CPU 26, is holding FFFF8F1AFDF55FE8 for so long

Let’s analyze the stack of CPU 26:

crash> bt -f 355608 PID: 355608 TASK: ffff8f4aea3a8000 CPU: 26 COMMAND: "custom_exporter" ..... #3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff8f1bf6bb9844 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 0000000000000000 R15: 0000000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #4 [ffff8f4957fbfa30] try_to_wake_up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086 ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48 ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0 ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80 ffff8f4957fbfa78: ffffffff838d6705 #5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05 #6 [ffff8f4957fbfa88] __cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0 ffff8f4957fbfaa0: ffffffff8393037a #7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a ..... ffff8f4957fbfbb8: ffffffff838d3867 #8 [ffff8f4957fbfbb8] migrate_swap at ffffffff838d3867 ffff8f4957fbfbc0: Ffff8f4aea3a8000 ffff8f1ae77dc100 ------- Migration_swap_arg ffff8f4957fbfbd0 in stack: 000000010000001a 0000000080490f7c ffff8f4957fbfbe0: ffff8f4aea3a8000 ffff8f4957fbfc30 ffff8f4957fbfbf0: 0000000000000076 0000000000000076 ffff8f4957fbfc00: 0000000000000371 ffff8f4957fbfce8 ffff8f4957fbfc10: ffffffff838dd0ba #9 [ffff8f4957fbfc10] task_numa_migrate at ffffffff838dd0ba ffff8f4957fbfc18: ffff8f1afc121f40 000000000000001a ffff8f4957fbfc28: Ffff8f4957fbfc30 is the address in the task_numa_env stack FFFF8f4957fbfc38: 000000000000001a 000000010000003f ffff8f4957fbfc48: 000000000000000b 000000000000022c ffff8f4957fbfc58: 00000000000049a0 0000000000000012 ffff8f4957fbfc68: 0000000000000001 0000000000000003 ffff8f4957fbfc78: 000000000000006f 000000000000499f ffff8f4957fbfc88: 0000000000000012 0000000000000001 ffff8f4957fbfc98: 0000000000000070 ffff8f1ae77dc100 ffff8f4957fbfca8: 00000000000002fb 0000000000000001 ffff8f4957fbfcb8: FFFF8F4957FBFCC8 Current FFFF8F4957FBFCC8 Current FFFF8F4957FBFCC8 0000000000017a48 0000000000001818 ffff8f4957fbfcd8: 0000000000000018 ffff8f4957fbfe20 ffff8f4957fbfce8: ffff8f4957fbfcf8 ffffffff838dd4d3 #10 [ffff8f4957fbfcf0] numa_migrate_preferred at ffffffff838dd4d3 ffff8f4957fbfcf8: ffff8f4957fbfd88 ffffffff838df5b0 ..... crash> crash>

As a whole, the CPU on # 26 is also executing the balance action for numa. A brief expansion of the balance action for numa is shown in the task_tick_fair function:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); entity_tick(cfs_rq, se, queued); } if (numabalancing_enabled)---------- If numabalancing is enabled, it will call task_tick_numa task_tick_numa(rq, curr); update_rq_runnable_avg(rq, 1); }

The task_tick_numa will push numa_balance to a work when the current process needs numa_balance, depending on the scan. By calling change_prot_numa, all PTE page table entries that map to the VMA should be PAGE_NONE, so that the next time a process accesses the page table there will be a page fault interrupt. Handle_pte_fault will select a better node based on numa due to the chance of a page fault interrupt. The details will not be expanded.

Stop_two_cpus –>cpu_stop_queue_two_works–> __cpu_stop_queue_two_work is inline because cpu_stop_queue_two_works is inline because cpu_stop_queue_two_works CPU_STOP_QUEUE_TWO_WORKS __CPU_STOP_QUEUE_WORK is called twice, so it is necessary to determine which call is currently in trouble based on the stack address.

227 static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1, 228 int cpu2, struct cpu_stop_work *work2) 229 { 230 struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1); 231 struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2); 232 int err; 233 234 lg_double_lock(&stop_cpus_lock, cpu1, cpu2); 235 spin_lock_irq(&stopper1->lock); (&stopper2->lock, SINGLE_DEPTH_NESTING); . 243 __cpu_stop_queue_work(stopper1, work1); 244 __cpu_stop_queue_work(stopper2, work2); . 251}

According to the address of the stack:

#5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05 #6 [ffff8f4957fbfa88] __cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0 ffff8f4957fbfaa0: ffffffff8393037a #7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a ffff8f4957fbfaa8: 0000000100000001 ffff8f1afdf55fe8 crash> dis -l ffffffff8393037a 2 The/usr/SRC/debug/kernel - 3.10.0-957. El7 / Linux - 3.10.0-957. El7. X86_64 / kernel/stop_machine. C: 244 0xffffffff8393037a <stop_two_cpus+394>: lea 0x48(%rsp),%rsi 0xffffffff8393037f <stop_two_cpus+399>: mov %r15,%rdi

__cpu_stop_queue_work on line 243 is called

Then analyze the corresponding input parameters:

crash> task_numa_env ffff8f4957fbfc30 struct task_numa_env { p = 0xffff8f4aea3a8000, src_cpu = 26, src_nid = 0, DST_CPU = 63, DST_NID = 1, SRC_STATS = {NR_RUNNING = 11, LOAD = 556, LOAD = 18848 Task_capacity = 18, has_free_capacity = 1}, dst_stats = {nr_running = 3, load = 111, Task_capacity = 18, has_free_capacity = 1}, imbalance_pct = 112, idx = 0, Best_task = 0xffff8f1ae77dc100, -- Task_numa_find_cpu -->task_numa_compare-->task_numa_assign best_imp = 763, } crash> migration_swap_arg ffff8f4957fbfbc0 struct migration_swap_arg {src_task = 0xFFFF8F4AEA3A8000, DST_TASK = 0xFFFF8F1AE77DC100, SRC_CPU = 26, DST_CPU = 1----- Select DST CPU is 1}

CPU_STOPPER: CPU_QUEUE_TWO_WORKS (CPU_STOPPER: CPU26) : CPU_TO_WAKE_UP (Wake: KWorker) : migrate (KWorker: migrate);

static void __cpu_stop_queue_work(struct cpu_stopper *stopper,
                    struct cpu_stop_work *work)
{
    list_add_tail(&work->list, &stopper->works);
    wake_up_process(stopper->thread);//其实一般就是唤醒 migration
}

Since the optimal CPU object is 1, you will need to migrate to the CPU to retrieve processes.

crash> p cpu_stopper:1 per_cpu(cpu_stopper, 1) = $33 = { thread = 0xffff8f1bf6bb9040, ---- Task lock = {{rlock = {raw_lock = {val = {counter = 1 enabled, enabled = true, works = { next = 0xffff8f4957fbfac0, prev = 0xffff8f4957fbfac0 }, stop_work = { list = { next = 0xffff8f4afbe16000, prev = 0xffff8f4afbe16000 }, fn = 0xffffffff83952100, arg = 0x0, done = 0xffff8f1ae3647c08 } } crash> kmem 0xffff8f1bf6bb9040 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff8eecffc05f00 task_struct 4152 1604 2219 317 32k SLAB MEMORY NODE TOTAL ALLOCATED FREE fffff26501daee00 ffff8f1bf6bb8000 1 7 7 0 FREE / [ALLOCATED] [ffff8f1bf6bb9040] PID: 14 COMMAND: "Migration /1"-------------- migration task: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: migration task: ffff8f1bf6bc4000 1 STATE: TASK_INTERRUPTIBLE (PANIC) PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff26501daee40 3076bb9000 0 0 0 6fffff00008000 tail

Processor CPU26 may have awakened the migrated process on CPU 1 while holding the lock, so why did CPU 1 fail to release the lock and trigger panic on CPU 1 by waiting too long for the lock?

Here’s why it held the lock for so long:

#3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffff8f1bf6bb9844 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 0000000000000000 R15: 0000000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #4 [ffff8f4957fbfa30] try_to_wake_up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086 ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48 ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0 ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80 crash> dis -l ffffffff838d63d2 / usr/SRC/debug/kernel - 3.10.0-957. El7 / Linux - 3.10.0-957. El7. X86_64 / kernel/sched/core. C: 1790 0xffffffff838d63d2 <try_to_wake_up+114>: mov 0x28(%r13),%eax 1721 static int 1722 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) 1723 { ..... 1787 * If the owning (remote) cpu is still in the middle of schedule() with 1788 * this task as prev, Wait until its done referencing the task. 1789 */ 1790 while (p->on_cpu)--------- . 1814 return success; 1815}

Let’s represent this hardlock with a simple diagram:

    CPU1                                    CPU26
    schedule(.prev=migrate/1)               <fault>
      pick_next_task()                        ...
        idle_balance()                          migrate_swap()
          active_balance()                        stop_two_cpus()
                                                    spin_lock(stopper0->lock)
                                                    spin_lock(stopper1->lock)
                                                    try_to_wake_up
                                                      pause() -- waits for schedule()
            stop_one_cpu(1)
              spin_lock(stopper26->lock) -- waits for stopper lock

View the upstream patch,

 static void __cpu_stop_queue_work(struct cpu_stopper *stopper,
-                    struct cpu_stop_work *work)
+                       struct cpu_stop_work *work,
+                       struct wake_q_head *wakeq)
 {
     list_add_tail(&work->list, &stopper->works);
-    wake_up_process(stopper->thread);
+    wake_q_add(wakeq, stopper->thread);
 }

Three, the recurrence of failure

Since this is a hardlock caused by a race condition, there is no problem in logical analysis, so there is no time to duplicate it. This environment runs on a DPDK node, but for performance, it is set to run on only one NUMA node, which can frequently cause unbalance of NUMA. Therefore, if you want to reproduce it, you can refer to running DPDK on a single NUMA node to reproduce it, which will have a higher probability.

IV. Fault avoidance or solution

Our solution is:

  1. Turn off NUMA’s Automatic Balance.
  2. Manually incorporating the 0B26351B910F patch into the Linux community
  3. This patch was implemented in CentOS 3.10.0-974.el7:

    [kernel] stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock (Phil Auld) [1557061]

At the same time, Red Hat is back to 3.10.0-957.27.2.el7.x86_64, so upgrading the CentOS kernel to 3.10.0-957.27.2.el7.x86_64 is also an option.