Personal blog: Click here to enter

1. Symptom

The system administrator receives a phone notification, which is described as a server suddenly fails to connect to SSH. After logging in to the server with an out-of-band IP address and entering the remote console interface, an Authentication error message is displayed. After restarting, the system can normally enter the system

2. Find the cause

After logging in to the system, no error occurs as follows:

I inquired Baidu and found that this error was kernel lock, referred to as “crash”. After inquiring the administrator, I learned that docker was installed on the server recently, which may be caused by high load

  • Soft lockup: This bug does not cause the system to crash completely, but several processes (or kernel threads) are locked in a certain state (usually in the kernel area), in many cases due to the use of kernel locks.
  • The kernel parameters of the kernel. Watchdog_thresh (/ proc/sys/kernel/watchdog_thresh) system default is 10. If the information is displayed after 2 x 10 seconds, note the following: The value cannot be greater than 60
  • The Linux kernel has a monitoring process for each CPU. In technical circles, this process is called a watchdog. Through ps ef | grep watchdog to see and process name is probably watchdog/X number (Numbers: CPU logic 1/2/3/4). The process or thread runs once every second, otherwise it is asleep and standby. This process collects the elapsed time of the data used by each CPU and stores it in its own kernel data structure. There are many specific interrupt functions in the kernel. The interrupt function will be called soft lockup count, he will use the current timestamp with a specific (corresponding) CPU kernel data structures, save time, if it is found that the current timestamp than corresponding save CPU time is greater than the setting threshold, he assumes that the monitoring process or thread guard dog in a considerable time of board yet. Why and how do Cpu soft locks occur? If the Linux kernel is a carefully scheduled CPU access, how can a CPU soft deadlock occur? It can only be said that the QMGR process caused panic in our server kernel due to user development or third-party software introduction. Because each infinite loop always has a CPU execution flow (QMGR shows a message queue service process for background mail) and has a certain priority. The Cpu scheduler schedules a driver to run, and if the driver has a problem and is not detected, the driver will temporarily use the Cpu for a long time. As described above, the watchdog process catches this and throws a soft lockup error. Soft deadlocks can suspend cpus and render your system unusable.

3. Specific analysis

1. The system restarts at the following two times:
Mar 3 21:53:16 ser-node7 kernel: Linux version 3.10.0-957. El7. X86_64 ([email protected]) Mar 3 22:37:19 ser - node7 kernel: Linux version 3.10.0-957. El7. X86_64 ([email protected])Copy the code

CPU soft locks have occurred for some time before the restart. For further analysis of CPU soft locks, we rely on vMCore data generated by kdump.

Mar  3 14:28:18 ser-node7 kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [runc[1:CHILD]:52902]
Mar  2 18:14:59 ser-node7 kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [runc:[1:CHILD]:55961]
Copy the code

./systemctl_list-unit-files:kdump.service enabled

If you have done this before, modify /etc/sysctl.conf to add the following line: kernel.softlockup_panic = 1 then run “sysctl -p” for the parameters to take effect. If a CPU soft lock occurs in the system, a kernel panic is automatically triggered. If kdump works properly, vMCOre is generated. And automatically restart the system

2. In addition, the following alarms exist in the logs, which are not directly related to the soft lockup problem.

# cat messages | grep "SLUB: Unable to allocate memory on node"

Mar  2 18:04:45 ser-node7 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
Mar  3 14:54:25 ser-node7 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
Mar  3 14:54:25 ser-node7 kernel: SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
Copy the code

This is a known BUG in the system. For details, see the following KB:

  • SLUB: Unable to allocate memory on node -1 (gfp=0x20)

Access.redhat.com/solutions/4… Based on this KB, please upgrade the kernel to kernel-3.10.0-1062.4.1.el7 or update it.

  • Kernel-3.10.0-1062.4.1. el7 download address:

Access.redhat.com/errata/RHSA…

  • For details on how to upgrade the kernel, see the following documentation:

How to update/upgrade the Red Hat Enterprise Linux kernel? Access.redhat.com/solutions/2…

4. Solutions

The solutions given by Baidu Big Hand are as follows:

  • vi /etc/sysctl.conf

kernel.watchdog_thresh =30

  • To view:# tail -1 /proc/sys/kernel/watchdog_thresh
  • Provisional effect:# sysctl -w kernel.watchdog_thresh=30The original proposal is pending