Understanding Linux load averaging, with troubleshooting tools

What is load average load average can be familiar or unfamiliar to us, but we ask what load average is, and most people say isn’t load average CPU usage per unit time? This is not the case. If possible, man uptime can be used to find out more about the load average.

Simply put, load average refers to the average number of processes per unit of time that are in the runnable state and non-interruptible state, that is, the average number of active processes, which is not directly related to CPU usage. The terms runnable and non-interruptible are explained here.

Operational status:
The process that is using the CPU or waiting for the CPU. We use the ps command to view the process in the R state
Non-interruptible state:
Processes are processes that are in kernel-mode key processes and are not interruptible. Such as: Common waiting for hardware I/O response, we are in the progress of the ps command to see in the D state For example, when a process is to read and write data to your disk, in order to ensure the consistency of the data, before have disk, it is not interrupted by other processes or interrupted, this time is in the process of the interrupted status, if the process is interrupted, It is easy to have disk data and process data inconsistency.

Therefore, the uninterruptible state is actually a protection mechanism for system processes and hardware devices.

Therefore, you can simply say that load average is the average number of active processes. The average number of active processes, intuitively understood as the number of active processes per unit time, is actually the exponential decay of the average number of active processes. Since this is the average number of active processes, ideally there should be exactly one process running on each CPU so that each CPU is fully utilized. For example, what does it mean when the average load is 2?

On a system with only 2 cpus, this means that all the cpus are just fully occupied
On a 4-CPU system, that means the CPU is 50% idle
On a system with only one CPU, this means that half of the processes cannot compete for the CPU

Load average and CPU Usage In the real world, we often confuse load average and CPU usage, so HERE, I also do a partition.

You might wonder, since load average represents the number of active processes, doesn’t a high load average mean high CPU usage?

Again, load average refers to the number of processes per unit of time that are in a runnable or uninterruptible state. Therefore, it includes not only processes that are normally using the CPU, but also processes that are waiting for the CPU and waiting for I/O.

The CPU usage is the statistics of CPU busy status per unit time. It does not correspond to the average load, for example:

Cpu-intensive processes that use a large number of cpus lead to a higher load average are consistent
I/ O-intensive processes, waiting for I/ OS can also lead to higher load averages, but CPU usage is not necessarily high
A large number of process scheduling waiting for the CPU can also lead to a higher load average, where CPU utilization is high

Here we need to install several tools: sysstat, stress, stress-ng

In this case, the sysstat version of Centos is older, and it is best to upgrade to the latest version. Manual RPM installation or source installation

Scenario 1 CPU Intensive 1. Run the stress command to simulate a 100% CPU usage scenario

$stress — CPU 1 –timeout 600 2. Enable the second terminal and uptime to check the load average

$watch -d uptime 09:40:35 up 80 days, 18:41, 2 Users, Load Average: 1.62, 1.10, 0.87Copy the code

3. Start the third terminal and run mpstat to check the CPU usage

$mpstat -p ALL 5 20 10:06:37 AM CPU %usr %nice %sys %iowait %irq % Soft % Steal %guest %gnice %idle 10:06:42 AM ALL 31.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:06:42 AM 0 1.20 0.00 0.00 0.00 0.00 0.00 10:06:42 AM 0 1.20 0.00 0.00 0.00 0.00 0.00 10:06:42 AM 0 1.20 0.00 0.00 0.00 0.00 10:06:42 AM 0 1.20 0.00 0.00 0.00 0.00 10:06:42 AM 0 1.20 0.00 0.00 0.00 0.00 10:06:42 AM AM 1 7.21 0.00 0.00 0.40 0.00 0.00 0.00 92.38 10:06:42 AM 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:06:42 AM 3 17.43 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 82.36 # -p ALL Monitors ALL cpus. The number 5 displays data at an interval of five secondsCopy the code

From the second terminal, we can see that the 1-minute load average increases to 1.62. From the third terminal, we can see that there is a 100% CPU utilization, but the IOwait is 0, indicating that the increase in the load average is officially 100% CPU utilization

Which process is responsible for the 100% CPU utilization? We can use pidstat to see:

$pidstat -u 5 1 10:08:41 AM UID PID %usr %system %guest % Wait %CPU CPU Command 10:08:46 AM 0 1 0.20 0.00 0.00 0.00 0 Systemd 10:08:46 AM 0 599 0.00 1.00 0.00 0.20 1.00 0 Systemd - Journal 10:08:46 AM 0 1043 0.60 0.00 0.00 0.00 0.60 0 rsyslogd 10:08:46 AM 0 6863 100.00 0.00 0.00 3 Stress 10:08:46 AM 0 7303 0.20 0.20 0.00 0.00 0.40 2 pidstatCopy the code

From here we can see that stress is the cause of the process.

1. We use the stress-ng command, but this time we simulate the I/O stress by executing sync continuously:

$stress-ng -i 4 -- HDD 1 --timeout 600Copy the code

2. Enable uptime on the second terminal to view the load average

$watch -d uptime 10:30:57 up 98 days, 19:39, 3 Users, Load Average: 1.71, 0.75, 0.69Copy the code

3. Start the third terminal and run mpstat to check the CPU usage

$mpstat -p ALL 5 20 10:32:09 AM CPU %usr %nice %sys %iowait %irq %soft % Steal %guest %gnice %idle 10:32:14 AM ALL 6.80 0.00 0.00 33.75 26.16 0.00 0.39 0.00 0.00 32.90 10:32:14 AM 0 4.03 0.00 69.57 19.91 0.00 0.00 0.00 0.00 6.49 10:32:14 AM 1 25.32 0.00 9.49 0.00 0.00 0.95 0.00 0.00 64.24 10:32:14 AM 2 0.24 0.00 10.87 63.04 0.00 0.00 0.00 0.00 25.36 10:32:14 AM 3 1.42 0.00 36.93 14.20 0.00 0.28 0.00 0.00 47.16Copy the code

As you can see, the 1-minute average load slowly increases to 1.71, with one CPU’s system CPU usage rising to 63.04. This indicates that the increase in average load is due to the increase in IOWAIT.

So which process is causing us? We use pidstat to see:

$ pidstat -u 5 1 Average: UID PID %usr %system %guest %wait %CPU CPU Command Average: 0 1 0.00 0.19 0.00 0.00 0.19 - Systemd Average: 0 10 0.00 0.19 0.00 1.56 0.19 - rcu_sched Average: 0 0 599 0.58 1.75 0.00 0.39 2.33 - Systemd - Journal Average: 0 1043 0.19 0.19 0.00 0.00 0.39 - rsyslogd Average: 0 0 6934 0.00 1.56 0.00 1.17 1.56 - kworker/2 0-events_power_efficient Average: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 - kworker/ U8 :0+flush :0 0 10793 0.00 0.97 0.00 1.56 0.97 - kworker/3 -mm_percpu_wq Average: 0 11062 0.00 21.79 0.00 0.19 21.79-stress-ng-HDD Average: 0 11063 0.00 1.95 0.00 1.36 1.95-stress-ng-IO Average: 0.00 0.00 0.00 0.00 0.00 2.72-stress-ng-io Average: 0 11065 0.00 1.36 0.00 1.75 1.36-stress-ng-io Average: 0 11065 0.00 2.72-stress-ng-io Average: 0 11065 0.00 1.36 0.00 1.36-stress-ng-io Average: 0 11066 0.00 2.72 0.00 0.58 2.72-stress-ng-ioCopy the code

Stress-ng can be found to be the cause

Scenario 3 A Large number of Processes When running processes in the system exceed the CPU capacity, a process waiting for the CPU occurs.

For example, we use Stress, but this time we simulate 8 processes:

$stress -c 8 –timeout 600 $stress -c 8 –timeout 600 $stress -c 8 –timeout 600

$uptime 10:56:22 up 98 days, 20:05, 3 Users, Load Average: 4.52, 2.82, 2.67Copy the code

Then we run pidstat to see what the process looks like:

$pidstat -u 5 1 Linux 5.0.5-1.el7.elrebo. x86_64 (k8S-M1) 07/11/2019 _x86_64_ (4 CPU) 10:57:33 AM UID PID %usr %system %guest %wait %CPU CPU Command 10:57:38 AM 0 1 0.20 0.00 0.00 0.00 0.20 1 Systemd 10:57:38 AM 0 599 0.00 0.99 0.00 0.20 0.99 2 Systemd-Journal 10:57:38am 0 1043 0.60 0.20 0.00 0.00 0.79 1 rsysLogd 10:57:38am 0 12927 51.59 0.00 0.00 48.21 51.59 0 stress 10:57:38am 0 12928 44.64 0.00 0.00 54.96 44.64 0 stress 10:57:38am 0 12929 45.44 0.00 0.00 54.56 45.44 2 Stress 10:57:38am 0 12930 45.44 0.00 0.00 54.37 45.44 2 Stress 10:57:38am 0 12931 51.59 0.00 0.00 48.21 51.59 3 Stress 10:57:38am 0 12932 48.41 0.00 0.00 51.19 48.41 1 Stress 10:57:38am 0 12933 45.24 0.00 0.00 54.37 45.24 3 Stress 10:57:38am 0 12933 45.24 0.00 0.00 54.37 45.24 10:57:38am 0 12934 48.81 0.00 0.00 50.99 48.81 1 stress 10:57:38am 0 13083 0.00 0.40 0.00 0.20 0.40 0 pidstatCopy the code

As can be seen, 8 processes occupy 4 cpus, and each process waits up to 50% of the CPU time (%wait), which exceeds the CPU’s computing capacity of the process, resulting in CPU overload.

Understanding Linux load averaging, with troubleshooting tools

Related Posts

When Swagger meets Torna, instantly tall!

Service registration and discovery components for Eureka application combat

40. Context Handler && How to customize context handler and use it