preface

As performance test engineers, our standard posture is to execute the uptime or top command whenever we see a computer slowing down to see how the load is on the system.

For example, if I type uptime on the command line, the system will return a line of information.

Appletekimbp :~ Apple $UPtime 20:44 up 21 days, 6:41, 2 users, load Averages: 2.85

2.33

2.91 But I would like to ask, do you know the meaning of each of the above columns?

20:44

The current time

up 21 days,

6 : 41

System uptime

2 users

Number of Logged In Users

The average system load is the average system load in 1 minute, 5 minutes, and 15 minutes respectively

The load averages: 2.85

2.33

2.91

In the second half of this line, we see “Load Average”, which means “average load on the system”, with three numbers that we can use to determine whether the system is overloaded or underburdened.

What is system load average?

I’m sure some of you will say, isn’t load average CPU usage per unit time? At 2.85, the CPU usage is 285%. That’s not true.

The CPU load value in Linux represents the average number of running, runnable jobs (reading a set of program instructions in machine language corresponding to the thread of process execution) or, crucially, dormant but non-interruptible (non-interleaved sleep). That is, to calculate the value of CPU load, only consider processes that are running or waiting for CPU time to be allocated. Does not consider normal dormant process (dormant state), zombie or stopped process.

In simple terms, load average refers to the average number of active and runnable processes per unit of time. It is not directly related to CPU usage.

Process status code R is running or runnable (in a run queue) D uninterrupted sleep (usually IO) S interruptible sleep (waiting for the event to complete) Z invalid/zombie, terminated but not stopped by its parent T, stopped signal by job control or because it was tracked […]

I’m going to explain runnable and uninterruptible.

A Runnable process is a process that is using the CPU or is waiting for the CPU. This is a process that is in the R state (Running or Runnable) seen with the ps command.

Non-interruptible processes refer to processes that are in kernel mode and are not interruptible, for example, waiting for I/O responses from hardware devices. This is a process in the D state (Uninterruptible Sleep, also known as Disk Sleep) seen in the Ps command. For example, when a process reads or writes data to the disk, to ensure data consistency, it cannot be interrupted by other processes or interrupts before receiving a response from the disk. At this time, the process is in the uninterruptible state. If the process is interrupted, disk data may be inconsistent with process data.

Therefore, an uninterruptible state is actually a system protection mechanism for processes and hardware devices.

Therefore, we can simply say that the average load is actually the average number of active processes. Average number of active processes, intuitively understood as the number of active processes per unit time. Since the average is the number of active processes, it is desirable that there is exactly one process running on each CPU, so that each CPU is fully utilized.

Here is what the different load values mean in a single-core processor computer:

0.00: No job is running or waiting for the CPU to execute. That is, the CPU is idle. Therefore, if a running program (process) needs to perform a task, it will ask the CPU for the operating system and immediately allocate CPU time to the process because no other processes are competing for it. 0.50: No jobs are waiting, but the CPU is processing previous jobs, and it is processing at 50% capacity. In this case, the operating system can also immediately allocate CPU time to other processes without having to put them in hold state. 1.00: There are no jobs in the queue, but the CPU is processing previous jobs at 100% capacity, so if a new process requests CPU time, it must be reserved until another job completes or the current CPU slot time (for example, CPU tick) expires, and the operating system decides which is the next given process priority. 1.50: THE CPU was operating at 100% of its capacity, and 5 out of 15 jobs requesting CPU time, or 33.33%, had to wait in line for others to use up their allotted time. Therefore, once the threshold of 1.0 is exceeded, the system can be said to be overloaded because it cannot immediately process 100% of the requested work. Obviously, the lower the “load Average “value, such as 0.2 or 0.3, the less the server is working and the lower the system load.

An analogy

How to read too much above? Okay, let’s look at a simple analogy.

Let’s start with the simplest case, where your computer has only one CPU and all the calculations must be done by that CPU.

So think of the CPU as a bridge with only one lane on which all traffic must pass. (Apparently, the bridge is only one-way.)

System load is zero, which means there are no cars on the bridge.

The system load is 0.5, meaning that half of the bridge has cars.

A system load of 1.0 means there are cars on all sections of the bridge, which means the bridge is “full”. It must be noted, however, that the bridge is still open until now.

The system load is 1.7, which means there are too many cars, the bridge is already full (100%), and the cars behind it waiting to get on the bridge are 70% of the cars on the deck. By analogy, system load 2.0 means there are as many cars waiting to get on the bridge as there are cars on the deck; System load 3.0 means there are twice as many vehicles waiting to get on the bridge as there are on the deck. In short, when the system load is greater than 1, the vehicle behind must wait; The greater the system load, the longer the bridge must wait.

The system load of the CPU is basically the same as the analogy above. The capacity of the bridge is the maximum workload of CPU. The vehicles on the bridge are processes waiting to be processed by the CPU.

If the CPU processes a maximum of 100 processes per minute, the system load is 0.2, which means that the CPU processes only 20 processes in that minute. System load 1.0, which means that the CPU processes exactly 100 processes in that one minute; The system load is 1.7, which means that 70 processes are queued up for processing by the CPU in addition to the 100 processes being processed by the CPU.

In order for the computer to run smoothly, the system load should not exceed 1.0, so that no processes need to wait and all processes can be processed first. Obviously, 1.0 is a critical value, beyond which the system is not in optimal condition and you need to intervene.

Multi-processor and multi-core systems

In a system with multiple processors or cores (multiple logical cpus), the meaning of the CPU load value depends on the number of processors present in the system. Therefore, a computer with four processors will not be used at 100% until it reaches a 4.00 load, so the first thing you must do when interpreting the three load values provided by top, htop, or uptime commands is to separate them. The number of logical cpus that exist in the system, and draw conclusions from it.

For example, what happens if you have two cpus on your computer? With two cpus, you’ve doubled the processing power of your computer and doubled the number of processes you can process at the same time. Using the bridge analogy, two cpus means the bridge has two lanes, doubling the traffic capacity

So, two cpus indicates a system load of 2.0, where each CPU is working at 100%. By extension, the maximum acceptable system load for a computer with n cpus is N. 0.

Chip manufacturers often contain multiple CPU cores within a SINGLE CPU, which is called a multicore CPU.

Multicore cpus are similar to multicore cpus in terms of system load, so when considering system load, you must consider how many cpus this computer has and how many cores each CPU has. Then, the system load is divided by the total number of cores, as long as the load of each core does not exceed 1.0, indicating that the computer is operating properly. How do you know how many CPU cores a PC has?

CPU utilization

If we look at different processes that pass through the CPU at a given time interval, the utilization percentage will represent the portion of time relative to that time interval in which the CPU executes instructions corresponding to each process. But this calculation is only considered running processes, not those that are waiting, whether they are in a queue (runnable state) or dormant but non-interruptible (e.g. waiting for the end of an input/output operation). Thus, this metric can tell us which processes are squeezing the CPU the most, but it does not give a true picture of system state if it is overloaded or underutilized. In the real world, we often confuse load average with CPU usage. From the above, we know that load average refers to the number of processes in the runnable and non-interruptible state per unit of time. So, it includes not only the processes that are using the CPU, but also the processes that are waiting for the CPU and waiting for I/O. CPU usage, as explained above, is busy per hour and does not necessarily correspond to average load. Such as:

Cpu-intensive processes, using a large number of cpus can lead to a higher load average, which is the same. I/ O-intensive processes, waiting for I/ OS can also lead to higher load averages, but CPU usage is not necessarily high. A large number of processes waiting for the CPU can also result in high load averages and high CPU utilization. Note the input/output (I/O) operations

The importance of uninterrupted sleep has been repeatedly emphasized in this article (D in the first figure), because sometimes you can find very high load values on your computer, while different running processes have relatively low utilization. If you do not consider this state, you will find the situation puzzling and you will not know how to deal with it. A process completes in this state when it is waiting for a resource to be released and its execution cannot be interrupted, such as when it is waiting for non-interruptible I/O operations (not all are non-interruptible). Typically, this occurs due to disk failure, network file system failure (such as NFS failure), or heavy use of very slow devices (such as USB 1.0 Pendrive).

In this case, we would have to use alternative tools such as iostat or IOTOP, which would indicate which processes are performing more I/O operations so that we could kill those processes or assign them less priority (nice commands) to allocate more CPU time to other, more critical processes.

Some skills

System overload and exceeding a load value of 1.0 is sometimes not a problem, because even with some latency, the CPU will process the jobs in the queue and the load will drop below 1.0 again. However, if the system’s persistent load value is greater than 1, it means that it cannot absorb all the load in execution, so its response time increases and the system becomes slow and unresponsive. High values above 1, especially the average load of the last 5 and 15 minutes, are an obvious symptom, either we need to improve the hardware of the computer, save fewer resources by limiting what users can use on the system, or divide by the load between multiple similar nodes.

Therefore, we propose the following:

=0.70: No response, but it is necessary to monitor CPU load. If it stays that way for a period of time, it must be investigated before things get worse. =1.00: There is a problem and you must find and fix it, otherwise a major spike in system load will cause your application to be slow or unresponsive. =3.00: Your system becomes very slow. It was hard even to manipulate it from the command line to try to figure out the cause of the problem, so fixing the problem took longer than we had previously taken action. You run the risk that the system will become more saturated and definitely crash. =5.00: You may not be able to recover the system. You can wait for a miracle to spontaneously lower the load, or if you know what’s going on and can afford it, you can start a command like kill-9<process_name> in the console and pray that it runs at some point to lighten the system and regain control. Otherwise, you certainly have no choice but to restart your computer.