In ordinary work, when measuring the performance of the server, several indicators are often involved, such as LOAD, CPU, MEM, QPS, RT and so on. Each indicator has its own unique significance. In many cases, when problems occur online, some indicators are often accompanied by abnormalities. In most cases, some indicator will show an exception in advance of the problem.

Understanding and viewing these indicators, exception resolution, etc., are important skills for programmers. In this paper, mainly to introduce a more important index – machine Load (Load), mainly involves the definition of Load, view the Load, Load soaring check ideas, etc..

What is a load

Load is an important index of Linux machines, which intuitively reflects the current state of the machine.

Let’s take a look at the definition of load:

In UNIX computing, the system load is a measure of the amount of computational work that a computer system performs. The load average represents the average system load over a period of time. It conventionally appears in the form of three numbers which Represent the system load during the last one-, five-, and fifteen-minute periods.

A quick explanation: On UNIX systems, system load is a measure of the current CPU workload, defined as the average number of threads in the runqueue over a specific time interval. Load Average indicates the average load of the machine over a period of time. The lower the value, the better. When the load is too high, the machine cannot handle other requests and operations, or even freezes.

Linux has a high load due to CPU usage, memory usage, and I/O consumption. Too much of either can cause server load to climb dramatically.

Look at the machine load.

On Linux machines, there are multiple commands to view the load information on the machine. The value includes uptime, TOP, and W.

uptimeThe command

The uptime command prints how long the system has been running and the average load on the system. The uptime command displays the following information: current time, how long the system has been running, how many users have logged in, and the average load of the system in the past 1 minute, 5 minutes, and 15 minutes.

➜  ~ uptime
13:29  up 23:41, 3 users, load averages: 1.74 1.87 1.97
Copy the code

In the second half of this line, it says “Load Average”, which means “average load on the system”, and it has three numbers that we can use to determine whether the system is overloaded or not.

1.74 1.87 1.97 Indicates the average system load in 1 minute, 5 minutes, and 15 minutes respectively. We generally express load1, load5, load15.

wThe command

The main function of the w command is to display the information of the current login user. However, unlike who, the w command is more powerful. The w command can also display the current time, time since the system was started, number of login users, and load average of the system in the last 1 minute, 5 minutes, and 15 minutes. The following information is displayed for each user: login account, terminal name, remote host name, login time, idle time, JCPU, PCPU, and command line of the running process.

➜ ~ w 14:08 up 23:41, 3 users, load Averages: 1.74 1.87 1.97 USER TTY FROM login@idle WHAT Hollis console-6 14 23:40 - Hollis s000-6 14 20:24 - ZSH Hollis S001 - Six 15 wCopy the code

From the result of the w command, we can see that the current system time is 14:08, and the system has been started for 23 hours and 41 minutes. There are three users who have logged in. The average load in the last 1 minute, 5 minutes, and 15 minutes was 1.74, 1.87, and 1.97, respectively. This is the same as uptime. The following also printed some login user data, not detailed introduction.

topThe command

The top command is a commonly used performance analysis tool in Linux. It displays the resource usage of each process in the system in real time, similar to the Task manager in Windows.

➜ ~ Top Processes: 244 total, 3 running, 9 stuck, 232 Sleeping, 1484 Threads 14:16:01 Load Avg: Idle SharedLibs idle SharedLibs idle SharedLibs 116M resident, 16M data, 14M linkedit. MemRegions: 66523 total, 2152M resident, 50M private, 930M shared. PhysMem: 7819M used (1692M wired), 370M unused. VM: 682G vsize, 533M framework vsize, 6402060(0) swapins, 7234356(0) swapouts. Networks: packets: 383006/251M in, 334448/60M out. Disks: 1057821/38G read, 350852/40G written. PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE BOOSTS %CPU_ME %CPU_OTHRS UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW 30845 Top 3.0 00:00.49 1/1 0 21 3632K 0B 0B 30845 1394 Running *0[1] 0.00000 0.00000 0 3283+ 112 203556+ 101770+ 8212+ 119901+ 823+ 30842 Google Chrom 0.0 00:47.39 17 0 155 130M 0B 0B 1146 1146 sleeping *0[1] 0.00000 0.00000 501 173746 2697 117678 37821 364228 444830 310043Copy the code

In the output above, Load Avg: 1.74, 1.87, 1.97 shows the Load information.

Normal load range of the machine

The question of how much Load of the machine is normal has always been controversial, and different people have different understandings. For a single CPU, it is considered out of bounds if the Load exceeds 0.7. Other people think it’s okay as long as it’s no more than 1. It is also considered acceptable to load a single CPU below 2.

The reason why there are so many different understandings is that different machines are affected by other factors besides CPU, such as running programs, machine memory, and even room temperature.

For example, some machines are used to perform a large number of timed batch tasks, during which time the Load can be quite high. Other times it may be lower. So, uh, should we go and fix the problem while we’re at it?

My suggestion is that it is best to establish a baseline of indicators (such as the average value of the last month) according to the actual situation of your machine, as long as the daily load is not acceptable within the range of the baseline, if there is too much difference, human intervention may be required to check.

But there has to be a recommended threshold, about this value. Ruan yifeng has the following suggestions in his blog:

When system load continues to be greater than 0.7, you need to start investigating what the problem is and prevent it from getting worse.

When the system load continues to be greater than 1.0, you must find a solution to bring it down.

When the system load reaches 5.0, it indicates that your system has a serious problem, is unresponsive for a long time, or is close to crashing. You should not allow the system to reach this value.

All of these metrics are based on a single CPU, but many of today’s computers are multi-core. Therefore, for the general system, it is based on the number of CPUS to determine whether the system is Over Load. If 0.7 is considered a safe line for single-core machines, then it is best to keep the load below 3(4*0.7 = 2.8) for quad-core machines.

Another point needs to be mentioned. In the index of Load Avg, there are three values: 1-minute system Load, 5-minute system Load and 15-minute system Load. We can also refer to these three values when troubleshooting problems.

In general, the 1-minute system load indicates the most recent temporary phenomenon. 15-minute system load indicates a persistent phenomenon, not a temporary problem. If load15 is high and load1 is low, the situation is considered to be improving. Instead, things may be getting worse.

How to reduce load

The reasons for the high load can be complex, ranging from hardware issues to software issues.

If it is a hardware problem, then the performance of the machine is really not good, so it is very simple to solve, directly change the machine can be.

As mentioned earlier, CPU usage, memory usage, and IO consumption can all contribute to high load. If it is a software problem, some Java threads are occupied for a long time or a large amount of memory is occupied continuously. You are advised to troubleshoot code problems in the following ways:

1. Whether GC is frequently caused by memory leak 2. Whether deadlocks occur 3.

Here is another suggestion: if you find a Load surge online, you can consider dumping the stack memory and restarting it to resolve the problem temporarily, then consider rolling back and troubleshooting the problem.

The Load of Java Web applications increases rapidly

1. Use uptime to check the current load.

➜  ~ uptime
13:29  up 23:41, 3 users, load averages: 10 10 10
Copy the code

2. Run the top command to view the IDS of the processes that occupy the highest CPU usage.

➜  ~ top

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
1893 admin     20   0 7127m 2.6g  38m S 181.7 32.6  10:20.26 java
Copy the code

The process with a PID of 1893 was found to occupy 181% of the CPU. And it is a Java process, basically concluded to be a software problem.

3. Run the top command to check which thread has the highest usage

➜ ~ top-HP 1893 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4519 admin 20 0 7127m 2.6g 38m R 18.632.6 Java 0:40. 11Copy the code

4. Use the printf command to view the hexadecimal value of this thread

➜  ~ printf %x 4519
11a7
Copy the code

5. Run the jstack command to view the method being executed by the current thread. Java command learning series (2) — Jstack

➜ ~ jstack 1893 | grep - A 200 11 a7 daemon prio = "thread - 5 # 500" 10 os_prio = 0 tid = 0 x00007f632314a800 nid = 0 x11a2 runnable [0x000000005442a000] java.lang.Thread.State: RUNNABLE at sun.misc.URLClassPath$Loader.findResource(URLClassPath.java:684) at sun.misc.URLClassPath.findResource(URLClassPath.java:188) at java.net.URLClassLoader$2.run(URLClassLoader.java:569) at java.net.URLClassLoader$2.run(URLClassLoader.java:567) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findResource(URLClassLoader.java:566) at org.hibernate.validator.internal.xml.ValidationXmlParser.getInputStreamForPath(ValidationXmlParser.java:248) at com.hollis.test.util.BeanValidator.validate(BeanValidator.java:30)Copy the code

From the thread’s stack log above, can be found that the current high CPU thread is executing code I com. Hollis. Test. Util. BeanValidator. Validate (BeanValidator. Java: 30) class. Then you can check to see if there are any usage problems with this class.

You can also use jstat(Java command learning series (4) — jstat) to check for frequent FGC, and then use jmap(Java command learning series (3) — jmap) to dump memory to see if there is a memory leak.