Source: author: love baby, my.oschina.net/zhangxufeng/blog/3017521

Those of you who have dealt with online problems will almost always encounter problems such as sudden slow running of the system, 100% CPU, and too many Full GC times. Of course, the final visual result of these problems is that the system runs slowly and has a lot of alarms.

This article mainly aims at the slow running of the system, provides the troubleshooting idea of the problem, so as to locate the code point of the problem, and then provides the idea of solving the problem.

If the online system is suddenly slow, the first thing to do is to export the JStack and memory information, and then restart the system to ensure that the system is available as soon as possible. There are two main possible reasons for this:

  • A large amount of data is read at a certain location in the code, leading to system memory exhaustion, resulting in too many times of Full GC and slow system;

  • There are CPU consuming operations in the code, resulting in high CPU and slow system operation;

These are the two most common types of online problems, and they can lead directly to system unavailability. There are several other conditions that can cause a feature to run slowly without causing the system to become unavailable:

  • There is an obstructing operation in a certain position of the code, resulting in the overall time consuming of the function call, but the occurrence is relatively random;

  • A thread enters the WAITING state for some reason, and the function is unavailable but cannot be reappeared.

  • Due to improper use of locks, multiple threads enter the deadlock state, resulting in a slow system as a whole.

In all three cases, you can’t see the specific problem by looking at CPU and system memory because they are relatively obstructive operations, low CPU and system memory usage, but slow in function. Let’s look at the system logs to identify these problems step by step.

1. The Full GC count is too high

This is relatively easy to do, especially when new features come online. In the case of more Full GC, there are two main characteristics:

  • Multiple threads on the line have cpus exceeding 100%, and as you can see from the jStack command, these threads are mainly garbage collection threads

  • Using the jstat command to monitor the GC situation, you can see that the number of Full GC counts is very high and increasing.

First, we can use the top command to check the CPU usage of the system. The following is an example of a high CPU usage:

1. `top -08:31:10 up 30 min,0 users, Load Average :0.73,0.58,0.34 '2.' KiBMem:2046460 total,1923864 used,122596 free,14388 Buffers' 3. 'KiBSwap:1048572 total,0 used,1048572 free.1192352 cached Mem` 4. 5. ` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND` 6. `9 root 200255716028897615812 S 98.014.10:42.60 Java `Copy the code

We can copy the process ID9 and run the following command to check the running status of each thread of the process:

top -Hp 9Copy the code

The running status of each thread under this process is as follows:

1. `top -08:31:16 up 30 min,0 users, 'Threads:11 total,1 running,10 sleeping,0 stopped,0 zombie' 3. %Cpu(s):3.5 us,0.6 Sy,0.0 ni, 0.0 ID,0.0 wa,0.0 hi,0.0 si,0.0 st '4. `KiBSwap:1048572 total,0 used,1048572 free.1192532 cached Mem` 6. 7. ` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND '8.' 10 root 200255716028982415872 R 79.314.20:41.49 Java '9.' 11 root 200255716028982415872 S 13.214.20:06.78 java`Copy the code

We can see the CPU usage of each thread in the Java program with process 9, and then we can use the jstack command to see why the thread with id 10 is using the most CPU.

Recommended: the jstack command: how to troubleshoot multithreading problems

Note that in the jsatck command, the thread ids are converted to hexadecimal. You can use the following command to view the conversion results, or you can find a scientific calculator to convert:

root@a39de7e7934b:/# printf "%x\\n" 10aCopy the code

The output here indicates that the thread is represented in the jstack as 0xA. Using the jstack command, we can see the following information:

1. `"main"#1 prio=5 os_prio=0 tid=0x00007f8718009800 nid=0xb runnable [0x00007f871fe41000]` 2. ` java.lang.Thread.State:  RUNNABLE` 3. ` at com.aibaobei.chapter2.eg2.UserDemo.main(UserDemo.java:9)` 5. `"VM Thread" os_prio=0 tid=0x00007f871806e000 nid=0xa runnable`Copy the code

The end of the VM Thread line displays nID = 0xA, where nID stands for operating system Thread ID. VM threads refer to garbage collection threads. At this point, we can pretty much confirm that the current slow system is mainly due to frequent garbage collection, resulting in long GC pauses. To check the GC status, run the following command:

root@8d36124607a0:/# jstat -gcutil 9 1000 10  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT  0.00   0.00   0.00  75.07  59.09  59.60   3259    0.919  6517    7.715    8.635  0.00   0.00   0.00   0.08  59.09  59.60   3306    0.930  6611    7.822    8.752  0.00   0.00   0.00   0.08  59.09  59.60   3351    0.943  6701    7.924    8.867  0.00   0.00   0.00   0.08  59.09  59.60   3397    0.955  6793    8.029    8.984Copy the code

As you can see, FGC refers to the Full GC count, which is 6,793 and growing. This further confirms that the system is slow due to memory overflow. You can dump the memory log and use eclipse’s MAT tool to view it. Here is an object tree structure:

After mat tool analysis, we can basically determine which main object in memory consumes more memory, and then find the creation location of the object for processing.

PrintStream is the main one here, but we can also see that it consumes only 12.2% of memory. That said, it is not enough to cause a large number of Full GCS, and we need to consider another case where the code or third-party dependent packages have system.gc () calls displayed.

Dump (); dump (); dump ();

\[Full GC (system.gc ())) \[Tenured: 262546K->262546K(349568K), 0.0014879 secs\] 262546K->262546K(506816K), \[Metaspace: 3109 k - > 3109 k (1056768 k) \], 0.0015151 secs \] \ [Times: Sys =0.00, real=0.01 secs\]\[GC (Allocation Failure) \[DefNew: 2795K->0K(157248K), 0.0001504 secs\]\[Tenured: Secs \] 265342K->402K(506816K), \[Metaspace: [Times: user=0.00, secs\] \[Times: user=0.00Copy the code

Here, for example, the first GC is due to a display call to System.gc(), while the second GC is initiated by the JVM. To sum up, there are two main reasons for the excessive number of Full GC:

  • The code obtains a large number of objects at a time, resulting in memory overflow. At this time, you can check which objects are in memory through the Mat tool of Eclipse.

  • This can be done by adding -xx :+DisableExplicitGC to disable THE JVM’s response to the display GC.

2. The CPU is too high

In the first point, we mentioned that a high CPU may cause the system to perform Full GC frequently, causing the system to slow down. However, we can also encounter time-consuming calculation, resulting in high CPU situation, this view is very similar to the above.

First, we run the top command to check which process is consuming too much CPU to get the process ID. Then use top-HP to check which threads in the process have too high CPU, generally more than 80% is relatively high, about 80% is a reasonable situation. This gives us the id of the thread that consumes a lot of CPU. The current thread-specific stack information is then viewed in the JStack log using the hexadecimal representation of the thread ID.

This is where we can distinguish between too many Full GC’s and too many time-consuming computations in the code. If the Full GC is too many, the Thread information that jStack gets is something like a VM Thread, whereas if there is a time-consuming calculation in the code, we get a specific stack information for a Thread. The following is an example of a code with a time-consuming calculation that causes the CPU to be too high in thread information:

As you can see, when UserController was requested, the CPU of the thread remained at 100% because the Controller made a time-consuming call. We can navigate to line 34 of the UserController based on the stack information to see exactly what is causing the computation to be so high in the code.

3. Interface time consumption occurs irregularly

A typical example of this is that our interface access often takes 2 to 3s to return. This is a troublesome situation, because generally speaking, it does not consume much CPU, and the memory occupation is not high, that is to say, we can not solve this problem through the above two ways of troubleshooting.

Moreover, the problem of time-consuming interface is not periodic, which results in that even if we get the stack information accessed by the thread through the jStack command, we can not determine which thread is executing the time-consuming operation.

For serious interface time-consuming problems that occur irregularly, the troubleshooting roadmap is as follows: First find the interface, constantly strengthen access through the pressure measuring tool, if the interface has a certain position is more time consuming, because of our visit frequency is very high, so most of the thread will eventually be blocked on the choke points, such by multiple threads with the same stack logs, we basically to locate the position of the interface more time-consuming code. The following is a stack log from a thread with a time-consuming blocking operation in the code:

1. `"http-nio-8080-exec-2"#29 daemon prio=5 os_prio=31 tid=0x00007fd08cb26000 nid=0x9603 waiting on condition [0x00007000031d5000]` 2. ` java.lang.Thread.State: TIMED_WAITING (sleeping)` 3. ` at java.lang.Thread.sleep(NativeMethod)` 4. ` at java.lang.Thread.sleep(Thread.java:340)`  5. ` at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)` 6. ` at com.aibaobei.user.controller.UserController.detail(UserController.java:18)` 7. 8. `"http-nio-8080-exec-3"#30 daemon prio=5 os_prio=31 tid=0x00007fd08cb27000 nid=0x6203 waiting on condition [0x00007000032d8000]` 9. ` java.lang.Thread.State: TIMED_WAITING (sleeping)` 10. ` at java.lang.Thread.sleep(NativeMethod)` 11. ` at java.lang.Thread.sleep(Thread.java:340)` 12. ` at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)` 13. ` at com.aibaobei.user.controller.UserController.detail(UserController.java:18)` 14. 15. `"http-nio-8080-exec-4"#31 daemon prio=5 os_prio=31 tid=0x00007fd08d0fa000 nid=0x6403 waiting on condition [0x00007000033db000]` 16. ` java.lang.Thread.State: TIMED_WAITING (sleeping)` 17. ` at java.lang.Thread.sleep(NativeMethod)` 18. ` at java.lang.Thread.sleep(Thread.java:340)` 19. ` at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)` 20. `at com.aibaobei.user.controller.UserController.detail(UserController.java:18)`Copy the code

As you can see from the log above, there are multiple threads blocking on line 18 of the UserController, indicating that this is a choke point, which is what is causing the interface to be slow.

4. A thread enters the WAITING state

For this situation, this is a relatively rare situation, but it is also possible, and because it has a certain “non-recurrence”, so it is very difficult to find in the investigation.

I have encountered a similar situation where CountDownLatch was used to wake up the main thread for further execution because each parallel task needed to be completed. At that time, we used CountDownLatch to control the connection of multiple threads and export the user’s Gmail data. One thread connected to the user’s mailbox, but the connection was suspended by the server, causing the thread to wait for the response from the server. Eventually, our main thread and the remaining threads are in a WAITING state.

Recommended: Figure out the state and life cycle of Java threads

As anyone who has looked at the JStack log knows, most threads on the line are normally in TIMED_WAITING, and the thread in question is in the same state, which makes it very easy to confuse our judgment. The main ideas to solve this problem are as follows:

  • Grep to find all of the positions in the jStack logTIMED_WAITINGState thread, export it to a file such as a1.log, here is an example of an exported log file:
"Attach Listener" #13 daemon prio=9 os\_prio=31 tid=0x00007fe690064000 nid=0xd07 waiting on condition \[0x0000000000000000\]"DestroyJavaVM" #12 prio=5 os\_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition \[0x0000000000000000\]"Thread-0" #11 prio=5 os\_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition \[0x0000700003ad4000\]"C1 CompilerThread3" #9 daemon prio=9 os\_prio=31 tid=0x00007fe68c00a000 nid=0xa903 waiting on condition \[0x0000000000000000\]Copy the code
  • After waiting for, say, 10 seconds, grep the jStack log again and export it to another file, such as a2.log, as shown below:
"DestroyJavaVM" #12 prio=5 os\_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition \[0x0000000000000000\]"Thread-0" #11 prio=5 os\_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition \[0x0000700003ad4000\]"VM Periodic Task Thread" os_prio=31 tid=0x00007fe68d114000 nid=0xa803 waiting on conditionCopy the code
  • Repeat Step 2, after 3 or 4 files are exported, we compare the exported files and find out the user threads that have always existed in these files. This thread can basically confirm that it contains the threads with problems in the waiting state. The normal request thread is not in a wait state after 20 30s.

  • Once we have these threads, we can continue to check their stack information. If the thread itself should be in a wait state, such as an idle thread in a user-created thread pool, the thread’s stack information will not contain user-defined classes. These can all be eliminated, and the remaining threads are basically identified as the problematic thread we’re looking for. From its stack information, we can figure out exactly where the code is causing the thread to be in the wait state.

Here to be sure, when we judge whether the thread for the user, can be judged through the front thread thread name, because the general framework of thread named are very standard, we can directly through the thread of judge threads in this thread is some framework, this thread can basically rule out. The rest, such as thread-0 above, and the custom Thread names we can identify, are the objects we need to check.

After checking in the above way, we can basically conclude that thread-0 is the Thread we are looking for. By looking at its stack information, we can find the specific position that causes it to be in the wait state. In the following example, line 8 of SyncTask causes the thread to wait.

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007f9de08c7000 nid=0x5603 waiting on condition \[0x0000700001f89000\]   java.lang.Thread.State: WAITING (parking)    at sun.misc.Unsafe.park(Native Method)    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)    at com.aibaobei.chapter2.eg4.SyncTask.lambda$main$0(SyncTask.java:8)    at com.aibaobei.chapter2.eg4.SyncTask$Lambda$1/1791741888.run(Unknown Source)    at java.lang.Thread.run(Thread.java:748)Copy the code

5. A deadlock

For deadlocks, this is basically easy to spot because JStack helps us check for deadlocks and print the specific deadlocked thread information in the log. Here is an example of a JStack log that generates a deadlock:

As you can see, at the bottom of the JStack log, it directly helps us analyze which deadlocks exist in the log and the thread stack information for each deadlock. Here we have two user threads waiting for each other to release the lock, and the blocked location is in line 5 of the ConnectTask. At this point we can directly locate the location and conduct code analysis to find the cause of the deadlock.

6. Summary

This paper mainly explains five possible online situations leading to system slowness, and analyzes the phenomena of each situation in detail. According to the phenomena, we can locate the system slowness caused by this reason. To be brief, online log analysis can be divided into the following steps:

  • Run the top command to check the CPU status. If the CPU is high, run the top-hp < PID > command to check the running status of each thread in the current process. After finding out the thread whose CPU is high, convert its thread ID to hexadecimal, and view the main work of the thread in the JStack log. There are two cases here

  • If it is a normal user thread, the stack information of the thread is used to see where the user code is running.

  • If the thread is VMThread, run the jstat-gcutil
    command to monitor the GC status of the current system. Then run jmapdump:format=b,file=

    < PID > to export the current memory data of the system. After the export, the memory situation can be analyzed in the Mat tool of Eclipse to get the main objects in memory that consume more memory, and then the relevant code can be processed.

  • If you run the top command, the CPU is not high and the system memory usage is low. At this point, you can consider whether the problem is caused by the other three conditions. Specific analysis can be made according to the specific situation:

  • If the interface invocation is time-consuming and occurs irregularly, you can increase the occurrence frequency of choke points by using pressure measurement to view stack information and find choke points through jStack.

  • If a function suddenly stops, this situation cannot be repeated. In this case, you can export jStack logs several times to compare the user threads that have been in the waiting state. These threads are the threads that may have problems.

  • If the deadlock status is visible through JStack, you can examine the specific choke points of the two threads causing the deadlock and deal with the problem accordingly.

This paper mainly proposes five common problems that lead to slow online function, as well as troubleshooting ideas. Of course, online problems occur in various forms and are not necessarily limited to these situations. If we can carefully analyze the situations in which these problems occur, we can make a case-by-case analysis to solve the corresponding problems.

Read more on my blog:

1.Java JVM, Collections, Multithreading, new features series tutorials

2.Spring MVC, Spring Boot, Spring Cloud series tutorials

3.Maven, Git, Eclipse, Intellij IDEA series tools tutorial

4.Java, backend, architecture, Alibaba and other big factory latest interview questions

Life is good. See you tomorrow