General troubleshooting problems are also around the memory CPU and other elements to check. Below is a general troubleshooting or performance problem process, look at the picture, no more details.


Those of you who have dealt with online problems will almost always encounter problems such as sudden slow running of the system, 100% CPU, and too many Full GC times. Of course, the final visual result of these problems is that the system runs slowly and has a lot of alarms. This article mainly aims at the slow running of the system, provides the troubleshooting idea of the problem, so as to locate the code point of the problem, and then provides the idea of solving the problem.

If the online system is suddenly slow, the first thing to do is to export the JStack and memory information, and then restart the system to ensure that the system is available as soon as possible. There are two main possible reasons for this:

  • A large amount of data is read at a certain location in the code, leading to system memory exhaustion, resulting in too many times of Full GC and slow system;
  • There are CPU consuming operations in the code, resulting in high CPU and slow system operation;

These are the two most common types of online problems, and they can lead directly to system unavailability. There are several other conditions that can cause a feature to run slowly without causing the system to become unavailable:

  • There is an obstructing operation in a certain position of the code, resulting in the overall time consuming of the function call, but the occurrence is relatively random;
  • A thread enters the WAITING state for some reason, and the function is unavailable but cannot be reappeared.
  • Due to improper use of locks, multiple threads enter the deadlock state, resulting in a slow system as a whole.

In all three cases, you can’t see the specific problem by looking at CPU and system memory because they are relatively obstructive operations, low CPU and system memory usage, but slow in function. Let’s look at the system logs to identify these problems step by step.

1. The Full GC count is too high

This is relatively easy to do, especially when new features come online. In the case of more Full GC, there are two main characteristics:

  • Multiple threads on the line have cpus exceeding 100%, and as you can see from the jStack command, these threads are mainly garbage collection threads
  • Using the jstat command to monitor the GC situation, you can see that the number of Full GC counts is very high and increasing.

First, we can use the top command to check the CPU usage of the system. The following is an example of a high CPU usage:

Top-08:31:10 UP 30 min, 0 Users, Load Average: 0.73, 0.58, 0.34 KiB Mem: 2046460 total, 1923864 used, 122596 free, 14388 buffers KiB Swap: 1048572 total, 0 used, 1048572 free. 1192352 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9 root 20 0 2557160 288976 15812 S 98.0 14.1 0:42.60 JavaCopy the code

We can copy the process ID9 and run the following command to check the running status of each thread of the process:

top -Hp 9
Copy the code

The running status of each thread under this process is as follows:

Top-08:31:16 UP 30 min, 0 Users, Load Average: 0.75, 0.59, 0.35 Threads: 11 total, 1 running, 10 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.5US, 0.6sy, 0.0Ni, 95.9 ID, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 2046460 total, 1924856 used, 121604 free, 14396 buffers KiB Swap: 1048572 total, 0 used, 1048572 free. 1192532 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10 root 20 0 2557160 289824 15872 R 79.3 14.2 0:41.49 Java 11 root 20 0 2557160 289824 15872 S 13.2 14.2 0:06.78 JavaCopy the code

We can see the CPU usage of each thread in the Java program with process 9, and then we can use the jstack command to see why the thread with id 10 is using the most CPU. Note that in the jsatck command, the thread ids are converted to hexadecimal. You can use the following command to view the conversion results, or you can find a scientific calculator to convert:

root@a39de7e7934b:/# printf "%x\n" 10
a
Copy the code

The output here indicates that the thread is represented in the jstack as 0xA. Using the jstack command, we can see the following information:

"main" #1 prio=5 os_prio=0 tid=0x00007f8718009800 nid=0xb runnable [0x00007f871fe41000]
   java.lang.Thread.State: RUNNABLE
	at com.aibaobei.chapter2.eg2.UserDemo.main(UserDemo.java:9)

"VM Thread" os_prio=0 tid=0x00007f871806e000 nid=0xa runnable
Copy the code

The end of the VM Thread line displays nID = 0xA, where nID stands for operating system Thread ID. VM threads refer to garbage collection threads. At this point, we can pretty much confirm that the current slow system is mainly due to frequent garbage collection, resulting in long GC pauses. To check the GC status, run the following command:

root@8d36124607a0:/# jstat -gcutil 9 1000 10S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 0.00 75.07 59.09 59.60 3259 0.919 6517 7.715 8.635 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 59.09 59.60 3397 0.955 6793 8.029 8.984Copy the code

As you can see, FGC refers to the Full GC count, which is 6,793 and growing. This further confirms that the system is slow due to memory overflow.

So here we confirm the memory overflow, but how to check which objects you caused the memory overflow, this can dump the memory log.

jmap -dump:format=b,file=jmap.log -F 9Copy the code

Then, eclipse’s MAT tool is used to view it. The following is an object tree structure displayed:

After mat tool analysis, we can basically determine which main object in memory consumes more memory, and then find the creation location of the object for processing. PrintStream is the main one here, but we can also see that it consumes only 12.2% of memory. That said, it is not enough to cause a large number of Full GCS, and we need to consider another case where the code or third-party dependent packages have system.gc () calls displayed. Dump (); dump (); dump ();

[Full GC (system.gc ())) [Tenured: 262546K->262546K(349568K), 0.0014879 secs] 262546K->262546K(506816K), [Metaspace: 3109K->3109K(1056768K)], 0.0015151 secs] [Times: user=0.00 sys=0.00, real= 0.01secs] [GC (Allocation Failure) [DefNew: 2795 k - > 0 k (157248 k), 0.0001504 secs] [Tenured: 262546K->402K(349568K), 0.0012949 secs] 265342K->402K(506816K), [Times: user=0.00] [Times: user=0.00Copy the code

Here, for example, the first GC is due to a display call to System.gc(), while the second GC is initiated by the JVM. To sum up, there are two main reasons for the excessive number of Full GC:

  • The code obtains a large number of objects at a time, resulting in memory overflow. At this time, you can check which objects are in memory through the Mat tool of Eclipse.
  • Memory footprint is not high, but Full GC is still a lot of times, which may be displayedSystem.gc()The call results in excessive GC counts, which can be added-XX:+DisableExplicitGCTo disable the JVM’s response to display GC.

2. The CPU is too high

In the first point, we mentioned that a high CPU may cause the system to perform Full GC frequently, causing the system to slow down. However, we can also encounter time-consuming calculation, resulting in high CPU situation, this view is very similar to the above. First, we run the top command to check which process is consuming too much CPU to get the process ID. Then use top-HP < PID > to check which threads in the process have too high CPU, generally more than 80% is relatively high, about 80% is a reasonable situation. This gives us the id of the thread that consumes a lot of CPU. The current thread-specific stack information is then viewed in the JStack log using the hexadecimal representation of the thread ID.

This is where we can distinguish between too many Full GC’s and too many time-consuming computations in the code. If the Full GC is too many, the Thread information that jStack gets is something like a VM Thread, whereas if there is a time-consuming calculation in the code, we get a specific stack information for a Thread. The following is an example of a code with a time-consuming calculation that causes the CPU to be too high in thread information:

As you can see, when UserController was requested, the CPU of the thread remained at 100% because the Controller made a time-consuming call. We can navigate to line 34 of the UserController based on the stack information to see exactly what is causing the computation to be so high in the code.

3. Interface time consumption occurs irregularly

A typical example of this is that our interface access often takes 2 to 3s to return. This is a troublesome situation, because generally speaking, it does not consume much CPU, and the memory occupation is not high, that is to say, we can not solve this problem through the above two ways of troubleshooting. Moreover, the problem of time-consuming interface is not periodic, which results in that even if we get the stack information accessed by the thread through the jStack command, we can not determine which thread is executing the time-consuming operation.

For serious interface time-consuming problems that occur irregularly, the troubleshooting roadmap is as follows: First find the interface, constantly strengthen access through the pressure measuring tool, if the interface has a certain position is more time consuming, because of our visit frequency is very high, so most of the thread will eventually be blocked on the choke points, such by multiple threads with the same stack logs, we basically to locate the position of the interface more time-consuming code. The following is a stack log from a thread with a time-consuming blocking operation in the code:

"http-nio-8080-exec-2" #29 daemon prio=5 os_prio=31 tid=0x00007fd08cb26000 nid=0x9603 waiting on condition [0x00007000031d5000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at java.lang.Thread.sleep(Thread.java:340)
	at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
	at com.aibaobei.user.controller.UserController.detail(UserController.java:18)

"http-nio-8080-exec-3" #30 daemon prio=5 os_prio=31 tid=0x00007fd08cb27000 nid=0x6203 waiting on condition [0x00007000032d8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at java.lang.Thread.sleep(Thread.java:340)
	at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
	at com.aibaobei.user.controller.UserController.detail(UserController.java:18)

"http-nio-8080-exec-4" #31 daemon prio=5 os_prio=31 tid=0x00007fd08d0fa000 nid=0x6403 waiting on condition [0x00007000033db000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at java.lang.Thread.sleep(Thread.java:340)
	at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
	at com.aibaobei.user.controller.UserController.detail(UserController.java:18)
Copy the code

As you can see from the log above, there are multiple threads blocking on line 18 of the UserController, indicating that this is a choke point, which is what is causing the interface to be slow.

4. A thread enters the WAITING state

For this situation, this is a relatively rare situation, but it is also possible, and because it has a certain “non-recurrence”, so it is very difficult to find in the investigation. I have encountered a similar situation where CountDownLatch was used to wake up the main thread for further execution because each parallel task needed to be completed. At that time, we used CountDownLatch to control the connection of multiple threads and export the user’s Gmail data. One thread connected to the user’s mailbox, but the connection was suspended by the server, causing the thread to wait for the response from the server. Eventually, our main thread and the remaining threads are in a WAITING state.

As anyone who has looked at the JStack log knows, most threads on the line are normally in TIMED_WAITING, and the thread in question is in the same state, which makes it very easy to confuse our judgment. The main ideas to solve this problem are as follows:

  • Grep to find all of the positions in the jStack logTIMED_WAITINGState thread, export it to a file such as a1.log, here is an example of an exported log file:
"Attach Listener" #13 daemon prio=9 os_prio=31 tid=0x00007fe690064000 nid=0xd07 waiting on condition [0x0000000000000000]
"DestroyJavaVM" #12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000]
"Thread-0" #11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000]
"C1 CompilerThread3" #9 daemon prio=9 os_prio=31 tid=0x00007fe68c00a000 nid=0xa903 waiting on condition [0x0000000000000000]
Copy the code
  • After waiting for, say, 10 seconds, grep the jStack log again and export it to another file, such as a2.log, as shown below:
"DestroyJavaVM" #12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000]
"Thread-0" #11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000]
"VM Periodic Task Thread" os_prio=31 tid=0x00007fe68d114000 nid=0xa803 waiting on condition
Copy the code
  • Repeat Step 2. After exporting 3 or 4 files, we compare the exported files and find out the user threads that always exist in these files. This thread can basically confirm that it contains the threads with problems in the waiting state. Normal request threads do not wait for 20 to 30 seconds.
  • Once we have these threads, we can continue to check their stack information. If the thread itself should be in a wait state, such as an idle thread in a user-created thread pool, the thread’s stack information will not contain user-defined classes. These can all be eliminated, and the remaining threads are basically identified as the problematic thread we’re looking for. From its stack information, we can figure out exactly where the code is causing the thread to be in the wait state.

Here to be sure, when we judge whether the thread for the user, can be judged through the front thread thread name, because the general framework of thread named are very standard, we can directly through the thread of judge threads in this thread is some framework, this thread can basically rule out. The rest, such as thread-0 above, and the custom Thread names we can identify, are the objects we need to check.

After checking in the above way, we can basically conclude that thread-0 is the Thread we are looking for. By looking at its stack information, we can find the specific position that causes it to be in the wait state. In the following example, line 8 of SyncTask causes the thread to wait.

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007f9de08c7000 nid=0x5603 waiting on condition [0x0000700001f89000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
	at com.aibaobei.chapter2.eg4.SyncTask.lambda$main$0(SyncTask.java:8)
	at com.aibaobei.chapter2.eg4.SyncTask$$LambdaThe $1/1791741888.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)
Copy the code

5. A deadlock

For deadlocks, this is basically easy to spot because JStack helps us check for deadlocks and print the specific deadlocked thread information in the log. Here is an example of a JStack log that generates a deadlock:

As you can see, at the bottom of the JStack log, it directly helps us analyze which deadlocks exist in the log and the thread stack information for each deadlock. Here we have two user threads waiting for each other to release the lock, and the blocked location is in line 5 of the ConnectTask. At this point we can directly locate the location and conduct code analysis to find the cause of the deadlock.

JPS, jstack, jmap, jhat, jstat, hprof

1, JPS (Java Virtual Machine Process Status Tool)

-q does not print the class name, Jar name, or parameters passed to main, -m prints the parameters passed to main, -l prints the full name of the main class or Jar, and -v prints parameters passed to the JVM.

2, jStack, mainly used to view the stack information of a Java process thread.

-l long listings, which prints additional lock information, can use jstack -l PID to keep track of lock holdings in the event of a deadlock

3. Jmap (Memory Map) and JHAT (Java Heap Analysis Tool). Jmap is used in combination with JHAT to check Heap Memory usage.

4、jmap -permstat pid

Print the class loader of the process and the persistent generation object information loaded by the class loader, output: class loader name, whether the object is alive (unreliable), object address, parent class loader, loaded class size and other information, as shown below:



Use jmap-heap PID to view process heap memory usage, including the GC algorithm used, heap configuration parameters, and heap memory usage in each generation. Take the following example:

root@ubuntu:/# jmap -heap 21711Attaching to process ID 21711, please wait... Debugger attached successfully. Server Compiler detected. JVM version is 20.10-B01 using Ththread -local Object allocation.  Parallel GC with 4 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 2067791872 (1972.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 85983232 (82.0MB) Heap Usage: PS Young Generation Eden Space: Capacity = 6422528 (6.125MB) Used = 5445552 (5.1932830810546875MB) Free = 976976 (0.9317169189453125MB) 84.78829520089286%, informs the From Space: Capacity = 131072 (0.125MB) Used = 98304 (0.09375MB) Free = 32768 (0.03125MB) 75.0% Used To Space: Capacity = 131072 (0.125MB) Used = 0 (0.0MB) Free = 131072 (0.125MB) 0.0% Used PS Old Generation Capacity = 35258368 (33.625MB) Used = 4119544 (3.9287033081054688MB) Free = 31138824 (29.69629669189453MB) 11.683876009235595% Used PS Perm Generation capacity = 52428800 (50.0MB) Used = 26075168 (24.867218017578125MB) Free = 26353632 (25.13278198242421875 MB) 49.73443603515625%, informs...Copy the code

Jmap-histo [:live] pid = jmap-histo [:live] pid = jmap-histo [:live] pid

root@ubuntu:/# jmap -histo:live 21711 | more 
num     #instances #bytes class name----------------------------------------------
   1:         38445        5597736  <constMethodKlass>
   2:         38445        5237288  <methodKlass>
   3:          3500        3749504  <constantPoolKlass>
   4:         60858        3242600  <symbolKlass>
   5:          3500        2715264  <instanceKlassKlass>
   6:          2796        2131424  <constantPoolCacheKlass>
Copy the code

Another common scenario is to dump the process memory usage into a file using Jmap and then analyze it using JHAT. The jmap dump command is in the following format:

jmap -dump:format=b,file=dumpFileName pid
Copy the code

Dump files can be viewed with MAT, VisualVM and other tools, here using JHAT to view:

root@ubuntu:/# jhat -port 9998 /tmp/dump.dat
Reading from /tmp/dump.dat...
Dump file created Tue Jan 28 17:46:14 CST 2014Snapshot read, resolving...
Resolving 132207 objects...
Chasing references, expect 26 dots..........................
Eliminating duplicate references..........................
Snapshot resolved.
Started HTTP server on port 9998Server is ready.Copy the code

5. Jstat (JVM Statistical Monitoring Tool)

The vmID is the ID of a Java VIRTUAL machine (VM). On Linux/Unix, it is generally the ID of a process. Interval is the sampling interval. Count is the number of samples. For example, the following output is GC information, sampling interval is 250ms, sampling number is 4:

root@ubuntu:/# jstat -gc 21711 250 4 
S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
192.0  192.0   64.0   0.0    6144.0   1854.9   32000.0     4111.6   55296.0 25472.7    702    0.431   3      0.218    0.649
192.0  192.0   64.0   0.0    6144.0   1972.2   32000.0     4111.6   55296.0 25472.7    702    0.431   3      0.218    0.649
192.0  192.0   64.0   0.0    6144.0   1972.2   32000.0     4111.6   55296.0 25472.7    702    0.431   3      0.218    0.649
192.0  192.0   64.0   0.0    6144.0   2109.7   32000.0     4111.6   55296.0 25472.7    702    0.431   3      0.218    0.649
Copy the code

To see what the above columns mean, look at the JVM heap memory layout:

As can be seen:

Heap memory = Young generation + Old generation + Permanent generation Young generation = Eden region + two Survivor regions (From and To)Copy the code

Now explain the meaning of each column:

S0C, S1C, S0U, and S1U: Survivor 0/1 Capacity and usage EC, EU: Eden Capacity and usage OC, OU: Capacity and usage of the elderly generation PC, PU: Capacity and usage of the permanent generation YGC, YGT: Young GC count and GC duration FGC and FGCT: Full GC count and Full GC duration GCT: total GC durationCopy the code

7. Summary


This paper mainly explains five possible online situations leading to system slowness, and analyzes the phenomena of each situation in detail. According to the phenomena, we can locate the system slowness caused by this reason. To be brief, online log analysis can be divided into the following steps:

  • throughtopCommand to check the CPU status. If the CPU is high, the system passestop -Hp <pid>Command to view the running status of each thread of the current process, find the thread whose CPU is too high, convert its thread ID to a hexadecimal representation, and then view the main work of the thread in the JStack log. There are two cases here
    • If it is a normal user thread, the stack information of the thread is used to see where the user code is running.
    • If the thread isVM Threadbyjstat -gcutil <pid> <period> <times>Command to monitor the GC status of the current system and passjmap dump:format=b,file=<filepath> <pid>Export the current memory data of the system. After the export, the memory situation can be analyzed in the Mat tool of Eclipse to get the main objects in memory that consume more memory, and then the relevant code can be processed.
  • If throughtopThe CPU is not high and the system memory usage is low. At this point, you can consider whether the problem is caused by the other three conditions. Specific analysis can be made according to the specific situation:
    • If the interface invocation is time-consuming and occurs irregularly, you can increase the occurrence frequency of choke points by pressure testjstackLook at the stack information to find choke points;
    • If a function suddenly stops, the situation cannot be repeated. In this case, you can export it several timesjstackThe log method compares which user threads are in the waiting state all the time. These are the threads that may have problems.
    • If throughjstackIf you can see the deadlock status, you can examine the specific choke points of the two threads that caused the deadlock and handle the problem accordingly.

This paper mainly proposes five common problems that lead to slow online function, as well as troubleshooting ideas. Of course, online problems occur in various forms and are not necessarily limited to these situations. If we can carefully analyze the situations in which these problems occur, we can make a case-by-case analysis to solve the corresponding problems.