One, the introduction

For Internet companies, the problem of online CPU surge is very common (for example, when the traffic suddenly surges at the beginning of an activity). Follow the steps of this article to troubleshoot the problem, which can be solved in 1 minute. The investigation method is hereby sorted out for your reference to improve.

Second, the recurrence of problems

The online system suddenly runs slowly, the CPU spikes to 100%, and the Full GC times are too many, followed by various alarms: such as the interface timeout alarm, etc. At this time, there is an urgent need for quick online troubleshooting.

Three, problem investigation

Whatever the problem is, if it’s a CPU surge, you should definitely look at the CPU consuming threads and then look at the GC.

3.1 Core Troubleshooting Procedure

  1. Run the top command to view the CPU usage sequence of all processes. Most likely, the first one will be our Java process (COMMAND column). PID column is the process id.

  2. Run the top-hp process ID command to check the CPU usage of all Threads in the Java process.

  3. Execute printf “%x\n 10: the thread stack is displayed in hexadecimal. To find our thread stack, we need to convert the thread number to hexadecimal. For example, if printf “%x\n 10- “prints: a, the thread number in jStack would be 0xa.

  4. Executing jstack process, | grep thread ID to find a process – “under the thread ID (nid) in jstack stack = 0 xa thread state. Os_prio =0 TID = 0x00007F871806e000 nID = 0xA Runnable os_PRIO =0 TID = 0x00007F871806e000 NID = 0xA Runnable

  5. Run the jstat -gcutil process NUMBER interval number of milliseconds for collecting statistics (the default value is consistent statistics) to check the GC changes of a process. If the FGC in the command output is large and keeps increasing – Full GC! You can also use the jmap-heap process ID to see if the heap is overflowing, especially in the old days when the heap usage reaches the threshold (see the garbage collector and the threshold configured at startup) and the process is Full GC.

  6. Run jmap -dump:format=b,file=filename process ID to export the memory heap from a process to a file. Eclipse’s MAT tool allows you to see which objects are abundant in memory.

3.2 Cause Analysis

1. Excessive memory consumption leads to excessive GC times of Full

Perform Steps 1-5.

  • Multiple threads have more than 100% CPU, and as you can see from the jstack command, these threads are mainly garbage collection threads – Step 2 in the previous section

  • Using the jstat command to monitor the GC situation, you can see that the number of Full GC counts is very high and increasing. Step 5 in the previous section

Confirm that it is Full GC, and then find the specific cause:

  • A large number of objects are generated, causing memory overflow. – Go to Step 6 to check the usage of memory objects.

  • This can be done by adding -xx :+DisableExplicitGC to disable THE JVM’s response to the display GC.

2. There are a lot of CPU consuming operations in the code, resulting in high CPU and slow system running.

Perform steps 1-4: In Step 4jStack, you can directly locate the line of code. For example, some complex algorithms, even algorithm bugs, infinite recursion and so on.

3. Deadlock occurs due to improper lock usage.

Perform Steps 1-4: If a deadlock occurs, a message is displayed. Keyword: deadlock. Step 4, the location of the business deadlock is printed.

Deadlock causes: Typically, two threads are waiting for each other to hold a lock.

4. A large number of threads randomly access the interface slowly.

There is an obstructing operation in a certain position of the code, resulting in the overall time consuming of the function call, but the occurrence is relatively random; Usually the CPU consumption is not much, and the memory occupation is not high.

Ideas:

The interface is first found, and a large number of threads will block at the choke point as the pressure gauge continues to increase access.

Perform Steps 1-4.

"http-nio-8080-exec-4" #31 daemon prio=5 os_prio=31 tid=0x00007fd08d0fa000 nid=0x6403 waiting on condition [0x00007000033db000]java.lang.Thread.State: Sleep (Native Method) at java.lang.thread. sleep(thread.java :340) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at Com. *. User. Controller. UserController. Detail (UserController. Java: 18) - "business code block pointsCopy the code

Above, find the business code choke point, where the business code uses the timeUnit.sleep () method to put the thread into the TIMED_WAITING state.

5. A thread enters the WAITING state for some reason, and the function is unavailable but cannot be reappeared.

Perform Steps 1-4: jStack for several times with an interval of 30 seconds to compare the threads that remain in the WAITING state caused by parking.

For example, CountDownLatch backs the timer so that the relevant threads wait ->AQS-> locksupport.park ().

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007f9de08c7000 nid=0x5603 waiting on condition [0x0000700001f89000]   java.lang.Thread.State: Unsafe.park(Native Method) at sun.misc.Unsafe java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at com.*.SyncTask.lambda$main$0(synctask.java :8)- "business code choke point at com.*.SyncTask$$LambdaThe $1/1791741888.run(Unknown Source)    
at java.lang.Thread.run(Thread.java:748)Copy the code

Four,

Follow the six steps in Section 3.1 and you will find the problem in most cases.


BLOG address: www.liangsonghua.com

Pay attention to wechat public number: Songhua said, get more wonderful!

Introduction to our official account: We share our technical insights from working in JD, as well as JAVA technology and best practices in the industry, most of which are pragmatic, understandable and reproducible