This is the fifth day of my participation in Gwen Challenge

Recently, the CPU of the test service suddenly reached 100%. After the service was restarted, a request came in and the CPU went up to 100% again. In fact, we sometimes have CPU100%. Take this opportunity to share the solutions and steps to troubleshoot similar problems. Take this full CPU load as an example: Symptom: Suddenly the test partner said that XXX service is unavailable, then I looked at the registry nacOS found that the service is not in the available list, so I logged on to the machine and found that the request log is indeed not moved. Then check the command:

ps -ef | grep ‘java’

The JAVA thread is alive and the service is alive

top

I found that the CPU usage of The JAVA thread was 100%. In order not to affect the normal operation of the service, I first restarted the service. After the restart, I found that the CPU was normal, and everything was normal in the local postman call corresponding interface test. After partner to let them continue to respond to the first test, ready to check the reason, after soon test partner feedback and hang, I once again login server to check the JAVA thread and ran 100% CPU, a preliminary look at may be related to test with the parameters of the partners, because my request local simulation including the postman pressure are normal. I wanted to verify what the JAVA thread was doing when it was filling up the CPU without processing any business. Wondering why a single request for testing could have such an impact, I started the following troubleshooting process:

  1. See if there is any obvious exception in the stack information at this time. The TOP command can be used to check the PID of the corresponding JAVA thread. Export the current stack to a file for analysis.

jstack -l pid >> statck.txt

After checking and analyzing the stack information, there is no obvious error (mainly to check whether there is deadlock). If there is deadlock, there will be the following words:

  1. The stack does not see a significant exception, and then checks to find out which thread ID corresponds to the CPU overload and what it is doing. Then the order was executed:

top -H -p pid

You can see the following

We see that the FULL CPU PID is 17880, and we convert this thread to hexadecimal. Because it’s all printed in hexadecimal on the stack. Execute command:

printf “%x\n” 17880

With this hexadecimal thread identifier we run the following command to see what the thread is doing:

jstack 17850 | grep 45d8 -A 30

You can see the following:

In the output, we can see that the thread is operating some database information and GC information, but from the normal process, it is a simple query should not fill the CPU, so WE think it may be caused by GC thread load is too high, so we need to further investigate. Execute the following command:

jstat -gcutil 17850  1000

You may see the following screen:

Let’s start by explaining what each column means:

  • S0: current usage ratio of surviving zone 1
  • S1: Current usage ratio of surviving 2 zones
  • E: Eden Park usage ratio
  • O: The proportion used in the old days
  • M: Usage ratio of metadata area
  • CCS: compression usage ratio
  • YGC: garbage collection times of young generation
  • FGC: the number of garbage collections in the old years
  • FGCT: Old age garbage collection cost time
  • GCT: Total time consumed by garbage collection

It can be seen that the current garbage collection situation is output every 1s. After investigation, IT is found that FGC keeps increasing every second and FGCT time is very long. S1 and O are 100% used, so it can be concluded that the FULL CPU is due to Java garbage collection triggering full-GC and the object is still strong reference can not be collected. The next question arises as to what causes such frequent GC. Then check the use of a big move (dump the entire stack memory information for analysis), because the export of memory information is very time-consuming, using MAT software analysis requires a large memory, so it is called a big move. So the order was executed:

jmap -dump:format=b,file=/tmp/e.bin pid

Run the preceding command to export the memory information to the TMP folder named e.bin. Typically this file is very large, in this case it is nearly 5GB. Run the sz command to download the file to the local PC for analysis.

  1. Use mat software to analyze memory information in dump
  • Installing mat Software

Download website address www.eclipse.org/mat/downloa… , download the latest MAT according to the operating system version.

  • Configuring mat Software

Because a lot of memory is consumed to analyze dump memory files, we need to configure mat to find the memoryAnalyzer.ini file in the installation directory, which has an Xmx parameter indicating the maximum memory usage. The default value is 1024 MB, which can be modified according to the heap dump size. Adjust the size to 8GB this time. For example: modify the configuration file: [MemoryAnalyzer. Ini] path: / Applications/mat. The app/Contents/Eclipse/MemoryAnalyzer ini parameters: -Xms2048m -Xmx2048m

-startup .. /Eclipse/plugins/org.eclipse.equinox.launcher_1. 5700..v20200207-2156.jar --launcher.library .. /Eclipse/plugins/org.eclipse.equinox.launcher.cocoa.macosx.x86_64_11.1100..v20190907-0426
-vmargs
-Xms8192m
-Xms8192m
-Dorg.eclipse.swt.internal.carbon.smallFonts
-XstartOnFirstThread
Copy the code
  • Mat was used for analysis

Open mat and import the 5GB file just downloaded, you can see the following interface:

Generally, there are more than two functions, the red box area in the figure above. 1, Histogram 2, Leak Suspects Histogram is a feature that looks at the relationships between classes and objects, as well as between objects, to locate which objects are alive after FGC and which take up most of the memory. 1. By opening Histogram, we can list the number of instances of each class, support regular expression search, and calculate the retained size of all objects of this class.

Shallow Heap is the size of the memory used by the object itself, not the memory of the object referenced by it, and is not useful in actual analysis. The ShallowSize of a regular object (not an array) is determined by the number and type of its member variables. The shallow size of an array is determined by the type of the array element (object type, primitive type) and the array length. The object members are references, the real memory is on the heap, which looks like a bunch of native byte[], char[], int[], and the object itself has very little memory. Retained Heap value is calculated by superimposing all the size of the Retained Set (the collection of Retained objects that will be recycled by the GC when the Retained object is recycled). Or, the heap size of all other freed objects (including recursively freed objects) because X is freed. Retained Heap example: An ArrayList object holds 100 objects, each taking up 16 bytes. If the list object is recycled, 100 of these objects can also be recycled, reclaiming 16*100 + X (X represents the shallow size of the ArrayList). Therefore, the RetainedHeap is a more accurate reflection of the actual size of an object. Leak Suspects The screen indicates a possible memory Leak.

And then there’s the description of problem one, which lists some of the larger examples.

Click Details to see the Details

Based on the above analysis, it is found that the exception is caused by a database query that triggers full-GC and fills the CPU. As a result, the input of regression test partner performs local debug analysis, and the final result is that the number of query data result sets is 1.2 million, thus filling up the memory and triggering GC. At the same time, before returning, the object is in strong reference state, so GC cannot be reclaimed all the time, thus filling up the CPU. Conclusion: Although the cause of the fault is simple, the troubleshooting process covers the sequence and methods of troubleshooting when the CPU is full, the load is too high, or the memory is abnormal. Summary record to share for your reference.