Analysis and solution of an on-line CPU surge problem

A few days ago, when I was at home after work, I suddenly received a monitoring alarm, and the CPU of a group of business machines on the line was hit to 100%. In order to maintain the stable operation of the service, I temporarily adopted rough methods such as upgrading configuration and adding machines to carry over the night. In the middle of the night, I thought about the possible causes, but I couldn’t be sure because there were too many possible factors. The next day after I arrived at the company, I configured the parameters required by Java Mission Control on a pre-delivery machine, and then WAITED for the next problem to occur. Sure enough, within a few days, the problem reappeared in the evening rush hour. A small portion of the online traffic was immediately diverted to pre-configured pre-delivery machines, which were soon fully loaded. Looking in the Thread panel of Java Mission Control for a while, the CPU usage of the business threads is not high at all, rarely exceeding 10%, but the CPU is full at this point. When I kept checking the information provided in Java Mission Control, I found that Full GC was very frequent. Within two hours after the application started, Full GC had been performed for more than 1500 times, which took more than 13 minutes, and the frequency and time were still increasing. This reminds me of the problem of frequent Full GC mentioned in related books before, but I did not expect to encounter it here, as shown in the picture:

However, the configuration of the pre-dispatch machine is different from that of the online machine. The CPU surge caused by Full GC does not mean that the online machine is also the same reason. In order to verify this problem, the author configured GC log and other parameters on one of the online machines, and waited for the online problem to occur again to leave relevant log information. Understand the Java Garbage Collection log. Do you find it troublesome to visually analyze GC log? Well, someone already has an analysis tool: GC Easy.

Here is a GC log, collected on a 1Core 4G machine (JVM not tuned) and 100% CPU by the evening peak, which can be uploaded to GC Easy for analysis. Download it here.

A picture is worth a thousand words:

In order to check whether some business threads use CPU heavily, the author also prepared commands related to the generation of thread dumps. For example, first use JPS to find out the TOMCAT PID, and then use the following command to generate thread dumps:

jstack -l <pid> > /opt/threadDump.txtCopy the code

In order to ensure the accuracy of the data, the author generated five or six copies for later analysis when the CPU was full. Similarly, it is always difficult to observe with naked eyes, so Java Thread Dump Analyzer can be used for analysis here. This analysis on the online machine is consistent with what the author observed on the pre-delivery machine. There are no large number of cpu-consuming problem threads.

The default JVM parameters were not tuned due to the small volume of the load. For example, on a 1Core 4G machine, the default maximum heap memory was 1/4 of the physical memory. There is still a lot of memory not used, the following solution is not needed to write, after the adjustment and observation of a week, so far the evening peak service is still stable, the load is within the expected range, there is no more CPU surge, the problem is solved.

ELIMINATE CONSECUTIVE FULL GCs In depth Understanding the Java Virtual Machine (version 2) Authoritative Guide to Java Performance

The original address

Analysis and solution of an on-line CPU surge problem

Related Posts

Interviewer: Tell me the principle of Mybatis plug-in and how to implement it.

Redis Interview Questions (2021 Latest edition)

Detail Java thread safety