This is the 16th day of my participation in the August Text Challenge.More challenges in August

One, foreword

While building Ignite clusters recently, I suddenly found that CPU usage soared to 99% or even 100% after online machines were running for a while. Then began to analyze and solve the problem. This article explains the entire problem location process in detail.

Second, the analysis process

1. Identify problems

After the server goes online, it receives CPU alarm notifications frequently. Then, it checks cluster servers on the monitoring system and finds that the CPU usage of all servers is 100%.

Select a machine at random to log in to the background, and use the Top command to find that the process with high CPU usage is a Java process. Therefore, it can be seen that our Own Java program has caused the high CPU usage.

2. Problem location

We found that our program was causing high CPU usage, and based on our experience we had a basic analysis idea. First print out the thread stack information, then find the threads with high CPU usage, then look at their thread stack information, and then analyze the code to see why.

A. Access to key information

First we need to get the Thread stack information for Java. So we can print the thread stack information directly in the background by jstack -l pid > log.log. The thread stack information (log.log) is then downloaded locally. It reads as follows:

Then we need to find out which threads in the Java process are consuming the most CPU. Therefore, by executing the top-hp PID command in the background of the machine, we can see which threads in the current process occupy the highest CPU (pid in the first column in the figure below is the thread ID).

B. Problem location

Once we have the thread stack information and which threads are using the most CPU, we can start our localization work. First, any thread ID with a high CPU usage is converted to hexadecimal. For example, the first thread, 708, is 2c4 when converted to hexadecimal.

Based on the obtained hexadecimal thread ID, a direct search in the thread stack file (i.e., search 0x2C4) yields the following thread information:

By looking at the other threads mentioned above in a similar manner, we found that most cpu-hogging threads execute uniform code:

We hypothesized that the CPU usage was high because of the abnormal execution of this code. So we further analysis of this program source code. Open source analysis to see that the capTotalSize method executes periodically, constantly scanning the log file to see if the configuration limit has been exceeded.

Further analysis of the capTotalSize method shows that the method getFilesInPerios is called, which roughly means scanning for qualified log files under the log folder.

Analysis found that getFilesInPerios call FileFilterUtil. FilesInFolderMatchingStemRegex method.

See FileFilterUtil. FilesInFolderMatchingStemRegex implementation, found that the main CPU consumption on the following method (file. ListFiles).

So we roughly analyzed that we should be consuming CPU while scanning the log files in the log folder. Then we tried to open the log folder and found that it could not be opened. So we guess that there are too many files to open. This explains why the log-scanning thread holds the listFiles location (the thread is still executing).

So why are there so many log files? The first thing we wanted to do was to check the log configuration parameters. After checking, we sent the log file to do the following configuration.

At first glance, the above configuration file does not seem to be a problem. However, if we look carefully, we can find that max-size and total-size-cap do not set the unit. If there is no unit, the default unit is byte. This causes the log to be set to 50 bytes per file (1K per file when processed by the operating system). This results in a large number of log files being generated. Therefore, the scanning thread may Hold the scan due to the excessive number of log files.

Third, solutions

After analyzing the specific reasons, the solution is relatively simple. Max-size and total-size-cap can be set to specific units.

After the configuration week is changed, restart the service and check the CPU monitoring. All CPU usage is restored.

After a period of observation, CPU usage was found to be relatively stable, so the problem was solved.

Four,

In retrospect, the whole problem was relatively simple. Because the log file is incorrectly configured (the specific unit is not configured), all the log files become 1K and a large number of log files are generated. This eventually causes the log scanning thread to hold at the place where the log file is being scanned, causing 100% CPU usage.

Five, the convention

If you have any questions or comments about this article, please add an official account to discuss it. (Add an official account to get 10GB video and graphic materials on “Java Advanced Architecture”.)