preface

A few days ago, I opened the mailbox and received a monitoring alarm email: THE CPU load of certain IP server is high. Please check and solve it as soon as possible. The sending time was just in the early morning.

In fact, AS early as last year, I also dealt with a similar problem and recorded it: “100% Investigation and Optimization Practice of CPU in One Production”.

But the cause of this problem is not quite the same as last time, and you can read on.

Problem analysis

As soon as I got the email, I logged on to the server and checked that the crime scene was still there.

So I used the screening routine of this kind of problem to locate again.


First, use top -c to display the system resource usage in real time (the -c parameter can display the command completely).

Then type a capital P to sort the applications by CPU usage, starting with the most used applications.

Sure enough, it’s one of our Java applications.

This application simply is to run some reports at regular time, every morning will trigger the task scheduling, under normal circumstances will be completed in a few hours.


The second normal step is to figure out what the most CPU-consuming thread in the application is doing.

You can still sort threads by CPU usage by using top-HP PID and typing P.

Jstack pid >pid.log to generate a log file. Use the saved hexadecimal process ID to search the thread snapshot to see what the CPU-using thread is doing.

If it’s too much trouble for you, I also highly recommend Arthas, ali’s open source problem locator.

For example, the above operations can be condensed into a single command, thread-n 3, to print snapshots of the three busiest threads, very efficiently.

For more information on how to use Arthas, see the official documentation.

Since I forgot the screenshots, I’ll jump to the conclusion:

The busiest thread is a GC thread, which means it is busy doing garbage collection.

The GC to check the

Investigation to here, experienced old drivers will think: most of the application memory use problems caused by.

So I printed out the memory usage and GC collection status (50 times every 200ms) via jstat-gcutil PID 200 50.

The following information can be obtained from the figure:

  • EdenArea andoldThe area is almost full, so there is a problem with memory reclamation.
  • fgcThe recycling frequency is high, with 8 recycling times occurring within 10s ((866493-866485)/ (200 *5)).
  • Over a long period of time, FGC has occurred more than 8W times.

Memory analysis

Since it is a preliminary location is a memory problem, so we still have to take a snapshot of memory analysis to finally locate the problem.

You can export a snapshot file by running the jmap -dump:live,format=b,file=dump.hprof pid command.

This is where analytical tools like MAT come in.

Problem orientation

It is obvious from this graph that there is a very large string in memory, which happens to be referenced by the thread of the scheduled task.

This string takes up about 258m of memory, which is a very large object for a single string.

So where does this string come from?

This is an INSERT SQL statement.

You have to admire the MAT tool, which can also help you predict where the memory snapshot might go wrong and give you a snapshot of the thread.

Finally, the thread snapshot finds the specific business code:

He calls a method that writes to the database, which concatenates an INSERT statement with values generated by a circular concatenation, something like this:

    <insert id="insert" parameterType="java.util.List">
        insert into xx (files)
        values
        <foreach collection="list" item="item" separator=",">
            xxx
        </foreach>
    </insert>
Copy the code

So once the list is very large, the concatenated SQL statement is also very long.

As you can see from the memory analysis above, the List is also quite large, which results in the huge memory footprint of the final INSERT statement.

Optimization strategy

Now that we have found the cause of the problem, it is easy to solve it. There are two directions:

  • Control the sourceListThe magnitude of thisListAlso from a table of data, can be paginated; So the follow-upinsertThe statement will be reduced.
  • To control the size of the batch write data, it is essentially to concatenate thisSQLThe length goes down.
  • The overall write efficiency needs to be reassessed.

conclusion

The time from analysis to solution of this problem is not long, and it is also relatively typical. The process is summarized as follows:

  • First, locate the consumptionCPUProcess.
  • Repositioning consumptionCPUThe specific thread.
  • Memory problemsdumpCreate a snapshot for analysis.
  • Draw conclusions, tweak code, test results.

Finally, I hope no one gets a production alarm.

Your likes and shares are the biggest support for me