After a slight revision, the system went online. After the launch, no problems were found in the test. The next day, the feedback was that the system was stuck and the system was offline.

Check system problems, optimize interface speed online, found no problem after online, the next day still appear lag. At this time, observe the CPU usage of 1600%. If you think of it at this time, first rollback. No site was reserved.

Test environment test found 100% CPU idle time found the problem fixed. But you can be sure that 100% of this is not the cause of 1600%.

On-line again, manual real-time monitoring of CPU occupancy. This is 1600% of the time. The thread corresponding to the process using 1600% of CPU at this point is as follows.

6619, 6625, and so on are the highest occupied process IDs.

Print the JVM stack information and output it to a file.

6619 converted to hexadecimal after 19db according to the process number in the stack file

Finally, it is found that all the processes with high CPU consumption are GC processes, which can be determined at this time. Some parts of the code logic take up too much memory. Or there is a memory leak.

Looking for problems

At this time, I have failed to go online for three times in a row, so I can’t test it online. The idea is to simulate this phenomenon in a grayscale environment and dump the heap information so that you can find the cause.

The first day to transfer a very small number of other system traffic, as well as a very small number of user traffic past, no problem was found.

The next day the traffic requested by other systems remains the same, and more user traffic is added, without recurrence of the problem.

On the third day, the traffic of some other system requests was increased, and there was no recurrence problem.

On the NTH day, increase the flow requested by other systems, memory adjustment is small, and there is no recurrence problem.

In the N +1 day, grayscale environment service and formal environment share the traffic equally, and continue to increase the number of users. There was no problem.

At this time, the modified part of the core process code has been checked N times and no problem has been found.

So you need to think about why the grayscale environment is not a problem. And there’s a problem on the line. How are their users different?

At this time, it is found that the gray environment is all the users with the lowest authority, and the administrator does not work in the gray environment. It is thought that the problem is very close to the truth here. You can say you’ve identified the problem. You just need to test your guess.

One of them is the ability to view the data they are managing. This function was not initially considered in this direction because it is not a core function and the amount of requests is very small. Logic is: find their next level, if there is data, continue to find, just the database has an abnormal data, his next level is yourself! This leads to an infinite loop, which causes more and more data to be stored in memory.

And only one abnormal user will cause the problem!

Because it is an IO intensive operation, the CPU footprint of this loop is low. It is not found in the thread stack.

To solve the problem

It is easy to find a solution to the problem, no longer describe in detail.

reflection

The first time the system is stuck, the correct way to deal with it should be as follows

If CPU usage is high, what is the corresponding thread of this process doing

When a large number of threaded remote desktops are found performing GC operations, heap information should be dumped

Use tools such as jmap to see which objects have a high memory footprint

Find the code to solve the problem

Such a bug should not exist. Even if there is a problem, don’t panic and save the information that can be saved quickly.

Big changes need to be released before the line, a few users first use.

The end of the