Too long to see what this article says

  • This section describes how to check the OOM.
  • Some useful commands help you observe heap, GC, etc. Jmap and jstat.
  • The cause, solution and review of the accident

1: Brief description of the accident

One evening at 21 o ‘clock, the business module (micro service) frequently timed out the RPC of the wechat module and received a large number of errors. Halfway through the process, I received the following alarm.The heap overflows. All the wechat module machines are down.

2: Start troubleshooting

Restart the machine immediately and ensure online service is available. Then download the heapdump.hprof file. Dump files need to be added at JVM startup time. Otherwise, you need to run the jmap command.

Because dump is a binary file, you need to use graphical tools for analysis. I use Eclipse Memory Analyzer.

Click on the leak suspects. Take a look at the possible memory leak spots.

I found that the Web container was creating a lot of linked instances and consuming a lot of memory. I immediately looked at the process GC using jstat -gcutil PID. FGC was triggered many times and yGC took a long time. It immediately occurs to me that many connections are still in the Keep Alive state and have not been returned to the pool, so new requests can only be made to establish a new connection object connection. Waiting connections are not collected by the garbage collector because they are still alive. This can result in a situation where the already created objects cannot be reclaimed and new connected objects are constantly being created. This uses up heap memory.

3: several useful commands to watch heap

You can use this command to see the GC status. Very important, because some frequent FGC can be processed in advance to avoid OOM.

You can use this command to look at the heap status directly.

jmap -heap pid
Copy the code

You can use this to look at objects on the heap.

jmap -histo:live pid | sort -k 2 -g -r| less
Copy the code

4:

Now that I have the idea, I just need to verify it. Looked at the before and after logs, basically locked down the crash site. First of all, there is a relatively large flow at 9 o ‘clock. The two machines received 18.2 million requests in about 1 minute. Most requests are this:

After sending the Template message, wechat server will push the event “Template Send Job Finish” to the developer server for processing. It’s kind of a bummer. If the request times out, wechat will send it again with each request.

In addition, in order to reply to wechat server SUCCESS as soon as possible. The code logic that handles callbacks uses asynchronous logic. The async annotation is spring’s default @async annotation. This annotation does not initialize its own thread pool is a sinkhole. @async is using a thread pool provided by SpringBoot. . This can cause tasks to pile up. The connection times out due to task accumulation.

To sum up, it is a natural disaster [sudden large-scale request] + man-made disaster [code writing problems].

5: Resolve and review

  • The optimized thread pool configuration injects its own configured thread pool
  • Server capacity expansion. Unconsciously, we have accumulated nearly 100W users, and wechat service is still only two machines. I added two more this time.

Seven or eight days later, after a couple of 10W level callbacks, the CPU was basically at 30util. The effect is obvious.