A painful memory (online memory leak check)

background

Why is it a painful memory? Ben wanted to recuperate during the National Day holiday. I ended up in a trance after nights of calling 911 online. Every night is a different call. This article mainly introduces the troubleshooting of the Java service heap memory leakage problem. Hope to help you.

The screening process

The JSP found that the service process disappeared

In the early morning of National Day, the nginx layer received an alarm that the entire Web service was unavailable. Qa found that the app was completely broken. I checked the server through JPS and found that the service was gone. If the fault is found to be serious and cannot be quickly located and resolved, one faulty service is isolated. After the other services are restarted, services are restored.

See the log

You can view service logs and find that no service exceptions are reported. Gc is also perfectly normal. So check the operating system logs.The Java process was deleted by OS kill because of OOM. Procedure If Linux is running out of memory, it will kill the process with the largest memory (Java process). Since our JVM heap is set to 6GB and the container memory is set to 8GB, 2g of fault tolerance must be sufficient, so it must be determined which memory is leaking.

View container memory

Through monitoring, we can see that the container memory number 29 has been increasing, and there is no trace of release. (version 29 was released, so it fell), so it can be identified as a Java service memory leak. If so, it must be a problem, because the process has already been killed and there is no scene, so you can continue troubleshooting the next day. Then he went back to sleep.

Troubleshooting Process Resources

The next day, you log in to a running service and run the top command to check the process status. The RES of the discovery process is 7.3 GB. Our heap is only 6GB, normally it should be no more than 6.5 GB.

Checking Memory Usage

Although the RES increase must be caused by a leak outside the heap, we checked to see if there were any exception objects in the heap. No exception was found after jmap dump. Check the Java thread stack: Run the top-hp pid command to check the number of threads: 189. There are not too many threads consuming memory resources. So rule it out.

Troubleshoot off-heap memory and Mataspace.

Fullgc was triggered by JMAP (because the online volume was small and the peak was low at that time, so there was no isolation, so it is not recommended to use it directly online, and it is better to use it after isolation). According to the GC log, Mataspace was normal, and RES was not released after full GC was completed. Unsafe, allocateMemory, and DirectByteBuffer memory leaks are excluded by combining the safety. allocateMemory and DirectByteBuffer memory leaks.

If there is aAbnormal shutdownresources

None of the above found a problem, so there are only two ideas left.

1. Some resources are not closed properly (streams and TCP connections are abnormal).

2. The memory of the native code is not released.

Through the code to see that there is no abnormal flow use, netstat check TCP number of more than 1000, still normal. So let’s exclude the first point for now.

Native code memory is not freed

You can only further see if there is a problem with the process memory.

Through the pmap – pid | x sort – n – r – k3 | less to check the process memory. In addition to the 4GB heap (a temporary fix to keep the JVM heap from 6 — > 4GB to keep the service alive for a while), there are also a lot of 64MB abnormal memory blocks. In fact, at this point in time, this memory is the reason why RES continues to rise.

I found out by searching on the Internet. Linux Glibc has a classic 64M memory problemgetconf GNU_LIBC_VERSIONI looked at the version.

Glibc 2.12Copy the code

It can be determined that memory that should have been allocated via glibc was not freed. Here can be located is native code pot.

Dump a chunk of memory to see what it is. Run cat /proc/50/smaps > smap to check the process memory consumption and randomly find the start and end addresses of abnormal memory.

Dump the memory block using GDBgdb -pid pidEnter the GDB. throughdump memory memory.dump startAddress endAddressCommand to dump. Exit GDB after success. useStrings. The memory dump | less dump analysisIn the memory.

There are many business return value strings, and a string of ciphertext. By looking at the code, only encryption and decryption can have these things. At this time, the idea is shifted to encryption and decryption, because encryption and decryption is implemented through native.

Encryption and decryption location

So I did a controlled experimental pressure test on the service

Basic group: encrypts pressure test interface A
Control group: plaintext pressure test interface A (without encryption and decryption)

It was found that the RES of the base group increased over time without any trace of release. The RES of the control group remained stable at a constant value. Therefore, at this time, memory leaks can be identified in the native code of encryption and decryption.

Because this JNI code is not maintained by us, we can only find the relevant basic platform group for investigation. Their code was found to have a memory leak bug. The problem was resolved after modifying the code to free up memory.

conclusion

The whole process was rough, but it also taught me something. Memory leaks are relatively rare in the JVM, and there are only a few cases that can be resolved by on-the-spot analysis.