We had an online project, which occupied more than 1.5G when we used the top command to check RES, which was obviously unreasonable, so we made some analysis and found the root cause. The following is the complete analysis process, hoping to help you.

It’s going to cover the following things

  • Linux classic 64M memory problem
  • Heap memory analysis, Native memory analysis of the basic routine
  • Application of TCMALloc and Jemalloc in native memory analysis
  • Finalize the principle
  • Hibernate kills people

The phenomenon of

Parameters for program startup

ENV=FAT java
-Xms1g -Xmx1g 
-XX:MetaspaceSize=120m 
-XX:MaxMetaspaceSize=400m 
-XX:+UseConcMarkSweepGC  
-jar 
EasiCareBroadCastRPC.jar
Copy the code

After startup, the memory footprint is as follows: a whopping 1.5 GIGABytes. Java is a memory hog, but don’t play with it.

Here’s the happy analysis.

Pick the soft persimmon first

Use JCMD or JMap first to check whether the heap is high, and if so, it should be fixed quickly.

As you can see, the heap is 216937K + 284294K = 489.48m, and the Metaspace memory, although not part of the Java heap, is shown here as 80M+, which is nowhere near 1.5G.

Where does the rest of the memory go? At this point, it’s clear that something outside the heap might be hosing memory, so it’s time to start using NativemoryTracking for the next step in the analysis.

NativeMemoryTracking use

If you want to track the memory footprint of other parts, you need to enable this feature with -xx :NativeMemoryTracking

java -XX:NativeMemoryTracking=[off | summary | detail]
Copy the code

Add the start parameter, restart the process, and then print the information using JCMD.

$ jcmd `jps | grep -v Jps | awk '{print $1}'` VM.native_memory detail

Total: reserved=2656938KB, committed=1405158KB
-                 Java Heap (reserved=1048576KB, committed=1048576KB)
                            (mmap: reserved=1048576KB, committed=1048576KB)

-                     Class (reserved=1130053KB, committed=90693KB)
                            (classes #15920)
                            (malloc=1605KB #13168)
                            (mmap: reserved=1128448KB, committed=89088KB)

-                    Thread (reserved=109353KB, committed=109353KB)
                            (thread #107)
                            (stack: reserved=108884KB, committed=108884KB)
                            (malloc=345KB #546)
                            (arena=124KB #208)

-                      Code (reserved=257151KB, committed=44731KB)
                            (malloc=7551KB #9960)
                            (mmap: reserved=249600KB, committed=37180KB)

-                        GC (reserved=26209KB, committed=26209KB)
                            (malloc=22789KB #306)
                            (mmap: reserved=3420KB, committed=3420KB)

-                  Compiler (reserved=226KB, committed=226KB)
                            (malloc=95KB #679)
                            (arena=131KB #7)

-                  Internal (reserved=15063KB, committed=15063KB)
                            (malloc=15031KB #20359)
                            (mmap: reserved=32KB, committed=32KB)

-                    Symbol (reserved=22139KB, committed=22139KB)
                            (malloc=18423KB #196776)
                            (arena=3716KB #1)
Copy the code

We are very disappointed that all the parts shown here look normal and there is no abnormal occupation. From here, we can basically know that there is a problem with native memory which is not controlled by JVM. How should we analyze it?

Pmap preliminary view

Pmap allows you to see a large number of 64 MB memory blocks, which is typical of ptmalloc. This problem is described in the previous article. See: juejin. Cn/post / 685457…

Is the 64M memory area block in the above NMT memory area?

The NMT tool’s address output detail mode prints out the start and end addresses for each area, as shown below.

A simple piece of code (just do it yourself) can combine the PMAP and NMT parts to see where the actual heap, stack, GC, and so on are distributed in the memory address space.

You can see that a large 64M portion of the memory area is not part of any NMT jurisdiction.

Tcmalloc, Jemalloc come to the rescue

We can use TCMALloc or Jemalloc to track native memory allocation. Their principle is the implementation of malloc, free and other memory application release functions of hook system to increase the logic of profile.

Take TCMALloc as an example.

Tcmalloc (github.com/gperftools/… LD_PRELOAD is a LD_PRELOAD function that hooks memory allocation.

HEAPPROFILE=./heap.log HEAP_PROFILE_ALLOCATION_INTERVAL=104857600 LD_PRELOAD=./libtcmalloc_and_profiler.so java -jar xxx .Copy the code

On startup, you’ll see a lot of analysis files generated for memory dump, and then use pprof to convert the heap files into more readable PDF files.

pprof --pdf /path/to/java heap.log.xx.heap > test.pdf
Copy the code

The following figure shows the links for memory application.

Java_java_util_zip_Inflater_inflateBytes. The JAR package is a ZIP package. CPP code in JNI is used to read the JAR file. There’s a lot of requests freeing up memory.

Solution without changing the code

Java-jar: java-cp. AppMain: java-cp. AppMain: java-cp. AppMain: java-cp. AppMain: java-cp. Because our project uses shade due to historical reasons, there are no jar packages in it, but only class files. The RES usage is only 400M without using jar package.

At this point, we are more sure that the jar package starts to cause the problem. Why does the JAR package start to cause the problem?

Explore root causes

Tcmalloc can be seen in the java.util.zip.Inflater class, where a call to its end method frees native memory.

I thought that is not end method calls, this is indeed possible, Java. Util. Zip. InflaterInputStream close method of a class in some scenarios is not call Inflater. End method, as shown below.

When an Inflater object is unreachable, the JVM will call the Finalize method of an Inflater.

public class Inflater { public void end() { synchronized (zsRef) { long addr = zsRef.address(); zsRef.clear(); if (addr ! = 0) { end(addr); buf = null; } } } /** * Closes the decompressor when garbage is collected. */ protected void finalize() { end(); } private native static void initIDs(); / /... private native static void end(long addr); }Copy the code

There are two possibilities

  • Inflater cannot be released because it is referenced by other objects. As a result, Finalize method cannot be called and memory cannot be freed naturally
  • Inflater’s Finalize method is called, but is cached by LIBC’s PTmalloc, and is not actually released back to the operating system

The second possibility, which I have already discussed in another article “A Java process OOM troubleshooting (GliBC)”, is juejin.cn/post/685457… It turns out that’s not the question.

Let’s look at the first possibility, which is to look at by dumping heap memory. Sure enough, there are eight Inflater objects that are still alive and not in the GC. In addition to being referenced by java.lang.ref.finalizer inside the JVM, there are other references that make it impossible for the Inflater to be recycled during GC.

Is this memory really related to 64MB blocks? There’s no proof. Let’s make sure. The Inflater class has a zsRef field, which is essentially a pointer address. Let’s see if the zsRef address of an Inflater that has not been freed is in what we call a 64 MB block of memory.

public class Inflater { private final ZStreamRef zsRef; } class ZStreamRef { private volatile long address; ZStreamRef (long address) { this.address = address; } long address() { return address; } void clear() { address = 0; }}Copy the code

Address = 140686448095872, hexadecimal 0x7FF41dc37280, address = 140686448095872, hexadecimal 0x7FF41dc37280

It’s in what we call a 64M memory block.

If you’re still not convinced, we can dump this chunk of memory, and I’ve written a script here

cat /proc/$1/maps | grep -Fv ".so" | grep " 0 " | awk '{print $1}' | grep $2 | ( IFS="-"
while read a b; do
dd if=/proc/$1/mem bs=$( getconf PAGESIZE ) iflag=skip_bytes,count_bytes \
skip=$(( 0x$a )) count=$(( 0x$b - 0x$a )) of="$1_mem_$a.bin"
done )
Copy the code

Dump the memory by passing in the process number and the starting address of the memory you want to dump.

./dump.sh `pidof java` 7ff41c000000
Copy the code

Execute the above script, pass in two parameters, one is the process ID, one is the address, will generate a 64 MB memory dump file, through strings.

strings 6095_mem_7ff41c000000.bin
Copy the code

The screen is filled with information about class files.

At this point there should be no need to prove anything, all that remains is analysis.

Who was that quoted by? Expand the chain of references and see a bunch of ClassLoaders appear.

An unexpected discovery (not very relevant to the problem, it can be easily solved)

– there is a strange Nashorn ClassLoader for this project. Nashorn handles JavaScript logic, so why does this project use Nashorn ClassLoader? After a careful search, the project code was not used.

Which middleware was introduced? Log4j2 supports javaScript, Groovy, and other scripting languages. Log4j2 supports javaScript and Groovy.


      
<Configuration status="debug" name="RoutingTest">
  <Scripts>
    <Script name="selector" language="javascript">
       </Script>
    <ScriptFile name="groovy.filter" path="scripts/filter.groovy"/>
  </Scripts>
</Configuration>
Copy the code

We did not use similar features in our project (because WE did not know), we can only say that it is really speechless, you are good to be a tool log library, do so many fancy things, superficial!

Here’s the code

I have had a rough look at this problem. Until the latest official version, there is no switch to close ScriptEngine. If not, I can use it myself, pull the code of log4j corresponding version in the project, modify it, and repackage it to run.

The nashorn part of the ClassLoader does not exist after the re-run, but this is just a small episode and the native memory usage issue is not resolved.

The killer surfaced

Then we will find which code in crazy invokes the Java. Util. Zip. Inflater. InflateBytes method

Use Watch to jStack a thread every second and immediately see Hibernate making frantic calls.

Hibernate is a legacy of our old code and has not been removed.

The function corresponding code org. Hibernate. Jpa. Boot. Archive. Internal. JarFileBasedArchiveDescriptor# visitArchive# 146

The junk code, jarfile.getinputStream (zipEntry) generates a new stream but does not close it.

In fact, I don’t know why Hibernate will scan and parse all the classes in my JAR package, it is completely wrong.

So let’s just rip this code out and make a minimal demo.

public class JarFileTest { public static void main(String[] args) throws IOException { new JarFileTest().process(); System.in.read(); } public static byte[] getBytesFromInputStream(InputStream InputStream) throws IOException {// Omit the logical return of read result; } public void process() throws IOException { JarFile jarFile = null; try { jarFile = new JarFile("/data/dev/broadcast/EasiCareBroadCastRPC.jar"); final Enumeration<? extends ZipEntry> zipEntries = jarFile.entries(); while (zipEntries.hasMoreElements()) { final ZipEntry zipEntry = zipEntries.nextElement(); if (zipEntry.isDirectory()) { continue; } byte[] bytes = getBytesFromInputStream(jarFile.getInputStream(zipEntry)); System.out.println("processing: " + zipEntry.getName() + "\t" + bytes.length); } } finally { try { if (jarFile ! = null) jarFile.close(); } catch (Exception e) { } } } }Copy the code

Run the code above.

javac JarFileTest.java java -Xms1g -Xmx1g -XX:MetaspaceSize=120m -XX:MaxMetaspaceSize=400m -XX:+UseConcMarkSweepGC -cp .  JarFileTestCopy the code

The MEMORY RES footprint immediately skyrocketed to over 1.2G, and the GC could not reclaim it in any case, but the heap memory was almost zero.

The RES memory usage is shown as follows.

The heap memory footprint is shown below. After GC, the footprint of the new generation is 0, and the footprint of the old generation is 275K

They are all occupied by 64M memory.

The stream is closed by modifying the code

while (zipEntries.hasMoreElements()) {
    final ZipEntry zipEntry = zipEntries.nextElement();
    if (zipEntry.isDirectory()) {
        continue;
    }

    InputStream is = jarFile.getInputStream(zipEntry);
    byte[] bytes = getBytesFromInputStream(is);

    System.out.println("processing: " + zipEntry.getName() + "\t" + bytes.length);
    try {
        is.close();
    } catch (Exception e) {

    }
}
Copy the code

Test again, the problem is solved, native memory usage is almost gone, and it’s time to solve the problem in the project. One is to completely remove Hibernate and replace it with mybatis, which I don’t know. I’m going to change the hibernate source code.

Try to modify

Modify this code (finally should be in the first place for all close’s, but for simplicity’s sake) and add close logic.

Recompile Hibernate, install it locally, and repackage it to run. At this point, the RES usage dropped from about 1.5G to over 700 M.

Fortunately, the native memory usage of the 64M block is very, very small, and 448M of the 700M memory is in the dirty heap area, which is only reserved by the JVM.

So far, we’ve pretty much solved the problem. In the new version of Hibernate, the problem has been solved, but I’m not going to upgrade it.

See:

Github.com/hibernate/h…

Afterword.

Because it is not the focus of this article, I did not discuss the use of some tools involved in the article, you can figure it out yourself if you are interested.

In fact, native memory leakage is not as complex as we imagined, and can be analyzed step by step through NMT, PMAP and TCMALloc. As long as it can be repeated, it is not called a bug.

Finally cherish life and stay away from Hibernate.

If you have any questions about JVM, please feel free to contact me on wechat or public account.