I encountered a gliBC memory reclamation problem. The process of finding the cause and experiment is quite interesting. It mainly involves the following:

  • Typical problems with large 64M memory areas in Linux
  • The underlying principle of GLIBC’s memory allocator PTmalloc2
  • How to write a custom malloc hook DLL so
  • Glibc memory allocation principles (Arenas, Chunk structures, bins, etc.)
  • Effect of malloc_trim on true memory reclamation
  • GDB heap debugging tool used
  • Introduction and application of Jemalloc library

background

Some time ago, a student gave feedback that a Java RPC project was killed shortly after it was started in the container because the memory of the container exceeded the 1500M quota. I helped to take a look.

After a quick run in a native Linux environment, the JVM started up and saw more than 1.5 gigabytes of RES memory through top, as shown in the figure below.

Arthas is a good place to start thinking about how memory is distributed, and type dashboard to see current memory usage, as shown below.

As you can see, the heap used by the process is only about 300M, and the non-heap is also very small, adding up to about 500 M. This takes a look at several components of JVM memory.

Where does the JVM spend its memory

The MEMORY of the JVM is roughly divided into the following sections

  • Heap: Eden, Metaspace, old area, etc
  • Thread Stack: Reserve 1 MB Thread Stack size for each Thread Stack
  • Non-heap: includes code_cache, metaspace, etc
  • Out-of-heap memory: unsafe. AllocateMemory and DirectByteBuffer apply for out-of-heap memory
  • Native (C/C++ code) requested memory
  • There is also memory that the JVM itself needs to run, such as GC, etc.

Next, it is suspected that out-of-heap memory and native memory may leak. Out-of-heap memory can be tracked by enabling NATIvemoryTracking (NMT), plus -xx :NativeMemoryTracking=detail to start the program again, also found that the memory footprint value is much smaller than the RES memory footprint value.

Since NMT does not track the memory requested by Native (C/C++ code), it has been suspected that native code is responsible for this. The only thing left in our project that rocksdb uses native code is the JVM itself. Let’s move on.

Linux familiar 64M memory problems

Pmap-x was used to check the memory distribution, and it was found that there were a large number of memory regions of about 64M, as shown in the figure below.

This phenomenon is all too familiar. Isn’t it the classic 64M memory problem in Linux Glibc?

Ptmalloc2 and arena

An early version of Malloc in Linux, implemented by Doug Lea, had a serious problem with allocating only one area of memory (arena). Each time an area of memory was allocated, the lock had to be locked. The word arena literally means’ arena ‘, probably the main arena where memory allocation libraries perform.

If I open more arenas, the lock competition will naturally improve.

Wolfram Gloger improved on Doug Lea to make Glibc’s MALloc multithreaded, ptmalloc2. On the basis of only one allocation area, non-main Arena is added. There is only one main allocation area, but there can be many non-main allocation areas. The specific number will be explained later.

When malloc is called to allocate memory, it checks to see if an allocation arena already exists in the current thread private variable. If it exists, the attempt will lock the arena

  • If the lock is successful, memory is allocated using this allocation
  • If the lock fails, another thread is in use, and the arena list is traversed looking for an arena region that is not locked. If it is found, the arena region is used to allocate memory.

The primary allocation area can apply for virtual memory in BRK or MMAP mode. The non-primary allocation area can apply for virtual memory only in MMAP mode. Each time glibc applies for 64MB virtual memory blocks, the glibc is then divided into small blocks for retail based on application requirements.

This is a typical 64M problem with Linux process memory distribution. How many regions are there? On 64-bit systems, this value is equal to 8 * number of cores, or up to 32 64 MEgabytes of memory if it is 4 cores.

Is it because there are too many arenas?

Is setting MALLOC_ARENA_MAX=1 useful?

Add this environment variable to start the Java process, and it is true that 64MB of memory is missing, but concentrated into a large area close to 700MB, as shown in the figure below.

At this point, the memory footprint problem is not solved, so we continue to play around.

Who is allocating free memory

Next, write a custom malloc function hook. A hook is a LD_PRELOAD environment variable that replaces a gliBC function and prints logs before calling malloc, free, realloc, or calloc. Taking the hook of the malloc function as an example, part of the code is shown below.

// Get the thread ID instead of pid static pid_tgettid() {
    return syscall(__NR_gettid);
}
static void *(*real_realloc)(void *ptr, size_t size) = 0;

void *malloc(size_t size) {
    void *p;
    if(! real_malloc) { real_malloc = dlsym(RTLD_NEXT,"malloc");
        if(! real_malloc)return NULL;
    }
    p = real_malloc(size);
    printLog("[0x%08x] malloc(%u)\t= 0x%08x ", GETRET(), size, p);
    return p;
}
Copy the code

Set LD_PRELOAD to start the JVM

LD_PRELOAD=/app/my_malloc.so java  -Xms -Xmx -jar ....
Copy the code

The JStack print thread stack is enabled during JVM startup, and when the JVM process is fully started, the malloc output logs and jStack logs are viewed.

Here is a malloc log output of tens of megabytes, as shown below. The first column of the log is the thread ID.

Logs on awK processing are used to count the number of threads processed.

cat malloc.log  | awk '{print $1}' | less| sort | uniq -c | sort -rn | less

 284881 16342
    135 16341
     57 16349
     16 16346
     10 16345
      9 16351
      9 16350
      6 16343
      5 16348
      4 16347
      1 16352
      1 16344
Copy the code

As you can see, thread 16342 allocates the most memory, so what is this thread doing? A search for thread 16342 (0x3FD6) in the output log of jStack shows that the jar package is being decompressed many times.

Java handles zip using the java.util.zip.Inflater class, and calling its end method frees Native memory. See here I think is not end method calls, this is indeed possible, Java. Util. Zip. InflaterInputStream close method of a class in some scenarios is not call Inflater. End method, as shown below.

It’s a little early to be happy. In fact, this is not the case. Even if the upper call does not call Inflater. End, Finalize method Inflater also calls end.

jcmd `pidof java` GC.run
Copy the code

The GC log confirms that FullGC was triggered, but memory is not lowered. Checking for memory leaks using tools like ValGrind also found nothing.

If the JVM itself does not leak memory, then it is glibc’s fault. The call to free returns the memory to Glibc, but glibc is not finally freed.

The principle of gliBC memory allocation

This is a very complex topic, if this piece is completely unfamiliar, I suggest you read the following information.

  • Understanding glibc malloc sploitfun.wordpress.com/2015/02/10/…

  • Taobao Minneapolis master the Glibc memory management – Ptmalloc2 source code analysis paper.seebug.org/papers/Arch…

In general, the following concepts need to be understood:

  • Memory allocation area Arena
  • Memory chunk
  • Bins of free Chunks

Memory allocation area Arena

The concept of an Arena for memory allocation was introduced earlier and is relatively simple. To get a more intuitive view of the internal structure of heap, you can use the HEAP extension package of GDB

  • Libheap: github.com/cloudburst/…
  • Pwngdb: github.com/scwuaptx/Pw…
  • PWNDBG: github.com/pwndbg/pwnd…

These are also the tools you can use for CTF heap related topics, and the Pwngdb tool is used to introduce them. Type arenainfo to view a list of arenas, as shown below.

In this example, there are 1 primary allocation Arena and 15 non-primary allocation arenas.

Structure of memory chunks

The concept of chunk is also easy to understand. The literal meaning of chunk is “chunk”, which is for users. The memory allocated by users is represented by chunk.

This may not be easy to understand, but here is a practical example.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void) {
    void *p;

    p = malloc(1024);
    printf("%p\n", p);

    p = malloc(1024);
    printf("%p\n", p);

    p = malloc(1024);
    printf("%p\n", p);

    getchar();
    return (EXIT_SUCCESS);
}
Copy the code

This code allocates 1K of memory three times at the following address:

./malloc_test

0x602010
0x602420
0x602830
Copy the code

The output of pmap is as follows.

You can see the address 0x602010 of the first allocated memory region at the base address (0x602000) offset 16(0x10) of this memory region.

0x602420 = 1024 + 16

The third allocation is the same, with 0x10 bytes empty each time. What is this 0x10 that’s left out?

This is clear when you use GDB to look at the 32-byte area starting 0x10 bytes before the three memory addresses.

You can see that it actually stores 0x0411,

0x0411 = 1024(0x0400) + 0x10(block size) + 0x01
Copy the code

Where 1024 is obviously the size of the memory region requested by the user, and 0x11 is what? Since memory allocations are aligned, the lowest three bits are actually meaningless to memory size, and the lowest three bits are borrowed to indicate special meaning. A chunk structure in use is shown below.

The lowest three digits are as follows:

  • A: Indicates that the chunk belongs to the primary allocation area or non-primary allocation area. If the chunk belongs to the non-primary allocation area, set this location to 1; otherwise, set this location to 0
  • M: Indicates the memory region from which the current chunk is obtained. An M value of 1 indicates that the chunk is allocated from the MMAP region, otherwise it is allocated from the heap region
  • P: indicates whether the previous chunk is in use. If P is 0, the previous chunk is idle. In this case, the first field prev_size of chunk is valid

In this example, the lowest three digits are B001, where A = 0 indicates that the chunk is not part of the primary allocation, M = 0 indicates that the chunk is allocated from the heap, and P = 1 indicates that the previous chunk is in use.

The glibc source code makes this a little clearer.

#define PREV_INUSE 0x1
/* extract inuse bit of previous chunk */
#define prev_inuse(p) ((p)->size & PREV_INUSE)

#define IS_MMAPPED 0x2
/* check for mmap()'ed chunk */ #define chunk_is_mmapped(p) ((p)->size & IS_MMAPPED) #define NON_MAIN_ARENA 0x4 /* check for chunk from non-main arena */ #define chunk_non_main_arena(p) ((p)->size & NON_MAIN_ARENA) #define SIZE_BITS (PREV_INUSE|IS_MMAPPED|NON_MAIN_ARENA) /* Get size, ignoring use bits */ #define chunksize(p) ((p)->size & ~(SIZE_BITS))Copy the code

The structure of Allocatd chunk is introduced previously, but the structure of free chunk is different after being freed. There is also a structure called top chunk, which is no longer expanded here.

Bins of Chunk’s recycling bins

Bin is literally a rubbish bin. When the memory application calls free, Chunk is not necessarily returned to the system immediately, but is intercepted by glibc. For efficiency purposes, ptmalloc2 will first try to find a suitable memory region from the free chunk memory pool to return to the application when the user requests memory allocation next time, thus avoiding frequent BRK and MMAP system calls.

To manage memory allocation and recycling more efficiently, ptmalloc2 uses an array that maintains 128 bins.

These Bins are described below.

  • Bin0 is not currently in use
  • Bin1 isunsorted binIs used to store the newly released chunk and the remaining chunk after the allocation of chunk. The size of the chunk is unlimited
  • Bin2 ~ bin63 issmall bins, is used to maintain chunk memory blocks < 1024B. The chunk in the same Small bin chain has the same size, which isindex * 16For example, the chunk size of bin2 is 32(0x20), and the chunk size of bin3 is 48(0x30). Note: there is something wrong in the PDF picture of Taobao, in PDFsize * 8, look at the source code, should be* 16to
  • Bin64 ~ bin126 islarge binsIs used to maintain heap blocks > 1024B. Heap blocks in a linked list may not have the same size. The specific rules are not the focus of this article and will not be expanded.

In this case, information about bins for each arena can be viewed in Pwngdb. As shown in the figure below.

fastbin

In addition to bin, ptmalloc also has a very important structure called Fastbin, which is used to manage small chunks of memory.

On 64-bit systems, when a chunk of memory less than 128 bytes is released, it is first placed in Fastbin. The P flag for Chunk in Fastbin is always 1. Fastbin blocks are treated as in use and therefore not merged.

When allocating memory less than 128 bytes, ptmalloc will first look for the corresponding free block in Fastbin and, if not, in other bins.

From another perspective, Fastbin can be seen as a cache for Smallbin.

Memory fragmentation and reclamation

Let’s do an experiment to see how memory fragmentation affects memory reclamation in GliBC, as shown below.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define K (1024)
#define MAXNUM 500000

int main() {

    char *ptrs[MAXNUM];
    int i;
    // malloc large block memory
    for(i = 0; i < MAXNUM; ++i) { ptrs[i] = (char *)malloc(1 * K); memset(ptrs[i], 0, 1 * K); } // Never free, only 1B memory leak, what it will impact to the system? char *tmp1 = (char *)malloc(1); memset(tmp1, 0, 1);printf("%s\n"."malloc done");
    getchar();

    printf("%s\n"."start free memory");
    for(i = 0; i < MAXNUM; ++i) {
        free(ptrs[i]);
    }
    printf("%s\n"."free done");

    getchar();

    return 0;
}
Copy the code

The program malloc a 500 MB memory, malloc 1B memory (actually a little bit bigger than 1B, but it doesn’t affect the description), and then free that 500 MB memory.

The memory footprint before free is shown below.

After calling free, using top to view the RES results in the following.

You can see that Glibc does not actually return memory to the system. Instead, it is placed in its own unsorted bin, which is clearly visible using GDB’s Arenainfo tool.

0x1EFE9200 is 520,000,000 in decimal notation, which is the 500 MB or so of memory we just freed.

If I comment out the second malloc in the code, glibc immediately frees the memory.

This experiment has demonstrated the effect of memory fragmentation on gliBC memory consumption.

Glibc and malloc_trim

The malloc_trim function is provided in glibc.

Man7.org/linux/man-p…

There is no way to return the empty space in the heap. In this case, the call to malloc_trim actually returns more than 500 MB of memory to the system.

gdb --batch --pid `pidof java` --ex 'call malloc_trim()'
Copy the code

Looking at the glibc source code, the underlying implementation of malloc_trim has been modified to iterate through all arenas and then all bins for each arena, executing the madvise system call to inform MADV_DONTNEED that the piece can be recycled.

This can be confirmed synchronously with the Systemtap script.

probe begin {
    log("begin to probe\n")
}

probe kernel.function("SYSC_madvise") {
    if (ppid() == target()) {
        printf("\nin %s: %s\n", probefunc(), $$vars) print_backtrace(); }}Copy the code

When malloc_trim is executed, there are a number of madvise system calls, as shown in the figure below.

Behavior =0x4 indicates MADV_DONTNEED, len_in indicates length, and start indicates the start address of memory.

Malloc_trim also works with the memory fragmentation experiment in the previous section.

Jemalloc appearance

Is there a better malloc library for defragmentation than glibc’s memory allocation strategy? Common names in the industry include Google’s tcmalloc and facebook’s jemalloc.

LD_PRELOAD is used to mount the Jemalloc library.

LD_PRELOAD=/usr/local/lib/libjemalloc.so
Copy the code

Restart the Java program and you can see that the memory RES consumption is reduced to about 1G

Jemalloc is about 500M smaller than Glibc, just a little over the 900-plus trim of Malloc_trim.

As for why Jemalloc is so powerful in this scenario, it is a complicated topic that I will not expand on here, but I can introduce the implementation principle of Jemalloc in detail when I have time.

After many experiments, malloc_trim can cause JVM crashes, so be careful when using it.

After ptmalloc2 is replaced with Jemalloc, the memory RES usage of the process is significantly reduced. As for performance and stability, further observation is needed.

Set pieces

I am also writing a simple malloc library recently. It is possible to know what pain points TCMALloc and Jemalloc are trying to solve and how the trade-offs behind complex design are made.

summary

Memory-related problems are relatively complex and can be affected by many factors. It is easiest to exclude problems in the application layer, but if problems are caused by glibc or the kernel itself, it can only be verified by bold assumptions. Memory allocation and management is a complex topic that will be covered in more detail in a later article.

All of the above may be wrong, just look at the idea.