Memory leaks caused by JVM tuning glibc

review

In the previous article on JVM tuning G1 for CMS, we switched from G1 to CMS and adjusted JVM parameters, resulting in very slow memory growth due to better GC selection and parameter Settings.

However, this does not fundamentally solve the problem. Through observation, it is found that at the highest time, the RSS will increase by about 100M a day, and the overall trend is still upward, without a trace of falling back.

Problem analysis

Although the growth is slow, even if only 1M per day, it is only a matter of time before OOM. This leads us to take a closer look at why RSS keeps growing.

Heap memory analysis

The heap was monitored to grow and reclaim periodically, consistent with our JVM parameter Settings, and there were no obvious business code issues with the dump file.

Out of memory

Review our options parameter Settings:

-Xms2048m -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:+CrashOnOutOfMemoryError -XX:NativeMemoryTracking=detail -XX:+UseConcMarkSweepGC -XX:MetaspaceSize=256M -XX:MaxMetaspaceSize=256M -XX:ReservedCodeCacheSize=128m - XX: InitialCodeCacheSize = 128 m - Xss512k - XX: + AlwaysPreTouch.Copy the code

Metaspace and codecache are restricted to dead, followed by Direct Buffers in Buffer Pools

You can see it’s a straight line, and there’s no fluctuation.

RSS, however, has been growing, and Native Memory Tracking has been used to track Memory usage within the JVM

Since we turned on NMT-xx: NativemoryTracking =detail

Start with a baseline:

jcmd 1 VM.native_memory baseline
Copy the code

And then over a period of time:

jcmd 1 VM.native_memory summary.diff
Copy the code

Look at the statistics by comparison. The figure below is only for example, the specific number is not for reference, because I executed it temporarily, the number is wrong.

In the real world, the biggest growth is in class malloc

malloc ? This is a memory request function, why so many requests? Is there no release? I thought of using the pmap command to look at the memory mapping.

The Pmap command is used to display the memory status of one or more processes. It reports address space and memory state information for the process

The following command is executed:

pmap -x 1 | sort -n -k3
Copy the code

Some clues:

There are some 64MB or so memory allocations, and more and more.

glibc

I didn’t get it, so I googled it. It is found that there is this kind of problem because it involves a lot of basic knowledge, here is the general analysis, interested readers can inquire more information to understand:

Most server programs use the malloc/free series of functions provided by Glibc to allocate memory.

An early version of Malloc in Linux, implemented by Doug Lea, had a serious problem with allocating only one area of memory (arena). Each time an area of memory was allocated, the lock had to be locked. The word arena literally means’ arena ‘

If I open more arenas, the lock competition will naturally improve.

Wolfram Gloger improved on Doug Lea to make Glibc’s MALloc multithreaded, ptmalloc2. On the basis of only one allocation, the non-main Arena is added. There is only one main allocation, but there can be many non-main allocation

When malloc is called to allocate memory, it checks to see if an allocation arena already exists in the current thread private variable. If it exists, an attempt will be made to lock the arena. If the lock is successful, the allocation will be used to allocate memory

If the lock fails, another thread is in use, and the arena list is traversed looking for an arena region that is not locked. If it is found, the arena region is used to allocate memory.

The primary allocation area can apply for virtual memory in BRK or MMAP mode. The non-primary allocation area can apply for virtual memory only in MMAP mode. Each time glibc applies for 64MB virtual memory blocks, the glibc is then divided into small blocks for retail based on application requirements.

This is a typical 64M problem with Linux process memory distribution. How many regions are there? On 64-bit systems, this value is equal to 8 * number of cores, or up to 32 64 MEgabytes of memory if it is 4 cores.

Glibc introduced memory pools for each thread starting at 2.11, and we are using version 2.17, which can be queried by using the following command

#View the glibc version
ldd --version  
Copy the code

Problem solving

The maximum number of arenas can be controlled by a single parameter on the server, MALLOC_ARENA_MAX

export MALLOC_ARENA_MAX=1
Copy the code

Since we are using a Docker container, it is added to the start parameter of the Docker.

After the container was restarted, 64M of memory was not allocated.

But RSS is still growing, although this time it seems to be growing more slowly. So Google again. (It is effective to observe for a longer time in other environments after the event. Although there is an increase in the short term, there will be a decline later)

The gliBC memory allocation policy may cause fragmentation of the memory reclamation problem, resulting in what appears to be a memory leak. Is there a better malloc library for fragmented memory? Common names in the industry include Google’s tcmalloc and facebook’s jemalloc.

tcmalloc

The installation

yum install gperftools-libs.x86_64 
Copy the code

Use LD_PRELOAD for mounting

Export LD_PRELOAD = "/ usr/lib64 / libtcmalloc. So. 4.4.5." "Copy the code

Note that Java applications need to be rebooted, after my tests using TCmalloc RSS memory is still growing, it doesn’t work for me.

jemalloc

The installation

yum install epel-release  -y
yum install jemalloc -y
Copy the code

Use LD_PRELOAD for mounting

export LD_PRELOAD="/usr/lib64/libjemalloc.so.1"
Copy the code

After using Jemalloc, THE RSS memory fluctuates periodically, within the range of about 2 percentage points, which is basically controlled.

Jemalloc principle

Like TCMALloc, each thread also uses the thread-local cache without locking at <32KB.

Jemalloc uses the following size-class classification on 64bits systems:

Small: [8], [16, 32, 48,…, 128], [192, 256, 320,…, 512], [768, 1024, 1280,…, 3840]
Large: [4 KiB, 8 KiB, 12 KiB, …, 4072 KiB]
Huge: [4 MiB, 8 MiB, 12 MiB, …]

Small /large objects find metadata in constant time, huge objects find metadata in logarithmic time through the global red-black tree.

Virtual memory is logically divided into chunks (4MB, 1024 4k pages by default), and the application thread allocates arenas on the first malloc through a round-robin algorithm. Each arena is independent of each other and maintains its own chunks. Chunk cuts Pages to small/ Large objects. Free () ‘s memory is always returned to the arena it belongs to, regardless of which thread called free().

In the figure above, you can see the Arena chunk structure managed by each arena. The header initially maintains a page map (the state of objects associated with 1024 pages) with its page space below. Small objects are grouped together, and the metadata information is stored in the starting location. Large chunks are independent of each other, and their metadata information is stored in the Chunk Header map.

When allocating through an arena, the arena bin (one for each small size-class, fine grained) or the arena itself must be locked. Thread cache objects are also returned to the arena using the garbage collection index surrender algorithm.

Jemalloc tcmalloc contrast

This is a comparison of server throughput with 6 malloCs, and you can see that TCMALloc and Jemalloc are the best (facebook’s 2011 test results, tcMALloc here is an older version).

In summary, using TCmalloc and Jemalloc in a multi-threaded environment is very effective. Jemalloc can be used when the number of threads is fixed and exits are not created frequently. Instead, using TCMALloc might be a better choice.

conclusion

If you see a lot of these 64M allocations in memory, you might stumble into this hole, modify MALLOC_ARENA_MAX, watch patiently, and if not, try using Jemalloc or TCMALloc

reference

www.cnhalo.net/2016/06/13/…
Tech.meituan.com/2019/01/03/…
www.heapdump.cn/article/170…
Engineering.fb.com/2011/01/03/…