This article is a transcript version of the content I shared on the company’s TechDay. I didn’t want to write such a lengthy article, because many students asked if they could write a relevant text version, so I had none originally.

Speaking of which, this is the second time I’ve done a share on TechDay. The first TechDay was four years ago, and there was a “MySQL Best Practices” that I don’t think are so good anymore. Without further ado, let’s look at the details.

The theme of this sharing is “Memory problems probe micro”, will be divided into the following aspects to talk about.

  • Underlying principles of Linux memory knowledge
  • Basic implementation principles of MALloc and Free
  • The realization principle of PTmalloc2
  • Internal structure of Arena, Heap, Chunk, Bin
  • Java development-related memory issues

Why share this topic

Because this is the question I get asked most frequently, oops what if my application gets OOM, what if my application memory exceeds its quota and gets killed by K8S, does my application seem to have high memory usage normal?

Nginx was killed by the operating system when the memory usage was too high.

When the load came in, Nginx memory crawled up to about 130GB and was killed by oom-killer, so the load could not continue. That a mature component problems, from what ideas to troubleshoot?

The second reason is that the knowledge of memory management is very large, such as the Linux troika, CPU, IO and memory. Memory can be said to be the most complex among these, and has inextricably related to the performance of CPU and IO. Only by understanding the memory problem can we really understand many Linux performance-related problems.

We recently did an 8 a.m. morning reading session, sharing an hour from 8 to 9 every day. I spent nearly 18 hours talking about memory, but there were still a lot of things I didn’t cover. So I’d like to take this opportunity to go over some of the things that we shared earlier, to try to be as clear as possible about some of the things that we use most in development.

Understanding memory can help us better understand problems such as:

  • Why does Golang native support multiple return values for functions
  • How does golang escape analysis work
  • How to analyze Java out-of-heap memory leaks
  • How is C++ smart pointer implemented

Part II: Principles of Linux memory management

Here we begin: The principles of Linux memory management. As with the three ultimate human problems, there are three similar problems of memory: what is memory, where does it come from, and where does it go after it is freed?

Virtual memory vs. physical memory

Let’s start by looking at virtual memory versus physical memory. The relationship between virtual memory and physical memory confirms the adage that “any problem in an operating system can be solved through an abstract middle tier.” Virtual memory is just that.

Without virtual memory, it is possible for processes to modify the memory data of other processes directly. Virtual memory is isolated from memory usage, and each process has a separate, contiguous, unified virtual address space (good illusion). Like a man in love, with her, as if to have the world.

All the applications see is virtual memory. The MMU maps virtual memory to physical memory. We know that Linux memory is aligned in 4K, 4k = 2^12, and the lower 12 bits in the virtual address are actually an offset.

Now think of the page table as a one-dimensional array, and for each page in the virtual address, we assign a slot of the array that points to the real address in the physical address. If you have a virtual memory address 0x1234010, 0x010 is the in-page offset, 0x1234 is the virtual page number, and the CPU finds the physical memory page address mapped by 0x1234 from the MMU, assuming 0x2B601000, and then adds the in-page offset 0x010. We find the real physical memory address 0x2B601010. As shown in the figure below.

Linux level 4 page table

One obvious problem with this approach is that the virtual address space can be very large, even on a 32-bit system with 4GB of virtual address space, 3GB of user-space memory, 4kB per page, and 786432 array sizes (1024 x 1024). Each page entry is stored in 4 bytes, so 4GB of spatial mapping requires 3MB of memory to store the mapping table. (Note: many documents here say 4M, which is not a big problem. My consideration here is that the kernel space is shared, so there is no need to worry too much.)

For a single process, 3M May not seem like much, but the page table is process exclusive, and each process needs its own page table. If there are a hundred processes, it will take up 300MB of memory, and that’s just the memory spent doing address mapping. This one-dimensional array approach is even more impractical when considering the large virtual address space of 64-bit systems.

To solve this problem, the concept of level is used. The structure of the page table is divided into multiple levels, and the size of the page table entries is only related to how much of the virtual memory space is actually used. The number of page entries is proportional to the size of the virtual address space. This multi-level structure allows unused memory to be allocated page entries.

The multistage page table format was developed, which was well suited because most of the virtual address space of the area was virtually unused. Using multistage page tables could significantly reduce the memory footprint of the page table itself. On 64-bit systems, Linux uses four-level page tables,

  • PGD: Page Global Directory is the top-level Page table.
  • PUD: Page Upper Directory, which is the second-level Page table
  • PMD: Page Middle Derectory, is the third level of the Page table.
  • PTE: Page Table Entry, the last level of Page Table, pointing to the physical Page.

As shown in the figure below.

The application sees only virtual memory, not physical addresses. Of course, there are ways to get physical addresses through virtual addresses by some means. For example, if we malloc a space of 1M and return a virtual address 0x7ffff7eEC010, how do we know the physical memory address corresponding to this virtual address?

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main() {
    char *p = NULL;
    p = malloc(1024 * 1024);
    *p = 0;
    printf("ptr: %p, pid: %d\n", p, getpid());
    getchar();
    return 0;
}

ptr: 0x7ffff7eec010, pid: 2621
Copy the code

There is no way to do this in the application layer, but we can write a kernel extension module to implement this function.

Writing a kernel module is also very simple, consisting of the following steps:

  • Define two callback hooks, module_init and module_exit
  • Get the task_struct of the process from the passed PID and get PGD from the mm variable in the task_struct and the virtual memory address va passed in
  • Through PGD, you get the PUD, then the PMD, and preferably the PTE, which already stores page frames in physical memory
  • The final physical memory address can be obtained by a 12-bit lower in-page offset.

The simplified code is shown below.

#include  <linux/module.h>.int my_module_init(void) {
    unsigned long pa = 0;
    pgd_t *pgd = NULL; pud_t *pud = NULL;
    pmd_t *pmd = NULL; pte_t *pte = NULL;

    struct pid *p = find_vpid(pid);
    struct task_struct *task_struct = pid_task(p, PIDTYPE_PID);

    pgd = pgd_offset(task_struct->mm, va); // Get the first level PGD
    pud = pud_offset(pgd, va);             // Get the second level puD
    pmd = pmd_offset(pud, va);             // Get level 3 PMD
    pte = pte_offset_kernel(pmd, va);      // Get the level 4 PTE

    unsigned long page_addr = pte_val(*pte) & PAGE_MASK;
    unsigned long page_addr &= 0x7fffffffffffffULL;

    page_offset = va & ~PAGE_MASK;
    pa = page_addr | page_offset; // Add offset

    printk("virtual address 0x%lx in RAM Page is 0x%lx\n", va, pa);

    return 0;
}
void my_module_exit(void) {
    printk("module exit! \n");
}

module_init(my_module_init); // Register the callback hook
module_exit(my_module_exit); // Register the callback hook

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Arthur.Zhang");
MODULE_DESCRIPTION("A simple virtual memory inspect");
Copy the code

We compile the kernel module to generate a.ko file, and then load the.ko file, passing in the process number and virtual memory address.

make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

insmod my_mem.ko pid=2621 va=0x7ffff7eec010
Copy the code

Then run dmesg -t to see the value of the actual physical address.

[Sat Oct 10 05:11:12 2020] virtual address 0x7ffff7eec010 in RAM Page is 0x2358a4010
Copy the code

You can see that in this example, the virtual address 0x7ffff7eEC010 corresponds to the physical address 0x2358a4010.

The complete code is at: github.com/arthur-zhan…

The memory layout of the process

Elf is a static file. This static file is made up of different sections, we call them sections. At runtime, Some run-time specific sections are mapped to the process’s virtual address space, such as code and data sections in the figure. In addition to this static area, there are also a lot of dynamic memory consumption areas such as stack, heap, and MMAP areas after a process starts.

The figure below is part of the memory layout output by our online Java service using PMap, as shown below.

So how do we look at these parts? This requires a deeper understanding of how process memory is divided up in Linux.

Libc memory management principle exploration

Linux memory management has three layers, the first layer is our user management layer, such as our own program memory pool, mysql bufferpool, the second layer is C runtime library, this part of the code is a package of the kernel, convenient for the upper application more convenient development, and the next layer is our kernel layer.

Today we will focus on the middle layer, which is implemented by a LIBC library, and take a closer look at how LIBC memory management is done.

The core idea of Linux memory management is hierarchical management, wholesale and retail, and hiding internal details. It is also important to keep in mind that heap management in LIBC is designed for small memory allocations and large memory is supported for uniformity in programming interfaces.

Let’s start by looking at two functions, malloc and free, which are defined as follows.

#include <stdlib.h>

void *malloc(size_t size);
void free(void *ptr);
Copy the code

These two functions are not system calls at all, they are just wrappers of BRK, MMAP, and MUNmap system calls, so why does liBC need to wrappers of these system calls?

One of the main reasons is that system calls are expensive and memory requests are released very frequently, so LIBC’s approach is to bulk requests and then act as a scalper of memory, slowly selling it to subsequent applications.

The second reason is to unify the programming, for example, sometimes use BRK, sometimes use MMAP, which is not friendly, BRK also needs to be locked in multi-threading, using a malloc is very sweet.

Linux memory allocator

There are many types of memory allocators for Linux. First dlmalloc was developed by Doug Lea. This allocator was not friendly to multithreading, which would compete for global locks. In addition to Linux official PTmalloc, various major manufacturers have developed different malloc algorithms, such as Jemalloc made by Facebook, Tcmalloc made by Google.

These memory allocators address two problems: the granularity of locks under multiple threads, and whether they are global, local, or lock-free. The second problem is small memory reclamation and memory fragmentation, where Jemalloc has a significant advantage.

The core concepts of PTMALloc

Next we look at Linux’s default memory allocator, PTmalloc, and I summarize its four core concepts: Arena, Heap, Chunk, and Bins.

Arena

Arena is the main Arena for memory allocation.

The idea is to make a thread monopolize an Arena as much as possible. At the same time, a thread will apply for one or more heaps, and the released memory will go into the recycle bin. Arena is used to manage these heaps and recycle bin.

What does Arena’s data structure look like? It is a structure that can be represented in the figure below.

It is a one-way circular linked list that uses mutex locks to handle multithreaded contention, and bins of freed memory are placed in structures.

If I have thousands of threads, will I generate thousands of arenas? Obviously not. All thread-related bottlenecks eventually lead to a limit on the number of CPU cores. There is also an upper limit on the number of partitions.

We can write a code to print Arena information. The idea is that for a given program, the address of main_arena is a given address in the glibc library, which we can print in the GDB debugger. You can also use the ptype command to view the structure information for this address, as shown in the figure below.

With this foundation, we can write a do while to iterate over the looping list. We convert the address of main_arena to a pointer to malloc_state, and then do while iterating until we reach the head of the list.

struct malloc_state { int mutex; int flags; void *fastbinsY[NFASTBINS]; struct malloc_chunk *top; struct malloc_chunk *last_remainder; struct malloc_chunk *bins[NBINS * 2 - 2]; unsigned int binmap[4]; struct malloc_state *next; struct malloc_state *next_free; size_t system_mem; size_t max_system_mem; }; void print_arenas(struct malloc_state *main_arena) { struct malloc_state *ar_ptr = main_arena; int i = 0; do { printf("arena[%02d] %p\n", i++, ar_ptr); ar_ptr = ar_ptr->next; } while (ar_ptr ! = main_arena); } #define MAIN_ARENA_ADDR 0x7ffff7bb8760 int main() { ... print_arenas((void*)MAIN_ARENA_ADDR); return 0; }Copy the code

The output is as follows.

So why make a distinction between a primary allocation and a non-primary allocation?

It has the privilege of using the Heap area near the DATA segment, which allocates memory by adjusting the BRK pointer.

In a sense, the Heap area is simply an extension of the DATA segment.

What about the non-primary allocation? It is more like an out-of-town, self-starting king who uses Mmap to wholesale large chunks of memory (64M) as a Sub Heap when it wants memory, and then slowly retail it to upper applications.

When one 64M is used up, a new one is created. Multiple subheaps are also linked using a linked list. An Arena can have multiple subheaps. We’ll talk more about this in the next section.

Heap

Next we look at the second core concept of PTmalloc2, where heap represents a large contiguous area of memory.

The primary allocation heap is nothing to talk about, we focus here on “non-primary allocation” sub-heaps (also known as simulated heaps), which, as mentioned earlier, wholesale large chunks of memory for retail.

So how do you understand the phrase cut retail? Its implementation is also very simple, first apply for a 64M size of unreadable, unwritable and unexecutable (PROT_NONE) memory area, when memory needs to use mProtect to change the permission of a memory area to read and write (R+W), this memory area can be allocated to the upper application.

Take the memory layout of our Java process as an example.

The two memory regions in the middle belong to a subheap, and their combined size is 64M. Then, a 1.3m memory region is allocated using MPROtrect, and the remaining 63M region is unreadable, unwritable and unexecutable.

What’s the use of knowing that? So useful, if you Google all the Java out-of-heap memory problems, chances are you’ll find Linux’s magic 64M memory problem. With this knowledge, you should have a better idea of what the 64M memory problem is.

As with the previous arenas, we can also iterate through all heap lists for all arenas in code, as shown below.

struct heap_info { struct malloc_state *ar_ptr; struct heap_info *prev; size_t size; size_t mprotect_size; char pad[0]; }; void dump_non_main_subheaps(struct malloc_state *main_arena) { struct malloc_state *ar_ptr = main_arena->next; int i = 0; while (ar_ptr ! = main_arena) { printf("arena[%d]\n", ++i); struct heap_info *heap = heap_for_ptr(ar_ptr->top); do { printf("arena:%p, heap: %p, size: %d\n", heap->ar_ptr, heap, heap->size); heap = heap->prev; } while (heap ! = NULL); ar_ptr = ar_ptr->next; } } #define MAIN_ARENA_ADDR 0x7ffff7bb8760 dump_non_main_subheaps((void*)MAIN_ARENA_ADDR);Copy the code

Chunk

Next, we look at the basic unit of allocation: chunk. Chunk is the basic unit of memory allocation in GliBC. Let’s start with a simple example.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void) {
    void *p;

    p = malloc(1024);
    printf("%p\n", p);

    p = malloc(1024);
    printf("%p\n", p);

    p = malloc(1024);
    printf("%p\n", p);

    getchar();
    return (EXIT_SUCCESS);
}
Copy the code

The logic is to call malloc three times in a row, allocate 1K of memory each time, and then observe its memory address.

./malloc_test

0x602010
0x602420
0x602830
Copy the code

We can see that 0x410 is different between memory addresses. 1024 is equal to 0x400. What is the extra 0x10 bytes? Let’s just press the button.

Looking back and forth between malloc and free, we can’t help but ask ourselves a question: How does free know how much memory it wants to free?

#include <stdlib.h>

void *malloc(size_t size);
void free(void *ptr);
Copy the code

Hong Kong writer Zhang Xiaoxian once said, “Everything has a price, and the price of happiness is pain.” In order to store 1K of data, you actually need some more data to record the metadata of this piece of memory. This extra piece of data is called the Chunk header and is 16 bytes long. That’s the extra 0x10 bytes we saw earlier.

This is common by adding a head in front of the actual data, such as new Integer(1024) in Java, which stores much more than 4 bytes of actual data. It has a huge object header that stores the object’s Hashcode, which has gone through several GC’s, Has not been synchronized as a lock.

It is not unreasonable to say that Java is bloated.

Let’s move on to see what’s stored in the 16-byte header, which looks like this.

It is divided into two parts, the first 8 bytes represent the size of the previous chunk, and the next 8 bytes represent the size of the current chunk. Since chunk is aligned by 16 bytes, the lower 4 bytes are useless, and three of them are used as marker bits, which are AMP. Where, A indicates whether it is the primary allocation area, M indicates whether it is A large chunk allocated by MMAP, and P indicates whether the previous chunk is in use.

As an example of the previous example, we can use GDB to view this portion of memory.

You can see that the 8 bytes corresponding to the size is 0x0411. Where does this value come from? It’s actually size + 8 aligned to 16B plus the lower B001.

0x0400 + 0x10 + 0x01 = 0x0411
Copy the code

Since the prev_size of the next chunk is meaningless when a chunk is being used, these 8 bytes can be used by the current chunk. Don’t be surprised. That’s how you do it. Next let’s look at the reuse of prev_size in chunk. The code for the test is as follows.

#include <stdlib.h>
#include <string.h>
void main() {
    char *p1, *p2;
    p1 = (char *)malloc(sizeof(char) * 18); // 0x602010
    p2 = (char *)malloc(sizeof(char) * 1);  // 0x602030
    memcpy(p1, "111111111111111111", 18);
}
Copy the code

Compile the source file, then step through the GDB debugging to see the addresses of P1 and P2.


p/x p1
$2 = 0x602010

(gdb) p/x p2
$3 = 0x602030
Copy the code

Then output the memory areas around P1 and P2.

(gdb) x/64bx p1-0x10
0x602000:       0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x602008:       0x21    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x602010:       0x31    0x31    0x31    0x31    0x31    0x31    0x31    0x31
0x602018:       0x31    0x31    0x31    0x31    0x31    0x31    0x31    0x31
0x602020:       0x31    0x31    0x00    0x00    0x00    0x00    0x00    0x00
0x602028:       0x21    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x602030:       0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x602038:       0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

Copy the code

The layout is shown below.

Space is limited, so I only show the structure of Malloc chunk here, and Free chunk, Top chunk, and Last Remainder chunk are not expanded. You can refer to other materials.

Bins

We move on to the final concept, Bins of small chunks of memory.

The memory recycle bin is divided into two categories, the first category is ordinary bin, fastbin.

  • Fastbin uses a one-way linked list. The size of free chunks in each linked list is determined, and insertion and deletion are carried out at the end of the queue.
  • According to the size of the recovered memory, ordinary Bin is divided into small, large, and unsorted, which are stored in bidirectional linked lists. The biggest difference between them is that they store different chunks in different sizes.

Let’s look at the details of these two Bins.

Common bin uses bidirectional linked list storage, defined in the form of an array, a total of 254 elements, two array elements to form a bin, fd, BK to form a bidirectional circular linked list, the structure is a bit like a nine-link toy.

So the total number of normal bins is 254/2 = 127. Among them, unsorted bin has only 1, small has 62, large bin has 63, and one is not currently used, as shown in the figure below.

smallbin

Smallbin is used to maintain chunk memory blocks <= 1024B. Chunks in the same Small bin chain have the same size and are index * 16, as shown in the following figure.

largebin

Chunks in the same chain in Largebin have “different” sizes

  • Divided into 6 groups
  • The number of bin in each group is 33, 15, 8, 4, 2, 1, and the maximum chunk size tolerance in each linked list is 64B, 512B, 4096B, 32768B, 262144B, etc

The structure is shown in the figure below.

unsorted bin

Unsorted bin has only one bidirectional list, which has the following characteristics.

  • Free chunks are not sorted
  • The unsorted bin is used to collect memory chunks larger than 128B

Its structure is shown below.

Below is an overview of all the common bin’s.

FastBin

Having said ordinary bin, let’s take a closer look at FastBin, which is specifically designed to improve the efficiency of allocating small memory.

It has the following characteristics.

  • Memory allocation smaller than 128B is searched in Fast Bin first
  • In a one-way linked list, the chunk size in each linked list is the same. There are seven free chunks in the linked list. The chunk size of each bin is 32B, 48B, 64B, 80B, 96B, 112B, and 128B in sequence
  • Since it is a one-way list, the BK pointer in Fastbin is not used. The fd pointer for the first chunk points to the special 0 address
  • The P flag is always 1 and is generally not merged
  • FIFO, add and remove from the end of the queue

Fast bins can be seen as a small cache of small bins.

Memory application and release

With this knowledge in mind, we can answer the question at the beginning of sharing, where memory comes from. Large chunks of memory request not particularly much can be said, direct MMAP system call to apply for a chunk, when released also directly back to the operating system.

The application for small chunks of memory is much more complicated. The principle is to look for them in the chunk recycle bin first. If you find them, it is best to return them directly without applying to the kernel. How does it do that?

If it is not in fastbin, return it directly from Fastbin. If it is not in Smallbin, try smallbin. Then if smallbin is not in Smallbin, a merge is triggered. If the unsorted bin does not have it, the Large bin is searched. If the top block is no longer cut, the heap is reapplied or the heap is resized, as shown in the following figure.

Now let’s answer the last question, where does memory free go? There are different strategies for dealing with it, depending on the size.

  • Fastbin compliant ultra small chunks of memory directly into the Fastbin single linked list, fast release, voiceover is such a little space, worth me to deal with half a day?
  • Large block of memory, directly back to the kernel does not enter the bin management logic. Voiceover is the special treatment of big customers, after all, big customers are a few cases.
  • Most are in the middle and are first released into the Unsorted bin. Based on the situation, free blocks are merged and migrated, and the top chunk is updated if it is near the top chunk. That’s what life is all about.

Stack memory

Most of what we have introduced before is heap memory. In fact, another very important thing is stack memory. The default stack memory size in LInux is 8M, and then there is a 4K protection zone, which cannot be read, written or executed.

This diagram shows the stack memory layout of a typical Linux native thread, with 8M stack space and 4k guard area.

The default stack size is 1M, followed by a 4K RED area and an 8K yellow area for more fine-grained stack overflow control. I wrote an article about threads and stacks earlier, but I won’t go into it here.

Part three: Description of memory problems related to development

Let’s move on to our final section, which deals with memory issues.

Xmx and memory consumption

The first question that gets asked a lot is why my Java application consumes much more memory than Xmx, and this is one that gets asked a lot on Stack Overflow.

To be clear, a process has a lot of overhead besides heap consumption, as shown below.

  • Heap
  • Code Cache
  • GC overhead
  • Metaspace
  • Thread Stack
  • Direct Buffers
  • Mapped files
  • C/C++ Native memory consumption
  • The cost of malloc itself
  • .

Memory hogs are no joke, and based on years of practice, setting Xmx to around 65% of the container’s memory is reasonable

RES to take up

The second question is whether the high RES usage in the top command represents real application consumption.

No, let’s take the simplest Java program running with -xms1g -XMx1g.

java -Xms1G -Xmx1G MyTest
Copy the code

Its memory footprint is as follows.

We changed the startup command slightly and added AlwaysPreTouch, as shown below.

java -XX:+AlwaysPreTouch -Xms1G -Xmx1G MyTest
Copy the code

The RES occupancy at this time is shown below.

In this case, the 1GB business application is not actually being used, but the JVM is simply writing to the memory so that when it is actually used, it doesn’t have to initiate a page-missing interrupt to actually request physical memory.

The memory footprint is not as small as possible, but also the number of GC and GC pause times.

Replace the default memory allocator

The default Linux memory allocator doesn’t perform very well in terms of performance and fragmentation, so try replacing the default memory allocator with jemalloc or TCMALloc by adding a LD_PRELOAD environment variable.

LD_PRELOAD=/usr/local/lib/libjemalloc.so
Copy the code

In the actual service, one service changed its memory footprint from 7GB to 3G, and the effect was significant.

Native memory analysis

Java heap memory analysis is very easy, jmap command dump memory, and then using jProfile, MAT, Perfma platform can be quickly analyzed. However, it is troublesome to consume too much memory of native. Here you can use the powerful profile capabilities of Jemalloc and TCmalloc. Using Jemalloc as an example, you can generate SVG from the memory requisition relationship.

Export MALLOC_CONF=prof:true,lg_prof_sample:1,lg_prof_interval:30,prof_prefix:jeprof.out jeprof - SVG /path/to/ SVG jeprof.out.* > out.svgCopy the code

The resulting SVG diagram is shown below.

summary

This introduction is only the tip of the iceberg of memory problems, many details did not unfold in detail in this share, there are problems can be communicated.

After this PPT, someone told me that the room was a little cold. That’s a good ending. I thought it was halfway.

The full shared PPT can be downloaded here: github.com/arthur-zhan…