Pooled memory allocation Management

Netty version: 4.1.72.final

Netty’s hottest and most popular network programming framework has to deal with massive bytes of data. Netty provides a mechanism for byte pooling. Object pooled memory allocation, which is assigned to memory times after use. Pooling memory, then memory management is essential. Netty implements a set of memory allocation and management mechanism based on Jemalloc.

GitHub address: github.com/jemalloc/je…

4.1.72. The Final version is implemented based on Jemalloc4

Draw on jemalloc to solve two problems:

Multithreading memory reclamation and allocation
Fragmentation of memory (which occurs during continuous allocation and reclamation, further optimized by Jemalloc4)

1. Memory specifications in Netty

As shown in the figure above, there are three types of memory specifications in Netty (defined in SizeClass) :

    enum SizeClass {
        Small,
        Normal
    }
Copy the code

Tiny has now been removed

Small: 0-28K (inclusive)
Normal: 28K(not included)-16 m (included)
Huge: larger than 16M (not pooled)

The Hotspot VIRTUAL machine’s automatic memory management system requires that the object’s starting address be an integer multiple of 8 bytes, in other words, the object’s size must be an integer multiple of 8 bytes. The object header is exactly a multiple (1 or 2) of 8 bytes, so when the object instance data part is not aligned, it needs to be filled by alignment.

Netty’s memory management implementation borrows from Jemalloc so many of the concepts are the same as those in Jemalloc.

Chunk: indicates the minimum unit of memory (16 MB by default) that Netty applies for from the OS. It is a collection of Runs
Run: indicates a contiguous memory block. The size is a multiple of Page
Page: The minimum allocation unit of Chunk. The default size is 8K. One Chunk has 2K pages by default.
Subpage: Allocates memory within a Page to reduce memory waste. If the amount of memory that needs to be allocated is less than the size of the Page(8K), such as only 100B, it is wasted to allocate a Page(8K). The minimum Subpage is a multiple of 16B. Subpage has no fixed size and depends on the buffer allocated by the user.

2. Data structure of the Netty memory pool

As you can see from the figure, Netty abstracts some components from the memory model. PoolArena, PoolChunk, PoolChunkList, PoolSubpage, PoolThreadCache, and MemoryRegionCache are analyzed according to their data structures and implementations.

Tips: Many of the concepts are similar to those in Jemalloc. The current version of the code under study removes the tiny type of memory size found in many online sources. Only small, Normal, and HUGE are available. This is because the current version of Netty memory allocation is based on Jemalloc4.

2.1 PoolSubpage Data structure Parsing

final class PoolSubpage<T> implements PoolSubpageMetric {

    final PoolChunk<T> chunk; // Chunk to which it belongs
    private final int pageShifts;  // Page offset
    private final int runOffset; //PoolSubpage Specifies the memory offset of PoolChunk
    private final int runSize; / / the size of the Run
    private final long[] bitmap; // The state of each small memory block

    PoolSubpage<T> prev; // The previous PoolSubpage
    PoolSubpage<T> next; // The next PoolSubpage

    boolean doNotDestroy;
    int elemSize; // The size of each small memory block (minimum 16B)
    private int maxNumElems; // Maximum number of memory blocks that can be stored: 8K(default page size)/elemSize=512
    private int bitmapLength;
    private int nextAvail;
    private int numAvail; // The number of memory blocks that can be allocated
}
Copy the code

There are two points to note here:

How does bitmap record memory state
How does PoolSubpage relate to PoolArena

How bitmap records memory state:

The size of the bitmap array is determined by a piece of code in the PoolSubpage constructor:

//LOG2_QUANTUM=4
bitmap = new long[runSize >>> 6 + LOG2_QUANTUM];
Copy the code

The size of runSize determines the size of the bitmap. PoolChunk#allocateSubpage allocates Run (size of Run is also determined here)

    private long allocateSubpage(int sizeIdx) {
        // Obtain the head of the PoolSubPage pool that is owned by the PoolArena and synchronize on it.
        // This is need as we may add it back and so alter the linked-list structure.
        PoolSubpage<T> head = arena.findSubpagePoolHead(sizeIdx);
        synchronized (head) {
            //allocate a new run
            int runSize = calculateRunSize(sizeIdx);
            //runSize must be multiples of pageSize
            long runHandle = allocateRun(runSize);
            if (runHandle < 0) {
                return -1;
            }

            int runOffset = runOffset(runHandle);
            assert subpages[runOffset] == null;
            int elemSize = arena.sizeIdx2size(sizeIdx);

            PoolSubpage<T> subpage = new PoolSubpage<T>(head, this, pageShifts, runOffset,
                               runSize(pageShifts, runHandle), elemSize);

            subpages[runOffset] = subpage;
            returnsubpage.allocate(); }}Copy the code

Tips: Run must be a multiple of PageSize.

PoolSubpage records whether the child memory is in use by bitmap. The bit value is 0 or 1

How does PoolSubpage relate to PoolArena

As shown in the previous figure, the PoolArena class has a smallSubpagePools attribute.

abstract class PoolArena<T> extends SizeClasses implements PoolArenaMetric {
    private final PoolSubpage<T>[] smallSubpagePools;
    
     protected PoolArena(PooledByteBufAllocator parent, int pageSize,
          int pageShifts, int chunkSize, int cacheAlignment) {
         
         // Omit some code
        numSmallSubpagePools = nSubpages; // nsubpage = 39; // Nsubpage = 39
        smallSubpagePools = newSubpagePoolArray(numSmallSubpagePools);
        for (int i = 0; i < smallSubpagePools.length; i ++) {
            smallSubpagePools[i] = newSubpagePoolHead();
        }
         
        // Omit some code
     }
    
    private PoolSubpage<T> newSubpagePoolHead(a) {
        PoolSubpage<T> head = new PoolSubpage<T>();
        head.prev = head;
        head.next = head;
        returnhead; }}Copy the code

Subpage has 39 types to choose from.

2.2 PoolChunk Data structure Parsing

final class PoolChunk<T> implements PoolChunkMetric {
    private static final int SIZE_BIT_LENGTH = 15;
    private static final int INUSED_BIT_LENGTH = 1;
    private static final int SUBPAGE_BIT_LENGTH = 1;
    private static final int BITMAP_IDX_BIT_LENGTH = 32;

    static final int IS_SUBPAGE_SHIFT = BITMAP_IDX_BIT_LENGTH;
    static final int IS_USED_SHIFT = SUBPAGE_BIT_LENGTH + IS_SUBPAGE_SHIFT;
    static final int SIZE_SHIFT = INUSED_BIT_LENGTH + IS_USED_SHIFT;
    static final int RUN_OFFSET_SHIFT = SIZE_BIT_LENGTH + SIZE_SHIFT;

     
    final PoolArena<T> arena;   // The owning PoolArena
    final Object base;
    final T memory; // The stored data
    final boolean unpooled;  // Whether to pool
    private final LongLongHashMap runsAvailMap; // Manage all runs of PoolChunk (used or not)
    private final LongPriorityQueue[] runsAvail; // Priority queues, each queue manages runs of the same size
    private final PoolSubpage<T>[] subpages; / / PoolSubpage list
    private final int pageSize;
    private final int pageShifts;
    private final int chunkSize;
    private final Deque<ByteBuffer> cachedNioBuffers;

    int freeBytes;  
    int pinnedBytes;

    PoolChunkList<T> parent;
    PoolChunk<T> prev; // The front node
    PoolChunk<T> next; // back node
	
    // Omit some code
}
Copy the code

The structure of Chunk memory is shown as follows:

RunSize = N * PageSize (N >= 1) RunSize = N * PageSize
Subpage sizes range from 16B to 28K, so subpages sometimes contain multiple pages (here jemalloc4 to further address memory fragmentation).
Chunk also has unused memory segments. These can be allocated

So how do you manage the state and size of all this memory? A handle (type long) is defined in Chunk.

As shown in the figure above, separate the high and low levels of longs from top to bottom.

O (runOffset) : runOffset(the Chunk offset of the page), 15 bits
S (size) : the size of Run (this field stores the number of pages), 15 bits
U (isUsed) : indicates whether the memory is marked with 1bit
E (isSubpage) : indicates whether the current Subpage flag bit is 1bit
B (bitmapIdx) : indicates the bitmap index of a Subpage. If the value is 0, it is not a Subpage. The value is 32 bits

Then take a look at a few important properties of PoolChunk:

runsAvailMap

A map that manages the Run state (used, unused)

key: runOffset

value: handle
runsAvail

An array of priority queues, each of which manages Runs of the same size. Run is stored at offset, so we always Run with a smaller offset allocation (priority queue characteristic)

PoolChunk key algorithm parsing:

Initialize the

At the beginning, we store the initial Run, that is, the entire block of data, initialized Run:
```
runOffset = 0
size = chunkSize
isUsed = false
isSubpage = false
bitmapIdx = 0
Copy the code
```

PoolChunk# allocateRun (int runSize) algorithm

private long allocateRun(int runSize) {
    int pages = runSize >> pageShifts;
    int pageIdx = arena.pages2pageIdx(pages);

    synchronized (runsAvail) {
        //find first queue which has at least one big enough run
        int queueIdx = runFirstBestFit(pageIdx);
        if (queueIdx == -1) {
            return -1;
        }

        //get run with min offset in this queue
        LongPriorityQueue queue = runsAvail[queueIdx];
        long handle = queue.poll();

        asserthandle ! = LongPriorityQueue.NO_VALUE && ! isUsed(handle) :"invalid handle: " + handle;

        removeAvailRun(queue, handle);

        if(handle ! = -1) {
            handle = splitLargeRun(handle, pages);
        }

        int pinnedSize = runSize(pageShifts, handle);
        freeBytes -= pinnedSize;
        pinnedBytes += pinnedSize;
        returnhandle; }}Copy the code

Find the first available Run in the runsAvail array based on runSize
If the Page of a Run is larger than the requested Page, it is split and the rest of the Run is saved for later use

PoolChunk# allocateSubpage (int sizeIdx) algorithm

private long allocateSubpage(int sizeIdx) {
    // Obtain the head of the PoolSubPage pool that is owned by the PoolArena and synchronize on it.
    // This is need as we may add it back and so alter the linked-list structure.
    PoolSubpage<T> head = arena.findSubpagePoolHead(sizeIdx);
    synchronized (head) {
        //allocate a new run
        int runSize = calculateRunSize(sizeIdx);
        //runSize must be multiples of pageSize
        long runHandle = allocateRun(runSize);
        if (runHandle < 0) {
            return -1;
        }

        int runOffset = runOffset(runHandle);
        assert subpages[runOffset] == null;
        int elemSize = arena.sizeIdx2size(sizeIdx);

        PoolSubpage<T> subpage = new PoolSubpage<T>(head, this, pageShifts, runOffset,
                           runSize(pageShifts, runHandle), elemSize);

        subpages[runOffset] = subpage;
        return subpage.allocate();
    }
}
Copy the code

Find a Subpage that is not full according to sizeIdx. Return if it exists, otherwise allocate a new PoolSubpage and call init(). Note: The subpage object is added to the PoolArena subpagesPool when init() is called.
Call subpage.allocate() to allocate

PoolChunk#free(long handle, int normCapacity, ByteBuffer nioBuffer) algorithm

    void free(long handle, int normCapacity, ByteBuffer nioBuffer) {
        int runSize = runSize(pageShifts, handle);
        pinnedBytes -= runSize;
        if (isSubpage(handle)) {
            int sizeIdx = arena.size2SizeIdx(normCapacity);
            PoolSubpage<T> head = arena.findSubpagePoolHead(sizeIdx);

            int sIdx = runOffset(handle);
            PoolSubpage<T> subpage = subpages[sIdx];
            assertsubpage ! =null && subpage.doNotDestroy;

            // Obtain the head of the PoolSubPage pool that is owned by the PoolArena and synchronize on it.
            // This is need as we may add it back and so alter the linked-list structure.
            synchronized (head) {
                if (subpage.free(head, bitmapIdx(handle))) {
                    //the subpage is still used, do not free it
                    return;
                }
                assert! subpage.doNotDestroy;// Null out slot in the array as it was freed and we should not use it anymore.
                subpages[sIdx] = null; }}//start free run
        synchronized (runsAvail) {
            // collapse continuous runs, successfully collapsed runs
            // will be removed from runsAvail and runsAvailMap
            long finalRun = collapseRuns(handle);

            //set run as not used
            finalRun &= ~(1L << IS_USED_SHIFT);
            //if it is a subpage, set it to run
            finalRun &= ~(1L << IS_SUBPAGE_SHIFT);

            insertAvailRun(runOffset(finalRun), runPages(finalRun), finalRun);
            freeBytes += runSize;
        }

        if(nioBuffer ! =null&& cachedNioBuffers ! =null&& cachedNioBuffers.size() < PooledByteBufAllocator.DEFAULT_MAX_CACHED_BYTEBUFFERS_PER_CHUNK) { cachedNioBuffers.offer(nioBuffer); }}Copy the code

If it is Subpage, the shard is returned to the current Subpage.
If the Subpage is not in use or is a Run, enable release Run
Merge consecutively available runs
Save the merged Run

2.3 PoolArena Data Structure Analysis

abstract class PoolArena<T> extends SizeClasses implements PoolArenaMetric {
    static final boolean HAS_UNSAFE = PlatformDependent.hasUnsafe();

    enum SizeClass {
        Small,
        Normal
    }

    final PooledByteBufAllocator parent; // Owning allocator

    final int numSmallSubpagePools;  / / 39
    final int directMemoryCacheAlignment;
    private final PoolSubpage<T>[] smallSubpagePools;

    private final PoolChunkList<T> q050;
    private final PoolChunkList<T> q025;
    private final PoolChunkList<T> q000;
    private final PoolChunkList<T> qInit;
    private final PoolChunkList<T> q075;
    private final PoolChunkList<T> q100;

    private final List<PoolChunkListMetric> chunkListMetrics;

    // Metrics for allocations and deallocations
    private long allocationsNormal;
    // We need to use the LongCounter here as this is not guarded via synchronized block.
    private final LongCounter allocationsSmall = PlatformDependent.newLongCounter();
    private final LongCounter allocationsHuge = PlatformDependent.newLongCounter();
    private final LongCounter activeBytesHuge = PlatformDependent.newLongCounter();

    private long deallocationsSmall;
    private long deallocationsNormal;

    // We need to use the LongCounter here as this is not guarded via synchronized block.
    private final LongCounter deallocationsHuge = PlatformDependent.newLongCounter();

    // Number of thread caches backed by this arena.
    final AtomicInteger numThreadCaches = new AtomicInteger();
    
    
    // Omit some code
}
Copy the code

Netty borrows from the design idea of Arena in Jemalloc, and uses a fixed number of multiple arenas to allocate memory. The default number of arenas is usually CPU cores *2(it may also choose one with a smaller memory calculation relationship). By creating multiple arenas, the problem of resource competition can be alleviated. This improves memory allocation efficiency. When a thread requests memory allocation for the first time, it polls the Arena array in a round-robin manner, selects a fixed Arena, and only deals with that Arena for the lifetime of the thread, so each thread saves Arena information to improve access efficiency. The following code calculates the default number of poolArenas:

public class PooledByteBufAllocator extends AbstractByteBufAllocator implements ByteBufAllocatorMetricProvider {
    private static final int DEFAULT_NUM_HEAP_ARENA; // Number of default arenas
    private static final int DEFAULT_NUM_DIRECT_ARENA;// Number of default arenas
    
    static{
        final int defaultMinNumArena = NettyRuntime.availableProcessors() * 2;
        final int defaultChunkSize = DEFAULT_PAGE_SIZE << DEFAULT_MAX_ORDER;
        DEFAULT_NUM_HEAP_ARENA = Math.max(0,
                SystemPropertyUtil.getInt(
                        "io.netty.allocator.numHeapArenas",
                        (int) Math.min(
                                defaultMinNumArena,
                                runtime.maxMemory() / defaultChunkSize / 2 / 3)));
        DEFAULT_NUM_DIRECT_ARENA = Math.max(0,
                SystemPropertyUtil.getInt(
                        "io.netty.allocator.numDirectArenas",
                        (int) Math.min(
                                defaultMinNumArena,
                                PlatformDependent.maxDirectMemory() / defaultChunkSize / 2 / 3))); }}Copy the code

PoolArena has two implementations for in-heap memory and out-of-heap memory:

DirectArena
HeapArena

The data structure is shown as follows:

Contains one smallSubpagePools (PoolSubpage

[]) and six poolChunkLists.

SmallSubpagePools store smallSubpage types of memory fast
Six poolChunkLists store chunks with different usage rates to form a bidirectional circular linked list

PoolArena implements memory allocation in subpages and chunks. PoolSubpage allocates memory smaller than or equal to 28 KB, and PoolChunkList allocates memory larger than or equal to 32 KB. PoolSubpage allocates memory in 39 different ways (see Netty source code -SizeClasses).

PoolChunkList initializes six in PoolArena:

protected PoolArena(PooledByteBufAllocator parent, int pageSize,
          int pageShifts, int chunkSize, int cacheAlignment) {
        q100 = new PoolChunkList<T>(this.null.100, Integer.MAX_VALUE, chunkSize);
        q075 = new PoolChunkList<T>(this, q100, 75.100, chunkSize);
        q050 = new PoolChunkList<T>(this, q075, 50.100, chunkSize);
        q025 = new PoolChunkList<T>(this, q050, 25.75, chunkSize);
        q000 = new PoolChunkList<T>(this, q025, 1.50, chunkSize);
        qInit = new PoolChunkList<T>(this, q000, Integer.MIN_VALUE, 25, chunkSize);

        q100.prevList(q075);
        q075.prevList(q050);
        q050.prevList(q025);
        q025.prevList(q000);
        q000.prevList(null);
        qInit.prevList(qInit);
}
Copy the code

If the Chunk memory usage changes, Netty rechecks the memory usage and adds it to the PoolChunkList. Therefore, PoolChunk moves between different poolChunkLists.

Question 1: Why should qInit and Q000 be divided into two

QInit was used to store the PoolChunk that was initially allocated. Because no PoolChunk was available in the PoolChunkList during the first memory allocation, a new PoolChunk was created and added to the qInit list. PoolChunk in qInit will not be reclaimed even if the memory is fully released, avoiding repeated initialization of PoolChunk.

Question 2: in method PoolArena#allocateNormal why is q050 judged first

    private void allocateNormal(PooledByteBuf<T> buf, int reqCapacity, int sizeIdx, PoolThreadCache threadCache) {
        if (q050.allocate(buf, reqCapacity, sizeIdx, threadCache) ||
            q025.allocate(buf, reqCapacity, sizeIdx, threadCache) ||
            q000.allocate(buf, reqCapacity, sizeIdx, threadCache) ||
            qInit.allocate(buf, reqCapacity, sizeIdx, threadCache) ||
            q075.allocate(buf, reqCapacity, sizeIdx, threadCache)) {
            return;
        }

        // Add a new chunk.
        PoolChunk<T> c = newChunk(pageSize, nPSizes, pageShifts, chunkSize);
        boolean success = c.allocate(buf, reqCapacity, sizeIdx, threadCache);
        assert success;
        qInit.add(c);
    }
Copy the code

Distribution of logic q050 — — — — — > q025 — — — — — > q000 — — — — > qInit — — — — > q075.

In the case of frequent memory allocation, most poolChunks will be created and destroyed frequently starting from Q000, resulting in poor memory allocation performance. If you start from Q050, the PoolChunk usage range is kept in the middle level, reducing the probability of PoolChunk being reclaimed, and taking into account the performance. (Did not see the official design instructions)

2.4 PoolChunkList data structure Parsing

final class PoolChunkList<T> implements PoolChunkListMetric {
    private static final Iterator<PoolChunkMetric> EMPTY_METRICS = Collections.<PoolChunkMetric>emptyList().iterator();
    private final PoolArena<T> arena;
    private final PoolChunkList<T> nextList;
    private final int minUsage;
    private final int maxUsage;
    private final int maxCapacity;
    private PoolChunk<T> head;
    private final int freeMinThreshold;
    private final int freeMaxThreshold;

    // This is only update once when create the linked like list of PoolChunkList in PoolArena constructor.
    private PoolChunkList<T> prevList;
}
Copy the code

As you can see from the data structure above, PoolChunkList forms a two-way circular list.

Above are the classes and associated data structures used in pooled memory allocation. But there is also a cache in the process of allocating the cache. Let’s look at caching. Caching mainly involves two classes:

PoolThreadCache
MemoryRegionCache

3. Pool the cache in allocation

3.1 PoolThreadCache

final class PoolThreadCache {

    private static final InternalLogger logger = InternalLoggerFactory.getInstance(PoolThreadCache.class);
    private static final int INTEGER_SIZE_MINUS_ONE = Integer.SIZE - 1;

    final PoolArena<byte[]> heapArena;
    final PoolArena<ByteBuffer> directArena;

    // Hold the caches for the different size classes, which are tiny, small and normal.
    private final MemoryRegionCache<byte[]>[] smallSubPageHeapCaches;
    private final MemoryRegionCache<ByteBuffer>[] smallSubPageDirectCaches;
    private final MemoryRegionCache<byte[]>[] normalHeapCaches;
    private final MemoryRegionCache<ByteBuffer>[] normalDirectCaches;

    private final int freeSweepAllocationThreshold;
    private final AtomicBoolean freed = new AtomicBoolean();

    private int allocations;
}
Copy the code

Netty acts as an allocated thread cache. This is the same as the Jemalloc scalable memory allocation technique.

When memory is freed, Netty does not return the cache to PoolChunk, as jemalloc does. Instead, Netty uses PoolThreadCache to store the cache, and when the next memory allocation of the same size is available, it can be used directly from PoolThreadCache. This can be seen in the PoolArena#tcacheAllocateSmall method:

    private void tcacheAllocateSmall(PoolThreadCache cache, PooledByteBuf<T> buf, final int reqCapacity,
                                     final int sizeIdx) {

        if (cache.allocateSmall(this, buf, reqCapacity, sizeIdx)) {
            // was able to allocate out of the cache so move on
            return;
        }
		// Omit some code
    }
Copy the code

MemoryRegionCache is used when PoolThreadCache is cached. There are two latitudes:

In or out of the heap
Small or normal

3.2 MemoryRegionCache

 private abstract static class MemoryRegionCache<T> {
        private final int size;
        private final Queue<Entry<T>> queue;
        private final SizeClass sizeClass;
        private int allocations;
     
     // Omit some code
      MemoryRegionCache(int size, SizeClass sizeClass) {
            this.size = MathUtil.safeFindNextPositivePowerOfTwo(size);
            queue = PlatformDependent.newFixedMpscQueue(this.size);
            this.sizeClass = sizeClass; }}Copy the code

There are several properties:

Size: indicates the queue length. The default value for small is 256 and that for Normal is 64

Maximum cache data size: 32K(PooledByteBufAllocator static variable DEFAULT_MAX_CACHED_BUFFER_CAPACITY is set)

4. Netty pooled memory allocation process

Netty PooledByteBuf allocation is allocated by PooledByteBufAllocator. The allocation process is as follows:

5. To summarize

Netty memory allocation is now based on Jemalloc4 implementation, the data structure model is different from the previous, especially the Chunk allocation management
Page, Subpage, PoolSubpage, PoolChunk, ChunkList, Run, PoolArena and other related classes are important components of implementing pooled memory allocation
**PooledByteBufAllocator** inherits the assigned top-level interfaceByteBufAllocatorTo act as an entry point for allocating in-heap and off-heap pooled memory
PoolThreadLocalCache allocates memory for the cache. The default size of the Small cache queue is 256, and that of the Normal cache queue is 64. The default maximum size of the cache is 32 KB.