Slab in the Linux kernel comes from the simple idea of having data structures ready to be allocated and released frequently. However, the standard slab implementation is so complex and expensive to maintain that slub, which is much smaller, has been created, so this article is about SLUB, and all subsequent references to Slab refer to SLub. In addition, this paper mainly describes kernel optimization, not basic principles, so if you want to know slab details and code implementation, please read the source code or baidu.

Simple slab on a single CPU

The following figure shows the sequence of scenarios when a slab is allocated and released on a single CPU:

As you can see, it’s very simple, and it achieves exactly what it was intended to do when slab was designed.

Extend to multi-core cpus

Now we can simply extend the above model to multi-core cpus, and the same allocation sequence is shown below:

We saw that with a single slab, if multiple cpus allocated objects at the same time, conflicts were inevitable, and almost the only way to resolve the conflicts was to queue them with locks. However, this greatly increased the latency. We saw that the total delay for applying a single object started at T0 and ended at T4, which was too long.

The direct idea of lock-free parallelization for multiple cpus is to copy the same set of data structures to each CPU.

The only way to do this is to add variables per CPU. For a slab, it can be extended as follows:

If you think it’s over, it’s pointless.

The problem

First, let’s take a look at a simple problem. If a single CPU has no slab cache to allocate, but there are still a large number of idle objects in the slab cache of other cpus, as shown in the following figure:

This is possible because the requirement for a single slab is closely related to the process/thread executing on the CPU. For example, if CPU0 only deals with networks, it will have a large requirement for data structures such as SKBS. If we choose to allocate a new page(or pages, depending on the object size and the order of the slab cache) from the partner system, then over time the slab will be unevenly distributed across the cpus, and more likely to eat up a large amount of physical memory, which is undesirable.

Before proceeding, must first clear is that we need to balance slab between the CPU, and these must rely on slab internal mechanism itself, the process between the CPU load balancing is completely different, and for the process, with a central dispatching mechanism, such as based on time slice, or virtual clock step rate, etc., but for the slab, It’s up to the user. As long as the object is still in use, you can’t deprive the user of the right to continue using it unless the user releases it. Therefore, slab load balancing must be designed to be cooperative, not preemptive.

All right. Now we know that reallocating a page(s) from the partner system is not a good idea, it should be the final decision, and try another route first before implementing it.

Now, we come to the second question, as shown below:

There is no guarantee that the CPU that allocates a slab object is the same CPU that releases a slab object, that a CPU will not allocate new pages (s) during the lifetime of a slab object, and that the complex operations during this period are not specified. How can these problems be solved? In fact, by understanding how these problems are solved, a slab framework is completely understood.

Solution – Layering slab Cache

The infinitely variable speed is always desirable.

If a CPU’s slab cache is full, it is considered rude and unethical to raid the slab cache of another CPU at the same level. So why not set up another slab cache, and fetch objects from it is not as simple and straightforward as fetching objects from the CPU slab cache, but not as difficult, just a little bit more expensive, which would be nice? In fact, CPU L1, L2, L3 cache design is not this solution? This has become the de facto standard for cache design. The same design idea applies to slab, the Slub implementation of the Linux kernel. Now you can give the concept and the explanation.

  • Linux kernel slab cache: a three-tier object cache model.
  • Level 1 slab cache: a linked list of idle objects. Each CPU has one exclusive cache for free objects.
  • Level 2 slab cache: a linked list of idle objects. Each CPU has one shared page(s) cache. When allocating and releasing objects, only this page(s) needs to be locked.
  • Level 3 slab cache: A page(s) linked list, in page(s), is the shared cache of all CPUS of each NUMA NODE. After obtaining the page, it is promoted to the Level 1 slab cache of the corresponding CPU, and the page(s) is used as the shared page(s) of Level 2.
  • Share the page (s) : The page(s) is occupied by one or more cpus, and each CPU can have a linked list of free objects on the page(s). The page(s) has a unique Level 2 slab cache free linked list. This linked list does not conflict with one or more idle Level 1 slab caches. When multiple cpus obtain this Level 2 slab cache, they must compete for it and upgrade the linked list to their own Level 1 slab caches.

The slab cache is shown as follows:

Its behavior is shown below:

Two scenarios

For a general object allocation process, the following figure shows the details:

In fact, there is another way to play when multiple cpus share a page(s), as shown below:

Partner system

Previously, we briefly experienced the slab design of the Linux kernel. It should not be too long, because it is difficult to understand if it is too long. But in the end, if Level 3 doesn’t get page(s) either, it will end up in the ultimate partner system.

Buddy systems are designed to prevent fragmentation of memory allocation, so they try to do two things:

1). Try to allocate as much memory as possible 2). Try to merge contiguous small chunks of memory into one large memory

We can understand the above principle by looking at the following diagram:

Note that this article is about optimization, not buddy systems, so I assume you already understand buddy systems.

Since slab cache objects are mostly small structures of no more than 1 page (not only slab systems, the memory requirements of more than 1 page are very small compared to the memory requirements of 1 page), there will be a large number of memory allocation requirements for 1 page. According to distribution principle in the partner system, if continue to a large number of distribution of a single page, there will be a lot of the order greater than 0 page splits into a single page, on the single core CPU, this is not a problem, but in multicore cpus, because every CPU such allocation, and partner system division, merger table will involve a lot of chain operation, This lock overhead is huge, so it needs to be optimized!

Linux kernel to partner systems for the allocation of a single page for the bulk allocation of the “single page cache per CPU” approach!

Each CPU has a single page cache pool. When a single page is required, the page can be obtained from the page pool corresponding to the current CPU without locking. When there are not enough pages in the pool, the system pulls a batch of pages from partner systems into the pool, and in turn, when a single page is released, it preferentially releases it into a single page cache per CPU.

In order to maintain “every single CPU page caching” the number of pages in the not too much or too little, too much will affect partner system, too few will affect the demand of the CPU), a system to keep the two values, when the cached page number is lower than the low value, was a batch of the partner system to get the page to the pool, and when the cached page number is greater than the high Some pages are released to the partner system.

summary

In a multi-CPU operating system kernel, the key cost is the cost of locking. I think this is the beginning of design, for a start, multi-core CPU does not appear, single-core CPU Shared protection are almost can use “the break”, “grab” bans to simple implementation, the multi-core era, as simple as the operating system the shift to a new platform, so the synchronization is based on a single core added later. To put it simply, the current mainstream operating systems were created during the single-core era, so they were designed for a single-core environment. For a multi-core environment, perhaps they were designed to be wrong from the start.

However, the best way to optimize operations is to prohibit or minimize locking operations. The idea was to create “per-CPU caches” for shared critical data structures, which fall into two types:

1). Data path cache. For data structures such as routing tables, you can protect them with RCU locks. Of course, it would be nice to create a local routing table cache for each CPU. The question now is when to update them, because all caches are flat, so a batch synchronization mechanism is necessary. 2). Manage mechanism cache. For example, a slab object cache, whose life cycle is entirely consumer dependent, does not have synchronization issues, but does have management issues. The idea of a tiered cache is a good one. This is very similar to CPU L1/L2/L3 caches.

This post is from the Github project:The strongest Linux kernel learning materials