Introduction: IO_URING, a new high-performance asynchronous programming framework, represents the future direction of the Linux kernel and is still in rapid development. Aliyun partnered with InfoQ to launch io_uring introduction and Application Practice, an open technology course that explores high-performance storage scenarios around The OpenAnolis Community Anolis OS 8.

Writing in the front

Cgroup is the bottom half of the container technology, many articles have been introduced and summarized very well, what cgroup is, what is useful, and some related concepts, these are not the focus of this article, so will not be repeated. If you are interested, you can search for various technical articles or directly refer to the official documents [1].

Note: By default, the reader has a basic understanding of what task, cgroup, subsys, and hierarchy are and how they are related.

Why do we care about CGroup control plane performance?

Cloud native is currently a key development direction in the cloud computing field. In the function computing scenario, the speed of function execution is an important performance indicator, requiring that instances can be created and destroyed quickly and in high concurrency. Isolation in this scenario typically involves a large number of CGroup-related operations, and existing CGroup frameworks are poorly designed for concurrency and may not have been designed with large-scale control plane operations (for example, create and destroy) in mind. With the evolution of technology, large-scale control plane operation scenarios gradually increase, and we began to pay more attention to the performance of the control plane.

This paper is based on the 4.19 version of the kernel source code, aiming to analyze the implementation principles behind the interface provided by CGroup to users, and based on the implementation principles, give some suggestions on the use of cGroup in user mode. Finally, at the end of the article, some ideas of kernel mode optimization are shared.

The principle of analysis

FIG. 1

Figure 2

The above two figures are the connection relation between the main data structures in cgroup and the connection relation of cgroup hierarchy in 4.19 kernel.

Cgroup: Literally

Cgroup_root: hierarchy

Cgroup_subsys: subsystem, usually abbreviated as SS

Cgroup_subsys_state: When pointing to a subsys, it represents an entity of the subsys in a cgroup

Css_set, cset_cGRp_link: Used to establish a many-to-many relationship between task_struct and cgroup

These data structures are abstracted into this diagram:

In fact, it is easy to understand that in essence, cGroup framework is to solve the problem of which tasks a Cgroup manages and which cgroup manages a task. In implementation, cset can be used as an intermediary to establish this relationship. Compared with the direct connection between task and cgroup, this approach can simplify the complex relationship. This is because in actual scenarios, tasks are generally managed by groups, and the resource management schemes for a group of tasks are likely to be consistent.

Operations for CGroups revolve around these three types of entities:

  • Create: Add a leaf node to the tree structure shown in Figure 2
  • Binding: Essentially migrating a child process to a cset pointed to by its parent process when it is forked out. The binding is migrating from one cset (removed if no more tasks are pointed to) to another cset (newly created if a new cgroup is pointed to).
  • Delete: Delete a leaf node that does not control any task in the tree structure shown in Figure 2

Access control for cGroup operations also centers around these three types of entities:

  • Task: cgroup_threadgroup_rwsem lock
  • Cset: css_set_lock lock
  • Cgroup: cgroup_mutex lock

The specific function of these three types of locks will be analyzed in the optimization idea.

Optimization scheme

What’s the problem?

The problem is three locks: cgroup_mutex, cgroup_threadgroup_rwsem, and css_set_lock.

Cgroup_mutex protects the entire hierarchy of CGroups. The cGroup hierarchy is a forest, and we need to use this one lock to protect the entire forest. For example, mount, mkdir, rmdir, etc. This lock is also required for any other operation on the CGroup, such as attach Task, and any other interface that reads or writes to the cgroup. Because operations on RMDIR can occur at any time, any operation needs to be mutually exclusive with RMdir.

Css_set_lock protects all operations related to cSS_set. Any process can exit at any time, causing a CSS_set to be released and thus affecting the cSS_set hash table. In addition, most operations on a cgroup will also involve cSS_set_lock, because most operations on a cgroup (except for creation) will cause the css_set to change.

Cgroup_threadgroup_rwsem protects group operations related to the cgroup. In reality, fork and exit operations can change the group at any time. The reason for using read/write locks here is that the process’s own behavior may include changing the composition of thread groups and holding read locks, which can be done in parallel; When attaching a process, a stable view of the thread group is required. If the process is forking or exiting, the thread group will change. In this case, the read-write lock does not mean that you are actually reading or writing something, it just happens to coincide with reader parallelism, and the writer needs to be mutually exclusive with other writers. In other words, fork, exec, and exit can be parallel, similar to the reader; Attach is mutually exclusive with all others, similar to writer.

These three locks are affected by process fork and exit, and also cause almost no parallelism between any operations on cgroup. Before the in-depth study of CGroup, the author thought that the designer was lazy at the beginning and used such a large granularity of locks. It was not until he explored the framework of CGroup that he found that the critical section was so large, and all kinds of asynchronous events needed to operate the data, so these locks were designed in such a reasonable way.

Try to abstract the problem here and think about what the essence of the problem is.

For cgroup_mutex, the problem is essentially concurrent access in a tree (the node is cgroup) structure.

For CSS_set_LOCK, the problem is actually concurrent access to bipartite graphs (cSS_set on one side and Cgroup on the other).

In the case of cgroup_threadgroup_rwsem, the problem is actually concurrent access to the structure of the collection (of which the threadgroup is an element).

Now that the definition of the problem is clear, how can we solve it? With my current abilities, I can’t solve it.

Yes, after so much analysis, the conclusion is that there is no solution to this problem, or there is no solution for the time being. Some possible solutions will also cause changes to the framework of CGROUP in the form of bone scraping and poison treatment. The risk behind this, the influence of stability, the input-output ratio of pain can withstand, I can not give a definitive conclusion. If you have any ideas, please leave them in the comments section and let us know.

Although it is difficult to treat the root cause, but the symptoms can still have some ideas.

User mode optimization: reduce cgroup operations

This solution is easy to understand. You can create and configure the CGroup in advance and just take it when you need it. It worked so well, it was a dimensional-reduction blow. Here is the experimental data. The test here simulates the creation and reading and writing of the kangaroo container when it is started

This scheme achieves an optimization rate of more than 90%, which changes the situation that attach process is deleted after creating and configuring to only need attach. The workload is less, and naturally it becomes faster.

But there are some drawbacks. On the one hand, the unused Cgroups in the pool are still visible to the system and need to be managed, so there will be a certain load. On the other hand, data residue is a problem. Not all Subsys provide operation interfaces similar to clear. If monitoring data is required, Cgroup will be used once and will not be used. Finally, the cGroup hierarchy needs to be clarified. After all, the cGroup hierarchy needs to be created and configured in advance, and the pool cannot be created without control of the runtime hierarchy.

Reduce the number of Cgroups

By default, systemd mounts most subsys on a separate hierarchy. If all service processes are controlled by the same subsys, you can mount all subsys on the same hierarchy. For example, mount the CPU, memory, and Blkio together.

If you create a Cgroup under CPU, memory, and blkio, how much difference is there between creating a Cgroup under CPU_memory_blkio and creating a Cgroup under CPU_memory_blkio? We have to have all the logic we need, none of them are going to run, and the most we can do is to lose a few cgroups themselves, how much difference can we make?

Going back to the original scenario, the problem with CGroup is that the scenario is highly concurrent, whereas by nature the operations are serial. As we know, there are two main dimensions for measuring performance: throughput and latency. The serial nature of Cgroup cannot directly improve the throughput, and each subsys is separated into subtasks under the hierarchy independently, which actually increases the delay.

Here are the test data:

Kernel state optimization

We can’t move the three locks, we can only move that part of the critical area. If you want to reduce the critical region, you need to find the time consuming part of the critical region and optimize it.

The following figure shows how long each part takes to create a Cgroup for each subsystem:

Here’s a quick explanation of what each part does:

  • Cgroup: Creates and initializes the cGroup structure
  • Kernfs: Creates the cgroup directory
  • Populoate: File interface used to create cGroup control
  • Cssalloc: Assigns CSS
  • Cssonline: indicates the online logic of the CSS in each subsystem
  • Csspopulate: Create file interface for subsystem control

The figure shows that the CPU, CPUACCt, and memory latency is much higher than that of other subsystems. CSS Alloc and CSS POPULATE account for the majority. Let’s take a look at what this “principal contradiction” is actually doing.

Through analysis, we found that the high latency on CSS alloc is due to allocating memory to some perCPU members, which is time-consuming. CSS populate because some subsystem interface files are large and need to be created one by one, which consumes more time.

After analysis, it is found that these logic must not be redundant, how to optimize? Do caching. The perCPU member variable records the address is not released for reuse next time. The subsystem interface file is moved to a specified place in the unit of folder when it is released, and it is moved back when needed. It only involves the read and write of a directory entry on the directory file, and the overhead is low and constant.

Through these two methods, the delay optimization results of each cgroup creation are as follows:

The CSS alloc part of the CPU subsystem is still time-consuming because of the amount of initialization, but the delay has been reduced to 50us compared to the original 160us.

Reducing the critical section does not affect the concurrency, but at least the latency is reduced. Here is the test data.

Each thread creates n cgroups under CPU, CPUACCt, cPUSet, memory, and BLkio:

Some hypothetical

How do you design a framework that controls process resources and supports high concurrency, regardless of the limitations, the existing framework, and the backward compatibility?

Now, the cGroup mechanism provides a high degree of flexibility. The relationship between subsystems can be arbitrarily bound, and the task can be arbitrarily bound to any CGroup. If this flexibility is sacrificed, the explanation of the problem can be simplified.

First, can the previously mentioned idea of binding all subsystems together to reduce the number of Cgroups be cemented into the kernel, or do we not provide separate subsystem mount and binding mount features? In this way, the process group and cgroup become a one-to-one correspondence, cset has no meaning to exist, and cSS_set_lock has not solved the problem. The downside, however, is that all processes in a process group have consistent resource control on each subsystem.

Second, is the cGroup hierarchy necessary? Now the CGroup is organized in a tree structure, which is indeed more logically realistic. For example, total resources are allocated to the business at layer 1 and resources are allocated to various components of the business at layer 2. However, in terms of resource allocation by the operating system and resource acquisition by business processes, the existence of the first layer has no effect, but provides users with more logical operation and maintenance management. If the no internal process feature proposed by CGroup V2 is also applied, the CGroup hierarchy can be flattened to only one layer.

The advantage of having only one cgroup layer is that you can easily fine-grained cgroup_mutex to one lock per Cgroup, rather than having several layers of tree structure — changing a Cgroup requires locking from the ancestor. After the granularity of the lock is refined, there is no contention when starting container instances concurrently, because they correspond to different Cgroups.

Thirdly, can the deletion of **cgroup be restricted? ** Now the user asynchronously deletes the empty cgroup manually. If you can hide the cgroup when it no longer manages the process (exit, move) and trigger the deletion at a later time, you can reduce the competition scenario. This method will make empty CGroups unrecyclable. Is there any need to reuse empty Cgroups now?

Finally, can the ** binding process be restricted? ** The nature of a task binding to a cgroup is moving from one cgroup to another. After the granularity of cgroup_mutex is refined, there will be deadlock problems of ABBA. One question is, is there a need for a task to bind to a Cgroup and then bind again? Ideally, you want to bind one to run smoothly and then exit. Based on this assumption, we can restrict task binding to include default cgroup in SRC and DST as a springboard.

These are my immature ideas, welcome to discuss.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.