Introduction: This article is the first in a series of mixed-use practices. This article will introduce the importance of resource isolation in mixed-use practices, its implementation challenges, and our solutions.

Authors: Qian Jun, Nan Yi

As the name implies, different services are deployed on the same machine to share CPU, memory, AND I/O resources. The purpose is to maximize resource utilization and reduce procurement and operation costs.

In 2014, Ali started its first exploration of mixing department. After seven years of practice, this sharp sword, which greatly improves resource utilization rate, officially began commercial use.

Through computing resources, memory resources, storage resources, network resources such as the isolation of the link, as well as a millisecond adaptive scheduling ability, ali can secure under the double tenth of a flow, with intelligent decision-making and operational ability, to support the internal millions of level of Pod mix, whether it is CPU and GPU resources, general containers and security, All kinds of heterogeneous infrastructure, including domestic environment, can realize efficient mixing, which reduces the production cluster cost of Alibaba’s core e-commerce business by more than 50%, while making the interference of core business less than 5%.

In view of the resource efficiency improvement in the cloud native era, we will publish a series of articles based on the mixing practice in large-scale scenarios, introducing and sharing details of mixing technology and various practical problems encountered in mass production. As a series opener, this article introduces the importance of resource isolation technology in mixed media, its implementation challenges, and our approach to addressing them.

The relationship between mix and resource isolation: Resource isolation is the cornerstone of mix

Mixing usually mixes tasks of different priorities together, such as high-priority real-time tasks (time-delay sensitive, low resource consumption; Called online) and low-priority batch tasks (latency insensitive, high resource consumption; When high-priority services need resources, low-priority tasks need to be returned immediately, and the running of low-priority tasks cannot cause significant interference to high-priority tasks.

In order to meet the requirements of mixed department, in a single dimension kernel resources isolation technology is one of the key technology, ali cloud subsoil in kernel resources isolation technology for many years, accumulated a lot of industry leading experience, we will be the kernel resources the scope of isolation technology is mainly summarized as the kernel of the scheduling, memory and I/o the three subsystems, Moreover, it has carried out in-depth transformation and optimization in each subsystem field according to the mixed part scenario of cloud native, including CPU Group Identity, SMT Expeller, asynchronous memory reclamation based on Cgroup, etc. These key technologies enable customers to provide optimal solutions based on service characteristics in cloud native mixed deployment scenarios, effectively improve user resource utilization and reduce user resource cost. They are very suitable for container cloud mixed deployment scenarios, and are also key technologies that large-scale mixed deployment schemes rely on.

The following figure shows the position of resource isolation capability in the whole mixed-part scheme:

Why is resource isolation needed and what are the obstacles to resource isolation

Suppose we now have a server that runs high-quality online business as well as offline tasks. Online tasks have a clear demand for Response Time (RT), which is called Latency Sensitive (LS) load because it requires RT as low as possible. Offline tasks always eat as many resources as they have, so this type of load is called the Best Effort (BE). If we do not interfere with online and offline tasks, offline tasks are likely to occupy various resources frequently and for a long time, so that online tasks do not have the opportunity to schedule, or schedule in a timely manner, or obtain bandwidth, and so on, resulting in a sharp increase in online business RT. Therefore, in this scenario, we need necessary means to separate online and offline containers from each other in terms of resource usage, so as to ensure that online high-quality containers can obtain resources in a timely manner when using resources, and ultimately ensure QoS of high-quality containers while improving overall resource utilization.

Here’s a look at what happens when you mix online and offline:

  • First, CPU is most likely to face off-line competition, because CPU scheduling is the core, and online and off-line tasks may be scheduled to one core, competing with each other for execution time.
  • Of course, the task may also run to a pair of HT, competing with each other for instruction transmitting bandwidth and other pipeline resources.
  • The following levels of CPU cache is bound to be consumed, and cache resources are limited, so this involves the problem of cache resource division;
  • Even if we had worked out the partitioning of resources at all levels of the cache perfectly, the problem remained. First of all, memory is the next level of CPU cache. Memory itself is also similar, and will be scrambled. Whether online or offline tasks, memory resources need to be divided like CPU cache.
  • In addition, when the Last Level Cache (LLC) is not hit, the memory bandwidth (we call this static capacity, which is different from the memory size) increases, so the resource consumption between memory and THE CPU Cache is mutually affected.
  • Given that CPU and memory resources are fine, isolation is fine for the native machine, but it is easy to understand that the network may need to be isolated as well for online high-priority services and offline tasks that run on the network.
  • Finally, IO preemption may occur on some online models, so we need an effective I/O isolation policy.

This is a very simple idea of a resource isolation process, as you can see that there is a possibility of interference or contention at each stage.

Isolation technology program introduction: unique isolation technology program, each powerful

Kernel resource isolation technology mainly involves scheduling, memory and IO subsystems in the kernel. Based on Linux Cgroup V1, these technologies provide the basic isolation division of resources and QoS guarantee. They are suitable for container cloud scenarios, and are also the key technologies strongly dependent on large-scale mixed deployment schemes.

In addition to the basic CPU, memory and IO resource isolation technologies, we have also developed resource isolation view, Service Level Indicator (SLI) and resource competition analysis and other supporting tools. Provides a complete set of resource isolation and mixing solutions, including monitoring, alarm, operation and maintenance, and diagnosis, as shown in the following figure:

Scheduler optimization for elastic container scenarios

How to improve the utilization of computing resources while ensuring the quality of computing service is a classic problem of container scheduling. As THE CPU usage increases, the CPU bandwidth controller becomes more and more elastic. In the face of sudden CPU demand of containers, the bandwidth controller needs to limit the CPU usage of containers to avoid affecting load delay and throughput.

CPU Burst technology was originally proposed by ali Cloud operating system team and contributed to the Linux community and Dragon Lizard community. It was included in Linux 5.14 and Dragon Lizard ANCK 4.19 respectively. It is an elastic container bandwidth control technology. Under the condition that the average CPU usage is lower than a certain limit, the CPU Burst can be used for a short time to improve the quality of service and accelerate the container load.

After using CPU Burst in the container scenario, the service quality of the test container is significantly improved, as shown in the figure below. It can be found in the measurement that the LONG tail problem of RT almost disappears after using this characteristic technology.

Technology of Group Identity

To meet the requirements of service providers on CPU resource isolation, ensure that the quality of service of high-quality services is not affected or the impact scope is controlled within a certain range while maximizing CPU resource utilization. At this point, the kernel scheduler needs to give more scheduling opportunities to high-priority tasks to minimize their scheduling delay and minimize the impact of low-priority tasks, which is a common requirement in the industry.

In this context, we introduce the concept of Group Identity, that is, each CPU Cgroup will have an Identity. Special priority scheduling is implemented by CPU Cgroup as a unit, and timely preemption ability of high-priority groups is improved to ensure the performance of high-priority tasks. This mode applies to both online and offline service scenarios. When offline services are mixed, the online service scheduling delay caused by offline services can be minimized. Core technologies such as CPU preemption time for high-priority services can be added to ensure that online services are not affected by offline services in TERMS of CPU scheduling delay.

SMT expeller technology

In some online business scenarios, QPS decreases significantly with hyperthreading compared to non-hyperthreading, and RT increases accordingly. The root cause has to do with the physical nature of hyperthreading. Hyperthreading technology simulates two logical cores on a physical core that have separate registers (EAX, EBX, ECX, MSR, etc.) and APIC, but share the execution resources that use the physical core. This includes execution engines, L1/L2 caches, TLB, system bus, and more. This means that if one core of a pair of HTS runs an online task, and its corresponding CORE runs an offline task at the same time, there will be competition between them, and this is the problem we need to solve.

In order to minimize the impact of this competition, we wanted to make sure that when an online task is executed on a core, its HT will no longer run an offline task. When offline tasks are running on a core and the online tasks are scheduled to the CORRESPONDING HT, the offline tasks are removed. Sounds like a hell of an offline life, right? But this is how we keep HT resources from being stolen.

The SMT Expeller feature further implements hyper-threading (HT) isolation scheduling based on the Group Identity framework to protect high-priority services from low-priority tasks from HT. The hyper-threading scheduling isolation further implemented through the Group Identity framework can well ensure that high-priority services will not be interfered by low-priority tasks on the corresponding HT.

Processor hardware resource management technology

Our kernel architecture supports Intel®Resource Director Technology(Intel®RDT), which is a processor-supported hardware Resource management Technology. Including Cache Monitoring Technology (CMT) for Cache resources, Memory Bandwidth Monitoring (MBM), The Cache Allocation Technology(CAT) and the Memory Bandwidth Allocation(MBA) monitor the Memory Bandwidth Allocation.

CAT turns the Last Level Cache (LLC) into a resource supporting QualityofService(QoS). In a mixed environment, if there is no ISOLATION of LLC, offline applications constantly read and write data and a large number of LLC is occupied. As a result, the online LLC is constantly polluted, affecting data access and even increasing hardware interrupt latency and performance degradation.

MBA is used for memory bandwidth allocation. For memory bandwidth sensitive services, memory bandwidth affects performance and latency more than LLC control. In a mixed environment, offline is usually resource-consuming. In particular, some AI-type jobs consume a lot of memory bandwidth resources. Once the memory bandwidth usage reaches the bottleneck, the performance and latency of online services may decrease exponentially, and THE CPU water level may rise.

Memcg background recycle

In the native kernel system, when the container memory usage reaches the upper limit, if you apply for the use of memory, the current process context will be directly reclaiming the memory, which will undoubtedly affect the execution efficiency of the current process, causing performance problems. Is there a way to asynchronously reclaim the container’s memory before it reaches a certain waterline? In this way, there is a high probability that the process in the container will not go into direct memory reclamation when the memory usage reaches the upper limit.

We know that there is a kSWAPD background kernel thread in the kernel, which is used for asynchronous memory reclamation when the system memory usage reaches a certain level. But there is a kind of situation, such as the current yields and business container memory usage has reached a relatively nervous state, but the host machine overall free memory still has a lot of, so that the kernel threads kswapd memory recovery will not be awakened, lead to the stress of these memory use memory yields and container didn’t have a chance to be recycled. This is a big contradiction. At present, there is no memory Cgroup-level asynchronous memory reclamation mechanism in the native kernel, that is, the container’s memory reclamation relies heavily on the kSWAPD reclamation at the host level or can only rely on its own synchronous reclamation, which will seriously affect the business performance of some high-quality containers.

Based on the above background, ali Cloud operating system team provides a Memcg based asynchronous reclamation strategy similar to kSWAPD on the host machine level, which can carry out container-level memory reclamation mechanism in advance according to user requirements to achieve memory pressure release in advance.

The specific asynchronous collection process can be illustrated by the following diagram:

Memcg global water level classification

In general, resource-consuming offline tasks often apply for a large amount of memory instantaneously, which makes the free memory of the system touch the global MIN waterline and causes all tasks in the system to enter the slow process of direct memory reclamation. In this case, delay-sensitive online services are prone to performance jitter. In this scenario, both the global KSWAPD and Memcg level background recycle mechanisms are powerless.

Based on the fact that memory-consuming offline tasks are not time-sensitive, we designed the “global MIN waterline classification function of Memcg” to solve the jitter problem. Based on the standard upstream global shared min waterline, the global min waterline of offline task was moved up to make it enter direct memory reclamation in advance, and the global min waterline of delay-sensitive online task was moved down to realize the isolation of offline task and online task to a certain extent. In this way, when a large amount of memory is applied by the offline task instantly, the offline task will be suppressed at its upshifted min waterline, avoiding the direct memory reclamation of the online task. Then, when a certain amount of memory is reclaimed by the global KSWAPD, the short-term suppression of the offline task can be relieved.

The core idea is to set different standard global water levels for offline containers to control their memory application actions respectively. In this way, the tasks of offline containers can enter the direct memory reclamation with online services before applying for memory, and solve the problem caused by a large amount of memory instantly applied by offline containers.

For those of you who have some knowledge of Linux memory management, you can also refer to the following figure, which records the trend of various water levels in offline container mixing process in detail:

Memcg OOM priority

In real service scenarios, especially in the memory oversold environment, when Global OOM occurs, it is reasonable to kill offline services with lower priority and protect online services with higher priority. When offline Memcg OOM occurs, there is a reason to kill lower priority jobs and keep higher priority offline jobs. This is actually a fairly common requirement in cloud native scenarios, but is not currently available in the standard Linux kernel. The kernel has an algorithm to select victim, but usually finds a process with the highest OOM score to kill. This process may be an online high quality business process, which is not what we want to see.

Based on the above reasons, ali Cloud operating system team provides a Memcg OOM priority feature, through which we can ensure that when OOM occurs in the system due to memory shortage, we can select the low-quality service process to Kill, so as to avoid the possibility of killing the high-quality service process. The impact on customer services caused by online service process exit can be greatly reduced.

CgroupV1 Writeback current limit

Since the Block IO Cgroup was incorporated into the kernel, there has been a problem that the Direct IO can only be limited by fsync, because when the Block IO reaches the Block Throttle layer, The current process is the process that initiates I/OS. The process can obtain the corresponding Cgroup and make proper accounting. If the bandwidth /IOPS exceeds the upper limit set by the user, the bandwidth /IOPS is limited. The Block Throttle layer cannot obtain the Cgroup to which the I/O belongs from the current process for buffer writes that are ultimately sent by the Kworker thread, so it cannot limit the flow of these I/OS.

Based on the above background, asynchronous I/O traffic limiting is supported in Cgroup V2, but not in Cgroup V1. Cgroup V1 is mainly used in the cloud native environment. Ali Cloud operating system team realized the function of IO asynchronous traffic limiting in Cgroup V1 by establishing the relationship among Page <-> Memcg <-> BLKCG. The main algorithm of traffic limiting was basically the same as that of Cgroup V2.

Blk-iocost Weight control

In normal cases, to prevent an I/O starvation job from consuming the entire system, we set the UPPER limit of I/O bandwidth for each Cgroup. The disadvantage of this upper limit is that even if the device is idle, the Cgroup with the upper limit cannot send MORE I/OS when the NUMBER of SENT I/OS exceeds the upper limit, resulting in a waste of storage resources.

Based on the above requirements, an IO controller, IOCOST, is developed. The controller allocates disk resources based on the weight of BLKCG, which can maximize the utilization of DISK I/O resources on the premise of satisfying the service I/O QOS. Once the disk I/O capacity reaches the upper limit and the QOS setting target is reached, the IOCost controller controls the I/O usage of each group by weight. On this basis, BLK-IOCost has certain adaptive ability to avoid disk capacity waste as much as possible.

Outlook and Expectation

All of the above resource isolation capabilities have been fully contributed to the Dragon lizard community, the relevant source code can refer to ANCK (Anolis Cloud Kernel), interested students can follow the Dragon lizard community: openanolis.cn/

At the same time, Ali Cloud container service team is also working with the operating system team to export Ali Cloud container service ACK Agile version and CNStack(CloudNative Stack) product family, and continue to land ACK Anywhere to empower more enterprises. In the commercial version, we will be completely based on the cloud native community standards and seamlessly installed in the K8s cluster in a plug-in way to deliver offline to customers in the form of output. The core OS layer isolation capability has been released to Anolis OS, an open source, neutral and open Linux operating system distribution that supports multiple architectures.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.