Han Yo-gang (Shin Shin)

background

In cloud native scenarios, applications are typically deployed and allocated physical resources as containers. Taking Kubernetes cluster as an example, the application workload declares the resource Request/Limit in Pod, and Kubernetes schedules the application resources and ensures the quality of service based on the declaration.

When the memory resources of the container or host are insufficient, the application performance may be affected, for example, the service delay is too high or OOM occurs. In general, the memory performance of in-container applications is affected by two aspects:

  1. Memory limit: When the memory of the container (including page cache) reaches the maximum, the memory subsystem of the kernel is triggered, and the performance of applying and releasing memory for applications in the container is affected.
  2. Host Memory Limit: When the container Memory Limit is oversold (Memory Limit > Request), it will trigger the global Memory reclamation of the kernel. The performance of this process is affected even more. In extreme cases, the whole machine can be ramped to death.

“Ali cloud of container service differentiation SLO mixed technology practice” and “how to reasonable use of CPU management strategy, improve the container” respectively expounds ali YunZaiYun native mixed department, container CPU resource management, a new method for the optimization of practical experience and the Chinese people to explore the container when using the memory resources of problems and security strategy.

Container memory resource trouble

Kubernetes memory resource management

Applications deployed in a Kubernetes cluster follow the standard Kubernetes Request/Limit model at the resource usage level. In the Memory dimension, the scheduler makes decisions by referring to the Memory Request declared by Pod, and Kubelet and container runtime set the declared Memory Limit to the Cgroups interface of Linux kernel on the node side, as shown below:

CGroups (CGroups for short) is a mechanism for Linux to manage the usage of container resources. The system can use CGroups to limit the CPU and memory resources used by processes in a container. Kubelet sets the Request/Limit of the Container to the Cgroup interface to implement constraints on the available resources of Pod and Container on the node side, which are roughly as follows:

Kubelet sets the cgroups interface memory.limit_in_bytes to the Pod/Container Memory Limit. Such as CPU time slices or binding core constraints. For the Request level, Kubelet sets the Cgroups interface CPU. shares according to CPU Request as the relative weight of CPU resources between containers. When the CPU resources of nodes are tight, The proportion of shared CPU time between containers is divided according to the Request ratio to meet fairness. By default, the Cgroups interface is not set for Memory Request, which is mainly used for scheduling and expulsion reference.

Kubernetes 1.22 + supports Memory Request resource mapping based on CGroups V2 (kernel version 4.15 + is not compatible with CGroups V1 and will affect all containers on the node).

For example, if the CPU Request of Pod A is two cores and that of Pod B is four cores, the CPU usage ratio of Pod A and Pod B is 1:2 when the CPU resource of the node is tight. However, when the Memory resource of the node is tight, the available Memory between containers is not divided according to the Request ratio as CPU, because Memory Request is not mapped to the Cgroups interface. Therefore, the fairness of the resource is not guaranteed.

Memory resource usage in cloud native scenarios

In cloud native scenarios, the memory Limit setting of the container affects the memory resource quality of the container itself and the entire host machine. Since the Linux kernel’s principle is to use as much memory as possible rather than constantly reclaim it, memory usage tends to keep rising when in-container processes claim memory. When the container memory usage approaches the Limit, container-level synchronous memory reclamation is triggered, resulting in additional latency. If the Memory request rate is high, the container may be OOM Out of Memory (Killed) and the application in the container may be interrupted or restarted.

The memory usage between containers is also affected by the memory Limit of the host. If the memory usage is too high, global memory reclamation will be triggered. In serious cases, memory allocation of all containers will be slowed down, resulting in memory resource quality deterioration of the entire node.

In a Kubernetes cluster, there may be a need to secure priority between pods. For example, high-priority PODS need better resource stability. In case of resource shortage on the whole system, the impact on high-priority PODS should be avoided as much as possible. However, in some real-world scenarios, low-priority PODS tend to run resource-consuming tasks, which means they are more likely to cause a wide range of memory resource constraints, interfering with the resource quality of high-priority PODS and being real “troublemakers”. Kubernetes currently uses low-priority PODS primarily through Kubelet eviction, but the response time may occur after global memory reclamation.

Container memory resources are secured using container memory quality of service

Container memory quality of service

Linux CGroups V2 provides memCG QoS capabilities to further guarantee the quality of container memory resources, including:

• Set the container’s Memory Request to the CGroups V2 interface memory.min so that the Memory of the locked Request is not recycled by the global Memory. • Set the cgroups V2 interface memory.high based on the container Memory Limit. When Pod Memory Usage > Request occurs, it will preferentially trigger the flow Limit to avoid OOM.

The upstream scheme can effectively solve the fairness problem of memory resources between PODS, but there are still some shortcomings from the perspective of resource use by users:

• When a Pod’s memory declaration Request = Limit occurs, resource constraints may still occur in the container, triggering memCG level direct memory reclamation that may affect the RT (response time) of the application service. • The solution does not currently consider compatibility with CGroups V1, and the fairness issue of memory resources on CGroups V1 remains unresolved.

Alibaba Cloud container service ACK is based on the Memory subsystem enhancement of Alibaba Cloud Linux 2. Users can use more complete container Memory QoS function on CGroups V1 in advance, as shown below:

  1. Ensure the fairness of memory reclamation among pods. When the memory resource of the whole system is insufficient, the memory is reclaimed from the Pod of memory overload (Usage > Request) preferentially, and the saboteurs are restrained to avoid the degradation of the resource quality of the whole system.
  2. When the Pod memory usage reaches the Limit, some memory is reclaimed asynchronously in the background to mitigate the impact of direct memory reclamation.
  3. When node memory resources are limited, Guaranteed memory quality of Guaranteed/Burstable pods is Guaranteed first.

A typical scenario

Memory oversold

In the cloud native scenario, the application administrator may set a Memory Limit larger than the Request Limit for the container to increase scheduling flexibility, reduce OOM risk, and optimize Memory resource availability. For a cluster with low memory usage, resource administrators may use this method to improve the memory usage to reduce costs and increase efficiency. However, this method may cause the sum of the Memory limits of all containers on the node to exceed the physical capacity, causing the entire node to be in the Memory overcommit state. When a node is oversold, even if the memory usage of all containers is significantly lower than the Limit value, the overall memory may hit the global memory reclamation threshold. If a container applies for a large number of memory resources, other containers on the node may enter the slow path of memory reclamation, or even trigger the OOM of the entire node, affecting the quality of application service in a wide range.

Memory QoS can improve the delay caused by direct reclamation by enabling container-level asynchronous background Memory reclamation to reclaim some Memory asynchronously before direct reclamation. For pods that declare Memory Request < Limit, Memory QoS supports setting an active Memory reclamation threshold to Limit Memory usage near the threshold to avoid serious interference with other pods on the node.

Mixed deployment

A Kubernetes cluster may have Pods with different resource usage characteristics deployed on the same node. For example, Pod A runs an online service workload with relatively stable memory utilization and is Latency Sensitive (LS). Pod B runs batch processing of big data applications. It is a resource-consuming service (BE) that requires a large amount of memory immediately after startup. Both Pod A and Pod B are affected by the global memory reclamation mechanism when the overall memory resource is tight. In fact, even if the current memory usage of Pod A does not exceed its Request value, its quality of service will be greatly affected. On the other hand, Pod B can be a real “troublemaker” that uses far more memory than Request by setting a large or even unset Limit, but is not fully constrained, thus damaging the memory quality of the whole machine.

The Memory QoS function enables the global lowest water mark tier and kernel MEMCG QoS to reclaim the Memory from the BE container first when the system Memory resources are insufficient, reducing the impact of global Memory reclamation on the LS container. Overused memory resources can also be reclaimed to ensure fairness of memory resources.

Technology insider

Linux memory reclamation mechanism

If the specified memory Limit is too low, processes in the container may require too much memory, resulting in extra delay or OOM. If the container memory Limit is set too large, the process consumes a large amount of the system memory resources, interferes with other applications on nodes, and causes service delay jitter. These latency triggered by memory requitation and OOM are closely related to the memory reclamation mechanism of the Linux kernel.

The memory pages used by processes in a container include: • Anonymous pages: they are from the heap, stack, and data segment and need to be reclaimed by swap-out. • File pages: from code snippets, file maps, need to be recycled through page-out, where dirty pages are first written back to disk. • Shared memory: from anonymous MMAP, SHMEM shared memory, needs to be reclaimed through swap.

Kubernetes doesn’t support Considerations by default, so the cleanup pages in the container are mainly from the file pages, also known as Page Cache (corresponding to the Cached part of the kernel interface statistics, which also includes a small amount of shared memory). Since memory access is much faster than disk access, the Linux kernel’s principle is to use as much memory as possible, and memory reclamation (such as page cache) is triggered mainly when the memory water level is high.

In particular, when the memory usage (including page cache) of the container approached the Limit, a direct reclaim was initiated at the CGROUP (memory Cgroup, short for MEMCG) level, in which clean file pages were reclaimed. This process occurs in the context of a process requesting memory, thus causing the application to lag within the container. If the memory request rate exceeds the reclaim rate at this time, the kernel’s OOM Killer will terminate some processes to further free up memory, considering that the processes in the container are running and using memory.

When the system memory resources are insufficient, the kernel reclaims the background memory based on the Free memory watermark. When the watermark reaches the Low watermark, the background memory reclamation process is completed by the kernel thread KSWAPD, which does not block application processes and can reclaim dirty pages. However, when the idle water level reaches the Min water level (Min < Low), a global direct memory reclamation is triggered. This process occurs in the context of memory allocation by the process, and more pages need to be scanned during the process. Therefore, performance is severely affected, and all containers on the node may be disturbed. If the system memory allocation rate exceeds the reclaiming rate, a wider range of OOM events are triggered and resource availability deteriorates.

Cgroups-v1 Memcg QoS

The Pod Memory Request part in Kubernetes cluster is not fully guaranteed, so when the node Memory resource is tight, the global Memory reclamation triggered can destroy the fairness of Memory resource between pods. Usage > Request containers may compete with non-usage containers for memory resources.

For the memCG QoS of the container memory subsystem, the Linux community provides the ability to Limit the memory usage on CGroups V1, which is also set by Kubelet to the Limit value of the container, but lacks the ability to guarantee (lock) the memory usage during the memory reclaim. This function is supported only on the CGroups V2 interface.

Alibaba Cloud Linux 2 kernel enables memCG QoS function in cgroups V1 interface by default. Alibaba Cloud container service ACK automatically sets appropriate MEMCG QoS configuration for Pod through Memory QoS function. Cgroups V2 does not need to be upgraded to support the container Memory Request resource locking and traffic limiting capabilities, as shown in the figure above:

• Memory. min: Set to the container’s memory Request. Based on the memory locking capability of the interface, Pod can lock the memory of the Request part from the global reclamation. When the node memory resources are tight, only the memory from the container that has memory overflow. • Memory. high: Set to a proportion of the Limit when the container memory Request < Limit or Limit is not set. Based on the memory limiting capability of this interface, when A Pod overuses its memory resources, it enters the process of traffic limiting. BestEffort Pod cannot overuse the system’s memory resources seriously, thus reducing the risk of global memory reclamation or OOM.

For more information about Alibaba Cloud Linux 2 memCG QoS capabilities, please refer to the official documentation: help.aliyun.com/document_de…

Background asynchronous memory reclamation

As mentioned earlier, the system’s memory reclamation process occurs not only at the machine dimension, but also inside the container (at the MEMCG level). When the memory usage within the container approaches the Limit, direct memory reclamation logic is triggered in the process context, blocking the performance of the application within the container.

To solve this problem, Alibaba Cloud Linux 2 adds the function of container background asynchronous collection. Different from the asynchronous kSWAPD kernel thread in global memory reclaim, this function does not create memCG granularity KSWAPD thread, but uses the workqueue mechanism to implement, and supports both CGroups V1 and CGroups V2 interface.

As shown in the figure above, Ali Cloud container service ACK automatically sets the appropriate background reclamation level memory.wmark_high for Pod through Memory QoS function. When the container memory watermark reaches this threshold, the kernel automatically enables the background reclamation mechanism to avoid delay caused by direct memory reclamation and improve the running quality of applications in the container.

For more information about Alibaba Cloud Linux 2’s asynchronous memory collection capability, please refer to the official website: help.aliyun.com/document_de…

Global lowest water level classification

Global direct memory recycling has a great influence on the performance of the system, especially the mixed deployment delay sensitive business tasks (BE) (LS) and the resource consumption of memory oversold scenario, the resource consumption task will often moment to apply for a lot of memory, make the system free memory to reach the global minimum water level (global wmark_min), All tasks in the system enter the slow path of direct memory reclamation, which results in performance jitter of delay sensitive services. In this scenario, neither global KSWAPD nor MEMCG background collection can effectively avoid this problem.

For the above scenarios, Alibaba Cloud Linux 2 adds the memCG global minimum water level classification function, which allows the global minimum water level (GLOBAL wmark_min), Adjust the memCG level with memory.wmark_min_adj. Ali Cloud container service ACK sets hierarchical water level for containers through Memory QoS function. Based on global wmark_min of the whole machine, global Wmark_min of BE container is moved up to make it enter direct Memory reclamation in advance. Drop the global wmark_min of the LS container to avoid direct memory reclamation as much as possible, as shown below:

In this way, when the BE task applies for a large amount of memory instantaneously, the system can suppress it for a short time through the upshifted global wmark_min, avoiding the direct memory reclamation of LS. Wait for global KSWAPD to reclaim a certain amount of memory, and then solve the short-time suppression of BE task.

For more information about Alibaba Cloud Linux 2 MEMCG global lowest water mark grading capability, please refer to the official document: help.aliyun.com/document_de…

summary

To sum up, container Memory QoS guarantees the Memory resource quality of containers based on Alibaba Cloud Linux 2 kernel. The recommended application scenarios of each capability are as follows:

We use Redis Server as a delay-sensitive online application, and verify the improvement effect of enabling Memory QoS on application latency and throughput by simulating Memory oversold and pressure test requests:

By comparing the above data, it can be seen that the average latency and average throughput of Redis applications are improved to some extent after the container memory quality of service is enabled.

conclusion

To solve the problem of container Memory usage in Cloud native scenarios, Aliyun Container service ACK provides container Memory QoS function based on Alibaba Cloud Linux 2 kernel, and ensures the fairness of Memory resources by allocating the container’s Memory reclamation and flow limiting mechanism. Improved application runtime memory performance. Memory QoS is a relatively static resource quality scheme, which is suitable for guaranteeing the Memory usage of Kubernetes cluster. However, for complex resource oversold and mixed deployment scenarios, a more dynamic and refined Memory guarantee strategy is indispensable. For example, an eviction strategy based on real-time resource stress metrics can be flexibly loaded schedding in user mode for frequent fluctuations in memory water levels. On the other hand, efficient memory oversold can be achieved by more fine-grained memory resource mining, such as memory reclamation based on hot and cold page tags, or Runtime (e.g. JVM) GC. Please look forward to the subsequent release of aliyun Container service ACK to support differentiated SLO function.

Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!