Zhang Zuowei (Youyi)

preface

In the cloud native era, application workloads are deployed on host machines in the form of containers, sharing various physical resources. With the enhancement of host hardware performance, the container deployment density of single node further increases, resulting in more serious problems such as inter-process CPU contention and cross-NUMA access, affecting application performance. How to allocate and manage the CPU resources of the host computer to ensure the optimal service quality is the key factor to measure the technical capability of container service.

Node side container CPU resource management

Kubelet’s CPU allocation policy

Kubernetes provides a semantic description of request and limit for container resource management. When a container specifies a request, the scheduler uses this information to determine which node a Pod should be assigned to. When a container has a limit specified, Kubelet ensures that the container is not overused at run time.

CPU is a typical time-sharing multiplexed resource. The kernel scheduler divides CPU into several time slices and allocates certain running time to each process in turn. Kubelet’s default CPU management policy controls the upper limit of container CPU resources through the Linux kernel’s CFS Bandwidth Controller. Under multi-core nodes, processes are often migrated to their different cores during operation. Considering that the performance of some applications is sensitive to CPU context switching, Kubelet also provides static policies that allow A Guaranteed type of Pod to monopolize the CPU core.

Kernel CPU resource scheduling

Cfs_period is a fixed value of 100 ms, and cfs_quota corresponds to the CPU Limit of the container. For example, for a container whose CPU Limit is 2, cfs_quota is set to 200ms. This means that the container can use a maximum of 200ms CPU slices every 100ms. That is, two CPU cores. When the CPU usage exceeds the preset limit, the processes in the container are restricted by the kernel scheduling. Careful application administrators will often observe this feature in the CPU Throttle Rate indicator in cluster Pod monitoring.

Status of container CPU performance issues

Application administrators often wonder why container resource utilization is not high, but application performance degrades frequently. From the perspective of CPU resources, problems usually come from the following two aspects: one is the CPU Throttle problem caused by the kernel limiting the consumption of container resources according to the CPU Limit; Secondly, due to the influence of CPU topology, some applications are sensitive to context switching between cpus, especially when cross-NUMA access occurs.

CPU Throttle problem description

Affected by the cfS_period, the CPU usage of a container is often deceptive. The following figure shows the CPU usage of a container over a period of time (unit: 0.01 core). It can be seen that the CPU usage of a container is relatively stable at the 1s granularity (purple broken line in the figure). The average is around 2.5 cores. As a rule of thumb, administrators will set the CPU Limit to 4 cores. This is supposed to leave plenty of flexibility, but if we zoom in to 100ms (the green broken line), the CPU usage of the container shows severe burrs, peaking at more than 4 cores. The container will Throttle up the CPU frequently, resulting in application performance degradation and RT jitter, but we can’t find it in common CPU utilization metrics!

Burrs are usually caused by sudden CPU resource requirements (such as code logic hotspot and traffic surge). An example is used to describe the application performance degradation caused by CPU Throttle. The figure shows the CPU resource allocation of each Thread after receiving the request (REQ) in a Web service class container with CPU Limit = 2. Assuming that the processing time of each request is 60 ms, it can be seen that even though the overall CPU utilization of the container is low recently, the time slice budget (200ms) of the kernel scheduling cycle is consumed due to the continuous processing of four requests in the range of 100 ms to 200ms. Thread 2 needs to wait for the next period to complete the req 2 processing, and the response delay (RT) of the request becomes longer. This is more likely to happen as the application load increases, causing the long tail of its RT to become more severe.

To avoid CPU Throttle problems, we can only increase the container’s CPU Limit. However, to Throttle the CPU completely, you usually need to increase the CPU Limit by a factor of two or three, and sometimes five or ten, before the problem is significantly alleviated. To reduce the risk of CPU Limit oversold, the deployment density of containers also needs to be reduced, resulting in higher overall resource costs.

The CPU topology is affected

In THE NUMA architecture, the CPU and memory of a node are divided into two or more parts (for example, Socket0 and Socket1 in the figure). The CPU is allowed to access different parts of the memory at different speeds. When the CPU accesses the memory of the other end of the Socket, the latency is relatively high. Blindly allocating physical resources to containers on nodes can degrade the performance of delay-sensitive applications, so we need to avoid dispersively binding cpus to multiple sockets to improve the locality of memory access. As shown in the following figure, allocating CPU and memory resources to the two containers is obviously more reasonable in scenario B.

Kubelet provides static POLICY for CPU management and single-numa-node for topology management, which binds containers to cpus to improve affinity between application load and CPU Cache, as well as between NUMA. But does this necessarily solve all cpu-related performance problems? Take a look at the following example.

Static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode Kubelet static mode In Default mode, the container gains more CPU flexibility, allowing individual threads to process requests as soon as they are received. It can be seen that the core binding strategy is not a “silver bullet”, and the Default mode also has its own application scenarios.

In fact, cpu-binding resolves the performance problem caused by context switching between different cores, especially between different NUMAs, but also reduces resource elasticity. In this case, threads are queued across cpus, and while CPU Throttle may be reduced, the application’s own performance problems are not completely resolved.

Use CPU Burst mechanism to improve container performance

In the previous article, we introduced aliyun’s CPU Burst kernel feature, which can effectively solve the problem of CPU Throttle. When the container’s true CPU usage is less than CFS_quota, the kernel will “save” the excess CPU time into CFS_burst. The kernel’s CFS Bandwidth Controller (BWC) allows a container to consume the amount of time it saved to CFS_burst when it has a sudden DEMAND for CPU resources that exceeds its CFS_quota.

CPU Burst mechanism can effectively solve the LONG tail problem of RT in delay-sensitive applications and improve container performance. At present, Alicloud container service ACK has completed the comprehensive support of CPU Burst mechanism. For kernel versions that do not support the CPU Burst policy, ACK monitors the container’S CPU Throttle status and dynamically adjusts the container’s CPU Limit to achieve the same effect as the kernel’S CPU Burst policy.

We used Apache HTTP Server as a delay-sensitive online application to evaluate the effect of CPU Burst capability on response time (RT) by simulating request traffic. The following data shows the performance of the CPU Burst policy before and after it is enabled:

By comparing the above data, it can be known that:

  • After enabling the CPU Burst capability, the P99 quantile of the applied RT metric deserves significant optimization.
  • Compared with CPU Throttled and utilization index, it can be seen that CPU Throttled is eliminated after enabling THE CPU Burst capability, while the overall UTILIZATION of Pod remains basically unchanged.

Use topology-aware scheduling to improve container performance

Kubelet provides static policy (single-numa-node), which can partially solve the problem that application performance is affected by CPU cache and NUMA affinity. However, this policy has the following disadvantages:

  • Static policy supports only PODS whose QoS is Guaranteed. Pods of other QoS types cannot be used
  • The policy applies to all pods in the node, and we know from the previous analysis that CPU core binding is not a “silver bullet”.
  • Central scheduling is not aware of the actual CPU allocation of nodes and cannot select the optimal combination within the cluster

Alibaba Cloud container service ACK realizes topology-aware Scheduling and flexible core binding strategy based on Scheduling framework, which can provide better performance for CPU-sensitive workloads. ACK topology-aware scheduling can be adapted to all QoS types and can be enabled on demand in the Pod dimension, and the optimal combination of nodes and CPU topologies can be selected in the whole cluster.

Through the evaluation of Nginx services, we found that the application performance can be improved by 22%~43% using CPU topology-aware scheduling on Intel (104 core) and AMD (256 core) physical machines.

conclusion

CPU Burst and topology aware scheduling are two powerful tools for Alicloud container service ACK to improve application performance. They solve CPU resource management in different scenarios and can be used together.

CPU Burst solves the current limiting problem of CPU Limit in kernel BWC scheduling, which can effectively improve the performance of delay-sensitive tasks. However, THE essence of CPU Burst is not to create resources out of nothing. If the container CPU utilization is already high (for example, more than 50%), the optimization effect of CPU Burst will be limited. In this case, the application should be expanded by means of HPA or VPA.

Topology-aware scheduling reduces the overhead of CPU context switching in workloads, especially in the NUMA architecture, and improves the quality of service for CPU-intensive and access-intensive applications. However, as mentioned earlier, cpu-binding is not a “silver bullet” and the actual effect depends on the application type. In addition, if a large number of Burstable pods in the same node are enabled with topology-aware scheduling at the same time, CPU binding may overlap, which may worsen the interference between applications in some scenarios. Therefore, topology-aware scheduling is more suitable for targeted enabling.

Click here to view alicloud ACK support CPU Burst, topology aware scheduling details!

Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!