This is the 9th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

takeaway

Quality of service (QoS) is not unfamiliar to everyone. The original meaning of QoS is service Quality assurance. The more important the service commitment is, the higher the requirements will be.

Under Kubernetes semantics, QoS is defined for POD, which describes the priority order of POD when it is scheduled and expelled, and it has a direct impact on the life cycle of POD. There are three QoS categories of POD, namely Guaranteed, Burstable and BestEffort. Read on to find out what it means.

QoS is introduced

QoS originates from the quality of service of the network. It provides different priorities for different users or data flows, or ensures the performance of data flows to a certain level according to the requirements of application programs. QoS guarantee is very important for networks with limited capacity, especially for streaming multimedia applications, such as VoIP and IPTV, because these applications often require fixed transmission rates and are sensitive to delay. [2]

Qos In Kubernetes

In Kubernetes system, QoS(Quality of Servcie) determines the scheduling and expulsion priority of POD, which can provide reliable and expected service guarantee for users above the cluster. The higher the QoS level is, the higher the support priority is.

In Kubernetes, QoS is not directly set by users, but calculated by Kubernetes itself according to users’ requirements on resources. There are three levels of QoS, which are Guaranteed, Burstable and BestEffort from highest to lowest. The following is how to set the three grades.

Guaranteed

Resource Settings

The QoS level will be set as Guaranteed when all container resources within a POD request CPU and memory limits CPU and memory are equal. A quick way to do this is to set limits CPU and memory only, and Kubernetes will automatically set requests equal to limits.

Kind: Pod
spec:
    containers:
        - image: busybox
          resource:
             requests:
		cpu: 200m
		memory: 10Mi
	     limits:
	        cpu: 200m
	        memory: 10Mi
Copy the code

Impact on scheduling

The Kubernetes scheduler will only schedule pods of the Guaranteed type to nodes that fully satisfy resource requests.

If Kubelet reports a node status DiskPressure, Guaranteed Pods will not be scheduled on the node. When the available disk space and inodes of the root file system or mirror file system of a node reach the threshold of expulsion, the DiskPressure node status is triggered. When a node reports DiskPressure, the scheduler will stop scheduling new Guaranteed pods on it.

Exclusive CPU

Under Kubernetes’ default CPU management policy “None”, Guaranteed Pods can only use shared CPU pool resources on nodes, This shared CPU pool resource contains all CPU resources on the node minus the resources reserved by kubelet using –kube-reserved or –system-reserved.

Under static CPU management policy, a Guaranteed POD can obtain exclusive CPU resources. Under this policy, the CPU requests value of a Guaranteed POD must be an integer value in order to truly obtain an exclusive CPU, otherwise it is still a shared CPU pool resource.

Burstable

Resource Settings

If at least one container in a POD has only CPU or memory set for Requests, or both but different from limits, then the POD’s QoS will be set to Burstable.

Kind: Pod
spec:
    containers:
        - image: busybox
          resource:
            requests:
                cpu: 200m
		memory: 10Mi
Copy the code

Impact on scheduling

The scheduler cannot ensure that Burstable Pods can be scheduled to nodes that fully meet their resource requirements.

A Burstable POD cannot be scheduled to a node that has reported DiskPressure. The scheduler does not schedule any new Burstable Pods to nodes in DiskPressure state.

Exclusive CPU

Under the “None” CPU management policy, Burstable Pods must share resource pools with BestEffort and Guaranteed Pods and cannot be allocated exclusive CPU resources.

BestEffort

Resource Settings

If all containers in a POD do not have CPU and memory set for requests and limits, the POD’s QoS will be set to BestEffort.

Kind: Pod
spec:
    containers:
        - image: busybox
	
Copy the code

Impact on scheduling

The scheduler also does not guarantee that BestEffort Pods will be scheduled to a node with sufficient resources. However, they can use any number of available CPU and memory resources on the node. Sometimes when BestEffort pods are greedy and don’t reserve enough resources for other pods, it will result in resource competition with other pods.

BestEffort POD cannot be scheduled to nodes whose states are DiskPressure and MemoryPressure. When the available memory of a node is lower than the predefined threshold, the MemoryPressure status is reported. The scheduler stops scheduling any new BestEffort pods to these nodes.

Exclusive CPU

Like Burstable Pods, BestEffort Pods can only use a shared resource pool, not exclusive CPU resources.

deportation

Let’s take a look at how QoS affects Kubelet’s expulsion of pods when there are insufficient node resources, such as running out of memory.

How does Kubelet evict Guaranteed, Burstable, BestEffort POD

Kubelet triggers POD expulsion when the node has low available computing resources. These expulsions mean reclaiming resources to avoid OOM. DevOPs can set a threshold for the resource, and when this threshold is reached, Kubelet triggers pod expulsion.

The QoS level of a POD does affect the order in which Kubelet selects expulsion objects. Kubelet first deports pods at BestEffort and Burstable levels that exceed resource requests. The order of expulsions depends on the priority expulsions are assigned and the amount of resources consumed in excess of resource requests.

Guaranteed and Burstable pods will not be ejected when their resource usage falls below the requested value and will have the lowest eviction priority.

On nodes where DiskPressure occurs, Kubelet first removes BestEffort pods and then Burstable pods. A Guaranteed POD will be expelled only after the current node no longer has BestEffort and Burstable pods.

Guaranteed, Burstable, BestEffort POD — what is the OOM behavior

Oom_killer will kill the container based on its OOM_score when the node is OOM before Kubelet retrieves the resource. Oom_killer calculates an OOM_score value for each container based on the proportion of resources used by the container to the resources requested by it plus the oOM_score_adj value.

The OOM_score_adj of each container is determined by the QoS level of the POD to which it belongs.

Quality of Service oom_score_adj
Guaranteed – 998.
BestEffort 1000
Burstable min(max(2, 1000 – (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Oom_killer first terminates the POD with the lowest QoS level and uses the container with the most requested resources. This means that containers with higher QoS pods have a lower chance of being killed than containers with lower QoS pods. However, if the container memory usage of higher level PODS exceeds the request volume, it may also be preferentially cleared, because oOM_score is not only not related to QoS, but also depends on the actual memory usage of the container.

Deep thinking

Why are there three different levels? Can’t you just use one level?

Let’s go back to the actual requirements, something like this

  1. Maximize resource utilization and oversold resources as much as possible
  2. The key business should have certain stability guarantee

In order to meet this goal, which is almost in the opposite direction, we have to provide more flexible service levels, so using only one level cannot meet both goals, and different levels are needed to secure services with different requirements. For services with low criticality and low requirements for stability, we can oversell resources appropriately so as to obtain a greater utilization of resources. For key services, service resources are allowed to be wasted, so that business stability is prioritized.

Why three? Kubernetes QoS depends on the CPU and memory values of Requests and limits. The two sets of four different configurations can be divided into three scenarios that correspond to the three current boundaries, so the three levels are the most natural approach.

In practice, Burstable and BestEffort are not very different, and from the point of view of business stability, the system engineer will plan the capacity of the system in advance, and should intervene before the cluster resource is insufficient, rather than relying on the behavior of Kubelet or Oom_killer. Kubelet and oOM_killer are more about stopping abnormal resource requests in time to maintain the stability of the whole cluster.

In actual combat

In practice, how to improve resource utilization while ensuring business stability?

The core principles I recommend are based on the importance of the business and the nature of the cluster. For example, in a production cluster, it is recommended that all pods be set to the Guaranteed level, as I believe that the business in the production environment is the most important and resource oversold is relatively unimportant. Of course, how to minimize the waste of resources is a very important problem. However, the offline test and development cluster is different. It does not have high requirements on the stability of the business and has higher oversold resources. In this case, POD can be set as Burstable or BestEffort.

How can the production environment reduce the waste of resources? The main starting point is to set a request value that exceeds a little bit based on the historical resource usage. In order to cope with the sudden increase or jitter of traffic, HPA can be attached to ensure that the capacity can be dynamically expanded when the number of requests increases, and the capacity can be automatically reduced when too many service instances are not needed.

QoS summary

  • QoS levels are defined through Requests and limits

  • When there is insufficient memory, pods can be ejected in the following order: BestEffort, Burstable, and Guaranteed
  • Oom_killer Kills a container based on the QoS level of the POD to which the container belongs and the actual resource usage of the container

Reference documentation

  1. baike.baidu.com/item/QoS
  2. zh.wikipedia.org/wiki/ Quality of service