background

Today, I would like to share the K8S Scheduler scheduling model I learned in Meituan Hulk team scheduling system group and the transformation of Meituan scheduler

Hulk team introduction

Hulk is the team of Meituan responsible for container cloud platform, and I belong to the K8S team under the scheduling system group. Mainly responsible for the development and maintenance of all K8S-related components.

Kube-scheduler scheduling model is introduced

(This introduction is based on K8S version 1.16)

Pod is the smallest entity of the K8S workload and the primary object of scheduler scheduling.

What the K8S scheduler does is bind the pod to the specified node and select the appropriate node for the POD.

Scheduler will first watch Apiserver to obtain the pod to be scheduled, that is, the pod object whose pod.spec.nodeName field is empty. Add the POD to the scheduling queue by calling the Pod Informer’s handler.

By default, the K8S scheduling queue is a PriorityQueue, and the scheduler performs special operations on the contents of the scheduling queue when some cluster information changes. Not here.

A scheduling process is mainly divided into three stages when judging whether a Node can be used as the target machine:

  • Predicates pre-selection stage: Hard conditions, filtering out the nodes that do not meet the conditions, this process is called Predicates. These are a set of filtering conditions in a fixed order, and if any of the Predicate doesn’t match, the Node is discarded.
  • Prioritites priority stage: Soft condition, which prioritizes the nodes that pass. This stage is called Priorities. Each Priority is a factor and has a certain weight.
  • Select Selection stage: Select the node with the highest priority from the preferred list, called Select. The selected Node is the machine on which the Pod will eventually be deployed.

Primary stage

The Scheduler starts n Goroutines to run predicates serial filtering rules for all nodes in the cluster concurrently until the list of nodes that meet the criteria is finally filtered

The scoring (preferred) stage

In the optimization stage, also known as the scoring stage, 0-10 points weighted average system is adopted. Each optimization algorithm has a certain weight, and the integral after the weight of the optimization algorithm * is the final score in the optimization stage. 0 is the lowest score and 10 is the highest score.

Finally, the node with the highest score will be selected as the dispatching node.

Grading rules

The most commonly used scoring rule is LeastRequestedPriority. It can be summarized as follows:

That is, select the node with the most free resources (CPU /memory) of the host.

And such as BalancedResourceAllocation scoring rules at the same time, its standard is scheduling is completed, all the nodes in the node of various kinds of resource allocation is the most balanced. As the community Scheduler evolves, more fine-grained scoring algorithms are added.

The node scoring algorithm can be abstracted into the following formula:

Meituan optimization

As the early version of KuBE-Scheduler had a certain scheduling delay problem in large-scale clusters, meituan also made customized transformation.

Pre-selected interrupt mechanism

Through in-depth analysis of the scheduling process, it can be found that in the pre-selection stage, the scheduler will continue to judge whether the subsequent filtering conditions meet even if it knows that the current Node does not meet certain filtering conditions. If there are tens of thousands of nodes, this judgment logic will waste a lot of calculation time, which is also an important factor in the poor performance of the scheduler.

Therefore, we propose the “pre-selection failure interrupt mechanism”, that is, once a pre-selection condition is not met, the Node will be immediately abandoned, and the subsequent pre-selection conditions will not perform judgment calculation, thus greatly reducing the amount of computation and greatly improving the scheduling performance. As shown below:

Locally optimal solution

In the current scheduling strategy, each time the scheduler traverses all nodes in the cluster in order to find the optimal Node, which is called BestFit algorithm in the scheduling field. However, in the production environment, it doesn’t make much difference whether we choose the optimal Node or the suboptimal Node. Sometimes we still avoid selecting the optimal Node (for example, in order to solve the problem of frequently creating applications on the new machine, our cluster randomizes the optimal solution). In other words, finding a local optimal solution can satisfy the demand.

Assume that there are 1000 nodes in the cluster, and 700 of them can pass the Predicates (pre-selection stage) during the scheduling process OF PodA. Then we will iterate over all nodes and find out the 700 nodes, and then find the optimal Node NodeX through score sorting. However, the local optimal algorithm is adopted, that is, we believe that the demand can be satisfied as long as N nodes can be found and the Node with the highest score is selected among these N nodes. For example, 100 nodes that can pass Predicates (pre-selection stage) can be found by default, and the optimal solution is selected among these 100 nodes. Of course, the global optimal NodeX solution may not be in the 100 nodes, but we can also choose the optimal NodeY among the 100 nodes. The best case is to traverse 100 nodes and find 100 nodes, or maybe 200 or 300 nodes and so on, so that we can greatly reduce the computation time without having too much impact on our scheduling results.

Plan 1:

Only K nodes are scheduled at a time

Scheme 2:

Once a score exceeds the water mark, the scores of other nodes will be terminated

PriorityQueue transformation

PriorityQueue is actually internally subdivided into Queue data structures.

  • ActiveQ Specifies the queue to be scheduled
  • BackOffQ Bind Failed queue
  • UnscheduledQ Scheduling failure queue
  • Other queues (not listed here) that are not related to this transformation…

Due to the rescheduling component and priorityQueue mechanism, when a large number of bad Pods with high priority fail to be scheduled at a certain time, these bad Pods will return to the activeQ queue head, and normal pods with low priority cannot be scheduled.

Downgrade plan:

When there is a large backlog of ActiveQ and Unscheduled Pods exist, PriorityQueue is demoted to FIFO Queue to ensure that the bad Pods will enter the end of the Queue in the next scheduling cycle and good Pods can be scheduled normally

\