What does the K8S Scheduler do

The function of Kubernetes Scheduler is to bind the Pod to be scheduled to an appropriate Worker Node (Node for short) in the cluster according to certain scheduling algorithm and policy, and write the binding information into ETCD. After that, the Kubelet service in the target Node listens to the Pod binding event generated by Scheduler through API Server to obtain Pod information, and then downloads the image to start the container. The scheduling process is shown in the figure:

The scheduling process provided by Scheduler is divided into Predicates and Priorities:

  • Pre-selection: K8S will traverse all nodes in the current cluster and select the nodes that meet the requirements as candidates
  • Preferably, K8S will score the candidate nodes

After pre-selection filtering and preferred scoring, K8S selects the Node with the highest score to run Pod. If there are multiple nodes with the highest scores, Scheduler randomly selects one of them to run Pod.

K8S Scheduler provides pre-selected policies

In Scheduler, the pre-selected policies available include:

If TaintNodesByCondition(beta level since 1.12, enabled by default) is enabled, The CheckNodeCondition, CheckNodeMemoryPressure, CheckNodeDiskPressure, and CheckNodePIDPressure pre-selected policies are disabled. PodToleratesNodeNoExecuteTaints, CheckNodeUnschedulable will be enabled.

The preferred policy provided by K8S Scheduler

In Scheduler, the preferred policies available include:

If open the ResourceLimitsPriorityFunction (default is not open) features, ResourceLimitsPriority will be enabled.

How to extend the K8S Scheduler

The built-in policy can meet the requirements in most scenarios, but in some special scenarios, it cannot meet the complex scheduling requirements. We can extend the Scheduler through the extender.

After the extended Scheduler invokes the built-in pre-selection policy and preferred policy, it invokes the extender through HTTP protocol to conduct pre-selection and preferred policy again, and finally selects an appropriate Node for SCHEDULING Pod. The scheduling process is as follows:

How do I implement my own Scheduler extensions

Write extensions

The extender is essentially an HTTP service that can filter and score Nodes. This is just an example, without any modifications. You can modify your pre-selected and preferred logic based on actual business scheduling scenarios, then package it as an image and deploy it.

Receives HTTP requests and, depending on the URL, calls preselected or preferred functions:

func (e *Extender) serveHTTP(w http.ResponseWriter, req *http.Request) {
    if strings.Contains(req.URL.Path, filter) {
        e.processFilterRequest(w, req)
    } else if strings.Contains(req.URL.Path, prioritize) {
        e.processPrioritizeRequest(w, req)
    } else {
        http.Error(w, "Unsupported request", http.StatusNotFound)
    }
}
Copy the code

Pre-selection logic:

func (e *Extender) processFilterRequest(w http.ResponseWriter, req *http.Request) { decoder := json.NewDecoder(req.Body) defer func() { if err := req.Body.Close(); err ! = nil { glog.Errorf("Error closing decoder") } }() encoder := json.NewEncoder(w) var args schedulerApi.ExtenderArgs if err := decoder.Decode(&args); err ! = nil { glog.Errorf("Error decoding filter request: %v", err) http.Error(w, "Decode error", http.StatusBadRequest) return } // Your logic pod := args.Pod nodes := args.Nodes.Items response := &schedulerApi.ExtenderFilterResult{ Nodes: &v1.NodeList{ Items: nodes, }, } if err := encoder.Encode(response); err ! = nil { glog.Errorf("Error encoding filter response: %+v : %v", response, err) } }Copy the code

Preferred logic:

func (e *Extender) processPrioritizeRequest(w http.ResponseWriter, req *http.Request) { decoder := json.NewDecoder(req.Body) defer func() { if err := req.Body.Close(); err ! = nil { glog.Fatalf("Error closing decoder") } }() encoder := json.NewEncoder(w) var args schedulerApi.ExtenderArgs if err := decoder.Decode(&args); err ! = nil { glog.Errorf("Error decoding prioritize request: %v", err) http.Error(w, "Decode error", http.StatusBadRequest) return } // Your logic for _, node := range args.Nodes.Items { hostPriority := schedulerApi.HostPriority{Host: node.Name, Score: 1} respList = append(respList, hostPriority) } if err := encoder.Encode(respList); err ! = nil { glog.Errorf("Failed to encode response: %v", err) } }Copy the code

Deploy the new Scheduler

The Kubernetes cluster already has a default scheduler named default-scheduler. In order not to affect the normal scheduling function of the cluster, it is generally necessary to create a new scheduler. The scheduler and default-scheduler have different startup parameters. Mirroring is no different. Here is the deployment process, with only the important parts listed:

Create the Scheduler configuration

We create the Scheduler scheduling configuration as ConfigMap, which specifies the built-in pre-selected and preferred policies, as well as the extensions we wrote.

apiVersion: v1
kind: ConfigMap
metadata:
  name: yrcloudfile-scheduler-config
  namespace: yanrongyun
data:
  policy.cfg: |-
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "predicates": [],
      "priorities": [],
      "extenders": [
        {
          "urlPrefix": "http://yrcloudfile-extender-service.yanrongyun.svc.cluster.local:8099",
          "apiVersion": "v1beta1",
          "filterVerb": "filter",
          "prioritizeVerb": "prioritize",
          "weight": 5,
          "enableHttps": false,
          "nodeCacheCapable": false
        }
      ]
    }
Copy the code

The deployment of the Scheduler

When deploying Scheduler, we need to specify policy-configmap as the configmap created by the Scheduler. We also need to specify a name for the Scheduler by scheduler-name parameter. Here we set it to yrCloudfile-Scheduler.

apiVersion: apps/v1beta1 kind: Deployment metadata: labels: component: scheduler tier: control-plane name: yrcloudflie-scheduler namespace: yanrongyun initializers: pending: []spec: replicas: 3 template: metadata: labels: component: scheduler tier: control-plane name: yrcloudflie-scheduler spec: containers: - command: - /usr/local/bin/kube-scheduler - --address=0.0.0.0 - -- leader-ELECT =true - --scheduler-name=yrcloudfile-scheduler - --policy-configmap=yrcloudfile-scheduler-config - --policy-configmap-namespace=yanrongyun - --lock-object-name= yrCloudfile-scheduler image: k8s.gcr. IO /kube-scheduler:v1.13.0 /healthz port: 10251 initialDelaySeconds: 15 name: yrcloudflie-scheduler readinessProbe: httpGet: path: /healthz port: 10251 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "name" operator: In values: - yrcloudflie-scheduler topologyKey: "kubernetes.io/hostname" hostPID: false serviceAccountName: yrcloudflie-scheduler-accountCopy the code

How do I use the new Scheduler

The Scheduler name is yrCloudfile-scheduler, and the Scheduler name is yrCloudfile-scheduler.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  labels:
    app: busyboxspec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      schedulerName: yrcloudfile-scheduler
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: busybox
Copy the code

The YRCloudFile extension K8S Scheduler

In the latest version 6.0 of YRCloudFile released by Yan Rong Cloud, the function of dynamically sensing CSI faults is added, which is realized by extending Scheduler.

In the case of default-scheduler, if the storage cluster of Work Node is disconnected, Kubernetes cannot sense the failure and still schedules Pod to the faulty Node. This causes Kubernetes to repeatedly make useless scheduling, which prevents Pod from being deployed properly, affecting the overall cluster performance.

As shown in the figure, we have deployed 3 copies of busyBox container, and there is a connection failure between node-3.yr and storage. The Pod on this node remains in ContainerCreating state and cannot be created successfully.

If you look at the Pod event list, you can find that Kubernetes’ default scheduler dispatches Pod to node-3.yr node, resulting in PV mount timeout.

Yan Rongyun addresses the above problems by expanding Scheduler and deploying CSI NodePlugin Sidecar container to check whether the connection between Node and storage cluster is healthy. When Scheduler preselects, NodePlugin Sidecar container will be called to check the storage connection status. If the connection status is unhealthy, the Node will be filtered out to avoid Kubernetes scheduling staid pods to the failed Node.

SchedulerName = yrCloudfile-Scheduler; schedulerName = yrCloudfile-scheduler;

The Pod has been created successfully and is not deployed on the node-3.yr node. If you look at the Pod event list, you can see that the scheduler is not the default scheduler of Kubernetes, but yrCloudfile-Scheduler.

Container storage – much more than K8S support

With the wide use of container, Kubernetes and cloud native technology, the attention of container storage is increasing day by day, and container storage has become a new commanding point of software-defined storage. Excellent container storage, however, far more than just support container persistent application, complete the data as simple, if the data is better governance, how to carry on the depth of integration, and container is promising, Yan melt cloud will constantly dig on container scene, trying to bring more excellent data storage service for the user.