What does K8S Scheduler do

The function of Kubernetes Scheduler is to bind the Pod to be scheduled to an appropriate Worker Node (hereinafter referred to as Node) in the cluster according to a certain scheduling algorithm and policy, and write the binding information into ETCD. Then, the Kubelet service in the target Node listens to the Pod binding event generated by the Scheduler through API Server to obtain Pod information, and then downloads the image to start the container. The scheduling process is shown in the figure:

Scheduler provides two scheduling processes: Predicates and Priorities.

  • Preselection, K8S will traverse all nodes in the current cluster and select the nodes that meet the requirements as candidates
  • Preferably, K8S will score the candidate nodes

After pre-selection screening and optimal scoring, K8S selects the Node with the highest score to run Pod. If there are multiple nodes with the highest score, the Scheduler will randomly select one Node from them to run Pod.

The pre-selected policy provided by K8S Scheduler

In Scheduler, the optional preselected policies include:

If TaintNodesByCondition(beta level from 1.12, enabled by default) is enabled, The CheckNodeCondition, CheckNodeMemoryPressure, CheckNodeDiskPressure, and CheckNodePIDPressure preselection policies are disabled. PodToleratesNodeNoExecuteTaints, CheckNodeUnschedulable will be enabled.

The preferred policy provided by K8S Scheduler

In Scheduler, the optional preferred policies include:

If open the ResourceLimitsPriorityFunction (default is not open) features, ResourceLimitsPriority will be enabled.

How do I extend K8S SchedulerScheduler

The built-in policy can meet the requirements in most scenarios, but in some special scenarios, it cannot meet the complex scheduling requirements. We can extend the Scheduler through the extender.

The expanded Scheduler will call the extension program through HTTP protocol for pre-selection and optimization again after calling the built-in pre-selection policy and preferred policy, and finally select a suitable Node for Pod scheduling. The scheduling process is as follows:

How do I implement my own Scheduler extension

Write extensions

The extender is essentially an HTTP service that filters and rates nodes. This is just an example, without any modifications. You can modify your pre-selection logic and preferred logic based on actual service scheduling scenarios, then package it as an image and deploy it.

Receives an HTTP request and, depending on the URL, calls a preselected or preferred function:

func (e *Extender) serveHTTP(w http.ResponseWriter, req *http.Request) {
    if strings.Contains(req.URL.Path, filter) {
        e.processFilterRequest(w, req)
    } else if strings.Contains(req.URL.Path, prioritize) {
        e.processPrioritizeRequest(w, req)
    } else {
        http.Error(w, "Unsupported request", http.StatusNotFound)
    }
}

Preselection logic:

func (e *Extender) processFilterRequest(w http.ResponseWriter, req *http.Request) { decoder := json.NewDecoder(req.Body) defer func() { if err := req.Body.Close(); err ! = nil { glog.Errorf("Error closing decoder") } }() encoder := json.NewEncoder(w) var args schedulerApi.ExtenderArgs if err := decoder.Decode(&args); err ! = nil { glog.Errorf("Error decoding filter request: %v", err) http.Error(w, "Decode error", http.StatusBadRequest) return } // Your logic pod := args.Pod nodes := args.Nodes.Items response := &schedulerApi.ExtenderFilterResult{ Nodes: &v1.NodeList{ Items: nodes, }, } if err := encoder.Encode(response); err ! = nil { glog.Errorf("Error encoding filter response: %+v : %v", response, err) } }

Preferred logic:

func (e *Extender) processPrioritizeRequest(w http.ResponseWriter, req *http.Request) { decoder := json.NewDecoder(req.Body) defer func() { if err := req.Body.Close(); err ! = nil { glog.Fatalf("Error closing decoder") } }() encoder := json.NewEncoder(w) var args schedulerApi.ExtenderArgs if err := decoder.Decode(&args); err ! = nil { glog.Errorf("Error decoding prioritize request: %v", err) http.Error(w, "Decode error", http.StatusBadRequest) return } // Your logic for _, node := range args.Nodes.Items { hostPriority := schedulerApi.HostPriority{Host: node.Name, Score: 1} respList = append(respList, hostPriority) } if err := encoder.Encode(respList); err ! = nil { glog.Errorf("Failed to encode response: %v", err) } }

Deploy the new Scheduler

Because there is already a default scheduler named default-Scheduler in the Kubernetes cluster, in order not to affect the normal scheduling function of the cluster, it is generally necessary to create a new scheduler. In addition to the different startup parameters of this scheduler and default-Scheduler, The image is no different. Here is the deployment process, with only the important parts listed:

Create a Scheduler configuration

We create the Scheduler scheduling configuration using ConfigMap. The built-in pre-selected and preferred policies need to be specified in the configuration file, as well as the extensions we write.

apiVersion: v1
kind: ConfigMap
metadata:
  name: yrcloudfile-scheduler-config
  namespace: yanrongyun
data:
  policy.cfg: |-
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "predicates": [],
      "priorities": [],
      "extenders": [
        {
          "urlPrefix": "http://yrcloudfile-extender-service.yanrongyun.svc.cluster.local:8099",
          "apiVersion": "v1beta1",
          "filterVerb": "filter",
          "prioritizeVerb": "prioritize",
          "weight": 5,
          "enableHttps": false,
          "nodeCacheCapable": false
        }
      ]
    }

The deployment of the Scheduler

To deploy the Scheduler, you need to specify the policy-configMap as the configMap you created before. You also need to give the Scheduler a name, which is specified by the scheduler-name parameter. Here we set it to yrCloudfile-Scheduler.

apiVersion: apps/v1beta1 kind: Deployment metadata: labels: component: scheduler tier: control-plane name: yrcloudflie-scheduler namespace: yanrongyun initializers: pending: []spec: replicas: 3 template: metadata: labels: component: scheduler tier: control-plane name: yrcloudflie-scheduler spec: containers: - command: - /usr/local/bin/kube-scheduler - --address=0.0.0.0 - --leader-elect=true - --scheduler-name=yrcloudfile-scheduler - --policy-configmap=yrcloudfile-scheduler-config - --policy-configmap-namespace=yanrongyun - --lock-object-name=yrcloudfile-scheduler image: k8s.gcr. IO /kube-scheduler:v1.13.0 livenessProbe: httpGet: path: /healthz port: 10251 initialDelaySeconds: 15 name: yrcloudflie-scheduler readinessProbe: httpGet: path: /healthz port: 10251 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "name" operator: In values: - yrcloudflie-scheduler topologyKey: "kubernetes.io/hostname" hostPID: false serviceAccountName: yrcloudflie-scheduler-account

How do I use the new Scheduler

After the Scheduler is successfully deployed, how can we use it? In fact, it is very simple to add the schedulerName to yrCloudfile-Scheduler when the Pod is deployed.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  labels:
    app: busyboxspec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      schedulerName: yrcloudfile-scheduler
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: busybox

K8S Scheduler for YRCloudFile extension

In yan Rong cloud recently released YRCloudFile version 6.0, the new dynamic awareness of CSI fault function, this function is realized through the extension Scheduler.

In the case of use of default-Scheduler, if the storage cluster connection of Work Node is interrupted, Kubernetes will not be aware of such failure and will still schedule Pod to the failed Node. This causes Kubernetes to repeatedly do useless scheduling, so that Pod can not complete the deployment normally, affecting the performance of the entire cluster.

As shown in the figure, we have deployed three copies of the BusyBox container, and nod-3. Yr node and storage connection failure, the Pod on this node is always in the ContainerCreating state, cannot create successfully;

If you look at the event list of the Pod, you can see that the default scheduler of Kubernetes schedules Pod to nod-3. Yr failed node, causing PV mount timeout.

In view of the above problems, Yan Rongyun expanded Scheduler and deployed CSI NodePlugin Sidecar container to check whether the connection between Node and storage cluster is healthy. The NodePlugin Sidecar container is called to check the storage connection status when the Scheduler is preselected. If the connection status is unhealthy, the Node will be filtered out, preventing Kubernetes from scheduling stateful PODS to faulty nodes.

We modify the YAML file and specify spec.schedulerName as yrCloudfile-scheduler, and the redeployment result is as shown in the figure:

The Pod event list shows that the scheduler is not the default Kubernetes scheduler, but yrCloudfile-Scheduler.

Container storage – much more than K8S support

With the widespread use of containers, Kubernetes, and cloud-native technologies, container storage has become a new high ground for software-defined storage. Excellent container storage, however, far more than just support container persistent application, complete the data as simple, if the data is better governance, how to carry on the depth of integration, and container is promising, Yan melt cloud will constantly dig on container scene, trying to bring more excellent data storage service for the user.