This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

This article will explain how to release and manage container applications in Kubernetes in detail, including Pod overview, basic usage, lifecycle, Pod control and scheduling management, Pod upgrade and rollback, and Pod capacity expansion mechanism, etc., combined with specific detailed examples, with you easily play with Pod. Open the way for the orchester of Kubernetes containers.

1. Pod Overview

1.1 What is Pod?

Pods are the atomic objects in Kubernetes, the basic building blocks.

Pod represents the set of running containers on the cluster. Pods are typically created to run a single master container. Pod can also run optional Sidecar containers to implement complementary features such as logging. (for example, the istio-proxy and istio-init containers that exist with the application in the Service Mesh)

aPodCan contain more than one container (other containers as a complement), responsible for handling the container’s data volume, secret keys, and configuration.

1.2 Why introduce the CONCEPT of Pod?

Cause 1: Kubernetes is extensible

Kubernetes doesn’t work directly with containers. The only resource Kubernetes users have access to is a Pod, which can contain multiple containers. When we use Kubectl to execute various commands to operate various resources in Kubernetes, we can not directly operate containers, usually with the help of Pod.

Kubernetes doesn’t support only the Docker container runtime.In order for Kubernetes not to be technically tied to a particular container runtime, but to be able to support multiple container runtime alternatives without recompiling the source code, we introduced an abstraction layer between Kubernetes and the container runtime, just as we introduced interfaces as an abstraction layer in object-oriented design. Container Runtime Interface (CRI).

With the help of CRI, Kubernetes does not rely on a specific container runtime implementation technology, but directly operates Pod, Pod internal management of multiple business closely related user business containers, this architecture is more convenient for Kubernetes expansion.

Cause 2: Easy to manage

Assuming that Kubernetes doesn’t have a Pod and instead manages containers directly, some containers inherently need to be closely related. For example, in ELK, the logging Filebeat needs to be closely deployed with the application. If you take a tightly related set of containers as a unit and suppose one of them dies, how should the state of the unit be defined? Should we think of it as a collective death, or individual deaths?

The reason this question is not easy to answer is that there is no uniform way to represent the state of the entire group of business containers, which is why Kubernetes introduced the concept of pods, and each Pod has its own Pause container. By introducing pause, a standard Kubernetes system container that is unrelated to business and acts like a Linux operating system daemon, the pause container’s state represents the state of the entire container group.These containers, which naturally need to be closely related, can be placed in the same Pod to schedule, scale, share resources, and manage the lifecycle in Pod as a minimum unit.

Cause 3: Communication and resource sharing

Multiple business containers in the Pod share Pause container IP and Volume, which simplifies communication between closely related business containers and solves the file sharing problem between them.

The namespace can communicate with localhost and share storage.

1.3 What benefits can Pod bring

Now that we know where Pod came from, how can it benefit us?

  • As a service unit that can run independently, Pod simplifies the difficulty of application deployment and provides great convenience for application deployment management with a higher level of abstraction.
  • As the smallest application instance, Pod can run independently, so it can be easily deployed, scaled and shrunk, and scheduling management and resource allocation.
  • Containers in PODS share the same data and network address space, and there is unified resource management and allocation between pods.

2. Basic usage of Pod

No matter through the command kubectl or Dashboard graphical management interface to operate, is inseparable from the definition of the resource manifest file. If the Dashboard graphical management interface is used, the operation is finally based on kubectl command. Here, only kubectl command is used to operate Pod.

More instructions on kubectl command, you can refer to the official document: kubernetes. IO/docs/refere…

The Pod resource list has several important properties: apiVersion, kind, metadata, spec, and Status. The apiVersion and kind are fixed, and the status is the runtime state, so the metadata and spec parts are the most important.

(For a definition of Kubernetes resource list, see the previous article: Kubernetes Resource List: How to Create a Resource?)

Let’s define a simple Pod resource file named frontend-pod.yml:

The Pod in the example is defined in the test namespace, so the following execution commands involve arguments that specify the namespace-n test. If it is defined in the default namespace default, you do not need to specify the -n parameter.

ApiVersion: V1 kind: Pod metadata: name: Frontend Namespace: test # If the namespace test does not exist, create it in advance. You can also use the default namespace default, that is, the labels of namespace properties can be undefined: app: frontend spec: containers: -name: frontend image: Xcbeyond /vue-frontend: Latest # Image published in DockerHub ports: - name: port containerPort: 80 hostPort: 8080Copy the code

You can use the command kubectl explain pod to see the specific use and meaning of each attribute tag.

[xcbeyond@bogon ~]$ kubectl explain pod KIND: Pod VERSION: v1 DESCRIPTION: Pod is a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts. FIELDS: apiVersion <string> APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources kind <string> Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds metadata <Object> Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata spec <Object> Specification  of the desired behavior of the pod. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status status <Object> Most  recently observed status of the pod. This data may not be up to date. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-statusCopy the code

2.1 create

Kubectl create -f

to create Pod based on resource manifest file:

[xcbeyond@localhost ~]$ kubectl create -f frontend-pod.yml
pod/frontend created
Copy the code

2.2 Viewing Status

Kubectl get Pods -n

(The -n parameter is optional for the default namespace. If the namespace is not default, the namespace must be specified. Otherwise, the namespace cannot be queried.)

[xcbeyond@localhost ~]$ kubectl get pods -n test
NAME       READY   STATUS    RESTARTS   AGE
frontend   1/1     Running   0          36s
Copy the code

2.3 Viewing the Configuration

If you want to know a running Pod configuration, can through the command kubectl get Pod < Pod – name > -n < namespace > -o < json | yaml > view:

(The -o parameter specifies the output configuration format, json, YAML format)

At this point, look at the results in the run state, which contains many attributes, and focus only on the key attributes.

[xcbeyond@localhost ~]$ kubectl get pod frontend -n test -o yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: "2020-11-19T08:33:20Z" labels: app: frontend managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:app: {} f:spec: f:containers: k:{"name":"frontend"}: .: {} f:image: {} f:imagePullPolicy: {} f:name: {} f:ports: .: {} k:{"containerPort":80,"protocol":"TCP"}: .: {} f:containerPort: {} f:hostPort: {} f:name: {} f:protocol: {} f:resources: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:dnsPolicy: {} f:enableServiceLinks: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: {} f:terminationGracePeriodSeconds: {} manager: kubectl-create operation: Update time: "2020-11-19T08:33:20Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: k:{"type":"ContainersReady"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:status: {} f:type: {} k:{"type":"Initialized"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:status: {} f:type: {} k:{"type":"Ready"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:status: {} f:type: {} f:containerStatuses: {} f: hostIP: {} f: phase: {} f: podIP: {} f: podIPs: : k: {} {" IP ":" 172.18.0.5} :. : {} f: IP: {} f: startTime: {} manager: kubelet operation: Update time: "2020-11-23T08:10:40Z" name: frontend namespace: test resourceVersion: "28351" selfLink: /api/v1/namespaces/test/pods/frontend uid: be4ad65c-e426-4110-8337-7c1dd542f647 spec: containers: - image: xcbeyond/vue-frontend:latest imagePullPolicy: Always name: frontend ports: - containerPort: 80 hostPort: 8080 name: port protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-bbmj5 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true nodeName: minikube preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: default-token-bbmj5 secret: defaultMode: 420 secretName: default-token-bbmj5 status: conditions: - lastProbeTime: null lastTransitionTime: "2020-11-19T08:33:20Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-11-23T08:10:40Z" status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2020-11-23T08:10:40Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-11-19T08:33:20Z" status: "True" type: PodScheduled containerStatuses: - containerID: docker://84d978ee70d806d38b9865021d9c68881cf096960c7eb45e87b3099da85b4f6d image: xcbeyond/vue-frontend:latest imageID: docker-pullable://xcbeyond/vue-frontend@sha256:aa31cdbca5ca17bf034ca949d5fc7d6e6598f507f8e4dad481e050b905484f28 lastState: {} name: frontend ready: true restartCount: 0 started: true state: running: startedAt: "2020-11-23t08:10:40z" hostIP: 172.17.0.2 phase: Running podIP: 172.18.0.5 podIPs: -ip: 172.18.0.5 qosClass: BestEffort startTime: "2020-11-19T08:33:20Z"Copy the code

2.4 Viewing Logs

Kubectl logs -n

kubectl logs -n

[xcbeyond@localhost ~]$ kubectl logs frontend -n test
Copy the code

Kubectl logs -c

kubectl logs

2.5 Modifying the Configuration

If you want to modify an existing Pod, for example, modify a label, you can use the following methods:

(1) Use the label management command kubectl label to set or update the labels of resource objects.

This method is only for the modification of the label.

Kubectl get Pods -n

–show-labels

[xcbeyond@localhost ~]$ kubectl get pods -n test --show-labels
NAME       READY   STATUS    RESTARTS   AGE   LABELS
frontend   1/1     Running   0          4d    app=frontend
Copy the code

Kubectl label pod -n
=value>

[xcbeyond@localhost ~]$ kubectl label pod frontend -n test tir=frontend
pod/frontend labeled
[xcbeyond@localhost ~]$ kubectl get pods -n test --show-labels
NAME       READY   STATUS    RESTARTS   AGE    LABELS
frontend   1/1     Running   0          4d1h   app=frontend,tir=frontend
Copy the code

Kubectl label pod -n

–overwrite
=new-value>

[xcbeyond@localhost ~]$ kubectl label pod frontend -n test tir=unkonwn --overwrite
pod/frontend labeled
[xcbeyond@localhost ~]$ kubectl get pods -n test --show-labels
NAME       READY   STATUS    RESTARTS   AGE    LABELS
frontend   1/1     Running   0          4d1h   app=frontend,tir=unkonwn
Copy the code

(2) Run the kubectl apply -f

command to update the configuration.

(3) Run kubectl edit -f

-n

to update the configuration online.

(4) Run kubectl replace -f

-n

–force to forcibly replace the resource object.

In fact, delete before replace.

[xcbeyond@localhost ~]$ kubectl replace -f frontend-pod.yml --force
pod "frontend" deleted
pod/frontend replaced
Copy the code

2.6 delete

Through the command kubectl delete (-f < filename > | pod [< pod – name > | -l label]) – n < namespace > to delete.

[xcbeyond@localhost ~]$ kubectl delete pod frontend -n test
pod "frontend" deleted
Copy the code

3. Pod life cycle

PodThe time span between the creation of an object and its termination is called its Pod lifecycle. During this time,PodWill be in a number of different states and perform some operations. Where, create the master container (main container) is required. Other optional operations include running the initialization container (init container), the hook (post start hook), container viability detection (liveness probe), readiness detection (readiness probe) and the container termination hook (pre stop hook), depending on whether these operations are performedPodThe definition. As shown in the figure below:

The status field of Pod is an object of PodStatus, which has a phase field.

Whether created manually or through a controller such as Deployment, a Pod object is always in one of the following phases in its life cycle:

  • Hang (Pending):API ServerAlready createdPodAnd has been depositedetcdBut it has not been scheduled or is still in the process of downloading the image from the repository.
  • In the operation of (Running):PodAll containers have been created, scheduled to a node, and all containers have beenkubeletThe creation is complete and at least one container is in the running, starting, or restarting state.
  • Successful (Succeeded):PodAll containers have been successfully executed and will not be restarted.
  • Failure (Failed):PodAll containers have exited, and at least one container has failed to exit. The container tonon-zeroThe status exits or is disabled by the system.
  • The unknown (Unknown)For some reasonApi ServerFailed to obtain thePodObject state information, possibly due to the inability to communicate with the working nodekubeletCommunication (network communication).

3.1 Pod creation process

PodKubernetes is the basic unit in Kubernetes, and knowing how it was created is a great help in understanding its lifecycle, one of which is depicted in the figure belowPodA typical process for creating a resource object.

(1) The user submits a Pod Spec to the API Server through Kubectl or other API clients. The create (Pod)

(2) THE API Server tries to write the relevant information of the Pod object to the ETCD storage system. After the writing operation is completed, the API Server will return the confirmation information to the client.

(3) API Server starts to feedback the state changes in etCD.

(4) All Kubernetes components use the “watch” mechanism to track and check API Server related changes.

(5) The Kube-Scheduler senses through its “watcher” that the API Server has created a new Pod object that has not yet been bound to any work node.

(6) Kube-Scheduler selects a work node for the Pod object and updates the result information to the API Server.

(7) The scheduling result information is updated to the ETCD storage system from the API Server, and the API Server also starts to feedback the scheduling result of this Pod object.

(8) Kubelet on the target working node to which Pod is scheduled tries to call Docker on the current node to start the container, and returns the result status of the container to the API Server.

(9) API Server stores Pod status information in etCD system.

(10) After the ETCD acknowledgement completes successfully, the API Server sends the acknowledgement to the relevant Kubelet, through which the event will be accepted.

3.2 Important Process

3.2.1 Init A Container

Initialize the container (init container) is a dedicated container used to start one or more initializing containers before starting an app Container. Prerequisites for the application container are completed.

The initialization container is very similar to a normal container, but it has the following unique characteristics:

  • Initializing the container always runs until successful completion.
  • Each initializer must complete successfully before the next initializer starts.

According to the Pod restartPolicy, if the init Container fails to be executed and the restartPolicy is set to Never, the Pod will fail to be started and will not be restarted. If restartPolicy is set to Always, the Pod will be automatically restarted.

If a Pod specifies multiple initializers, the initializers are run one at a time in sequence. The next initializer can be run only after the previous initializer has run successfully. Kubernetes initializes the Pod and runs the application container only after all the initialization containers have been run.

3.2.2 Container detection

Container Probe is an important daily task in the life cycle of Pod objects. It is a health diagnosis performed periodically by Kubelet on the container, which is defined by the container’s handler. Kubernetes supports three processors for Pod probing:

  • ExecAction: The operation of executing the specified command in the container and diagnosing according to the status code returned by it is calledExecProbe, the status code is0Indicates success. Otherwise, the system is in an unhealthy state.
  • TCPSocketAction: through some of the containerTCPThe port attempts to establish a connection for diagnosis. If the port can be opened successfully, it is normal; otherwise, it is unhealthy.
  • HTTPGetAction: Through the containerIPThe specified port of the addresspathinitiateHTTP GETRequest for diagnosis. The response code is2xxor3xxIs successful, otherwise is failed.

Any detection method can have three possible results: “Success”, “Failure”, and “Unknown”. Only Success indicates that the detection has passed successfully.

There are two types of container probes:

  • Survival probe: determines whether the container is “running” (RunningState). If such tests fail,kubeletWill kill the container and according to the restart policy (restartPolicy) to determine whether to restart it. The default state for containers with no survival check defined is”Success“.
  • readinessProbe: Determines whether the container is ready to provide services. A container that does not pass detection means it is not ready, and the endpoint controller (e.gServiceObject)IPFrom all matches to herePodThe object’sServiceObject from the endpoint list. When it passes, it will be re-examinedIPAdd to the list of endpoints.

When do WE use LiVENESS and Readiness probes?

If the process in the container can crash itself in the event of a problem or an unhealthy situation, the survival probe is not necessarily needed, and Kubelet will automatically perform the correct action based on the Pod’s restartPolicy.

If you want the container to be killed and restarted when the probe fails, specify a live probe and specify restartPolicy as Always or OnFailure.

Specify a ready probe if you want to start sending traffic to a Pod only when the probe is successful. In this case, the ready probe may be the same as the live probe, but the presence of the ready probe in the spec means that the Pod will start without receiving any traffic, and will only start receiving traffic if the probe probe is successful.

If you want the container to be self-sustaining, you can specify a ready probe that checks for a different endpoint than the live probe.

Note: You don’t necessarily need a ready probe if you just want to be able to exclude requests when the Pod is removed; When a Pod is deleted, the Pod automatically places itself in the unfinished state, whether the ready probe is present or not. When waiting for the container in the Pod to stop, the Pod is still in an unfinished state.

4. Pod scheduling

In Kubernetes, we rarely actually create a Pod directly, In most cases, the creation, scheduling and full lifecycle automatic control tasks of a set of Pod copies are completed by RC (Replication Controller), Deployment, DaemonSet, Job and other controllers.

In the original Kubernetes version, there were no Pod replicas. There was only one Pod replica controller, RC, which was designed and implemented like this: RC is independent of the Pod it controls and controls the creation and destruction of the target Pod instance through the loosely coupled association of labels. With the development of Kubernetes, RC has a new successor, Deployment, which is used to more automatically deploy, update, roll back, and so on.

Strictly speaking, the RC successor is not Deployment but ReplicaSet, because ReplicaSet further enhances the flexibility of the RC tag selector. RC’s ReplicaSet has a set of tag selectors that can select multiple Pod tags, as shown below:

selector:
  matchLabels:
    tier: frontend
  matchExpressions:
    - {key: tier, operator: In, values: [frontend]}
Copy the code

Unlike RC, ReplicaSet is designed to control multiple copies of PODS with different labels. A common application scenario is that the user wants to keep the number of Pod copies of the APP to be 3, which can contain both V1 and V2 versions of Pod, and then use ReplicaSet to achieve this control. It can be written as follows:

selector:
  matchLabels:
    version: v2
  matchExpressions:
    - {key: version, operator: In, values: [v1,v2]}
Copy the code

In fact, Kubernetes rolling upgrade is implemented by using ReplicaSet, and Deployment also realizes automatic control of Pod replica through ReplicaSet.

4.1 Automatic Scheduling

One of the main functions of Deployment or RC is the automatic Deployment of multiple copies of a container application and the continuous monitoring of the number of copies, always maintaining a user-specified number of copies within the cluster.

Example:

(1) For a Deployment configuration example, use the resource list configuration file nginx-Deployment. yml to create a ReplicaSet, which will create three Nginx Pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
           - name: http
             containerPort: 80
Copy the code

(2) Execute the kubectl create command to create this Deployment:

[xcbeyond@localhost ~]$ kubectl create -f nginx-deployment.yml -n test
deployment.apps/nginx-deployment created
Copy the code

(3) Run kubectl get deployments to check Deployment status:

[xcbeyond@localhost ~]$ kubectl get deployments -n test
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   3/3     3            3           17m
Copy the code

This structure indicates that three copies of Deployment have been created and that all copies are up to date.

Kubectl Get Rs and Kubectl Get Pods commands can be used to view the created ReplicaSet (RS) and Pod information:

[xcbeyond@localhost ~]$ kubectl get rs -n test
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-86b8cc866b   3         3         3       24m
[xcbeyond@localhost ~]$ kubectl get pods -n test
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-86b8cc866b-7rqzt   1/1     Running   0          27m
nginx-deployment-86b8cc866b-k2rwp   1/1     Running   0          27m
nginx-deployment-86b8cc866b-rn7l7   1/1     Running   0          27m
Copy the code

In terms of scheduling policy, these three Nginx Pods are fully scheduled by Kubernetes. The node on which they are scheduled to run is determined by the Schedule on the Master through a series of calculations. The user cannot intervene in the scheduling process and the scheduling result.

In addition to automatic scheduling, Kubernetes also provides a variety of scheduling policies for users to choose. Users only need to set more fine-grained scheduling policies such as NodeSelector, NodeAffinity, PodAffinity and Pod eviction in the definition of Pod. We can do precise scheduling of the Pod.

4.2 NodeSelector: directional scheduling

The Scheduler (Kube-Scheduler) on the Kubernetes Master is responsible for realizing the scheduling of Pod. The whole scheduling process calculates an optimal target node for each Pod by executing a series of complex algorithms, and this process is automatically completed. There is no way to know which node the Pod will eventually be scheduled to.

In actual scenarios, Pod may need to be scheduled to some specified nodes, which can be achieved by matching the Node Label with the nodeSelector attribute of Pod. For example, if you want to schedule the MySQL database to a target node on an SSD disk, the NodeSelector property in the Pod template comes into play.

Example:

(1) Run kubectl label nodes

=

to add a label to a specified node.

For example, add a label to the Node of the SSD disk disktype = SSD:

$ kubectl label nodes k8s-node-1 disk-type=ssd
Copy the code

(2) Add the nodeSelector property to the Pod’s resource list definition.

apiVersion: v1
kind: Pod
metadata:
  name: mysql
  labels:
    env: test
spec:
  containers:
  - name: mysql
    image: mysql
  nodeSelector:
    disk-type: ssd
Copy the code

(3) Run the kubectl get Pods -o wide command to check whether it takes effect.

In addition to allowing users to add tags to nodes, Kubernetes also provides predefined tags for nodes. The predefined tags include:

  • Kubernetes. IO /arch: Example,kubernetes.io/arch=amd64. This is useful in situations such as mixing ARM and x86 nodes.
  • Kubernetes. IO/OS: example,kubernetes.io/os=linux. This is useful when there are nodes with different operating systems in the cluster (for example, nodes with mixed Linux and Windows operating systems).
  • Beta. Kubernetes. IO/OS (deprecated)
  • Beta. Kubernetes. IO /arch (deprecated)
  • Kubernetes. IO /hostname: example,kubernetes.io/hostname=k8s-node-1.

More reference: kubernetes. IO/docs/refere…

NodeSelector simply realizes the method of limiting the node where Pod is located by means of tags. It seems simple and perfect, but in real environment, it may face the following embarrassing problems:

(1) What if the Label selected by NodeSelector does not exist or does not meet the conditions, such as when these target nodes break down or resources are insufficient?

(2) What if you want to select a variety of appropriate target nodes, such as SSD disk nodes or ultra-fast disk nodes? Kubernetes introduced NodeAffinity to solve this problem.

(3) Affinity between different PODS. For example, MySQL and Redis cannot be scheduled to the same target Node, or two different PODS must be scheduled to the same Node to achieve special requirements such as local file sharing or local network communication. This is what PodAffinity is supposed to solve.

4.3 NodeAffinity: Indicates Node affinity scheduling

NodeAffinity is a scheduling policy based on Node affinity. It is a new scheduling policy to solve the problem of NodeSelector.

There are two ways to say this:

  • RequiredDuringSchuedlingIgnoredDuringExecution:Must satisfyThe specified rule can schedule a Pod to a Node (similar to nodeSelector, but using a different syntax), equivalent to a hard limit.
  • PreferredDuringSchedulingIgnoredDuringExection: emphasis onFirst meetWhen you specify the rule, the scheduler will attempt to schedule the Pod to the Node, but it will not force it, which is a soft limit. Multiple priority rules can also set weight values to define the order of execution.

LgnoredDuringExecution implies that the scheduler will not remove Pod objects from a Node after Pod resources have been scheduled to a Node based on the Node affinity rules, and the Node label has changed so that it no longer complies with the Node affinity rules. Therefore, NodeAffinity only applies to newly created Pod objects.

Node affinity by pod. Spec. Affinity. NodeAffinity specified.

Example:

Set the NodeAffinity scheduling rules as follows:

  • requiredDuringSchedulingIgnoredDuringExecutionKubernetes. IO /arch In AMD64
  • preferredDuringSchedulingIgnoredDuringExecutionThe disk type must be SSD (disk-type In SSD).

Pod-with-node-affinity. Yml is defined as follows:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: disk-type
            operator: In
            values:
            - ssd
  containers:
  - name: with-node-affinity
    image: busybox
Copy the code

You can see from the example above that the In operator is used. NodeAffinity syntax supports operators such as: In, NotIn, Exists, DoesNotExist, Gt, Lt. You can use NotIn and DoesNotExist to implement the node anti-affinity behavior, which implements the troubleshooting function.

The NodeAffinity rule defines the following precautions:

  • If you specify bothnodeSelectornodeAffinity, both must be met before Pod can be scheduled to the specified node.
  • ifnodeAffinitySpecifies multiplenodeSelectorTerms, then one of them can be matched successfully.
  • If thenodeSelectorTermsThere are multiplematchExpressions, a node must satisfy allmatchExpressionsTo run the Pod.

If you specify multiple matchExpressions associated with nodeSelectorTerms,

4.4 PodAffinity: indicates a Pod affinity or mutually exclusive scheduling policy

Affinity and mutexes between Pods were introduced with Kubernetes 1.4. This feature allows the user to restrict the nodes that Pod can run on from another perspective: judging and scheduling based on the Pod label that is running on the node rather than the node’s label, which requires matching the two conditions of node and Pod. This rule can be described as follows: If one or more pods of condition Y are running on a Node with tag X, then Pod should run on that Node.

Unlike NodeAffinity, Pod belongs to a namespace, so the condition Y denotes a Label Selector in one or all namespaces.

Same as NodeAffinity, Pod affinity and mutually exclusive Settings also requiredDuringSchedulingIgnoredDuringExecution and PreferredDuringSchedulingIgnoredDuringExection.

Pod affinity by Pod. Spec. Affinity. PodAffinity specified.

4.5 Taints and Tolerations

NodeAffinity is an attribute defined in a Pod that enables a Pod to be scheduled to run on certain nodes. Taints, on the other hand, lets Node refuse Pod to run.

Taints needed to work with Toleration to allow pods to avoid some tolerating nodes. After setting up one or more taints on nodes, the Pod cannot run on those nodes unless it explicitly states that it will tolerate such taints. Toleration is a Pod attribute that allows Pod to run on Node with Taint tagged.

To set the taint information for a Node, use the kubectl taint command:

[xcbeyond@localhost ~]$ kubectl taint nodes node1 key=value:NoSchedule
Copy the code

4.6 Pod priority scheduling

In Kubernetes 1.8, a scheduling policy based on Pod priority Preemption was introduced, in which Kubernetes would attempt to free up low-priority PODS on the target node to make room for higher-priority pods. This type of scheduling is called “preemptive scheduling.”

It can be defined in the following dimensions:

  • Priority: indicates the Priority
  • QoS: service quality level
  • Other metrics defined by the system

Example:

(1) Create PriorityClasses that do not belong to any namespace:

apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: high-priority
  value: 1000000
  globalDefault: false
  description: "this priority class should be used for XYZ service pods only."
Copy the code

Note: Define a priority named high-priority. The priority is 1000000. The larger the number, the higher the priority. Numbers with priority numbers greater than 100 million are retained by the system for assignment to system components.

(2) Refer to the Pod priority defined above in Pod:

apiVersion: v1 kind: Pod metadata: name: frontend namespace: test labels: app: frontend spec: containers: - name: frontend image: xcbeyond/vue-frontend:latest ports: - name: port containerPort: 80 hostPort: 8080 priorityClassName: High-priority # Refers to the Pod priorityCopy the code

Scheduling policies that use priority preemption may cause some PODS to never be scheduled successfully. Priority scheduling not only increases the complexity of the system, but also may bring additional instability factors. Therefore, in case of resource shortage, the first consideration is to expand the cluster capacity. If the capacity cannot be expanded, the priority scheduling feature with supervision should be considered. For example, the Namespace based resource quota is used to restrict any priority preemption.

4.7 DaemonSet: Schedule one Pod on each Node

DaemonSet is a resource object new to Kubernetes 1.2 that manages only one copy of Pod running on each Node in the cluster. As shown in the figure below:

[Img – 3UjHzrij-1611625546431] [img- 3Ujhzrij-1611625546431]

Some typical uses of DaemonSet:

  • Run the cluster daemon on each node. For example, a Daemon process running GlusterFS storage or Ceph storage.
  • Run the log collection daemon on each node. For example, run a log collection program, Fluentd or Logstach.
  • Run the monitoring daemon on each node. For example, collect the running performance data of the Node, such as Prometheus Node Exporter and CollectD.

4.8 Job: Batch scheduling

A Job is responsible for batching short, one-time tasks, that is, tasks that are executed only once, and it ensures that one or more PODS of a batch task end successfully.

According to different implementation modes of batch tasks, the following typical modes are applicable to different business scenarios:

  • Extension based on Job template:

    You need to write a generic Job template to generate multiple Job JSON/YML files based on different parameters for Job creation. You can use the same label for Job management.

  • Queues by each work item:

    • It requires the user to prepare a message queue service, such as rabbitMQ, which is a common component to which each work item can feed a task message.
    • Users can create parallel jobs that fit the message queue, then consume tasks from the message queue and process them until the message is processed.
    • In this mode, you need to enter a value based on the number of itemsspec.completions, the number of parallel.spec.parallelismIt can be filled in according to the actual situation. In this mode, a job is completed only when all tasks are completed successfully.
  • Queue with variable number of tasks:

    • Users are required to prepare a storage service in advance to hold the work queue, such as Redis. Each project can populate the storage service with messages.
    • Users can start multiple parallel jobs applicable to the work queue for message processing. Unlike the previous Rabbit message queue, each Job task knows that the work queue is empty and can successfully exit.
    • In this mode,spec.completionsI need one, the number of parallelisms.spec.parallelismIt can be filled in according to the actual situation. If one of the tasks is completed successfully, the Job is completed successfully.
  • Normal static tasks:

    Assign tasks statically.

Considering the parallelism of batch processing, Kubernetes classifies jobs into three types:

  • Non-parallel Job:

    Usually, a Job runs only one Pod. Once the Pod finishes normally, the Job ends. (Pod exception, restart)

  • Parallel jobs with fixed completion times:

    Run a specified number of Pods concurrently until the specified number of Pods succeeds and the Job ends.

  • Parallel jobs with work queues:

    • The user can specify the number of parallel Pods, and no new pods will be created after any Pod finishes successfully.
    • Once one Pod ends successfully and all Pods end, the Job ends successfully.
    • Once one Pod ends successfully, the rest of the Pods are ready to quit.

4.9 Cronjob: scheduled task

CronJob is a scheduled task similar to Cron in Linux.

The format of a timing expression is as follows:

Minutes Hours DayofMonth Month DayofWeek Year

  • Minutes: the value is an integer ranging from 0 to 59. The value can be four characters: – * /.

  • Hours: Hours. The value is an integer ranging from 0 to 23. The value can be four characters: – * /.

  • DayofMonth: day of a month. The value is an integer ranging from 0 to 31. It could be – * /, right? L, W, and C.

  • Month: indicates the Month. The value is an integer ranging from 1 to 12 or Jan-dec. The value can be four characters: – * /.

  • DayofWeek: indicates the week. The value is an integer ranging from 1 to 7 or sun-sat. 1 indicates Sunday, 2 indicates Monday, and so on. It could be – * /, right? L, C, #, these 8 characters.

For example, if the task is executed every 1 minute, the Cron expression is */1 * * * *

Example:

(1) Define the CronJob resource configuration file test-cronjob.yml:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: hello
              image: busybox
              args:
                - /bin/sh
                - -c
                - date; echo hello from the Kubernetes cluster
          restartPolicy: OnFailure
Copy the code

(2) Run the kubectl crate command to create CronJob:

[xcbeyond@localhost k8s]$ kubectl create -f test-cronjob.yml -n test
cronjob.batch/test created
Copy the code

(3) Run the kubectl get cronjob command every one minute to check the task status. It is found that the task is indeed scheduled once a minute:

[xcbeyond@localhost k8s]$ kubectl get cronjob test -n test NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE test */1 * * *  * False 1 61s 69s [xcbeyond@localhost k8s]$ kubectl get cronjob test -n test NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE  AGE test */1 * * * * False 2 7s 75s [xcbeyond@localhost k8s]$ kubectl get cronjob test -n test NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE test */1 * * * * False 3 28s 2m36sCopy the code

Run kubectl delete cronjob to delete the cronjob.

4.10 Custom Scheduler

If the above scheduler still does not meet some unique needs, you can implement simple or complex custom schedulers in any language.

5. Pod upgrade and rollback

When you need to upgrade a service, you typically stop all pods associated with that service, download the new version image and create a new Pod. If the cluster is large, this becomes a challenge, and the service is unavailable for a long time, which is difficult to accept.

To solve these problems, Kubernetes offers rolling upgrades that are a great solution.

If the Pod is created by Deployment, then the Deployment Pod definition (SPC.template) or image name can be modified at run time and applied to the Deployment object, and the system completes the automatic update of Deployment. If an error occurs during the update, the version of the Pod can be recovered by a rollback operation.

5.1 Deployment upgrade

Take nginx-Deployment.yml as an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
           - name: http
             containerPort: 80
Copy the code

(1) The number of running Pod copies is 3, check the Pod status:

[xcbeyond@localhost k8s]$ kubectl get pod -n test
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-86b8cc866b-7rqzt   1/1     Running   2          53d
nginx-deployment-86b8cc866b-k2rwp   1/1     Running   2          53d
nginx-deployment-86b8cc866b-rn7l7   1/1     Running   2          53d
Copy the code

(2) To update nginx version to Nginx :1.9.1, you can run kubectl set image command to set a new image for Deployment:

[xcbeyond@localhost k8s]$kubectl set image deployment/nginx-deployment nginx=nginx:1.9.1 -n test deployment.apps/nginx-deployment image updatedCopy the code

Kubectl set image: Updates the container image of an existing resource object.

An alternative upgrade method is to modify the Deployment configuration using the kubectl edit command.

(3) At this time (in the upgrade process), check the Pod status and find that the upgrade is under way:

[xcbeyond@localhost k8s]$ kubectl get pod -n test
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-79fbf694f6-kplgz   0/1     ContainerCreating   0          96s
nginx-deployment-86b8cc866b-7rqzt   1/1     Running             2          53d
nginx-deployment-86b8cc866b-k2rwp   1/1     Running             2          53d
nginx-deployment-86b8cc866b-rn7l7   1/1     Running             2          53d
Copy the code

During the upgrade, you can view the Deployment upgrade process by executing the kubectl rollout status command.

(4) After the upgrade, check the Pod status:

[xcbeyond@localhost k8s]$ kubectl get pod -n test
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-79fbf694f6-h7nfs   1/1     Running   0          2m43s
nginx-deployment-79fbf694f6-kplgz   1/1     Running   0          7m21s
nginx-deployment-79fbf694f6-txrfj   1/1     Running   0          2m57s
Copy the code

The NAME in the Pod status list before and after the upgrade has changed.

Nginx :1.9.1: nginx:1.9.1:

[xcbeyond@localhost k8s]$ kubectl describe pod/nginx-deployment-79fbf694f6-h7nfs -n test Name: Nginx fbf694f6 deployment - 79 - h7nfs Namespace: test... Containers: nginx: Container ID: docker://0ffd43455aa3a147ca0795cf58c68da63726a3c77b40d58bfa5084fb879451d5 Image: Nginx: 1.9.1 Image ID: Docker pullable: / / nginx @ sha256:2 f68b99bc0d6d25d0c56876b924ec20418544ff28e1fb89a4c27679a40da811b Port: 80 / TCP...Copy the code

So how does Deployment complete the Pod update?

Use the kubectl describe Deployment/nginx-Deployment command to look closely at the deployment update process:

When Deployment is initially created, a ReplicaSet (Nginx-Deployment-86B8CC866b) is created and three Pod replicas are created. When Deployment is updated, a new ReplicaSet (NginX-Deployment-79FBF694F6) is created, the number of ReplicaSet replicas is expanded to 1, and the old ReplicaSet is reduced to 2. After that, the system continues to adjust the old and new ReplicaSet one by one according to the same update strategy. Finally, the new ReplicaSet runs three copies of the new Pod, and the old ReplicaSet is reduced to zero.

As shown in the figure below:

Execute the kubectl describe Deployment/nginx-Deployment command to see the final deployment information:

[xcbeyond@localhost k8s]$ kubectl describe deployment/nginx-deployment -n test Name: nginx-deployment Namespace: test CreationTimestamp: Thu, 26 Nov 2020 19:32:04 +0800 Labels: <none> Annotations: deployment.kubernetes.io/revision: 2 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: App =nginx Containers: nginx: Image: nginx:1.9.1 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-79fbf694f6 (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 30m deployment-controller Scaled up replica set nginx-deployment-79fbf694f6 to 1 Normal ScalingReplicaSet 25m deployment-controller Scaled down replica set nginx-deployment-86b8cc866b to 2 Normal ScalingReplicaSet 25m deployment-controller Scaled up replica set nginx-deployment-79fbf694f6 to 2 Normal ScalingReplicaSet 25m deployment-controller Scaled down replica set nginx-deployment-86b8cc866b to 1 Normal ScalingReplicaSet 25m deployment-controller Scaled up replica set nginx-deployment-79fbf694f6 to 3 Normal ScalingReplicaSet 24m deployment-controller Scaled down replica set nginx-deployment-86b8cc866b to 0Copy the code

Execute the kubectl get rs command to check the final status of the two ReplicaSet replicas:

[xcbeyond@localhost k8s]$ kubectl get rs -n test
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-79fbf694f6   3         3         3       39m
nginx-deployment-86b8cc866b   0         0         0       53d
Copy the code

Throughout the upgrade process, the system ensures that at least two Pods are available and up to four are running simultaneously, which Deployment does through a complex algorithm. Deployment needs to ensure that only a certain number of Pods may be unavailable throughout the update process. By default, Deployment ensures that the total number of available PODS is reduced by at least the required number of copies (Desired), which means that at most one is unavailable (maxUnavailable=1).

In this way, Deployment can guarantee uninterrupted service during the upgrade process and that the replicas always remain at the specified number.

Update strategy

In the Deployment definition, the Pod update policy can be specified through spec.strategy. Currently, two policies are supported: concrete (reconstruction) and RollingUpdate (RollingUpdate). The default value is RollingUpdate.

  • ** set: ** setspec.strategy.type=Recrate, which means that Deployment will first kill all running Pods and then create new ones when it updates pods.
  • **RollingUpdate: ** setspec.strategy.type=RollingUpdate, indicating that Deployment will update the Pod one by one in a rolling update manner. At the same time, you can setspec.strategy.rollingUpdateTwo parameters inmaxUnavailableandmaxSurgeTo control the process of rolling updates.
    • spec.strategy.rollingUpdate.maxUnavailable: specifies the maximum number of Unavailable Deployment Pods during the update. This value can be a specific number or a percentage of the number of copies the Pod expects.
    • spec.strategy.rollingUpdate.maxSurge: specifies the maximum amount at which the total number of Pods exceeds the desired number of copies of Pod during Deployment update Pod. This value can be a specific number or a percentage of the number of copies the Pod expects.

5.2 Deployment rollback

By default, the release history of all Deployment is kept on the system so that we can roll back at any time. (The number of historical records can be configured)

You can roll back Deployment by executing the kubectl rollout undo command.

(1) Run the kubectl rollout history command to check the history of a Deployment:

[xcbeyond@localhost k8s]$ kubectl rollout history deployment/nginx-deployment -n test
deployment.apps/nginx-deployment 
REVISION  CHANGE-CAUSE
1         <none>
2         <none
Copy the code

Using the –record parameter when creating Deployment, you can see the create/update commands for each version in the change-cause column.

(2) If you want to view details about a specific version, add the –revision=

parameter:

[xcbeyond@localhost k8s]$ kubectl rollout history deployment/nginx-deployment --revision=2 -n test deployment.apps/nginx-deployment with revision #2 Pod Template: Labels: App =nginx pod-template-hash= 79Fbf694F6 Containers: nginx: Image: nginx:1.9.1 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none>Copy the code

(3) Undo the latest version and roll back to the previous version, that is, nginx:1.9.1-> nginx:latest. Run the kubectl rollout undo command:

[xcbeyond@localhost k8s]$ kubectl rollout undo deployment/nginx-deployment -n test
deployment.apps/nginx-deployment rolled back
Copy the code

Of course, you can also use the –to-revision argument to specify a version number to roll back to.

(4) You can run the kubectl Describe Deployment/nginx-Deployment command to see the entire rollback process:

[xcbeyond@localhost k8s]$ kubectl describe deployment/nginx-deployment -n test Name: nginx-deployment Namespace: test CreationTimestamp: Thu, 26 Nov 2020 19:32:04 +0800 Labels: <none> Annotations: deployment.kubernetes.io/revision: 3 Selector: app=nginx Replicas: 3 desired | 2 updated | 4 total | 3 available | 1 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:latest Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True ReplicaSetUpdated OldReplicaSets: nginx-deployment-79fbf694f6 (2/2 replicas created) NewReplicaSet: nginx-deployment-86b8cc866b (2/2 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 5m50s deployment-controller Scaled up replica set nginx-deployment-86b8cc866b to 1 Normal ScalingReplicaSet 4m55s deployment-controller Scaled down replica  set nginx-deployment-79fbf694f6 to 2 Normal ScalingReplicaSet 4m55s deployment-controller Scaled up replica set nginx-deployment-86b8cc866b to 2Copy the code

5.3 ROLLING upgrades for RC

For RC rolling upgrades, Kubernetes also provides a kubectl rolling-update command implementation. This command creates a new RC, then automatically reduces the number of Pod copies in the old RC to 0, and increases the number of Pod copies in the new RC from 0 to the target value, thus accomplishing the Pod upgrade.

5.4 Update Policies of Other Objects

Since version 1.6, Kubernetes has introduced rolling upgrade strategies similar to Deployment for DaemonSet and StatefulSet, which automate the application version upgrade through different strategies.

5.4.1 Update strategy for DaemonSet

DaemonSet has two strategies for upgrading:

  • OnDelete: The default strategy for DaemonSet. When usingOnDeleteWhen the policy updates DaemonSet, new Pods are not automatically created after a new DaemonSet configuration has been created, and are not triggered until the user manually deletes the older version of the Pod.
  • RollingUpdate: when usingRollingUpdateWhen the policy is updated to DaemonSet, the old version of Pod is automatically removed and a new version of Pod is automatically created.

5.4.2 Update policy of StatefulSet

Starting with the 1.6 release of Kubernetes, the update strategy for StatefulSet is gradually aligned with Deployment and DaemonSet, with RollingUpdate, Partioned, and onDelete implemented. To ensure Pod updates in StatefulSet are sequential, one by one, and to keep a history of updates, as well as the ability to roll back to a historical version.

6. Pod expansion

In a production environment, we might adjust (increase or decrease) the number of service instances in different scenarios to ensure that the system resources are fully utilized. At this point, the Deployment/RC Scale mechanism can be used to do this.

Kubernetes provides both manual and automatic Pod capacity expansion:

  • ** Manual mode: ** Through executionkubectl scaleCommand or through the RESTful API to set the number of Pod replicas for a Deployment/RC.
  • ** Automatic mode: ** Users specify a range of Pod replicas based on a performance indicator or custom service indicator. The system automatically adjusts the number of Pod replicas based on the change of the performance indicator within the range.

6.1 Expansion in Manual Mode

Take nginx-Deployment.yml as an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
           - name: http
             containerPort: 80
Copy the code

(1) The number of running Pod copies is 3, check the Pod status:

[xcbeyond@localhost ~]$ kubectl get pod -n test
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-86b8cc866b-7g4xp   1/1     Running   0          24h
nginx-deployment-86b8cc866b-pgh6c   1/1     Running   0          24h
nginx-deployment-86b8cc866b-qg26j   1/1     Running   0          23h
Copy the code

(2) Update the number of Pod copies from the original 3 to 5 by executing the kubectl scale command:

[xcbeyond@localhost ~]$ kubectl scale deployment nginx-deployment --replicas 5 -n test
deployment.apps/nginx-deployment scaled
[xcbeyond@localhost ~]$ kubectl get pod -n test
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-86b8cc866b-7g4xp   1/1     Running             0          24h
nginx-deployment-86b8cc866b-dbkms   0/1     ContainerCreating   0          5s
nginx-deployment-86b8cc866b-pgh6c   1/1     Running             0          24h
nginx-deployment-86b8cc866b-qg26j   1/1     Running             0          23h
nginx-deployment-86b8cc866b-xv5pm   0/1     ContainerCreating   0          5s
[xcbeyond@localhost ~]$ kubectl get pod -n test
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-86b8cc866b-7g4xp   1/1     Running   0          24h
nginx-deployment-86b8cc866b-dbkms   1/1     Running   0          79s
nginx-deployment-86b8cc866b-pgh6c   1/1     Running   0          24h
nginx-deployment-86b8cc866b-qg26j   1/1     Running   0          23h
nginx-deployment-86b8cc866b-xv5pm   1/1     Running   0          79s
Copy the code

If — Replicas is set to be smaller than the current Pod replica, some running Pods will be removed to reduce the size.

6.2 Expansion in Automatic Mode

Starting with version 1.1, Kubernetes has added a Horizontal Pod Autoscaler (HPA) for automatic Pod capacity expansion based on CPU usage.

HPA is a Kubernetes resource object like Deployment and Service.

The goal of the HPA is to automatically adjust the number of copies of pods to meet application needs and reduce resource waste by tracking the load changes of all pods in the cluster (the Master-based Kube-Controller-Manager service continuously monitors certain performance indicators of the target Pod).

Kubernetes currently supports the following types of metrics:

  • Pod resource usage: A performance indicator at the Pod level, usually a ratio, such as CPU usage.
  • Pod custom metrics: Pod level performance metrics, usually a number, such as requests per second for the service (TPS or QPS).
  • Object User-defined or external user-defined metrics: A value that needs to be provided by the container application in a certain way, for example, through the HTTP URL /metrics, or using the metrics provided by external services to collect the URL (for example, a business metric).

How to count and query these indicators?

Starting with version 1.11, Kubernetes deprecated the Heapster component for COLLECTING Pod CPU usage and fully moved to Metrics Server based data collection. The Metrics Server provides the collected Pod performance data to the HPA controller through the aggregation API (for example, metric.k8s. IO, custom.metric.k8s. IO, and extern.metric.k8s. IO) for query.

6.2.1 Working principle of HPA

A Metrics Server (Heapster or custom Metrics Server) in Kubernetes continuously collects Metrics for all Pod copies. The HPA controller obtains the data from the Metrics Server API and calculates the number of Pod copies based on user-defined capacity expansion rules. When the number of target Pod replicas is different from the current number, the HPA controller initiates a scale operation (equivalent to executing kubectl scale command in manual mode) to the Pod replicas controller (Deployment, RC or ReplicaSet) to adjust the number of Pod replicas and complete the expansion operation.

The following figure describes the key components and workflow in the HPA system:

The HPA controller is based on the probe period defined by the Master kube-controller-Manager service startup parameter — lateral-pod-Autoscaler-sync-period (default value is 15s).

6.2.2 HPA configuration description

Capacity expansion in automatic mode is provided by the HorizontalPodAutoscaler resource object to users to define capacity expansion rules.

The HorizontalPodAutoscaler resource object is located in the Kubernetes API group “AutoScaling” and currently includes versions V1 and V2.

  • Autoscaling/V1: Only automatic capacity expansion based on CPU usage is supported.
  • Autoscaling/V2 * : Supports automatic capacity expansion configurations based on any metrics, such as resource usage, Pod metrics, and other metrics. The current version is Autoscaling/V2Beta2.

The configuration and usage of HorizontalPodAutoscaler are described below.

For details about configuration items, see:

  1. Kubernetes. IO/docs/refere…

  2. Kubernetes. IO/docs/refere…

(1) HorizontalPodAutoscaler configuration based on Autoscaling/V1 version:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50
Copy the code

Parameter description:

  • scaleTargetRef: Target action object, which can be Deployment, RC, or ReplicaSet.
  • targetCPUUtilizationPercentage: Expected CPU usage per Pod
  • minReplicasandmaxReplicas: P Indicates the minimum and maximum number of large replicas. The system automatically expands the capacity within the range and maintains the CPU usage of each Pod.

To use autoscaling/ V1 version of HorizontalPodAutoscaler, you need to pre-install the Heapster component or Metrics Server to capture Pod CPU usage.

(2) HorizontalPodAutoscaler configuration based on Autoscaling/V2Beta2:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
Copy the code

Parameter description:

  • scaleTargetRef: Target action object, which can be Deployment, RC, or ReplicaSet.
  • minReplicasandmaxReplicas: P Indicates the minimum and maximum number of large replicas. The system automatically expands the capacity within the range and maintains the CPU usage of each Pod.
  • metrics: Target indicator value.
    • type: Defines the type of an indicator. Four types can be set, and one or more types can be set:
      • Resource: Resource-based indicator values, such as CPU and memory. For CPU usage, set averageUtilization in the target parameter to define the target average CPU usage; AverageValue is set in the target parameter to define the average memory usage of the target.
      • Pods: Indicators based on Pod. The system will calculate the average value of indicators of all Pod copies. Its target indicator type can only be AverageValue. You need to build custom Metrics Server and monitoring tools to collect and process Metrics data.
      • Object: Indicates an indicator based on a resource object or any user-defined indicator of an application system. You need to build custom Metrics Server and monitoring tools to collect and process Metrics data.
      • ExternalKubernetes introduced support for external system metrics as of version 1.10. For example, users who use messaging services or external load balancers provided by public cloud service providers can automatically expand their deployed services in Kubernetes based on the performance indicators of these external services.
    • target: Defines an indicator target value. The system triggers capacity expansion when the indicator data reaches the target value.

For example 1, set the metric name to requests-per-second, which is derived from the main-route of Ingress. Set the target value to 2000, that is, the capacity expansion is triggered when the number of requests per second reaches 2000.

metrics:
- type: Object
  object:
    metric:
      name: requests-per-second
    describedObject:
      apiVersion: extensions/v1beta1
      kind: Ingress
      name: main-route
    target:
      type: Value
      value: 2k
Copy the code

Example 2: Set the metric name to HTTP_requests and the resource object with the tag “verb=GET” to trigger expansion when the metric average reaches 500:

metrics:
- type: Object
  object:
    metric:
      name: http_requests
      selector: 'verb=GET'
    target:
      type: AverageValue
      averageValue: 500
Copy the code

You can also define multiple types of indicators in the same HorizontalPodAutoscaler resource object. The system calculates the target number of Pod copies for each type of indicator and expands the capacity based on the maximum number. Such as:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: AverageUtilization
        averageUtilization: 50
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: 1k
  - type: Object
    object:
      metric:
        name: requests-per-second
      describedObject:
        apiVersion: extensions/v1beta1
        kind: Ingress
      target:
        type: Value
        value: 10k
Copy the code

Example 3: Set the counter name to queue_Messages_ready and queue=worker_tasks. When the average value of the target counter is 30, automatic capacity expansion is triggered.

metrics:
- type: External
  external:
    metric:
      name: queue_messages_ready
      selector: 'queue=worker_tasks'
    target:
      type: AverageValue
      averageValue: 30
Copy the code

To use indicators of external services, you need to install and deploy a monitoring system that can be connected to the Kubernetes HPA model, and complete the mechanism for the monitoring system to collect these indicators, so that the subsequent automatic capacity expansion can be completed.

6.2.3 HPA Practices based on user-defined Indicators

Here is a complete example of how to set up and use an HPA system based on custom metrics.

Before automatic capacity expansion based on user-defined Metrics, you need to deploy a custom Metrics Server. You can customize Metrics Server using Adapter based on Prometheus, Microsoft Azure, Google Stackdriver, etc. To customize Metrice Server, see github.com/kubernetes/… The instructions.

This section describes how to deploy basic HPA components and configure HPA based on the Prometheus monitoring system.

The HPA architecture based on Prometheus is shown in the figure below:

Key components are described as follows:

  • Prometheus: An open source service monitoring system that periodically collects performance indicators data for each Pod.
  • The Custom Metrics Server:Customize Metrics Server using Prometheus Adapter. It collects performance Metrics data from the Prometheus service and registers the custom Metrics API with the Master API Server via Kubernetes’ Metrics Aggregation layer/apis/custom.metrics.k8s.ioPath Provides indicator data.
  • HPA Controller: Kubernetes HPA Controller that automatically expands capacity based on the user-defined HorizontalPodAutoscaler.

The entire deployment process is as follows:

(1) Enable startup parameters for the Kube-apiserver and Kube-controller-Manager services.

The kube-apiserver and kube-controller-manager services are deployed in the kube-system namespace by default.

[xcbeyond@localhost minikube]$ kubectl get pod -n kube-system
NAME                               READY   STATUS    RESTARTS   AGE
coredns-6c76c8bb89-p26xx           1/1     Running   11         103d
etcd-minikube                      1/1     Running   11         103d
kube-apiserver-minikube            1/1     Running   11         103d
kube-controller-manager-minikube   1/1     Running   11         103d
kube-proxy-gcd8d                   1/1     Running   11         103d
kube-scheduler-minikube            1/1     Running   11         103d
storage-provisioner                1/1     Running   29         103d
Copy the code

Note: This Kubernetes environment is a local environment based on Minikube deployment.

The current startup parameters of the service can be viewed with the Kubectl describe command, for example:

[xcbeyond@localhost minikube]$ kubectl describe pod/kube-apiserver-minikube -n kube-system Name: Kube apiserver - minikube Namespace: kube - system... Containers: kube - apiserver:... Command: Kube apiserver - advertise - address = 172.17.0.2 - allow - ring = true - authorization - mode = Node, RBAC --client-ca-file=/var/lib/minikube/certs/ca.crt --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,No deRestriction,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota --enable-bootstrap-token-auth=true --etcd-cafile=/var/lib/minikube/certs/etcd/ca.crt --etcd-certfile=/var/lib/minikube/certs/apiserver-etcd-client.crt - etcd - keyfile = / var/lib/minikube/certs/apiserver - etcd - client. The key - etcd - the servers = https://127.0.0.1:2379 - insecure - port = 0  --kubelet-client-certificate=/var/lib/minikube/certs/apiserver-kubelet-client.crt --kubelet-client-key=/var/lib/minikube/certs/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/var/lib/minikube/certs/front-proxy-client.crt --proxy-client-key-file=/var/lib/minikube/certs/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/var/lib/minikube/certs/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=8443 - service - account - key - the file = / var/lib/minikube/certs/sa. The pub - service - cluster - IP - range = 10.96.0.0/12 - the TLS - cert - file = / var/lib/minikube/certs/apiserver CRT - TLS - private - key - the file = / var/lib/minikube/certs/apiserver. Key...Copy the code

Service start parameters can be updated with the kubectl edit command, such as:

[xcbeyond@localhost minikube]$ kubectl edit pod kube-apiserver-minikube -n kube-system
Copy the code

Start the Aggregation layer on the Master API Server by setting the following startup parameters of the Kube-Apiserver service.

  • --requestheader-client-ca-file=/var/lib/minikube/certs/front-proxy-ca.crt: indicates the client CA certificate.
  • --requestheader-allowed-names=front-proxy-client: List of common Names of clients that are allowed to access through the header--requestheader-username-headersParameter specifies the field. The name of client common Names needs to be configured in client-ca-file. If the value is set to null, any client can access it.
  • --requestheader-extra-headers-prefix=X-Remote-Extra-: Specifies the prefix name to insist on in the request header.
  • --requestheader-group-headers=X-Remote-Group: Specifies the group name to be checked in the request header.
  • --requestheader-username-headers=X-Remote-User: Specifies the user name to be checked in the request header.
  • --proxy-client-cert-file=/var/lib/minikube/certs/front-proxy-client.crt: Verifies the Client CA certificate of Aggregator during the request.
  • --proxy-client-key-file=/var/lib/minikube/certs/front-proxy-client.key: Authenticates the client private key of the Aggregator during the request.

Configure the HPA startup parameters (optional) for the Kube-controller-Manager service as follows:

  • --horizontal-pod-autoscaler-sync-period=10s: Space interval for the HPA controller to synchronize the number of Pod copies. The default value is 15 seconds.
  • --horizontal-pod-autoscaler-downscale-stabilization=1m: Wait time for capacity expansion. The default value is 5 minutes.
  • --horizontal-pod-autoscaler-initial-readiness-delay=30s: Latency for the Pod to reach the Read state. The default value is 30 minutes.
  • - horizontal - pod - autoscaler - how = 0.1: Tolerance of capacity expansion calculation. The default value is 0.1, which indicates [-10% – +10%].

(2) Deploy Prometheus.

The Prometheus-operator is used here for deployment.

The Prometheus Operator provides a simple definition for monitoring the management of Kubernetes Service, Deployment, and Prometheus instances, simplifying deployment, management, and operation on Kubernetes.

(3) Deploy a custom Metrics Server.

Deploy with the Prometheus Adapter implementation.

(4) Deploy the application.

The application provides a RESTful interface /metrics and a custom indicator value named HTTP_requestS_Total.

(5) Create a ServiceMonitor object for Prometheus to monitor metrics provided by the application.

(6) Create a HorizontalPodAutoscaler object to provide the HPA controller with the automatic expansion configuration that the user expects.

(7) Send HTTP access requests to applications to verify the HPA automatic capacity expansion mechanism.


Reference article:

  1. www.cnblogs.com/cocowool/p/…
  2. Kubernetes. IO/docs/refere…
  3. www.cnblogs.com/linuxk/p/95…
  4. Kubernetes. IO/docs/refere…