The biggest difficulty for beginners to learn K8s is when the Pod doesn’t start up properly or crashes after running for a while. So what is the problem that caused it to not work, or what factors caused it to collapse, is it the moral collapse or the distortion of human nature… Sorry, I got the wrong script.

In this article, we will learn and summarize some common errors when using K8s and how to troubleshoot the problems behind them.

Learn these things and make sure you have them in front of your r&d team and operations

Don’t fake it, get punched in the face, and don’t come to me, because I get punched in the face all the time. Hey, try your best to get it next time.

Those states of Pod

After deploying our service with K8s, we always use the following command to query the status of the Pod to see if it is successful.

kubectl get pods

NAME                         READY   STATUS              RESTARTS   AGE
my-app-5d7d978fb9-2fj5m      0/1     ContainerCreating   0          10s
my-app-5d7d978fb9-dbt89      0/1     ContainerCreating   0          10s
Copy the code

Here STATUS represents the STATUS of the Pod. The following states may be encountered:

  • ContainerCreating: Indicates that the container is being created. This is an intermediate state that can be switched as the container is created, but can also remain stuck.
  • ImagePullBackOff: Failed to pull the container image.
  • CrashLoopBackOff: When the container crashes, the Deployment will re-create a Pod and maintain the number of replicas, but most likely the newly created Pod will still crash. It will not attempt to rebuild Pod indefinitely. If the number of crashes exceeds the set number, it will not attempt to rebuild Pod, and the Pod state will remain at CrashLoopBackOff.
  • Evicted: Evicted status is displayed when a Pod is Evicted because of insufficient node resources (CPU/Mem/Storage). K8s will Kill the Pod that can be Evicted from the node according to the policy.
  • Running This means the Pod is Running properly.

Let’s take a look at some of the causes of Pod errors and how to troubleshoot them.

Failed to pull the mirror. Procedure

After the image pull failure, Pod status field is represented as ImagePullBackOff, this situation is a lot of, in addition to the reason we accidentally write the wrong image name, there is also a common software some official images are in foreign countries, such as in docker. IO or quay. IO mirror repository. Sometimes access is slow.

Kubectl: kubectl: kubectl: kubectl: kubectl: kubectl: kubectl: kubectl: kubectl For example, I have been using the Deployment definition in the K8s tutorial:

apiVersion: apps/v1 kind: Deployment metadata: name: my-go-app spec: replicas: 2 selector: matchLabels: app: go-app template: metadata: labels: app: go-app spec: containers: - name: go-app-container image: Kevinyan001 / kube-GO-app :v0.3 Resources: limits: Memory: "200Mi" CPU: "50m" Ports: -containerPort: 3000 volumeMounts: - name: app-storage mountPath: /tmp volumes: - name: app-storage emptyDir: {}Copy the code

We intentionally changed the name of the mirror to V0.5. I typed this mirror myself, and it is true that there is no version 0.5 yet. After kubectl apply is executed, observe the Pod status.

➜ kubectl apply -f deployment.yaml deployment.apps/my-go-app configured ➜ kubectl get Pods NAME READY STATUS RESTARTS AGE my-go-app-5d7d978fb9-2fj5m 1/1 Running 0 3h58m my-go-app-5d7d978fb9-dbt89 1/1 Running 0 3h58m My-go-app-6b77dbbcc5-jpgbw 0/1 ContainerCreating 0 7s ➜ kubectl get Pods NAME READY STATUS RESTARTS AGE my-go-app-5d7d978fb9-2fj5m 1/1 Running 0 3h58m my-go-app-5d7d978fb9-dbt89 1/1 Running 0 3h58m my-go-app-6b77dbbcc5-jpgbw  0/1 ErrImagePull 0 14s ..... // Pause for 1 minute, ➜ kubectl get Pods NAME READY STATUS RESTARTS AGE my-go-app-5d7d978fb9-2fj5m 1/1 Running 0 4h1m my-go-app-5d7d978fb9-dbt89 1/1 Running 0 4h1m my-go-app-6b77dbbcc5-jpgbw 0/1 ImagePullBackOff 0 3m11sCopy the code

After we updated the Deployment above, we observed the Pod status change process as follows:

ContainerCreating ===> ErrImagePull ===> ImagePullBackOff
Copy the code

First of all, the update of Pod is a rolling update, and the replacement of old version Pod can be completed after the new Pod is created first. Then, an intermediate state, ErrImagePull, will be feedback due to the mirror pull error. At this time, it will try to pull again. If it is determined that the mirror can not be pulled down, it will feedback a failed final state, ImagePullBackOff.

How to find out what caused the pull failure? View its event log via kubectl describe pod {pod-name}

➜ kubectl describe pod my-go-app-6b77dbbcc5-jpGBw Name: my-go-app-6b77dbbcc5-jpGBW Namespace: default Priority: 0... Controlled By: ReplicaSet/my-go-app-6b77dbbcc5 Containers: go-app-container: Container ID: Image: Kevinyan001 /kube-go-app:v0.5 Image ID: Port: 3000/TCP Host Port: 0/TCP State: Waiting Reason: ErrImagePull Ready: False ... Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m12s default-scheduler Successfully assigned default/my-go-app-6b77dbbcc5-jpgbw to docker-desktop Normal Pulling 27s (x4 over 2m12s) kubelet Pulling image "Kevinyan001 / Kube-GO-app: V0.5" Warning Failed 20S (X4 over 2M4s) kubelet Failed to pull image "Kevinyan001 /kube-go-app:v0.5": RPC error: code = Unknown desc = error response from daemon: Manifest for kevinyan001/ Kube-GO-app: V0.5 not found: manifest unknown: manifest unknown Warning Failed 20s (x4 over 2m4s) kubelet Error: ErrImagePull Normal BackOff 4S (X5 over 2M4S) Kubelet back-off pulling image "Kevinyan001 / Kube-GO-app: V0.5" Warning Failed 4s (x5 over 2m4s) kubelet Error: ImagePullBackOffCopy the code

The Pod event record clearly recorded the state change experienced by Pod from the beginning to the end, and what caused the state change. The failure event clearly gave the reason that the mirror could not be found.

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 20s (x4 over 2m4s) kubelet Failed to pull Image "kevinyan001/kube-go-app:v0.5": RPC error: code = Unknown desc = error response from daemon: Manifest for kevinyan001/ Kube-GO-app: V0.5 not found: manifest unknown: manifest unknown Warning Failed 20s (x4 over 2m4s) kubelet Error: ErrImagePull Normal BackOff 4S (X5 over 2M4S) Kubelet back-off pulling image "Kevinyan001 / Kube-GO-app: V0.5" Warning Failed 4s (x5 over 2m4s) kubelet Error: ImagePullBackOffCopy the code

Another reason is that the pull fails due to network reasons or the mirror warehouse does not have the permission to reject the pull request. Because I have the network environment, accelerators and so on, I won’t show you these two cases.

However, the investigation method is the same, use kubectl describe command to check Pod events, and use docker pull to actively pull the image to try, if it is true that the network problem can not pull down, you can consider over the wall, or use the domestic acceleration node.

To configure the accelerator, you can consider using the free accelerator of Ali Cloud. The configuration document is below. You need to register an Ali Cloud account to use the accelerator

Help.aliyun.com/product/607…

The container crashed after startup

And then there’s the bug, which is a continuous container crash caused by an internal problem with the program running inside the container. The last feedback to the Pod state is the CrashLoopBackOff state.

It’s a little difficult to demonstrate a container crash in action, but fortunately I did an image of it earlier when I introduced automatic sampling for the Go service

The following content refers to my previous article: Design and implementation of automatic sampling performance analysis for Go service

I made a Docker image to facilitate the experiment. The image has been uploaded to docker Hub. If you are interested, you can quickly test it on your computer.

Run the following command to experience the experience.

docker run –name go-profile-demo -v /tmp:/tmp -p 10030:80 –rm -d kevinyan001/go-profiling

The Go service in the container provides the following routes

So we intentionally created the container crash situation by replacing the image in the Deployment Pod template above with this Kevinyan001 / Go-profiling and manually manufacturing OOM with the route provided.

Modify the container image used by Pod

Yaml apiVersion: apps/v1 kind: deployment metadata: name: my-go-app spec: replicas: 2 selector: matchLabels: app: go-app template: metadata: labels: app: go-app spec: containers: - name: go-app-container image: kevinyan001/go-profiling:latest resources: limits: memory: "200Mi" cpu: "50m"Copy the code

Create an SVC to enable Pod to accept external traffic

Yaml apiVersion: v1 kind: service metadata: name: app-service spec: type: kubectl apply -f service. NodePort selector: app: go-app ports: - name: http protocol: TCP nodePort: 30080 port: 80 targetPort: 80Copy the code

The routes provided in the program are as follows:

Visit http://127.0.0.1:30080/1gb-slice for containers out of memory because Deployment will restart the collapse of the Pod, so here is tough hand speed: After one minute of Deployment, the Deployment aborts the treatment and takes a rest before restarting the Pod. At this point, the Pod status successfully changes to:

➜ kubectl get pods
NAME                         READY   STATUS             RESTARTS      AGE
my-go-app-598f697676-f5jfp   0/1     CrashLoopBackOff   2 (18s ago)   5m37s
my-go-app-598f697676-tps7n   0/1     CrashLoopBackOff   2 (23s ago)   5m35s
Copy the code

At this point we will use Kubectl Describe Pod to see the details of the crashed pod and see the error code returned by the application in the container

➜ kubectl describe pod my-go-app-598f697676-tps7n
Name:         my-go-app-598f697676-tps7n
Namespace:    default
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sun, 20 Mar 2022 16:09:29 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 20 Mar 2022 16:08:56 +0800
      Finished:     Sun, 20 Mar 2022 16:09:05 +0800
Copy the code

Kubectl logs {pod-name} is another command that can be used to further investigate Pod container problems.

kubectl logs my-go-app-598f697676-tps7n
Copy the code

If the Pod happens to be restarted and nothing is found, view the previous container’s logs by adding the -previous parameter option.

kubectl logs my-go-app-598f697676-tps7n --previous
Copy the code

Container expelled

First of all, this problem can not be solved by r&d, but you can use your YY ability: when the group alarm, operation and maintenance @ you quickly read, you can counter-kill, tell him to expand the capacity of resources is not enough, can you put it into ^_^…

Anyway, let’s get back to the point. When resources in the cluster are tight, K8s will first expel pods with lower priority, and the status of the expelled Pod will be Evicted. This situation cannot be simulated locally, post a screenshot of this situation in the company K8s cluster.

Kubectl get pod Check the POD status

In the image above, you can see that one of the Pod states has changed to Evicted.

Use describe again for more details

Kubectl Describe POD View pod details and event logs

Sorry, the picture above is too vague because it has a long history. The Message column in the picture gives the following information:

Status: Faild
Reason: Evicted
Message: The node wan low on resource: xxx-storage. Container xxx using xxxKi, 
which exceeds its request of ....
Copy the code

conclusion

In general, most common deployment failures can be rectified and debugged using these commands:

  • kubectl get pods
  • kubectl describe pod <podname>
  • kubectl logs <podname>
  • kubectl logs <podname> --previous

Of course, sometimes want to see the Pod configuration information, you can also use

  • kubectl get pod <podname> -o=yamlVerify that the Pod configuration is the same as the one we submitted, and some additional information.

The get and describe commands can look at the status and information of other resources in addition to the Pod status and information records.

kubectl get pod|svc|deploy|sts|configmap <xxx-name>

kubectl describe pod|svc|deploy|sts|configmap <xxx-name>
Copy the code

I’ll leave that for you to experience later. In order to facilitate the local experiment, you can find all kinds of YAML templates used today in the public account “NMS bi bi” reply [K8S], interested in hands-on practice.