Cause-causing Messages Down Pods in a Kubernetes Cluster

Published time: Jan 26, 2019

The original link: blog. Gruntwork. IO/delaying – sh…

Article by Yorinasub17

This is the third installment in a series on implementing “Kubernetes Cluster Zero Downtime updates.” In part 2 of this series, we implemented the proper termination of an application Pod by using Pod lifecycle hooks to mitigate the downtime caused by pods shutting down without completing processing existing requests. However, we also learned that pods refuse to service incoming traffic after the startup and shutdown sequence, but the reality is that pods may continue to receive new traffic. This means that clients may end up receiving error messages because their requests are routed to pods that can no longer service the traffic. Ideally, we want the Pod to stop receiving traffic as soon as it is turned on and off. To mitigate this situation, we must first understand why the Pod is still receiving new traffic when it starts to shut down.

Much of the information in this article was taken from the book “Kubernetes in Action”. In addition to the information presented here, this book provides best practices for running applications on Kubernetes, so you are strongly encouraged to read it.

Pod off sequence

In the previous article, “How to Gracefully Turn off a Pod,” we introduced the Pod ejection lifecycle. The first step in the ejection sequence is to start removing the Pod, which triggers a series of events that eventually lead to the Pod being removed from the system. However, what we didn’t talk about in the previous article is how to unregister the Pod from the upper Service controller so that it can stop receiving traffic.

Before the Pod can actually stop running, it needs to be removed from the Service.

So, what would cause a Pod to be unregistered from a Service? To understand this, we need to take a deeper look at what happens when a Pod is removed from a cluster.

After a Pod is removed from the cluster via Kubernetes’ API, the Pod is marked for removal in the metadata server. This sends a Pod delete notification to all relevant subsystems, and then processes the notification:

Kubernetes APIServer is the metadata server of Kubernetes, and the subsystem is the core component of Kubernetes.

  • On the node where the Pod is locatedkubeletLaunches the Pod shutdown sequence described in the previous article.
  • Running on all nodeskube-proxyThe daemon will remove the Pod’s IP address from iptables.
  • The endpoint controller will remove the Pod from the list of valid endpoints, as reflected in our example of removing the Pod endpoint from the list of Service managed Endpoints.

We don’t need to know the details of every system. The point here is that there are multiple subsystems involved, which may run on different nodes, and the operations listed above occur in parallel. Therefore, it is likely that the Pod has already started executing the preStop hook and received the TERM signal before it is removed from the list of all activities. This is why the Pod continues to receive traffic even after it has started and closed the sequence.

Pick the flow scheme

On the face of it, we could concatenate the above sequence of events and forbid them to proceed in parallel until the Pod to be deleted has been unregistered from all related subsystems before starting the Pod shutdown sequence. However, this is difficult to do in practice due to the distributed nature of the Kuberenetes system. What happens if one of the nodes encounters network blocking? Do you want to wait indefinitely for events to propagate? What if the node comes back online? What if your Kubernetes cluster has thousands of nodes?

Unfortunately, the reality is that there is no perfect solution to prevent all outages. What we can do, however, is introduce enough delay in the Pod shutdown sequence to capture 99% of the cases. To do this, we introduced a sleep instruction in the preStop hook to delay the Pod shutdown sequence. Next, let’s see how it works in our example.

We’ll update the resource definition file we’ve been using and use the sleep command to introduce a delay as part of the preStop hook to execute. In ‘Kubernetes in Action’, author Lukša recommended a 5 to 10 second delay, so here we will use a 5 second delay as part of the preStop hook:

lifecycle:
  preStop:
    exec:
      # Introduce a delay to the shutdown sequence to wait for the
      # pod eviction event to propagate. Then, gracefully shutdown nginx.
      command: ["sh"."-c"."sleep 5 && /usr/sbin/nginx -s quit"]

Copy the code

Now, let’s deduce what happens in this example close sequence. As analyzed in the previous article, we will use Kubectl drain to evict pods on nodes. This will send a Pod deletion event, which is notified to both the Kubelet and the Endpoint Controller (the Endpoint Controller, in this case, the Service Controller above the Pod). Here, we assume that the preStop hook starts before the Service removes the Pod from its own available list.

When the preStop hook is executed, the second command to close Nginx is delayed by five seconds. During this time, the Service will remove the Pod from its available list.

During this delay, the Pod is still running, so even if it receives a new connection request, it can still process the connection. In addition, after removing a Pod from a Service, client connection requests will not be routed to the Pod that will be shut down. Therefore, in this case, there will be no downtime in the cluster if the Service finishes processing these events during the delay.

Finally, the preStop hook process wakes up from sleep and closes the Nginx container, removing the container from the node:

At this point, we can safely perform any upgrades on Node1, including restarting the node to load a new kernel version. If we have started a new node to hold the workload running on Node1, we can also shut down the Node1 node.

Recreate Pod

If you’ve seen this, you might be wondering how to recreate the Pod that was originally scheduled on the maintenance node. Now we know how to turn off pods properly, but what if we want to maintain the number of pods in operation? It is up to the Deployment controller to come into play.

The Deployment controller is responsible for maintaining the specified expected state on the cluster, and if you recall from our resource definition file we did not create the Pod directly. Instead, we have Deployment automatically manage the pods for us by providing it with a template to create a Pod. Here is what the Pod template section of the definition looks like:

template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: Nginx: 1.15
        ports:
        - containerPort: 80
Copy the code

The template specifies a Pod to create a container running an nginx:1.15 image, with the tag APP :nginx and exposed port 80.

In addition to the Pod template, we also provide a configuration for the Deployment resource that specifies the number of Pod copies to maintain:

spec:
  replicas: 2
Copy the code

This informs the Deployment controller that it should always maintain two pods running on the cluster. Every time the number of pods running drops, the Deployment controller automatically creates a new Pod to replace it. Therefore, in our example, when we use the Kubectl drain operation to expel pods from the node, the Deployment controller automatically recreates pods on the other available nodes, keeping the current state in line with the desired state specified in the definition.

conclusion

In summary, we can now shut down the Pod properly on a single node with sufficient delay and normal termination of the preStop hook. With Deployment, we can automatically recreate the Pod that was shut down. But what if we want to replace all the nodes in the cluster at once?

If we naively restart all the nodes, the service load balancer may have no Pod available, causing the system to go down. Worse, for stateful systems, doing so can invalidate the mediation mechanism.

To handle this situation, Kubernetes provides a feature called PoddisruptionClassification which specifies the online number of closed pods that Kubernetes can withstand at any given point in time. In the next and final part of this series, we’ll show you how to use it to control the number of simultaneous node eviction events.

Recommended reading

  • How to gracefully shut down Pod in Kubernetes cluster
  • Deployment Application details