Smooth shutdown and service destreaming are two important means to ensure that multi-node applications can provide services continuously and stably. Smooth shutdown ensures that application nodes finish processing the received requests before shutting down. Before in the article “learn to write HTTP services with Go” to introduce how to use NET/HTTP library provided HTTP.ShutDown smooth ShutDown of HTTP services, today to introduce you to the gRPC distributed services smooth ShutDown method. Application after the smooth closing phase declined to provide service for the new incoming traffic, if continue to have access to the new traffic, is bound to make send the request of the client perceived service disconnected, so we have the application before smooth closing application nodes do pick flow operation, ensure the gateway will not be distributed the new flow to have to shut down the application of the nodes.

If the service is deployed on a cloud host, you only need to remove the IP address of the node from the load balancer. After the application is restarted or updated, you can attach the IP address of the node to the load balancer. However, it is impossible for Kubernetes to manually pick up the stream, so we will also introduce how to make The Kubernetes system automatically pick up the stream for the application node that we are about to shut down.

Smooth closing

In this chapter, in addition to introducing the gRPC framework’s smooth shutdown method, we will also introduce the whole life cycle of removing Pod in Kubernetes cluster, because if our gRPC service is deployed in Kubernetes cluster, Both smooth shutdown and destreaming of services depend on the Pod deletion lifecycle, or Pod shutdown sequence. If you see me saying “Pod shutdown sequence” and “Pod deletion lifecycle” in the following section, keep in mind that they are just two ways of saying the same thing.

The gRPC gracefulStop

The communication protocol used by the gRPC framework is HTTP2, which uses the GoAway frame signal for connection closure (type 0x7 to initiate connection closure or signal a critical error status). Goaway allows the service endpoint to stop accepting new traffic normally while still completing processing of previously established flows.

Go language version of gRPC Server provides two exit methods Stop and GracefulStop, just look at the name of the know behind this is to achieve smooth closure used. The GracefulStop method first closes the service listener so that no new requests can be made, and then loops through all current connections to send a GoAway frame. Servewg.wait () will Wait for all handleRawConn coroutines to exit (each new connection in gRPC Server creates a handleRawConn coroutine and increments the WaitGroup counter).

func (s *Server) GracefulStop(a) {
    s.mu.Lock()
    ...

    // Close the listener to receive no new connection requests
    for lis := range s.lis {
        lis.Close()
    }

    s.lis = nil
    if! s.drain {for st := range s.conns {
            // Issue a goaway signal to all connections
            st.Drain()  
        }
        s.drain = true
    }


    // Wait for all handleRawConn coroutines to exit. Each request is a Goroutine, controlled by WaitGroup.
    s.serveWG.Wait()

    // Wait while there are free connections. When exiting the serveStreams logic, a Broadcast wake up is performed. Any client exit triggers removeConn to wake up.
    for len(s.conns) ! =0 {
        s.cv.Wait()
    }
...
Copy the code

The Stop method lacks the logic of sending a goAway frame to the connection and waiting for the connection to exit as compared to GracefulStop, which I won’t cover here.

The application listens to OS signals and starts smooth shutdown

After knowing the method of service smooth shutdown provided by gRPC framework, like the smooth shutdown of HTTP service, our application should be able to receive signals such as TERM and Interrupt from OS, and then actively trigger the call to GracefulStop for service smooth shutdown. Of course, we can also do some other work in the application before the call smooth off, such as the application using Etcd service registration, so I suggest to first take the initiative to cancel the node IP corresponding Key from the Etcd, if the Key can not expire in time, It is possible for a client to send a request to the endpoint to be closed if it is not notified that the IP address of the node has been removed during load balancing.

The following is the gRPC service after the start of the monitoring OS to disconnect signal when the start of the smooth closed method, the demo code is just some pseudo-code, but the real degree has been very high, the actual application can be directly to this code template to apply their own method.

errChan := make(chan error)

stopChan := make(chan os.Signal)

signal.Notify(stopChan, syscall.SIGTERM, syscall.SIGINT, syscall.SIGKILL)

go func(a) {
     iferr := grpcServer.Serve(lis); err ! =nil {
        errChan <- err
     }
}()

select {
case err := <- errChan:
   panic(err)
case <-stopChan:
   // TODO does some of the project's own cleanup
   DoSomeCleanJob()
   // Smooth off the service
   grpcServer.GracefulStop()
}
Copy the code

Kubernetes Pod shutdown undergoes a life cycle

Kubernetes will delete pods, re-create and schedule pods when applications need to be updated, nodes need to be upgraded and maintained, and node resources are exhausted. Pods in Kubernetes cluster will undergo the following life cycle before being deleted:

  1. The Pod state is marked asTerminatingAt this point, the Pod stops receiving new traffic.
  2. The Pod preStop hook is executed, and in the hook we can set the command to execute or the HTTP request to send. Most applications can handle TERM interruption signals from the OS, but if the application relies on an external system that is not under its own control, it can send requests from the hook to complete actions such as logout. The service abstraction flow described later also uses the preStop hook.
  3. Kubernetes sends a SIGTERM signal to the Pod.
  4. Kubernetes will wait 30 seconds for Pod shutdown by default. If you need to wait more than 30 seconds for the application to exit properly, you can use itterminationGracePeriodSecondsConfigure yourself in Deployment how long to wait for smooth shutdown of Pod Kubernetes. Note that the preStop hook execution and SIGTERM signal sending are included in this time.If the application shuts down earlier than that, it immediately moves to the next phase of its life cycle.
  5. Kubernetes sends SIGKILL to the application and then deletes the Pod.

The above gRPC service, deployed in Kubernetes cluster, if the node upgrade or other to shut down a node Pod situation, the application can receive Kubernetes sent to Pod TERM signal, active to complete the smooth shut down service operation.

For more details on the lifecycle that Pod shutdown goes through, take a look at my recent article “How to Gracefully Shut down Pods in a Kubernetes Cluster.”

Kubernetes service abstraction

Kubernetes Service abstraction, we have to Kubernetes Pod and Service resource concept again briefly. Our application service runs in a container, which is packaged by Kubernetes in a Pod. A Pod can have multiple containers, but only one main container running the main process. The other containers are auxiliary, that is, the Sidecar mode supported by Pod.

Kubernetes uses a controller called Service to control a group of PODS and provide them with a unified access mode to the outside world. Service It assigns a Pod label by selector to add matching pods to its list of Service endpoints.

Therefore, the IP registered with the registry after the application is started is not the IP of the Pod where the application is located, but the IP of the upper NodePort type Service. This IP is a VIP. When accessing it, load balancing will be performed automatically and traffic will be randomly routed to the Pod attached behind the Service.

If you are not familiar with Pod and Service concepts, you can read my previous articles to understand these things.

“Kubernetes Pod Starter Guide”

“Combine learning and practice to quickly master Kubernetes Service”

The Service itself does the snooping and destreaming for pods, but if your application gets enough traffic, the destreaming will sometimes be delayed and new traffic will come in when pods are shut down. This causes the client to feel the flash interruption when the service is restarted, or when there is a node upgrade or restart within the Kubernetes cluster, and the Pod on the node is scheduled to another node. This is a big problem because it is quite common for Kubernetes to do resource rescheduling and switching to new nodes within the cluster. After looking through the Kubernetes literature and some community discussions, we finally found the reason for the slow Service picking (it didn’t take much effort, Github and Kubernetes In Action both discussed this problem).

The reason is that Kubernetes will broadcast the Pod deletion event to Kubernetes cluster, and several subsystems will receive the broadcast processing event, which includes:

  • On the node where the Pod to be deleted residesKubeletUpon receiving the event, the Pod shutdown lifecycle described above will be started. The Pod rejects new traffic and waits for the action within the lifecycle to be deleted.
  • The Pod’s Service controller receives the event and removes the Pod to be disabled from the Service endpoint list.

It is possible that the Pod has entered the shutdown sequence, but the Service has not finished destreaming. The Service may still route incoming traffic to the Pod to be shut down.

Both the community and Kubernetes In Action offer a similar solution to this problem. Use Pod to close the preStop hook in the lifecycle and sleep it for 5 to 10 seconds. Delay Pod shutdown to allow the Service to complete the process. PreStop execution does not affect the Pod application to continue processing the existing request.

To introduce the delay method in the preStop hook, see the configuration file snippet below.

containers:
  - args:
  - /bin/bash
  - -c
  - /go-big-app
  . 
  The preStop hook below introduces a 10-second delay
  lifecycle:
    preStop:
      exec:
        command:
          - sh
          - -c
          - sleep 10
Copy the code

This allows parallel pick-offs and smooth closing actions to stagger the timeline as much as possible, so that Service pick-offs can be delayed.

For detailed description and solution of this problem, please refer to my previous translated article “Pod destreaming with Pod Deletion Event propagation”, which has detailed graphic explanation to explain the origin of this problem and the solution.

conclusion

These are some of the experiences that our r&d team summarized during the high availability guarantee of services. Each of these knowledge points is not difficult to master in its own field, and almost all of the applications developed with Go will use signal.Notify to receive OS signals to complete smooth shutdown. Do Kubernetes operation and maintenance on the back of those concepts and problem-solving methods should also be familiar with the road, but as a research and development if we can “cross” more like Kubernetes in the ecology and program development closely combined knowledge, by r & D initiative to promote operation and maintenance with us to solve these problems, The experience and the sense of accomplishment is still quite different.

Without further discussion, those of you who are interested in Go programming practices and advanced programming and Kubernetes can visit my public account “Webmaster bi Bi bi” and access original articles on these topics in the menu bar. Hope these easy to understand articles can help to work together fellow passers-by.