The original article was linked to Github

What is a smooth restart

When the online code needs to be updated, our usual practice is to shut down the service and then restart the service. At this time, there may be a large number of online requests being processed. At this time, if we directly shut down the service, all requests will be interrupted and user experience will be affected. New requests will also come in before the service can be restarted 502. Two problems need to be solved:

  • The old service is handling requests that must be processed before exiting (gracefully)
  • Incoming requests need to be handled normally and the service cannot be interrupted (smooth restart)

This paper introduces the selection and practice process based on Linux and Golang.

Graceful exit

One of the first issues that needs to be addressed before a graceful reboot can be implemented is how to gracefully exit: We know that after Go 1.8.x, Golang added the shutdown method to HTTP to control graceful exit. Many HTTP graceful restart libraries are based on HTTP. Shutdown.

HTTP shutdown source analysis

Let’s take a look at the main HTTP shutdown implementation logic. Use atomic to indicate exit status, then close various resources, and then block waiting for no idle connections, polling every 500ms.

Var shutdownPollInterval = 500 * time.millisecond func (SRV *Server) Shutdown(CTX context.context) error {// Indicate the exit status Atom.storeint32 (&srv.inshutdown, 1) srv.mu.lock () // Close listen fd, new connection cannot be established. Lnerr := srv.closelistenerslocked () // put server.godoneCloseDoneChanLocked (); // Execute the shutdown callback, and we can register the shutdown callbackfor_, f := range srv.onShutdown {go f()} // Check every 500ms to see if there are no free connections or to listen for CTX contexts passed upstream. ticker := time.NewTicker(shutdownPollInterval) defer ticker.Stop()for {
        if srv.closeIdleConns() {
            return lnerr
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case<-ticker.C:}}}... Func (s *Server) closeIdleConns() bool {s.u.lock () defer s.u.lock () quiescent :=true
	for c := range s.activeConn {
		st, unixSec := c.getState()
		if st == StateNew && unixSec < time.Now().Unix()-5 {
			st = StateIdle
		}
		ifst ! = StateIdle || unixSec == 0 { quiescent =false
			continue
		}
		c.rwc.Close()
		delete(s.activeConn, c)
	}
	return quiescent
}
Copy the code

Close server.donechan and the listening file descriptor

// Disable doen Chan func (s *Server)closeDoneChanLocked() {
    ch := s.getDoneChanLocked()
    select {
    case <-ch:
        // Already closed. Don't close again. default: // Safe to close here. We're the only closer, Guarded // By S. mu. close(ch)}} // Keep the monitored fd func (s *Server) closeListenersLocked() error {var err errorfor ln := range s.listeners {
        ifcerr := (*ln).Close(); cerr ! = nil && err == nil { err = cerr } delete(s.listeners, ln) }returnErr} // Close connection func (c *conn) Close() error {if! c.ok() {
        return syscall.EINVAL
    }
    err := c.fd.Close()
    iferr ! = nil { err = &OpError{Op:"close", Net: c.fd.net, Source: c.fd.laddr, Addr: c.fd.raddr, Err: err}
    }
    return err
}
Copy the code

After this series of operations, server.go’s serv master listener method is retired.

func (srv *Server) Serve(l net.Listener) error {
    ...
    for {
        rw, e := l.Accept()
        ife ! = nil {select {// exitcase <-srv.getDoneChan():
                return ErrServerClosed
            default:
            }
            ...
            return e
        }
        tempDelay = 0
        c := srv.newConn(rw)
        c.setState(c.rwc, StateNew) // before Serve can return
        go c.serve(ctx)
    }
}
Copy the code

So how to ensure that users complete the request before closing the connection?

func (s *Server) doKeepAlives() bool {
	returnatomic.LoadInt32(&s.disableKeepAlives) == 0 && ! s.shuttingDown() } // Serve a new connection. func (c *conn) serve(ctx context.Context) { deferfunc() {... xiaorui.cc ...if! c.hijacked() {// close the connection and mark exit c.close() c.setState(c.wc, StateClosed)}}()... ctx, cancelCtx := context.WithCancel(ctx) c.cancelCtx = cancelCtx defer cancelCtx() c.r = &connReader{conn: c} c.bufr = newBufioReader(c.r) c.bufw = newBufioWriterSize(checkConnErrorWriter{c}, 4<<10)for{// Receive request w, err := c.readRequest(CTX)ifc.r.remain ! = c.server.initialReadLimitSize() { c.setState(c.rwc, StateActive) } ... . ServerHandler {c.server}.servehttp (w, w.reeq) w.ancelctx ()if c.hijacked() {
			return}... // Check whether the shutdown mode is enabled and select exitif! w.conn.server.doKeepAlives() {
			return}}...Copy the code

Graceful restart

Methods the evolution

From a Linux system perspective

  • Direct use ofexec, replace the code segment with the code of the new program, discard the original data segment and stack segment and assign new data segment and stack segment to the new program, the only thing left is the process number.

One of the problems with this is that the old process cannot exit gracefully and the requests being processed by the old process cannot exit properly. And the start of a new process service is not instantaneous. A new process may be rejected before it listens and accepts because the SYN queue is full (this is rare, but it can happen with high concurrency). Here combined with the following three handshake with TCP process may be easier to understand a lot, personally feel a feeling of suddenly enlightened.

  • throughforkafterexecCreate a new process,execThe former passes in the old processfcntl(fd, F_SETFD, 0);removeFD_CLOEXECSign, and thenexecThe new process inherits the fd of the old process and can use it directly.

    Then the new process and the old processlistenThe same FD provides services at the same time, and sends signals to the old process after the new process starts the service normally, and the old process exits gracefully.

    All requests are then sent to the new process and the graceful reboot is completed. In combination with the problems existing in the actual online environment, the system changes the parent process of the new child process to process 1 due to the exit of the parent process. In the online environment, most services passsupervisorAnd there’s a problem with that,supervisorThe service exits abnormally and a new process is restarted.
  • By setting the SO_REUSEPORT flag to the file descriptor to allow two processes to listen on the same port, the problem is that two different FDS are used to listen on the same port when the old process exits. Syn queue Kills unaccepted connections in the syn queue.

  • Graceful restarts are also possible by passing file descriptors between processes using UNIX domain sockets through the Ancilliary Data system call. However, such an implementation can be complicated, and this model is implemented in HAProxy.

  • The child process inherits all file descriptors opened by the parent process. The child process obtains file descriptors in ascending order from 3, in the same order as the parent process opened them. The child uses epoll_ctl to register the FD and the event handler (using the epoll model as an example), so that it can listen for requests on the same port as the parent, and send its SIGHUP to the parent when the child starts up and provides the service. The parent process exits gracefully The child process provides services and restarts gracefully.

Implementation in Golang

From above, the relatively easy implementation is the straightforward Forkandexec approach, so let’s discuss the implementation in Golang.

Golang socket fd with FD_CLOEXEC flag set by default

// Wrapper around the socket system call that marks the returned file // descriptor as nonblocking and close-on-exec. func sysSocket(family, sotype, proto int) (int, error) { // See .. /syscall/exec_unix.gofor description of ForkLock.
	syscall.ForkLock.RLock()
	s, err := socketFunc(family, sotype, proto)
	if err == nil {
		syscall.CloseOnExec(s)
	}
	syscall.ForkLock.RUnlock()
	iferr ! = nil {return -1, os.NewSyscallError("socket", err)
	}
	if err = syscall.SetNonblock(s, true); err ! = nil { poll.CloseFunc(s)return -1, os.NewSyscallError("setnonblock", err)
	}
	return s, nil
}
Copy the code

So after exec fd will be shut down by the system, but we can do this directly with os.mand. Some of you may be confused here by the fact that the FD_CLOEXEC flag is not set, and the fd inherited by the new child process is closed. The fact is that the child process started by os.mand can inherit the fd of the parent and use, Os.mand descriptors passed by Stdout,Stdin,Stderr, and ExtraFiles are cleared by Golang by default. (syscall/exec_{GOOS}. Go I here is macOS source code implementation reference source code)

// dup2(i, i) won't clear close-on-exec flag on Linux,
// probably not elsewhere either.
_, _, err1 = rawSyscall(funcPC(libc_fcntl_trampoline), uintptr(fd[i]), F_SETFD, 0)
if err1 != 0 {
	goto childerror
}
Copy the code

Problems with supervisor

In actual projects, online services are generally started by the supervisor. As mentioned above, if we exit the parent process through the parent process and the child process is started, the problem is that the child process will be taken over by the first process. As a result, the supervisor thinks that the service is down and restarts the service. To avoid this problem, we can use master and worker methods. The basic idea of this method is: When the project starts, the program starts as master and listens on the port to create a socket descriptor but not to provide services to the outside world, and then creates a child process through os.mand with Stdin, Stdout, Stderr,ExtraFiles, and Env pass standard job INPUT/output errors and file descriptors as well as environment variables. Through the environment change quantum process, it can know that it is a child process and register fd in epoll through os.NewFile. It can create TCPListener object through FD and bind handle processor to accept and process the request.

f := os.NewFile(uintptr(3+i), "")
l, err := net.FileListener(f)
iferr ! = nil {return fmt.Errorf("failed to inherit file descriptor: %d", i)
}

server:=&http.Server{Handler: handler}
server.Serve(l)
Copy the code

The above process only starts the worker process and provides services, which is a real elegant restart. You can either send signals to the worker process through the interface (because the online publishing machine may not have permissions, so it can only save the country by curving), and the worker sends signals to the master. After receiving the signal, the master process starts a new worker. After the new worker starts and provides services normally, it sends a signal to the master. The master sends an exit signal to the old worker, and the old worker quits.

Log collection problems, if the project itself log is directly into the file, there may be FD scrolling problems (currently not thoroughly studied). The current solution is that all project logs are output to STdout and the supervisor collects the log file. When creating a worker, STdout and Stderr can inherit the past, which solves the problem of log. If there is a better way to discuss with the environment.

The original article was linked to Github

Refer to the article

The Linux TCP Backlog Go upgrade/reboot tool is used to upgrade the TCP backlog