preface

After the Go services were moved to Kubernetes before, some Pod of some services would restart from time to time, and the cause could not be found by checking the business logs. I analyzed that it must be because the code was not strict enough to reference null pointer, which caused panic to hang when Go was sent. However, panic output to STderr before the restart of the container will be cleared, so we have the analysis and solution later in the article.

Solution analysis

In applications written by Go, whether in the main goroutine or other subcoroutines, the entire program will crash if a runtime panic error occurs. When deploying a Go project, the Supervisor is used to monitor the application process. Once the application stops due to panic, the Supervisor will start the process again.

After deploying the project to the Kubernetes cluster, there is some overlap in introducing the Supervisor, since the Kubelet on each node will restart the container where the main process crashed. However, the panic information of Go was directly written to the standard error, and the previous panic error disappeared after the container was restarted, so it was impossible to find out the cause of the container crash. So the key to checking for a container restart becomes: how to redirect panic from stderr to a file, so that information about the program crash can be retained through the directory of the container’s volume persistent log file.

Previously, the Supervisor could set the standard error to a file directly by configuring stderr_logfile:

[program: go-xxx...]  directory=/home/go/src... environment=... command=/home/go/src... /bin/app stderr_logfile=/home/xxx/log/.... /app_err.logCopy the code

Now I have changed to Kubernetes, and after I no longer use Supervisor, I have to find a way to implement it in the program. One of the first things you might consider when implementing panic to log files in Go is that it is not practical to log the errors that caused panic in Recover to files, but also to reference third-party packages that can cause panic. In addition, Go does not have exceptions like other languages. Uncaught exceptions can be caught by the global ExceptionHandler mechanism, which cannot achieve the function of using a RECOVER to catch all panic.

Finally only one way, to find a way to run a program when replaced the standard error log files, so Go to panic when it is written in standard error, we just secretly changed his standard error of the file descriptor for the log file descriptor (in the eyes of system stderr is also a file, all files in Unix systems).

Scheme of trial and error

Following this idea, I first tried using the following example:

package main

import (
    "fmt"
    "os"
)

const stdErrFile = "/tmp/go-app1-stderr.log"

func RewriteStderrFile(a) error {
    file, err := os.OpenFile(stdErrFile, os.O_RDWR|os.O_CREATE|os.O_APPEND, 0666)
    iferr ! =nil {
    		fmt.Println(err)
        return err
    }
    os.Stderr = file
    return nil
}


func testPanic(a) {
    panic("test panic")}func main(a) {
    RewriteStderrFile()
    testPanic()
}
Copy the code

/ TMP /go-app1-stderr. Log does not have any information. Panic information is also output to the standard error.

Final plan

As for the reason, I did a search, and luckily Rob Pike has a special answer to a similar question, which goes like this:

Instead of assigning variables created by the higher level package directly to the runtime, we use syscall.dup2 to replace the descriptor and add a global variable to the log file descriptor to avoid GC collection of file descriptors in resident threads:

var stdErrFileHandler *os.File

func RewriteStderrFile(a) error {
	  if runtime.GOOS == "windows" {
		    return nil
	  }
  
    file, err := os.OpenFile(stdErrFile, os.O_RDWR|os.O_CREATE|os.O_APPEND, 0666)
    iferr ! =nil {
    		fmt.Println(err)
        return err
    }
    stdErrFileHandler = file // Save file handles to global variables to avoid GC collection
    
    if err = syscall.Dup2(int(file.Fd()), int(os.Stderr.Fd())); err ! =nil {
        fmt.Println(err)
        return err
    }
    // Close the file descriptor before collecting memory
	  runtime.SetFinalizer(stdErrFileHandler, func(fd *os.File) {
		    fd.Close()
	  })
	  
    return nil
}
Copy the code

Because Windows does not support syscall.Dup2, I added a interpretation, Windows environment Go runtime load system a DLL file can also achieve this function, but our server environment is Linux, so this part compatible with Windows is useless. Just make sure it runs on Windows.

**/ TMP /go-app1-stderr. Log ** You can see the panic information when the program crashed and the entire call stack information when the panic occurred:

➜ ~ cat/TMP /go-app1-stderr.log Panic: test panic goroutine 1 [running]: main.testpanic (...) / Users/kev/Library/Application Support/JetBrains/GoLand2020.1 / scratches/scratch_4 go: 39 main. The main () / Users/kev/Library/Application Support/JetBrains/GoLand2020.1 / scratches/scratch_4 go: 44 + 0 x3f panic: test panic goroutine 1 [running]: main.testPanic(...) / Users/kev/Library/Application Support/JetBrains/GoLand2020.1 / scratches/scratch_4 go: 39 main. The main () / Users/kev/Library/Application Support/JetBrains/GoLand2020.1 / scratches/scratch_4 go: 44 + 0 x3fCopy the code

The effect after the implementation of the program

This solution has been running online for a month now, and all the Pod restart events found can accurately record the call stack when the program crashes in the log file, which helps us locate several problems in the code. In fact, these problems are related to null Pointers. I mentioned these problems in my previous article “How to Avoid Writing Go Code with dynamic Language Thinking”. Once the project is complicated, no one can guarantee that null Pointers will not occur. Errors caused by extremely subtle conditions can only be solved by analyzing the logs at the time of the incident.

Recommended reading

  • Decrypt the stack memory management of the Go coroutine

If you like my article, please give me a thumbs up. I will share what I have learned and seen and first-hand experience through technical articles every week. Thank you for your support. Wechat search public account “NMS bi Bi” to get my article push in the first time, reply “GoCookBook” to get the complete source code in the article.