In software engineering, the system needs to be optimized or reconstructed continuously after it goes online.

It is a basic skill of software engineering practice to learn how to collect data and analyze performance of application system. Profiles are typically used to represent performance analysis and collection, or profiling is used to represent the behavior of performance analysis. For example, the related tool in the Java language is Jprofiler, which stands for Java Profiler.

Go is very performance-oriented, and its built-in library includes the performance analysis library Pprof. Pprof has two packages for analyzing programs: Runtime /pprof and NET/HTTP /pprof, where NET/HTTP /pprof simply wraps the Runtime /pprof package and exposes it using HTTP. Runtime /pprof is used to perform performance analysis on ordinary applications, mainly for finishable code blocks, such as a function call; Net/HTTP/Pprof is specially used for the performance collection and analysis of background service programs.

This section will show you how to analyze and optimize performance based on Pprof, including CPU, memory usage, Block blocking, and Goroutine usage. In addition, a more intuitive graphical tool will be introduced: flame charts, which are converted from pprof results to flame charts based on the Go-Torch.

Performance analysis of common applications

As we already know, Runtime /pprof is used for performance analysis of ordinary applications, mainly for finishable code blocks. So, let’s do this through a case study.

Calculate PI

The case I choose is the algorithm to calculate PI.

As we all know, it is arguably the most famous irrational constant in the world and represents the ratio of the circumference of a circle to its diameter, or “PI”. Around 250 BC, Archimedes gave an estimate of “PI” ranging from 223/71 to 22/7.

Zu Chongzhi (429-500), a famous Chinese mathematician in the Southern and Northern Dynasties, was the first to actualize the value of “PI” to the seventh decimal place, that is, between 3.1415926 and 3.1415927. His “ratio of density and ratio of reduction” made a great contribution to the study of mathematics. It wasn’t until the 15th century that the Arab mathematician Al Qassi broke the record with “accuracy to 17 decimal places.” The letter for “PI” is the lower case of the 16th Greek letter. And an acronym of the Greek περιφρεια, for perimeter, region, circle. The English mathematician William Jones (1675-1749) first used “PI” to represent PI in 1706. In 1736, Swiss mathematician Leonhard Euler (1707-1783) also began to use PI. Since then, it has become the pronoun of PI.

The usual calculation methods are as follows:

  • Monte Carlo method;
  • Square approximation;
  • Iteration method;
  • Chudnovsky’s formula

Test the code implementation

The author here uses monte Carlo method to calculate PI, the general idea is as follows:

Inside a square there is a circle that is tangent to each other, and the ratio of their areas is PI over 4.

Inside the square, 10,000 points (i.e., 10,000 coordinate pairs (x, y)) are randomly generated, and their distance from the center point is calculated to determine whether they fall inside the circle. If the points are evenly distributed, then the points in the circle should account for PI /4 of all the points, so multiply that ratio by 4 to get the value of PI. By randomly simulating 30,000 points, the estimated value of π differed by 0.07% from the real value.

Finally, the complete code for the implementation looks like this:

package main import ( "flag" "fmt" "log" "os" "runtime" "runtime/pprof" "time" ) var n int64 = 10000000000 var h float64 = 1.0 / float64(n) func f(a float64) float64 {return 4.0 / (1.0 + a*a)} func chunk(start, end int64, C chan float64) {var sum float64 = 0.0 for I := start; i < end; I ++ {x := h * (float64(I) + 0.5) sum += f(x)} c < -sum * h} func main() {var cpuProfile = flag.string (" cpuProfile ", "", "write cpu profile to file") var memProfile = flag.String("memprofile", "", "Write mem profile to file") flag.parse () // Sample CPU running status if *cpuProfile! = "" { f, err := os.Create(*cpuProfile) if err ! = nil {log.fatal (err)} pprof.startcpuProfile (f) defer pprof.stopcpuProfile ()} // Record the start time start := time.now () var PI float64 np := runtime.NumCPU() runtime.GOMAXPROCS(np) c := make(chan float64, np) fmt.Println("np: ", np) for i := 0; i < np; Go chunk(int64(I)*n/int64(NP), (int64(I)+1)*n/int64(NP), c)} for I := 0; i < np; i++ { tmp := <-c fmt.Println("c->: ", tmp) pi += tmp fmt.Println("pai: ", pi) } fmt.Println("Pi: ", PI) // Record the end time end := time.now () // Output the execution time in milliseconds. Fmt.printf (" Spend time: %vs\n", end.sub (start).seconds ()) // Sample the memory state if *memProfile! = "" { f, err := os.Create(*memProfile) if err ! = nil { log.Fatal(err) } pprof.WriteHeapProfile(f) f.Close() } }Copy the code

Above is the calculation of π algorithm, based on go language Goroutine and Channel, make full use of multi-core processor, improve the speed of CPU resources calculation.

We introduced runtime/ Pprof dependencies and added CPU Profiling and Memory Profiling code to the implementation code to measure CPU and Memory performance.

Compilation and Execution

The next step is to compile to get the executable, and after execution to get the pPROF sample data, which can then be analyzed using the relevant tools. Related commands are as follows:

$ go build  -o pai main.go
$ ./pai --cpuprofile=cpu.pprof
$ ./pai --memprofile=mem.pprof

Copy the code

The command above generates two sample files, cpu.pprof and mem.pprof respectively, which we analyze using the go Tool pprof command:

$ go tool pprof cpu.pprof
Copy the code

After the preceding command is executed, the pprof command line interaction mode is entered. Pprof supports multiple commands. For example, top is used to display the top 10 items of data in a Pprof file, and 20 lines of data can be displayed through top 20. Other instructions such as list, PDF, EOG, etc.

In the figure above, other parameters are explained as follows:

  • Duration: indicates the program execution time. The multi-core execution program took 13.47s in total, while the sampling time was 24.44s. Sampling time was divided equally for each nucleus.
  • Flat /flat% : Indicates the CPU usage time and percentage in the current tier.
  • Cum/CUM % : indicates the CPU time and CPU usage accumulated by the current tier, respectively.
  • Sum % : Cumulative CPU time usage for all levels, from small to large to 100%, i.e. 24.44 seconds.

In this example, main.chunk occupies 21.86s of CPU time in the current tier, accounting for 89.44% of the collection time. The cumulative elapsed time of this function is 24.44s, accounting for 100% of the collection time. Cum data shows that the Chunk function occupies the most CPU time.

The figure above clearly illustrates the main application time-consuming functions, then use the list command to look at the main occupancy factors. The list command outputs the related methods according to your regular expression, with the option -o to output all methods, or specify method names. This allows you to view the code for the matching function and the elapsed time per line:

As can be seen from the figure above, it takes an additional 2.58s to call f(x) on line 24. The time spent on each line of code is displayed, and the code can be optimized based on this information.

Graphical rendering

As for the results collected by pprof, we can not only analyze them using the commands that come with pprof, but also through a more intuitive vector graph. With Graphviz, Pprof can directly generate corresponding graphical files.

The written test is based on Centos 7.5. Install Graphviz directly with the following commands:

$ sudo yum install graphviz
Copy the code

For more installation instructions, see the Graphviz website

With Graphviz installed, go ahead and type SVG on the pprof interactive command line:

Note that web commands are not supported on server systems. Use the SVG command to generate the vector image and open it in a browser, as shown below:

It can also be seen from the figure above that the main time-consuming function is main.chunk, which takes 21.86s, and the associated invoked function F (x) takes 2.58s. The size of each square in the figure also represents CPU usage, with larger squares indicating longer CPU usage.

Background server performance analysis

For background services that are always running, such as Web applications or distributed applications, we can use the NET/HTTP /pprof library, which analyzes the application as it provides HTTP services.

ListenAndServe(“0.0.0.0:8000”, nil) is a simpler case, just import the package.

import (
	_ "net/http/pprof"
)
Copy the code

Note that the package is imported with the underscore “_”, which means we just need the package to run its init() function, which will automatically collect the information and save it in memory.

If you use a custom ServerMux multiplexer, you need to register some routing rules manually:

r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/heap", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)
Copy the code

These paths represent:

  • /debug/pprof/profile: Accessing this link will automatically perform CPU profiling for 30 seconds and generate a file for download, which can be done with parameters? =seconds=60 Data is collected for 60 seconds.
  • /debug/pprof/block: Records the Goroutine blocking event. The default is to sample every blocking event that occurs.
  • /debug/pprof/goroutines: Records active goroutines. Only sample once at fetch time.
  • /debug/pprof/heap: Records the heap memory allocation. By default, the sample is taken once every 512K bytes allocated.
  • /debug/pprof/mutex: View the owners of contention mutex.
  • / debug/pprof/threadcreate: system thread creation of records. Only sample once at fetch time.

Rewriting test code

Rewrite the program of calculating PI into a service, provide an interface externally, and introduce net/ HTTP /pprof dependency to collect performance indicators of HTTP service.

package main

import (
	"fmt"
	"net/http"
	_ "net/http/pprof"
	"runtime"
)

var n int64 = 10000000000
var h = 1.0 / float64(n)

func f(a float64) float64 {
	return 4.0 / (1.0 + a*a)
}

func chunk(start, end int64, c chan float64) {
	var sum float64 = 0.0
	for i := start; i < end; i++ {
		x := h * (float64(i) + 0.5)
		sum += f(x)
	}
	c <- sum * h
}

func callFunc(w http.ResponseWriter, r *http.Request) {

	var pi float64
	np := runtime.NumCPU()
	runtime.GOMAXPROCS(np)
	c := make(chan float64, np)
	fmt.Println("np: ", np)

	for i := 0; i < np; i++ {
		go chunk(int64(i)*n/int64(np), (int64(i)+1)*n/int64(np), c)
	}

	for i := 0; i < np; i++ {
		tmp := <-c
		fmt.Println("c->: ", tmp)

		pi += tmp
		fmt.Println("pai: ", pi)

	}

	fmt.Println("Pi: ", pi)
}

func main(a) {
	http.HandleFunc("/getAPi", callFunc)
	http.ListenAndServe(": 8000".nil)}Copy the code

In the implementation of the code above, we expose port 8000 and define an interface getAPi. The implementation of calculating PI is the same as before, with each call to the interface triggering the calculation of PI once.

Compile implementation

Now that the code is written, we are ready to compile and execute the HTTP service by executing the following command:

$ go build -o httpapi main.go

$ ./httpapi
Copy the code

After compiling the program successfully and running the binaries to get performance data for the service,

At this point, we can through pprof HTTP interface to http://localhost:8000/debug/pprof/:

The figure above shows the pprof Web viewing service in action, with the sampling results constantly updated by refreshing the page.

Graphical analysis

As with the performance analysis of the above finishable program, we can also use graphical analysis of the performance of the background program.

Then go Tool pprof tool is used to analyze and save these data. Generally, pprof is used to access the routing endpoints in the above column through HTTP to directly obtain the data and then analyze it. After obtaining the data, Pprof will automatically put the terminal into interactive mode.

Run the following command to check Memory information:

$ go tool pprof main http://localhost:8000/debug/pprof/heap
Copy the code

Profile001.svg is generated by default in the current directory. Of course, we can specify the location and file name.

Since there are no HTTP requests, the memory usage is low and there are no exceptions. Next, we will analyze the performance under normal operation by pressure measuring the simulated situation on the line.

Use the Go-Torch to generate a flame map

The previous section describes net/ HTTP /pprof and Runtime /pprof for performance analysis of Go programs. The above example, however, is only a sampling of a snippet of code. At the same time, the main optimizations for application services are only visible when there are a large number of requests. The go-Torch, another Open source flame mapping tool for Uber, was needed to help us with our analysis. To implement the flame map, you need to install the following three tools: WRK, FlameGraph, and Go-Torch. The following describes how to install and use the three components.

Pressure test component WRK

WRK is a benchmark testing tool for HTTP protocol. It can use the system’s own high-performance I/O mechanism, such as Epoll, Kqueue, etc., under the condition of a single multi-core CPU, through multithreading and event mode, to produce a large amount of load on the target machine. The installation command is as follows:

$ git clone https://github.com/brendangregg/FlameGraph.git
$ cd wrk/
$ make
Copy the code

With the above command, we have generated an executable WRK file. The main parameters are described as follows:

  • -c: total connections (connections processed per thread = total connections/threads)
  • -d: Test duration, such as 2s(2second), 2m(2minute), 2h(hour)
  • -t: indicates the total number of threads to be executed
  • -s: Executes the Lua script. The path and name of the Lua script are written here. An example is provided later
  • -h: header information to be added. Pay attention to the syntax of the header. For example, -h “token: abcdef”, token, colon, space, abcdefg (do not forget the space, otherwise an error will be reported).

The parameters of the initial pressure test performed by the author are as follows:

./wrk -t5 -c10 -d120s http://localhost:8000/getAPi
Copy the code

That is, 5 concurrent threads maintain 10 connections per second for 120s. If the following error occurs,

unable to create thread 419: Too many open files
Copy the code

This is because the number of /socket connections exceeds the threshold set by the system. You need to adjust the maximum number of files that can be opened for each user.

$ ulimit -n 2048
Copy the code

FlameGraph flame diagram with go-Torch

Flame Graph is a great tool for performance analysis, allowing you to quickly locate performance bottlenecks. On Linux servers, perF is used together.

Go-torch is an open source tool for Uber that reads pprof profiling data directly and generates an SVG file with a flame map. The flame map SVG file can be opened in a browser, and it has the advantage of being able to analyze the content on a call diagram by clicking on each square.

Run the following command to install:

$ git clone https://github.com/brendangregg/FlameGraph.git
$ go get github.com/uber/go-torch
Copy the code

The go-Torch uses the following command:

$ go-torch -u http://localhost:8000 -t 100
Copy the code

The above command will enable the Go-Torch tool to collect 100s of information from http://localhost:8000.

Pressure test to generate flame map

With the above three components installed, we will test them. The first step is to start our application service:

$ ./httpapi
Copy the code

Then start the pressure test and go-Torch:

$ ./wrk -t5 -c10 -d120s http://localhost:8000/getAPi
$ go-torch -u http://localhost:8000 -t 100
Copy the code

As you can see, the request we squashed has generated the corresponding flame diagram on the server side: torch. SVG. Note: Execute the Go-Torch in the FlameGraph directory, otherwise add the path to the binary executable to the system environment variable.

Open the flame diagram as shown below:

The flame graph is named for its flame-like appearance, with CPU time on the horizontal axis and call order on the vertical axis. The fire chart is called from bottom to top, with each square representing a function, the layer above it indicating which functions the function will call, and the size of the square representing the amount of CPU usage. The color scheme of the flame map has no special meaning, the default red and yellow color scheme is to look more like the flame.

It is the same as the result of our analysis above, the total time is in the chunk function. Let’s take another look at the fire chart without requesting access:

As you can see, the CPU time and memory footprint in this case are very flat and are concentrated in the library functions that provide HTTP services.

summary

This paper mainly introduces how to collect and analyze the performance indicators of Go application by pprof. Using Pprof, we get the details of CPU and memory usage, further providing guidance on which functions take time, and call chains between functions. For a more detailed analysis, go down to the code level, look at the time taken per line of code, and directly locate the line of code where the performance problem occurred.

With the combination of Uber’s open source Go-Torch to generate a flame map, we can view the memory and CPU of the system running globally, as well as Goroutines and blocking locks. Skilled use of performance analysis tools can help us locate online problems and solve problems more quickly.

As you have learned in this article, enabling performance analysis for a daemon requires requests, not static services, and this article uses pressure testing to simulate a large number of requests. Of course, there are performance costs to enabling PPROF in a production environment, and solving problems before going live is definitely the best option.

Read the latest article concerned public number: AOHO Quest