Many of you have heard the story of thor 3’s performance optimization. In the source of a 3D game engine, John Carmack optimized the execution efficiency of 1/ SQRT (x) function to the extreme.

We usually use dichotomy, or Newtonian iteration, to compute the square root of a floating point number. But in this function, the author uses a “magic number”, no iteration at all, and directly calculates the square root in two steps. Breathtaking!

Because it was the lowest level function, and the game involved a large number of such operations, so that in the DOS era, computing resources were extremely tight, the game could run smoothly. That’s the beauty of performance optimization!

At work, when the volume of business is relatively small, the use of fewer machines, can not experience the benefits of performance optimization. When a business uses thousands of machines, a 20% improvement in performance can save hundreds of machines, millions a year. Save the money, give employees a year-end bonus, that will be Happy!

Generally speaking, performance analysis can be considered at three levels: application layer, system layer, and code layer.

The application layer is mainly to sort out the usage mode of the business side, so that they can use it more rationally and reduce meaningless calls on the premise of meeting the needs of the users. The system layer focuses on the architecture of services, such as adding a layer of caching; The code layer is concerned with the efficiency of the function execution, such as using a more efficient root algorithm.

In everything you do, pay attention to method. In many cases, getting the most important part of the job done quickly will reap the lion’s share of the benefits. Some of the other edges and corners can be sewn up slowly. Trying to achieve 100% from the start often leads to a situation where you put in a lot of effort and get very little out of it.

The same goes for performance optimization, where identifying performance bottlenecks gives us the most bang for the buck with the least effort.

In Go, Pprof is a tool that helps us quickly identify performance bottlenecks for targeted optimization.

What is a pprof

Before the code goes live, we can gauge the performance of the system, such as requests handled per second, average response time, error rate, etc. This gives us an idea of the performance of our services.

But manometry is the simulated flow offline. What if it goes online? Will encounter high concurrency, large flow, unreliable upstream and downstream, sudden peak flow and other scenarios, these are unpredictable.

Suddenly a large number of online alarms, interface timeout, error number increase, in addition to reading logs, monitoring, is to use performance analysis tools to analyze the performance of the program, find bottlenecks. Of course, usually this situation will not let you have the opportunity to analyze, downgrade, limit flow, rollback is the first to do, to stop the loss. After returning to normal, create performance problems by means of online flow playback or pressure measurement, and then analyze system bottlenecks with tools.

Generally speaking, performance analysis focuses on CPU, memory, disk I/O, and network.

Profiling refers to the collection of data during a program’s execution that reflects the state of its execution. In software engineering, performance analysis (also known as profiling) is a kind of dynamic program analysis method, which studies program behavior by collecting program runtime information.

Go’s built-in Pprof library can analyze the performance of the program and provide visualization. It contains two related libraries:

  • Runtime /pprof For a once-run program, such as an offline preprocessor that runs only once a day, call the functions provided by the pprof package to manually enable performance data collection.

  • Net/HTTP /pprof For online services, for an HTTP Server, access the HTTP interface provided by pprof to obtain performance data. Of course, the underlying function is also provided by the invoked Runtime /pprof, which is encapsulated as an interface to provide network access.

The role of pprof

Pprof is the Go language analysis program performance tool, it can provide a variety of performance data:

Allocs and HEAP samples have the same information, except that alLOCs is memory allocation for all objects, while HEAP is memory allocation for active objects.

The difference between the two is the way the pprof tool reads there at start time. Allocs profile will start pprof in a mode which displays the total number of bytes allocated since the program began (including garbage-collected bytes).

The figure above comes from a pPROF field article by Wolfogre, which provides a sample program to troubleshoot, analyze and solve performance problems through PPROF. It is very wonderful.

  1. When CPU performance analysis is enabled, Go runtime pauses every 10ms to record the call stack and related data of the currently running goroutine. Once the performance analysis data is saved to disk, we can analyze the hot spots in the code.
  1. Memory performance analysis records the call stack when the Heap is allocated. By default, it’s one sample per 1,000 allocations, and this number can change. Stack allocations are not recorded by memory analysis because they are released at any time. Because memory analysis is a sampling method, and because it records memory allocation, not memory usage. Therefore, it is difficult to use memory performance analysis tools to accurately determine the specific memory usage of a program.
  1. Blocking analysis is a unique analysis, somewhat similar to CPU performance analysis, but it records how long the Goroutine is waiting for resources. Blocking analysis can be very helpful in analyzing concurrency bottlenecks in a program. Blocking performance analysis can show when a large number of Goroutines are blocked. Blocking performance analysis is a special analysis tool that should not be used until CPU and memory bottlenecks are eliminated.

How is pprof used

Pprof can be used in three ways: report generation, Web visualization, and interactive terminals.

— “Performance Analysis of Golang Killer PProf” of Fried Fish

runtime/pprof

To take CPU profiling as an example, add two lines of code to start CPU profiling by calling pprof.startcpuProfile and flushing data to a file by calling pprof.stopCPUProfile () :

import "runtime/pprof"

var cpuprofile = flag.String("cpuprofile".""."write cpu profile to file")

func main(a) {
    / /.....................
        
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()
    
    / /.....................
}
Copy the code

net/http/pprof

Start a port (different from the one normally used to provide business services) to listen for pprof requests:

import _ "net/http/pprof"

func initPprofMonitor(a) error {
	pPort := global.Conf.MustInt("http_server"."pprofport".8080)

	var err error
	addr := ":" + strconv.Itoa(pPort)

	go func(a) {
		err = http.ListenAndServe(addr, nil)
		iferr ! =nil {
			logger.Error("funcRetErr=http.ListenAndServe||err=%s", err.Error())
		}
	}()

	return err
}
Copy the code

The pprof package automatically registers the handler to handle the related requests:

// src/net/http/pprof/pprof.go:71

func init(a) {
	http.Handle("/debug/pprof/", http.HandlerFunc(Index))
	http.Handle("/debug/pprof/cmdline", http.HandlerFunc(Cmdline))
	http.Handle("/debug/pprof/profile", http.HandlerFunc(Profile))
	http.Handle("/debug/pprof/symbol", http.HandlerFunc(Symbol))
	http.Handle("/debug/pprof/trace", http.HandlerFunc(Trace))
}
Copy the code

The first path /debug/pprof/ has 5 subpaths:

goroutine threadcreate heap block mutex

After starting the service, access it directly from the browser:

http://47.93.238.9:8080/debug/pprof/

You get a summary page:

You can directly click the link above to enter the sub-page and view the relevant summary information.

There are two links for information about goroutines, The Goroutine, which is a summary message to see the goroutines in their entirety, and the Full Goroutine stack dump, which shows the status of each goroutine. The interpretation of the specific content of the page can refer to the article of dabin.

Click Profile and Trace, and the data will be sampled in the background for a period of time. After sampling, a profile file will be returned to the browser and then analyzed locally by the Go Tool pprof tool.

After downloading the profile file, execute the following command:

go tool pprof ~/Downloads/profile
Copy the code

You can enter command line interactive use mode. Run the go tool pprof -help command to view the help information.

To enter the command line interaction mode, you do not need to click the link on the browser:

Go tool pprof http://47.93.238.9:8080/debug/pprof/profileCopy the code

Of course, it is also necessary to collect data for a period of time, and then download the data file to the local, and finally analyze it. The Url above can also be followed by a time parameter: Seconds =60, the length of custom CPU Profiling.

Similar commands include:

Download THE CPU profile. By default, the collection starts from the current one30sThe CPU usage needs to wait30s
go tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/profile
# wait 120s
go tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/profile? seconds=120Download heap profilego tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/heap# Download goroutine Profilego tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/goroutine# Download Block Profilego tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/block# Download mutex Profilego tool pprof http:/ / 47.93.238.9:8080 / debug/pprof/mutex
Copy the code

In interactive mode, commands such as top, list, and web are commonly used.

Perform the top:

Four columns of data are obtained:

The column name meaning
flat The execution time of this function
flat% Flat to the total CPU time. The total time of the program was 16.22s, and 16.19s of Eat accounted for 99.82%
sum% The flat proportion of each previous row
cum The cumulative amount. Refers to the total time taken by the function plus the function called by the function
cum% Proportion of CUM to total CPU time

Other types, such as heap flat, sum, and cum, have a similar meaning to the above, but calculate different things, one is CPU time, the other is memory size.

Execute list, using regular matching, to find the relevant code:

list Eat
Copy the code

Directly to the relevant long-running code:

Executing the Web (you need to install Graphviz, pprof can use grapgviz to generate the call graph of the application) will generate an SVG file, which will be opened directly in the browser (you may need to set the default opening of the.svg file format) :

The lines in the figure represent calls to a method, the labels on the lines represent sample values for the specified method call (for example, time, memory allocation, etc.), and the size of the box is related to the size of the sample value for the method to run.

Each box consists of two labels: in the CPU profile, the percentage of time the method was running, and the percentage of time it appeared in the sampled stack (flat time, cumulate time). The larger the box, the more time it takes or the more memory it allocates.

Alternatively, the traces command can list the call stack of functions:

In addition to the two methods described above (report generation, command line interaction), you can also interact in the browser. Create a profile and run the following command:

go tool pprof --http=:8080 ~/Downloads/profile
Copy the code

Enter a visual operation interface:

Click on the menu bar to toggle between Top/Graph/Peek/Source and even see the Flame Graph:

It is exactly the reverse of a normal flame diagram, where the call relationship is displayed from top to bottom. The longer the shape, the longer the execution time. Note: THE version of Go I’m using here is 1.13. Older versions of the pprof tool do not support the -http parameter. Of course, you can also download other libraries to view the fire map, such as:

Go get -u github.com/google/pprof or go get github.com/uber/go-torchCopy the code

Pprof advanced

In the Resources section, I have provided some practical articles on how to use pPROF tools for performance analysis, which can be followed by hands-on practice and then used in my daily work.

Russ Cox of actual combat

This part is mainly from the reference material [Ross Cox], to learn the optimization ideas of Daniel.

It all started when someone published an article that implemented an algorithm in a variety of languages, and it turned out that a program written in go was very slow, while C++ was the fastest. And then Russ Cox complains, how can you stand that? Now enable the Pprof killer to optimize. Finally, the program is not only faster, but also uses less memory!

First, add the CPU profiling code:

var cpuprofile = flag.String("cpuprofile".""."write cpu profile to file")

func main(a) {
    flag.Parse()
    if*cpuprofile ! ="" {
        f, err := os.Create(*cpuprofile)
        iferr ! =nil {
            log.Fatal(err)
        }
        
        pprof.StartCPUProfile(f)
        defer pprof.StopCPUProfile()
    }
    ...
}
Copy the code

Using pprof to look at the top5 functions, one of the map-reading functions takes the most time: mapaccess1_fast64, which appears in a recursive function.

You can see the largest mapacess1_Fast64 function of the box at a glance. Execute the web mapaccess1 command to focus a bit more:

The most common calls to mapAccess1_fast64 are main.FindLoops and main.DFS. It’s time to locate the specific code.

The best way to optimize is to change the map to slice, which of course has to do with the fact that the key is of type int and not too sparse.

The take away will be that for smaller data sets, you shouldn’t use maps where slices would suffice, as maps have a large overhead.

After the modification, CPU profiling again shows that recursive functions are no longer in the top5. However, there is a new long time function: runtime.mallocgc, accounting for 54.2%, which is related to storage allocation and garbage collection.

Next, add the code to collect memory data:

var memprofile = flag.String("memprofile".""."write memory profile to this file")

func main(a) {
    / /.....................
    
    FindHavlakLoops(cfgraph, lsgraph)
    if*memprofile ! ="" {
        f, err := os.Create(*memprofile)
        iferr ! =nil {
            log.Fatal(err)
        }
        pprof.WriteHeapProfile(f)
        f.Close()
        return
    }
    
    / /.....................
}
Copy the code

Use the top5 and list commands to find the code locations that allocate the most memory. This time, it is the insertion of elements into the map that uses the most memory. Slice is also used to replace map, but map also has a feature that can insert elements repeatedly, so a new function to insert elements into slice is written:

func appendUnique(a []int, x int) []int {
    for _, y := range a {
        if x == y {
            return a
        }
    }
    return append(a, x)
}
Copy the code

Ok, now the program is 2.1 times faster than the original. Looking again at the CPU profile data, runtime. mallocGC dropped a bit, but still accounted for 50.9%.

Another way to look at why the system is garbage collecting is to look at the allocations that are causing the collections, the ones that spend most of the time in mallocgc.

So you need to look at what the garbage collection is actually collecting, and these are the main culprits of frequent garbage collection.

Use the web mallocgc command to display the functions related to mallocGC in the form of vector graph, but there are too many nodes with small sample size to affect observation, add filter command:

go tool pprof --nodefraction=0.1 profile
Copy the code

Filter out less than 10% of the sample points, and the new vector visually shows that FindLoops triggered the most garbage collection operations. Go ahead and use the list FindLoops command to locate the code directly.

It turns out that you have to make temporary variables every time you execute the FindLoops function, which adds to the garbage collector’s burden. The improvement is to add a global variable cache that can be reused. The downside is that it’s not thread-safe anymore.

Optimization using the Pprof tool ends here. The end result is pretty good, with roughly the same speed and memory allocation as C++.

The heuristic is to use the CPU profile to find the functions that take the most time and optimize them. If you find that gc is doing a lot of work, optimize it by finding the code that allocates the most memory and the function that causes it.

The article is excellent, and although it was written quite a while ago (originally in 2011), it’s still worth reading. In addition, reference material [Wolfogre] actual combat article is also very wonderful, and with the move and this article is almost the same, but you can run the sample program provided by the article, step by step to solve the performance problem, very interesting!

Finding memory leaks

Memory allocation can occur either on the heap or on the stack. Memory allocated on the heap requires garbage collection or manual collection (for languages without garbage collection, such as C++), while memory on the stack is usually freed automatically after the function exits.

Go uses escape analysis to distribute as many objects on the stack as possible to make the program run faster.

To clarify, there are two types of memory analysis strategies: one is the current (this time collected) allocation of memory or objects, called inuse; The other option is to allocate all memory from the program’s run up to now, whether it has been gc or not, called alloC.

As mentioned above, there are two main memory analysis strategies with pprof. One is around looking at the current allocations (bytes or object count), called inuse. The other is looking at all the allocated bytes or object count throughout the run-time of the program, called alloc. This means regardless if it was gc-ed, a summation of everything sampled.

With the -sample_index parameter, you can switch the type of memory analysis:

Go tool pprof - sample_index = alloc_space http://47.93.238.9:8080/debug/pprof/heapCopy the code

There are 4 kinds:

type meaning
inuse_space amount of memory allocated and not released yet
inuse_objects amount of objects allocated and not released yet
alloc_space total amount of memory allocated (regardless of released)
alloc_objects total amount of objects allocated (regardless of released)

Reference shows how to find the cause of goroutine leaks by using a diff like method to find the extra Goroutine from the heap or goroutine profile directly. Also recommended reading!

conclusion

Pprof is a powerful tool for performance analysis of Go programs. It generates profile files by sampling and collecting performance related data of run Go programs. After that, three different presentation forms are provided to give us a more intuitive view of the relevant performance data.

After obtaining performance data, you can use commands such as top, web, and list to quickly locate the corresponding code and optimize it.

“Premature optimization is the root of all evil.” In practice, few people care about performance, but when you write a program with performance bottlenecks, QA pressure test, QPS can not go up, in order to demonstrate technical strength, or through pprof to observe performance bottlenecks, the corresponding performance optimization.

The resources

Russ Cox optimization process, and attach the code 】 【 blog.golang.org/profiling-g…

【 Google Pprof 】github.com/google/ppro…

[Debug golang using Pprof and flame map] cizixs.com/2017/09/11/…

Jbns.ca /blog/2017/0…

… Profiling your Golang App in 3 Steps: Coder. Today/Tech /2018-1…

Golang Remote Profiling and FlameGraphs matoski.com/article/gol…

The fish pprof 】 segmentfault.com/a/119000001…

【 Birdhouse pprof】colobu.com/2017/03/02/…

Seven kinds of performance analysis method about the Go 】 【 blog.lab99.org/post/golang…

【pprof comparison 】juejin.cn/post/684490…

By analyzing examples to explain, optimization of process 】 【 artem.krylysov.com/blog/2017/0…

【Go by Dmitry Vyukov】github.com/golang/go/w…

The actual combat article 】 【 wolfogre wonderful blog.wolfogre.com/posts/go-pp…

Dave. Cheney dave.cheney.net/high-perfor 】…

Actual combat case 】 【 www.cnblogs.com/sunsky303/p…

Segmentfault.com/a/119000001 DaBin combat memory leaks 】 【…

To find memory leaks 】 【 www.freecodecamp.org/news/how-i-…

[Thunder 3 performance optimization] diducoder.com/sotry-about…