• How I mistake memory leaks in Go using pprof on a large codebase
  • Translation address: github.com/watermelo/d…
  • Translator: Khaki Khaki
  • Translator level is limited, if there is translation or understanding fallacy, please help to point out

I’ve been working on scalable blockchain infrastructure development with the Go language on the Orbs team for most of this year, and it’s been an exciting year. In 2018, we studied which language to choose for our blockchain implementation. Because we knew Go had a good community and a great toolset, we chose Go.

In recent weeks, we have entered the final phase of system integration. As with any large system, problems can occur in the late stages, including performance issues, memory leaks, and so on. When integrating systems, we found a good way to do it. In this article, I describe how to investigate memory leaks in Go, detailing the steps to find, understand, and resolve them.

The toolset Golang offers is excellent but has its limitations. Looking at this first, the biggest problem is the limited ability to query the full core dumps. The full Core Dumps are a mirror image of the memory (or user memory) occupied by the processes while the application is running.

We can think of a memory map as a tree, and walking through that tree we get different object assignments and relationships. This means the root will hold memory anyway without GCing (garbage collection) the reason for memory. Because there is no easy way to analyze a full core dump in Go, it is difficult to find the root of an object that has not been GC.

At the time of writing, we couldn’t find any tools on the web to help us. Because of the core dump format and an easy way to export the file from a debug package, this is probably one that Google has used. An online search for it looks like this core dump viewer was created in Golang Pipeline, but it doesn’t look like anyone is using it. That being said, even without such a solution, we can often find the root cause using existing tools.

A memory leak

Memory leaks or memory stress can occur in many forms throughout the system. Usually we think of them as bugs, but sometimes their root cause can be a design problem.

These considerations are not important when we build our systems under new design principles. It’s more important to build the system in a way that avoids premature optimization and allows you to optimize your code as it matures, rather than overdesigning it from the start. However, some common memory stress problems are:

  • Too much memory allocated and data represented incorrectly
  • Extensive use of reflections or strings
  • Using global variables
  • Orphans, goroutines without end

In Go, the easiest way to create a memory leak is to define a global variable, an array, and then add that data to the array. This blog post describes this example in a nice way.

Why am I writing this article? When I looked into this example, I found a number of ways to get around memory leaks. However, compared to this example, our real system has more than 50 lines of code and a single structure. In this case, finding the source of the memory problem is more complicated than the example describes.

Golang provides us with an amazing tool called Pprof. Once you have this tool, you can help investigate and find the most likely memory problems. Another use is to find CPU problems, but I won’t cover anything CPU related in this article.

go tool pprof

It will take more than one blog post to spell out all the details of this tool. I’ll spend a little time figuring out how to use this tool to get something useful. In this article, I will focus on its memory-related capabilities.

The pprof package creates a heap dump file that you can then analyze/visualize following two memory mappings:

  • Current memory allocation
  • Total (cumulative) memory allocation

This tool can compare snapshots. For example, it lets you compare displays now to 30 seconds ago. For stress scenarios, this can help you locate problem areas in your code.

Pprof portrait

The way pprof works is by using portraits.

A portrait is a set of traces that show the call sequence stack that led to a particular event instance, such as memory allocation.

The file Runtime /pprof/pprof. Go contains the details and implementation of the portrait.

Go has several built-in portraits for us to use in common situations:

  • Goroutine – Stack trace of all current Goroutines
  • Heap – a sample of the memory allocation of an active object
  • Allocs – A sample of all past memory allocations
  • Threadcreate – Causes a stack trace to be created for a new OS thread
  • Block – causes the stack trace of the synchronization primitive to block
  • Mutex – Stack trace of contention for mutex holders

When looking at memory issues, we will focus on heap portraits. The ALlocs portrait is the same as it is in terms of data collection. The difference between the two is how the Pprof tool reads at startup. The ALlocs portrait will start pprof in a mode that shows the total number of bytes allocated since the program started (including bytes from garbage collection). We often use this pattern when trying to improve code efficiency.

The heap

In short, this is where the OS stores the memory footprint of objects in our code. This memory is then “garbage collected” or freed manually in non-garbage collected languages.

The heap is not the only place where memory allocation occurs; some memory is also allocated on the stack. Stacks are mainly short cycle memory. In Go, the stack is typically used for assignments that occur within a function closure. Another place where Go uses the stack is that the compiler “knows” how much memory it needs at run time (such as fixed-size arrays). There is a way for the Go compiler to “escape” the stack into the heap to output analysis, but I won’t cover it in this article.

Heap data requires “dump” and garbage collection, whereas stack data does not. This means using stacks is more efficient.

This is a brief description of allocating memory in different locations. There’s more, but it’s beyond the scope of this article.

Get heap data using pprof

There are two main ways to get data. The first is usually to add code to a test or branch, including importing runtime/pprof and calling pprof.writeHeapProfile (some_file) to write heap information.

Note that WriteHeapProfile is syntactic sugar for running:

// lookup takes a profile name
pprof.Lookup("heap").WriteTo(some_file, 0)
Copy the code

According to the documentation, WriteHeapProfile is backward compatible. Other types of portraits do not have this convenience and must use the Lookup() function to retrieve their portrait data.

The second, more interesting, is enabled via HTTP (Web-based endPoints). This allows you to extract data from containers in a running E2E /test environment, or even from a “production” environment. This is where the Go runtime and toolset excel. The entire package documentation can be found here, too long to look at, just add it to the code:

import (
  "net/http"
  _ "net/http/pprof")...func main(a){... http.ListenAndServe("localhost:8080".nil)}Copy the code

The “side effect” of importing NET/HTTP /pprof is that the pprof endpoint is registered in the web server root of /debug/pprof. Curl curl curl curl curl curl curl curl curl curl curl

curl -sK -v http://localhost:8080/debug/pprof/heap > heap.out
Copy the code

ListenAndServe() needs to be added only if you don’t have an HTTP Listener before your application. If there is, there is no need to listen, it will take care of itself. You can also set it up using servemux.handlefunc (), which makes sense for more complex HTTP programs.

Using pprof

So we’ve collected this data, now what? As mentioned above, PPROF has two main memory analysis strategies. One is to look at the current memory allocation (byte or object count), called inuse. The other is to look at the count of all allocated bytes or objects, called alloc, while the entire program is running. That means it’s the sum of all the samples, whether it’s garbage collected or not.

Here we need to reiterate that the heap image file is an example of memory allocation. The behind-the-scenes pprof uses the runtime.memprofile function, which by default collects allocation information in allocated bytes per 512KB. You can modify the MemProfile to collect information for all objects. Note that this is likely to slow down your application.

This means that by default, there may be problems with small objects that jitter under pPROF monitoring. For large code bases/long-running programs, this is not a problem.

Once you have collected the portrait file, you can load it into pprof’s interactive command line by running:

> go tool pprof heap.out
Copy the code

We can observe the information displayed

Type: inuse_space
Time: Jan 22, 2019 at 1:08pm (IST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
Copy the code

The thing to note here is Type: inuse_space. This means that we are looking at the memory allocation data at a particular moment (when we captured the profile). Type is the configuration value of sample_index. Possible values are:

  • Inuse_space – The amount of memory allocated but not yet freed
  • Inuse_objects – Number of allocated but unfreed objects
  • Alloc_space – The amount of memory allocated (whether or not it has been freed)
  • Alloc_objects – Total number of allocated objects (whether freed or not)

Now type top on the interactive command line to output the top-level memory consumer

(pprof) Top Showing Nodes Accounting for 330.04MB, 93.73% of 352.11MB Total Dropped 19 nodes (cum <= 1.76MB) Showing top 10 nodes out of 56 flat flat% sum% cum% 142.02MB 40.33% 40.33% 142.02MB 40.33% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/membuffers/go.(*InternalMessage).lazyCalcOffsets 28MB 7.95% 48.29% 28MB 7.95% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.TransactionsBlockProo FReader (inline) 26.51MB 7.53% 55.81% 39.01MB 11.08% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.(*ResultsBlockHeaderB Uilder).build 25.51MB 7.24% 63.06% 32.51MB 9.23% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.(*ResultsBlockProofBu Ilder).Build 23MB 6.53% 69.59% 23MB 6.53% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.ResultsBlockHeaderRea Der (inline) 20.50MB 5.82% 75.41% 20.50MB 5.82% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.TransactionsBlockMeta DataReader (inline) 20MB 5.68% 81.09% 20MB 5.68% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.TransactionsBlockHead ErReader (inline) 16MB 4.54% 85.64% 24MB 6.82% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.(*TransactionsBlockHe AderBuilder).build 14.50MB 4.12% 89.76% 122.51MB 34.79% 14 MB github.com/orbs-network/orbs-network-go/services/gossip/codec.DecodeBlockPairs 14 MB 3.98% 93.73% to 3.98% github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/orbs-spec/types/go/protocol.ResultsBlockProofRead er (inline)Copy the code

We can see a set of data about Dropped Nodes, which means they are filtered out. A node or “node” in a tree is an entire object. Discarding nodes helps us find problems faster, but sometimes it can hide the root cause of memory problems. Let’s move on to an example.

If you want all the data for the portrait file, add the -nodeFraction =0 option when running pprof, or type nodefraction=0 on the interactive command line.

In the output list, we see two values, flat and cum.

  • flatRepresents the memory of the current layer function in the stack
  • cumRepresents the amount of memory accumulated on the stack up to the current layer function

This information alone can sometimes help us understand if there is a problem. For example, a function is responsible for allocating a lot of memory but not keeping it. This means that some other object points to that memory and maintains its allocation, which indicates that we may have a system design problem or bug.

Top actually runs top10. The top command supports the topN format, where N is the number of items you want to view. In the above case, if you type top70 it will output all the nodes.

visualization

While topN provides a text list, pprof comes with several very useful visualization options. You can type PNG, GIF, and so on (see Go Tool pprof -help for complete information).

On our system, the default visual output is something like:

This may seem intimidating, but it is a visualization of the memory allocation flow (based on stack traces) within the program. Reading charts is not as complicated as it seems. A white square with a number shows the allocated space (how much memory it occupies on the edge of the graph), and each wider rectangle shows the function called.

Note that in the figure above, I pulled a PNG from the execution mode inuse_space. Most of the time you should also look at inuse_objects as it can help you find memory allocation issues.

Dig deeper to find the root cause

So far, we have been able to understand how memory is allocated during application runtime. This helps us understand the behavior (or bad behavior) of our programs.

In our example, we can see that memory is held by membuffers, which is our data serialization library. This does not mean that we have a memory leak in this code segment, it means that the function holds memory. It is important to know how to read charts and pPROF output. In this example, when we serialize the data, it means that we allocate memory to the structure and the original object (int, string), and it will not be freed.

Skipping to the conclusion, we can assume that a node on the serialization path is responsible for holding memory, for example:

We can see that somewhere in the chain of the log library, controlling >50MB of allocated memory. This is the memory allocated by our logger calling function. On reflection, this is actually to be expected. The logger allocates memory because it needs to serialize data to output to the log, so it causes memory allocation in the process.

We can also see that under the allocation path, memory is held only by serialization and not anything else. In addition, loggers retain about 30% of the total amount of memory. All this tells us that the most likely problem is not the logger. If it’s 100 percent, or close to it, then we should keep looking – but it’s not. This might mean that it’s logging something it shouldn’t, but it’s not a memory leak for the logger.

It is time to introduce another pprof command called list. It accepts a regular expression that is a filter for the content. The “list” is actually the annotated source code associated with the assignment. We can see that the List RequestNew will be executed in the context of the logger because we want to see the call to the logger. These calls come from two functions that happen to start with the same prefix.

(pprof) list RequestNew
Total: 352.11MB
ROUTINE ======================== github.com/orbs-network/orbs-network-go/services/consensuscontext.(*service).RequestNewResultsBlock in /Users/levison/work/go/src/github.com/orbs-network/orbs-network-go/services/consensuscontext/service.go
         0    77.51MB (flat, cum) 22.01% of Total
         .          .     82:}
         .          .     83:
         .          .     84:func (s *service) RequestNewResultsBlock(ctx context.Context, input *services.RequestNewResultsBlockInput) (*services.RequestNewResultsBlockOutput, error) {
         .          .     85:	logger := s.logger.WithTags(trace.LogFieldFrom(ctx))
         .          .     86:
         .    47.01MB     87:	rxBlock, err := s.createResultsBlock(ctx, input)
         .          .     88:	if err != nil {
         .          .     89:		return nil, err
         .          .     90:	}
         .          .     91:
         .    30.51MB     92:	logger.Info("created Results block", log.Stringable("results-block", rxBlock))
         .          .     93:
         .          .     94:	return &services.RequestNewResultsBlockOutput{
         .          .     95:		ResultsBlock: rxBlock,
         .          .     96:	}, nil
         .          .     97:}
ROUTINE ======================== github.com/orbs-network/orbs-network-go/services/consensuscontext.(*service).RequestNewTransactionsBlock in /Users/levison/work/go/src/github.com/orbs-network/orbs-network-go/services/consensuscontext/service.go
         0    64.01MB (flat, cum) 18.18% of Total
         .          .     58:}
         .          .     59:
         .          .     60:func (s *service) RequestNewTransactionsBlock(ctx context.Context, input *services.RequestNewTransactionsBlockInput) (*services.RequestNewTransactionsBlockOutput, error) {
         .          .     61:	logger := s.logger.WithTags(trace.LogFieldFrom(ctx))
         .          .     62:	logger.Info("starting to create transactions block", log.BlockHeight(input.CurrentBlockHeight))
         .    42.50MB     63:	txBlock, err := s.createTransactionsBlock(ctx, input)
         .          .     64:	if err != nil {
         .          .     65:		logger.Info("failed to create transactions block", log.Error(err))
         .          .     66:		return nil, err
         .          .     67:	}
         .          .     68:
         .          .     69:	s.metrics.transactionsRate.Measure(int64(len(txBlock.SignedTransactions)))
         .    21.50MB     70:	logger.Info("created transactions block", log.Int("num-transactions", len(txBlock.SignedTransactions)), log.Stringable("transactions-block", txBlock))
         .          .     71:	s.printTxHash(logger, txBlock)
         .          .     72:	return &services.RequestNewTransactionsBlockOutput{
         .          .     73:		TransactionsBlock: txBlock,
         .          .     74:	}, nil
         .          .     75:}
Copy the code

We can see that the memory allocation that is made is in the CUM column, which means that the allocated memory remains in the call stack. This has to do with what the chart shows. It’s easy to see at this point that the logger allocated memory because we sent the entire “block” object. This object needs to serialize some part of it (our object is the membuffer object, which implements some String() functions). Is it a useful log, or a good practice? Probably not, but it’s not the logger side or the code that calls the logger that causes a memory leak,

List searches under the GOPATH path to find the source code. If it searches for roots that don’t match (depending on your computer’s project build), use the -trim_path option. This will help fix it and let you see the annotated source code. Git should be set up to commit properly when capturing heap configuration files.

Cause of memory leak

The investigation was prompted by a suspected memory leak. We found that memory consumption was higher than the system expected. Most importantly, we see it increasing, which is another strong indicator that something is wrong here.

At this point, in Java or. Net, we will open up some ‘GC Roots’ analysis or profiler and get the actual object that referenced the data and caused the leak. As explained, this is not possible with Go, because the tooling problem is also due to Go’s low level of memory representation.

Without elaboration, we don’t know which object Go stores at which address (except Pointers). This means that, in effect, knowing which memory address represents which member of the object (structure) will require some kind of mapping to be output to the heap portrait file. This may mean that heap portrait files should also be used before full core dump to map addresses to allocated rows and files, and thus to objects represented in memory.

At this point, because we are familiar with our system, it is easy to understand that this is no longer a bug. It’s (almost) by design. But let’s continue exploring how to extract information from the tool (PPROF) to get to the root cause.

When nodeFraction =0 is set, we see the entire graph of the allocated objects, including the smaller ones. Let’s look at the output:

We have two new subtrees. Again, the Pprof heap portrait file is a sample of memory allocation. For our systems – we don’t miss anything important. The part of this long green new subtree is the test runner that is completely disconnected from the rest of the system, and I have no interest in considering it in this article.

The shorter blue subtree, with an edge attached to the entire system, is inMemoryBlockPersistance. The name also explains what we think of as’ leakage ‘. This is the data back end, which stores all data in memory rather than persisting it to disk. Notably, we can see that it holds two large objects. Why two? Because we can see that the object size is 1.28MB and the function size is 2.57MB.

The question is easy to understand. We can use delve (debugger) to look at memory in debugging our code.

How to repair

It was a terrible human error. While the process is educational, can we do better?

We can still “sniff” the heap. Deserialized data takes up too much memory, so why should 142MB be significantly reduced? . Pprof can answer this question – in fact, it can.

To view the annotated source code for a function, we can run list lazy. We use lazy because the name of the function we are looking for is lazyCalcOffsets(), and there are no other functions in our code that begin with lazy. Of course, list lazyCalcOffsets will also work.

(pprof) list lazy Total: 352.11 MB ROUTINE = = = = = = = = = = = = = = = = = = = = = = = = github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/membuffers/go.(*InternalMessage).lazyCalcOffsets in /Users/levison/work/go/src/github.com/orbs-network/orbs-network-go/vendor/github.com/orbs-network/membuffers/go/message. Go 142.02MB 142.02MB (flat, cum) 40.33% of Total.. 29: . . 30:func (m *InternalMessage) lazyCalcOffsets() bool { . . 31: if m.offsets ! = nil { . . 32: return true . . 33: } 36MB 36MB 34: res := make(map[int]Offset) . . 35: var off Offset = 0 . . 36: var unionNum = 0 . . 37: for fieldNum, fieldType := range m.scheme { . . 38: // write the current offset . . 39: Off = alignOffsetToType(off, fieldType).. 40: if off >= m.size {.. 41: return false.. 42:} 106.02MB 106.02MB 43: res[fieldNum] = off . . 44: . . 45: // skip over the content to the next field . . 46: if fieldType == TypeUnion { . . 47: if off + FieldSizes[TypeUnion] > m.size { . . 48: return falseCopy the code

We can see two interesting pieces of information. Also, remember that the pprof heap portrait file samples the information about allocation. We can see that the flat and cum numbers are the same. This indicates that allocated memory is also reserved at these allocation points.

Next, we can see that make() takes up some memory. This is normal, it’s a pointer to a data structure. However, we also see that the assignment on line 43 takes up memory, which means that it allocates memory.

This brings us to the map map, where the assignment of a map is not a simple variable assignment. This article explains how MAP works in detail. In short, a map is more expensive, has more “costs” and has more elements than a slice.

Caution should follow: If memory consumption is a relevant consideration, map[int]T is fine when the data is not sparse or can be converted to sequential indexes, but slicing should generally be used. However, when expanding a large slice, the slice may slow down the operation, which is negligible in the map. There is no one-size-fits-all approach to optimization.

In the code above, after examining how we used the map, we realized that although we imagined it to be a sparse array, it wasn’t that sparse. This matches the situation described above, and it immediately occurs to us that a small refactoring that changes the map to a slice is actually possible and might make the code more memory efficient. So we changed it to:

func (m *InternalMessage) lazyCalcOffsets(a) bool {
	ifm.offsets ! =nil {
		return true
	}
	res := make([]Offset, len(m.scheme))
	var off Offset = 0
	var unionNum = 0
	for fieldNum, fieldType := range m.scheme {
		// write the current offset
		off = alignOffsetToType(off, fieldType)
		if off >= m.size {
			return false
		}
		res[fieldNum] = off
Copy the code

As simple as that, we now use slices instead of maps. Because the way we receive the data is lazily loaded in, and how we then access it, there is no code to change other than those two lines and the structure that holds the data. How do these changes affect memory consumption?

Let’s take a look at a few benchCMP tests

Benchmark Old NS/OP New NS/OP Delta BenchmarkUint32read-4 2047 1381-32.54% benchmarkUint64Read-4 507 321-36.69% BenchmarkStringRead BenchmarkSingleUint64Read - 4, 251, 164-34.66% - 4, 1572, 1126-28.37% benchmark old allocs new allocs Delta 14 7-50.00% BenchmarkUint64Read BenchmarkUint32Read - 4-4 4 2 2 1-50.00% - 50.00% BenchmarkSingleUint64Read - 4 BenchmarkStringRead-4 12 6-50.00% Benchmark Old Bytes New Bytes delta BenchmarkUint32Read-4 1120 80-92.86% BenchmarkUint64Read - 4, 320 16-95.00% 8-95.00% BenchmarkStringRead BenchmarkSingleUint64Read - 160-4, 960, 32-96.67%Copy the code

Read test initialization to create allocated data structures. We can see an increase in elapsed time of about 30%, a 50% decrease in memory allocation, and an increase in memory consumption of > 90% (!).

Since slices (previously maps) never add much data, these numbers pretty much show what we’ll see in production. It depends on the data entropy, but there may be room for improvement in memory allocation and memory consumption.

Looking at pprof from the same test to get the heap portrait file, we will see that memory consumption is now actually down by about 90%.

It is important to note that for smaller data sets, map should not be used when slices are satisfied, because map is very expensive.

Complete core dump

As mentioned above, this is where we now see tool limitations. When we investigated the problem, we believed we could find the root object, but without much success. Go evolves rapidly over time, but in the case of full dumps or in-memory representations, this evolution comes at a cost. The full heap dump format is not backward compatible when modified. For the latest version described here and for writing a full heap dump, debug.writeheapdump () can be used.

Right now we’re not “stuck,” though, because there’s no good solution to explore full down dumps. So far, Pprof has answered all our questions.

Note that the Internet remembers a lot of information that is no longer relevant. If you’re going to try to open a full dump yourself, then you should ignore a few things and start with GO1.11:

  • There is no way to open and debug full core dump on MacOS, only Linux can.
  • Github.com/randall77/h… The tool is for Go1.3, which has a 1.7+ branch, but it also doesn’t work properly (incomplete).
  • In github.com/golang/debu… View on does not really compile. It’s easy to fix (the internal package points to golang.org instead of github.com), but it probably won’t work on MacOS or Linux.
  • In addition, github.com/randall77/c… It will also fail on MacOS

pprof UI

One detail to note about Pprof is its UI functionality. You can save a lot of time when you start investigating issues related to using pprof portrait files. (Translator’s Note: Need to install Graphviz)

go tool pprof -http=:8080 heap.out
Copy the code

It should open a Web browser at this point. If not, browse the port you set up. It allows you to change options and get visual feedback faster than the command line. A very useful way to consume information.

The UI really familiarized me with fire charts, which can expose code culprits very quickly.

conclusion

Go is an exciting language with a very rich toolset and you can do so much more with Pprof. For example, CPU analysis is not covered in this article.

Some other good articles:

  • rakyll.org/archive/ – I believe this is one of the main contributors around performance monitoring and there are many good posts on her blog
  • github.com/google/gops – Written by JBD (running rakyll.org), this tool is guaranteed to be your own blog post.
  • medium.com/@cep21/usin… – go tool traceIs used to do CPU analysis, this is a good post about this analysis feature.