causes

In recent days, a Go API service has regularly monitored and found that the memory occupied keeps rising, from the initial 70M to more than 1G until it runs out of memory. Basically, it can be concluded that there is a problem of memory leakage. However, because the language with garbage collection has few memory leaks, If there is, it must be a big problem. Therefore, there is a detailed investigation process below to accumulate experience for dealing with such problems in the future.

Analysis of the screen

Goroutine leaked

There is a well-known saying that nine times out of ten memory leaks in Go are goroutine coroutine leaks. Is this true? The runtime.numgoroutine () query can be used to determine the number of goroutines, or /debug/pprof with pprof enabled to see the number of goroutines. Timely capture of online data locally.

Mux := http.newServemux () mux.handle ("/debug/vars", expvar.Handler()) // main.go func sampleInterval() chan struct{} { done := make(chan struct{}) go func() { ticker := time.NewTicker(1 * time.Second) numG := expvar.NewInt("runtime.goroutines") defer ticker.Stop() for { select { case <-ticker.C: numG.Set(int64(runtime.NumGoroutine())) case <-done: Return}}} () return the done} / / run/local/curl http://host:port/debug/vars {48} "runtime. Goroutines" :Copy the code

Done Chan will be returned so that we can gracefully exit when we shut down the system. After some time of fetching and chart analysis, it is found that goroutine is not leaked. If so, you can access /debug/pprof/goroutine? Debug =1 Check the status of each G on line, and then do further analysis, but obviously through the elimination method, this is not the problem of this accident.

Long term holding citation

New problems will naturally emerge: if the number of G has a rose and then fell always automatically, instructions are in type G can run normally, G block waiting for the result in delays in exit is caused by the leak can be ruled out, then we can think of memory leak was completely has nothing to do with G? The answer is not, because we can’t determine whether G can lead to some of our object reference is always hold, when Mark Mark these still hold the object to be sure it can’t be garbage collected, but for some reason it has been held and also with the new memory allocation, it is the second main reasons cause memory of rising.

So, as the problem of object or a group of memory allocation, this problem can be a big slice has been adding elements has been referenced in the object life cycle, can also be an object a pointer, hold the other objects in a similar list of references, if the reference chain continues to grow, It also causes memory leaks.

Analyze heap memory using pprof sampling

By obtaining the Pprof Heap profile, we can directly generate a snapshot file of the current heap allocation online and persist it locally in the form of a.pb.gz compressed package for further offline analysis.

go tool pprof http://[domain]/debug/pprof/heap
Copy the code

Using the command line, snapshots of heap memory and various statistics can be persisted locally for analysis, going to the pprof interactive command line. Here I directly on the line at intervals of a period of time respectively captured two snapshots to local analysis:

File: binary Type: inuse_space Time: Feb 27, 2020 at 12:06 PM (CST) (PPROF) Top Showing Nodes Accounting for 44174.92kB, 93.50% of 47247.81kB Total Showing top 10 nodes out of 114 flat flat% sum% cum cum% 35842.64kB 75.86% 75.86% 35842.64kB MyfuncA 1825.78kB 3.86% 79.73% 1825.78kB 3.86% github.com/mozillazg/go-pinyin.init 1805.17kB 3.82% 83.55% 2354.01kB 4.98% compress/flate.NewWriter 1024.14kB 2.17% 85.71% 2048.25kB 4.34% Database/SQL.(*DB). PrepareDC 1024.08kB Net.(*file).getLineFromData 561.50kB 1.19% 89.07% 561.50kB 1.19% html.populateMaps 548.84kB 1.16% compress/flate.(* Compressor). InitDeflate 516.76kB 1.09% 91.32% 516.76kB 1.09% Net/HTTP.(*http2Framer).WriteDataPadded 513.31kB 1.09% 92.41% 513.31kB 1.09% regexp. OnePassCopy 512.69kB 1.09% 93.50% 512.69 kB vendor/golang.org/x/net/http2/hpack. 1.09% (* headerFieldTable). AddEntryCopy the code
File: binary Type: inuse_space Time: Feb 27, 2020 at 5:58 PM (CST) (PPROF) Top Showing Nodes Accounting for 80.65MB, 83.62% of 96.45MB total Showing top 10 nodes out of 145 flat flat% sum% cum% 35MB 36.29% 36.29% 35MB 36.29% myfuncA 8.50 MB 11 MB github.com/lib/pq.parseStatementRowDescribe 7.05 MB 7.31% 52.42% 11.41% 8.81% 45.10% 7.31% 7.05 MB (*DB).adddeplocked 6MB 6.22% 58.64% 6MB 6.22% Encoding /json.(*decodeState).literalStore 5MB 5.18% 63.82% 19.50MB 20.22% github.com/lib/pq.(*conn).prepareTo 4.59MB 4.76% 68.58% 4.59MB 4.76% reflation. unsafe_NewArray 4.50MB 4.67% 73.25% 11MB 11.41% queryfuncB 4MB 4.15% 77.40% 18.56MB 19.24% Database/SQL.(*DB). PrepareDC 3.50MB 3.63% 81.03% 3.50MB 3.63% github.com/lib/pq.decideColumnFormatsCopy the code

By comparing the functions that hold the most memory, it is found that although myfuncA holds a lot of memory, the memory generated by myfuncA at two time points does not change, and the total memory usage increases from 47M to 96M. Therefore, the suspicion of leakage by myfuncA can be ruled out. The deserialization of DB queries also requires a large amount of memory. According to the Go Runtime memory allocation strategy, GC does not reclaim the memory immediately, and even if it does, it does not return the idle memory to the OS immediately. This memory will remain in the multi-level memory pool of the memory allocator for some time. So there are still some distractions that get in the way of analysis. Unsafe_NewArray () ¶ Reflect. Unsafe_NewArray () is a function that causes memory leaks. It makes memory requests whenever a slice is deserialized using json. Slice can be created from the reflected type information, which can be understood as a constructor for a generic slice.

Compare the differences between two snapshots

Just looking at the difference between the two snapshots with the naked eye is quite intrusive, but it would be clearer if the first time point was used as a baseline to see the results after diff from the second time point.

go tool pprof -base pprof.binary.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz pprof.binary.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
Copy the code
File: binary Type: inuse_space Time: Feb 27, 2020 at 12:06 PM (CST) (PPROF) Top Showing Nodes Accounting for 37.65MB, 76.19% of 49.42MB Total Dropped 1 node (cum <= 0.25MB) Showing top 10 nodes out of 118 flat flat% sum% cum% 6.50MB 18.21% 13.16% 13.16% 9 MB github.com/lib/pq.parseStatementRowDescribe 5.05 MB 5.05 MB 10.23% 23.38% to 10.23% (*DB).adddeplocked 5MB 10.12% 33.50% 5MB 10.12% Encoding /json.(*decodeState).literalStore 4.58MB 9.28% FuncB (callByA) 4MB 8.10% 58.97% 16MB 32.38% Github.com/lib/pq. (* conn.) prepareTo 3 MB 6.07% 65.04% 3 MB github.com/lib/pq.decideColumnFormats 2.50 MB 5.06% 6.07% and 70.10% 2.50 MB 5.06% github.com/lib/pq. readBuf (*). The string is 1.51 MB 3.05% 73.15% 3.05% 1.51 MB github.com/lib/pq.textDecode 1.50 MB CtrlfuncA 3.04% 76.19% 39.10MB 79.12% ctrlfuncACopy the code

Using -base to specify the benchmark, again using top, lists exactly who is responsible for the extra 50M. FuncA is A controller function, funcB is an M-layer method that is called by A and interacts directly with PostgresQL to get data, the rest have standard library functions that deserialize json data fields, and finally some functions related to database connections. What is initially clear is that the memory leak must be somewhere in the method call chain.

Using the list method name on the interactive command line, you can step by step analyze the specific usage on the method call chain, eliminate the interference of deserialization object allocation memory, and combine the usage of snapshot 2. Locate the *DB and *conn objects, and the method calls locate the Preparex method provided by SQLX and the Get method of the Stmt object produced by Preparex.

func (r repoXXX) FindByID(id string) (*T, err) { stmt, err := r.Preparex(`MY SQL QUERY`) // ??? if err ! = nil { return nil, err } t := new(T) err = stmt.Get(t, id) // ??? if err ! = nil { if err == sql.ErrNoRows { return nil, errno.TorNotFound } return nil, err } return t, nil }Copy the code

Find the culprit

SQLX is simply a database/ SQL wrapper, and to see what the two methods found do, you have to start with Database/SQL.

The DB create

Func Open(driverName, dataSourceName string) (*DB, error) {// Check whether the database driver is initialized. Ok := drivers[driverName] Driversmu.runlock () // No initialization, error if! ok { return nil, fmt.Errorf("sql: unknown driver %q (forgotten import?) ", driverName) } if driverCtx, ok := driveri.(driver.DriverContext); Ok {// Connector, err := Driverctx. OpenConnector(dataSourceName) if err! = nil {return nil, err} // Initialize the DB object and return OpenDB(connector), Connector return OpenDB(dsnConnector{DSN: dataSourceName, driver: Driveri}), nil}Copy the code

OpenDB() is obviously a DB constructor, so keep going

func OpenDB(c driver.Connector) *DB {
	ctx, cancel := context.WithCancel(context.Background())
	db := &DB{
		connector:    c,
		openerCh:     make(chan struct{}, connectionRequestQueueSize),
		resetterCh:   make(chan *driverConn, 50),
		lastPut:      make(map[*driverConn]string),
		connRequests: make(map[uint64]chan connRequest),
		stop:         cancel,
	}

	go db.connectionOpener(ctx)
	go db.connectionResetter(ctx)

	return db
}
Copy the code

A few key components: openerCh is used to cache requests to create new connections, which is singly represented and follows first-in, first-out logic, and is best defined as a Chan struct

ResetterCh is used to asynchronously reset the context of a connection and return an error in which the current connection executes a query

Two G’s are used to perform the above two operations asynchronously

LastPut for debug, skip

ConnRequests are used to queue asynchronous requests when there are no idle database connections and new connections cannot be created. The key is incremental, the value uses chan to prevent read blocking, and len can be easily counted

func (db *DB) putConnDBLocked(dc *driverConn, err error) bool { // ... If c := len(db.connrequests); c > 0 { var req chan connRequest var reqKey uint64 for reqKey, req = range db.connRequests { break } delete(db.connRequests, = if err == nil {dc.inUse = true} // Allocate connection for this request req < -connRequest {conn: Dc, // *driverConn database connection err: err,} return true}}Copy the code

ConnRequest represents requests to create new connections. ConnRequests are used to queue requests globally. Whenever a connection is created, DB tries to reuse the driverConn available to satisfy those requests. It’s a request to create a connection

Record dependent references to resources

The CONN method of the DB object is responsible for obtaining the connection, and then we can send our statement to the database. Take the query of the SELECT statement as an example

func (db *DB) query(ctx context.Context, query string, args []interface{}, strategy connReuseStrategy) (*Rows, Error) {dc, err := db.conn(CTX, strategy) if err! = nil {return nil, err} // Familiar STMT, where the Rows object is created, Bind driverConn to execute and return db.queryDC(CTX, nil, dc, dc.releaseconn, query, args)}Copy the code

Don’t get too far ahead of yourself, after all, we’re not accessing the database this way, but the above calls help clarify how the library organizes the underlying objects. Now back to SQLX’s PrepareX method, where we started our problem, and ultimately call the PrepareContext method of the standard library

func (db *DB) PrepareContext(ctx context.Context, query string) (*Stmt, Error) {var STMT * STMT var err error // Specify a number of failed retries and use flag bits to control the policy of obtaining connections (caching connections are preferred) for I := 0; i < maxBadConnRetries; i++ { stmt, err = db.prepare(ctx, query, cachedOrNewConn) if err ! = driver.ErrBadConn { break } } if err == driver.ErrBadConn { return db.prepare(ctx, query, alwaysNewConn) } return stmt, err }Copy the code
func (db *DB) prepare(ctx context.Context, query string, strategy connReuseStrategy) (*Stmt, Error) {error := conn(CTX, strategy) if err! = nil { return nil, err } return db.prepareDC(ctx, dc, dc.releaseConn, nil, query) }Copy the code

The prepare method here is very similar to the query method mentioned above, and we’ll discuss the differences in detail later. Here we continue with the prepareDC clue.

// CG nil means connection pooling is enabled! PrepareDC (CTX context.Context, DC *driverConn, release func(error), CG stmtConnGrabber, query string) (*Stmt, error) { var ds *driverStmt var err error defer func() { release(err) }() withLock(dc, func() { ds, err = dc.prepareLocked(ctx, cg, query) }) if err ! = nil { return nil, err } stmt := &Stmt{ db: db, query: query, cg: cg, cgds: Ds,} // This is the real purpose of PrepareX, Stmt.css = []connStmt{{dc, ds}} // Record the total number of closed connections. Stmt.lastnumclosed = atomy.loaduint64 (&db.numclosed) // Add dependency reference count (stmt.lastNumClosed) Db.adddep (STMT, STMT)} return STMT, nil}Copy the code

The return STMT is the result of Preparex, and the Get is executed, wrapped in SQLX, and the func (S * STMT) QueryContext is called. Func (r *Row) scanAny in SQLX

To be sure, after Preparex is called, the STMT reference is already held by the DB. As long as the DB object exists, STMT must be unreferenced manually in order for it to be released. Is there something like removeDep that we want

func (s *Stmt) QueryContext(ctx context.Context, args ... interface{}) (*Rows, error) { // ... rowsi, err = rowsiFromStatement(ctx, dc.ci, ds, args...) if err == nil { rows := &Rows{ dc: dc, rowsi: rowsi, } s.db.addDep(s, rows) // HERE rows.releaseConn = func(err error) { releaseConn(err) s.db.removeDep(s, rows) // HERE } var txctx context.Context if s.cg ! = nil { txctx = s.cg.txCtx() } rows.initContextClose(ctx, txctx) return rows, nil } releaseConn(err) }Copy the code

Yes, but this is code about rows reference records and reuse of database connections. It does not see the release of the STMT object. Also look at scanAny of SQLX, which serializes the data returned by the database into the struct specific field passed in Get by calling Scan.

Dep is a dictionary used for reference counting to ensure that dependencies are freed first. This dictionary ensures that the current element can only be freed after its dependencies are freed. The finalClose method of the current element is used to release resources held by the current element.

Solve the leak

To fix the memory leak, the DEP must drop the reference to STMT. The release of the STMT object is hidden in its close method through a simple search of the SQL.go file.

func (s *Stmt) Close() error { // ... If s.gram == nil {return s.d.removedep (s, s)}}Copy the code

One line of code will do the trick

STMT, _ := r.reparex (' MY SQL QUERY ') // defer stmt.Close() // ...Copy the code

Redeploy and sample the memory usage as shown below:

GC can collect STMT objects properly, problem solved!

conclusion

Lessons learned:

  • Early stage: do load test, if you can solve it offline, don’t put it online; Use pprof; Write well and avoid being trampled on
  • On-line: monitoring and alarm to do bit, memory occupancy over a certain threshold must be paid attention to, make good use ofhttp/pprof
  • Peacetime: more familiar with peacetime heavy use of the package and standard library, especially the standard library, which hides a lot of details, slightly not careful on the pit

Other potential leaks (Easter eggs)

To tell the truth, this is not the first time I met in the Go memory leaks, but Go do not be so easy to appear this kind of problem, through my reflection on pit for many times, I found some G lead to memory leaks by actually pprof or monitoring is able to be quickly resolved, it has long been held reference is easy to ignore, These telltale signs often appear in:

  • Object pools that are manually maintained
  • Manually controlled system resources: timers, connections, IO buffer streams, etc
  • The default value is unlimited: for example, db.maxopen