Recently, I encountered the problem of GC in my work: a large number of objects were created repeatedly in the project, resulting in a large amount of GC workload and frequent CPU bottling. Prepare to use sync.pool to cache objects and reduce GC overhead. In order to use it more smoothly, I specially studied, the formation of this article. This article from the use of source code analysis, step by step, one way.

This article is based on Go 1.14

What is the

Sync.pool is a component of the sync package that can be used as a “Pool” to hold temporary retrieval objects. I think the name is somewhat misleading because objects in the Pool can be collected without notice, but sync.Cache is probably a more appropriate name.

What’s the use of

Sync.pool is a good choice for many places where you need to re-allocate and recycle memory. Frequent allocation and reclamation of memory will bring a certain burden on GC, and in serious cases will cause CPU burrs. Sync. Pool can cache temporarily unused objects and use them directly when needed next time, without having to re-allocate memory, reuse object memory, reduce the pressure on GC and improve system performance.

How to use

First, sync.Pool is coroutine safe, which is extremely convenient for users. Before using the Pool, set the object’s New function to create an object that is not cached in the Pool. After that, objects can be fetched and returned anywhere in the program at any time by simply using the Get() and Put() methods.

Here’s a look at sync.Pool from Go Night Reading in 2018, and how it works:

When multiple Goroutines need to create the same object, if there are too many Goroutines, the number of objects created increases dramatically, resulting in increased GC pressure. Form a vicious cycle of “large concurrency – large memory footprint – slow GC – reduced processing concurrency – more concurrency”.

At this point, you need a pool of objects, and instead of creating objects individually, each Goroutine gets an object from the pool (if one already exists).

So the key idea is to reuse objects to avoid repeated creation and destruction. Let’s look at how to use it.

Simple example

Let’s start with a simple example:

package main
import (
	"fmt"
	"sync"
)

var pool *sync.Pool

type Person struct {
	Name string
}

func initPool(a) {
	pool = &sync.Pool {
		New: func(a)interface{} {
			fmt.Println("Creating a new Person")
			return new(Person)
		},
	}
}

func main(a) {
	initPool()

	p := pool.Get().(*Person)
	fmt.Println("Get from pool for first time:", p)

	p.Name = "first"
	fmt.Printf("Set p.name = %s\n", p.Name)

	pool.Put(p)

	fmt.Println("Pool already has an object: &{first}, call Get:", pool.Get().(*Person))
	fmt.Println("Pool out of objects, call Get:", pool.Get().(*Person))
}
Copy the code

Running results:

Creating a newPerson: &{first} set p.name = first The pool already has an object: &{first}, Get: &{first} Creating anewPerson Pool has no object, Get: &{}Copy the code

First, you need to initialize the Pool, and the only thing you need is to set up the New function. When the Get method is called, it returns the cached object if it is cached in the pool. If there is no inventory, the New function is called to create a New object.

In addition, we can see that the object retrieved by the Get method is the same as the object Put in the last time. The Pool is not “emptied”. But we should not make any assumptions about this, because in a real concurrent usage scenario, this order is not guaranteed, and it is best to empty objects before putting.

How to use FMT package

How to use fmt.Printf:

func Printf(format string, a ...interface{}) (n int, err error) {
	return Fprintf(os.Stdout, format, a...)
}
Copy the code

Moving on to Fprintf:

func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error) {
	p := newPrinter()
	p.doPrintf(format, a)
	n, err = w.Write(p.buf)
	p.free()
	return
}
Copy the code

The argument to the Fprintf function is an IO.Writer. Printf passes os.Stdout, which is equivalent to output directly to standard output. NewPrinter uses Pool:

// newPrinter allocates a new pp struct or grabs a cached one.
func newPrinter(a) *pp {
	p := ppFree.Get().(*pp)
	p.panicking = false
	p.erroring = false
	p.wrapErrs = false
	p.fmt.init(&p.buf)
	return p
}

var ppFree = sync.Pool{
	New: func(a) interface{} { return new(pp) },
}
Copy the code

Going back to Fprintf, after getting the pp pointer, it does some format and writes the contents of p.bouf to w. Finally, call the free function to return the pp pointer to the Pool:

// free saves used pp structs in ppFree; avoids an allocation per invocation.
func (p *pp) free(a) {
	if cap(p.buf) > 64<<10 {
		return
	}

	p.buf = p.buf[:0]
	p.arg = nil
	p.value = reflect.Value{}
	p.wrappedErr = nil
	ppFree.Put(p)
}
Copy the code

Clear some fields of the object before returning it to the Pool, so that the cached object can be used safely when it is retrieved via Get.

pool_test

The test file is a good way to learn the source code because it represents “official” usage. More importantly, the test cases deliberately test for “potholes”, and learning about them also teaches you how to avoid them when you use them.

There are seven tests and four bechmarks in the pool_test file.

TestPool and TestPoolNew are simple and test Get/Put functions. Let’s look at TestPoolNew:

func TestPoolNew(t *testing.T) {
	// disable GC so we can control when it happens.
	defer debug.SetGCPercent(debug.SetGCPercent(- 1))

	i := 0
	p := Pool{
		New: func(a) interface{} {
			i++
			return i
		},
	}
	ifv := p.Get(); v ! =1 {
		t.Fatalf("got %v; want 1", v)
	}
	ifv := p.Get(); v ! =2 {
		t.Fatalf("got %v; want 2", v)
	}

	// Make sure that the goroutine doesn't migrate to another P
	// between Put and Get calls.
	Runtime_procPin()
	p.Put(42)
	ifv := p.Get(); v ! =42 {
		t.Fatalf("got %v; want 42", v)
	}
	Runtime_procUnpin()

	ifv := p.Get(); v ! =3 {
		t.Fatalf("got %v; want 3", v)
	}
}
Copy the code

First, GC=-1 is set to stop GC. Then why defer? We have run out of functions, so what do we need to defer? Notice that the debug.SetGCPercent function is called twice and returns the value of the last GC. Therefore, the purpose of defer here is to revert to the GC Settings before this function was called, which is to restore the scene.

Next, we call Pool New: returns an int that changes and increments by 1 each time we call New. Then, Get is called twice in a row, because there are no cached objects in the Pool, so New is called each time to create one, so it returns 1 the first time and 2 the second time.

Runtime_procPin() is then called to prevent the goroutine from being commandeered, in order to protect the next Put and Get operations so that they all operate on the same “pool” of P. Also, this call to Get does not call New, because there was a Put operation earlier.

Finally, Get is called again, because there is no “inventory”, so New is called again to create an object.

TestPoolGC and TestPoolRelease test GC’s effect on Pool objects. Here we use a function to count how many objects are collected by GC:

runtime.SetFinalizer(v, func(vv *string) {
	atomic.AddUint32(&fin, 1)})Copy the code

When garbage collection detects that V is unreachable and v has an associated Finalizer, another goroutine calls the set Finalizer function, the func parameter in the code above. This makes object V rereachable and thus not recycled during the GC. After that, object V and its associated Finalizer are unbound and will be reclaimed only when the next GC detects that object V is unreachable again.

TestPoolStress is a Pool of 10 Goroutines that repeatedly Put objects into the Pool and then Get objects to see if there is an error.

TestPoolDequeue and TestPoolChain, they both call TestPoolDequeue, that’s what they do. It needs to pass in a PoolDequeue interface:

// poolDequeue testing.
type PoolDequeue interface {
	PushHead(val interface{}) bool
	PopHead() (interface{}, bool)
	PopTail() (interface{}, bool)}Copy the code

PoolDequeue is a two-ended queue that can enter elements from the head and exit elements from the head and tail. When the function is called, NewPoolDequeue(16) is passed in and NewPoolChain() is passed in, which is poolDequeue. Here’s what testPoolDequeue does:

There are ten goroutines in total: one producer and nine consumers. The producer pushes the element from the queue head into a double-ended queue and popheads it every 10 times. The consumer keeps fetching elements from the end of the queue. Whether an element is fetched from the head of the queue or from the tail of the queue, it is marked in the map and checked to see if each element was fetched only once.

All that’s left is the Benchmark test. The first BenchmarkPool is simple, just keep putting/getting to test performance.

The BenchmarkPoolSTW function turns off THE GC, puts 10 objects into the pool, and then forces the GC, records the GC pause time, and does a sort to calculate the STW times for P50 and P95. This function can be added to a personal code base:

func BenchmarkPoolSTW(b *testing.B) {
	// Take control of GC.
	defer debug.SetGCPercent(debug.SetGCPercent(- 1))

	var mstats runtime.MemStats
	var pauses []uint64

	var p Pool
	for i := 0; i < b.N; i++ {
		// Put a large number of items into a pool.
		const N = 100000
		var item interface{} = 42
		for i := 0; i < N; i++ {
			p.Put(item)
		}
		// Do a GC.
		runtime.GC()
		// Record pause time.
		runtime.ReadMemStats(&mstats)
		pauses = append(pauses, mstats.PauseNs[(mstats.NumGC+255) %256])}// Get pause time stats.
	sort.Slice(pauses, func(i, j int) bool { return pauses[i] < pauses[j] })
	var total uint64
	for _, ns := range pauses {
		total += ns
	}
	// ns/op for this benchmark is average STW time.
	b.ReportMetric(float64(total)/float64(b.N), "ns/op")
	b.ReportMetric(float64(pauses[len(pauses)*95/100]), "p95-ns/STW")
	b.ReportMetric(float64(pauses[len(pauses)*50/100]), "p50-ns/STW")}Copy the code

I ran it on the MAC:

go test -v -run=none -bench=BenchmarkPoolSTW
Copy the code

Get the output:

goos: darwin goarch: amd64 pkg: Sync benchmarkPoolSTW-12 361 3708 NS/OP 3583 P50-NS /STW 5008 P95 - NS /STW PASS OK Sync 1.481sCopy the code

Last BenchmarkPoolExpensiveNew test when the high costs of the New, the performance of the Pool. You can also add your own code base.

other

Sync. Pool is also used in encoding/ JSON to improve performance. Context fetching is also used in sync.pool.

Take a look at how gin uses sync.pool. Set the New function:

engine.pool.New = func(a) interface{} {
	return engine.allocateContext()
}

func (engine *Engine) allocateContext(a) *Context {
	return &Context{engine: engine, KeysMutex: &sync.RWMutex{}}
}
Copy the code

Use:

// ServeHTTP conforms to the http.Handler interface.
func (engine *Engine) ServeHTTP(w http.ResponseWriter, req *http.Request) {
	c := engine.pool.Get().(*Context)
	c.writermem.reset(w)
	c.Request = req
	c.reset()

	engine.handleHTTPRequest(c)

	engine.pool.Put(c)
}
Copy the code

Call Get to retrieve the cached object, do some reset, execute handleHTTPRequest, and finally Put back to Pool.

In addition, the Echo framework also uses sync.Pool to manage the context, and almost reaches zero heap memory allocation:

It leverages sync pool to reuse memory and achieve zero dynamic memory allocation with no GC overhead.

Source code analysis

The Pool structure

Pool struct Pool struct

type Pool struct {
	noCopy noCopy

    // Local queue for each P. The actual type is [P]poolLocal
	local     unsafe.Pointer // local fixed-size per-P pool, actual type is [P]poolLocal
	// [P]poolLocal size
	localSize uintptr        // size of the local array

	victim     unsafe.Pointer // local from previous cycle
	victimSize uintptr        // size of victims array

	// A custom object creation callback that is called when there are no available objects in the pool
	New func(a) interface{}}Copy the code

Because pools do not want to be copied, there is a noCopy field in the structure, and the Go Vet tool can detect if user code is copying pools.

NoCopy is a static checking mechanism introduced in GO1.7. It works not only at runtime or in the standard library, but also in user code.

Users only need to implement such a non-memory, statically analysis-only structure to ensure that an object does not replicate after the first use.

The implementation is very simple:

// noCopy is used to embed a structure to ensure that it will not be copied after the first use
//
/ / https://golang.org/issues/8005#issuecomment-190753527
type noCopy struct{}

// Lock is a null operation used for -copylocks static analysis for 'go vet'
func (*noCopy) Lock(a)   {}
func (*noCopy) Unlock(a) {}
Copy the code

The local field stores a pointer to the [P]poolLocal array (which is technically a slice), and localSize represents the size of the local array. When accessed, the ID of P corresponds to the [P]poolLocal subindex. This design reduces competition and improves performance when multiple Goroutines use the same Pool.

In a round of GC, victim and victimSize “take over” Local and localSize, respectively. Victim’s mechanism is used to reduce performance jitter caused by a cold start after GC, making allocation objects smoother.

Victim Cache was originally a concept in the computer architecture, a technology for CPU hardware to handle caches, and sync.pool was introduced to reduce GC pressure and improve hit ratios.

When there are no cached objects in the Pool, the New method is called to generate a New object.

type poolLocal struct {
	poolLocalInternal

	// add poolLocal to multiple of two cache rows to prevent false sharing,
	// Each cache line has 64 bytes, or 512 bits
	// Currently our processors typically have 32 * 1024/64 = 512 cache lines
	PoolLocalInternal = poolLocalInternal; // poolLocalInternal = poolLocalInternal
	pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

// Local per-P Pool appendix.
type poolLocalInternal struct {
    // P private cache, no need to lock
	private interface{}
	// Public cache area. PushHead /popHead; Other P's can only be popTail
	shared  poolChain
}
Copy the code

The pad field mainly prevents false sharing. Dong Da’s “What is CPU cache?” describes it well:

In modern cpus, the cache is divided into a cache line (cache block), which is usually 64 bytes in x86_64 system. The cache line is the smallest unit of operation.

Even if the program only wants to read one byte of data in memory, it must load the nearby 63 bytes into the cache at the same time. If the program reads more than 64 bytes, it must load multiple cache lines.

Simply put, if there is no pad field, the CPU will load indexes 0 and 1 into the CPU cache at the same time when the poolLocal of index 0 needs to be accessed. Changing only index 0 invalidates poolLocal for index 1. Thus, when another thread tries to read index 1, a cache miss occurs and it has to be reloaded, which is bad for performance. Add a pad to complete the cache rows so that the related fields can be loaded independently of the cache rows so that false sharding does not occur.

PoolChain is an implementation of a two-ended queue:

type poolChain struct {
	// Only producers push to, without locking
	head *poolChainElt

	// Reading and writing requires atomic control. pop from
	tail *poolChainElt
}

type poolChainElt struct {
	poolDequeue

	// Next is written by producer and read by consumer. So it's only going from nil to non-nil
	// The prev is written by consumer and read by producer. So it's only going from non-nil to nil
	next, prev *poolChainElt
}

type poolDequeue struct {
	// The head index is stored in the most-significant bits so
	// that we can atomically add to it and the overflow is
	// harmless.
	// headTail contains a 32-bit head and a 32-bit tail pointer. Both of these values are modelled by len(Vals)-1.
	// tail is the oldest data in the queue, and head points to the slot to be filled next
    // Slots' valid scope is [tail, head), held by consumers.
	headTail uint64

	// Vals is a circular queue with interface{} and size must be a power of 2
	// If slot is null, vals[I].typ is null; Otherwise, non-empty.
	// A slot is null at this point: tail no longer refers to it, and vals[I].typ is nil
	// Set to nil by consumer, read by producer
	vals []eface
}
Copy the code

PoolDequeue is implemented as a single producer, multi-consumer, fixed-size, lock-free (atomic implementation) ring-like queue (the underlying storage uses arrays, with two Pointers to mark head and tail). Producers can insert from head and delete from head, while consumers can only delete from tail.

HeadTail points to the head and tail of the queue, and bits are used to store the head and tail into the headTail variable.

We use a diagram to describe the Pool structure completely:

What are the disadvantages of Sync. Pool? To make it easier to understand the double-endian queue:

We can see that poolDequeue is not used directly for pools because its size is fixed, whereas Pool size is unlimited. Therefore, wrapped on top of the poolDequeue, it becomes a two-way list of poolChainElt that can grow dynamically.

Get

Direct source code:

func (p *Pool) Get(a) interface{} {
    / /...
	l, pid := p.pin()
	x := l.private
	l.private = nil
	if x == nil {
		x, _ = l.shared.popHead()
		if x == nil {
			x = p.getSlow(pid)
		}
	}
	runtime_procUnpin()
    / /...
	if x == nil&& p.New ! =nil {
		x = p.New()
	}
	return x
}
Copy the code

The ellipsis content is race-related and is a bit of noise in reading the source code, commented out for now. Thus, the whole process of Get is very clear:

  1. First, call p.pin() to bind the current goroutine to P, prevent preemption, and return the poolLocal and PID corresponding to the current P.

  2. Then take L. private directly, assign it to x, and juxtize L. private to nil.

  3. Determine if x is empty, if so, try popping an object from the header of L. Shared and assigning it to x.

  4. If x is still empty, getSlow is called to try to “steal” an object from the end of the shared queue of other P’s.

  5. When the Pool operation is complete, call runtime_procUnpin() to remove non-preemption.

  6. If no cached object is retrieved, the New function is called to create one.

I use a flow chart to illustrate the process:

With the overall flow sorted out, let’s take a look at some of the key functions.

pin

Pool.pin() :

// src/sync/pool.go

// The caller must call runtime_procUnpin() after completing the value to cancel preemption.
func (p *Pool) pin(a) (*poolLocal, int) {
	pid := runtime_procPin()
	s := atomic.LoadUintptr(&p.localSize) // load-acquire
	l := p.local                          // load-consume
	// Because there may be dynamic P (run time adjust the number of P)
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	return p.pinSlow()
}
Copy the code

Pin binds the current groutine to P, preventing preemption. Returns the poolLocal and the ID of P.

If G is preempted, the state of G changes from running to runnable and is put back to THE localQ or globaq of P for the next schedule. The next time I do it, I don’t have to combine it with P. Since the PID will be used later, if it is preempted, it is possible that the next PID will not be the same as the bound P.

The task of “binding” is finally given to procPin:

// src/runtime/proc.go

func procPin(a) int {
	_g_ := getg()
	mp := _g_.m

	mp.locks++
	return int(mp.p.ptr().id)
}
Copy the code

The code is simple: you complete the “binding” by increasing the locks value of one of the locks fields on m of the current Goroutine binding. For details about how pin works, see Golang’s Sync. pool source code, which explains why procPin cannot be preempted and GC does not clean objects in the pool.

If the current PID is less than p. localsize, the poolLocal array will be selected directly from the pid index of the poolLocal array. Otherwise, poolLocal has not been created for the Pool. Call p.pinlow () to complete the creation.

func (p *Pool) pinSlow(a) (*poolLocal, int) {
	// Retry under the mutex.
	// Can not lock the mutex while pinned.
	runtime_procUnpin()
	allPoolsMu.Lock()
	defer allPoolsMu.Unlock()
	pid := runtime_procPin()
	// poolCleanup won't be called while we are pinned.
	// Atomic operations are not used because global locks have been added
	s := p.localSize
	l := p.local
	// Since pinSlow may have been called by another thread halfway through, the PID needs to be checked again. If the PID is within the size of P. local, the poolLocal slice is returned.
	if uintptr(pid) < s {
		return indexLocal(l, pid), pid
	}
	if p.local == nil {
		allPools = append(allPools, p)
	}
	// If GOMAXPROCS changes between GCs, we re-allocate the array and lose the old one.
	// The current number of P
	size := runtime.GOMAXPROCS(0)
	local := make([]poolLocal, size)
	// Old local will be recycled
	atomic.StorePointer(&p.local, unsafe.Pointer(&local[0])) // store-release
	atomic.StoreUintptr(&p.localSize, uintptr(size))         // store-release
	return &local[pid], pid
}
Copy the code

The function name has slow because of the large lock allPoolsMu. We know that the bigger the lock granularity, the more competition, the more “slow” it naturally becomes. However, in order to lock, you must first unbind, after the lock, perform the “bind”. The reason is that the bigger the lock, the more likely it is to block, and if it still occupies P, it wastes resources.

After unbinding, pinSlow may have been called by another thread, and P.Local may change. Therefore, the PID needs to be checked again. If the PID is within the size of P. localsize, the poolLocal slice is returned.

Then use make to create a slice with runtime.GOMAXPROCS(0) poolLocal and set p.local and P. localsize using atomic operations.

Finally, return the element at the PID index corresponding to P.local.

As for allPoolsMu, Cao Da gives an example in several Lock problems that Go Systems may encounter. The third party library uses sync.Pool and has an internal structure called fastTemplate. Template that contains the sync.Pool field. When RD is used, a new structure is created for each request. As a result, on each request, an attempt was made to fetch cached objects from an empty Pool, and the Goroutine ended up blocking on the lock because it was trying to execute allPools = append(allPools, p), causing performance problems.

popHead

Going back to the Get function, let’s look at another key function: poolchain-pophead () :

func (c *poolChain) popHead(a) (interface{}, bool) {
	d := c.head
	ford ! =nil {
		if val, ok := d.popHead(); ok {
			return val, ok
		}
		// There may still be unconsumed elements in the
		// previous dequeue, so try backing up.
		d = loadPoolChainElt(&d.prev)
	}
	return nil.false
}
Copy the code

The popHead function is only called by producer. Get the head node first: c. read. If the head node is not empty, try calling the popHead method of the head node. Note that these popHead methods are not actually the same, one for poolChain and one for poolDequeue. If you are confused, go back to the Pool structure diagram. Pooldequeue.pophead () :

// /usr/local/go/src/sync/poolqueue.go

func (d *poolDequeue) popHead(a) (interface{}, bool) {
	var slot *eface
	for {
		ptrs := atomic.LoadUint64(&d.headTail)
		head, tail := d.unpack(ptrs)
		// Check whether the queue is empty
		if tail == head {
			// Queue is empty.
			return nil.false
		}

		// The head position is the first position of the queue head.
		// Remove control of the slot by decrement the head value by 1 before the slot value is read
		head--
		ptrs2 := d.pack(head, tail)
		if atomic.CompareAndSwapUint64(&d.headTail, ptrs, ptrs2) {
			// We successfully took back slot.
			slot = &d.vals[head&uint32(len(d.vals)- 1)]
			break}}/ / remove the val
	val := *(*interface{})(unsafe.Pointer(slot))
	if val == dequeueNil(nil) {
		val = nil
	}
	
	// Reset slot, typ and val are nil
	// Unlike popTail, there is no competition with pushHead, so don't be too careful
	*slot = eface{}
	return val, true
}
Copy the code

This function deletes and returns the head node of the queue. But if queue is empty, return false. This queue is actually a cache of objects in the Pool.

At the heart of the whole function is an infinite loop, a form of lockless programming commonly used in Go.

We call the unpack function to separate the head and tail Pointers. If head and tail are equal, the queue is empty and returns nil, false.

Otherwise, move the head pointer back one bit, that is, subtract 1 from the head value, and then call pack to pack the head and tail Pointers. Use atomic.Com pareAndSwapUint64 compare headTail whether there is a change in the middle of it, if no change, equivalent to get to the lock, then update the headTail values. Assign the element at the vals corresponding index to slot.

Since the vals length is actually only 2 to the NTH power, len(D. als)-1 is actually the lowest n digits of the values with all 1s, and this is paired with head, which is actually the lowest n digits of head.

The slot element is cast to determine whether it is dequeueNil. If it is, no cached object is retrieved and nil is returned.

// /usr/local/go/src/sync/poolqueue.go
// Since nil stands for empty slots, dequeueNil stands for interface{}(nil)
type dequeueNil *struct{}
Copy the code

Finally, before returning val, “zero” slot: *slot = eface{}.

Return to poolchain-pophead () and call pooldequeue.pophead () to retrieve the cached object. Otherwise, redirect D to d.rev and continue trying to fetch the cached object.

getSlow

If not, call pool.getslow () and try to steal the object from the poolLocal of another P:

func (p *Pool) getSlow(pid int) interface{} {
	// See the comment in pin regarding ordering of the loads.
	size := atomic.LoadUintptr(&p.localSize) // load-acquire
	locals := p.local                        // load-consume
	// Try to steal one element from other procs.
	// Steal objects from other P's
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i+1) %int(size))
		ifx, _ := l.shared.popTail(); x ! =nil {
			return x
		}
	}

	// Try to fetch an object from the victim cache. This occurs after a failed attempt to steal from another P's poolLocal,
	// This makes the object in victim easier to recycle.
	size = atomic.LoadUintptr(&p.victimSize)
	if uintptr(pid) >= size {
		return nil
	}
	locals = p.victim
	l := indexLocal(locals, pid)
	ifx := l.private; x ! =nil {
		l.private = nil
		return x
	}
	for i := 0; i < int(size); i++ {
		l := indexLocal(locals, (pid+i)%int(size))
		ifx, _ := l.shared.popTail(); x ! =nil {
			return x
		}
	}

	// Clear the victim cache. You won't have to look here next time
	atomic.StoreUintptr(&p.victimSize, 0)

	return nil
}
Copy the code

Starting at poolLocal with index PID +1, try calling shared.poptail () to get the cache object. If not, find the victim from the poolLocal.

Finally, if you can’t find someone, you will victimSize 0 to prevent another “person” from finding him in victim.

At the end of the Get function, the New function is called to create a New object after this operation has not found the cached object.

popTail

Finally, there is one popTail function left:

func (c *poolChain) popTail(a) (interface{}, bool) {
	d := loadPoolChainElt(&c.tail)
	if d == nil {
		return nil.false
	}

	for {
		d2 := loadPoolChainElt(&d.next)

		if val, ok := d.popTail(); ok {
			return val, ok
		}

		if d2 == nil {
			// The bidirectional list has only one tail node, which is now empty
			return nil.false
		}

		// The queue at the end of the bidirectional list is empty, so move on to the next node.
		// And since the tail node is already "empty", it should be discarded. That way, next time popHead doesn't have to look to see if it has cached objects.
		if atomic.CompareAndSwapPointer((*unsafe.Pointer)(unsafe.Pointer(&c.tail)), unsafe.Pointer(d), unsafe.Pointer(d2)) {
			// get rid of the tail node
			storePoolChainElt(&d2.prev, nil)
		}
		d = d2
	}
}
Copy the code

At the beginning of the for loop, d. ext is loaded into D2. D may be briefly empty, but if D2 is not empty before Pop or Pop Fails, d will be permanently empty. In this case, it is safe to “dump” d.

Finally, update c. track to D2 to prevent the next popTail from viewing an empty dequeue; Setting d2.prev to nil prevents the next popHead from viewing an empty dequeue.

Let’s look again at the core pooldequeue.poptail:

// src/sync/poolqueue.go:147

func (d *poolDequeue) popTail(a) (interface{}, bool) {
	var slot *eface
	for {
		ptrs := atomic.LoadUint64(&d.headTail)
		head, tail := d.unpack(ptrs)
		// Check whether the queue is empty
		if tail == head {
			// Queue is empty.
			return nil.false
		}

		// start with the head and tail positions. If we do, then the slot belongs to us
		ptrs2 := d.pack(head, tail+1)
		if atomic.CompareAndSwapUint64(&d.headTail, ptrs, ptrs2) {
			// Success.
			slot = &d.vals[tail&uint32(len(d.vals)- 1)]
			break}}// We now own slot.
	val := *(*interface{})(unsafe.Pointer(slot))
	if val == dequeueNil(nil) {
		val = nil
	}

	slot.val = nil
	atomic.StorePointer(&slot.typ, nil)
	// At this point pushHead owns the slot.

	return val, true
}
Copy the code

PopTail Removes an element from the end of the queue, returning false if the queue is empty. This function may be called by multiple consumers at the same time.

At the heart of the function is an infinite loop, another lock – free programming. If the head and tail Pointers are equal, the queue is empty.

Because you are removing an element from the tail, the tail pointer advances by 1 and then sets the headTail using the atomic operation.

Finally, the val and TYp of the slot to be removed are “zero” :

slot.val = nil
atomic.StorePointer(&slot.typ, nil)
Copy the code

Put

// src/sync/pool.go

// Put Adds the object to the Pool
func (p *Pool) Put(x interface{}) {
	if x == nil {
		return
	}
	/ /...
	l, _ := p.pin()
	if l.private == nil {
		l.private = x
		x = nil
	}
	ifx ! =nil {
		l.shared.pushHead(x)
	}
	runtime_procUnpin()
    / /...
}
Copy the code

The race-related functions are also removed, making it look much cleaner. The whole logic of Put is also clear:

  1. Bind g and P first, and then try to assign x to the private field.

  2. If that fails, the pushHead method is called to try to put it into a two-ended queue maintained by the shared field.

Also use a flow chart to show the whole process:

pushHead

PushHead = pushHead;

// src/sync/poolqueue.go

func (c *poolChain) pushHead(val interface{}) {
	d := c.head
	if d == nil {
		// poolDequeue has an initial length of 8
		const initSize = 8 // Must be a power of 2
		d = new(poolChainElt)
		d.vals = make([]eface, initSize)
		c.head = d
		storePoolChainElt(&c.tail, d)
	}

	if d.pushHead(val) {
		return
	}

    // Double the length of the previous poolDequeue
	newSize := len(d.vals) * 2
	if newSize >= dequeueLimit {
		// Can't make it any bigger.
		newSize = dequeueLimit
	}

    // end to end to form a linked list
	d2 := &poolChainElt{prev: d}
	d2.vals = make([]eface, newSize)
	c.head = d2
	storePoolChainElt(&d.next, d2)
	d2.pushHead(val)
}
Copy the code

If C. read is empty, a poolChainElt is created as the first and, of course, the last node. The length of the two-ended queue it manages, initially 8, is doubled when a poolChainElt node is created. Of course, there is a maximum length limit (2^30) :

const dequeueBits = 32
const dequeueLimit = (1 << dequeueBits) / 4
Copy the code

Call pooldequeue. pushHead to try to put an object into poolDeque:

// src/sync/poolqueue.go

// Add val to the head of the double-ended queue. If the queue is full, return false. This function can only be called by one producer
func (d *poolDequeue) pushHead(val interface{}) bool {
	ptrs := atomic.LoadUint64(&d.headTail)
	head, tail := d.unpack(ptrs)
	if (tail+uint32(len(d.vals)))&(1<<dequeueBits- 1) == head {
		// The queue is full
		return false
	}
	slot := &d.vals[head&uint32(len(d.vals)- 1)]

	// Check whether the slot is released by popTail
	typ := atomic.LoadPointer(&slot.typ)
	iftyp ! =nil {
		// Another groutine is poptailing the slot, indicating that the queue is still full
		return false
	}

	// The head slot is free, so we own it.
	if val == nil {
		val = dequeueNil(nil)}// slot placeholder, and store val to VALS* (*interface{})(unsafe.Pointer(slot)) = val

	// head increment by 1
	atomic.AddUint64(&d.headTail, 1<<dequeueBits)
	return true
}
Copy the code

First check whether the queue is full:

if (tail+uint32(len(d.vals)))&(1<<dequeueBits- 1) == head {
	// Queue is full.
	return false
}
Copy the code

That is, add the tail pointer to the length of D. vals and lower it by 31 bits to see if it is equal to head. We know that the length of d. als is actually fixed, so if the queue is full, then both sides of the if statement are equal. If the queue is full, return false.

Otherwise, the queue is not full. Use the head pointer to locate the slot to be filled: take the lower 31 bits of the head pointer.

// Check if the head slot has been released by popTail.
typ := atomic.LoadPointer(&slot.typ)
iftyp ! =nil {
	// Another goroutine is still cleaning up the tail, so
	// the queue is actually still full.
	// popTail sets val and typ to nil. PopHead can operate this slot only after tyP is set
	return false
}
Copy the code

If popTail does not collide with popTail, return false.

Finally, assign val to slot and increment the head pointer value by 1.

// slot placeholder, and store val to VALS* (*interface{})(unsafe.Pointer(slot)) = val
Copy the code

Slot is of type eface. Convert slot to interface{} so that val can be assigned to slot with interface{} and slot.typ and slot.val point to its memory block. So neither slot.typ nor slot.val is empty.

pack/unpack

Finally, we look at the pack and unpack functions, which are actually a set of two functions that bind and unbind the head and tail Pointers.

// src/sync/poolqueue.go

const dequeueBits = 32

func (d *poolDequeue) pack(head, tail uint32) uint64 {
	const mask = 1<<dequeueBits - 1
	return (uint64(head) << dequeueBits) |
		uint64(tail&mask)
}
Copy the code

The lower 31 bits of the mask are all ones and the other bits are 0. The mask matches only the lower 31 bits of the tail. After head moves 32 bits to the left, the lower 32 bits are all zeros. The head and tail are tied together.

Corresponding unbundling function:

func (d *poolDequeue) unpack(ptrs uint64) (head, tail uint32) {
	const mask = 1<<dequeueBits - 1
	head = uint32((ptrs >> dequeueBits) & mask)
	tail = uint32(ptrs & mask)
	return
}
Copy the code

To retrieve the head pointer, move the PTRS 32 bits to the right and align it with the mask, again looking at the lower 31 bits of the head. Tail is actually simpler, simply pairing PTRS with a mask.

GC

For a Pool, it cannot be extended indefinitely. Otherwise, objects take up too much memory and run out of memory.

With almost all pooling technologies, some cached objects are emptied or purged at some point, so when do unused objects get cleaned up in Go?

The answer is when GC occurs.

In the init function of the pool.go file, we register how to clean up the pool when GC occurs:

// src/sync/pool.go

func init(a) {
	runtime_registerPoolCleanup(poolCleanup)
}
Copy the code

The compiler does something behind the scenes:

// src/runtime/mgc.go

// Hooks for other packages

var poolcleanup func(a)

// Registers the cleanup in the sync package with the runtime using the compiler flag
//go:linkname sync_runtime_registerPoolCleanup sync.runtime_registerPoolCleanup
func sync_runtime_registerPoolCleanup(f func(a)) {
	poolcleanup = f
}
Copy the code

To be specific:

func poolCleanup(a) {
	for _, p := range oldPools {
		p.victim = nil
		p.victimSize = 0
	}

	// Move primary cache to victim cache.
	for _, p := range allPools {
		p.victim = p.local
		p.victimSize = p.localSize
		p.local = nil
		p.localSize = 0
	}

	oldPools, allPools = allPools, nil
}
Copy the code

The poolCleanup is called during the STW phase. So the whole thing looks pretty neat. Local and victim are swapped so that GC doesn’t empty all the pools and victim is in the “pocket”.

If sync.pool is acquired and released at a steady rate, no new Pool objects will be allocated. If the fetch rate drops, objects may be freed in two GC cycles instead of the previous one.

How is sync.Pool optimized in Go 1.13? I talked about optimization in 1.13.

OldPools, allPools, p.pitcim, oldPools, allPools, p.pitcim, oldPools, allPools, p.pitcim, oldPools, allPools, p.pitcim

  1. In the initial state, oldPools and allPools are nil.
  1. The first call to Get, since P.local is nil, will create P.local in pinSlow, and then put p into allPools, where allPools is of length 1 and oldPools is nil.
  2. The first call to Put returns the object after it is used.
  3. During the 1st GC STW phase, all P. locals in allPools assign values to victim collimated to nil. AllPools is assigned to oldPools, and finally allPools is nil, and oldPools has length 1.
  4. The second call to Get, since P.local is nil, will try to fetch an object from p.victim.
  5. After the object is used, the second call to Put puts the object back, but since P.local is nil, p.local is recreated and the object is Put back, at which point allPools has length 1 and oldPools has length 1.
  6. In the second GC STW phase, all of the oldPools’ p. vitim will be nil, and the previous cache will be collected. allPools’ p.local will be assigned to victim and set to nil. Finally allPools is nil, oldPools has length 1.

I’ve drawn a diagram based on this process to make it a little clearer:

It should be noted that allPools and oldPools are both slices, and the elements of slices are Pointers to pools. Get/Put operations do not need to go through them. In step 6, allPools will have multiple elements if any other pools perform a Put operation.

In implementations prior to Go 1.13, poolCleanup was “crude” :

func poolCleanup(a) {
    for i, p := range allPools {
        allPools[i] = nil
        for i := 0; i < int(p.localSize); i++ {
            l := indexLocal(p.local, i)
            l.private = nil
            for j := range l.shared {
                l.shared[j] = nil
            }
            l.shared = nil
        }
        p.local = nil
        p.localSize = 0
    }
    allPools = []*Pool{}
}
Copy the code

Empty p.local and poollocal.shared for all pools.

Through the comparison between the two, it is found that the granularity of GC in the new version is larger than before Go 1.13, and the cost of GC per unit time is reduced due to the elongated time line of actual collection.

From this, we can understand the function of P. victim. It is positioned as a secondary cache, where objects are placed during GC, and if there is a Get call before the next GC, it is fetched from P.victim until the next GC comes in for temporary collection.

At the same time, because objects taken from P.victim are not put back into P.victim after they are used, the overhead of the next GC is also reduced to a certain extent. The cost of one GC is lengthened to two and reduced to some extent, which is the intention of p.Victim’s introduction.

This article also describes the sync.Pool design concepts, including lock-free, operation object isolation, atomic operations instead of locks, behavior isolation — linked lists, Victim Cache to reduce GC overhead. Very well written and recommended reading.

In addition, for an article on optimizing lock competition in sync.Pool, I recommend reading Rui Dashen’s optimizing Lock Competition.

conclusion

This article first introduces what Pool is and what it does, then explains how Pool is used, in standard libraries, some third-party libraries, and some test cases in pool_test. Finally, a detailed interpretation of sync.Pool source code.

At the end of this article, we’ll summarize the main points about sync.pool in more detail:

  1. The key idea is to reuse objects to avoid repeated creation and destruction. Cache temporarily unused objects and use them directly when needed next time, without having to re-allocate the memory, reuse the memory of objects and reduce the pressure of GC.

  2. Sync. Pool is coroutine safe and very easy to use. After setting up the New function, call Get to Get and call Put to return the object.

  3. Sync. Pool can be found in the encoding/ JSON package. Gin, Echo and other frameworks also use sync.pool.

  4. Don’t make any assumptions about the objects you Get, and it’s better to “empty” the objects when you return them.

  5. The life cycle of objects in a Pool is affected by GC and is not suitable for connection pooling because connection pooling needs to manage the life cycle of objects itself.

  6. Pool size cannot be specified and is only subject to the GC threshold.

  7. ProcPin binds G to P to prevent G from being preempted. During binding, the GC cannot clean up cached objects.

  8. Prior to the victim mechanism, the maximum cache time for objects in sync.Pool was a GC cycle. When GC started, unreferenced objects were cleaned up. With the VICTIM mechanism, the maximum cache time is two GC cycles.

  9. Victim Cache was originally a concept in the computer architecture, a technology for CPU hardware to handle caches, and sync.pool was introduced to reduce GC pressure and improve hit ratios.

  10. The lowest level of sync.pool uses slices plus linked lists to implement a two-ended queue and stores cached objects in slices.

The resources

【 oU God source analysis 】changkun.us/archives/20…

【 Go night reading. Hidevops. IO/reading / 201…

www.youtube.com/watch?v=jae…

[source code analysis, pseudo-sharing] juejin.cn/post/684490…

【 sync golang object pool. The pool source reading 】 zhuanlan.zhihu.com/p/99710992

[understand Go sync in 1.13. The design and implementation of the Pool 】 zhuanlan.zhihu.com/p/110140126

The advantages and disadvantages, figure 】 cbsheng. Dead simple. IO/posts/golan…

【xiaorui optimization lock competition 】xiaorui.cc/archives/58…

The path of performance optimization, a variety of specifications custom caching 】 blog.cyeam.com/golang/2017…

What are the disadvantages of sync.Pool mp.weixin.qq.com/s?__biz=MzA…

[1.12 and 1.13 evolution] github.com/watermelo/d…

Evolution of Dong Zerun 】 【 www.jianshu.com/p/2e0833248…

Github.com/golang/go/i noCopy 】 【…

【 Dong Zerun CPU cache 】 www.jianshu.com/p/dc4b5562a…

【gomemcache example 】docs.kilvn.com/The-Golang-…

【 birdhouse 1.13 optimization 】colobu.com/2019/10/08/…

【A journey with go】medium.com/a-journey-w…

A count component 】 【 encapsulates the www.akshaydeo.com/blog/2017/1…

[fake sharing] ifeve.com/falsesharin…