introduce

The articles in this series are as follows

  1. Garbage Collection in Golang (part 1)
  2. Golang garbage collection :Go Traces
  3. Garbage collection in Golang (3) :Go Pacing

The garbage collector is responsible for keeping track of heap memory allocations, freeing up space that is not needed, and keeping track of allocated space that is still in use. This mechanism can be complex to implement in different programming languages, but developers can build software without having to understand the details of garbage collection. In addition, the VM and Runtime of programming languages in different releases are always changing and evolving. It is important for application developers to maintain a good work model and understand the behavior of garbage collectors in programming languages and how they support this behavior.

For go version 1.12, the GO language uses non-generational, concurrent tricolor tagging and sweeping collector. If you want to know how to mark and clean, please refer to this article. The implementation of Golang’s garbage collector is updated and evolving with each release. So once the next version is released, the implementation is no longer accurate in any detail.

In short, this article will not cover the actual implementation details. I’ll share with you some of the collector’s behavior and explain how to deal with it, regardless of implementation details and future changes. This will make you a better Golang developer

The heap is not a container

I don’t think of the heap as a container for storing or releasing values. It is important to understand that memory does not clearly define a “heap” boundary. The amount of memory reserved by any application is available for heap allocation. Given any heap memory allocation, where it is actually stored in virtual or physical memory is independent of our model. Understanding this will help you better understand how the garbage collection model works.

Collector behavior

When collection begins, the collector completes three phases. Two of these phases cause Stop The World(STW) delays, and The other phase also causes delays and reduces The throughput of The application. The three stages are:

  • Mark Setup – STW
  • Marking – Concurrent
  • Mark Termination -STW

Below is a detailed description of each stage.

Mark Setup -STW

When recycling begins, the first thing you must do is open the Write Barrier. The purpose of the write barrier is to allow the collector to maintain data integrity in the heap when the Collector and the application Goroutines are concurrent.

To enable write barriers, goroutine must be stopped for every running application. This activity is usually very fast, averaging between 10 and 30 subtleties. That is, if your application Goroutines behaves properly.

Note: In order to better understand the following scheduling diagrams, it is best to read the previous onesGolang schedulingThe article
Figure 1.1

Figure 1 shows four applications, Goroutines, running before garbage collection begins. In order to recycle, each of the four Goroutines must be stopped, and the only way to do that is for the collector to check and wait for the Goroutine to make the method call. The method call ensures that goroutines stop at a safe point. What happens if one goroutine doesn’t make a method call but the others do?

Figure 1.2

Figure 1.2 shows a practical example of a problem. If P4’s Goroutine does not stop, garbage collection will not start. But P4 is executing a Tight loop to do some math, which causes recycling to never start.

L1:

01 func add(numbers []int) int {
02     var v int
03     for _, n := range numbers {
04         v += n
05     }
06     return v
07 }
Copy the code

L1 gives the code that P4 Goroutine is executing. Depending on the slice size, Goroutine may execute for a very long time, resulting in no chance to stop at all. This code prevents garbage collection from starting. To make matters worse, the other PS cannot serve the other Goroutine because the collector is in a wait state. So it is critical that Goroutine make method calls within a reasonable practical scope.

Note: This is what the Golang team will be working on in 1.14, by adding the schedulerPreemptive scheduling technology

Marking -Concurrent

Once the write barrier is on, the collector begins to enter the all mark phase. The first thing the collector does is take away 25% of the available CPU for its own use. The Collector uses Goroutines to do the collection, which means it snatches the appropriate number of P’s and M’s from the application. This means that in a 4-thread GO program, one P is taken to handle the collection.

Figure 1.3

Figure 1.3 shows how the collector removed P1 during the collection. Now the collector can begin the Marking process. The marking phase is the marking of in-use values in heap memory. It checks all the goroutines in the stack to find the root pointer to the heap memory. The collector must then traverse the tree of heap memory, starting with the root pointer. While Marking is being processed on P1, the application can continue to execute concurrently on P2, P3, and P4. This means that the collector reduces the current CPU capacity by 25%.

I wish that was the end of the matter, but it’s not. What happens if the GORoutine of GC on P1 does not finish Marking when in-use heap memory reaches its upper limit? What if the Goroutine of the other three applications causes the Collector to fail to complete on time? If this happens, new memory allocation must be slowed down, especially on the corresponding Goroutine.

If the collector determines that it must slow down memory allocation, it recruits the application’s Goroutines to Assist with the marking. This is called Mark Assist. The time it takes for any application Goroutine to be put into Mark Assist is proportional to the amount of data it will add to the heap memory. One of the positive features of Mark Assist is that it helps speed up recycling.

Figure 1.4

Figure 1.4 shows that Goroutine, previously run by the application on P3, is now Mark Assist to help with the recycling. Hope other Goroutines don’t get involved. Applications that are under pressure to allocate memory will see most of the running Goroutines handling a small amount of Mark Assist work at garbage collection time.

Mark Termination -STW

Once the marking is complete, the next process is Mark Termination. At this point the write barrier is closed, various cleanup operations are performed, and the next collection target is calculated. Goroutines that find themselves handling tight loops during the marking process also result in increased latency for the Mark Termination STW.

Figure 1.5

Once the collection is complete, each P can be used again by the application Goroutines, and the program is back in full swing.

Figure 1.6

Figure 1.6 shows that after the collection is complete, all available P’s are now working on the application.

Sweeping – Concurrent

After the collection is finished, there are another activities called Sweeping. The Sweeping is the cleaning of the heap memory that has values but is not marked as in-use. This activity occurs when the application Goroutines tries to allocate a new value in the heap. The Sweeping delay adds to the overhead of heap memory allocation and is not associated with any garbage collection delay.

The following is a sample trace on my machine, which has 12 Hardware Threads to execute goroutines.

Figure 1.7

After the collection is complete, the program is back in full swing. You can see a lot of rose-colored vertical lines underneath these goroutines.

Figure 1.8

Figure 1.9

runtime.mallocgc
runtime.(*mcache).nextFree
nextFree

The collection behavior described above only occurs when a collection has been started and is being processed. GC configuration options play an important role in determining when to start collecting.

GC percentage

The Runtime has a configuration option called GC Percentage. The default value is 100. This value represents how much new heap memory can be allocated before the next collection begins. GC Percentage set to 100 means that the next collection must start when more than 100% of new memory is allocated to the heap, based on the amount of heap memory marked as alive after the collection is complete.

As an example, imagine that the collection is complete and 2MB of in-use heap memory is marked.

Note: Heap memory in the chart is not representative of reality. The heap memory in GO is usually cluttered and fragmented, and you don’t have the distinction you see in the diagram. These diagrams provide a convenient visual heap memory model for easy comprehension.

Figure 1.10

Figure 1.11

L2

GODEBUG=gctrace=1./app GC 1405 @6.068s 11%: 0.058+1.2+0.083 ms clock, 0.70+2.5/1.5/0+0.99 ms CPU, 7->11->6 MB, 10 MB goal, 12 P GC 1406 @6.070s 11%: 0.051+1.8+0.076 ms clock, 0.61+2.0/2.5/0+0.91 ms CPU, 8->11->6 MB, 13 MB goal, 12 P GC 1407 @6.073s 11%: 0.052+1.8+0.20 ms clock, 0.62+1.5/2.2/0+ 2.4ms CPU, 8->14->8 MB, 13 MB goal, 12 PCopy the code

L2 shows how to use the GODEBUG variable to generate GC trace. The program generated GC traces are shown in L3 below.

L3

The gc 1405 @ 6.068 s 11% : 0.70 + 0.058 + 1.2 + 0.083 ms clock, 2.5/1.5/0 + 0.99 ms CPU, 7-11 - > > 6 MB, 10 MB goal, 12 P / / General gc 1404: The 1404 GC run since The program started @6.068s: Six seconds since The program started 11% : Eleven percent of the available CPU so far has been spentinGC // wall-clock 0.058ms: STW: Mark start-write Barrier on 1.2 MS: Concurrent: Marking 0.083ms: STW: Mark Termination - Write Barrier off and clean up // CPU Time 0.70ms: STW: Mark Start 2.5ms: Concurrent: Mark - Assist Time (GC performedinLine with allocation) 1.5ms: Concurrent: Mark-background GC time 0ms: Concurrent: Mark-idle GC time 0.99ms: STW : Mark Term // Memory 7MB : Heap memoryin-use before the Marking started
11MB        : Heap memory in-use after the Marking finished
6MB         : Heap memory marked as live after the Marking finished
10MB        : Collection goal for heap memory in-use after Marking finished

// Threads
12P         : Number of logical processors or threads used to run Goroutines
Copy the code

L3 shows the actual value in GC and what it means. I’ll get to these values at the end, but for now notice the memory fragment of 1405 GC Trace.

Figure 1.12

L4

// Memory
7MB         : Heap memory in-use before the Marking started
11MB        : Heap memory in-use after the Marking finished
6MB         : Heap memory marked as live after the Marking finished
10MB        : Collection goal for heap memory in-use after Marking finished
Copy the code

The GC trace line gives the following information: The in-use size of heap memory before Marking Work begins is 7MB. When Marking Work is complete, the size of in-use in heap memory is 11MB. This represents an additional 4MB of memory allocation in the collection. After Marking Work is complete, the size of the heap memory alive space is 6MB. This means that in-use heap memory can be increased to 12MB (100%* live heap size =6MB) before the next collection begins

You can see that the collector exceeded its target by 1MB and the in-use heap memory after Marking Work was completed was 11MB instead of 10MB. But it doesn’t matter, because the target is calculated based on the current in-use heap memory, that is, the space in the heap marked as living, and there is an additional allocation that increases over time as the collection takes place. In this case, the application did something that caused more heap memory to be used than expected after Marking.

If you look at the next GC Trace line (1406), you will see how things have changed in 2ms.

Figure 1.13

The gc 1406 @ 6.070 s 11% : 0.051+1.8+0.076 ms clock, 0.61+2.0/2.5/0+0.91 ms CPU, 8->11->6 MB, 13 MB goal, 12 P // Memory 8MB: Heap Memoryin-use before the Marking started
11MB        : Heap memory in-use after the Marking finished
6MB         : Heap memory marked as live after the Marking finished
13MB        : Collection goal for heap memory in-use after Marking finished
Copy the code

L5 shows that after the previous collection started (6.068s vs 6.070s) the collection started in a 2ms state, although in-use heap memory only reached 8MB of the allowed 12MB. Note that if the collector decides it is best to start collecting earlier, it will. In this case, it might have started the collection early because the application was under heavy allocation pressure and the Collector wanted to reduce the latency of Mark Assist in the collection effort.

Two other things to note is that the collector does what it sets out to do. After Marking, in-use heap memory is 11MB instead of 13MB, 2MB less. The amount of heap memory marked as alive after Marking is also 6MB.

In addition, you can get more details about GC by adding the gcpacerTrace =1 flag. This causes the collector to print the internal state of the Concurrent Pacer.

L6

$ exportGODEBUG=gctrace=1, gcpacerTrace =1./app Sample output: gc 5 @0.071s 0%: 0.018+0.46+ 0.071ms clock, 0.14+0/0.38/0.14+ 0.56ms CPU, 29->29->29 MB, 30 MB goal, 8 P pacer: sweepdone at heap size 29MB; allocated 0MB of spans; swept 3752 pages at +6.183550e-004 pages/byte

pacer: assist ratio=+1.232155e+000 (scan 1 MB in70->71 MB) workers=2+0 pacer: H_m_prev=30488736 H_T= + 2.334071E-001 H_T=37605024 H_A =+1.409842 +000 H_A =73473040 H_G =+ 1.000000E + 60977472 U_a =+2.500000e-001 u_g=+2.500000e-001 W_a=308200 goal δ =+7.665929e-001 actual δ =+1.176435 +000 u_a/u_g=+1.000000e+000Copy the code

Running GC Trace can tell you a lot about the health of your application and the speed of the collector.

Pacing

The collector has an pacing algorithm that determines when the collection starts. The algorithm relies on a feedback loop that the collector uses to gather information about the application as it runs and how much stress it puts on the heap. Stress can be defined as how fast the application is allocated on the heap in a given amount of time. The pressure determines how fast the collector is running.

Before the collector begins a collection, it calculates how long it thinks it will take to complete the collection. Once the collection runs, it causes a delay on the running application, which slows the application down. Each collection increases the overall latency of the application.

There is a misconception that slowing down the collector is a way to improve performance. If you can delay the next collection, you will delay the delay it creates. But improving performance isn’t really about slowing down the collector.

You can decide to change GC Percentage to a value greater than 100. This increases the amount of memory that can be allocated before the next collection begins. This slows down the collector. But don’t even think about it.

Figure 1.14

Trying to directly affect the collection speed does not improve the performance of the collector. The important thing is to do more between and during each collection, which you can impact by reducing work’s heap allocation.

Note: It is also possible to use the smallest heap to achieve the desired throughput. Remember, it’s important to minimize heap memory usage in a cloud environment.

Figure 1.15

Take a look at the average reclaim speed between the two versions (2.08ms vs 1.96ms). They’re not much different, about 2ms. The difference is the amount of work between each collection between the two versions. The number of requests processed by the application went from 3.98 to 7.13 between each reclaim. You can see a 79.1% increase in workload over almost the same time period. As you can see, the collection does not slow down as the allocated memory decreases, but rather maintains its original speed. The success lies in getting more done between each collection.

Adjusting the collector’s collection speed and delaying the delay cost is not a way to improve your application’s performance, it just reduces the amount of time the collector has to run, which in turn reduces the delay cost. The latency costs of the collector have already been explained, but here’s a quick summary.

Collector delay cost

Each time a collection is done, there are two types of delay. The first is CPU stealing, which means that your application is not running at full CPU at the time of recycling. The application Goroutines now shares P with the collector Goroutine, or Mark Assist.

Figure 1.16

Figure 1.17

Note: Marking usually requires 4 CPU-millsecondes/MB of the live heap (for example, to evaluate how many millseconds are running in the Marking phase, this value is set to the live heap MB size minus 0.25* number of cpus). Marking actually runs about 1MB/ms, but only 1/4 of the CPU is used to process it.

The second type of delay is the STW that occurs in the collection. STW is a situation where no Goroutine is working. The entire application is essentially stopped.

Figure 1.18

Reduce GC latency

One way to reduce GC latency is to identify unnecessary memory allocations in your application and remove them, which can help improve the Collector in several ways.

Help collector:

  • The minimum heap is maintained
  • Find the optimal consistent speed
  • Keep each recycle within the goal
  • Minimize time to recycle, STW and Mark Assist

All of these things help reduce the latency generated by the collector, thereby increasing the throughput and performance of your application. Changing the recycling rate doesn’t help. You can improve performance by making the right engineering decisions to reduce heap memory allocation stress.

Understand the workload your application is running

Regarding performance, you also need to be aware of the type of workload. Understanding your workload means making sure that you are using a reasonable number of Goroutines to handle your work. CPU vs. IO bound workloads are different and require different choices. You can refer to it here

conclusion

If you take the time to focus on reducing memory allocation, you will get a performance boost. But you can’t write programs that allocate 0, so it’s important to understand and validate memory allocation when it’s productive and when it’s not productive. Then you can trust the garbage collector to keep memory healthy and stable, and keep your programs running.

Garbage collectors are a good compromise. Spend a little money on garbage collection so you don’t have to worry about memory management. The Go garbage collector makes programmers more efficient and productive, allowing you to write programs that are fast enough.

Original link:www.ardanlabs.com/blog/2018/1…