(eight)JVM into god path of GC partition: G1, ZGC, ShenandoahGC high performance collector in-depth analysis

The introduction

In GC Generation, we covered generational GC collectors in the JVM. In this chapter, we focus on the high-performance collectors developed in subsequent releases of the JDK. But before we begin, let’s take a look at the changes to the GC architecture in the JDK release record:

September 2018: JDK11 is released and introducedEpsilonGarbage collector, also known as”No-0p(No operation) “collector. At the same time, a scalable low-latency garbage collector was introducedZGC(Experimental).
March 2019: JDK12 released to enhance G1 collector to automatically return unused heap memory to the operating system. At the same time, a low pause time collector was introducedShenandoahGC(Experimental).
September 2019: JDK13 released to enhance the ZGC collector and automatically return unused heap memory to the operating system.
March 2020: JDK14 is released, eliminating the CMS collector and extending itZGCinmacOSandWindowsOn the application, enhancementG1supportNUMAArchitecture.

From the JDK release log above, you can see three new GC collectors:Epsilon, ZGC, ShenandoahGCSince then, the JVM “GC family” has officially assembled ten collectors, as shown below:

Among themEpsilonIt is mainly an operation-free collector for testing, such as performance testing, memory stress testing, VM interface testing, etc. Choosing to load the Epsilon collector at program startup helps us filter out performance artifacts caused by the GC mechanism. JVMS equipped with this GC collector do not perform any GC-related operations during runtime, and Java programs will exit for OOM as soon as their allocated heap space runs out.

Because the G1 collector is a milestone that enabled whole heap collection, it is not described in the generation section, which will start from the G1 collector and conduct a comprehensive analysis of the ZGC and ShenandoahGC in turn.

1. Milestone G1, initiating a new era of GC “partition reclamation”

As discussed in Generation, the CMS collector is a milestone in the creation of concurrent collections in the JVM, while G1, the main character of this time, is a milestone in the new era of GC partition reclamation. While prior to G1, all collectors followed the step-by-step principle of “physical + logical generation” collection, the G1 collector began to formally partition the heap into a “physical partition without generation” memory structure, thus beginning the JVM partition reclamation.

G1, full name garbage-first Garbage Collector, was introduced to Java in JDK1.7, after which we could assemble it with the -xx :+UseG1GC parameter. G1 is a collector specifically designed for machines with multi-core processors and large memory, while meeting the latency control of GC response time, but also to maximize program throughput. It is intended to replace the CMS collector with G1, and to have G1 assume the responsibility and expectation of being the first fully functional garbage collector in the JVM that can drive everything.

The G1 collector has the following features:

① Like CMS collector, can be executed at the same time with user thread, complete concurrent collection.
(2) During GC, there will be a process of defragmentation, no memory fragmentation, and defragmentation of free memory faster.
③ When GC occurs, the pause time is controllable, which allows the program to pursue lower latency to a greater extent.
(4) While pursuing low latency, high throughput will be guaranteed as far as possible.
⑤ Unused memory of the heap can be returned to the operating system.

Let’s start with a thorough analysis of the COLLECTOR, starting with the G1’s memory partition.

1.1, G1 collector heap space memory partition

Before the G1 collector, Java’s heap space looked something like this:

In JDK1.9, the heap space slowly began to change dramatically. Until then, the layout of the heap space was generational storage, both logically and physically. But by Java9, because the default GC was changed to G1, the memory regions in the heap were divided upRegionArea.

Each partition can be a young generation or an old generation, but can only belong to one generation at a time. At run time, each partition is marked with a unique partition id. In the G1 collector, however, the concepts of young Eden, Survivor, and Old Old are still present, but become logical concepts. The advantage of this is that the logic of the previous generation framework can be reused, while also satisfying the ephemeral nature of Java objects.

The G1 collector is no longer physically isolated, although it is logically divided into generations. This means that the physical memory space is divided into regions. The JVM no longer needs to allocate contiguous memory to the heap space, which can be discontiguous physical memory to form a collection of regions. As a bonus, G1 can preferentially collect garbage from regions with a large number of garbage objects and regions with a small number of garbage objects. In this way, it takes less time to collect garbage from regions with a large number of garbage objects. This is why G1 is named garbage Collector. At runtime, the G1 collector changes the heap space to the following structure:

G1 divides the Java heap into independent, equally-sized piecesRegionArea, but inHotSpotThe source ofTARGET_REGION_NUMBERDefines theRegionThe number of zones is limited to2048More than this is actually allowed, but after that, the heap space becomes unmanageable.

Generally, the size of a Region is equal to the total size of the heap space divided by 2048. For example, if the total size of the heap space is 8GB, that is 8192MB/2048=4MB, then the final size of each Region is 4MB. It is also possible to use -xx :G1HeapRegionSize to force the size of each Region, but it is not recommended because the default calculation method is best for managing the heap space. For this logic, the HotSpot of source/share/vm/gc_implementation/g1 / heapRegion CPP file defines the specific implementation, For details, see the internal HeapRegion:: setup_HEAP_REGION_size (size_T initial_HEAP_size, size_T max_HEap_size) function.

By default, the initial proportion of the new generation to the heap memory is 5%. If the heap size is 8GB, the young generation occupies about 400MB of memory, which corresponds to about 200 regions. You can set the initial proportion of the new generation by using -xx :G1NewSizePercent. The JVM keeps adding more regions to the new generation as Java programs run, but the maximum number of new generations does not exceed 60% of the total heap size. This can be adjusted by -xx :G1MaxNewSizePercent. Easy to trigger global GC). The ratio of Eden Region and Survivor Region in the New generation is the same as before, the default is 8:1:1. Assuming that there are 400 regions in the new generation, the ratio of the whole new generation is Eden=320,S0/From=40,S1/To=40.

In G1, the promotion conditions of the aged generation are the same as before. Objects that reach the age threshold will be transferred to the Region of the aged generation. The difference is that large objects are not allowed to enter the aged generation in G1. If an object is determined to be a large object, it is placed directly into the Humongous store.

In G1, the method to determine whether an object is large is as follows: The size of an object exceeds 50% of a common Region. If the size exceeds 50%, the current object is large, and the object is directly added to the Humongous Region. For example, the current size of the heap is 8GB, and each Region is 4MB. If the size of an object exceeds 2MB, it is considered as a large object. If a Humongous Region becomes too large for a Humongous Region, it may be stored across multiple regions.

The significance of Humongous zone: It can avoid some “short-lived” giant objects directly entering the old generation, save the memory space of the old generation, and effectively avoid GC overhead of the old generation due to insufficient space.

When global GC(FullGC) occurs in the heap space, Humongous region is also recycled in addition to the new generation and old generation.

In fact, it is also mentioned in the JVM Into God Path chapter 4: JVM runtime memory partition: The logical generation + physical partition structure is the best solution for heap space, so G1 is the ideal structure, but the subsequent ZGC and ShenandoahGC collectors, due to implementation reasons, did not adopt this structure.

1.2. G1 collector GC type

There are three GC types in G1: YoungGC, MixedGC and FullGC, which can be triggered in different situations.

1.2.1, YoungGC

As mentioned above, G1 will not allocate all regions in the whole heap space at the beginning. Whether it is the new generation, the surviving Region or the old generation, there will be an initial number at the beginning. During the program operation, the number of regions in each generation Region will be continuously increased according to the demand.

In G1, when the new generation area is used up, G1 will first roughly calculate how long it will take to reclaim the current generation space. If the reclaim time is much less than the value set by -xx :MaxGCPauseMills, Instead of triggering YoungGC, new regions will continue to be added for the new generation to hold newly allocated object instances. YoungGC will be triggered only when the reclamation time is close to the value set by -xx :MaxGCPauseMills after calculation and the Eden space is filled again.

When YoungGC is triggered, the surviving objects in the target Region will first be moved to the surviving Region space (the Region marked with a Survivor from Region). At the same time, objects that reach the promotion age standard will also be moved to the Region of the aged generation for storage.

It is worth noting that the G1 collector uses multithreaded parallel copying of moving objects when YoungGC occurs, in exchange for better GC performance. G1 defaults to 200ms if the user does not explicitly set the expected GC recycle pause value with the -xx :MaxGCPauseMills parameter.

1.2.2, MixedGC

MixedGC translates to hybrid GC, not FullGC. When the old generation in the whole heap area share reaches parameters – XX: InitiatingHeapOccupancyPercent after setting the value of the trigger MixedGC, after this type of GC, All regions of the new generation, some regions of the old generation (according to the expected GC pause time, appropriate regions of the old generation will be recycled first) and Humongous regions of the large object.

In normal cases, G1 garbage collection starts with MixedGC. The replication algorithm is adopted. During GC, the surviving objects in the reclaimed Region are copied to other regions.

1.2.3, FullGC

When the free Region in the whole heap space is insufficient to support copy objects or triggered due to the full metadata space, G1 will stop all user threads in the system first, and then use a single thread to mark, clean and compress the memory. This clears up enough free regions for the next MixedGC. But this process is single-threaded serial collection, so the process is very time consuming (multithreaded parallel collection is used in ShenandoahGC).

In fact, THERE is no FullGC in G1 collector. The FullGC in G1 uses serial old FullGC. Because G1 was designed to avoid FullGC, if either of these GCS fails to return the program to normal execution, it will eventually trigger FullGC for the SerialOld collector.

1.3. G1 collector garbage collection process

The G1 collector typically performs four steps (mainly MixedGC) when GC occurs:

① Initial mark (InitialMark) : Trigger firstSTW, and then use a single GC thread to quickly markGCRootsObjects that are directly connected.
② Concurrent marker (ConcurrentMarking) : Consistent with CMS’s concurrent marking process, multiple GC threads are used to execute with user threads, according toRootThe root node marks all objects.
③ Final marking (Remark) : the re-marking stage of CMS is mainly to correct the mislabeling, mislabeling and missing labeling objects caused by user operations in the concurrent marking stage.
④ Screening and recycling (Cleanup) : First to eachRegionRank the recycling value and cost of the area and find the one with the “highest recycling value”RegionPriority recycling.

The G1 collector is named garbage First because of the filter collection phase. In this phase, after sorting regions, G1 retrieves the Region with the highest value and most consistent with user expectations based on the expected pause time specified by the user (that is, -xx :MaxGCPauseMillis). For example: Suppose there are 800 regions in the tenderly space and they are all full, so GC is triggered at this point. However, according to the expected pause time value of GC, this GC can only allow pause 200ms, and G1 can be roughly inferred after the previous cost calculation: If the pause time of 600 regions can be controlled at around 200ms, GC will be triggered based on “reclaim 600 regions”, so as to ensure that the pause time caused by GC can be controlled within the specified range.

It is important to note, however, that the filter collection phase in the G1 collector stops all user threads and uses multiple threads for parallel collection. In fact, this process can be performed with the user thread to achieve concurrent collection, but because G1 only collects part of itRegion, the pause time is controllable, so the recovery efficiency can be greatly improved after stopping the user thread. At the same time, assuming concurrent collection, some problems caused by user thread execution need to be taken into account. Therefore, in G1, the collection stage adopts occurrenceSTWProgram completion (in subsequentZGC, ShenandoahGCConcurrent collection is implemented in the collector), the G1 collection process is as follows:

In G1, both the new generation and the old generation, the collection algorithm is a copy algorithm, and when GC occurs, it will put oneRegionThe surviving objects in the zone are copied to anotherRegionZone. This approach does not cause memory fragmentation compared to the mark-clearing method used by the previous CMS collector, so there is no additional cost to defragment memory.

However, since G1, including later ZGC and ShenandoahGC collectors, from the perspective of each Region, the replication algorithm is adopted, but from the perspective of the heap space as a whole, the standard-integer algorithm is adopted, which is also called “local replication, global standard-integer”. Neither of the two algorithms will cause memory fragmentation. The advantage is that when allocating memory for large objects, the next GC will not be triggered in advance because the contiguous memory space can not be found, which is conducive to the long-term operation of the program, especially in the case of large memory heap space, which brings significant additional advantages.

Note, however, that CMS performs better than G1 collector in the case of small heap space, with a balancing point6~8GBThe left and right sides.

1.4 Analysis of solutions to the problem of tricolor marking – missing marks in G1

As mentioned in GC Generation, the CMS collector is the beginning of concurrent collection. The core of concurrent collection is the tricolor algorithm, but tricolor algorithm is guaranteed to have the problem of missing labels. So let’s take a look at the G1 collector’s solution to the problem of missing labels: STAB + write barrier.

Object of the specific implementation, speaking, reading and writing barrier is located on the HotSpot source: the HotSpot/SRC/share/vm/oops/oop. Inline. HPP file, its internal is through the inline method (HotSpot virtual machine realization of write barriers has several versions).

STAB solves the problem of missing labels for new allocated objects

STAB, full name snapshot-at-the-beginning, exists to maintain the correctness of G1 collector GC-concurrent collections. The correctness of the GC is to ensure that living objects are not collected, which simply means that garbage is collected.

If it is the exclusive collection, also is the way of serial recovery after STW, when the GC to ensure the accuracy of 100%, but when I collect process is with the user threads execute concurrently, GC thread tag, user threads execute, therefore in the heap object references will change, appear unstable factors, eventually lead to the correctness of the tags cannot be guaranteed. To address this problem, STAB mechanisms have been introduced in the G1 collector.

In STAB, a snapshot of heapspace living objects is generated by RootTracing before GC starts marking. During concurrent marking, all the living objects in the snapshot are considered alive, and newly allocated objects in the marking process are also marked as living objects and will not be reclaimed. In this way, G1 ensures GC correctness for newly allocated objects. Before you understand how STAB works, take a look at the Region structure:

In the heap space divided by G1, each Region contains five Pointers: bottom, Previous TAMS, Next TAMS, Top, and End, as shown in the following figure:

The five Pointers are interpreted as follows:

twoTAMS(top-at-mark-start)Pointer that records the position of two concurrent tokens.
bottomPointer: currentRegionThe starting position of an allocation object.
topPointer:RegionArea The location of the currently used space.
endPointer: currentRegionThe maximum location that an area can be allocated.

If you don’t understand the Pointers, think of the Region as a transparent glass. The bottom pointer points to the bottom of the glass, the top pointer points to the water level in the glass, and the end pointer is the rim/top of the glass, which is the maximum amount of water the glass can hold.

After the concurrent markup occurs, STAB goes as follows:

(1) The NTH GC is triggeredRegionthetopPointer assignment tonextTAMSDuring the concurrent marking period, newly allocated objects are innextTAMS ~ topBetween Pointers, the STAB mechanism ensures that this part of the object is absolutely marked and is alive by default.
② When the “concurrent mark” is about to end, it willnextTAMSPointer assignment topreTAMS, STAB mechanism will be followed bybottom ~ preTAMSGenerates a snapshot file between objects (BitMapStructure), all garbage objects can be identified from the snapshot file.
(3) Repeat the above steps for each subsequent GC.
Two rounds of CONCURRENT GC marking are illustrated in the following diagram:

(1) Stage is the initial marking stage, STW occurs, the targetRegiontheTopAssigned tonextTAMS.
Phase ①~② is the concurrent marking phase, the GC thread and the user thread execute together, and mark all the living objects in the area concurrently.
In stage ②,nextTAMS ~ topBetween the Pointers are newly allocated objects, called “implicit objects” that use one at a timeBitMapBegins recording the address of the object marked during concurrent marking.
③ Stage is the cleaning and recycling stagenextTAMSAssigned topreTAMS(includingBitMapThe tag data in thePrevBitMap) and clean upbottom ~ preTAMSBetween all garbage objects.
“Implicit objects” generated during the concurrent tag phase will be cleared at the next GC collection, such as phase ⑥ in the figure above, which clears new objects generated during the first GC-concurrent tag. This is also the downside of STAB, which creates floating garbage to some extent.

In the end, the G1 collector uses this mechanism to ensure that new objects don’t miss marks during concurrent marking (since new objects are not marked at all, all new objects are alive by default, and whether new objects are alive or dead is left to the next GC).

STAB+ Write barrier solves missing flags in reference to change objects

The problem of missing labels for newly allocated objects has been described previously, but how does the G1 collector solve the problem of missing labels when references to existing objects change during “concurrent labeling”? To quote a fragment from the subgeneration, as follows:

① A user thread breaks an unmarked white object connection during execution and then establishes a reference connection with an already marked black object. The diagram below:

The white object breaks the reference to the gray object on the left and establishes a new reference to the black object on the right.

② A user thread breaks the reference connection between a grey object and an unmarked white object in the process of execution just as the GC thread marks it. Then, when the GC marks the grey object and marks it as black, the white object is re-referenced. The diagram below:

Before GC marks, the white object is disconnected from the gray object, and four seconds later GC marks the gray object, at which point the white object is re-referenced to the black object.

In both cases, the GC thread does not re-mark the white object and its member objects because the “parent” of the re-referenced white object has already been marked black, so the white objects will remain in the white collection. The end result is that those living objects that still have references are “misjudged” as garbage objects to be removed. This situation directly affects the correctness of the application and is unacceptable.

Condition 1: The gray object is disconnected from the white object (either directly or indirectly). Condition 2: An object marked black is re-referenced to a white object. Missing marks can occur only when an object satisfies both conditions. To understand the last simple code example:

Object X = obj.fieldX; // get the obj.fieldx member object
obj.fieldX = null; // Break the reference to obj.fieldx
objA.fieldX = X; // Establish a reference between the X white object and the black object objA
Copy the code

From the point of view of the code above, if obj is a gray object, get its member fieldX and assign it to variable X so that the instance in the heap keeps a reference to variable X. Obj. FieldX = obj; obj = obj; obj = oba; obj = obj;

Gray object obj, white object obj.fieldx /X, black object objA. White object X disconnects from the grey object before GC marks the grey object obj member property, and then “hooks up” with the black object objA. At this point, white object X will remain in the white collection forever until the cleanup phase arrives and is “misjudged” as garbage collected.

In CMS, the means to solve this problem are as follows: post write barrier + incremental update, post write barrier is used to record the object referenced by the change, and then re-scan the changed node through traceability. In G1, STAB+ write before barrier solves this problem as follows:

// HotSpot object field assignment logic
void oop_field_store(oop* field, oop new_value) { 
    *field = new_value; // Assignment: replace old value with new value
} 

// Object field write barrier in HotSpot (easiest to understand implementation version)
void oop_field_store(oop* field, oop new_value) {
    pre_write_barrier(field); // Write the front barrier
    *field = new_value; // Assignment: replace old value with new value
    post_write_barrier(field, value);  // Write the back barrier
} 
Copy the code

The G1 collector uses a pre-write barrier to record the original reference before it is changed, as follows:

void pre_write_barrier(oop* field) {
  // Is in the GC concurrent marking phase and the object has not been marked (accessed)
  if($gc_phase == GC_CONCURRENT_MARK && !isMarkd(field)) { 
      oop old_value = *field; // Get the old value
      remark_set.add(old_value); // Record the original reference object}}Copy the code

The idea in G1 is to preserve the object graph relationship at the beginning of GC, the original snapshot (SATB: Snapshot At The Beginning), The concurrent marking process will access The original object graph relationship. Even if The reference information of an object changes during The concurrent marking process, G1 will record The original object reference relationship through The pre-write barrier and still mark The object graph according to The original Snapshot.

G1’s approach to this problem is very simple, to sum it up: in concurrent marking, I don’t care how your reference relationship changes, I just mark with the original object graph relationship. It is also worth noting that the pre-GC snapshot is a logical concept, but G1 is essentially maintaining this snapshot through the pre-write barrier to achieve the goal of marking against the “raw snapshot”. G1 is done by breaking “Condition one: Grey objects are disconnected from white objects.” The problem of missing marks has been solved.

The source in the process of actual implementation, the G1 is pseudo code logic is convenient for you to understand, only the realization of the real in G1SATBCardTableModRefBS: : the enqueue (oop pre_val) function. In fact, when a reference is changed in G1, the original reference is saved to the SATb_mark_queue, and each thread has its own SATb_mark_queue. In the next concurrent marking phase, objects in satb_mark_queue are processed in turn to ensure that they are alive in the current GC.

STAB ensures that truly alive objects are not mistakenly collected by the GC, but it also causes some garbage objects that should be collected to slip through the GC, resulting in a lot of extra floating garbage (float garbage).

1.5. RSet is used to resolve cross-generation reference in G1

As mentioned in the generation chapter, CMS and most of the previous generation collectors adopt the structure of CardTable to realize memory sets in order to solve the cross-generation reference problem, while G1 implements a new data structure: Remembered Set, called RSet for short, is also known in some places as a two-way card list.

Each Region has an RSet, which records the relationship between objects in other regions and objects in the current Region. That is, who references my object is recorded. This RSet belongs to the points-into structure. Whereas the previous card table is a points-out structure, which records “whose object I refer to”. There are multiple card pages in the card table, and each card page records a certain range (512KB) of heap.

So RSet, or CardTable, is actually an implementation of a memory set, and you can also think of RSet as an advanced implementation of CardTable. In G1, RSet is essentially a HashTable structure. Key is the starting address of other regions that reference objects in the current Region, and Value is a collection of elements containing the addresses of each Region that references objects in the current Region. During the running of the virtual machine, the reference relationships in the RSet are maintained by post-write barriers and Concurrent Refinement Threads.

When YGC occurs, only rsets of the target Cenozoic Region need to be selected as the root set when scanning marked objects, and these Rsets record cross-generation references of Old → Young. In GC scanning, only these Rsets need to be scanned, thus avoiding scanning the whole Old Region. When G1 MixedGC occurs, Old Region also has RSet to record the Old → Old reference relationship, and Old → Young cross-generation reference can be obtained from Young Region, so that the whole heap scan will not occur when MixedGC occurs. So the introduction of RSet greatly reduces the amount of GC work.

In fact, in G1RSetThe memory overhead is also significant, which can be taken up when the JVM is heavily partitioned and running for a long time20%The above.

1.6. Summary of G1 collector

The G1 collector is a milestone in the creation of partitioned collections and is also known as a garbage first collector because G1 maintains a priority list in the background: the CollectionSet (CSet), which records the collection of regions to be collected by the GC. Regions in the collection can be of any age. The name G1 derives from the fact that each GC takes place in a Region with the highest recovery value from the list based on “user-specified expected pause time or default expected pause time”. Such as:

One Region takes 120ms to collect 10MB garbage, while the other Region takes 80ms to collect 20MB garbage. If the time for collecting garbage is limited, G1 preferentially selects the Region with a higher cost performance. This use of regions and prioritized Region collection ensures that the G1 collector is as efficient as possible in a limited amount of time.

Without a doubt, the ability to specify “expected pause times during GC collection” is one of G1’s greatest strengths. Setting different “expected pause times” allows G1 to achieve a good balance between “throughput” and “response time/low latency” in any scenario. The “expected pause time” value must be realistic. If you set it to 20ms, isn’t it ultra-low latency? This is unrealistic and can only be called whimsically, because G1 reclamation also requires STW root scanning, copying objects, and cleaning objects, so this “expected pause time” value also has to be at a certain level. As mentioned in the generation chapter analyzing the PS collector:

Throughput must be sacrificed in pursuit of response time, and response time must be sacrificed in pursuit of throughput. You can’t have your cake and eat it too.

The default G1 “pause time” goal is 200ms, but in general, the actual “pause time” is normal, but if we set the pause time very low, such as the previous 20ms, it is likely to result in:

Because the pause target time is too short, the target Region selected to be collected each time only occupies a small part of the heap memory, and the collector’s collection speed gradually fails to keep up with the allocator’s allocation speed. Finally, garbage accumulates slowly, dragging down the whole program to trigger single-threaded FullGC. Because it is likely that the collector will initially gain some breathing time from free heap memory, the longer the program runs, the heavier the burden on memory, eventually filling up the heap and causing FullGC, which degrades overall performance. So in practice, It is usually reasonable to set the “expected pause time” to between 100ms and 300ms.

1.7. In what scenarios is the G1 collector recommended

① In the heap space50%The above memory will be occupied by the surviving application
② The application of allocation speed and promotion speed is particularly fast
(3) at least8GBThe application of heap memory above
(4) Using the original generational collector, the GC time will be long1s+The application of
⑤ The pursuit of pause time in500msApplication within

A performance warcraft from JDK11 – ZGC

In JDK11, Java once again introduced a new garbage collector, ZGC, which is also a memory layout GC based on the partitioning concept. This GC is a true generation-free collector because it no longer retains the concept of generations, both logically and physically.

The MEMORY structure of the ZGC is actually called paging, derived from the introduction of the standard Huge Page in Linux Kernel 2.6, which has two sizes, 2MB and 1GB. The main reason for the Linux kernel to introduce large pages is to cater to the hardware development, because the development of cloud computing, flexible scheduling and other technologies, server hardware configuration will be more and more high, if the standard page size of 4KB, it will eventually affect performance.

The main focus of ZGC is ultra-low latency and throughput. When implemented, ZGC also implements low latency to limit the garbage collection pause time to less than 10ms under any heap memory size with minimal impact on heap throughput. ZGC originated from the C4(Concurrent Continuously Compacting Collector) and PauselessGC of Azul System. The main purposes of Introducing ZGC into Java are as follows:

① Lay the foundation for future GC features.
② In order to support the super-large heap space (TBLevel), the highest supported16TB.
③ In the worst case, the throughput impact is not reduced by more than 15%.
④ There is no deviation in the pause time generated by GC trigger10ms.

2.1 ZGC heap space memory division

In ZGC, the heap space is also divided into sectionsRegionRegion, but in the ZGCRegionThere is no concept of generation, it is simply a collection of allRegionIt is divided into three grades: large, medium and small, as follows:

Small Area/page (Small) : Fixed size is2MBIs used to assign less than256KBThe object.
Medium Area/page (Medium) : Fixed size is32MB, used to allocate>=256KB ~ <=4MBThe object.
Large Area/page (Large) : There is no fixed size, the capacity can change dynamically, but the size must be2MBInteger multiple of, used specifically for storage>4MBIs a huge object. But it’s worth mentioning: everyLargeThe area can only hold one big object, which means how big your big object is, so thisLargeSo in general,LargeThe area capacity is less thanMediumAnd need to pay attention to:LargeArea space will not be redistributed (more on that later).

Comments in the source code are as follows:

In fact, ZGC in JDK11 is not designed to abandon the concept of generational heap space, because in fact, the essence of the concept of generational heap space was proposed in the first place due to the concept of “most objects die overnight”, and in fact most Java programs conform to this phenomenon at runtime. So logical generation + physical generation is the best structure scheme for heap space. But the question is: why doesn’t ZGC design a generational heap space structure? In fact, the essence of the reason is that generation implementation is very troublesome and complex, so we will first achieve a relatively simple and usable single-generation version, which may be optimized and improved later.

2.1.1 TB memory Source: NUMA architecture

As mentioned earlier, “The purpose of ZGC is to be able to handle terabytes of heap space”, but the problem is that many companies in production environments do not necessarily have terabytes of hard disk, so how can the server have terabytes of memory to allocate Java heap? To understand this point, we have to refer to the NUMA architecture, and of course, the corresponding UMA architecture, as follows:

UMA architecture: The Uniform Memory Access Architecture (UMA) is the Uniform Memory Access Architecture of common computers. One Memory has multiple cpus, and all cpus Access one Memory. Therefore, competition occurs. In order to avoid security problems in the process of competition, the operating system is bound to exist with the concept of lock, and the scene with lock will certainly affect efficiency. In addition, cpus need to access the memory through the bus and the north bridge. As the number of CPU cores increases, the bus and the north bridge become bottlenecks. As a result, the more CPU cores the UMA/SMP uses, the greater the competition and the lower the performance.

The NUMA architecture: The NON-Uniform Memory Access Architecture (NUMA) is a non-uniform Memory Access Architecture. In this Architecture, each CPU has a certain amount of Memory, depending on the location of the CPU’s Memory. The CPU will preferentially access the memory, and each CPU will access the memory nearest to itself, which naturally improves efficiency. But the content is not the key, the key is the NUMA architecture allows multiple machines to form a service supplies the external use, NUMA technology can make many servers as a single system, the architecture has been very popular in the large system, is also a high-performance solution, especially in terms of system delay is very good, therefore, Actually heap space can also be made up of memory from multiple machines. ZGC is a garbage collector that is automatically aware of the NUMA architecture and can take full advantage of the FEATURES of the NUMA architecture.

2.2 ZGC recovery process

The ZGC collector has only three major operations when GC occurs: mark, migrate, and relocate.

Mark: Mark all living objects from the root node.
Move: Move live objects from the area to be reclaimed to the new partition.
Relocation: Changes all Pointers to the pre-migration address to the post-migration address.

In fact, the relocation action is performed during the marking phase, and if it is found that the pointer still refers to the old address, it is corrected to the new address, and then marked. It is worth noting, however, that the first GC does not take place because it has already been marked and only records where the original object was moved to. Relocation, or change to the new reference address, occurs only when the second GC occurs and an object is moved at the beginning of the tag, but the reference remains the same.

A garbage collection process in the ZGC is divided into ten steps: initial marking, concurrent marking, re-marking, and concurrent transfer preparation: [Non-strong reference concurrent mark, reset transfer set, reclaim invalid page (area), select target reclaim page, initialize transfer set (table)], initial transfer, concurrent transfer, where only initial mark, re-mark, initial transfer stage will have a temporary STW, the other stages are executed concurrently.

In the ten steps of ZGC collection process, non-strong reference concurrent mark, reset transfer set, reclaim invalid page (area), select target reclaim page, initialize transfer set (table), these steps are all concurrent and take place in the preparation stage of concurrent transfer, as follows:

① Initial mark

This phase triggers the STW, marking only objects that are root reachable and pushing them onto the tag stack, and other actions such as resetting the TLAB, determining whether soft references should be cleared, and so on.

② Concurrent marking

Multiple GC threads are started based on the root object of the “initial tag” and the object graph is traversed concurrently, as well as counting the number of live objects in each partition/page. Mark stack

③ Mark again

Also can appear short STW at this stage, because the application thread “concurrent tag” stage or in running, so can modify the object reference lead to leakage of mark, so need to mark phase leakage to mark the underlying object (if this stage pause time is too long, ZGC will once again entered the stage of concurrent tags to mark).

④ Non-strong reference concurrent marking and reference concurrent processing

The non-strongly referenced root objects from the previous procedure are iterated through, but not all non-strongly referenced objects can be concurrently marked, and some non-strongly referenced objects that cannot be concurrently marked will be processed in the “re-mark” stage earlier. Non-strongly referenced objects in the heap are also marked.

⑤ Reset the transfer set/table

Reset the data recorded in the transfer table when the last GC occurred for this GC. – in ZGC, because need to survive in the a partition when recycling objects move into another free partition, and the transfer of ZGC is concurrent execution, as a result, a user threads access the heap of an object, the object happened to be transferred, then the user thread according to the original pointer is unable to locate an object, So the concept of moving forward table was introduced in ZGC. – the transfer table can be interpreted as a set of Map structures. When a thread accesses a moved object according to the pointer, if the object has been moved, it will look for the object in the NewAddress according to the record of the transfer table and update the reference of the pointer at the same time. ,newaddress>

⑥ Reclaim invalid partitions/pages

Reclaim invalid virtual memory pages where physical memory has been freed. ZGC is a collector that can return heap memory to the physical machine. Some unused heap space is released when the machine is running out of memory, but the freed pages are not released until the next round of markup is completed, so in this phase, the space freed by the last GC is actually reclaimed.

⑦ Select the partition/page to be reclaimed

The ZGC, like the GC collector, also has a “garbage first” feature. After the marking is complete, there are many partitions in the heap that can be reclaimed, and the ZGC will select the pages with the highest recycling value as the target for the GC collection.

Initialize the transfer table of the set to be transferred

Initialize the migration table of the partition/page to be reclaimed to record the migration information of surviving objects in the partition/page. – Note: There is a transfer table forwardingTable for each page/partition.

⑨ Initial Transfer

STW occurs in this phase, traversing all GCRoots nodes and their directly connected objects. If the traversed object is in the collection of reclaimed partitions, the corresponding space is allocated to the object in the new partition. It is important to note, however, that this phase only transfers the root object (that is, the GCRoots node directly connected object).

⑩ Concurrent transfer

This phase is similar to the previous “concurrent marking” phase, starting from the root object transferred in the previous step, all objects in the target region are iterated and processed concurrently.

In simple terms, the ZGC collection process can be divided into four stages: concurrent marking, concurrent transfer preparation, concurrent transfer, and concurrent remapping/positioning.

ZGC is also a generation-free collector, which means that there is only one type of GC in ZGC, and there is no need for the concept of memory set. Since it is a single-generation heap space, each collection scans all pages, and there is no need to solve the cross-generation reference problem.

2.2.1 Why can’t Large areas be redistributed/transferred?

Since only one object will be stored in Large area, it can be directly decided whether to recycle or not after the marking is completed when GC occurs. The object stored in Large area is not impossible to be transferred to other areas, but unnecessary. There is only one Large object in Large area at present, and another Large area has to be specially prepared to receive the transfer. But essentially it doesn’t matter whether you transfer it or not, but it will add extra space overhead.

2.2.2 The core of ZGC — Colored Pointers technology

ColoredPointers, one of the core techniques of ZGC, before this all GC information is stored in the object header, but the GC information in ZGC is stored in the pointer. At the same time, there is no pointer compression in ZGC, because the pointer has been modified in ZGC, and the dyeing pointer technology has been implemented through the reference pointer in the program. Since the dyeing pointer is used for all 64 bits of the pointer, the pointer can no longer be compressed.

In ZGC, there are four colors (states) used in the color pointer:Marked0, Marked1, Remapped, FinalizableThat is as follows:

For the definition of the pointer in the figure above, the source comment is as follows:

As can be seen from the notes:0~41BitThese 42 bits are normal address records, so the MAXIMUM the ZGC can actually support4TB(Theoretically, the maximum number can be supported16TB) of heap memory, because the maximum address of the 42-bit address is4TB(2^42=4TB). In addition to these 42 address bits, the rest of the bits are defined as follows:

42~45Bit/4 bit: indicates the flag bit
- Finalizable=1000: This bit is related to concurrent reference processing, indicating that this object can only pass throughfinalizerTo access.
- Remapped=0100: Sets the value of this bit to indicate that the object is not pointed toRelocationSet(relocation setIndicates that GC is requiredRegionPartition/page collection).
- Marked1=0010: marks objects to aid GC.
- Marked0=0001: marks objects to aid GC.
46~63Bit/18 bits: Reserved for future use.

But before we look at the implementation, let’s explore a few issues.

Why ZGC only support 4TB (JDK13 extended to 16TB), not there are many bits useless? After analyzing the pointer, ZGC has reserved 18 bits in the pointer, so why not use all of them and make ZGC support super large heap space? In fact, this is related to hardware and OS, because X86_64 architecture hardware only 48 address bus, hardware motherboard address bus is only 48bit wide, four of them are color bits, only 44 bits left, so limited to the current hardware, ZGC can only support 2^44=16TB memory.

② Why are there two Marked cases? This is to prevent tag confusion between different GC cycles, so the two Marked flags are swapped each time a new GC starts. For example, the first GC uses M0, the second GC uses M1, and the third GC uses M0….. , because after ZGC mark all partitions of live objects, will choose to partition recycle, so live objects will not be transferred in part of the area, then the object identification will not reset, will stay on Marked signs of before (M0), if the next GC or using the same M0 to tag object, the confusion of the two objects. To make sure the marks aren’t mixed up, two Marked signs are used interchangeably.

③ Why does the change of pointer marker bit during ZGC operation not affect the memory addressing of objects? This is because the ZGC’s coloring Pointers use a technique called multiple mapping, where multiple virtual addresses point to the same physical address, regardless of the pointer address 0001…. Or 0010… , or 0100… And so on, they all end up at the same physical address.

OK, after a brief analysis of the above problems, let’s take a look at ZGC concurrent processing based on dyeing Pointers:

Before the first GC occurs, all objects in the heap are identified as:Remapped.
After the first GC is triggered, the GC thread starts to flag and begin scanning if the object isRemappedFlag, and the root node of the object is reachableM0Identifier that represents the living object.
If during the marking process, the object identifier scanned is alreadyM0Represents that the object has already been marked or is newly allocated after GC has started, in which case no processing is required.
After GC starts, objects newly created by the user thread are directly identified asM0.
In the marking phase, it is not enough for the GC thread to mark only objects that are directly accessible to the user thread. In fact, it also needs to recursively mark all objects referred to by the object’s member variables.

In general, after the “marker phase” is over, the object is either M0 alive or Remapped waiting to be retrieved. Eventually, all active objects marked as M0 are placed in the Active information table. These objects are processed in the “transition phase” as follows:

The ZGC selects the target reclaim region and begins the concurrent migration.
The GC thread traverses objects in the target region if the object is identified asM0If it exists in an active table, the object is moved to the new partition/page space and its identity is corrected toRemappedMark.
The GC thread scans an object that exists in the active table but is identified asRemapped“, indicating that the object has been transferred and no action is required.
Objects newly created by the user thread during the “transition phase” are identified asRemapped.
If the GC thread traverses an object that is notM0State or not in the active table, no action required.

Finally, when all live objects in the target region are moved to the new partition, the ZGC uniformly reclaims the original selected reclamation region. At this point, one GC ends, and the entire heap space continues to execute normally until the next GC is triggered. When the next round of GC occurs, M1 will be used as the auxiliary identifier of GC, instead of M0. The specific reasons will not be elaborated after the previous analysis.

PS: Several identifiers in Pointers are also called address views in some places. In simple terms, the address view refers to the marker bit of the address pointer. For example, if the marker bit is M0, then the view is M0.

The benefits of dyeing the pointer

① Once a live object in a partition is removed, the partition can be immediately recycled and reused, without waiting for all the objects in the heap to point to itRegionThe references to the fields have been corrected before they can be cleaned up.
② Color Pointers can greatly reduce the number of memory barriers used during GC, and ZGC only uses read barriers.
③ Color pointer has strong scalability. It can be used as an extensible storage structure to record more data related to object marking and relocation process, so as to further improve performance in the future.

2.2.3 what is a ForwardingTable/set?

After transfer table ForwardingTable is ZGC ensure transfer objects, other reference pointer to point to the latest address of a kind of technology, every page/partition exists, is actually the area all the live objects in the transfer of records, a thread by reference to read the object, after the found object be transferred in the table will migrate to query the latest address. At the same time, the data in the transfer table clears the reset on the second GC, including the remapping/relocation operation triggered on the second GC.

2.3 zGC-read barrier to solve the object missing mark

As mentioned earlier, ZGC solves the object missing label problem by means of read barriers, which is equivalent to AOP when reading references. The pseudo-code is as follows:

oop oop_field_load(oop* field) {
    pre_load_barrier(field); // Read barrier - pre-read operation
    return *field;
}

void pre_load_barrier(oop* field, oop old_value) {  
  if($gc_phase == GC_CONCURRENT_MARK && !isMarkd(field)) {
      oop old_value = *field;
      remark_set.add(old_value); // Records the read object}}Copy the code

Read barriers record all member variables as they are read, which is conservative but also safe. As mentioned above, two conditions must be met to cause the problem. The second condition is: “The object marked as black re-establishes a reference relationship with the white object”, and the premise for the black object to re-establish a reference with the white object is: You can record who read the current white object, and then re-mark the black object in “Re-mark”.

By reading barriers and the role of the GC, displaced part of live objects, when read application thread object, you can use to read the signs on the barrier through the pointer to judge whether the object is transferred, if the transfer of object has been read, then the fixed current object reference to the latest address to transfer (check) in the table. The advantage of this is that the next time another thread reads the transfer object, it can normally access the latest value. Of course, this situation is also called in some quarters: the ZGC pointer has the ability to “heal” itself.

2.4. When will ZGC garbage collection be triggered?

There are currently four mechanisms in the ZGC that cause GC to be triggered:

(1) Timing trigger. By default, this parameter is disabledZCollectionIntervalParameter configuration.
(2) Preheat trigger, up to three times, in the heap memory10 percent, 20 percent, 30 percentIs triggered when the GC time is collected and used by other GC mechanisms.
③ Allocation rate, based on normal distribution statistics, calculate memory99.9%The maximum possible allocation rate, and the point at which memory will run out at this rate, triggers GC “Run out time – maximum duration of one GC – time of one GC detection cycle” before running out.
④ Active trigger, default on, can passZProactiveParameter configuration. When the heap memory increases by 10% since the last GC, or exceeds 5 minutes, compare “Interval between last GC” and “49* Maximum duration of one GC”, and trigger if the interval exceeds.

2.5 summary of ZGC

Since ZGC is a dye pointer technology based on 64-bit Pointers, it is doomed that ZGC cannot support 32-bit machines. At the same time, ZGC achieves low latency through multi-phase concurrent execution plus several short STW phases. The biggest problem with ZGC is floating garbage. Assuming that a full ZGC takes eight minutes, during this time a large number of new objects will be created in the heap due to the high allocation rate of new objects. These new objects will not be counted in this GC and will be directly judged as viable objects. Most of the newly allocated objects may become “garbage” during this GC collection, but this “floating garbage” can only wait for the next GC to be collected.

The reason why ZGC does not increase the pause time due to the increase in heap space is that THE STW occurs only in the process of processing the root node. However, no matter how large the heap space is, the number of root nodes in memory does not increase substantially, so the pause time of ZGC is almost not limited by the size of memory. Also, ZGC differs from previous collectors in that it marks Pointers rather than objects, but ultimately the effect is equivalent because all objects and all Pointers are iterated over. Each time a pointer is read from the heap, it passes through the Loaded Value Barrier (LVB) during the marking and shifting phases. The read barrier does different things at different stages. In the marking phase, it “corrects” Pointers to the latest object address value. In the transition phase, the barrier updates the read pointer to the object’s new address and “fixes” the pointer in the heap to the original field. The advantage of this is that even if the object is moved during GC, the read barrier will find and correct the pointer, so that the updated valid pointer will always be held at the user thread level, without the need for stop-the-world, the coarse-grained synchronization between GC and application.

In the end, ZGC is generation-insensitive, which is a weakness, because objects are very ephemeral, and ZGC is generation-insensitive simply because it is difficult to implement.

ShenandoahGC, a performance monster from JDK12

ShenandoahGC collector is a collector based on partition structure, like G1 and ZGC collector. In contrast to the ZGC, the pause times of ShenandoahGC are not affected by heap size. However, unlike the ZGC:

ZGC is implemented based on the colored Pointers for Pointers while ShenandoahGC is implemented based on the Brooks Pointers for forwarding Pointers.

ShenandoahGC’s memory layout is similar to G1’s in that it also divides heap memory into regions of equal size, as well as Humongous regions for large objects. You can think of ShenandoahGC as a modified version of G1 collector, which is more aggressive than G1 collector’s implementation. Blindly pursue extreme low delay. ShenandoahGC, like ZGC, does not implement generational architecture, so there is no new generation or old generation when GC is triggered. There is only one GC type that covers the whole world.

3.1 ShenandoahGC collection process

ShenandoahGCA garbage collection for the heap consists of two STW phases and two concurrent execution phases. After the GARBAGE collection for the heap is triggered, garbage collection for the entire heap space begins as follows:

Initial marking phase: markingGCRootsFor directly reachable objects, STW occurs, but very briefly.
Concurrent marking phase: Working with the user thread, starting from the root object, marking all objects in the heap.
Final marking stage: year-over-yearThe G1, ZGCSTW is triggered during the re-marking phase in. STW is used to correct mislabeling objects caused by the modification of reference relationships by user threads during concurrent marking. STAB mechanism is used to implement this phase. At the same time, in this stage, the area with the largest recycling value will be selected as the target area for recycling.
Concurrent collection phase: Concurrent execution with the user thread, the live objects in the collection area are copied to other unused objectsRegionZone to go, then will be the originalRegionAll areas are cleaned and recycled.

The previous stages of the above process are not very different from G1, but the emphasis is on the final recycle phase, which is executed concurrently with the user thread, so new problems can arise:

Question 1: During reclamation, if an object is copied to a new region, how does the user thread locate the object when accessing it through the original pointer? Question 2: During concurrent reclamation, what if there is a security problem during replication?

Both problems are solved by the ShenandoahGC with BrooksPointers forward Pointers, while the latter builds on them with a read and write barrier. The core implementation of ShenandoahGC is BrooksPointers forward Pointers.

3.2 ShenandoahGC core -Brooks inters forwards Pointers

When an object is copied to a new region, how does the user thread navigate to the latest object address based on the pointer? In the previous ZGC it was possible to get the latest addresses with a dye pointer + read barrier scheme, but in ShenandoahGC another scheme was proposed: BrooksPointers forward Pointers.

A forward pointer is a new field in front of the object header of each object, that is, a root pointer in front of the object header. For unmoved objects, the pointer points to itself, but for moved objects, it points to BrooksPointers forwarding Pointers from the object’s new address, as shown below:

becauseShenandoahGCThe forward pointer technique is used, so when a user thread accesses an object, it needs to find it firstBrooksPointersAnd find the object by the address in the pointer.

Also, if an object is moved, the flow of object access looks like this: find old BrooksPointers, find new BrooksPointers based on the old forward pointer, and find objects based on the new forward pointer. A handle pool mechanism is added, similar to how handles are accessed.

ShenandoahGC uses this technique to solve the access problem of the moved object, but the disadvantage is also obvious, each access requires at least one additional forward.

Consider thread-safety issues due to concurrent reclamation as follows:

① The GC thread is copying old objects to a new region.
② The user thread updates the data of the original object.
③ The GC thread points the forward pointer of the old object to the forward pointer of the new object.

The GC thread had copied the object but had not yet updated the forward pointer to the old object, causing the user operation to fall on the old object and causing a security issue. ShenandoahGC solves this problem by adopting read and write barriers to ensure that steps ① and ③ are atomic.

3.3 ShenandoahGC connection matrix

Resolving cross-zone references in G1 is done byRSetThis memory set approach is implemented while inShenandoahGCIn order to record cross-region object references, a new concept is also proposed: join matrix, which is essentially similar to a two-dimensional array, if theNaRegionThere are objects pointing to the zoneRegionMThen mark N rows and M columns of the matrix:

As shown above,R-3The objectARefer to theR-5The objectBAnd the objectBReference againR-7Objects in an areaCFinally, the locations of these cross-region reference objects will be marked with corresponding marks on the matrix, which can be obtained through this matrix graph when recyclingRegionCross-region references are generated between.

High performance garbage collector summary

Among the three high performance GC, G1 is the only one that still keeps the idea of generation, while ZGC and ShenandoahGC are not implemented because they have better performance without generation, but because they are difficult to implement, as mentioned before: The logical generation + physical partition structure is the best, so the generation-free structure is actually a disadvantage for ZGC and ShenandoahGC, because the generation-free heap space will scan the whole heap every time a GC is triggered.

The G1 collector has been optimized in subsequent JDK versions, since G1 was developed as a full-on default GC, but one of the biggest regrets and weaknesses of the G1 collector is that the collection phase requires STW to occur, resulting in a lot of pauses in programs that use the G1 collector. While ZGC and ShenandoahGC, the former adopts dyeing pointer + read barrier technology to achieve concurrent collection, while the latter also achieves concurrent collection by forwarding pointer + read barrier. As a result, applications using these two collectors trigger GC during runtime with very short pauses, so they are a good choice if your project has very low latency requirements. However, the ZGC promises low latency of up to 10ms, so the worst case scenario can result in a throughput reduction of around 15%, so if you want to use it, be prepared to increase the heap space, as this can only be achieved by increasing the heap space. ShenandoahGC also has two problems because of the additional forwarding pointer: ① When accessing objects, the speed will be slower, because it needs to pass at least one address forwarding. ② More space is needed to store the extra pointer. ShenandoahGC also doesn’t offer the same “maximum 10ms low latency” promise as the ZGC, so at this point ShenandoahGC will perform a bit worse than the ZGC, but the only advantage is that it can support more heap space than the ZGC (though it doesn’t help much).

4.1. Differences between G1, ZGC and ShenandoahGC

Compare the item	G1	ZGC	ShenandoahGC
Whether concurrent reclamation is supported	Does not support	support	support
Maximum heap size	Hundreds of gigabytes can be a long pause	16TB	256TB
The average pause	Within 500 ms	Within 10 ms	About 1 ~ 20 ms
Whether pointer compression is supported	support	Does not support	support

In fact, from the above data, it seems that the G1 collector is not as good as the other two, which is not the case, because each collector has its own application scenarios. For example, in a few hundred MEgabytes of heap space, is ZGC better than G1? Not really. Because G1 has generational logic and ZGC is single-generation, if the allocation rate is fast, ZGC may not be able to keep up (because ZGC’s entire GC process takes a long time), whereas G1 can do perfectly well.

At the same time, since ZGC’s coloring pointer is implemented using 64-bit Pointers, this means that pointer compression is disabled in ZGC, so ZGC takes up more space than other collectors for the same object data under 32GB of heap space.