The purpose of garbage collection

  • The purpose of garbage collection (GC) is to reclaim memory space occupied by garbage objects for reuse by programs.

  • Garbage collection algorithms aim to maximize the benefits of garbage collection.

The requirements for garbage collectors in various scenarios always revolve around three performance metrics: Footprint, Throughput, and Latency.

  • Memory footprint: Memory usage is limited by small memory device programs that require the garbage collector to consume less memory.

  • Throughput: Non-interactive programs pursue higher CPU utilization, i.e. higher throughput; Such as scientific computing services, data processing services, etc. These programs require the garbage collector to be less CPU intensive.

  • Delay: Interactive programs pursue higher timely response, that is, the pursuit of lower latency; With Web services, microservices, and even many transactions that are limited in time, garbage collection can have a significant impact if it requires stopping the execution of user services. These programs require shorter or no blocking delays from the garbage collector. With the development of network applications, services with delay requirements are increasing, so garbage collection is moving in this direction.

Together, these three things form a “triangle of impossibilities,” and the garbage collector can never achieve all three at the same time. Limiting memory means that the GC can’t do much more, which certainly reduces program throughput and increases latency. In the case of a fixed number of tasks, an increase in program throughput means that the garbage collector is less efficient, which inevitably leads to longer garbage collection hours and greater delays.

Different scenarios have different requirements for different garbage collectors, but the criteria for determining whether a garbage collector works properly are the same: the memory collection speed should not be slower than the memory allocation speed. Failure to meet this standard will cause more and more garbage in the memory and result in an OOM. And GC can’t be satisfied with just keeping the collection rate up to the average memory allocation rate. For example, a burst of traffic in a Web application will create so many temporary objects in a flash that if the proper garbage collection algorithm is not used, the memory will burst immediately.

What kind of object is garbage?

Objects that need to be recycled are objects that do not need to be used any more. If an object is not referenced by any other object, it will not be used any more, that is, it is judged as garbage. Reference counting method and reachability analysis algorithm are used to determine whether an object is alive or not.

Reference counting

Reference Counting adds a counter to an object, and every time it is referenced somewhere, the counter value is increments by one. When the reference is invalidated, the counter value is subtracted by one. When the counter value is 0, the object is not referenced and can be declared dead.

Although reference counting is simple and intuitive, it has the problem of circular reference. Reference counting, which requires a lot of extra processing to work correctly, has been used by Microsoft’s COM, FlashPlayer, Python, and others, but is still relatively rare.

Accessibility analysis algorithm

The JVM uses reachability analysis algorithms to find living objects. The idea of the reachability algorithm is to start from a root object node set called GCRoot, search down according to the Reference relationship, and traverse the whole object graph. The path taken during the search is called “Reference Chain”. If the object is unreachable and does not appear in the graph, it is proved that the object is dead.

Recycle method area?

The JVM specification does not require that garbage collection be implemented in the method area, because it is often cost-effective. The method area garbage collection mainly reclaims discarded constants and types that are no longer used. For discarded constants, you can reclaim them if you determine that no string value in the current system is equal to the value of the string object in the constant pool. Deprecated classes are more strictly judged:

  • All instance objects of the class are reclaimed.

  • The classloader that loaded the class has been reclaimed. This means that the namespace of the class is reclaimed.

  • The java.lang.Class instance of this Class is not referenced anywhere. This means that the class cannot be found by reflection.

After these three conditions are met, the virtual machine is allowed to reclaim these types, but to do so, the -xNoClassGC parameter needs to be set. In scenarios that use a lot of reflection, dynamic proxies, bytecode frameworks such as CGLib, dynamically generated JSPS, and frequent custom class loaders such as OSGi, it is often necessary to turn on type unload.

Accessibility analysis

The process of reachability analysis is the process of marking the surviving objects, known as the “Mark” step in the collection algorithm.

Root node enumeration

The process of enumerating GCRoot nodes is called root node enumeration. For garbage collection to be accurate, the root node enumeration must take place in a snapshot that guarantees consistency, so all garbage collectors must pause the user thread during the root node enumeration step.

The step of suspending The user thread makes The user program impossible to execute, as if it were interrupted from The user’s point of view, so it is called “Stop The World (STW).

Since there is no way to Stop The World, The only way to reduce The impact of The root enumeration is to reduce The Stop time.

GCRoot object

Objects that can be used as root nodes are definitely objects that we need to use. In Java, objects that are fixed as root nodes include:

  • Objects in the local variable table of the stack frame, variables used in the current call stack must not be cleared.

  • The object referenced by the static property of the class in the method area cannot be reclaimed by normal GC, so the class member variables in the method area cannot be reclaimed.

  • Constant reference objects in the method area, like references in the string constant pool (I don’t know about this one, but I’ll explore it later).

  • An object referenced by a Native method in a local method stack, the same as a stack frame.

  • References within the JVM, resident objects of the JVM (such as Class instances corresponding to common objects), and other objects that must be maintained.

  • An object that is held by a Synchronized.

In addition to the fixed root node objects, depending on the garbage collection algorithm and requirements, they will choose to add some additional important objects to GCRoot.

Plain object pointer

Because mainstream JVMS use exact garbage collection (where the type and size of the object are known), there is a way for the JVM to directly find out where the object reference is when it is paused. Ordinary object Pointers (OOP) are Pointers that record references to objects in memory regions, registers, or stacks.

HotSpot uses OopMap to store where object references exist during execution: at the end of class loading, HotSpot calculates which data types are Pointers to which offsets within the object, and at just-in-time compilation, it records which stacks and registers are references at specific locations. In this way, the garbage collector can use OopMap to get the location of all objects without having to do a non-leaky scan. This improves the speed of root node enumeration.

Although OopMap can do root node enumeration quickly, it is a preprocessing process that transfers the consumption of root node enumeration to the runtime of the application, reducing the throughput of the application. Also, if YOU generate OOP every time an object reference changes, you’re bound to consume a lot of memory, which is the opposite of the purpose of garbage collection.

safer

Since it is not possible to generate OOP every time a reference changes, HotSpot only records these references at specific locations, called safepoints, and does not care how the reference changes between the safe points, only about the end result of the object reference at the Safepoint.

With the safe point set, garbage collection can only be performed when the program reaches the safe point in order to reduce the time consumed by the root node enumeration. Therefore, the selection of safe points should be neither so few that the garbage collector waits too long, nor so many that the program runs overburdened. Safety points are usually selected based on whether the program can run for a long time. The most common is the reuse of instruction sequences, such as method calls, loop jumps, and exception jumps.

At the same time, it is a question of how to get other threads to stop at safe points while GC is going on. There are two solutions to this problem:

  1. Preemptive Suspension: During garbage collection, the system stops all user threads. If a thread is not at a safe point, it resumes execution for a short period of time before stopping until all threads are at a safe point.

    This approach is so influential and inefficient that few JVMS use it to get threads to a safe point.

  2. Active Suspension: The JVM sets a flag bit. All threads poll for this flag during execution. Once the flag is true, all threads actively suspend at their nearest safe point. The polling marker point and the safety point coincide, and all operations that require memory allocation on the Java heap are polling the marker point to check whether garbage collection needs to be performed in case there are not enough memory allocated objects.

    HotSpot uses memory protection traps to implement active interrupts. At the polling place, the test instruction is executed, which reads page 0x160100. If the JVM is ready for GC, the page is set to unreadable. A thread calling this instruction generates a trap exception, which the exception handler can catch and suspend the thread.

The safety area

A Safe Region is used to solve the problem that a blocked thread cannot enter the Safe point, because the blocked thread cannot interrupt actively.

Security zone refers to the execution zone that can guarantee that the reference relationship will not change, so it is safe to perform root node enumeration everywhere in this zone, in fact, it is the security point with expanded scope. When a thread is in a security zone, it needs to identify itself as being in a security zone, so the root node enumeration does not need to consider the execution status of the thread. A thread can leave the safe zone only when it receives a signal that it can leave the safe zone.

In fact, my personal thinking is that it would be ok to create OopMap for a thread when it gets blocked, right?

Concurrent labeling algorithm

OopMap makes the root node enumeration time very short and fixed, so the main cost of marking garbage objects is traversing the object graph.

Three-color labeling

Accessibility analysis algorithm belongs to search algorithm and depth first search is usually used in the marking stage. It is common to use three-color notation to traverse the object graph. The idea is to divide the object into three colors:

  • White: Objects that have not been traversed, either temporarily or untraversable.

  • Gray: An object that has been traversed, but its member variables have not been fully traversed.

  • Black: an object that itself contains all member variables that have been traversed.

At the end of the reachability analysis, the object marked white is not in the graph and is unreachable.

Consistency snapshot

The labeling algorithm requires that, like the root node enumeration, it be performed in a snapshot that guarantees consistency. No nodes are connected or disconnected during graph traversal, no errors are made with markers, and the simplest thing to do is to suspend all user threads. However, The task of marking The algorithm is very large and requires traversing The entire graph structure, so using The Stop The World approach is not possible. So we need to have the markup algorithm execute in parallel with the user thread.

The problem of missing multiple bids

If the concurrent marking algorithm is adopted, there are new references and disconnection references in the marking process, which will lead to the problems of missing marks and multiple marks:

  • Missing tagging: Also known as the “object disappearance problem”, when a black object references a white object during tagging, we mistakenly assume that the white object is unreachable.

  • Multiple labeling: During the labeling process, the black object disreferences other objects. If the number of references to the disconnected object is 0, we mistakenly assume that the object is reachable.

Multiple tags only make it impossible for us to collect part of the garbage. Moreover, the problem of missing tags is more serious, which may cause the objects we are using to be collected. Wilson theoretically proved in 1994 that the following two steps must be met:

  1. The black object refers to the white object.

  2. All references from the gray object to the white object are broken.

The missing object occurs only if the two steps are executed in sequence, and if 2 is executed first, then no object can access the white object. So if we want to avoid missing labels, we just need to break one of these steps.

Incremental update with the original snapshot

Since breaking one of The two steps would solve The missing mark problem, two solutions have emerged: Incremental Update and Snapshot At The Beginning (SATB).

  • Incremental update: Destruct step one, when a black object references a white object, the black or white object is rescan as a gray object after the concurrent marking ends. Because this approach is equivalent to adding a new object to the graph when a reference is added, it is called incremental updating.

  • The original snapshot: Break Step two, if the gray object is to break its reference to the white object, add the white object to GCRoot after the concurrent marking ends and rescan. Doing so is straightforward, but can result in floating garbage with the white object as the root node. This is called the original snapshot because it preserves the state of the gray object when it was scanned.

In order to ensure that the tagging algorithm works in a consistent snapshot, the incremental update and second round of re-tagging of the original snapshot must pause the user thread, otherwise consistency can never be guaranteed. But Stop The World is a very short period of time and won’t have much impact on The user program.

Incremental updates are performed when a reference is added, and raw snapshots are performed when a reference is broken. For objects added during concurrent markup, incremental updates are recorded without discrimination. However, the original snapshot does not. The original snapshot needs to specifically identify the newly created object. But the cost is too high if the extra operations are performed both when the reference is broken and when the reference is added. The usual solution is to allocate an area of memory exclusively for objects newly created when concurrent tokens are allocated.

Because references are added more frequently than references are broken while the program is running, incremental updates during concurrent markup consume more than the original snapshot. The incremental update rescaling phase is traversed along with the newly added object, and the rescaling phase consumes more than the original snapshot. But because there is no floating garbage problem and no need to separate regions for new objects, the memory footprint is less than the original snapshot.

Write barriers and read barriers

Read barriers and Write barriers refer to the additional operations that a program performs when it reads a reference from the heap or updates a reference, which is equivalent to adding a circular aspect to a Read/Write reference. Execution before updating a reference is called a pre-write Barrier, and execution after updating a reference is called a post-write Barrier, as shown in the pseudocode:

void oop_field_store(oop* field, oop new_value) {
    pre_write_barrier(field, new_value);  // Write front barrier
    *field = new_value;
    post_write_barrier(field, new_value); // Write back barrier
}
Copy the code

GC implements incremental updates to the original snapshot through a write barrier. The use of read barriers is very rare, because reading operations are all the time in the program, and increasing read barriers is very costly to the throughput of the program, so use of read barriers requires careful consideration.

Generational collection theory

The theory of generational collection, though called theory, is a set of rules of thumb for most programs:

  1. Weak Generational Hypothesis: The vast majority of objects were born and died overnight.

  2. The Strong Generational Hypothesis: Objects that survive more garbage collections are less likely to die out.

    Based on the weak generation hypothesis and the strong generation hypothesis, the memory region is usually divided into two regions: the Cenozoic and the old. During garbage collection, it is not necessary to collect all the memory areas at once. The new generation is garbage collected when the new generation is full, and the old one is garbage collected when the old one is full, so as to reduce the workload of a single garbage collection.

    Newly created objects obviously belong in the new generation, so the new generation will be collected more frequently than the old generation, and only objects that survive many new generation garbage collections will be promoted to the old generation. Combining the characteristics of the new generation and the old generation, we can choose different algorithms to maximize the benefits of garbage collection.

  3. Intergenerational Reference Hypothesis: Intergenerational Reference accounts for only a small percentage of generational references.

    Cross-generational reference refers to the possibility that objects between strong and weak generations may be referenced to each other. Therefore, when collecting one generation, you have to add all the objects from another generation to GCRoot. The old generation is fine, but the new generation is very frequent, which will bring a great burden to the garbage collection. In fact, it’s not just old and new garbage collection algorithms that have this problem, it’s all algorithms that involve partial collections, but they call it cross-sectional references.

    The cross-generational citation hypothesis is derived from the weak and strong generational hypotheses: two objects with a citation relationship are more likely to live and die at the same time. If an old age object references a new age object, then the new age object will inevitably be promoted to an old age object after multiple garbage collections.

    Therefore, we do not need to add objects from the entire old generation to GCRoot for a very small number of intergenerational references. To ensure the accuracy of the reachability analysis, we only need to add objects with intergenerational references to GCRoot. So we need to record the existence of cross-generational reference objects. We call this a data structure of record a ‘Remembered Set’. This is a typical space-for-time optimization.

Generational collection

For garbage collectors that adopt the generational collection mechanism, the collection methods of different regions are different, so the naming methods are also different:

  • Full heap collection (GC)

  • Partial GC

    • Cenozoic collection (Minor GC/Young GC)

    • Old COLLECTION (Major GC/Old GC). Some VMS call the whole heap collection Major GC, so be careful to distinguish between them.

    • Mixed GC: Collect the entire new generation and part of the old memory.

The essence of generational collection is to optimize the algorithm, so that more collections can be carried out in areas with higher collection benefits, and appropriate collection algorithms can be selected for different areas, which can greatly improve the efficiency of GC. Therefore, the new and old generation design can easily achieve the “memory recovery speed is not slower than the speed of memory allocation” this standard. However, there are also garbage collectors that do not currently use this algorithm. It is not that the algorithm is bad, but it is difficult to implement.

Partition memory layout

By dividing the Java heap into multiple regions by region-based heap memory layout, only a part of the memory Region can be reclaimed by each GC.

There are several ways to divide a Region:

  1. According to the partition size, the partition can be divided into large Region, medium Region, and small Reigon, which are used to store large objects, medium objects, and small objects respectively. Since the replication of large objects is time-consuming, the generational operation usually requires the transfer of objects between different generations, which is very costly. Dividing large object regions can effectively solve this problem.

  2. According to the generation division, it can be divided into Cenozoic Region and old era Region. This allows you to combine the advantages of partitioning and generation to accurately collect hot spots without much impact on your application.

  3. Large object regions, medium object regions, and small object regions can also be divided by size. Although regions have the same size, objects can be stored by size.

Splitting by region creates the problem of cross-region references, where the memory set is also used to record cross-region references.

Memory set and card table

As mentioned earlier, a Remembered Set is a data structure used to store cross-generational or cross-sectional references. However, memory sets do not simply record objects that are referenced across generations. The granularity of recording is determined by the virtual machine implementation. There are usually three granularity:

  • Word length accuracy: Each record is accurate to the memory address referenced across generations, as determined by the machine word length.

  • Object precision: Each record is accurate to an object, indicating that there is a cross-generational reference to a data field of that object.

  • Card accuracy: each record is accurate to a memory area, indicating that there are objects in the area of cross-generation reference, based on the accuracy of the memory set is called Card Table. Each block of memory is called a Card Page.

    In other words, a card table is an implementation of a memory set. The simplest implementation can be set as an array of bytes. The card page size must be set as a power of 2, so that the position of the spanage pointer on the card page can be directly calculated using a bit operation, such as:

    CARD_TABLE[addr >> 9] = 0;
    Copy the code

    The size of the card page is 2 ^ 9 = 512 bytes, the array element is 0 means there is no intergenerational reference, and the address of the card page can be calculated from the index offset. The maximum number of bytes is 256. If the number of intergenerational references in the card page is less than 256, the card table can record the number of intergenerational references.

For the memory management method of generational collection, the memory set usually only records the reference relationship between old objects and new objects, but does not record the reference relationship between new objects and old objects. Because Minor GCS occur far more often than Major GCS, it is important to speed up the Cenozoic collection process with memory sets; The overall performance of the program is not affected by adding all new generation surviving objects to the GC Root for traversal during Major GC. In addition, objects in the new generation are born and die out quickly, and the reference relationship changes very frequently. Recording the reference relationship between the new generation objects and the old objects will cost a lot of memory, and the cost performance is very low.

Pseudo sharing problem

HotSpot uses write barrier technology to maintain the state of card tables. In addition to the overhead of write barrier, card tables face False Sharing in high concurrency scenarios: The CPU’s caching system is in units of caching behavior. When multiple threads modify variables on the same cache line, they need to perform additional operations (write back, invalidate, or synchronize), resulting in performance degradation. If multiple thread write barriers update the same card table page on the card table at the same time, the same cache row will be written and performance will be degraded. Non-unconditional write barriers are usually used to solve the pseudo-sharing problem. The idea is to check whether the card table page is dirty before deciding whether to update it.

if (CARD_TABLE[addr >> 9] != 0) // If the page is dirty, change it to non-dirty
    CARD_TABLE[addr >> 9] = 0;
Copy the code

After JDK7, HotSpot opens the -xx :UseCondCardMark parameter to control whether to enable the card table update condition judgment. If it is enabled, the pseudo sharing problem can be solved, but there will be a performance loss of judgment.

Garbage collection algorithm

Mark-clear algorithm

The mark-clear algorithm is divided into two stages: marking and clearing. First, all the objects that need to be reclaimed should be marked. After the marking is completed, all the objects that are marked will be reclaimed uniformly. You can also reverse this by tagging alive objects and then reclaiming untagged objects.

This method is characterized by its simplicity and immediacy, but has two major disadvantages:

  1. Execution efficiency is volatile, and the time for marking and clearing phases increases as the number of objects increases.

  2. The problem of memory space fragmentation is that flag clearing will generate a large number of discontinuous memory fragments. Too many memory fragments will cause that the program cannot find enough continuous memory when it needs to allocate large memory objects, resulting in frequent garbage collection.

Mark-copy algorithm

Semispace Copying algorithm was proposed by Fenichel in 1969 to solve the problem of inefficient execution and memory space fragmentation in marker-clearance algorithm facing large number of recyclable objects. It divides memory into two Spaces 1:1, using only one at a time. The one used is called the “active space.” When the active space memory reaches a certain threshold, the living objects are copied to the free space, and the original active space is emptied, so that the two memory areas are used interchanged. This “mark first”, “copy again”, “clear” algorithm is called the mark-copy algorithm.

Based on the vast majority of generational garbage collector will use the tag – collection of Cenozoic replication algorithm, because this method is very simple and efficient, copy the mark phase is complete, the replication after there is no memory fragmentation issues, at the same time do not need to determine whether need to remove one object, direct release the memory space, is very efficient. But the problem with it is that it wastes a lot of memory. At the same time, the more objects survive, the less efficient it becomes.

In 1989, Based on the weak generational hypothesis, Andrew Appel proposed a more optimized semi-region replication strategy, called Appel recycling: The Cenozoic was divided into a large Eden region and two small Survivor regions. Use only Eden and one surviving area at a time; When garbage collection occurs, the surviving objects in the Eden area and the surviving area are copied to another free surviving area, and the memory of the other two areas is directly freed. The two surviving areas are used alternately to store surviving objects.

The ratio of Eden to survival in HotSpot is 8:1, which means that only 10% of the new generation’s memory is wasted. Although most objects are ephemeral, due to Murphy’s Law, it is certain that the surviving area will run out of memory. So Appel recycling sets up an “escape door” policy, where Minor GC runs out of memory, and other memory regions are used for Handle Promotion, usually sending the extra objects directly to the old generation.

Mark-collation algorithm

In 1974, Edward Lueders proposed the mark-sorting algorithm: the marking phase marks the living objects or dead objects, and then moves all the living objects to one end of the memory space, and cleans up the memory space outside the boundary after moving.

The essential difference between the sorting algorithm and the clearing algorithm is that the former is a mobile recycling algorithm, while the latter is non-mobile. In The old days, moving a living object and updating all of its reference variables was a cumbersome operation, and The operation was called “Stop The World (STW)” because The user program had to be suspended to do so without technical assistance. Although the non-moving mark-clear algorithm has a shorter stop time, memory allocation can only be solved by more complex memory allocators and memory accessors, but because memory access is the most frequent operation of the program, using mark-clear algorithm directly reduces the throughput of the program.

VMS use the mark-clean and mark-clean algorithms to reclaim old memory based on their own characteristics. Parallel Scavenge focuses on application throughput and therefore uses the mark-to-file approach. CMS pays attention to the delay, so it uses mark-clean method. However, when the fragmentation degree of memory space starts to affect object allocation, CMS will use mark-clean method to collect once, which means that STW will still be brought.