Learning from the Ground up ->JVM (Preface)

Learning from the ground up ->JVM (Preface) – Why delay?

JVM (a) : The Java Memory Model (JMM) is not the Java Virtual Machine memory model (JVM) oh!

Learning from the Ground up – JVM part 2: Why does Java Need a JVM (Java Virtual Machine)?

Learning from the Ground up ->JVM (3) : Classloader (1)

Learning from the Ground up ->JVM (4) : Classloader (Middle)

Learning from the ground up ->JVM (5) : Classloader (2)

Learning from the Ground up – JVM part 6: The Relationship between threads and JVMS

Learning from the ground up ->JVM (7) : Runtime Data Area (1)

Learning from the ground up ->JVM (8) : Runtime Data area (2)

Learning from scratch ->JVM (9) : Garbage Collection (1)

Learning from the ground up ->JVM (10) : Garbage Collection (middle)

Learning from scratch ->JVM (11) : Garbage Collection (2)

preface

Garbage collection, in the JVM world, is very important.

To implement the inversion of control design principle, Java leaves the generation and destruction of Java objects to our Own Java programs through dependency injection. That is, Java programs do not control the life cycle of their own objects, so if our JVM does not manage the generation and destruction of objects, As a result, the Java objects generated by our program fill up our JVM memory, causing memory to constantly run out of memory.

The JVM has garbage collection, of course, to make sure our JVM doesn’t run out of memory.

In the JVM, the program counter, the virtual machine stack, and the local method stack are created and removed with the thread, and the stack frames in the stack methodically perform the removal and removal of methods as they enter and exit.

How much memory allocation in each stack frame is basically set in the class structure is known (though at runtime will some optimized by real-time compilers, but based on the discussion of the conceptual model, in general can be thought of as compile-time knowable), so this a few areas of memory allocation and recovery with certainty, There is no need to worry too much about recycling in these areas, and when a method or thread terminates, the memory will follow.

The Java heap and method areas have significant uncertainties: Multiple implementation classes of an interface may require different memory, and different conditional branches performed by a method may require different memory. Only during runtime can we know exactly which objects and how many objects will be created by the program. The allocation and reclamation of this part of memory is dynamic.

The garbage collector is concerned with how this portion of memory is managed.

Because our garbage collector was created primarily for the collection of Java objects, of course, the class objects of Java classes also fall under our category of collection. So, since it is the Java object to recycle, then we need to know one thing, that is to determine whether the Java object is alive, if the Java object is alive, then naturally cannot be recycled.

Determine whether a Java object is alive

Different VMS may have different schemes to determine whether their objects are alive or not. Currently, most algorithms used in the industry are as follows:

  1. Reference counting.
  2. Accessibility analysis algorithm.

Specific explanation is as follows.

1. Reference counting

Adds a reference counter to the object, incrementing its value by one each time it is referenced somewhere; When a reference is invalid, the counter value is reduced by one; An object whose counter is 0 at any point in time cannot be used again.

Advantages:

  1. Immediate garbage collection: In reference counting, each object always knows its own number of references (that is, the value of the counter). When the number of references is zero, the object immediately connects itself to the free list as free space. That is, objects are collected as soon as they become garbage, so memory space is not occupied by garbage.
  2. Short maximum pause times: In reference counting, garbage collection is performed only when a pointer is updated via the mutator (application). This means that garbage is collected every time it is generated by executing the Mutator, thus drastically reducing the maximum pause time of the Mutator.

Disadvantages:

  1. Heavy increments and decrement of counter values: In reference counting, the counter value is updated every time the pointer is updated, so the increments and decrement of values are necessarily heavy. This problem can be solved using deferred reference counting, but as the value of the delay counter increases or decreases, garbage cannot be collected immediately, so garbage presses the heap and we lose one of the great advantages of reference counting, which is that garbage can be collected immediately
  2. Counters need a lot of bits: the counter used for reference counting must count up to the number of references to all objects in the heap. For example, if we were using a 32-bit machine, it would be possible to have two to the thirty-second objects reference one object at the same time. With this in mind, it is necessary to ensure that counters for each object are 32 bits in size. That is, there must be 32 bits of space for all objects. This problem can be solved using Sticky reference counting method or 1-bit reference counting method. Both methods reduce the counter bit width to solve the problem, but may lead to counter overflow.
  3. Circular reference cannot be recycled: Circular reference data structures, including self-reference structures, cannot be recycled. This creates a problem if the whole is unreachable in the program, but the collector still can’t reclaim the memory because its reference count will never be zero.

But in Java, we’ve mostly focused on the problem that circular references don’t solve, so let’s take a look at that.

1.1 Circular reference data structures

This circular reference data structure, which is often referred to as a circular reference, is one of the things that happens when you use reference counting.

This is easier to understand with our Java code.

public static  void main(String[]args){
        Object object1 = new Object();
        Object object2 = new Object();
        object1 = object2;
        object2 = object1;
        object1 = null;
        object2 = null;
    }
Copy the code

So you can see that when two objects object1 object2 and reference each other, if you are using a reference counting method to determine the object’s alive or not, you’ll see after the two objects to the program execution, the counter value is 1, this also means that the garbage collector is unable to recycle these two objects.

Circular references to circular data structures, such as bidirectional lists or circular buffers, are very common in both applications and run-time systems.

1.2 Self-referential structure

This structure is generally in C language, mainly in the realization of the tree structure and tree child nodes, it is required that the structure contains its own type of member variables.

This leads to self-referential structures, which are rarely covered in Java, so I won’t go into detail here.

Of course, self-referential structures are just a type of circular data structure.

1.3 How can I Solve the Problem of Circular Reference

The most common and widely accepted approach in the industry is to experiment with deletion algorithms in order to solve the problem that the garbage collector cannot reclaim the memory when there are circular reference structures or self-reference structures.

Instead of using a back-up tracer collector to scan the entire live object graph, the algorithm focuses on local object graphs that might generate circular garbage by removing references.

The idea of the experimental deletion algorithm is as follows:

  1. Inside the circular garbage pointer structure, the reference count for all objects is generated by Pointers between their internal objects.
  2. Circular garbage is only possible if the reference count of an object is thrown greater than zero after a reference to that object is deleted.

With these two, you can use partial tracing to trace subgraphs starting with an object that might be garbage. For each candidate garbage object, the algorithm experimentally removes the reference count generated by the internal pointer. After tracing, if the reference count for an object is still not zero, then there must be an external object that references the object.

This can be done by using the Recycler algorithm, which can be divided into three stages:

  1. The collector does subgraph tracing from an object that might be a member of the circular garbage, while reducing the reference count generated by the internal pointer. The algorithm uses a three-color notation (described below) to set the traversed object to gray.
  2. All objects in the subgraph are detected. If the reference count of an object is not zero, the object must be referenced by other objects outside the graph. At this point, the experiment deletion operation of the first stage needs to be modified. The algorithm will reset the surviving gray object to black and set other gray objects to white.
  3. Objects that are still white in the subgraph must be garbage and can be recycled by the algorithm.

Of course, in addition to using the recycler algorithm for trial removal, there are other situations that can solve the problem:

  1. Python uses a tracer collector to implement a partial mark-clear algorithm to solve the circular reference problem.
  2. Android’s smart pointer method through strong and weak references to solve the problem of circular references.

These methods and theories, I will not talk more here, interested you can go to understand.

2. Accessibility analysis algorithm

A series of root objects called “GC Roots” are used as the initial node set. From these nodes, search down according to Reference relationship. The path traversed in the search process is called “Reference Chain”. Or in graph theory terms, when the object is unreachable from GC Roots to the object, it proves that the object cannot be used again, and the garbage collector can collect the garbage objects.

The memory management subsystem of the current mainstream commercial programming languages (Java, C#, back to the ancient Lisp) uses Reachability Analysis to determine whether objects are alive or not.

2.1 Which objects are GC Roots?

  1. Objects referenced in the virtual machine stack (the local variable table in the stack frame), such as parameters, local variables, temporary variables, etc. used in the method stack called by each thread.
  2. An object referenced by a class static attribute in a method area, such as a Java class reference type static variable.
  3. Objects that are constant references in the method area, such as references in the String Constant pool (String Table).
  4. Objects referenced by JNI (commonly referred to as Native methods) in the Native method stack.
  5. Internal references to the Java virtual machine, such as Class objects corresponding to the basic data types, some resident exception objects (NullPointExcepiton, OutOfMemoryError), and system Class loaders.
  6. All objects held by the synchronized keyword.
  7. Jmxbeans that reflect Java virtual machine internals, callbacks registered in JVMTI, local code caches, and so on.
  8. Depending on the garbage collector selected by the user and the memory region currently reclaimed, other objects can be added “temporarily” to form a complete COLLECTION of GC Roots.

2.2 Types of References

Whether the reference count algorithm is used to judge the number of references to the object, or whether the reference chain of the object is reachable by the reachability analysis algorithm, the survival of the object is dependent on “reference”.

Prior to JDK 1.2, a reference was traditionally defined in Java: a reference was a reference to a block of memory or an object if the value stored in the reference represented the starting address of another block of memory.

There is nothing wrong with this definition, but now it seems too narrow. An object can only be “referenced” or “unreferenced” under this definition, and it is useless to describe some “tasteless” objects.

For example, we want to be able to describe a class of objects that can be kept in memory when there is enough space, and then discarded if memory is still too tight after garbage collection. Caching in many systems fits this scenario.

Java expanded the concept of references after JDK version 1.2, Strongly re-reference, Soft Reference, Weak Reference, and Phantom Reference are divided into four types of references. The strength of these four types of references decreases successively.

  1. Strong reference: is the most traditional definition of “reference”, which refers to the common reference assignment in program code, such as “Object obj=new Object()”. In any case, the garbage collector will never reclaim the referenced object as long as the strong reference relationship exists.
  2. Soft references are used to describe objects that are useful, but not necessary. Only objects that are associated with soft references are listed in the recycle range for a second collection before an out-of-memory exception occurs. If there is not enough memory in the recycle range, an out-of-memory exception is thrown. After JDK1.2, a SoftReference class was provided to implement soft references.
  3. Weak references: Also used to describe non-essential objects, but weaker than soft references, objects associated with weak references only survive until the next garbage collection occurs. When the garbage collector starts working, objects associated only with weak references are reclaimed, regardless of whether there is currently enough memory. WeakReference classes were provided after JDK1.2 to implement weak references.
  4. Virtual reference: Also known as “ghost reference” or “phantom reference”, this is the weakest type of reference relationship. The existence of a virtual reference does not affect the lifetime of an object, nor can an object instance be obtained through a virtual reference. The sole purpose of setting a virtual reference association for an object is to receive a system notification when the object is reclaimed by the collector. PhantomReference classes were provided after JDK1.2 to implement virtual references.

2.3 Death of an object

I mentioned earlier that when an object is deemed unreachable by the reachability analysis algorithm, the garbage collector can reclaim the object, but in fact, the object does not have to die.

To actually declare an object dead, there must be at least two tagging processes.

If an object is found to have no reference chain connected to GC Roots after reachability analysis, it will be marked for the first time and then filtered by whether it is necessary to execute the Finalize () method.

If the object does not override the Finalize () method, or if the Finalize () method has already been called by a virtual machine, then the virtual machine considers both cases “unnecessary to execute”, and the garbage collector will directly reclaim the object.

However, if the objects are judged to be necessary to finalize(), then they will be placed ina Queue named f-queue, and their Finalize () methods will be executed later by a low-priority Finalizer thread automatically set up by the virtual machine.

By “execute,” I mean that the virtual machine triggers the method to start running, but does not promise to wait for it to finish.

The reason for this is that if the Finalize () method of an object executes slowly, or even if an infinite loop occurs, it is likely to cause other objects in the F-queue to wait forever, or even cause the crash of the entire memory reclamation subsystem.

Finalize () method is the last chance for objects to escape death. Later collector will mark objects in f-queue for a second time ina small scale. If objects want to save themselves in Finalize () successfully, they only need to re-associate with any object on the reference chain. For example, if you assign yourself (this keyword) to a class variable or a member variable of an object, it will be removed from the collection on the second tag; If the object has not escaped by this time, it will basically be recycled.

However, we should remember that an object can only be executed once when Finalize () is used. That is to say, if the object that escaped from finalize() is determined to be unreachable in the following reachability analysis, then the object cannot be executed again when Finalize () is used, and it will be directly reclaimed by garbage collector.

Two. Garbage collector design ideas

In contemporary commercial virtual machines, many design theories of garbage collector have emerged. Garbage collection algorithms can be divided into two types based on the judgment of object extinction:

  1. Reference Counting GARBAGE Collection (GC)
  2. Tracing GC (also known as indirect collection)

Reference counting garbage collection I covered a little bit earlier, but in Java we basically use trace garbage collection, so we’ll focus on trace garbage collection.

What is tracking garbage Collection?

Tracing garbage collection is an indirect garbage collection strategy, that is, this strategy does not directly look for the garbage itself, but first looks for which objects are alive, and then determines that all the remaining objects are garbage and frees the memory space of the remaining unreachable objects.

On the basis of this traceable garbage collection, most virtual machines in the market basically adopt two strategies to recycle garbage:

  1. Generation collection theory
  2. Zonal collection theory

These two strategies are primarily used to partition the memory space of our virtual machine, or in The case of Java, to subdivide the heap space and method area of our JVM into different memory areas by one or both of these strategies, and then use our memory reclamation algorithm for garbage collection.

1. Generational collection theory

In fact, IN my previous article on Java heaps, LEARNING THE JVM from the Ground up (8) : Runtime Data Areas (2), I talked a little bit about generational collection theory, but not much about cleaning, and I’ll go over it in detail here.

One lesson learned from many program cases is that “most objects become garbage as soon as they are generated, and very few objects live very long.” The generation collection theory uses this experience to introduce the concept of age into the object.

Then, when the object has age, the generation collection theory sets two hypotheses on the basis of age:

    1. Weak generation hypothesis: most objects are ephemeral.
    1. The strong generation hypothesis: the more garbage collections an object passes, the less likely it is to die.

On the basis of these two hypotheses, objects are classified into several generations in generational garbage collection, and different GC algorithms are used for different generations. We refer to the newly generated objects as the new generation objects and the objects that reach a certain age as the old age objects.

Different objects are stored in different parts of the Java heap, for example in HotSpot, but this brings up another problem, which is that there are references to older objects in the new generation, or references to new ones in the old generation.

What’s the problem with that?

When a GC retrieves a new generation object, depending on the reachabability analysis algorithm, the GC may have to iterate over all the old generation objects. If the gc retrieves a new generation object, it may also have to iterate over all the new generation objects. If this problem is very large, it can seriously degrade the performance of the virtual machine.

Thus, people summarized an experience from many program cases and came to a hypothesis:

    1. Cross-generation quote hypothesis: Cross-generation quotes are less common than same-generation quotes.

According to this hypothesis, the gc at the time of recovery of Cenozoic object, you should not again to a small amount of intergenerational references to scan the entire old s, don’t waste space devoted to each object exists and what reference across generations, only need to establish a global data on the Cenozoic structure (the structure is called “memory”, Remembered the Set), This structure divides the old generation into smaller chunks, identifying which chunks of memory from the old generation will have cross-generation references.

Later, when GC occurs, only objects in small chunks of memory containing cross-generation references are added to GC Roots for scanning. Although this approach requires some runtime overhead to maintain the correctness of the recorded data when the object changes its reference relationship (such as assigning a value to itself or an attribute), it is still cost-effective compared to scanning the entire age at collection time.

Therefore, the generation collection algorithm is based on three hypotheses.

2. Zonal collection theory

Partition algorithm divides the whole heap space into continuous different cells, and each cell is used independently and recycled independently.

The advantage of this is that you can control how many cells are recycled at a time. Under the same conditions, the larger the heap space, the longer a GC takes, and thus the longer the pauses. To better control the pauses generated by GC, reduce pauses generated by one GC by splitting a large memory area into smaller chunks and reasonably collecting several cells at a time (rather than the entire heap) based on the target pause times.

3. Memory reclamation algorithm

Due to generational or partitioned collection algorithms, the memory area of our JVM is divided into object storage Spaces of different properties, so if all the space uses the same algorithm for garbage collection, there is no point in doing generational or partitioning.

Therefore, our JVM matches different garbage collection algorithms for different regions to collect garbage. These garbage algorithms mainly include the following:

  1. Mark-clear algorithm
  2. Mark-copy algorithm
  3. Mark-collation algorithm

These specific algorithms are for different memory regions of the Java heap, so let’s take a look at these algorithms.

3.1 Mark-clear algorithm

One of the earliest and most basic garbage collection algorithms is the Mark-sweep algorithm. Half a century after its creation, it remains a great algorithm for all kinds of processing programs.

The algorithm is divided into two stages of marking and clearing: firstly, all the objects that need to be recycled are marked, and all the marked objects are uniformly recycled after the marking is completed. Conversely, the surviving objects are marked and all the unmarked objects are uniformly recycled.

The marking process is the process of determining whether an object is garbage, and in the JVM, usage is determined by reachability analysis.

Advantages:

  1. Simple implementation: This algorithm is the basis of all algorithms, the rest of the garbage collection algorithms are developed on its basis, so the algorithm is relatively simple to implement.
  2. Compatibility with conservative GC: This algorithm is very compatible with GC that does not recognize Pointers and non-pointers.

Disadvantages:

  1. The execution efficiency is not stable. If the Java heap contains a large number of objects, and most of them need to be reclaimed, then a large number of marking and clearing actions must be performed, resulting in the execution efficiency of both marking and clearing process decreases with the increase of the number of objects.
  2. Fragmentation of memory space. After marking and clearing, a large number of discontinuous memory fragments will be generated. Too much space fragment may cause that in the future program running process, if large objects need to be allocated, enough continuous memory may not be found and another garbage collection action has to be triggered in advance.
  3. Allocation is too slow: Memory regions in the algorithm are not contiguous, so memory fragments must be recorded in a linked list, and each allocation must traverse the free linked list to find enough memory.

The mark-sweep algorithm is the basis of all algorithms, but that doesn’t mean it’s behind The Times. It’s used on many gc’s, such as our Major GC.

3.2 Mark-copy algorithm

Mark-copy algorithm is an algorithm developed by Marvin L. Minsky in 1963 to solve the problem of low efficiency of mark-clear algorithm in the face of a large number of recyclable objects.

In simple terms, this algorithm only copies live objects in one space to other Spaces, and reclaims all objects in the original space.

In our Java heap, this algorithm is used in the Survivor region of the new generation. This algorithm splits the Survivor region into two regions, one called From region and the other called To region. When the From region is fully occupied, gc marks the surviving objects in the From region and then copies all the surviving objects To the To region. When the copy is complete, the algorithm swaps the From space with the To space, and the GC ends.

Advantages:

  1. Excellent throughput: Because the GC replication algorithm only searches and copies live objects, it can complete the GC in a shorter time than the normal GC mark-sweep algorithm. In other words, its throughput is excellent.
  2. High speed allocation: GC replication algorithms do not use idle linked lists. This is because a partition is a contiguous memory space. Therefore, investigate the size of the partition and move the pointer to allocate as long as the partition size is not smaller than the requested size.
  3. Fragmentation does not occur: the surviving objects are moved into an empty block of memory, and the remaining memory space clears up all objects to form a complete blank memory.
  4. Cache compatibility: Referenced objects are copied close to each other in the heap, making it faster for the CPU to read desired objects.

Disadvantages:

  1. Heap use is inefficient: GC replication algorithms bisects the heap, usually using only half of it to arrange objects. That is, only half of the heap can be used. This is arguably a major disadvantage of the mark-copy algorithm compared to other GC algorithms that can use the whole heap.
  2. Incompatible with conservative GC: The mark-copy algorithm is incompatible with conservative GC because it requires moving objects.
  3. Performance penalty of replication: When you copy an object, you copy its children recursively. Therefore, the additional burden of calling recursive functions every time a copy is made cannot be ignored. And the stack is consumed on each recursive call, so there is also the possibility of stack overflow.

3.2 Mark-collation algorithm

The efficiency of mark-copy algorithm will decrease when the survival rate of the object is high. More importantly, if you do not want to waste 50% of the space, you need to have extra space for allocation guarantee, in the extreme case that all objects in the memory used are 100% alive, so in the old days, this algorithm generally cannot be used directly.

In 1974, Edward Lueders proposed another targeted “mark-compact” algorithm for the survival characteristics of old objects. In this algorithm, the marking process is still the same as the “mark-clean” algorithm, but the next step is not to clean the recyclable objects directly. Instead, all surviving objects are computed by Lisp2 and moved to one end of the memory space, then cleaned up beyond the boundary.

Advantages:

  1. Efficient use of the heap: mark-collation algorithms do not use half the heap as mark-copy algorithms do. The mark-collation algorithm can arrange objects in the whole heap, and the heap usage efficiency is almost twice that of mark-copy algorithm. The word “almost” is used because there is room for Pointers, so technically less than twice. On the other hand, although the mark-sweep algorithm can also utilize the entire heap, there is no defragmenting process, so the heap cannot be fully utilized efficiently.

Disadvantages:

  1. Defragmenting objects is performance-intensive: Defragmenting moving objects is definitely performance-intensive. If mobile live objects, especially has a large number of objects in every time the recycling old s live area, mobile live objects and update all the references to these objects will be an extremely load operation, and the objects move operation must suspend the user application to all the way, it’s more for the users to have to carefully weigh its disadvantages, Pauses like this were graphically described by The original virtual machine designers as “Stop The World.”
  2. Low throughput: When Lisp2 is executed, the entire heap is searched 3 times, that is, the gc with mark-collation takes time proportional to the heap size. Note, however, that the throughput is higher than that of the mark-sweep algorithm.

conclusion

This article focuses on the theoretical basis of garbage collection and does not cover the specific GC implementation, which will be covered in the next article.

Writing this article spent a long time, there has been no ideas, are little by little ground out, I hope you can see the visitors from this article to harvest things, so I feel very happy.