Why is there a garbage collector

A world without GC

Aren’t virtual machines and garbage collectors natural for people who have only been exposed to Java, Python, and other languages since their development careers? In fact, not true, for C or C++ program apes, when they need to manually allocate memory, use their own free memory, this is the normal (diy, rich food). So why have a garbage collector when you can do it yourself? Isn’t it good to be in control? Good! But there’s no need. Most of the time, programmers focus on logic during development, and the garbage collector is designed to allow you to focus only on logic during development without worrying about tedious memory management.

What does GC do for us

Since developers don’t need to manage memory anymore, the virtual machine allocates memory every time an object is created, but memory is not infinite, we need something to help us clean up objects that are no longer used to free up memory, this thing is garbage collector. To put it more simply, it does two things:

  • How to find these recyclable objects: find the memory address of an inactive object in memory
  • How do you recycle these objects: free up memory so that programs can reuse it

When Java developers went home to spend time with their girlfriends, C++ developers were scratching their heads: where did they forget to release memory and cause the program to Crash?

Evolution of garbage collection algorithms

The GC origin

Most people think of GC as Java/Python, but Lisp was the first language to implement GC, even though garbage collection is known in these languages. Sixty years ago in 1960, John McCarthy, the father of Lisp, published the GC algorithm in his paper, and Lisp was the first language to implement GC.

Lisp also has many advanced designs, including functional programming, dynamic typing of variables, and so on, besides garbage collection.

Mark clear

It was proposed by John McCarthy in 1960 that garbage objects can be recovered by two steps: mark and clean

How do I mark objects to be reclaimed?

To clear objects, we first need to define which objects can be reclaimed. Conversely, if we can know which objects are active/being used (or can be used), then all but these active objects can be reclaimed. This is the tagging process in the tagging clearance algorithm: reachability analysis.

Accessibility analysis

Roughly speaking, accessibility analysis can be divided into two steps:

  • Identify the Root node: an object directly active on a thread
    • Objects referenced by static properties or constants
    • Active threads (stack variables, method parameters, etc.)
  • Traversal all objects, marked as reachable (most are depth-first traversal)

After iterating through all objects, allObjects that are not marked as reachableWill be recycled

How are these objects reclaimed after the [clear] mark?

Once the mark is successful, all objects are iterated again and all objects marked as reachable are reclaimed.

The entire tag clearing algorithm is simply shown in the following figure:

As can be seen from the above, the mark clearing algorithm has an obvious disadvantage, that is, too much memory fragmentation. One scenario that is likely to occur is when the virtual machine needs to allocate memory for a large object, the total amount of memory is sufficient, but there is not enough contiguous space. Therefore, a new algorithm for basic tag clearing is proposed: tag finishing. Mark sorting is also a two-step process:

  • Start with accessibility analysis
  • Instead of clearing all unmarked objects directly, all living objects are sorted by their memory address, and all memory space after the last living object is reclaimed. Resolve memory fragmentation

Mark finishing:

Reference counting

It was proposed by George E. Collins in 1960 that objects are recovered mainly by the number of references to them

The principle of reference counting is fairly simple: store the reference count in the object header, +1 when the object is referenced, -1 when the object is dereferenced, and reclaim the object when the count is zero.

Problem: Circular references

Partition replication

It was proposed by Marvin L. Minsky in 1963 to recycle objects by memory partitioning and copying

When using the partition copy algorithm, there are usually two areas of memory: an active area with various objects allocated, and a free area. When GC is done, it is also done in two steps:

  • A live memory object that marks an active region
  • Copies live objects from the active region to the free region and empties the original active region

Advantages and disadvantages compared

Recovery algorithm advantages disadvantages
Mark clear Implement a simple Create debris

Low recovery efficiency
Tag to sort out Memory fragmentation is not produced The sorting part is more complicated

Low recovery efficiency
Reference counting Short pauses (mostly)

Instant recycling
Additional counting operations are required

Object needs to store additional count information

There are problems with circular references
Partition replication High distribution efficiency

Most of the time the recycling efficiency is fast

It doesn’t create debris
Low memory usage

Recycle efficiency is reduced if there are too many living objects (lots of copies)

As we can see from the above, each recycling algorithm has its own advantages and disadvantages, and no algorithm is perfect. At this point, smart people think: Can I combine them so that I can take advantage of the advantages of these algorithms while avoiding their disadvantages? This is the generational collection that we will talk about next

Generational collection

People from a large number of programs, summed up an experience: most objects become garbage immediately after generation, few objects live long. Almost all programming languages have shown this consistency for decades. Thus, in 1984, David Ungar proposed the theory of generational collection in one of his papers.

Core of generational collection:

  • The heap is divided into generative space (where memory is allocated), two equally sized surviving Spaces (From and To), and old chronospace.
  • The recordset is used to record references from the old age to the new age, in order to efficiently find references from the old age to the new age.
  • Object needs to add three fields:
    • Age: indicates the age of the object
    • Forwarded: Indicates a forwardedflag
    • Remembered: Indicates that records have been made to the recordset

When you see two equal surviving Spaces do you think of a copy algorithm? The new generation of generational collection does use complex algorithms, which perfectly fit the characteristics of the replication algorithm: high distribution efficiency and fast recovery efficiency. At the same time, it avoids its disadvantages with high probability: the recovery efficiency is low when there are many living objects (in most cases, the survival rate of new objects is very low. Long-term active objects have been promoted to the old age)

Advantages and disadvantages of generational collection

advantages

  • Improved GC throughput
  • Memory allocation efficiency improves
  • Most of the time the pause is short

disadvantages

  • Implementation complex, different generations reference each other
  • When basic assumptions are violated, efficiency is even lower

Garbage collector in JAVA

do xi do la
The new generation Serial



Replication algorithm

serial
ParNew



Replication algorithm

parallel

Can be used with CMS configuration
Parallet Scavenge



Replication algorithm

parallel

Throughput priority

GC adaptive adjustment strategy
The old s SerialOld



Tag to sort out

serial

CMS backup, used when ConcurrentModeFailure occurs
ParalletOld



Tag to sort out

parallel

Can be used in conjunction with the ParalletScavenge and be used in throughput first and CPU sensitive applications
CMS



Concurrent mark clearing

Low pause times are preferred

CPU resource sensitive, concurrent use of user program CPU

The above list only lists the garbage collectors that most programs normally use. If you are interested in G1 and ZGC, you can find out by yourself.

How do you evaluate the impact of garbage collection on your program

Let’s use a simple formula to calculate the impact of GC on our application and evaluate whether the current GC frequency is reasonable

Impact of GC on requests:

(Response time +GC time) * GC times within the time range/T

Copy the code

How do you understand that? Let’s start with a graph:

Assuming that the normal response time of each interface is 50ms and each GC requires a pause of 10ms, then if one GC occurs in unit time T, then the response time of requests arriving in 50ms before GC starts will increase by 10ms, while the response time of requests arriving during GC will increase by 0~10ms. So the impact of GC on the request = (response time +GC time) * the number of GC’s in the time range.


Here’s an example:We can start by looking at the GC log above

How to read the GC log is not detailed here, you are so smart, Google to understand.

It can be seen that the interval between two GC is almost 600ms (estimated GC100 times per minute), and each time takes 0.01s = 10ms. Then we calculate by one minute:

(50ms + 10ms) * 100/60000 = 0.1

Copy the code

That is, 10% of requests are affected by GC every minute.

So what if we need to optimize to reduce the impact of GC on requests? Look at the formula:

  • Reduces the time of a single GC
  • Reduces the number of GC cycles per unit time

conclusion

Above we talked about why we have GC and the various GC algorithms for GC, their pros and cons and their implementation in HotSpot, and finally we looked briefly at how to quickly evaluate the impact of GC on your application. Hopefully this will give you a simple idea of what garbage collection is. Even better, get people interested and go into the details yourself!

[key] reprint please indicate the source: https://juejin.cn/post/6857391504907436040