preface

Garbage collection is an important part of Python, and this is the first of two installments of Garbage collection

1.Garbage collection

Advanced languages such as Java, C #, etc., use garbage collection instead of the user-managed and maintained memory of C and C ++. Self-management memory is extremely free, you can apply for memory arbitrarily, but like a double-edged sword, for a large number of memory leaks, dangling Pointers and other bugs buried hidden dangers. A string, a list, a class, and even a value are all objects in a language that is easy to locate and use, and naturally does not leave the user to deal with how to allocate reclaimed memory. Python also uses the same garbage collection mechanism as Java, but the difference is that Python uses reference counting as the main mechanism, and mark-sweep and generational collection as the secondary mechanism. Reference counting mechanism: Everything in Python is an object, and at its core is a structure: PyObject

typedef struct_object{
    int ob_refcnt;                
    struct_typeobject *ob_type; 
}PyObject;
Copy the code

PyObject is mandatory for every object, where ob_RefCNt is used as a reference count. When an object has a new reference, its OB_refCNt increases, and when the object referencing it is deleted, its OB_refcnt decreases.

# define Py_INCREF (op) ((op) - > ob_refcnt + +) / / increase the count # defien Py_DECREF (op) \ / / reduce counting the if (- (op) - > ob_refcnt! = 0) \; \ else \ _Py_Dealloc((PyObject *)(op))Copy the code

When the reference count reaches 0, the object ends its life.

Advantages of reference counting mechanism:

  • simple
  • Real-time: Once there is no reference, memory is freed directly. Don’t wait for a specific moment like other mechanics. Another benefit of real-time is that the time spent processing reclaimed memory is spread over the usual time.

Disadvantages of reference counting:

  • Maintaining reference counts consumes resources
  • A circular reference
list1 = []
list2 = []
list1.append(list2)
list2.append(list1)
Copy the code

List1 and List2 refer to each other, and if there are no references to them, the reference counts for list1 and List2 are still 1, and the memory used can never be reclaimed, which would be fatal. Shortcoming 1 is acceptable on today’s powerful hardware, but circular references cause memory leaks, and Python is bound to introduce new recycling mechanisms. (Mark clearance and generational collection)

2. Draw Ruby and Python garbage collection

Visualizing garbage Collection in Ruby and Python

2.1. The beating heart of applications

A GC system does much more than garbage collection. In fact, they are responsible for three important tasks:

  • Allocates memory for newly generated objects
  • Identify those junk objects
  • Reclaim memory from garbage objects

If you think of an app as a human body: all the elegant code you write, the business logic, the algorithms, that’s the brain. By analogy, which body organ should the recycling mechanism be? (I’ve heard some interesting answers from RuPy listeners: kidneys, white blood cells.) I think recycling is the app’s beating heart. Just as the heart supplies blood and nutrients to the rest of the body, the garbage collector provides memory and objects for your application. If your heart stops, you’re dead in seconds. If the garbage collector stops working or runs slowly, like clogged arteries, your application will become less efficient and eventually die.

2.2. A simple example

The use of examples always contributes to theoretical understanding. Here’s a simple class, written in Python and Ruby, that we’ll use as an example for today:

Incidentally, the code of the two languages is surprisingly similar: Ruby and Python are really only slightly different in saying the same thing. But are the internal implementations of the two languages so similar?

2.3.Ruby object allocation

What exactly does Ruby do when we execute node.new (1) above? How does Ruby create new objects for us? Surprisingly, it does very little. In fact, Ruby creates hundreds of objects long before the code even starts to execute and strings them together on a linked list called the available list. The following is a concept diagram of the available list:

Imagine each white square marked with an “unused pre-created object”. When we call Node.new,Ruby simply fetches a pre-created object for us to use:

In the figure above, the left gray bars represent the current objects used in our code, while the other white bars are unused objects. (Please note that my diagram is undoubtedly a simplification of reality. In fact, Ruby will use another object to hold the string “ABC”, another object to hold the Node class definition, another object to hold the abstract syntax tree parsed from the code, etc.) If we call Node.new again, Ruby will hand us another object:

This simple algorithm for pre-allocating objects using linked lists has been around for more than 50 years, and its inventor, the eminent computer scientist John McCarthy, started with Lisp. Lisp was not only the first functional programming language, but also a pioneer in computer science. One is the concept of using the garbage collection mechanism to automate program memory management.

The standard version of Ruby, also known as “Matz’s Ruby Interpreter”(MRI), uses a GC algorithm similar to McCarthy’s implementation in 1960. For better or worse, Ruby’s garbage collection mechanism is 53 years old. Like Lisp, Ruby creates objects up front and then makes them available to you when you allocate new objects or variables.

2.4. Object allocation in Python

We’ve seen that Ruby creates objects up front and stores them in the available list. What about Python? Although Python also uses available lists (to recycle specific objects such as lists) for many reasons, Python and Ruby differ in how they allocate memory for new objects and variables. For example, we use Pyhon to create a Node object:

Unlike Ruby, Python immediately requests memory from the operating system when an object is created.(Python effectively implements its own memory allocation system, providing an abstraction layer on top of the operating system heap. But I won’t expand on that today) when we create the second object, we request memory again from OS:

As simple as it seems, Python takes some time to find and allocate memory for us as we create objects.

2.5.Ruby developers live in messy rooms

Ruby leaves useless objects in memory until the next GC execution comes back to Ruby. As we create more and more objects, Ruby keeps searching the available list for pre-created objects for us. As a result, the available list gets progressively shorter:

. And then even shorter:

Notice that I keep assigning new values to the n1 variable, and Ruby leaves the old values in place.” Three Node instances ABC”,”JKL”, and “MNO” are still stuck in memory. Ruby doesn’t immediately clean up old objects that are no longer used in code! Ruby developers live in a messy room with clothes on the floor or dirty dishes in the sink. As a Ruby programmer, useless junk objects surround you all the time.

2.6.Python developers live in clean homes

Python’s garbage collection mechanism is quite different from Ruby’s. Let’s go back to the three Python Node objects mentioned earlier:

Internally, when creating an object, Python always stores an integer, called the number of references, in the object’s C structure. Initially, Python sets this value to 1:

A value of 1 indicates that there is a pointer to or reference to each of the three objects. Suppose we now create a new Node instance, JKL:

As before, Python sets the number of JKL references to 1. Note, however, that since we changed n1 to point to JKL instead of ABC, Python sets the number of references to ABC to zero. At this point, the Python garbage collector comes to the fore! Whenever the number of references to an object decreases to zero, Python immediately frees it, returning memory to the operating system:

Python reclaimed the memory used by the ABC Node instance above. Remember, Ruby abandons old objects and does not free their memory. Python’s garbage collection algorithm is called reference counting. George Collins invented it in 1960, which happened to be the same year that John McCarthy invented the available list algorithm. As MikeBernstein said in his excellent garbage collection mechanics talk at the Ruby conference in Gotham city in June, “1960 was the golden age of the garbage collector…” .

Python developers work in a hygiene house, and you can imagine having a roommate with mild OCD who is constantly cleaning up after you, and as soon as you put down a dirty plate or cup, there’s a guy ready to put it in the dishwasher! Now for the second example. Let’s have N2 reference n1:

The number of references to DEF on the left in the figure above has been reduced by Python, and the garbage collector will immediately reclaim the DEF instance. And the number of references to JKL has changed to 2, since both n1 and n2 point to it.

2.7. Mark – clear

Eventually the cluttered room was filled with rubbish and could no longer be quiet. After a while of Ruby running, the available list is finally exhausted:

At this point all Ruby pre-created objects are used by the program (they’re all gray) and the available list is empty (no white squares). Now Ruby came up with another algorithm invented by McCarthy, called mark-erase. First Ruby stops the program. Ruby uses the “Earth Stop Garbage Collection method”. Ruby then polls all Pointers, variables, and code to produce additional reference objects and other values. Ruby also facilitates internal Pointers through its own virtual machine. Mark each object referenced by these Pointers. I’m going to use M here.

The three objects marked M in the figure above are still used by the program. Internally, Ruby actually uses a string of bitvalues, called an available bitmap, to keep track of whether an object is marked.

If the marked object is alive, the remaining unmarked object is garbage, which means our code will no longer use it. I’ll use a white grid to represent garbage objects in the image below:

Next Ruby removes these useless junk objects and sends them back to the available list:

Internally this happens very quickly, because Ruby doesn’t actually copy objects from here to there. Instead, the junk object is repositioned into the available list by adjusting the internal pointer to point to a new linked list. Now Ruby will be able to share these junk objects with us the next time we create them. In Ruby, objects are reincarnated through six cycles and enjoy multiple lives.

2.8. Mark-delete vs. reference counting

At first glance, Python’s GC algorithm looks much better than Ruby’s: would you rather live in a dirty room? Why would Ruby rather force a program to stop periodically than use Python’s algorithms? However, reference counting is not as simple as it may seem at first glance. There are a number of reasons why not many languages use the reference counting GC algorithm as Python does:

First of all, it’s hard to implement. Python has to leave some space inside each object to handle the number of references. It costs a little space. To make matters worse, each simple operation (like modifying a variable or reference) becomes a more complex operation, because Python needs to add one count, subtract another, and possibly free objects. Second, it’s relatively slow. While Python is robust as a program executes GC (a dirty dish can be washed in the sink), it’s not necessarily faster. Python is constantly updating numerous reference values. Especially if you are no longer using a big data structure, such as a list of many elements, Python may have to release a large number of objects at once. Reducing the number of references becomes a complex recursive process. Finally, it doesn’t always work. Reference counting does not deal with circular data structures — that is, data structures that contain circular references.

3. Circular data structures and reference counting in Python

3.1. Circular references

From the previous article, we learned that in Python, each object holds an integer value called a reference count, which keeps track of how many references refer to the object. Whenever a variable or other object in our program references the target object, Python increases the count, and when the program stops using the object, Python decreases the count. Once the count is reduced to zero, Python frees the object and reclaims the associated memory space. Since the 1960s, computer science has been faced with a serious theoretical problem, that is, for reference-counting algorithms, if a data structure refers to itself, that is, if the data structure is a circular data structure, then certain reference-counting values cannot be zero. To better understand the problem, let’s take an example. The following code shows some of the node classes we used last week:

We have a “constructor “(called __init__ in Python) that stores a single property in an instance variable. After the class definition we create two nodes, ABC and DEF, shown in the left rectangle. The reference count for both nodes is initialized to 1 because there are two references to each node (n1 and n2). Now, let’s define two additional attributes in the node, next and prev:

Unlike Ruby, in Python you can define instance variables or object properties dynamically while the code is running. It seems like Ruby is missing some interesting magic. (For the record, I’m not a Python programmer, so there may be some naming errors). We set n1.next to point to n2 and n2.prev to point back to n1. Our two nodes now form a bidirectional linked list using circular references. Also notice that the reference count for ABC and DEF has been increased to 2. There are two Pointers to each node: n1 and n2, followed by next and prev. Now, assuming that our program no longer uses these two nodes, we set both n1 and n2 to null(None in Python).

Well, Python reduces the reference count to 1 per node as usual.

3.2. Generation Zero in Python

Note that in the example above, we end up with an unusual situation: we have an “island” or a set of unused objects that point to each other, but none of them have external references. In other words, our program doesn’t use these node objects anymore, so we hope Python’s garbage collection mechanism is smart enough to free these objects and reclaim the memory they occupy. But this is impossible, because all reference counts are 1 instead of 0. Python’s reference counting algorithm cannot handle objects that point to each other. That’s why Python introduced the Generational GC algorithm! Just as Ruby uses a free list to keep track of unused, free objects, Python uses a different list to keep track of active objects. Instead of calling it an “active list,” Python’s internal C code calls it Generation Zero. Every time you create an object or some other value, Python adds it to the zero-generation list:

You can see from above that when we create the ABC node, Python adds it to the zero-generation linked list. Note that this is not a real list and cannot be accessed directly in your code; in fact, the list is a completely internal Python runtime. Similarly, when we create DEF nodes, Python adds them to the same linked list:

Generation zero now contains two node objects. (It will also include every other value created by Python, along with some internal values used by Python itself.)

3.3. Detect circular references

Python then loops through each object on the zero-generation list, checking each cross-reference object in the list and subtracting its reference count according to the rules. During this process, Python counts the number of internal references one by one to prevent premature object release. To help you understand, consider an example:

As you can see above, the ABC and DEF nodes contain 1 number of references. There are three other objects in the zero-generation list, and the blue arrow indicates that some objects are being referenced by other objects outside the zero-generation list. (As we will see, there are two other linked lists in Python called generation 1 and generation 2, respectively). These objects have a higher reference count because they are being pointed to by other Pointers. Next you’ll see how Python’s GC handles zero-generation linked lists.

By recognizing internal references, Python is able to reduce the reference count of many zero-generation linked list objects. In the first line above you can see that the reference counts for ABC and DEF have gone to zero, which means that the collector can free them and reclaim memory. The remaining active objects are moved to a new linked list: the generation linked list. In a sense, Python’s GC algorithm is similar to the tag collection algorithm used by Ruby. Periodically tracing references from one object to another to determine whether the object is still alive and being used by the program is similar to Ruby’s marking process.

4. GC threshold in Python

When will Python do this marking process? As your program runs, the Python interpreter keeps track of newly created objects and objects that are released because the reference count is zero. In theory, these two values should be the same, because every object created by the program should eventually be released. Of course, that’s not the case. Because of circular references, and because your program uses objects that are older than others, the difference between the count of the allocated object and the count of the freed object grows. Once this difference accumulates above a certain threshold, Python’s collection mechanism kicks in and triggers the aforementioned zero-generation algorithm, releasing “floating garbage” and moving the remaining objects to the generation list.

Over time, the objects used by the program gradually move from a zero-generation list to a generation-list. Python works the same way for objects in the first generation list, moving the remaining active objects to the second generation list once the number of allocated and released counts has reached a certain threshold. In this way, the objects that your code is using for a long time, the active objects that your code is constantly accessing, are moved from generation zero to generation one to generation two. With different threshold Settings, Python can process these objects at different intervals. Python handles the zero generation most frequently, with the second generation followed by the second generation. The weak generation hypothesis looks at the core behavior of the generational garbage collection algorithm: the garbage collector will process new objects more frequently. A new object is just created by your program, while an incoming object is an object that exists after several cycles. Python promotes an object as it moves from generation zero to generation one, or from generation one to generation two. Why would you do that? The roots of this algorithm lie in the weak generational hypothesis. The hypothesis is made up of two ideas: first, younger people usually die faster, while older people are more likely to live longer. Suppose I now create a new object in Python or Ruby:

According to the hypothesis, my code will probably only use ABC for a short time. The object may be just an intermediate result in a method, and the object will become garbage as the method returns. Most new objects become garbage so quickly. Occasionally, however, programs create important, long-lived objects-such as session variables or configuration items in web applications. By frequently processing new objects in a zero-generation linked list, Python’s garbage collector spends its time in a more meaningful way: it deals with new objects that could soon become garbage. And only very rarely, when the threshold condition is met, does the collector go back to the old variables.