This is my second article on getting started

What is garbage collection

Garbage collection, I think you’ve all heard about this at some point, what is garbage collection? Garbage Collection in Computer science, Garbage Collection (abbreviated GC) is an automatic memory management mechanism. When dynamic memory on a computer is no longer needed, it should be freed to make room for it. This type of memory resource management is called garbage collection. We all know that in C/C++ users need to manage and maintain their own memory, memory management is very free, you can freely apply for, free memory, but easy to appear memory leaks, dangling Pointers and other problems; Advanced languages such as Java and Python have adopted a garbage collection mechanism to automatically manage memory, and the garbage collection mechanism focuses on two things: (1) finding useless garbage resources in memory. (2) Clean up the garbage resources and free up the memory for other objects.

Python is an interpreted language. Because of its simple syntax, we can directly assign values to variables without having to declare the types of variables. The determination of variable types and the allocation and release of memory space are automatically done by the Python interpreter at run time, so we don’t have to care about it. The automatic memory management of Python greatly reduces the coding burden of developers and enables them to focus on business implementation, which is one of the important reasons for Python’s success. Next, let’s take a look at Memory management in Python.

Garbage collection in Python

Reference counting

Everything in Python is an object, which means that every variable you use in Python is essentially a class object. In fact, the core of every object is a “PyObject” structure, which has a reference counter ob_refcnt inside. The program updates the value of ob_refcnt in real time to reflect the number of names that reference the current object. When an object has a reference count of 0, the object is garbage, it is reclaimed, and its memory is immediately freed.

typedef struct _object {
    int ob_refcnt;// Reference count
    struct _typeobject *ob_type;
} PyObject;

Copy the code

The following cases cause the reference count to be increased by one: (1) the object is created, such as a = 5 (2) the object is referenced, b = a (3) object is as a parameter, passed to a function (note that occurs in a function call, will produce an additional two references, a stack from function, and the other one is function parameters) as an element (4) objects, stored in a container (such as stored in the list)

The following causes the reference count to be subtracted by one: â‘  The alias of the object is displayed to be destroyed del a â‘¡ the alias of the object is given to a new object â‘¢ an object leaves its scope â‘£ The container in which the object resides is destroyed or deleted from the container

We can also get the current reference count of an object referenced by a name using getrefCount () in the sys package (note that getrefcount() itself increments the reference count).

import sys
a = [1.2.3]
print(sys.getrefcount(a))
# Output 2, indicating two references (one from the definition of a and one from getrefcount)

def func(a) :
    print(sys.getrefcount(a))
    # Output 4, indicating four references (a definition, Python function call stack, function arguments, and getrefcount)

func(a)
print(sys.getrefcount(a))
# output 2, indicating that there are two references (one from the definition of a and one from getrefcount), and that the func call no longer exists

Copy the code

Here’s a look at it from the perspective of using memory:

import os
import psutil


def show_memory_info(hint) :
    Param hint: :return: """
    pid = os.getpid()
    p = psutil.Process(pid)

    info = p.memory_full_info()
    memory = info.rss / 1024 / 1024
    print('{} Current process memory usage: {} MB'.format(hint, memory))


def func() :
    show_memory_info('initial')
    a = [i for i in range(9999999)]
    show_memory_info('After creating a')


func()
show_memory_info('the end')

Copy the code

The output is as follows:

Initial memory usage of the current process: 12.125 MB Memory usage of the current process after the creation of A: 205.15625 MB Memory usage of the current process after the creation of A: 12.87890625 MB Memory usage of the end of the process: 12.87890625 MBCopy the code

As you can see, the initial memory usage of the current process was 12.125 MB. After func() was called to create list A, the memory usage quickly increased to 205.15625 MB, and after the function call ended, the memory returned to normal. This is because the list a declared inside the function is a local variable. After the function returns, the reference to the local variable is cancelled. At this point, the reference count of the object in the list A is zero, and Python performs garbage collection.

A circular reference

What is a circular reference? In simple terms, two objects reference each other. Look at the following program:

def func2() :
    show_memory_info('initial')
    a = [i for i in range(10000000)]
    b = [x for x in range(10000001.20000000)]
    a.append(b)
    b.append(a)
    show_memory_info('After creating a,b')

func2()
show_memory_info('the end')

Copy the code

The output is as follows:

Initial memory usage of the current process: 12.14453125 MB Memory usage of the current process after a and B are created: 396.6875 MB Memory usage of the current process after a and B are created: 396.96875 MB Memory usage of the current process after a and B are created: 396.96875 MBCopy the code

As you can see, in the program, a and B refer to each other, and as local variables, after the call to function func2 ends, A and B no longer exist in the program sense, but from the output, we can see that there is still a memory footprint. Why? Because they refer to each other, the number of references is not zero.

If circular references are present in a production environment, and there is no other garbage collection mechanism, the program is bound to consume more and more memory over a long period of time, and if it is not handled in time, the server will be full.

If we have to use circular references, we can start garbage collection by explicitly calling gc.collect() :

def func2() :
    show_memory_info('initial')
    a = [i for i in range(10000000)]
    b = [x for x in range(10000001.20000000)]
    a.append(b)
    b.append(a)
    show_memory_info('After creating a,b')

func2()
gc.collect()
show_memory_info('the end')

Copy the code

The output is as follows:

Initial memory usage of the current process: 12.29296875 MB Memory usage of the current process after a and B are created: 396.69140625 MB Memory usage of the current process after a and B are created: 12.95703125 MBCopy the code

Reference counting is efficient, simple, and real-time. Once an object’s reference count is zero, memory is freed. You don’t have to wait for a specific moment like other mechanics. The garbage collection is randomly allocated to the running stage, and the time to process the memory collection is allocated to the normal time, and the normal program runs smoothly. However, there are some disadvantages to reference counting, the usual ones are:

â‘  Although the logic is simple, it is troublesome to maintain. Each object needs to allocate a separate space for the reference count, and it needs to maintain the reference count, which can be resource consuming. â‘¡ Circular reference. This would be fatal to the reference counting mechanism, which is unsolvable and therefore must be supplemented with other garbage collection algorithms.

In fact, Python uses the mark-sweep algorithm and generational collection to enable automatic garbage collection for circular references.

Mark clear to remove circular references

Python uses the Mark and Sweep algorithm to solve the circular reference problem of container objects. Note that circular references are only possible for container-class objects, such as lists, dictionaries, objects of user-defined classes, tuples, and so on. Simple types like numbers and strings don’t have circular references. As an optimization strategy, tuples containing only simple types are not considered in the tag cleanup algorithm.) There are two phases: the first phase is the tag phase, where GC marks all active objects, and the second phase is the collection of unmarked inactive objects. So how does Python determine what objects are inactive? For any collection of objects, we first build a reference counting copy table, to save their reference count, and references are lifted out of the collection of internal (internal reference “refers to an object of this collection reference another object inside this collection), the process of lifting the copy table to reduce the reference count, after lifting off all internal reference, If the reference count in the copy table is still not zero, it is the root set. Then we start the marking process, that is, gradually restore the reference from the following collection node and increase the reference count in the copy table. Finally, the reference count in the copy table is zero, and we need to garbage collect them. Such as:

The nodes in the above collection have external inbound connections (to a and to b), and external connections (c refers to an external object). On the right is the reference count table, and then we tear down all internal connections:

So the root set is a and B, and then we start from a and B and restore the reference count:

Starting from a and b can be up to the nodes have been restored, the reference count is zero is the internal circular references garbage collection (e and f), if all the object as a collection, you can recycle all waste, also the set of all objects can be divided into small, recycling garbage from the small set respectively. But having to traverse the graph every time is a huge performance waste for Python.

Generational recycling

Python divides memory into different sets according to the life time of objects. Each set is called a generation. Python divides memory into three generations: the young generation (generation 0), the middle generation (generation 1), and the old generation (generation 2). They correspond to three linked lists whose garbage collection frequency decreases as the object’s lifetime increases. The newly created object is assigned in the young generation, list the total number of young generation to reach highs, namely when the garbage collector in the new object minus the deleted object to the corresponding threshold, will be to this generation of objects start recycling, which objects can be recycled, reclaiming and those not recycled object s will be moved to, and so on, Objects in the old age are those that survive the longest, even over the lifetime of the system. At the same time, generational recycling is based on the marker removal technology. In fact, generational collection is based on the idea that new objects are more likely to be garbage collected, and that objects that survive longer have a higher probability of surviving. Therefore, by doing this, you can save a lot of computations and thereby improve Python performance.

conclusion

Garbage collection is a built-in mechanism of Python, which is used to automatically free memory space that is no longer used. In Python, garbage collection is mainly carried out through reference counting, and circular reference problems that may be caused by container objects are solved by token clearing. Garbage collection efficiency is improved by exchanging space for time through generational collection.

Finally, I would like to thank my girlfriend for her tolerance, understanding and support in work and life.