Threaded process coroutine 3: Detail the Python GIL

In this third installment of the Threaded Process Coroutines column, the focus is on Python’s GIL lock and its impact on Python multithreading.

What is a GIL

The Global Interpreter Lock, also known as the Global Interpreter Lock, is a means of synchronizing threads within the Interpreter. There is a GIL for each interpreter process, which has the direct effect of limiting the parallel execution of multiple threads in a single interpreter process, so that only one thread can be running at a time for a single interpreter process, even on a multi-core processor.

For Python, the GIL is not a feature of the language itself, but an implementation feature of the CPython interpreter. The compiled bytecode of Python code is executed in the interpreter. During execution, the GIL in the CPython interpreter causes only one thread to execute bytecode at a time.

The immediate problem with the existence of the GIL is that it is impossible to achieve true parallelism in an interpreter process using multiple threads using multi-core processors.

Therefore, Python’s multithreading is pseudo-multithreading, unable to utilize multi-core resources, and only one thread is actually running at a time.

Explore the impact of the GIL on Python multithreading

IO intensive and CPU intensive

Next, let’s simulate the effect of GIL on Python multithreading with some code examples.

Let’s start with a review of computer intensive and IO intensive.

Cpu-bound: Also known as CPU-intensive, programs that spend most of their time performing COMPUtations, logic validation, and other CPU processing, such as matrix calculations, video codec, and so on, with high CPU usage.
Io-bound: Programs that spend most of their time waiting for IO (network IO, disk IO) to complete, such as most Web applications, and have low CPU usage.

CPU intensive code testing

Start with a single threaded test that is computationally intensive.

from time import time

def loop_add(n) :
    i = 0
    while i < n:
        i += 1


if __name__ == '__main__':
    begin_time = time()

    loop_add(100000000)

    end_time = time()
    run_time = end_time - begin_time
    print("Program takes {}s".format(run_time))

Copy the code

The result is: The program takes 3.2977540493011475s

Then multithreading tests were performed on computationally intensive subjects.

import threading
from time import time

def loop_add(n) :
    i = 0
    while i < n:
        i += 1


if __name__ == '__main__':
    begin_time = time()

    t1 = threading.Thread(target=loop_add, args=(50000000,))
    t2 = threading.Thread(target=loop_add, args=(50000000,))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end_time = time()
    run_time = end_time - begin_time
    print("Program takes {}s".format(run_time))

Copy the code

The process takes 3.718836545944214s

IO intensive code testing

I/O intensive test again.

import requests
from time import *


def loop_request() :
    for i in range(50):
        requests.get("http://www.baidu.com")

if __name__ == '__main__':
    begin_time = time()

    loop_request()

    end_time = time()
    run_time = end_time - begin_time
    print("Program takes {}s".format(run_time))
Copy the code

The program takes 3.770847797393799s

Multithreaded test again.

import threading
import requests
from time import *


def loop_request(n) :
    for i in range(n):
        requests.get("http://www.baidu.com")


if __name__ == '__main__':
    begin_time = time()

    t1 = threading.Thread(target=loop_request, args=(25,))
    t2 = threading.Thread(target=loop_request, args=(25,))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end_time = time()
    run_time = end_time - begin_time
    print("Program takes {}s".format(run_time))

Copy the code

The program took 1.7753996849060059s

conclusion

Comparing the results, it can be found that for CPU-intensive programs, multithreading has no significant performance improvement, but the latency increases, while for IO intensive programs, the execution time is halved.

Why isn’t the execution time of cpu-intensive programs halved? This is the result of GIL, because at the same time, even in the multi-core CPU, and only one thread only after getting the global lock can perform bytecode, other threads to execute bytecode will need to wait for the global lock is released, so failed to meet the real parallel execution, but an alternate with multithreading serial execution. Because in the process of multi-thread execution also involves the global lock acquisition and release, context switching and so on. And this loss of efficiency may be even more significant on multi-core cpus than on single-core cpus.

Why is the IO intensive program execution time halved? After Python3.2, GIL locks are switched between threads at fixed intervals. When another thread requests the GIL, the current running thread attempts to release the GIL at intervals of 5 milliseconds (the default time). In addition, the GIL is released when IO operations are performed and the other thread continues execution. For example, network IO operations, file reading and writing, and so on.

Why GIL exists

If the GIL has such an impact on CPU-intensive Python multithreading, why does it exist? It’s important to trace the cause.

Python was born in the single-core CPU era, so in the initial design and development of the CPython interpreter, more consideration was given to the mainstream single-core CPU usage scenarios.

CPython uses reference counts primarily for garbage collection, and if two threads refer to the same object, it can be non-thread-safe, where updates to one thread’s reference count do not show up in the other thread’s acquisition of the reference count. Therefore, the introduction of global locking with large granularity such as GIL can effectively avoid the non-thread safety of CPython’s memory management mechanism in multi-threaded environment and ensure the consistency of shared data in multi-threaded environment.

GIL locks are relatively granular and do not need to be acquired or released frequently, making Python more efficient in a single thread and relatively simple to implement compared to other implementations (such as more fine-grained locks).

With the advent of multi-core cpus, the inability of Python’s multithreading to take advantage of multi-core cpus has become a relic of history.

With the GIL, is thread safety still a concern

We also use a code example to test this

import threading

a = 0
lock = threading.Lock()

def increase(n) :
    global a
    for i in range(n):
        a += 1

def increase_with_lock(n) :
    global a
    for i in range(n):
        lock.acquire()
        a += 1
        lock.release()

def multithread_increase(func, n) :
    threads = [threading.Thread(target=func, args=(n,)) for i in range(10)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print(a)


if __name__ == '__main__':
    multithread_increase(increase, 1000000)
    a = 0
    multithread_increase(increase_with_lock, 1000000)
Copy the code

Increase_with_lock = increase_with_lock = increase_with_lock = increase_with_lock = increase

The test result is

6633318, 10000000,Copy the code

As you can see, in the lockless increment function, it is non-thread-safe, indicating that in the execution of multiple threads, the modification results of some threads are not reflected in the modification of other threads.

The reasons are:

GIL guarantees the exclusivity of each byte code execution, that is, the execution of each byte code is atomic. The GIL has a release mechanism, so the GIL does not guarantee that the thread will not switch during the execution of bytecode, that is, it is possible for the thread to switch between multiple bytecode.

We can use Python’s DIS module to look at the bytecode of a += 1 execution, and find that multiple bytecodes are needed to complete, and threads have the possibility of switching, so it is not thread safe.

The granularity of the GIL is different from that of the thread mutex. GIL is a Python interpreter level mutex, which ensures consistency of shared resources at the interpreter level. Thread mutex is a code (or user) level mutex, which ensures consistency of shared data at the Python program level. So we still need thread mutex and other thread synchronization methods to keep the data consistent.

How to improve performance under the GIL

There are a number of ways we can improve performance with the GIL

For IO intensive tasks, we can use multithreading or coroutines.
It is an option to replace an interpreter such as Jython that does not have a GIL, but it is not recommended to do so because you miss many useful features in C modules.
Use multiple processes instead of multiple threads.
Move computationally intensive tasks to Python’s C/C++ extension modules.

Next, we remodel the previous CPU-intensive code with multiple processes.

import multiprocessing
from time import time


def loop_add(n) :
    i = 0
    while i < n:
        i += 1


if __name__ == '__main__':
    begin_time = time()

    p1 = multiprocessing.Process(target=loop_add, args=(50000000,))
    p2 = multiprocessing.Process(target=loop_add, args=(50000000,))

    p1.start()
    p2.start()
    p1.join()
    p2.join()

    end_time = time()
    run_time = end_time - begin_time
    print("Program takes {}s".format(run_time))

Copy the code

The result is: the program takes 1.7649831771850586s

Remember our previous results, the single-threaded program took 3.2977540493011475s, and the multi-threaded program took 3.718836545944214s. Now with multi-process programs, that time is cut in half.

So far, the Python GIL knowledge is basically finished, welcome to discuss and communicate ~

All the code for this section is in my github directory.