Python multi-core, multi-threaded, concurrent programming

Original address: blogof33.com/post/8/

preface

In a recent project, the Raspberry Pi (ARM Cortex-A53 quad-core, 1GB of ram) was connected to two ultrasound waves, which required constant ranging. Raspberry PI controls the movement of the robot as the main control. In addition, the raspberry PI also runs a camera and a Web server (used to transmit video streams) at the same time, and Tensorflow image recognition model is permanently installed in the background, which will conduct image recognition at intervals according to requirements. Due to the low configuration of raspberry PI, the four cores will be full every time Tensorflow image recognition is run. Therefore, in order to ensure that ultrasonic wave does not affect Tensorflow recognition and prevent long-term high CPU usage from burning the expansion board, it is necessary to reduce the ultrasonic CPU usage.

Python GIL

Since there are two ultrasounds ranging at the same time, and since the control program is written in Python, our initial thought was to create a child process using the start_new_thread method in the Python Thread library:

thread.start_new_thread(checkdist,(GPIO_R1,var1,))
thread.start_new_thread(checkdist,(GPIO_R2,var2,))
Copy the code

The Raspberry PI 3B ultrasonic ranging sensor Python GPIO/WPI/BCM three ways.

Then the program runs while the robot moves sluggishly and issues commands with a delay. Look at the CPU usage, as shown in the figure below. It turns out that all the programs are crowded on one core, which causes the lag.

Why does the whole program run on top of one core? The CPython interpreter has a GIL.

What is the GIL

The first thing to be clear about is that the **GIL (Global Interpreter Lock) ** is not a Python feature, it is a concept introduced when implementing the Python Interpreter (CPython, implemented in C). There are many types of interpreters, and the same code can be executed using CPython, PyPy, JPython, and other Python execution environments. JPython, for example, has no GIL. However, because CPython is the default Python execution environment for most environments. Therefore, in many people’s concept of CPython as Python, GIL is automatically attributed to the defects of the Python language. So let’s be clear here: GIL is not a Python feature, and Python can do without GIL.

All C code in the CPython interpreter must hold this lock while executing Python. Guido van Rossum, the father of Python, originally added this lock because multicore was not common in those days, and it was simple enough to ensure that only one thread could access a shared resource at a time. (By the way, when threads were switched, CPython can perform either collaborative or preemptive multitasking, as described in this article), but this also makes Python unable to implement multi-core multi-threading. With the advent of multi-core era, GIL exposed its inherent defects, which not only prevents programs from running multi-core and multi-threading in parallel. And running on multiple cores will affect the efficiency of multithreading.

So why not remove the GIL and replace it with a better implementation at this point in Python’s development?

An experiment was conducted to remove the GIL and use smaller locks, and it was found that the performance was severely degraded on a single thread, and the performance of multithreaded programs would improve only when the number of threads reached a certain number. Therefore, removing the GIL and replacing it with smaller locks is not worth it at present. One of the most important studies in the Python community is the GIL, and removing the GIL is a long way off.

So, how do we get around GIL’s limitations? I tried several things.

Steer clear of GIL

multiprocessing

To implement programs that run on multiple cores at the same time, if you are using only Python, it is common to write multiprocess programs using Multiprocessing:

from multiprocessing import Process
p1=Process(target=checkdist,args=(GPIO_R1,var1,))
P2=Process(target=checkdist,args=(GPIO_R2,var2,))
P1.start()
P2.start()
P1.join()
P2.join()
Copy the code

CPU usage:

It is found that the ultrasonic part of the program is running on two cores, and the parent process is running on another core, so the robot is not stuck! However, both cores were full, which solved the latency problem but made occupancy worse. Is the process too heavy? I went online, and a lot of people said that the ultrasonic cycle does fill up the nucleus.

So, is there no solution? At that time, I thought of C language. Could I use C language to bypass GIL’s restriction?

Dance with C

“Python is sometimes a Swiss Army knife.” The official interpreter CPython is implemented in C, so Python has the ability to integrate with C/C++. If using C language to perform ultrasonic distance detection, using Python to call, can achieve multi-core multithreading?

Python has a library called Ctypes that implements python’s requirements for calling C functions. After reading the official document, we first write C language function, as shown below, for detailed description of ultrasonic function, please refer to the previous article raspberry PI 3B ultrasonic ranging sensor Python GPIO/WPI/BCM three ways.

#include<stdio.h> #include< bcm2835.h> #include<termio.h> #include<sys/time.h> #include<stdlib.h> void checkdist(int GPIO_R,int *var,int *signal,int GPIO_S,void(*t_stop)(int)){ struct timeval tv1; struct timeval tv2; long start, stop; double dis; if(! bcm2835_init()){ printf("setup bcm2835 failed !" ); return; } bcm2835_gpio_fsel(GPIO_S, BCM2835_GPIO_FSEL_OUTP); bcm2835_gpio_fsel(GPIO_R, BCM2835_GPIO_FSEL_INPT); while(1){ if(*signal==1) continue; else *signal=1; printf("GPIO:%d\n",GPIO_R); bcm2835_gpio_write(GPIO_S,HIGH); bcm2835_delayMicroseconds(15); // Delay 15us bcm2835_gPIo_write (GPIO_S,LOW); while(! bcm2835_gpio_lev(GPIO_R)); gettimeofday(&tv1, NULL); while(bcm2835_gpio_lev(GPIO_R)); gettimeofday(&tv2, NULL); start = tv1.tv_sec * 1000000 + tv1.tv_usec; // stop = tv2.tv_sec * 1000000 + tv2.tv_usec; dis = ((double)(stop - start) * 34000 / 2)/1000000; If (dis<5){*var=1; t_stop(1); } else *var=0; printf("dist:%lfcm\n",dis); *signal=0; bcm2835_delay(100); // delay 15ms}}Copy the code

Then use the following command to generate the object file:

gcc checkdist.c -shared -fPIC -o libcheck.so

The shared-fpic option is used to generate location-independent code, so there are no absolute addresses in the generated code, and all relative addresses are used. Therefore, the code can be loaded anywhere in memory by the loader, and will execute correctly. This is exactly what is required for shared libraries, which are loaded in a non-fixed location in memory. Prevent address errors such as variables.

Then use the following Python code to implement the C language thread function:

The #checkdist function prototype is:
#void checkdist(int GPIO_R,int *var,int *signal,int GPIO_S,void(*t_stop)(int));
Def t_stop(t_time) def t_stop(t_time)

lib=cdll.LoadLibrary("/home/pi/Raspbarry_Tensorflow_Car/Servo/MotorHAT/libcheck.so")# to load the DLL
stop_func=CFUNCTYPE(None,c_int)
The first argument to #CFUNCTYPE is the return value of the function, void is NULL, and other arguments to the function follow
func=stop_func(t_stop)Func is a pointer type to the t_stop function in Python
signal=c_int(0)Int type in c
var1=c_int(0)
var2=c_int(0)
Create two child threads (checkdist)
thread.start_new_thread(lib.checkdist,(GPIO_R1,byref(var1),byref(signal),GPIO_S,func,))
thread.start_new_thread(lib.checkdist,(GPIO_R2,byref(var2),byref(signal),GPIO_S,func,)) 

Copy the code

Let’s explain some of the details of the above code.

lib=cdll.LoadLibrary("/home/pi/Raspbarry_Tensorflow_Car/Servo/MotorHAT/libcheck.so")

This line loads the C target file libcheck.so.

The Ctypes library provides three objects that are easy to load dynamically linked libraries: CDLL, Windll, and oledll. By accessing the properties of these three objects, you can call the functions of the dynamic link library. Among them, CDLL is mainly used to load C language call mode (CDECL), windll is mainly used to load WIN32 call mode (STDcall), and OLedLL uses WIN32 call mode (STDcall) and the return value is the HRESULT value returned in Windows. In C language, parameters are passed in the way of pushing, the order is from right to left, the difference between CDLL and the latter two is that when clearing the stack, CDLL is the way of using the caller to clear the stack, so the function that implements variable parameters can only use this call convention; Windll and oledll are used to clear the called function before returning the stack of passed parameters, the number of parameters of the function is fixed. Here’s a good example:

CDLL is used here to prevent unnecessary errors.

Then call the callback function:

stop_func=CFUNCTYPE(None,c_int)
The first argument to #CFUNCTYPE is the return value of the function, void is NULL, and other arguments to the function follow
func=stop_func(t_stop)Func is a pointer type to the t_stop function in Python
Copy the code

The purpose of these two lines is to pass the python t_stop function into the C function so that the C function can call t_stop.

Signal =c_int(0); singel=0; Same thing for the next few lines.

And finally:

Create two child threads (checkdist)
thread.start_new_thread(lib.checkdist,(GPIO_R1,byref(var1),byref(signal),GPIO_S,func,))
thread.start_new_thread(lib.checkdist,(GPIO_R2,byref(var2),byref(signal),GPIO_S,func,)) 
Copy the code

Create thread, parameter function is C language function checkdist, byref(var1) = &var1, pass in the address of var1, other similar.

Then let’s run it and see what happens:

Nice! We can see that the result is beyond our expectation, the effect is much better than expected! Not only realize multi-core multi-threading, and CPU usage is reduced a lot. The dual-core occupancy rate basically fluctuates between 2% and 30%, which is much better than before.

conclusion

This experience is quite fruitful, can not imagine a simple ultrasonic obstacle avoidance, will encounter so many problems. As a matter of fact, there are still many aspects left undiscussed in this article, such as the experiment of GIL’s influence on the efficiency of multithreading, why there is such a big gap between multiprocessing and C language as a threading function, and so on. In addition, there is a Tensorflow optimization problem on a low performance machine like raspberry PI, which I have been wanting to write for a long time. So, the article still has a lot of inadequacies, if found there are omissions, please do not hesitate to comment on readers.

With your mutual encouragement.

Python multi-core, multi-threaded, concurrent programming

preface

Python GIL

What is the GIL

Steer clear of GIL

multiprocessing

Dance with C

conclusion

Related Posts

Common mass data processing problem solving ideas

An API and RESTful API project seed (skeleton) based on Spring Boot

Spring Boot uploads files to the local server