How did I fix a very old GIL race condition bug in Python 3.7

How I fixed a very old GIL race condition in Python 3.7

Originally written by Victor Stinner

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: kezhenxu94

Proofreader: Starrier

A serious bug in the famous Python GIL (Global Interpreter Lock) library took me four years to fix, and the Python GIL is one of the most error-prone parts of Python. I had to dig into Git’s commit history to find a record of Guido van Rossum’s commit from 26 years ago: at that time threads were obscure. Just listen to me slowly.

Fatal Python errors caused by the C thread and GIL

In March 2014, Steve Dower reported a bug bpo-20891 that occurs when “C threads” use the Python C API:

In Python 3.4 RC3, calling the PyGILState_Ensure() method in a thread not created in Python, but not calling the PyEval_InitThreads() method, causes the program to experience a serious error and exit:

Fatal Python error: take_gil: NULL tstate

My first comment:

In my opinion this is a bug in PyEval_InitThreads().

PyGILState_Ensure() fixes

Within two years I had forgotten the bug. By March 2016, I had modified Steve’s test code to be Linux compatible (the test code was written on Windows at the time). I successfully recreated the bug on my computer and wrote a fix for PyGILState_Ensure().

A year later, in November 2017, Marcin Kasperski asked:

Has this fix been released yet? I don’t see anything in the change log…

Oops, I completely forgot the question again! This time, I not only submitted my fix to PyGILState_Ensure(), but also wrote the unit test test_embed. Test_bpo20891 () :

Well, this bug has been fixed in Python 2.7, 3.6, and the main branch (later 3.7). On 3.6 and Master, this patch comes with unit tests.

I’m submitting the fix on the main branch, submitting b4d1e1f7:

bpo-20891: Fix PyGILState_Ensure() (# 4650)

When PyGILState_Ensure() is called in a non-Python thread before
PyEval_InitThreads(), only call PyEval_InitThreads() after calling
PyThreadState_New() to fix a crash.

Add an unit test in test_embed.
Copy the code

Then I closed the issue bpo-20891…

Unit tests crashed randomly on macOS

Everything is fine… It wasn’t until a week later that I realized that my new unit tests were crashing on macOS from time to time. Finally, I managed to find the replay path. The following example is the third runtime crash:

macbook:master haypo$ while true; do ./Programs/_testembed bpo20891 ||break; date; done
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Lun  4 déc 2017 12:46:34 CET
Fatal Python error: PyEval_SaveThread: NULL tstate

Current thread 0x00007fffa5dff3c0 (most recent call first):
Abort trap6:Copy the code

Test_embed. Test_bpo20891 () shows a race condition in macOS PyGILState_Ensure() : the GIL lock itself builds… No lock protection! Adding a lock to detect whether Python currently has a GIL lock is obviously pointless…

I made an incomplete suggestion to fix PyThread_start_new_thread() :

I found a possible fix: call PyEval_InitThreads() in PyThread_start_new_thread(). The GIL can then be created as soon as the second thread is created. GIL can no longer be created when two threads are running. But at least in the “Python or not” black and white case, this fix will fail if a thread is not created in Python, but the thread will call PyGILState_Ensure().

Why not create a GIL in the first place?

Antoine Pitrou asks a simple question:

Why not call PyEval_InitThreads() when the parser is initialized? Is there any downside?

Thanks to the git blame and git log commands, I found the origin of the “create GIL on demand” code, a change from 26 years ago!

commit 1984f1e1c6306d4e8073c28d2395638f80ea509b
Author: Guido van Rossum <[email protected]>
Date:   Tue Aug 4 12:41:02 1992 +0000

    * Makefile adapted to changes below.
    * split pythonmain.c in two: most stuff goes to pythonrun.c, in the library.
    * new optional built-in threadmodule.c, build upon Sjoerd's thread.{c,h}. * new module from Sjoerd: mmmodule.c (dynamically loaded). * new module from Sjoerd: sv (svgen.py, svmodule.c.proto). * new files thread.{c,h} (from Sjoerd). * new xxmodule.c (example only). * myselect.h: bzero -> memset * select.c: bzero -> memset; removed global variable (...) +void +init_save_thread() +{ +#ifdef USE_THREAD + if (interpreter_lock) + fatal("2nd call to init_save_thread"); + interpreter_lock = allocate_lock(); + acquire_lock(interpreter_lock, 1); +#endif +} +#endifCopy the code

I guess the intention of this dynamic GIL creation is to avoid “premature” GIL creation for applications that only use one thread (i.e., never create a new thread).

Fortunately, Guido van Rossum was there and was able to work with me to find the root cause:

Yes, the initial reason was that threads were arcane and not much code used threads, and at the time, due to bugs in the GIL code, we felt sure that frequent use of the GIL would lead to (minor) performance degradation and increased crash risk. Now that we know we don’t have to worry about either of these, we can use initialization as much as we want.

A second fix for Py_Initialize() is proposed

I suggest another fix for Py_Initialize() : Always create the GIL as soon as Python is started, not “on demand” to avoid the risk of race conditions:

+    /* Create the GIL */
+    PyEval_InitThreads();
Copy the code

Nick Coghlan asked if I could run a performance benchmark on my patch. I ran PyPerformance on my PR 4700 and the difference was up to 5% :

haypo@speed-python$ python3 -m perf compare_to \ 2017-12-18_12-29-master-bd6ec4d79e85.json.gz \ 2017-12-18_12-29-master-bd6ec4d79e85-patch-4700.json.gz \ --table --min-speed=5 +----------------------+--------------------------------------+-------------------------------------------------+ | Benchmark | 2017-12-18_12-29-master-bd6ec4d79e85 | 2017-12-18_12-29-master-bd6ec4d79e85-patch-4700 | + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + | Pathlib | | ms 44.3 ms 41.8: 1.06 x slower (+ 6%) | +----------------------+--------------------------------------+-------------------------------------------------+ | scimark_monte_carlo | 197 ms | 210 ms: 1.07 x slower (+ 7%) | +----------------------+--------------------------------------+-------------------------------------------------+ | spectral_norm | 243 ms | 269 ms: 1.11 x slower (+ 11%) | +----------------------+--------------------------------------+-------------------------------------------------+ | 7.30 us sqlite_synth | | 8.13 us: 1.11 x slower (+ 11%) | +----------------------+--------------------------------------+-------------------------------------------------+ | unpickle_pure_python | 707 us | 796 us: 1.13 x slower (+ 13%) | +----------------------+--------------------------------------+-------------------------------------------------+ Not significant (55): 2to3; chameleon; chaos; (...).Copy the code

Wow, 5 benchmarks down. Performance regression testing is popular in Python: We’re always trying to make Python faster!

Skip failed tests on Christmas Eve

I didn’t expect performance to degrade in all five benchmarks. This required deeper exploration, but I didn’t have the time to do it, and I felt too shy/ashamed to be responsible for doing performance regression testing.

I couldn’t make up my mind before the Christmas break, but test_embed. Test_bpo20891 () crashed randomly on macOS as usual. Having to spend two weeks with the most error-prone part of Python — GIL — really made me feel bad. So I decided to skip the unit test of test_bPO20891 () until after the holidays.

Python 3.7, no Easter eggs.

Run the new benchmark, and the second fix is merged into the main branch

At the end of January 2018, I ran the five benchmarks that had dropped in my PR again. I run these benchmarks manually on my laptop, with different tests using separate cpus:

vstinner@apu$ python3 -m perf compare_to ref.json patch.json --table
Not significant (5): unpickle_pure_python; sqlite_synth; spectral_norm; pathlib; scimark_monte_carlo
Copy the code

Well, according to the Python “performance” benchmark suite, it now turns out that my second fix didn’t really have much of an impact on performance.

I decided to push my fix to the main branch by submitting 2914bb32:

bpo-20891: Py_Initialize() now creates the GIL (# 4700)

The GIL is no longer created "on demand" to fix a race condition when
PyGILState_Ensure() is called in a non-Python thread.
Copy the code

I then restarted the test_embed. Test_bpo20891 () unit test on the main branch.

Sorry, there is no second fix for Python 2.7 and 3.6!

Antoine Pitrou thought about porting the patch to Python 3.6 without merging:

I don’t think it’s necessary. You can already call PyEval_InitThreads().

Guido van Rossum didn’t want to port the patch either. So I removed test_embed. Test_bpo20891 () from the main branch of 3.6.

I also didn’t apply my second patch in Python 2.7 for the same reason, and there are no unit tests for Python 2.7 because porting is too difficult.

But at least Python 2.7 and 3.6 applied my first patch, PyGILState_Ensure().

conclusion

Python still has some race conditions in some boundary cases. This bug was discovered when the C thread created the GIL using the Python API. I pushed the first patch, but another new race condition appeared on macOS.

I had to drill into the Python GIL’s very old commit history (1992). Fortunately Guido van Rossum was able to help find the root cause of the bug.

After a benchmark glitch, we agreed that in Python 3.7 the GIL was always created as soon as the parser was started, rather than “on demand”. This change had no noticeable impact on performance.

We also decided to leave Python 2.7 and 3.6 unchanged to prevent any risk of regression testing: continue to create GIL “on demand”.

A serious bug in the famous Python GIL (Global Interpreter Lock) library took me four years to fix, and the Python GIL is one of the most error-prone parts of Python. We are glad to have this bug out of the way: it has been fully fixed in the upcoming Python 3.7 release!

See the full story at bpo-20891. Thanks to all the developers who helped me fix this bug!

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.