This article describes parallel optimization of the CPython interpreter to support a true multi-interpreter parallel execution solution.

Bytedance Terminal Technology — Xie Junyi

background

In a business scenario, we execute the algorithm package through CPython. Due to the cpython implementation, it is not possible to use multiple cores of the CPU to execute the algorithm package simultaneously in a single process. In response, we decided to optimize CPython with the goal of making cpython complete enough to support parallelism and greatly improve the execution efficiency of Python algorithm packages within a single process.

In 2020, we completed the parallel execution transformation of CPython, the industry’s first parallel implementation of cpython3 that is highly complete and compatible with the Python C API.

  • performance

    • Single-thread performance deteriorated by 7.7%
    • Multithreading is basically lockless preemption, and opening one more thread reduces execution time by 44%.
    • Parallel execution greatly optimizes the total execution time
  • Passed the cpython unit test
  • It’s already in full use online

Retaining pain, GIL

Cpython is the official interpreter implementation of Python. In CPython, the GIL is used to protect access to Python objects, thus preventing multiple threads from executing Python bytecode simultaneously. The GIL prevents contention and ensures thread safety. Cpython cannot really execute Python bytecode in parallel because of the GIL. Although the GIL limits python parallelism, the cpython code does not take into account the parallel execution scenario, which is full of various shared variables and the change complexity is too high, so the GIL has not been removed officially.

challenge

In the 20 years Python was open source, Python could not be parallel because of the GIL (global lock). There are currently two main approaches to Python parallelism, but there has been no highly complete solution (high performance, compatible with all open source features, stable API). Mainly because:

  1. Removing the GIL interpreter directly requires many fine-grained locks, affecting single-threaded execution performance twice as slowly.

Back in the days of Python 1.5, Greg Stein actually implemented a comprehensive patch set (the “free threading” patches) that removed the GIL and replaced it with fine-grained locking. Unfortunately, even on Windows (where locks are very efficient) this ran ordinary Python code about twice as slow as the interpreter using the GIL. On Linux the performance loss was even worse because pthread locks aren’t as efficient.

  1. Interpreter state isolation The internal implementation of the interpreter is full of various global states, and modification is tedious and heavy workload.

It has been suggested that the GIL should be a per-interpreter-state lock rather than truly global; interpreters then wouldn’t be able to share objects. Unfortunately, this isn’t likely to happen either. It would be a tremendous amount of work, because many object implementations currently have global state. For example, small integers and short strings are cached; these caches would have to be moved to the interpreter state. Other object types have their own free list; these free lists would have to be moved to the interpreter state. And so on.

There is an open source project working on multi-core-Python for this idea, but it is currently shelved. Currently, you can only run demos with very simple arithmetic operations. The parallel execution of Type and many modules is not addressed and cannot be used in real-world scenarios.

New architecture – Multi-interpreter architecture

To achieve the best execution performance, we implemented a parallel implementation with high completion in CPython3.10, referring to multi-core-Python.

  • Transition from global interpreter state to each interpreter structure holding its own running state (separate GIL, various execution states).
  • Support parallelism, interpreter state isolation, parallel execution performance is not affected by the number of interpreters (there is basically no lock preemption between interpreters)
  • Get the Python interpreter state from Thread Specific Data of the Thread.

Under this new architecture, Python interpreters are isolated from each other, do not share a GIL, and can execute in parallel. Take full advantage of the multi-core performance of modern cpus. Significantly reduces the execution time of business algorithm code.

Isolation of shared variables

There are many shared variables used in interpreter execution, and they generally exist in the form of global variables. When multiple interpreters are running, these shared variables are read and written to at the same time, making them thread unsafe.

Main shared variables within CPython: 3.10 Shared variables pending processing. There are about 1,000 of them… It’s a lot of work to deal with.

  • free lists

    • MemoryError
    • asynchronous generator
    • context
    • dict
    • float
    • frame
    • list
    • slice
  • singletons

    • small integer ([-5; 256] range)
    • empty bytes string singleton
    • empty Unicode string singleton
    • empty tuple singleton
    • Single byte character (b ‘\x00’ to B ‘\xFF’)
    • single Unicode character (U+0000-U+00FF range)
  • cache

    • slide cache
    • method cache
    • bigint cache
    • .
  • interned strings
  • PyUnicode_FromId static strings
  • .

How do you make these variables unique to each interpreter?

Cpython is implemented in C, where a pointer to the interpreter_state structure is passed as a parameter to hold member variables belonging to an interpreter. This modification is also the best modification for performance. However, if you do this, then all functions that use interpreter_state will need to change the function signature. It’s almost impossible from an engineering point of view.

Alternatively, we can store the interpreter_state in thread Specific data. During interpreter execution, the Interpreter_state is retrieved through the Thread Specific key. This allows the execution state to be retrieved from Thread Specific’s API without changing the signature of the function.

static inline PyInterpreterState* _PyInterpreterState_GET(void) {
    PyThreadState *tstate = _PyThreadState_GET();
#ifdef Py_DEBUG
    _Py_EnsureTstateNotNULL(tstate);
#endif
    return tstate->interp;
}
Copy the code

We store all shared variables in the interpreter_state.

    /* Small integers are preallocated in this array so that they
       can be shared.
       The integers that are preallocated are those in the range
       -_PY_NSMALLNEGINTS (inclusive) to _PY_NSMALLPOSINTS (not inclusive).
    */
    PyLongObject* small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];
    struct _Py_bytes_state bytes;
    struct _Py_unicode_state unicode;
    struct _Py_float_state float_state;
    /* Using a cache is very effective since typically only a single slice is
       created and then deleted again. */
    PySliceObject *slice_cache;

    struct _Py_tuple_state tuple;
    struct _Py_list_state list;
    struct _Py_dict_state dict_state;
    struct _Py_frame_state frame;
    struct _Py_async_gen_state async_gen;
    struct _Py_context_state context;
    struct _Py_exc_state exc_state;

    struct ast_state ast;
    struct type_cache type_cache;
#ifndef PY_NO_SHORT_FLOAT_REPR
    struct _PyDtoa_Bigint *dtoa_freelist[_PyDtoa_Kmax + 1];
#endif
Copy the code

Quick access via _PyInterpreterState_GET. For example,

/* Get Bigint freelist from interpreter  */
static Bigint **
get_freelist(void) {
    PyInterpreterState *interp = _PyInterpreterState_GET();
    return interp->dtoa_freelist;
} 
Copy the code

Note that changing the global variable to Thread Specific Data has a performance impact, but the performance impact is acceptable as long as you control the number of times you call the API. Based on the changes already made in CPython3.10, we have solved various shared variables issues, 3.10 shared variables to be dealt with

Type variable sharing processing, API compatibility and solutions

Currently cpyThon3. x exposes PyType_xxx variables in the API. These global type variables are referenced by third-party extension code as &pyType_XXX. If you isolate Type into a subinterpreter, you’re bound to cause incompatibility problems. This is also the reason for official change stagnation, as the problem does not appear in python3 in a reasonably modified manner. It can only be changed after python4 changes the API.

We fixed the problem quickly in another way.

Type being a shared variable can cause the following problems

  1. The Ref count of Type Object is frequently changed and is not thread safe
  1. The Type Object member variable was modified and is not thread safe.

Other:

  1. immortal type object.
  1. Lock the insecure place with low use frequency.
  1. Set the member variable to Immortal Object for high-frequency use scenarios.

    1. Of a descriptor mechanism for python, in actual use, the type of the property, function, classmethod, staticmethod, doc generated descriptor is set to immortal object.

This results in memory leaks for Type and member variables. However, since CPython has a caching mechanism for modules, there is no problem if the cache is not cleared.

Pymalloc memory pool sharing processing

We used Mimalloc instead of Pymalloc memory pools to optimize 1%-2% performance without additional pymalloc processing.

Subinterperter capability completion

The subinterpreter module only provides interp_run_string to execute code_string. For volume and security reasons, we have removed python’s ability to dynamically execute code_string. We have added two additional capabilities to the Subinterpreter module

  1. The interp_call_file call executes the Python PyC file
  1. Interp_call_function executes any function

Subinterpreter execution model

In Python, our code runs main Interpreter by default, or we can create a Sub Interpreter to execute code,

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)
Copy the code

The important thing to note here is that we create Sub Interpreter in Main Interpreter, then execute it in Sub Interpreter, and finally return the result to Main Interpreter. This looks simple, but it does a lot of things.

  1. Main Interpreter passes the argument to Sub Interpreter
  1. The thread switches to sub Interpreter’s interpreter_state. Gets and converts parameters
  1. Sub Interpreter interprets the executing code
  1. Get the return value, switch to Main Interpreter
  1. Conversion return value
  1. Exception handling

There are two complications:

  1. Interpreter State State switchover
  1. Interpreter Transfer of data

Interpreter State State switchover

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)
Copy the code

We can factor it into theta

# Running In thread 11: # main interpreter: The interpreter state set by thread Specific is the main interpreter state to do some things. create subinterpreter ... interp_call_function ... # Thread Specific set interpreter State to sub Interpreter State # sub Interpreter: Do some thins... call function ... get result ... Thread Specific set interpreter State to main Interpreter State get return result...Copy the code

Interpreter Transfer of data

Because the execution state of our interpreter is isolated, Python objects created in Main Interpreter cannot be used in Sub Interpreter. We need:

  1. Get the Main Interpreter PyObject key data
  1. In a block of memory
  1. Recreate a PyObject from this data in Sub Interpreter

An implementation of the Interpreter state switch & the passing of data can be seen in the following example…

static PyObject * _call_function_in_interpreter(PyObject *self, PyInterpreterState *interp, _sharedns *args_shared, _sharedns *kwargs_shared) { PyObject *result = NULL; PyObject *exctype = NULL; PyObject *excval = NULL; PyObject *tb = NULL; _sharedns *result_shread = _sharedns_new(1); #ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS // Switch to interpreter. PyThreadState *new_tstate = PyInterpreterState_ThreadHead(interp); PyThreadState *save1 = PyEval_SaveThread(); (void)PyThreadState_Swap(new_tstate); #else // Switch to interpreter. PyThreadState *save_tstate = NULL; if (interp ! = PyInterpreterState_Get()) { // XXX Using the head thread isn't strictly correct. PyThreadState *tstate = PyInterpreterState_ThreadHead(interp); // XXX Possible GILState issues? save_tstate = PyThreadState_Swap(tstate); } #endif PyObject *module_name = _PyCrossInterpreterData_NewObject(&args_shared->items[0].data); PyObject *function_name = _PyCrossInterpreterData_NewObject(&args_shared->items[1].data); . PyObject *module = PyImport_ImportModule(PyUnicode_AsUTF8(module_name)); PyObject *function = PyObject_GetAttr(module, function_name); result = PyObject_Call(function, args, kwargs); . #ifdef EXPERIMENTAL_ISOLATED_SUBINTERPRETERS // Switch back. PyEval_RestoreThread(save1); #else // Switch back. if (save_tstate ! = NULL) { PyThreadState_Swap(save_tstate); } #endif if (result) { result = _PyCrossInterpreterData_NewObject(&result_shread->items[0].data); _sharedns_free(result_shread); } return result; }Copy the code

Implement a sub-interpreter pool

We have implemented an internal isolation execution environment, but this is a low-level API, and we need to encapsulate some highly abstract apis to improve the ease of use of parallel subinterpreters.

interp = _xxsubinterpreters.create()
result = _xxsubinterpreters.interp_call_function(*args, **kwargs)
Copy the code

The Python Concurrent library provides thread pool, process pool, and futures implementations to implement the Subinterpreter Pool. A high-level interface for asynchronous callback execution is provided through the concurrent.futures module.

executer = concurrent.futures.SubInterpreterPoolExecutor(max_workers)
future = executer.submit(_xxsubinterpreters.call_function, module_name, func_name, *args, **kwargs)
future.context = context
future.add_done_callback(executeDoneCallBack)
Copy the code

Internally, we implement this by inheriting the Executor base classes provided by Concurrent

class SubInterpreterPoolExecutor(_base.Executor):
Copy the code

A SubInterpreterPool initializes when a thread is created, and a SubInterpreter is created for each thread

interp = _xxsubinterpreters.create()
t = threading.Thread(name=thread_name, target=_worker,
                     args=(interp, 
                           weakref.ref(self, weakref_cb),
                           self._work_queue,
                           self._initializer,
                           self._initargs))
Copy the code

The thread worker receives arguments and executes them using interp

result = self.fn(self.interp ,*self.args, **self.kwargs)
Copy the code

Implement the external scheduling module

The change to Sub Interpreter is large and has two pitfalls

  1. There may be compatibility issues with the code, and third-party C/C++ Extension implementations have global state variables that are not thread-safe.
  1. Python has very few modules. Sub Interpreter cannot be used. For example, the process

We want to unify the external interface so that users don’t have to pay attention to these details, and we automatically switch the call method. Automatically select whether to use in the main interpreter (good compatibility, stable) or sub-interpreter (parallelism, good performance)

We provide C and Python implementations for business users to use in various scenarios. Here is a simplified version of the Python implementation code.

In bddispatch.py, the invocation is abstracted to provide a unified execution interface for handling exceptions and returning results. bddispatch.py

def executeFunc(module_name, func_name, context=None, use_main_interp=True, *args, **kwargs):
    print( submit call  , module_name,  . , func_name)
    if use_main_interp == True:
        result = None
        exception = None
        try:
            m = __import__(module_name)
            f = getattr(m, func_name)
            r = f(*args, **kwargs)
            result = r
        except:
            exception = traceback.format_exc()
        singletonExecutorCallback(result, exception, context)

    else:
        future = singletonExecutor.submit(_xxsubinterpreters.call_function, module_name, func_name, *args, **kwargs)
        future.context = context
        future.add_done_callback(executeDoneCallBack)


def executeDoneCallBack(future):
    r = future.result()
    e = future.exception()
    singletonExecutorCallback(r, e, future.context)
Copy the code

Bind directly to the subinterpreter execution

For high performance scenarios, calling a sub-interpreter to perform tasks by the main interpreter in the manner described above increases the performance cost. Here we provide capIS that let users of directly embedded CPython directly bind to an interpreter through the CAPI.

class GILGuard { public: GILGuard() { inter_ = BDPythonVMDispatchGetInterperter(); if (inter_ == PyInterpreterState_Main()) { printf( Ensure on main interpreter: %p\n , inter_); } else { printf( Ensure on sub interpreter: %p\n , inter_); } gil_ = PyGILState_EnsureWithInterpreterState(inter_); } ~GILGuard() { if (inter_ == PyInterpreterState_Main()) { printf( Release on main interpreter: %p\n , inter_); } else { printf( Release on sub interpreter: %p\n , inter_); } PyGILState_Release(gil_); } private: PyInterpreterState *inter_; PyGILState_STATE gil_; }; - (void)testNumpy {GILGuard gil_guard; BDPythonVMRun(....) ; }Copy the code

About the Byte Terminal technology team

Bytedance Client Infrastructure is a global r&d team of big front-end Infrastructure technology (with r&d teams in Beijing, Shanghai, Hangzhou, Shenzhen, Guangzhou, Singapore and Mountain View), responsible for the construction of the whole big front-end Infrastructure of Bytedance. Improve the performance, stability and engineering efficiency of the company’s entire product line; The supported products include but are not limited to Douyin, Toutiao, Watermelon Video, Feishu, Tomato novels, etc., and have in-depth research on mobile terminals, Web, Desktop and other terminals.

The team is currently recruiting interns for Python interpreter optimization. The work will focus on optimizing cpython interpreter, optimizing cpythonJIT (self-developed), and optimizing cpython common tripartite libraries. Welcome to contact wechat: Beyourselfyii. E-mail: [email protected]


🔥 Volcano Engine APMPlus application Performance Monitoring is a performance monitoring product for volcano Engine application development suite MARS. Through advanced data collection and monitoring technologies, we provide enterprises with full-link application performance monitoring services, helping enterprises improve the efficiency of troubleshooting and solving abnormal problems. At present, we specially launch “APMPlus Application Performance Monitoring Enterprise Power Action” for small and medium enterprises, providing free application performance monitoring resource pack for small and medium enterprises. Apply now for a chance to receive a 60-day free performance monitoring service with a maximum of 60 million events.

👉 Click here to apply now