Python Sync and Async execution speed

“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

Remember before

The async/await syntax is supported in new versions of Python, and many articles have said that implementation code for this syntax will be fast, but this speed is scenario-limited. This article will attempt to explain briefly why Async code is faster than Sync code in some scenarios.

Diary will receive notifications of new tweets as soon as possible

1. A simple example

First of all, let’s understand the difference between the two call methods from an example. In order to clearly see the difference in their running time, let them repeat 10000 times, the specific code is as follows:

import asyncio
import time


n_call = 10000


The duration of the sync call
def demo(n: int) - >int:
    return n ** n

s_time = time.time()
for i in range(n_call):
    demo(i)
print(time.time() - s_time)

# Async call duration
async def sub_demo(n: int) - >int:
    return n ** n

async def async_main() - >None: 
    for i in range(n_call):
        await sub_demo(i)

loop = asyncio.get_event_loop()
s_time = time.time()
loop.run_until_complete(async_main())
print(time.time() - s_time)

# output
# 5.310615682601929
# 5.614157438278198
Copy the code

Sync syntax is quite familiar, while async syntax is quite different. Functions need to start with async def and call async def functions need to use await syntax. When running, we need to get the event loop of the thread first. The async_main function is then run through an event loop to achieve the same effect, but the output shows that sync syntax is a bit faster than async syntax in this scenario (due to Python’s GIL, multi-core performance is not available here, only single-core).

The reason for this is that in the case of the same thread executing (CPU single core), async calls also need to go through some additional calls of the event loop, which will incur some small overhead and thus run slower than sync. Meanwhile, this is a pure CPU operation example, while the advantage of Async lies in the network IO operation. It doesn’t work well in this scenario, but it does in high-concurrency scenarios because async is run as a coroutine and sync is run as a thread.

NOTE: The current async syntax supports network IO, while the asynchronous IO of file system is not very perfect, so the asynchronous read and write of file system is processed by multithreading through encapsulation, rather than coroutine. Specific: github.com/python/asyn…

2. An IO example

In order to understandasyncThe advantages of running in an IO scenario start with an IO scenario where a Web backend service typically needs to process many requests, all from different clients, as shown in the following example:

In this scenario, client requests are made in a short period of time. The server supports concurrency or parallelism in some way in order to be able to process a large number of requests in a short period of time and prevent processing delays.

NOTE: Concurrency, in an operating system, refers to a period of time when several programs are running between start and finish, and these programs are all running on the same processor, but only one program is running on the processor at any point in time. Parallelism is a method of computing in which two or more processes can be performed simultaneously in a computer system.

For sync syntax, the Web backend can be implemented by processes, threads, or a combination of both, and their ability to provide concurrency/parallelism is limited to the number of Wokers, such as when there are 5 clients requesting at the same time and only 4 workers on the server. One request will enter the blocking waiting stage until one of the four running workers has been processed. In order to enable the server to provide better service, we will provide enough workers. At the same time, since processes have good isolation and each process occupies an independent resource, we provide services in the form of several processes + a large number of threads.

NOTE: Process is the smallest unit of resource allocation, and too many processes will occupy a lot of system resources. Generally, the number of processes enabled by background services is not very large. Meanwhile, thread is the smallest scheduling unit, so I describe the following scheduling by thread.

However, this method consumes a lot of system resources (compared with coroutines), because the running of threads is performed by CPU, which is limited and can only support a fixed number of workers at a time, while other threads have to wait to be scheduled, which means that each thread can only work one time fragment. It is then controlled by the scheduling system to enter the blocking or ready phase, giving way to other threads until the next time it retrieves a time slice. In order to simulate the situation of multiple threads running at the same time and prevent other threads from starving, the running time of threads is very short each time, and the scheduling switch between threads is very frequent. When more processes and more threads are enabled, the scheduling will be more frequent.

But thread scheduling overhead is not big, big spending is scheduling threads below as a result of the switch, and competitive conditions (concrete can refer to “introduction to computer” process scheduling related material, I am here just simple instructions), CPU when code is executed, it needs to load the data to the CPU cache to run again, When the thread running by the CPU completes execution in this time slice, the latest running data of the thread is saved, and the CPU loads the data of the thread to be scheduled and runs it. Although this part of temporary data is stored in registers that are faster than memory and closer to the CPU than memory, the access speed of registers is not as fast as that of THE CPU cache. Therefore, when the CPU switches running threads, it will spend part of the time to load data and there is a race problem when loading the cache.

Contrast thread scheduling of context switching and preemptive, async grammar realization coroutines is preemptive, coroutines scheduling is dependent on a loop to control, the cycle is a very Chang Gaoxiao task manager and scheduler, because of the scheduling is a piece of code implementation logic, so CPU execution code does not use switch, There is no overhead of context switching, and there are no competing load caches to worry about. Or in the examples that the graph above, in the service to start, will start an event loop, when I received the request, it will create a task to handle the request of the client to send to come over, this task will be to the executive power was obtained from the event loop, monopolize the thread resources and has been executed, until meet needs to wait for external events, For example, the task tells the event loop that it is waiting for the event to return data from the database, and then hands over execution. The event loop passes execution to the task that needs to run the most. When the task that just handed over execution receives a subsequent response from the database event, the event loop places it first on the ready list (different event loop implementations may vary) and returns execution to it the next time it switches execution, allowing it to continue execution until the next waiting event is encountered.

This way of switching coroutines is called cooperative multitasking, and because it only runs in a single process or thread, the context does not change when switching coroutines, and the CPU does not have to read and write the cache again, which saves some overhead. Can be seen from the above collaborative switch executive power is based on the coroutines give up on their own, and thread is preemptive, thread in IO events, not may also from running into the ready state, until the called again, this will be a lot more scheduling overhead, and coroutines is always run, until meet concessions events switching, So coroutines are scheduled much less often than threads. At the same time as you can see coroutines when scheduling is specified by the developer (such as what is said above, and so on the database returned events), and is a preemptive, this means that some collaborators at run time, cheng running coroutines is no other way, can only wait for coroutines running hand over executive power, so the developer to ensure that can’t keep the task on the CPU to stay too long, Otherwise the rest of the mission will starve to death.

3. Summary

In the I/O scenario, the overhead of I/O is much larger than the overhead of the CPU executing the code logic. Another way to think about this is that the code logic needs to wait for the IO overhead, and the CPU is idle. Therefore, the CPU can be multiplexed through coroutines/threads, and the CPU can be milked. Assuming that the sync grammar and async execution code logic is the same, so the contrast of their execution speed can be converted into coroutines compared with multiple process/thread overhead, namely coroutines event loop scheduling overhead and multiple process/thread scheduling overhead of logic, and event loop scheduling overhead is basically remain unchanged (or changed), The cost of multi-process/thread is not only higher than that of event cycle scheduling, but also increases with the quantity of worker. When the amount of concurrency reaches a certain level, the cost of multi-process/multi-thread will be greater than that of coroutine switching, and the execution speed of async grammar will be faster than that of Sync grammar. Therefore, sync is faster than Async in common scenarios, but async is faster than SYNC in I/O calculation than CPU calculation and high concurrency.

Python Sync and Async execution speed

Remember before

1. A simple example

2. An IO example

3. Summary

Related Posts

How to implement log link tracing in microservices distributed architecture?

Build a reliable short chain system at very low cost

Use Gitea to quickly set up private Git version control services