Author: Shi Jicheng


Introduction to the

Tokio is the most well-known asynchronous execution framework in Rust, which covers almost all interfaces for asynchronous execution, including but not limited to file, network, and file system management. Beneath these easy-to-use high-level interfaces are the “building blocks” that do not exist in the interface for direct user interaction, but are hidden beneath the surface to get things done. This includes thread pools, the basic unit for performing asynchronous tasks, and this article takes a look at Tokio’s thread pools and their scheduling, as many interesting points as possible. The code covered in this article is mainly under tokio/ SRC/Runtime.

The thread pool

The concept of thread pools is used in many languages, and the common reason for using thread pools is to reduce the performance penalty of creating and destroying threads. In Tokio thread pools are used as execution resources to execute asynchronous tasks. For example, the following code creates an asynchronous task and puts it into the Tokio thread pool:

tokio::spawn(
    // This is an async task
    async{... });Copy the code

There are several obvious (but not only) options for how to store these tasks:

  1. All pending tasks are placed in a common queue (the global queue) from which each thread pulls information.
  2. Each thread has an independent queue and only captures tasks in its queue. When tasks in the queue are empty, they will rest.
  3. Each thread has its own queue. First, it grabs tasks in its queue. If its queue is empty, it steals tasks from other threads’ queues.

The first implementation is terrible, and no matter how you optimize that public queue — with locks or atomic operations — the performance degradation caused by competition is inevitable. It is important to point out that atomic operation is not without competition. Atomic operation puts competition into hardware, and more competition still leads to poor efficiency.

The second implementation is also not so good, when one thread is overloaded with tasks, its tasks cannot be executed in time, and other idle threads “have nothing to do”, resulting in an inequality between threads. This inequality also negates the advantages of multi-threaded parallel execution.

The third implementation, the more commonly used “Work Stealing” approach, avoids the problems of both methods, but still leaves much to be desired in the implementation details.

How to be efficient at Work Steal

Although the implementation method of Work Steal is straightforward, there is still an unavoidable problem that two or more threads simultaneously fetch tasks from a queue. To be thread-safe, use either locks or lockless data structures. Tokio used crossbeam-based lockless queues in earlier versions, but Tokio authors felt that this implementation was still too heavy (epoche-based GC was still inefficient, which was Tokio’s point of view, but not tested in this paper), so the current implementation was adopted.

The Tokio task queue implementation is still lock-free and uses a ring queue data structure, known as the Ring Buffer. In order to know which slot is already in use, the data structure uses two indexes, head and tail. The first item is placed from tail and the item is taken from head, as shown in the following figure:

However, this data structure still cannot meet the requirements of parallel access without modification, so Tokio makes some adjustments to the data structure. It breaks the head index into two parts, one is head and the other is steal. The original head index is U32. After modification, the first 16 bits are steal and the last 16 bits are head. As shown below:

When no one is stealing the task, Tail and Head are equal and point to the same block. Let’s examine several cases to see how the data structure handles multiple concurrent accesses.

The current Thread retrieves the local Task

  1. If no one has stolen the task at the start of the operation, then the Steal and Head read (using Atomic Load and unpacked) must be equal, so just add one to both Steal and Head. Then compare and swap (CMP-SWP) operation is used to store the result. If the task is successful, it indicates that the task is successfully fetched. There are two atomic operations. The first is LOAD and the second is CMP-SWP. The second operation may fail because another thread steals the task and modifs the atomic variable. If the CMP-SWP operation fails, the header can start trying again until it succeeds.
  2. If someone is stealing the task at the beginning of the operation, then the stolen and Head read must be different, so you just need to add one Head, Steal remains unchanged, and then use cMP-SWP operation to store the result. If cmp-swp succeeds, the task is successfully fetched. If cmp-swp fails, repeat the preceding operations until the task succeeds.

Stealing tasks from other threads

  1. If there are other threads stealing tasks at the beginning of the operation, then the value of Steal and Head in the load result must be different. In this case, the operation is directly ended and the Steal is abandoned.

  2. If only the Thread was stealing the task at the start of the operation, then the Steal value of the load must be the same as the Head value. Then the cMP-SWP operation is used to store the result. The inherent meaning of this operation is that an area is reserved for stealing in advance. If cMP-SWP succeeds, the stealing task can continue. The current status is shown in the following figure. When stealing (moving the Steal to the middle of the Head to the stealer queue) is complete, Steal is changed to the same as Head, and the result is saved using cMP-SWP. The state after stealing is the same as before. Steal and Head are the same. It is worth noting here that stealing involves three atomic operations, the first being LOAD and the second and third CMP-SWP. The success of the second CMP-SWP indicates that the task can be stolen, and the failure of the second CMP-SWP indicates that another thread is operating on the queue (either another thread is stealing, or the local thread is taking its own task) and tries again. If cMP-SWP succeeds, it indicates that no thread is synchronizing. If cMP-SWP fails, it indicates that the local thread is fetching the task.

To summarize, the key points of mission theft:

  1. Use an atomic variable to store both Steal and Head values.
  2. Run the cMP-SWp command to ensure that no modification is performed during the operation. If yes, try again.
  3. Only one theft operation can succeed at a time.

The above method implements a ring buffer that is safe for concurrency using only one atomic variable, which is very clever.

Other optimizations for Work Steal

In addition to the core data structures mentioned above, there are other techniques that can be used to improve the efficiency of task stealing. Here are a few:

  1. Limit the number of threads that can steal at the same time. Since waiting for the thread to wake up is to wake up everyone at once, it is possible that everyone is stealing tasks from other threads and the competition is too fierce. To avoid this, Tokio by default allows only half of the threads to steal, while the rest of the threads continue to sleep.
  2. Fairness of stolen objects. The so-called “wool pulling can not stare at a sheep pulling”, the choice of stealing target should also be random, so the initial stealing target is determined by a random number generator, to ensure fairness, so that the task of the queue is more even.
  3. Stealing operations should not be too frequent, so the number of stolen tasks should not be too small, so Tokio adopts a “wholesale” strategy, stealing half the number of tasks in the current queue each time.

The problems Work Steal doesn’t solve

There are still some problems with Work Steal, such as where to put tasks when the local queue is flooded with them all at once. Local queues of unlimited length are obviously not an option, which would completely break the ring buffer design described above. Tokio’s solution is to maintain a common queue outside the local queue. Tasks that cannot be placed locally can be placed in the common queue. Due to the existence of the local queue, there are few opportunities to access the common queue, so the competition is not too fierce and the efficiency is guaranteed.

Even then, there are follow-up problems, such as the possibility that if all threads give priority to the local queue, the task on the common queue will never be scheduled. In order to avoid such imbalance, Tokio stipulated that each thread should go to the common queue to have a look after taking tasks from 61 local queues. Through this mechanism, tasks in the common queue would be scheduled within a limited time. As for the strange number 61, there’s an explanation in the code, which translates as “Copy from Golang. Why ask Golang?”

/// The number is fairly arbitrary. I believe this value was copied from golang.
Copy the code

conclusion

Tokio implements an efficient task stealing mechanism for its multithreaded executor, which ensures that multiple threads can schedule tasks efficiently and evenly. Tokio’s task scheduling system is the “cornerstone” of other components, and future articles will continue to analyze other components of Tokio, and will certainly mention the role this cornerstone plays.