How to improve the performance of the storage system is an eternal big proposition for storage engineers, to solve this problem is not a silver bullet, IO performance optimization is in the details. Today we will talk about the relationship between performance and IO models.

Let’s start with the IO model of the local disk. On the one hand, the IO performance of traditional mechanical HDD media is orders of magnitude worse than CPU instructions and applications for local disks; On the other hand, the performance of new SATA SSDS and NVMe SSDS is greatly improved. The disk controller chip has multiple queues to process concurrent I/O requests, and the disk itself has higher concurrency capability. How to solve the problem of slow disk interaction and use new disk internal features to improve data access performance and reduce system overhead? As a result, system engineers have introduced various IO models to deal with these problems.

01 IO model

In short, we can summarize the IO models in Linux operating system from two dimensions, synchronous and asynchronous, blocking and non-blocking, in the following two-dimensional table.

Synchronous blocking IO

This is the most commonly used IO model for application programming. In this model, when an application makes a system call, the application blocks. For example, if an application issues a read system call, the program’s subsequent logic is blocked until the system call completes (the data transfer completes or fails). Of course, just because this application is blocked, it does not mean that other applications cannot continue to execute. While this application is blocked, the CPU is free to execute other applications, but the application itself is blocked by accessing disk IO. It is efficient from a processor perspective, and even with the slow response of traditional HDDS, this read/write mode does not involve much user-mode or kernel-mode context switching to meet the performance requirements of most applications.

Synchronize non-blocking IO

The major difference between the synchronous non-blocking model and the first one is that after an application sends an IO call in a non-blocking manner, the system returns a return code (EAGAIN or EWOULDBLOCK) that prompts the application to wait or later actively ask if the IO is complete. After completion of the I/o the system call, the system will return the data, which means that the IO may have already completed, but still need to be used again to take the initiative to request, to obtain the data, so I will take some additional time delay, store the overall delay performance is poor, and the kernel and user mode have occurred many times, the context switch between the application of high latency requirements generally do not adopt this model.

Asynchronous blocking IO

The third IO model, also known as the system event-driven model or IO multiplexing, is also a very common IO model. The mechanism can simply understand the sending system calls for the application, use the epoll mechanism of the operating system, active statement to listen a IO descriptor fd state changes (or types) of events, epoll mechanism can guarantee the fd in specified change notification after application, data is ready, then launched by application of IO operations. In the actual I/O process from disk, the epoll mechanism itself listens for events, and the application program does not pay attention to the internal execution of epoll. The application program can perform other operations.

Asynchronous non-blocking IO

This brings us to today’s focus, asynchronous non-blocking IO, also known as AIO. The characteristic of this model is that after the application sends an IO request, the system will directly return to inform the system that the request has been successfully initiated and received. During I/O operations on the system background, applications can execute other service logic. When the IO response arrives, a signal is generated or a callback function is directly executed by the system to complete the IO operation. As can be seen from the description and the following figure, this model brings several benefits. First, the application will not be blocked by an IO request, and the subsequent application logic can continue without polling or making related system calls again. Second, this mode has few context switches. It can commit multiple IO in one context, so the system overhead is also very small.

AIO is a standard feature of the Linux2.6 kernel. It is proposed to support the asynchronous non-blocking model. Currently, AIO is implemented in two ways, using libaio and IO_uring. Kernel-level AIO support has been implemented in 2.6 + kernel-level AIO support, along with user-mode libaio library support for asynchronous non-blocking mode access, is now mature and stable. Io_uring, introduced in the 5.x kernel, will serve as a unified framework to support asynchronous, non-blocking operations for data access, such as disk and network. While IO_uring is more widely used, it is less mature and stable, and is still iterating. So when the industry usually says AIO, the default is libaio.

The emergence of Libaio is indeed a good support and liberation for SSD and other new media. To get the most out of your hardware without Libaio, you need to introduce multithreading or multi-machine multitasking at the application level. There are two disadvantages in this way. First, context switching is required between multiple threads, and a large number of threads can not be introduced indefinitely for the sake of concurrency, which is very expensive for the system and CPU. Second, some applications themselves do not achieve multi-threading, nor do multiple machine concurrency, so it is impossible to improve the use of the bottom through multi-threading. By libaio, can in the case of a thread, make full use of the new hardware such as SSD internal queue to implement more concurrent (namely the SSD controller to maintain the multiple task queue, application through the use of libaio, can under the single thread, safely to hardware issued a large number of I/o request, by the hardware itself to handle multiple concurrent problems). This improves the performance of single-threaded applications and reduces the overhead caused by multi-threaded switching. AIO is an important way to increase the processing power of today’s high-performance systems, whether storage or otherwise.

02 AIO (Libaio) restriction

Files are opened in two ways, dio and buffer IO. Dio does not write pagecache, and directly interacts with disks. Buffer I/OS involve memory pagecache, which improves performance in some scenarios, but may degrade performance in some specific I/O scenarios. For example, the performance of sequential large I/OS may be worse than that of DIO. This is because BUFFER I/OS need to be written to memory first and then flushed, while sequential I/O performance of HDDS or other disks may be higher. In some scenarios with high requirements on data reliability, data loss may occur when pagecache is written. For example, in MySQL and other databases, these applications usually use DIO when writing data, and some caching mechanisms of the application are introduced to improve performance.

The background to DIO and Buffer IO is explained because one limitation of Libaio is that it only supports DIO. This is because of the bounce buffer allocation blocking problems encountered by buffer IO and the write penalty triggered when encountering unaligned IO, which has a significant impact on efficiency. This is the opposite of libaio’s desire to improve performance, so libaio is implemented with DIO by default.

The new IO_uring supports buffer IO (more on IO_uring later).

03 Distributed file system support for AIO and its significance

For network or external storage, the client’s primary function is IO forwarding, so the client does not involve direct access to disk (the IO access model, especially AIO, was originally designed to solve the local access problem), so in general (especially for network file systems), Open source distributed file stores like GlusterFS generally don’t support AIO. However, for some applications, such as MySQL, it does not know whether its data source is a local file system or a network file system, so the application uses libaio by default. If the client does not support AIO and only performs AIO forwarding, performance will be limited. In this scenario, the client emulates the implementation of the back-end AIO to get the best out of the client.

04 YRCloudFile Client support for AIO

The new version of YRCloudFile client supports the read/write mode of AIO. Io_setup, io_cancel, io_destroy, io_getevents, io_submit The corresponding interfaces in the kernel are AIo_read /write and aio_complete. On the client side, it is necessary to determine whether the request is an AIO request, and then whether it is asynchronous when aiO_read /write is executed. Aio_read /write is the implementation focus.

For AIO reads, first check whether data buff and offset are aligned. For requests that do not correspond to PAGE_SIZE, calculate the corresponding physical pages, then pin the user pages in turn, delay switching out, and then wrap the request and deliver it asynchronously. After the page is mapped to the kernel linear address space, data is read from the storage back end for populating. When the data is filled, aio_complete is called back and the reference count of Pages is released.

During the process, you need to consider the impact of pagecache. You need to brush back and wait the pagecache in the overlapping interval. For details, see filemap_WRITe_AND_WAIT_range.

In addition, consider the following three aligned scenarios:

  • Scenario 1: date_len <= PAGE_SIZE, where data is written to the same page.
  • Scenario 2: date_len <= PAGE_SIZE, where data spans two pages.
  • Scenario 3: date_len > PAGE_SIZE, data offset in the first page.

For write: you can refer to the logic of read, and basically encapsulate the request to send asynchronously. After concurrent processing, aio_complete is called back, and pagecache also needs to be considered during this process.

The performance data

After liBAIo support is implemented, the client performance increases linearly with iodepth in fiO + Libaio scenarios until the client performance reaches the upper limit. The performance of a single client is as follows:

05 summary

In a distributed file system, customers care not only about the performance of the entire cluster, but also about the performance of a single client and application access in a single thread. For many services, the concurrency is not high, and the delay of single thread directly affects the system performance. While some business logic (such as Nginx, MySQL, seastar) all use AIO model, if the client does not support AIO, then the performance of back-end data access will be limited.

YRCloudFile, with AIO client-side support in the new version, further addresses this shortcoming and will be better suited to these application scenarios.