What is asynchronous IO?

Asynchronous IO: When an application initiates an IO operation, the caller does not get the result immediately. Instead, after the kernel completes the IO operation, the caller is notified via signals or callbacks.

The difference between asynchronous IO and synchronous IO is shown in Figure 1:

As you can see from the figure above, synchronous IO must wait for the kernel to finish processing the IO operation before returning. Asynchronous I/O does not wait for the COMPLETION of an I/O operation, but sends an I/O operation to the kernel and returns immediately. When the kernel completes the I/O operation, the kernel notifies the application program through signals.

Linux native AIO principles

Linux Native AIO is a Linux supported Native AIO, why add the word Native? Linux has many third-party asynchronous IO libraries, such as Libeio and Glibc AIO. So to distinguish it, the Linux kernel provides asynchronous IO called native asynchronous IO.

Many third-party asynchronous IO libraries are not real asynchronous IO, but use multiple threads to simulate asynchronous IO, such as Libeio is using multiple threads to simulate asynchronous IO.

This article mainly introduces the principle and use of Linux native AIO, so it will not analyze other third-party asynchronous IO libraries, let’s first introduce the principle of Linux native AIO.

As shown in Figure 2:

Linux native AIO processing flow:

  • When the application callsio_submitAfter the system call initiates an asynchronous I/O operation, it adds an I/O task to the I/O task queue of the kernel and returns success.
  • The kernel processes THE I/O tasks in the I/O task queue in the background and stores the processing results in the I/O tasks.
  • The application can be calledio_geteventsSystem call to obtain the result of asynchronous I/O processing. If the I/O operation is not complete, a failure message is returned; otherwise, the I/O processing result is returned.

As you can see from the above flow, the Asynchronous IO operation in Linux mainly consists of two steps:

    1. callio_submitThe IO function initiates an asynchronous IO operation.
    1. callio_geteventsThe function gets the result of asynchronous IO.

Here we will focus on how the Linux kernel implements asynchronous IO.

Linux native AIO use

Before introducing the implementation of Linux native AIO, let’s use a simple example to illustrate its use:

#define _GNU_SOURCE

#include <stdlib.h>
#include <string.h>
#include <libaio.h>
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

#define FILEPATH "./aio.txt"

int main(a)
{
    io_context_t context;
    struct iocb io[1], *p[1] = {&io[0]};
    struct io_event e[1].
    unsigned nr_events = 10;
    struct timespec timeout;
    char *wbuf;
    int wbuflen = 1024;
    int ret, num = 0, i;

    posix_memalign((void **)&wbuf, 512, wbuflen);

    memset(wbuf, The '@', wbuflen);
    memset(&context, 0.sizeof(io_context_t));

    timeout.tv_sec = 0;
    timeout.tv_nsec = 10000000;

    int fd = open(FILEPATH, O_CREAT|O_RDWR|O_DIRECT, 0644); // 1. Open the file for asynchronous I/O
    if (fd < 0) {
        printf("open error: %d\n", errno);
        return 0;
    }

    if (0! = io_setup(nr_events, &context)) {2. Create an asynchronous IO context
        printf("io_setup error: %d\n", errno);
        return 0;
    }

    io_prep_pwrite(&io[0], fd, wbuf, wbuflen, 0);           // 3. Create an asynchronous I/O task

    if ((ret = io_submit(context, 1, p)) ! =1) {            // 4. Submit an asynchronous I/O task
        printf("io_submit error: %d\n", ret);
        io_destroy(context);
        return - 1;
    }

    while (1) {
        ret = io_getevents(context, 1.1, e, &timeout);     // 5. Obtain the result of asynchronous I/OS
        if (ret < 0) {
            printf("io_getevents error: %d\n", ret);
            break;
        }

        if (ret > 0) {
            printf("result, res2: %d, res: %d\n", e[0].res2, e[0].res);
            break; }}return 0;
}
Copy the code

A simple example shows the process of using Linux native AIO, which includes the following steps:

  • By calling theopenThe system call opens the file for asynchronous IO. Note that the AIO operation must be setO_DIRECTDirect I/O flag bit.
  • callio_setupThe system call creates an asynchronous IO context.
  • callio_prep_pwriteorio_prep_preadThe function creates an asynchronous write or asynchronous read task.
  • callio_submitThe system call commits asynchronous IO tasks to the kernel.
  • callio_geteventsThe system call gets the result of asynchronous I/O.

In the above example, we obtain the result of asynchronous IO operation in an infinite loop. In fact, Linux also supports a mechanism based on eventFD event notification. Eventfd and epoll can be combined to achieve event-driven way to obtain the result of asynchronous IO operation. If you are interested, check out the relevant content.

Linux native AIO implementation

The above mainly analyzes the principle and use of Linux native AIO, and the following mainly introduces the implementation process of Linux native AIO.

This article is based on linux-2.6.0 version kernel source code

In general, there are three steps to using Linux native AIO:

    1. callio_setupThe function creates a generic IO context.
    1. callio_submitThe function submits an asynchronous IO operation to the kernel.
    1. callio_geteventsFunction to obtain the result of the asynchronous I/O operation.

So, we can understand the implementation of Linux native AIO by analyzing the implementation of these three functions.

Linux native AIO implementation is in the source file /fs/aio.c.

Create an asynchronous I/O context

To use Linux native AIO, you first need to create an asynchronous IO context. In the kernel, the asynchronous IO context is represented by the KIocTX structure, defined as follows:

struct kioctx {
    atomic_t                users;    // Reference counters
    int                     dead;     // Whether it has been closed
    struct mm_struct        *mm;      // The corresponding memory management object

    unsigned long           user_id;  // A unique ID that identifies the current context and is returned to the user
    struct kioctx           *next;

    wait_queue_head_t       wait;     // Wait for the queue
    spinlock_t              ctx_lock; / / lock

    int                     reqs_active; // Number of asynchronous I/O requests in progress
    struct list_head        active_reqs; // Async IO request object in progress
    struct list_head        run_list;

    unsigned                max_reqs;  // Maximum number of I/O requests

    struct aio_ring_info    ring_info; // Ring buffer

    struct work_struct      wq;
};
Copy the code

In kiocTX structure, the more important members are Active_reqs and ring_info. Active_reqs holds all asynchronous I/O operations that are in progress, while ring_INFO members hold the results of asynchronous I/O operations.

Kioctx structure is shown in Figure 3:

As shown in Figure 1, the Active_REQS member holds the queue of asynchronous IO operations in kiOCB units, while the ring_INFO member points to a Ring Buffer of type AIo_ring_INFO.

So let’s look at the kiocb structure and aio_ring_info structure definition:

struct kiocb {.struct file         *ki_filp;      // File object for asynchronous IO operations
    struct kioctx       *ki_ctx;       // Points to the owning asynchronous IO context.struct list_head    ki_list;       // Used to connect all asynchronous IO operation objects in progress
    __u64               ki_user_data;  // User supplied data pointer (can be used to distinguish asynchronous IO operations)
    loff_t              ki_pos;        // File offset for asynchronous I/O operations. };Copy the code

Kiocb is used to store information about asynchronous I/O operations, such as:

  • ki_filp: Used to hold the file object for asynchronous IO.
  • ki_ctx: refers to the owning asynchronous IO context object.
  • ki_list: Connects all I/O operation objects in the current asynchronous I/O context.
  • ki_user_data: This field is used by users to distinguish asynchronous I/O operations or set a callback function.
  • ki_pos: Used to save the offset of the file for asynchronous I/O operations.

The aio_ring_info structure is an implementation of a circular buffer, defined as follows:

struct aio_ring_info {
    unsigned long       mmap_base;     // The virtual memory address of the ring buffer
    unsigned long       mmap_size;     // The size of the ring buffer

    struct page六四屠杀ring_pages;  // The number of memory pages used by the ring buffer
    spinlock_t          ring_lock;     // Protect the ring buffer's spin lock
    long                nr_pages;      // The number of pages of memory occupied by the ring buffer

    unsigned            nr, tail;

    // If the ring buffer is not larger than 8 pages
    // ring_pages points to internal_pages
#define AIO_RING_PAGES  8
    struct page         *internal_pages[AIO_RING_PAGES]; 
};
Copy the code

This circular buffer is used to store the results of asynchronous I/O operations that have been completed. The results of asynchronous I/O operations are represented by the IO_event structure. As shown in Figure 4:

In Figure 2, head represents the start position of the ring buffer, and tail represents the end position of the ring buffer. If tail is greater than head, the result of completed asynchronous I/O operation is available. If head equals tail, asynchronous I/O operations are not completed.

The head and tail positions of the ring buffer are stored in the aiO_ring structure, which is defined as follows:

struct aio_ring {
    unsigned    id;
    unsigned    nr;    // The number of IO_events that the ring buffer can hold
    unsigned    head;  // The start position of the ring buffer
    unsigned    tail;  // The end position of the ring buffer. };Copy the code

So much data structure was introduced just to make the following source code analysis easier to understand.

Now let’s examine the creation of an asynchronous IO context by calling the io_setup function, which calls the kernel function sys_io_setup as follows:

asmlinkage long
sys_io_setup(unsigned nr_events, aio_context_t *ctxp)
{
    struct kioctx *ioctx = NULL;
    unsigned long ctx;
    longret; . ioctx = ioctx_alloc(nr_events);// Call ioctx_alloc to create an asynchronous IO context
    ret = PTR_ERR(ioctx);
    if(! IS_ERR(ioctx)) { ret = put_user(ioctx->user_id, ctxp);// Returns the asynchronous IO context identifier to the caller
        if(! ret)return 0;
        io_destroy(ioctx);
    }
out:
    return ret;
}
Copy the code

The sys_IO_setup function is simple to implement. It first calls iocTX_alloc to request an asynchronous IO context object, and then returns the identifier of the asynchronous IO context object to the caller.

Therefore, the sys_io_setup function calls the ioctx_alloc function.

static struct kioctx *ioctx_alloc(unsigned nr_events)
{
    struct mm_struct *mm;
    struct kioctx *ctx;. ctx = kmem_cache_alloc(kioctx_cachep, GFP_KERNEL);// Apply a kioctx object. INIT_LIST_HEAD(&ctx->active_reqs);// Initializes the asynchronous I/O operation queue.if (aio_setup_ring(ctx) < 0)                       // Initializes the ring buffer
        gotoout_freectx; .returnctx; . }Copy the code

The ioctx_alloc function does the following:

  • callkmem_cache_allocThe function requests an asynchronous IO context object from the kernel.
  • Initialize each member variable of the asynchronous I/O context, for example, initialize the asynchronous I/O operation queue.
  • callaio_setup_ringThe function initializes the ring buffer.

The implementation of aio_setup_ring is a bit complicated and mainly involves memory management, so we will skip this part of the analysis. If you are interested, you can discuss with me privately.

Submit asynchronous I/O operations

The io_submit function is used to submit an asynchronous IO operation. Io_submit provides an array of type IOCB structures that represent the information about the asynchronous IO operation to be performed.

struct iocb {
    __u64   aio_data;       // User-defined data that can be used to identify IO operations or set callback functions. __u16 aio_lio_opcode;// IO operation type, such as read (IOCB_CMD_PREAD) or write (IOCB_CMD_PWRITE) operation
    __s16   aio_reqprio;
    __u32   aio_fildes;     // Handle to the file to perform IO operations
    __u64   aio_buf;        // The buffer that performs IO operations (such as writing to file data)
    __u64   aio_nbytes;     // Buffer size
    __s64   aio_offset;     // File offset for IO operations. };Copy the code

Io_submit finally calls the kernel function sys_io_submit to provide asynchronous IO operations. Let’s analyze the implementation of sys_IO_submit:

asmlinkage long
sys_io_submit(aio_context_t ctx_id, long nr, struct iocb __user **iocbpp)
{
    struct kioctx *ctx;
    long ret = 0;
    inti; . ctx = lookup_ioctx(ctx_id);// Get the asynchronous I/O context object by the asynchronous I/O context identifier.for (i = 0; i < nr; i++) {
        struct iocb __user *user_iocb;
        struct iocb tmp;

        if (unlikely(__get_user(user_iocb, iocbpp+i))) {
            ret = -EFAULT;
            break;
        }

        // Copy asynchronous IO operations from user space to kernel space
        if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
            ret = -EFAULT;
            break;
        }

        // Call io_submit_one to submit asynchronous I/O operations
        ret = io_submit_one(ctx, user_iocb, &tmp);
        if (ret)
            break;
    }

    put_ioctx(ctx);
    return i ? i : ret;
}
Copy the code

The implementation of sys_IO_submit function is relatively simple. It mainly copies the asynchronous IO operation information from user space to kernel space, and then calls io_submit_ONE function to submit the asynchronous IO operation. We focus on the analysis of io_submit_one function implementation:

int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, struct iocb *iocb)
{
    struct kiocb *req;
    struct file *file;
    ssize_t ret;
    char*buf; . file = fget(iocb->aio_fildes);// Get the file object through the file handle. req = aio_get_req(ctx);// Get an asynchronous IO operation object. req->ki_filp = file;// The file object for asynchronous IO
    req->ki_user_obj = user_iocb;       // An iocB object that points to user space
    req->ki_user_data = iocb->aio_data; // Set user-defined data
    req->ki_pos = iocb->aio_offset;     // Set the file offset for asynchronous I/O operations

    buf = (char(*)unsigned long)iocb->aio_buf; // The data buffer to perform asynchronous IO operations on

    // Different processing is performed according to different asynchronous IO operation types
    switch (iocb->aio_lio_opcode) {
    case IOCB_CMD_PREAD: // Asynchronous read operation. ret = -EINVAL;// Call different functions according to different file systems to initiate asynchronous IO operations:
        // Generic_file_aio_read is called by ext3 file systems
        if (file->f_op->aio_read)
            ret = file->f_op->aio_read(req, buf, iocb->aio_nbytes, req->ki_pos);
        break; . }...// Asynchronous IO operations may be completed by the time aio_read is called, or may be added to the IO request queue.
    // So, if the asynchronous IO operation is submitted to the IO request queue, it is returned directly
    if (likely(-EIOCBQUEUED == ret)) return 0;

    aio_complete(req, ret, 0); // If the IO operation is complete, call aio_complete to finish the job
    return 0;
}
Copy the code

The io_submit_one function has been commented in detail in the above code. Here is a summary of what io_submit_one does:

  • By calling thefgetThe getFile () function gets the file object corresponding to the file handle.
  • callaio_get_reqThe function gets a value of typekiocbThe asynchronous IO operation object of the structure, which has been analyzed previously. In addition,aio_get_reqThe asynchronous I/O function also adds an asynchronous I/O operation object to the asynchronous I/O contextactive_reqsIn the queue.
  • Different processing is performed according to different asynchronous I/O operation typesAsynchronous read operationThe file object will be calledaio_readMethod to process. Different file system, itsaio_readMethods are implemented differently, such as Ext3 file systemsaio_readThe method will point togeneric_file_aio_readFunction.
  • If an asynchronous IO operation is added to the kernel’s IO request queue, it is returned directly. Otherwise, it means that the IO operation has finished, so it is calledaio_completeThe function does the finishing work.

The operation process of io_submit_ONE function is shown in Figure 5:

Therefore, the main task of the io_submit_one function is to submit AN IO request to the kernel.

The asynchronous I/O operation is complete

When the asynchronous IO operation is complete, the kernel calls aio_complete to put the processing result into the ring buffer ring_info of the asynchronous IO context.

int aio_complete(struct kiocb *iocb, long res, long res2)
{
    struct kioctx *ctx = iocb->ki_ctx;
    struct aio_ring_info *info;
    struct aio_ring *ring;
    struct io_event *event;
    unsigned long flags;
    unsigned long tail;
    intret; . info = &ctx->ring_info;// Ring buffer object

    spin_lock_irqsave(&ctx->ctx_lock, flags);         // Lock the asynchronous IO context
    ring = kmap_atomic(info->ring_pages[0], KM_IRQ1); // Virtual memory address mapping for memory pages

    tail = info->tail;                           // The next free position of the ring buffer
    event = aio_ring_event(info, tail, KM_IRQ0); // Get the free position from the ring buffer to save the result
    tail = (tail + 1) % info->nr;                // Update the next free location

    // Save asynchronous IO results to ring buffer
    event->obj = (u64)(unsigned long)iocb->ki_user_obj; event->data = iocb->ki_user_data; event->res = res; event->res2 = res2; . info->tail = tail; ring->tail = tail;// Update the next free position of the ring buffer

    put_aio_ring_event(event, KM_IRQ0); // Remove virtual memory address mapping
    kunmap_atomic(ring, KM_IRQ1);       // Remove virtual memory address mapping

    // Release asynchronous IO objectsret = __aio_put_req(ctx, iocb); spin_unlock_irqrestore(&ctx->ctx_lock, flags); .return ret;
}
Copy the code

The iocB arguments of aio_complete are the asynchronous I/O objects submitted by calling io_submit_once, while the res and RES2 arguments are returned after the I/O operation is completed with the kernel.

The aio_complete function does the following:

  • According to the ring buffertailPointer gets an idleio_eventObject to hold the results of IO operations.
  • For the ring buffertailThe pointer increments one to the next free position.

Once the results of asynchronous IO operations are saved to the ring buffer, the user layer can read the results of IO operations by calling the IO_getevents function, which eventually calls the sys_IO_getevents function.

Let’s examine the sys_io_getevents function implementation:

asmlinkage long sys_io_getevents(aio_context_t ctx_id,
                                 long min_nr,
                                 long nr,
                                 struct io_event *events,
                                 struct timespec *timeout)
{
    struct kioctx *ioctx = lookup_ioctx(ctx_id);
    longret = -EINVAL; .if (likely(NULL! = ioctx)) {// Call the read_events function to read the result of the IO operation
        ret = read_events(ioctx, min_nr, nr, events, timeout);
        put_ioctx(ioctx);
    }
    return ret;
}
Copy the code

The sys_io_getevents function calls the read_events function to read the results of asynchronous IO operations.

static int read_events(struct kioctx *ctx,
                      long min_nr, long nr,
                      struct io_event *event,
                      struct timespec *timeout)
{
    long start_jiffies = jiffies;
    struct task_struct *tsk = current;
    DECLARE_WAITQUEUE(wait, tsk);
    int ret;
    int i = 0;
    struct io_event ent;
    struct timeout to;

    memset(&ent, 0.sizeof(ent));
    ret = 0;

    while (likely(i < nr)) {
        ret = aio_read_evt(ctx, &ent); // Read an IO processing result from the ring buffer
        if (unlikely(ret <= 0))        // If the ring buffer has no IO processing results, exit the loop
            break;

        ret = -EFAULT;
        // Copy the IO processing results to user space
        if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
            break;
        }

        ret = 0;
        event++;
        i++;
    }

    if (min_nr <= i)
        return i;
    if (ret)
        returnret; . }Copy the code

The read_events function primarily calls the aio_read_EVt function to read the results of asynchronous IO operations from the ring buffer and, if successful, copy the results into user space.

The aio_read_EVt function reads the result of an asynchronous IO operation from the ring buffer as follows:

static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
{
    struct aio_ring_info *info = &ioctx->ring_info;
    struct aio_ring *ring;
    unsigned long head;
    int ret = 0;

    ring = kmap_atomic(info->ring_pages[0], KM_USER0);

    // If the head pointer of the ring buffer is equal to the tail pointer, the ring buffer is empty
    if (ring->head == ring->tail) 
        goto out;

    spin_lock(&info->ring_lock);

    head = ring->head % info->nr;
    if(head ! = ring->tail) {// Read results from the ring buffer according to the head pointer of the ring buffer
        struct io_event *evp = aio_ring_event(info, head, KM_USER1);

        *ent = *evp;                  // Save the result to the ent argument
        head = (head + 1) % info->nr; // Move the head pointer of the ring buffer to the next position
        ring->head = head;            // Save the head pointer to the ring buffer
        ret = 1;
        put_aio_ring_event(evp, KM_USER1);
    }

    spin_unlock(&info->ring_lock);

out:
    kunmap_atomic(ring, KM_USER0);
    return ret;
}
Copy the code

The aio_read_EVt function determines whether the ring buffer is empty, reads the result of the asynchronous IO operation from the ring buffer and saves it to ent, and moves the head pointer of the ring buffer to the next position.