This article discusses several major zero-copy technologies in Linux and the scenarios where zero-copy technologies are applicable. To quickly establish the concept of zero copy, let’s introduce a common scenario:

citation

When writing a server-side application (Web Server or file Server), file download is a basic function. In this case, the server’s task is to send the files on the server’s host disk from the connected socket without modification. We usually use the following code to accomplish this:

while((n = read(diskfd, buf, BUF_SIZE)) > 0)
    write(sockfd, buf , n);
Copy the code

The basic operation is to read the contents of the file from disk into the buffer, and then send the contents of the buffer to the socket. But since Linux I/O operations are buffered I/O by default. The two main system calls used here are read and write, and we don’t know what the operating system does. In fact, multiple copies of data occurred during the above I/O operation.

When an application to access a block of data, operating system, will first check whether recently visited this file, the file content is cached in the kernel buffer, if it is, the operating system is directly according to the read system call for buf address, copy the contents of the kernel buffer to the user space buffer specified by buf. If not, the operating system first copies the data on disk into the kernel buffer, which currently relies on DMA, and then copies the contents of the kernel buffer into the user buffer. The write system call then copies the contents of the user buffer into the kernel buffer associated with the network stack, and the socket sends the contents of the kernel buffer to the network adapter. Having said so much, it is better to look at the picture:

DMA

No changes were made to the contents of the file, so copying data back and forth between kernel space and user space was a waste, and zero copy was designed to address this inefficiency.

What is zero-copy technology? # #

Zero copy main task is to avoid copies the data from a storage CPU to another block storage, main is to use various zero copy technique, to avoid the CPU to do a lot of data copy task, reduce unnecessary copies, or other components to do this kind of simple data transmission task, let the CPU free to focus on other tasks. This allows system resources to be utilized more efficiently.

Going back to the example in the citation, how can we reduce the number of copies of data? An obvious focus is to reduce copying of data back and forth between kernel space and user space, which introduces a type of zero copy:

Let the data transfer not go through the user space

Using mmap

One way we can reduce the number of copies is to call mmap() instead of read:

buf = mmap(diskfd, len);
write(sockfd, buf, len);
Copy the code

The application calls mmap(), and the data on disk is then DMA through a copy of the kernel buffer, which the operating system then shares with the application, so there is no need to copy the contents of the kernel buffer into user space. The application then calls write(), and the operating system copies the contents of the kernel buffer directly into the socket buffer, all in kernel mode. Finally, the socket buffer sends the data to the network card. Again, it’s easy to look at the picture:

mmap
mmap
map
SIGBUS
SIGBUS
coredump

We usually avoid this problem with the following solutions:

  1. Establish signal handlers for SIGBUS signalsWhen faced withSIGBUSThe signal handler simply returns,writeThe system call returns the number of bytes written before being interrupted, anderrnoIt’s set to success, but that’s a bad way to handle it because you’re not addressing the heart of the problem.
  2. Use file rental locksUsually we use this method, using a rent lock on the file descriptor, we apply a rent lock to the kernel for the file, and when other processes want to truncate the file, the kernel sends us a liveRT_SIGNAL_LEASESignal that tells us the kernel is breaking the read/write lock you hold on the file. In this way the program accesses illegal memory and isSIGBUSBefore you kill yourwriteThe system call will be interrupted.writeReturns the number of bytes already written, and setserrnoFor success.

We should lock the mmap file before it and unlock it after manipulating it:

if(fcntl(diskfd, F_SETSIG, RT_SIGNAL_LEASE) == -1) {
    perror("kernel lease set signal");
    return- 1; } /* l_type can be F_RDLCK F_WRLCK lock */ * l_type can be F_UNLCK unlock */if(fcntl(diskfd, F_SETLEASE, l_type)){
    perror("kernel lease set type");
    return- 1; }Copy the code
Use sendfile

Starting with the 2.1 kernel, Linux introduced sendfile to simplify operations:

#include<sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
Copy the code

The system call sendFile () passes the file contents (bytes) between the descriptor in_fd representing the input file and the descriptor out_fd representing the output file. The descriptor out_fd must point to a socket, and the file in_fd points to must be Mmapable. These limitations limit the use of SendFile so that it can only pass data from a file to a socket and not vice versa. Using Sendfile not only reduces the number of data copies, but also reduces the context switch. Data transfer always takes place only in kernel space.

What happens if another process truncates the file when we call sendFile? Assuming we don’t set any signal handlers, the sendFile call simply returns the number of bytes it transferred before it was interrupted, and errno is set to SUCCESS. If we lock the file before calling SendFile, sendFile will behave the same as before and we will also receive RT_SIGNAL_LEASE.

So far, we have reduced the number of data copies, but there is still one copy, a copy of the page cache to the socket cache. So can we omit this copy as well?

With the help of hardware, we can do it. We used to copy the data from the page cache to the socket cache. In fact, we only need to pass the buffer descriptor to the socket buffer and then pass the data length, so that the DMA controller can directly package the data from the page cache and send it to the network.

To summarize, the SendFile system call uses the DMA engine to copy the contents of the file into the kernel buffer, and then adds the buffer descriptor with the file location and length to the socket buffer. This step does not copy data from the kernel into the socket buffer. The DMA engine copies the kernel buffer data to the protocol engine, avoiding the last copy.

However, this collection and copy function requires hardware and driver support.

Using a splice

Sendfile only works for copying data from a file to a socket, limiting its use. Linux introduced the splice system call in 2.6.17 to move data between two file descriptors:

#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);
Copy the code

Splice calls move data between two file descriptors without copying data back and forth between kernel space and user space. It copies len length data from fd_IN to fd_OUT, but one of the parties must be a pipe device, which is some of the limitations of Splice today. The flags parameter has the following values:

  • SPLICE_F_MOVE: Try to move data instead of copying it. This is just a hint to the kernel: if the kernel can’t start frompipeMoving data orpipeThe cache is not a whole page, and data still needs to be copied. The initial implementation of Linux had some problems, so from2.6.21This option does not work at first, but should be implemented in a later Linux version.
  • SPLICE_F_NONBLOCKspliceOperations are not blocked. However, if the file descriptor is not set to I/O in a non-blocking manner, then the call to Splice may still block.
  • SPLICE_F_MOREThe back of the:spliceThe call will have more data.

Splice calls take advantage of the pipeline buffer mechanism proposed by Linux, so at least one descriptor must be a pipe.

All of the above zero-copy techniques are implemented to reduce the copying of data between user space and kernel space, but sometimes data must be copied between user space and kernel space. At this point, we have to focus on when the data is copied in user space and kernel space. Linux often uses copy on write ** to reduce overhead, a technique often referred to as COW.

For reasons of space, this article does not cover write-time copying in detail. The general description is: If multiple applications access the same data at the same time, so each application has a pointer to the data, in each program’s view, oneself are independent with this data, only when a program need to modify the data content, will copy the data content to program their application in the space, at that time, the data become the private data of the program. If the application does not need to modify the data, it never needs to copy the data into its own application space. This reduces copying of data. The content copied while writing can be used for another article…

In addition, there are some zero-copy technologies, such as traditional Linux I/O with the O_DIRECT flag to directly I/O, avoiding automatic caching, and fBUFS which is not yet mature. This article does not cover all zero-copy technologies, but just introduces some common ones. If you are interested, you can explore them by yourself. Mature server projects also modify the I/O part of the kernel to improve their data transfer rates.