preface

Zero-copy technology means that the CPU does not have to copy data from one memory region to another when performing an operation on the computer, thus reducing context switching and CPU copy time. Its function is to reduce the number of data copies and system calls during the transmission of data packets from network devices to user program space, so as to realize the zero participation of CPU and completely eliminate the load of CPU in this aspect. The main techniques used to achieve zero copy are DMA data transfer and memory region mapping.

  • The zero-copy mechanism reduces the number of repeated I/O copies of data between the kernel buffer and user process buffer.
  • The zero-copy mechanism reduces CPU overhead due to context switching between the user process address space and the kernel address space.

The body of the

1. Physical and virtual memory

Since the processes of the operating system share CPU and memory resources, a complete memory management mechanism is needed to prevent memory leaks between processes. To manage Memory more efficiently and reduce errors, modern operating systems provide an abstraction of main Memory, namely Virtual Memory. Virtual memory provides a consistent, private address space for each process, which gives each process the illusion that it is in exclusive main memory (each process has a contiguous, complete memory space).

1.1. Physical memory

Physical memory is relative to Virtual memory. Physical memory refers to the memory space obtained through physical memory bars, while virtual memory refers to the partition of a region of the hard disk as memory. The main function of memory is to provide temporary storage for the operating system and various programs while the computer is running. In application, nature is, as the name implies, physically, the size of the actual capacity of the memory stick inserted in the motherboard memory slot.

1.2. Virtual memory

Virtual memory is a technology of memory management in computer system. It makes an application think it has contiguous available memory (a contiguous complete address space). In practice, virtual memory is typically divided into physical memory fragments, with some temporarily stored on external disk storage, where data is exchanged and loaded into physical memory as needed. At present, most operating systems use virtual memory, such as Windows virtual memory, Linux swap space and so on.

Virtual memory addresses are closely related to user processes. Generally speaking, the same virtual address in different processes points to different physical addresses, so there is no point in talking about virtual memory without a process. The size of the virtual address that each process can use is related to the CPU bits. On 32-bit systems, the virtual address space size is 2 ^ 32 = 4G, and on 64-bit systems, the virtual address space size is 2 ^ 64 = 2 ^ 34G, and the actual physical memory may be much smaller than the virtual memory size. Each user process maintains a separate Page Table through which virtual memory and physical memory are mapped to address Spaces. The following shows the address mapping between the virtual memory space and the corresponding physical memory of processes A and B:

When a process executes a program, it first reads the process’s instructions from memory and then executes them, using the virtual address. This virtual address is determined when the program is linked (the address range of the dynamic library is adjusted when the kernel loads and initializes the process). To obtain the actual data, the CPU needs to convert the virtual address to the physical address. The CPU uses the Page Table of the process to convert the address. The Page Table data is maintained by the operating system.

Page Table can be simply understood as a linked list of a single Memory Mapping (of course, the actual structure is very complex). Each Memory Mapping maps a virtual address to a specific address space (physical Memory or disk storage). Each process has its own Page Table, independent of any other process’s Page Table.

From the above introduction, we can simply summarize the user process to request and access physical memory (or disk storage space) as follows:

  1. A user process sends a memory request to the operating system
  2. The system checks whether the virtual address space of the process is used up. If the virtual address space is available, the system assigns a virtual address to the process
  3. The system creates a Memory Mapping for this virtual address and places it in the process’s Page Table.
  4. The system returns the virtual address to the user process, and the user process accesses the virtual address
  5. The CPU found the corresponding Memory Mapping in the Page Table of the process based on the virtual address, but the Memory Mapping was not associated with physical Memory, so the Page miss interrupt occurred
  6. When the operating system receives a page miss interrupt, it allocates real physical Memory and associates it with the corresponding Memory Mapping of the page table. The CPU can access memory after the interrupt processing is complete
  7. Of course, missing page interrupts don’t happen all the time. They are only used when the system feels it is necessary to delay Memory allocation. In many cases, the system allocates real physical Memory and associates it with Memory Mapping in step 3 above.

Introducing virtual memory between user processes and physical memory (disk storage) has the following major advantages:

  • Address space: provides a larger address space, and the address space is continuous, making the program writing, linking easier
  • Process isolation: The virtual addresses of different processes are not related to each other, so the operations of one process do not affect other processes
  • Data protection: Each virtual memory has corresponding read and write attributes, which can protect program code segments from modification and data blocks from execution, increasing the security of the system
  • Memory mapping: With virtual memory, files (executable files or dynamic libraries) on disk can be mapped directly to the virtual address space. This allows for delayed allocation of physical memory, so that files are actually loaded from disk to memory only when they need to be read, and can be emptied when memory is tight, increasing the efficiency of physical memory, all of which is transparent to the application
  • Shared memory: Dynamic libraries, for example, simply store a copy in memory and map it to the virtual address space of different processes to make them feel like they own the file. Memory sharing between processes can also be achieved by mapping the same physical memory block to different virtual address Spaces of the process
  • Physical memory management: Physical address space is managed by the OPERATING system and cannot be allocated or reclaimed by processes. In this way, the system can better utilize memory and balance memory requirements among processes

2. Kernel space and user space

The core of the operating system is the kernel, which is independent of ordinary applications and has access to the protected memory space as well as access to the underlying hardware devices. To prevent User processes from directly operating the Kernel and ensure Kernel security, the operating system divides the virtual memory into two parts: kernel-space and user-space. In Linux, the kernel module runs in the kernel space, and the corresponding process is in kernel state. While the user program runs in user space, the corresponding process is in user mode.

The ratio of virtual memory occupied by kernel and user processes is 1:3, while Linux X86_32 systems have 4 gigabytes of addressing space (virtual storage) (2 ^ 32), with the highest 1 gigabyte (from virtual addresses 0xC0000000 to 0xFFFFFFFF) available to kernel processes. Called kernel space; The lower 3G bytes (from virtual addresses 0x00000000 to 0xBfffff) are used by individual user processes and are called user space. Here is the memory layout of user space and kernel space for a process:

2.1. Kernel space

Kernel space always resides in memory, which is reserved for the operating system’s kernel. Applications are not allowed to read or write directly from this area or call functions defined by kernel code. The area on the left in the figure above is the virtual memory corresponding to the kernel process, which can be divided into two areas by access permission: process private and process shared.

  • Process private virtual memory: Each process has its own kernel stack, page table, task structure, mem_map structure, etc.
  • Virtual memory shared by processes: An area of memory shared by all processes, including physical storage, kernel data, and kernel code areas.

2.2. User space

Each normal user process has a separate user space. The user process cannot access the data in the kernel space, nor can it directly call the kernel function. Therefore, when making system calls, the process must switch to the kernel state. User space consists of the following memory areas:

  • Runtime stack: Automatically released by the compiler to hold function parameter values, local variables and method return values, etc. Each time a function is called, the return type of the function and some information about the call are stored at the top of the stack, and the call information is popped up and freed. The stack area is a continuous internal area that grows from the high address bits to the low address bits. The maximum capacity is predefined by the system. If the requested stack space exceeds this threshold, an overflow message will be displayed, and users can obtain less space from the stack.
  • Run-time heap: The address bit between the BSS and the stack used to store dynamically allocated memory segments during a process’s running. The card issuer applies for distribution (MALLOC) and release (free). The heap grows from low address bits to high address bits and adopts a chain storage structure. Frequent malloc/free causes a discontinuity in memory space, resulting in a large amount of fragmentation. When applying for heap space, the library functions follow certain algorithms to search for available space of sufficient size. So the heap is much less efficient than the stack.
  • Code segment: Stores machine instructions that the CPU can execute. This part of memory can only be read but not written. Usually the code area is shared, that is, it can be called by other executants. If several processes in the machine are running the same program, they can use the same code snippet.
  • Uninitialized data segment: Holds uninitialized global variables. BSS data is initialized to 0 or NULL before program execution begins.
  • Initialized data segment: Stores initialized global variables, including static global variables, static local variables, and constants.
  • Memory mapping area: For example, the memory of the virtual space such as dynamic library and shared memory is mapped to the physical space, usually the virtual memory space allocated by the MMAP function.

3. Linux internal hierarchy

Kernel-mode can execute any command and call all resources of the system, while user-mode can only perform simple operations and cannot directly call system resources. The user mode must go through the System Call to issue instructions to the kernel. For example, when a user process starts a bash, it makes a system call to the kernel’s PID service via getPid () to get the ID of the current user process; When the user process views the host configuration through the cat command, it makes a system call to the file subsystem of the kernel.

  • The kernel space has access to all CPU instructions and all memory space, I/O space, and hardware devices.
  • User space can access only limited resources. If special permissions are required, you can obtain corresponding resources through system calls.
  • User space allows page breaks, while kernel space does not.
  • Kernel space and user space are for linear address Spaces.
  • X86 CPUS have user space in the 0-3G address range and kernel space in the 3G-4G address range. X86_64 CPU user space address range is 0x0000000000000000-0x00007FFFFFFFFFFFFF, kernel address space is 0xFFFF880000000000 – Maximum address.
  • All kernel processes (threads) share one address space, while user processes have their own address space.

With the division of user space and kernel space, the Linux internal hierarchy can be divided into three parts, from the bottom to the top are hardware, kernel space and user space, as shown in the following figure:

4. Linux I/O read and write mode

Linux provides three data transfer mechanisms between disk and main memory: polling, I/O interrupt, and DMA transfer. In the polling mode, I/O ports are continuously detected based on an infinite loop. In I/O interrupt mode, when data arrives, the disk sends an interrupt request to the CPU, and the CPU is responsible for data transmission. In DMA transmission, DMA disk controller is introduced on the basis of I/O interrupt, and the DMA disk controller is responsible for data transmission, reducing the I/O interrupt operation to the CPU resource consumption.

4.1. I/O Interrupt Principle

Before DMA, I/O operations between an application and a disk were done through interrupts from the CPU. Each time a user process reads disk data, the CPU interrupts and then initiates an I/O request to wait until the data is read and copied. Each I/O interruption causes a CPU context switchover.

  1. The user process makes a read system call to the CPU to read the data, switches from user to kernel mode, and then blocks waiting for the data to return.
  2. After receiving the command, the CPU initiates an I/O request to the disk and puts the disk data into the disk controller buffer.
  3. After data is prepared, the disk initiates an I/O interrupt to the CPU.
  4. The CPU receives an I/O interrupt and copies data from the disk buffer to the kernel buffer, and then from the kernel buffer to the user buffer.
  5. The user process switches from the kernel state to the user state, unblocks, and waits for the CPU’s next execution clock.

4.2. DMA transmission principle

DMA, or Direct Memory Access, is a mechanism that allows peripheral devices (hardware subsystems) to Access the system’s main Memory directly. In other words, based on DMA access, data transfer between the main memory of the system and the hard disk or network card can bypass the full CPU scheduling. At present, most hardware devices, including disk controllers, network cards, video cards and sound cards, all support DMA technology.

With the DMA disk controller taking over the data read/write requests, the CPU is freed from heavy I/O operations and the data read operation flows as follows:

  1. The user process makes a read system call to the CPU to read the data, switches from user to kernel mode, and then blocks waiting for the data to return.
  2. The CPU issues scheduling instructions to the DMA disk controller after receiving the instructions.
  3. The DMA disk controller makes an I/O request to the disk and puts the disk data into the disk controller buffer first. The CPU does not participate in the whole process.
  4. After the data is read, the DMA disk controller receives a notification from the disk and copies the data from the disk controller buffer to the kernel buffer.
  5. The DMA disk controller signals the CPU to read the data, and the CPU copies the data from the kernel buffer to the user buffer.
  6. The user process switches from the kernel state to the user state, unblocks, and waits for the CPU’s next execution clock.

5. Traditional I/O mode

To better understand the problems zero copy solves, let’s first look at the problems with traditional I/O. In Linux system, the traditional access method is realized by two system calls write() and read(). The read() function reads the file into the cache, and then outputs the data in the cache to the network port through the write() method. The pseudocode is as follows:

read(file_fd, tmp_buf, len);
write(socket_fd, tmp_buf, len);
Copy the code

The following figure shows the data read and write process of a traditional I/O operation. The whole process involves two CPU copies, two DMA copies, a total of four copies, and four context switches. The following is a brief description of related concepts.

  • Context switch: When a user program makes a system call to the kernel, the CPU switches the user process from the user state to the kernel state. When the system call returns, the CPU switches the user process from kernel state back to user state.
  • CPU copy: The CPU processes data transfer directly. Data copy occupies CPU resources.
  • DMA copy: The CPU issues instructions to the DMA disk controller, which processes the data transfer and feeds the information back to the CPU after the data transfer, thus reducing the CPU resource occupancy.

5.1. Traditional read operations

When an application executes a read system call to read a piece of data that already exists in the page memory of the user process, it reads the data directly from memory. If the data does not exist, it is first loaded from disk into the read buffer of kernel space, and then copied from the read buffer into the page memory of the user process.

read(file_fd, tmp_buf, len);
Copy the code

Based on traditional I/O reading, the read system call triggers two context switches, one DMA copy, and one CPU copy, and initiates a data read as follows:

  1. The user process makes a system call to the kernel through the read() function, and the context changes from user space to kernel space.
  2. The CPU uses DMA controllers to copy data from main memory or hard disk to the read buffer of the kernel space.
  3. The CPU copies the data in the read buffer to the user buffer in user space.
  4. The context switches from kernel space back to user space, and the read call returns.

5.2. Traditional write operations

When the application program prepares the data and executes a write system call to send the network data, it copies the data from the user-space page cache to the kernel-space socket buffer, and then copies the data from the write cache to the network adapter device to complete the data transmission.

write(socket_fd, tmp_buf, len);
Copy the code

Based on the traditional I/O writing method, the write() system call triggers two context switches, one CPU copy, and one DMA copy. The user program sends network data as follows:

  1. The user process makes system calls to the kernel through the write() function, and the context changes from user space to kernel space.
  2. The CPU copies data from the user buffer to the socket buffer of the kernel space.
  3. The CPU uses a DMA controller to copy data from the socket buffer to the network card for data transfer.
  4. The context switches from kernel space to user space, and the write system call execution returns.

6. Zero-copy mode

There are three main ways to implement zero-copy technology in Linux: user-mode direct I/O, reducing the number of data copies, and copy-on-write technology.

  • User-mode direct I/O: Applications can access the hardware storage directly, and the operating system kernel only assists in data transfer. In this way, there is still a context switch between user space and kernel space, and the data on the hardware is copied directly to user space, not through the kernel space. Therefore, there is no copy of data between the kernel-space buffer and user-space buffer for direct I/O.
  • Reduce the number of data copies: In the process of data transmission, avoid the CPU copy of data between the user space buffer and the system kernel space buffer, as well as the CPU copy of data in the system kernel space. This is also the realization idea of the current mainstream zero-copy technology.
  • Copy-on-write technology: When multiple processes share the same piece of data, if one process needs to modify the data, it copies the data to its own process address space. If the data is only read, no copy operation is required.

6.1. User mode direct I/O

Direct I/O to user mode makes the application process or run in user mode (the user space) of library function under direct access to the hardware, data transmitted directly across the kernel, the kernel in the data transmission process in addition to the necessary virtual storage configuration, not participate in any other job, this way can bypass the kernel directly, greatly improve the performance.

User-mode direct I/O is only suitable for applications that do not require kernel buffer processing. These applications usually have their own data caching mechanism in the process address space, called self-caching applications, such as database management systems. Second, the zero-copy mechanism directly operates disk I/O. Due to the execution time gap between CPU and disk I/O, a large amount of resources are wasted. The solution is to use asynchronous I/O.

6.2 mmap + write

One zero-copy mode is to use mmap + Write instead of read + write, which reduces one CPU copy operation. Mmap is a memory-mapped file method provided by Linux. That is, a virtual address in the address space of a process is mapped to a disk file address. The pseudo-code of mmap + write is as follows:

tmp_buf = mmap(file_fd, len);
write(socket_fd, tmp_buf, len);
Copy the code

The purpose of mmap is to map the address of the read buffer in the kernel to the user buffer in user space so that the kernel buffer can be shared with the application memory. The process of copying data from the kernel read buffer to the user buffer is omitted. However, the kernel read buffer still needs to copy data to the kernel write buffer, as shown in the following figure:

Based on the zero-copy mode of mMAP + Write system call, four context switches, one CPU copy and two DMA copies will occur in the whole copy process. The user program reads and writes data as follows:

  1. The user process makes a system call to the kernel through the mmap() function, and the context changes from user space to kernel space.
  2. Memory address mapping is performed between the read buffer of the user process kernel space and the user buffer of the user space.
  3. The CPU uses DMA controllers to copy data from main memory or hard disk to the read buffer of the kernel space.
  4. The context switches from kernel space to user space, and the MMAP system call execution returns.
  5. The user process makes system calls to the kernel through the write() function, and the context changes from user space to kernel space.
  6. The socket buffer into which the CPU copies data from the read buffer.
  7. The CPU uses a DMA controller to copy data from the socket buffer to the network card for data transfer.
  8. The context switches from kernel space to user space, and the write system call execution returns.

The main use of Mmap is to improve I/O performance, especially for large files. For small files, memory-mapped files can lead to a waste of fragmentation space, because memory maps always align page boundaries, and the minimum unit is 4 KB. A 5 KB file will map 8 KB of memory, thus wasting 3 KB of memory.

Although the mMAP copy is reduced by one copy, which improves efficiency, there are some hidden problems. When mmap a file, if the file is intercepted by another process, the write system call is terminated by SIGBUS for accessing an invalid address. SIGBUS kills the process by default and produces a coredump, and the server may be terminated.

6.3. sendfile

The SendFile system call was introduced in Linux kernel version 2.1 to simplify the transfer of data between two channels over the network. The introduction of sendFile system call, which not only reduced the number of CPU copies, but also reduced the number of context switches, has the following pseudo-code:

sendfile(socket_fd, file_fd, len);
Copy the code

With the SendFile system call, data can be directly I/O transferred within the kernel space, eliminating the need to copy data back and forth between user space and kernel space. Unlike mMAP memory mapping, THE I/O data in the SendFile call is completely invisible to user space. In other words, this is a complete data transfer process.

Based on the zero-copy mode of sendFile system call, there will be two context switches, one CPU copy and two DMA copies in the whole copy process. The flow of user program reading and writing data is as follows:

  1. The user process initiates a system call to the kernel through sendFile (), and the context changes from user space to kernel space.
  2. The CPU uses DMA controllers to copy data from main memory or hard disk to the read buffer of the kernel space.
  3. The socket buffer into which the CPU copies data from the read buffer.
  4. The CPU uses a DMA controller to copy data from the socket buffer to the network card for data transfer.
  5. The context changes from kernel space to user space, and the sendFile system call returns.

Compared to mMAP memory mapping, SendFile has two fewer context switches but still has one CPU copy operation. The problem with SendFile is that the user program cannot modify the data, but simply completes a data transfer.

6.4. sendfile + DMA gather copy

The Linux 2.4 kernel modified the SendFile system call to introduce the Gather operation for DMA copies. It records the corresponding data description (memory address and address offset) in the kernel space’s read buffer into the corresponding socket buffer. DMA copies data in batches from the read buffer to the nic device according to the memory address and address offset, thus eliminating the only CPU copy operation left in the kernel space. The pseudo-code of SendFile is as follows:

sendfile(socket_fd, file_fd, len);
Copy the code

With hardware support, sendFile is no longer copied from the kernel buffer to the socket buffer. Instead, it simply copies the file descriptors and data lengths of the buffer, so that the DMA engine can use the Gather operation to package the data in the page cache and send it to the network. The essence is similar to the idea of virtual memory mapping.

Based on the zero-copy mode of SendFile + DMA Gather Copy system call, 2 context switches, 0 CPU copies and 2 DMA copies will occur in the whole copy process. The process of user program reading and writing data is as follows:

  1. The user process initiates a system call to the kernel through sendFile (), and the context changes from user space to kernel space.
  2. The CPU uses DMA controllers to copy data from main memory or hard disk to the read buffer of the kernel space.
  3. The CPU copies the file descriptors and data lengths of the read buffer into the socket buffer.
  4. Based on copied file descriptors and data length, the CPU uses the GATHER/Scatter operation of the DMA controller to directly copy data in batches from the read buffer of the kernel to the network adapter for data transfer.
  5. The context changes from kernel space to user space, and the sendFile system call returns.

Sendfile + DMA Gather Copy also has problems that users cannot modify data and requires hardware support. It is only suitable for copying data from a file to a socket.

6.5. splice

Sendfile is only suitable for copying data from a file to a socket and requires hardware support, which limits its use. Linux introduced the splice system call in 2.6.17, which not only requires no hardware support, but also enables zero copy of data between two file descriptors. The pseudo-code for Splice is as follows:

splice(fd_in, off_in, fd_out, off_out, len, flags);
Copy the code

Splice system calls can create pipelines between the read buffer and the socket buffer in kernel space, avoiding CPU copying between the two.

Based on the zero-copy mode of splice system call, the whole copy process will take place 2 context switches, 0 CPU copies and 2 DMA copies. The flow of user programs reading and writing data is as follows:

  1. The splice() function is used to make system calls to the kernel, and the context is changed from user space to kernel space.
  2. The CPU uses DMA controllers to copy data from main memory or hard disk to the read buffer of the kernel space.
  3. The CPU builds pipelines between the read buffer and the socket buffer in the kernel space.
  4. The CPU uses a DMA controller to copy data from the socket buffer to the network card for data transfer.
  5. Context changes from kernel space to user space, and splice system call execution returns.

The splice copy method also has the problem that the user program cannot modify the data. In addition, it uses Linux’s pipe buffering mechanism and can be used to transfer data between any two file descriptors, but one of its two file descriptor parameters must be a pipe device.

6.6. Copy on write

In some cases, the kernel buffer may be shared by multiple processes, and if a process wants to write the shared area, since write does not provide any locking operations, the data in the shared area will be corrupted. The introduction of copy-on-write is used by Linux to protect data.

Copy-on-write means that when multiple processes share the same piece of data, if one process needs to modify the data, it needs to copy it into its own process address space. This does not affect the operation of other processes on the data. Each process will copy the data when it needs to modify it, so it is called copy-on-write. This approach reduces overhead to the extent that if a process never changes the data it accesses, it will never need to copy.

6.7. Buffer sharing

The buffer sharing method completely overwrites the traditional I/O operation, because the traditional I/O interface is based on data copy. To avoid copying, the original set of interfaces have to be removed and rewritten, so this method is a relatively comprehensive zero-copy technology. Fbuf (Fast Buffer) is a mature solution implemented on Solaris.

The idea of FBUF is that each process maintains a buffer pool that can be mapped to both user space and kernel space, and the kernel and user share the buffer pool, thus avoiding a series of copying operations.

The difficulty of buffer sharing is that managing shared buffer pools requires close cooperation between applications, network software, and device drivers, and how to adapt the API is still experimental and immature.

7. Linux zero-copy comparison

In both traditional I/O copying and zero-copy, a 2-dma Copy is indispensable because both DMS are hardware dependent. Here is a summary of the differences between the above I/O copy methods in terms of CPU copy times, DMA copy times, and system calls.

Copy the way CPU copy The DMA copy The system calls Context switch
Traditional (read + Write) 2 2 read / write 4
Memory mapping (Mmap + Write) 1 2 mmap / write 4
sendfile 1 2 sendfile 2
sendfile + DMA gather copy 0 2 sendfile 2
splice 0 2 splice 2

Java NIO zero copy implementation

A Channel in Java NIO corresponds to a Buffer in the operating system’s kernel space, and a Buffer corresponds to a user Buffer in the operating system’s user space.

  • A Channel is full-duplex (two-way transmission) and can be either a read buffer or a socket buffer.
  • Buffers are divided into HeapBuffer and DirectBuffer, which are user-mode memory allocated by malloc().

Out-of-heap memory (DirectBuffer) needs to be reclaimed manually by the application after use, whereas HeapBuffer data may be reclaimed automatically during GC. Therefore, when using HeapBuffer to read or write data, NIO copies the HeapBuffer data into a temporary DirectBuffer native memory to avoid the loss of the buffer data due to GC. This copy involves sun. Misc. Unsafe. CopyMemory () call, and the implementation of the principle behind memcpy (). Finally, the memory address of the data inside the provisionally-generated DirectBuffer is passed to the I/O calling function, thus eliminating the need to access Java objects for I/O reading and writing.

8.1. MappedByteBuffer

MappedByteBuffer is an implementation of NIO’s zero-copy approach to memory mapping (MMAP), which is inherited from ByteBuffer. FileChannel defines a map() method that maps the size of a file starting at position into a memory image file. The abstract map() method is defined in FileChannel as follows:

public abstract MappedByteBuffer map(MapMode mode, long position, long size)
        throws IOException;
Copy the code
  • Mode: specifies the access mode of MappedByteBuffer to memory image files, including READ_ONLY, READ_WRITE, and copy-on-write (PRIVATE).
  • Position: start address of the file mapping, corresponding to the start address of the MappedByteBuffer.
  • Size: indicates the size of the MappedByteBuffer, the number of bytes after position.

MappedByteBuffer includes fore(), load(), and isLoad() methods:

  • Fore () : For buffers in READ_WRITE mode, changes to the contents of the buffer are forcibly flushed to a local file.
  • Load () : Loads the contents of a buffer into physical memory and returns a reference to the buffer.
  • IsLoaded () : Returns true if the contents of the buffer are in physical memory, false otherwise.

Here is an example of using MappedByteBuffer to read and write files:

private final static String CONTENT = "Zero copy implemented by MappedByteBuffer";
private final static String FILE_NAME = "/mmap.txt";
private final static String CHARSET = "UTF-8";

Copy the code
  • Write file data: Enable the fileChannel fileChannel and provide read, write, and data clearing permissions. Map the fileChannel to a writable memory buffer, mappedByteBuffer, and write the target data to the mappedByteBuffer. The buffer changes are forced to a local file using the force() method.
@Test
public void writeToFileByMappedByteBuffer(a) {
    Path path = Paths.get(getClass().getResource(FILE_NAME).getPath());
    byte[] bytes = CONTENT.getBytes(Charset.forName(CHARSET));
    try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ,
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
        MappedByteBuffer mappedByteBuffer = fileChannel.map(READ_WRITE, 0, bytes.length);
        if(mappedByteBuffer ! =null) { mappedByteBuffer.put(bytes); mappedByteBuffer.force(); }}catch(IOException e) { e.printStackTrace(); }}Copy the code
  • Read file data: enable the fileChannel fileChannel and grant read-only permission. The fileChannel maps to a readable memory buffer, mappedByteBuffer, and reads the byte array in the mappedByteBuffer to obtain file data.
@Test
public void readFromFileByMappedByteBuffer(a) {
    Path path = Paths.get(getClass().getResource(FILE_NAME).getPath());
    int length = CONTENT.getBytes(Charset.forName(CHARSET)).length;
    try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
        MappedByteBuffer mappedByteBuffer = fileChannel.map(READ_ONLY, 0, length);
        if(mappedByteBuffer ! =null) {
            byte[] bytes = new byte[length];
            mappedByteBuffer.get(bytes);
            String content = new String(bytes, StandardCharsets.UTF_8);
            assertEquals(content, "Zero copy implemented by MappedByteBuffer"); }}catch(IOException e) { e.printStackTrace(); }}Copy the code

The following describes the underlying implementation of the map() method. The map () method is a Java. Nio. Channels. FileChannel abstract method, by the subclass sun. Nio. Ch. FileChannelImpl. Java implementation, the following is the core of the associated with the memory mapping code:

public MappedByteBuffer map(MapMode mode, long position, long size) throws IOException {
    int pagePosition = (int)(position % allocationGranularity);
    long mapPosition = position - pagePosition;
    long mapSize = size + pagePosition;
    try {
        addr = map0(imode, mapPosition, mapSize);
    } catch (OutOfMemoryError x) {
        System.gc();
        try {
            Thread.sleep(100);
        } catch (InterruptedException y) {
            Thread.currentThread().interrupt();
        }
        try {
            addr = map0(imode, mapPosition, mapSize);
        } catch (OutOfMemoryError y) {
            throw new IOException("Map failed", y); }}int isize = (int)size;
    Unmapper um = new Unmapper(addr, mapSize, isize, mfd);
    if((! writable) || (imode == MAP_RO)) {return Util.newMappedByteBufferR(isize, addr + pagePosition, mfd, um);
    } else {
    	returnUtil.newMappedByteBuffer(isize, addr + pagePosition, mfd, um); }}Copy the code

The map() method allocates a block of virtual memory to a file using the local method map0() as its memory-mapped region, and returns the starting address of the memory-mapped region.

  1. File mapping requires the creation of an instance of MappedByteBuffer in the Java heap. If the file mapping results in OOM for the first time, the garbage collection is manually triggered, and the mapping will sleep for 100ms. If the mapping fails, an exception will be thrown.
  2. Create an instance of DirectByteBuffer using Util’s newMappedByteBuffer (readable and writable) method or newMappedByteBufferR (read-only) method reflection, DirectByteBuffer is a subclass of MappedByteBuffer.

The map() method returns the starting address of the memory-mapped region and retrieves data for the specified memory by (starting address + offset). Instead of reading () or writing (), sun.misc.Unsafe’s getByte() and putByte() methods are used to read and write data.

private native long map0(int prot, long position, long mapSize) throws IOException;

Copy the code

The above is the definition of native method MAP0, which calls the implementation of low-level C through JNI (Java Native Interface). The native function (Java_sun_nio_ch_FileChannelImpl_map0) to realize the JDK source code package under the native/sun/nio/ch/FileChannelImpl. The c source files.

JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
                                     jint prot, jlong off, jlong len)
{
    void *mapAddress = 0;
    jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
    jint fd = fdval(env, fdo);
    int protections = 0;
    int flags = 0;

    if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
        protections = PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
        protections = PROT_WRITE | PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
        protections =  PROT_WRITE | PROT_READ;
        flags = MAP_PRIVATE;
    }

    mapAddress = mmap64(
        0./* Let OS decide location */
        len,                  /* Number of bytes to map */
        protections,          /* File permissions */
        flags,                /* Changes are shared */
        fd,                   /* File descriptor of mapped file */
        off);                 /* Offset into file */

    if (mapAddress == MAP_FAILED) {
        if (errno == ENOMEM) {
            JNU_ThrowOutOfMemoryError(env, "Map failed");
            return IOS_THROWN;
        }
        return handle(env, - 1."Map failed");
    }

    return ((jlong) (unsigned long) mapAddress);
}

Copy the code

It can be seen that the map0() function finally issues the memory mapping call to the Linux underlying kernel through the function MMap64 (). The prototype of mmAP64 () function is as follows:

#include <sys/mman.h>

void *mmap64(void *addr, size_t len, int prot, int flags, int fd, off64_t offset);

Copy the code

The following describes the meanings and optional values of each parameter of mmAP64 () :

  • Addr: the start address of the file in the memory mapping area of the user process space. This is a recommended parameter and can be set to 0 or NULL. The kernel determines the actual start address. When flags is MAP_FIXED, addr is a mandatory argument that requires an existing address.
  • Len: The length of bytes for the file to be memory-mapped
  • Prot: Controls the access permission of user processes to the memory mapped area
    • PROT_READ: indicates the read permission
    • PROT_WRITE: indicates the write permission
    • PROT_EXEC: execution permission
    • PROT_NONE: no permission
  • Flags: Controls whether changes to a memory mapping area are shared by multiple processes
    • MAP_PRIVATE: Changes to memory-mapped data are not reflected in the real file. The copy-on-write mechanism is used when data changes occur
    • MAP_SHARED: Changes to a memory map are synchronized to a real file and are visible to processes that share the memory map
    • MAP_FIXED: not recommended. In this mode, the specified addr parameter must provide an existing addr parameter
  • Fd: indicates the file descriptor. Each map operation increases the reference count of the file by 1, and each unmap operation or ending the process decreases the reference count by 1
  • Offset: indicates the offset of the file. The position of the file to be mapped, the amount shifted back from the starting address of the file

Here’s a summary of the features and drawbacks of MappedByteBuffer:

  • The MappedByteBuffer uses out-of-heap virtual memory, so the size of the map is not limited by the JVM’s -Xmx parameter, but it is also size limited.
  • If the file exceeds the integer.max_value byte limit, you can use the position parameter to remap the contents of the file.
  • MappedByteBuffer does perform well when handling large files, but it also has problems with memory usage and file closure uncertainty. Files opened by MappedByteBuffer will only be closed during garbage collection, and this point in time is uncertain.
  • MappedByteBuffer provides both the mmap() method for file-mapped memory and the unmap() method for freeing mapped memory. However, unmap() is a private method in FileChannelImpl and cannot be invoked directly. Therefore, the user program needs to manually free up the memory area occupied by the mapping by calling the Clean () method of the Sun.misc.Cleaner class with Java reflection.
public static void clean(final Object buffer) throws Exception {
    AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
        try {
            Method getCleanerMethod = buffer.getClass().getMethod("cleaner".new Class[0]);
            getCleanerMethod.setAccessible(true);
            Cleaner cleaner = (Cleaner) getCleanerMethod.invoke(buffer, new Object[0]);
            cleaner.clean();
        } catch(Exception e) { e.printStackTrace(); }}); }Copy the code

8.2. DirectByteBuffer

DirectByteBuffer object references are located in the Java memory model heap, and the JVM can allocate and recycle DirectByteBuffer objects. The static DirectByteBuffer method allocateDirect() is used to create instances of DirectByteBuffer and allocate memory.

public static ByteBuffer allocateDirect(int capacity) {
    return new DirectByteBuffer(capacity);
}

Copy the code

The byte buffer bits inside DirectByteBuffer lie in out-of-heap (user-state) direct memory, which is allocated via the Unsafe native method allocateMemory(), which calls the underlying operating system malloc() function.

DirectByteBuffer(int cap) {
    super(-1.0, cap, cap);
    boolean pa = VM.isDirectMemoryPageAligned();
    int ps = Bits.pageSize();
    long size = Math.max(1L, (long)cap + (pa ? ps : 0));
    Bits.reserveMemory(size, cap);

    long base = 0;
    try {
        base = unsafe.allocateMemory(size);
    } catch (OutOfMemoryError x) {
        Bits.unreserveMemory(size, cap);
        throw x;
    }
    unsafe.setMemory(base, size, (byte) 0);
    if(pa && (base % ps ! =0)) {
        address = base + ps - (base & (ps - 1));
    } else {
        address = base;
    }
    cleaner = Cleaner.create(this.new Deallocator(base, size, cap));
    att = null;
}

Copy the code

In addition, when DirectByteBuffer is initialized, a Deallocator thread is created and the direct memory is recycled through the Cleaner freeMemory() method. FreeMemory () calls the operating system’s free() function underneath.

private static class Deallocator implements Runnable {
    private static Unsafe unsafe = Unsafe.getUnsafe();

    private long address;
    private long size;
    private int capacity;

    private Deallocator(long address, long size, int capacity) {
        assert(address ! =0);
        this.address = address;
        this.size = size;
        this.capacity = capacity;
    }

    public void run(a) {
        if (address == 0) {
            return;
        }
        unsafe.freeMemory(address);
        address = 0; Bits.unreserveMemory(size, capacity); }}Copy the code

Because DirectByteBuffer allocates local memory that is not within the control of the JVM, direct memory collection is different from heap memory collection and can easily cause OutofMemoryErrors if used incorrectly.

Having said that, what does DirectByteBuffer have to do with zero copy? As mentioned earlier, when MappedByteBuffer is memory-mapped, its map() method creates an instance of the buffer using util.newmappedbyteBuffer (), which initializes as follows:

static MappedByteBuffer newMappedByteBuffer(int size, long addr, FileDescriptor fd,
                                            Runnable unmapper) {
    MappedByteBuffer dbb;
    if (directByteBufferConstructor == null)
        initDBBConstructor();
    try {
        dbb = (MappedByteBuffer)directByteBufferConstructor.newInstance(
            new Object[] { new Integer(size), new Long(addr), fd, unmapper });
    } catch (InstantiationException | IllegalAccessException | InvocationTargetException e) {
        throw new InternalError(e);
    }
    return dbb;
}

private static void initDBBRConstructor(a) {
    AccessController.doPrivileged(new PrivilegedAction<Void>() {
        public Void run(a) {
            try{ Class<? > cl = Class.forName("java.nio.DirectByteBufferR"); Constructor<? > ctor = cl.getDeclaredConstructor(newClass<? > [] {int.class, long.class, FileDescriptor.class,
                                    Runnable.class });
                ctor.setAccessible(true);
                directByteBufferRConstructor = ctor;
            } catch (ClassNotFoundException | NoSuchMethodException |
                     IllegalArgumentException | ClassCastException x) {
                throw new InternalError(x);
            }
            return null;
        }});
}

Copy the code

DirectByteBuffer is a concrete implementation class of MappedByteBuffer. In fact, the util.newmappedByteBuffer () method uses reflection to get the constructor for DirectByteBuffer, and then creates an instance of DirectByteBuffer corresponding to a separate constructor for memory mapping:

protected DirectByteBuffer(int cap, long addr, FileDescriptor fd, Runnable unmapper) {
    super(-1.0, cap, cap, fd);
    address = addr;
    cleaner = Cleaner.create(this, unmapper);
    att = null;
}

Copy the code

Therefore, in addition to allowing the operating system’s direct memory to be allocated, DirectByteBuffer itself has file-memory mapping capabilities, which I won’t go into too much detail here. It is important to note that DirectByteBuffer provides random get() and write() operations for memory image files on top of MappedByteBuffer.

  • Random reads of memory image files
public byte get(a) {
    return ((unsafe.getByte(ix(nextGetIndex()))));
}

public byte get(int i) {
    return ((unsafe.getByte(ix(checkIndex(i)))));
}

Copy the code
  • Random write operations to memory image files
public ByteBuffer put(byte x) {
    unsafe.putByte(ix(nextPutIndex()), ((x)));
    return this;
}

public ByteBuffer put(int i, byte x) {
    unsafe.putByte(ix(checkIndex(i)), ((x)));
    return this;
}

Copy the code

Random reads and writes of memory image files are located by ix() method. Ix () method calculates the pointer address by the address of memory mapping space and the given offset I. The unsafe class then reads or writes to the data pointed to by the pointer, using the Get () and put() methods.

private long ix(int i) {
    return address + ((long)i << 0);
}

Copy the code

8.3. FileChannel

FileChannel is a channel for file reading and writing, mapping, and manipulation that is thread-safe in a concurrent environment, The getChannel() method based on FileInputStream, FileOutputStream, or RandomAccessFile creates and opens a file channel. FileChannel defines two abstract methods transferFrom() and transferTo(), which transfer data by establishing connections between channels.

  • TransferTo () : Writes the source data in the file to the destination channel of a WritableByteChannel using FileChannel.
public abstract long transferTo(long position, long count, WritableByteChannel target)
        throws IOException;

Copy the code
  • TransferFrom () : Reads data from a source channel ReadableByteChannel into the current FileChannel file.
public abstract long transferFrom(ReadableByteChannel src, long position, long count)
        throws IOException;

Copy the code

The following is an example of FileChannel using the transferTo() and transferFrom() methods for data transfer:

private static final String CONTENT = "Zero copy implemented by FileChannel";
private static final String SOURCE_FILE = "/source.txt";
private static final String TARGET_FILE = "/target.txt";
private static final String CHARSET = "UTF-8";

Copy the code

TXT and target. TXT files are created in the class loading root path, and initialization data is written to source file source. TXT.

@Before
public void setup(a) {
    Path source = Paths.get(getClassPath(SOURCE_FILE));
    byte[] bytes = CONTENT.getBytes(Charset.forName(CHARSET));
    try (FileChannel fromChannel = FileChannel.open(source, StandardOpenOption.READ,
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
        fromChannel.write(ByteBuffer.wrap(bytes));
    } catch(IOException e) { e.printStackTrace(); }}Copy the code

For the transferTo() method, the destination channel toChannel can be any one-way byte write channel WritableByteChannel; For the transferFrom() method, the source fromChannel can be any one-way byte read channel ReadableByteChannel. FileChannel, SocketChannel, and DatagramChannel implement WritableByteChannel and ReadableByteChannel interfaces, and are bidirectional channels that support both read and write. For testing purposes, an example of channel-to-channel data transfer based on FileChannel is shown below.

  • TransferTo () copies data from fromChannel to toChannel
@Test
public void transferTo(a) throws Exception {
    try (FileChannel fromChannel = new RandomAccessFile(
             getClassPath(SOURCE_FILE), "rw").getChannel();
         FileChannel toChannel = new RandomAccessFile(
             getClassPath(TARGET_FILE), "rw").getChannel()) {
        long position = 0L;
        longoffset = fromChannel.size(); fromChannel.transferTo(position, offset, toChannel); }}Copy the code
  • Copy fromChannel data to toChannel by transferFrom()
@Test
public void transferFrom(a) throws Exception {
    try (FileChannel fromChannel = new RandomAccessFile(
             getClassPath(SOURCE_FILE), "rw").getChannel();
         FileChannel toChannel = new RandomAccessFile(
             getClassPath(TARGET_FILE), "rw").getChannel()) {
        long position = 0L;
        longoffset = fromChannel.size(); toChannel.transferFrom(fromChannel, position, offset); }}Copy the code

Introduce below transferTo () and transferFrom () method of the underlying implementation principle, the two methods is Java. Nio. Channels. FileChannel abstract method, By the subclass sun. Nio. Ch. FileChannelImpl. Java implementation. Both transferTo() and transferFrom() realize data transfer based on SendFile, where FileChannelImpl. Java defines three constants, Indicates whether the current operating system kernel supports SendFile and sendfile related features.

private static volatile boolean transferSupported = true;
private static volatile boolean pipeSupported = true;
private static volatile boolean fileSupported = true;

Copy the code
  • TransferSupported: Indicates whether the current system kernel supports sendFile (). The default value is true.
  • PipeSupported: Used to indicate whether the current system kernel supports file descriptor (FD) pipe-based (pipe) sendFile () calls, which defaults to true.
  • FileSupported: Used to indicate whether the current system kernel supports file descriptor (FD) file-based sendfile() calls. Defaults to true.

The following uses the source code implementation of transferTo() as an example. FileChannelImpl first executes the transferToDirectly() method to try copying the data in sendFile zero-copy mode. If the system kernel does not support sendFile, run the transferToTrustedChannel() method to perform memory mapping in zero-copy mode of Mmap. In this case, the destination channel must be of the FileChannelImpl or SelChImpl type. If the above two step fails, the execution transferToArbitraryChannel () method, based on the traditional way of I/O completion, speaking, reading and writing, specific steps is to initialize a temporary DirectBuffer, Read data from the source channel FileChannel into DirectBuffer and write it to the destination channel WritableByteChannel.

public long transferTo(long position, long count, WritableByteChannel target)
        throws IOException {
    // Calculate the file size
    long sz = size();
    // Verify the start position
    if (position > sz)
        return 0;
    int icount = (int)Math.min(count, Integer.MAX_VALUE);
    // Check the offset
    if ((sz - position) < icount)
        icount = (int)(sz - position);

    long n;

    if ((n = transferToDirectly(position, icount, target)) >= 0)
        return n;

    if ((n = transferToTrustedChannel(position, icount, target)) >= 0)
        return n;

    return transferToArbitraryChannel(position, icount, target);
}

Copy the code

Next, the implementation of transferToDirectly() method will be analyzed, that is, the essence of zero copy realized by transferTo() through sendFile. As you can see, transferToDirectlyInternal () method to get to the purpose of channel WritableByteChannel file descriptor targetFD, Acquiring the synchronization lock and then execute transferToDirectlyInternal () method.

private long transferToDirectly(long position, int icount, WritableByteChannel target)
        throws IOException {
    // Omit the process of getting targetFD from target
    if (nd.transferToDirectlyNeedsPositionLock()) {
        synchronized (positionLock) {
            long pos = position();
            try {
                return transferToDirectlyInternal(position, icount,
                        target, targetFD);
            } finally{ position(pos); }}}else {
        returntransferToDirectlyInternal(position, icount, target, targetFD); }}Copy the code

Finally by transferToDirectlyInternal () call native methods transferTo0 (), try to sendfile method for data transmission. If the system kernel does not support sendfile at all, such as Windows, return UNSUPPORTED and mark transferSupported as false. If the kernel does not support some of the features of SendFile, for example, older Linux kernels do not support DMA Gather Copy, Returns UNSUPPORTED_CASE with pipeSupported or fileSupported as false.

private long transferToDirectlyInternal(long position, int icount,
                                        WritableByteChannel target,
                                        FileDescriptor targetFD) throws IOException {
    assert! nd.transferToDirectlyNeedsPositionLock() || Thread.holdsLock(positionLock);long n = -1;
    int ti = -1;
    try {
        begin();
        ti = threads.add();
        if(! isOpen())return -1;
        do {
            n = transferTo0(fd, position, icount, targetFD);
        } while ((n == IOStatus.INTERRUPTED) && isOpen());
        if (n == IOStatus.UNSUPPORTED_CASE) {
            if (target instanceof SinkChannelImpl)
                pipeSupported = false;
            if (target instanceof FileChannelImpl)
                fileSupported = false;
            return IOStatus.UNSUPPORTED_CASE;
        }
        if (n == IOStatus.UNSUPPORTED) {
            transferSupported = false;
            return IOStatus.UNSUPPORTED;
        }
        return IOStatus.normalize(n);
    } finally {
        threads.remove(ti);
        end (n > -1); }}Copy the code

Native method transferTo0() invokes low-level C functions through JNI (Java Native Interface). The native function (Java_sun_nio_ch_FileChannelImpl_transferTo0), also in the JDK source package under the native/sun/nio/ch/FileChannelImpl c source files. The JNI function Java_sun_nio_ch_FileChannelImpl_transferTo0() is precompiled on different systems based on conditional compilation. The following is a call wrapper provided by the JDK based on the Linux kernel for transferTo().

#if defined(__linux__) || defined(__solaris__)
#include <sys/sendfile.h>
#elif defined(_AIX)
#include <sys/socket.h>
#elif defined(_ALLBSD_SOURCE)
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/uio.h>

#define lseek64 lseek
#define mmap64 mmap
#endif

JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_transferTo0(JNIEnv *env, jobject this,
                                            jobject srcFDO,
                                            jlong position, jlong count,
                                            jobject dstFDO)
{
    jint srcFD = fdval(env, srcFDO);
    jint dstFD = fdval(env, dstFDO);

#if defined(__linux__)
    off64_t offset = (off64_t)position;
    jlong n = sendfile64(dstFD, srcFD, &offset, (size_t)count);
    return n;
#elif defined(__solaris__)
    result = sendfilev64(dstFD, &sfv, 1, &numBytes);	
    return result;
#elif defined(__APPLE__)
    result = sendfile(srcFD, dstFD, position, &numBytes, NULL, 0);
    return result;
#endif
}

Copy the code

For Linux, Solaris, and Apple systems, transferTo0() performs the system call SendFile64 at the bottom to complete the zero-copy operation. Sendfile64 () has the following prototype:

#include <sys/sendfile.h>

ssize_t sendfile64(int out_fd, int in_fd, off_t *offset, size_t count);

Copy the code

Here is a brief description of the various parameters of sendFile64 () :

  • Out_fd: indicates the file descriptor to be written
  • In_fd: indicates the file descriptor to be read
  • Offset: specifies the read position of the file stream corresponding to in_fd. If it is empty, the file stream starts from the start position by default
  • Count: Specifies the number of bytes transferred between the file descriptors in_fd and out_fd

Before Linux 2.6.3, out_fd had to be a socket, and after Linux 2.6.3, out_fd can be any file. That is, the sendFile64 () function can not only do network file transfers, but also zero-copy local files.

Other zero-copy implementations

9.1. Netty Zero copy

Zero copy in Netty is different from zero copy at the operating system level mentioned above. Netty zero copy is completely based on the user mode (Java level). It is more oriented towards the concept of data operation optimization, which is embodied in the following aspects:

  • Netty through DefaultFileRegion classes in Java. Nio. Channels. The FileChannel tranferTo () method for packaging, when the file transfer file buffer data can be sent directly to the purpose of Channel (Channel)
  • ByteBuf can wrap a byte array, ByteBuf, and ByteBuffer into a ByteBuf object via the wrap operation, thus avoiding copy operations
  • ByteBuf supports slice, so you can split ByteBuf into multiple BytebuFs that share the same storage area, avoiding memory copying
  • Netty provides the CompositeByteBuf class, which combines multiple ByteBuFs into a logical ByteBuf, avoiding copying between individual ByteBuFs

Item 1 is a zero-copy operation at the operating system level, while the following three items can only be regarded as data operation optimization at the user level.

9.2 RocketMQ vs. Kafka

RocketMQ uses mMAP + Write as a zero-copy approach for data persistence and transmission of small chunks of business-level messages. Kafka uses zero-copy sendFile, which is suitable for data persistence and transmission of large files with high throughput, such as system log messages. But it’s worth noting that Kafka uses mmap + write for index files and SendFile for data files.

The message queue Zero copy mode advantages disadvantages
RocketMQ mmap + write It is suitable for small block file transfer and is very efficient when called frequently It can not make good use of DMA mode, which will consume more CPU than Sendfile, and memory security control is complex, which needs to avoid THE JVM Crash problem
Kafka sendfile Can use DMA mode, less CPU consumption, large file transfer efficiency, no memory security issues The efficiency of small block files is lower than that of MMAP. Therefore, they can only be transferred in BIO mode rather than NIO mode

summary

This article begins by detailing the concepts of physical and virtual memory in Linux, kernel space and user space, and the internal Linux hierarchy. On this basis, it further analyzes and compares the difference between traditional I/O mode and zero-copy mode, and then introduces several zero-copy implementations provided by the Linux kernel. Several mechanisms, including memory-mapped MMAP, SendFile, SendFile + DMA Gather Copy and SPLice, are compared in terms of system calls and copy times. Next, the implementation of Java NIO to zero copy is analyzed from the source code, including mmap based MappedByteBuffer and SendFile based FileChannel. At the end of this article, we briefly describe the zero-copy mechanism in Netty and the difference between RocketMQ and Kafka message queues in zero-copy implementation.

This account will continue to share learning materials and articles on back-end technologies, including virtual machine basics, multithreaded programming, high-performance frameworks, asynchronous, caching and messaging middleware, distributed and microservices, architecture learning and progression.