One of the most common questions you’ll be asked in an interview is: Why is RocketMQ fast? Why is Kafka fast? What is MMAP?

One of these problems is zero copy, although there are other reasons, but today’s topic is zero copy.

Traditional IO

Before we start talking about zero copy, we need to get an idea of the traditional IO approach.

Based on the traditional IO approach, the underlying layer is actually implemented by calling read() and write().

Read () reads data from the hard disk into the kernel buffer and copies it to the user buffer. Then write() to the socket buffer, and finally to the nic device.

In the whole process, there are 4 user mode and kernel mode context switches and 4 copies. The specific process is as follows:

  1. User process passesread()Method makes a call to the operating system when the context changes from user-mode to kernel-mode
  2. The DMA controller copies data from the hard disk to the read buffer
  3. The CPU copies the read buffer data to the application buffer, and the context changes from kernel-state to user-state.read()return
  4. User process passeswrite()Method is invoked, and the context changes from user to kernel
  5. The CPU copies data from the application buffer to the socket buffer
  6. The DMA controller copies data from the socket buffer to the nic, switches the context from kernel to user mode,write()return

So, here refers to the user state, kernel state refers to what? What is a context switch?

In simple terms, user space refers to the running space of user processes, and kernel space is the running space of the kernel.

The process is kernel-mode if it is running in kernel space and user-mode if it is running in user-space.

For security reasons, they are isolated from each other, and context switching between user and kernel modes is time-consuming.

As we can see from the above, a simple IO process resulted in four context switches, which can undoubtedly have a significant impact on performance in high concurrency scenarios.

So what is a DMA copy?

For an I/O operation, the CPU sends corresponding instructions to complete it. However, compared with the CPU, the I/O speed is too slow. The CPU has a large amount of time to wait for the I/O.

Therefore, DMA (Direct Memory Access) technology comes into being. In essence, it is an independent chip on the motherboard that transfers data between Memory and IO devices, thus reducing the CPU waiting time.

However, no matter who does the copying, frequent copy time also affects performance.

Zero copy

Zero-copy technology means that when a computer performs an operation, the CPU does not need to copy data from one area of memory to another specific area. This technology is usually used to save CPU cycles and memory bandwidth when transferring files over the network.

So for zero copy, it is not really no data copy process, just reduce the user mode and kernel mode switch times and CPU copy times.

Here, just a few of the most common zero-copy technologies are discussed.

mmap+write

Mmap +write simply replaces the read operation in read+write with mmap, reducing a CPU copy.

Mmap is implemented by mapping the address of the read buffer to the address of the user buffer. The kernel buffer is shared with the application buffer, thus reducing the CPU copy from the read buffer to the user buffer.

In the whole process, there are 4 user mode and kernel mode context switches and 3 copies. The specific process is as follows:

  1. User process passesmmap()Method makes a call to the operating system, and the context changes from user to kernel mode
  2. The DMA controller copies data from the hard disk to the read buffer
  3. The context changes from kernel-mode to user-mode, and the MMAP call returns
  4. User process passeswrite()Method is invoked, and the context changes from user to kernel
  5. The CPU copies data from the read buffer to the socket buffer
  6. The DMA controller copies data from the socket buffer to the nic, switches the context from kernel to user mode,write()return

Mmap saves a CPU copy, and since the memory in the user process is virtual, only mapped to the kernel read buffer, it can save half of the memory space, which is suitable for large file transfers.

sendfile

Sendfile also reduces one CPU copy and two context switches compared to Mmap.

Sendfile is a system call function introduced after the kernel version of Linux2.1. By using sendfile data, it can be directly transmitted in the kernel space, thus avoiding the copy of user space and kernel space. At the same time, because sendfile is used instead of read+write, a system call can be saved. That’s two context switches.

In the whole process, there are two user mode and kernel mode context switches and three copies. The specific process is as follows:

  1. User process passessendfile()Method makes a call to the operating system, and the context changes from user to kernel mode
  2. The DMA controller copies data from the hard disk to the read buffer
  3. The CPU copies data from the read buffer to the socket buffer
  4. The DMA controller copies data from the socket buffer to the nic, switches the context from kernel to user mode,sendfileCall returns

The SendFile method IO data is completely invisible to user space, so it can only be used in cases where user space processing is not required at all, such as static file servers.

sendfile+DMA Scatter/Gather

Sendfile has been further optimized for Linux2.4 by introducing new hardware support called DMA Scatter/Gather.

It records data descriptions in the read buffer — memory addresses and offsets — into the socket buffer, which DMA uses to copy data from the read buffer to the nic, reducing the CPU copying process compared to previous versions

In the whole process, there are two context switches and two copies in user mode and kernel mode, and more importantly, there is no CPU copy at all. The specific flow is as follows:

  1. User process passessendfile()Method makes a call to the operating system, and the context changes from user to kernel mode
  2. The DMA controller uses Scatter to copy data from hard disks to read buffers for discrete storage
  3. The CPU sends the file descriptor and data length from the read buffer to the socket buffer
  4. The DMA controller uses Scatter/Gather to copy data from the kernel buffer to the nic based on the file descriptor and data length
  5. sendfile()The call returns, and the context switches from kernel state back to user state

Like SendFile, DMA Gather data is not visible in user space and requires hardware support, and the input file descriptor can only be a file, but there is no CPU copying at all, greatly improving performance.

Application scenarios

Zero-copy technology is used for both RocketMQ and Kafka scenarios described at the beginning of this article.

For MQ, it is nothing more than a producer sending data to MQ and persisting it to disk, and then a consumer reading it from MQ.

For RocketMQ these two steps use mmap+ Write, whereas Kafka uses Mmap + Write to persist data and sendfile to send data.

conclusion

Due to the difference between CPU and IO speeds, DMA technology was developed to reduce CPU wait time through DMA handling.

The traditional IOread+write method produces two DMA copies + two CPU copies, with four context switches.

Mmap + Write generates two DMA copies + one CPU copy and four context switches. Memory mapping reduces one CPU copy and reduces memory usage, which is suitable for large file transfer.

The sendfile method is a new system call function that produces 2 DMA copies +1 CPU copy, but only 2 context switches. Because there is only one call, context switching is reduced, but user space is not visible to IO data, which is suitable for static file servers.

Sendfile +DMA Gather produces 2 DMA copies, no CPU copies, and only 2 context switches. Although the performance is greatly improved, new hardware devices are required to support it.

Reference:

Juejin. Cn/post / 684490…

www.cnblogs.com/xiaolincodi…

Time.geekbang.org/column/arti…

www.toutiao.com/i6898240850…