I’ll write the conclusion at the beginning and then explain it one by one.

  • In order to read and write
  • Page Cache
  • Zero copy
  • Partition segmentation + index
  • Batch, speaking, reading and writing
  • Batch compression

In order to read and write

Kafka persists logging to a local disk. Generally, we think of disk read/write performance as poor, and we might question how Kafka can guarantee performance. In fact, whether memory or disk is fast or slow depends on the way it addresses. Disk is divided into sequential and random reads and writes, and memory is divided into sequential and random reads and writes. While disk-based random read/write is slow, sequential disk read/write performance is generally three orders of magnitude higher than random disk read/write performance. In some cases, sequential disk read/write performance is higher than random memory read/write performance.

The performance comparison diagram is from ACM Queue

Sequential disk reads and writes are the most consistent of the disk usage patterns, and the operating system has greatly optimized this pattern. Kafka uses sequential disk reads and writes to improve performance. Kafka messages are continuously appended to the end of local disk files, rather than being written randomly, resulting in a significant increase in Kafka write throughput.

One drawback of this approach is that there is no way to delete data, so Kafka does not delete data. It keeps all data, and each Consumer has an offset for each Topic to indicate which data item was read.

If you don’t delete the hard drive, it will be full, so Kakfa offers two strategies for deleting data. It is based on the time and partition file size. See its configuration documentation for details.

Page Cache

To optimize read and write performance, Kafka uses the operating system’s own Page Cache, which uses the operating system’s own memory rather than JVM space memory. The benefits of this are:

  1. Avoid Object consumption: If you are using a Java heap, Java objects consume a lot of memory, often twice as much or more of the stored data.

  2. Avoid GC problems: As the amount of data in the JVM increases, garbage collection becomes complex and slow, and there are no GC problems using the system cache

Using the Page cache of an operating system is much simpler and more reliable than using data structures such as the JVM or in-memory cache. First, cache utilization is higher at the operating system level because it stores compact byte structures rather than individual objects. Second, the operating system itself has made significant improvements to the Page Cache, providing write-behind, read-ahead, and flush mechanisms. Furthermore, the system cache does not disappear even if the server process restarts, avoiding the in-process cache rebuilding process.

Kafka’s read and write operations are basically memory-based through the operating system’s Page Cache, which greatly improves read and write speeds.

Zero copy

To understand zero copy, let’s take a look at traditional IO

Traditional IO

Based on the traditional IO approach, the underlying layer is actually implemented through calls.

By reading data from the hard disk into the kernel buffer and then copying it to the user buffer; It then writes to the socket buffer and finally to the nic device.

The whole process took place 4 user – mode and kernel-mode context switches and 4 copies

Zero copy

Zero-copy technology means that when a computer performs an operation, the CPU does not need to copy data from one area of memory to another specific area. This technology is usually used to save CPU cycles and memory bandwidth when transferring files over the network.

So for zero copy, it is not really no data copy process, just reduce the user mode and kernel mode switch times and CPU copy times.

Here, we look at several common zero-copy techniques.

mmap

Mmap () refers directly to file handles in the user state, that is, the user state and the kernel state share the kernel state data buffer, so that data does not need to be copied to the user state space. When an application outputs data to Mmap, it outputs data directly to the kernel state, or if the output device is disk (flush interval is 30 seconds).

sendfile

For SendFile, the data does not need to be processed in the application and is simply transferred from one DMA device to another. In this case, the data only needs to be copied to kernel-mode, not user-mode, and there is no need for handles (file references) to kernel-mode data as mMAP does. As shown below:

As we can see from the figure above (output device can be nic/disk driver), kernel mode has two data caches. Sendfile was introduced in Linux 2.1 and optimized in Linux 2.4. The data in the disk page cache in the figure above does not need to be copied to the Socket buffer. Instead, the location and length of the data are stored in the Socket buffer. The actual data is sent directly from the DMA device to the corresponding protocol engine, thus eliminating another data copy.

Partition segmentation + index

Kafka’s messages are grouped by topic, and the data in these topics is stored in partitions to different broker nodes. Each partition corresponds to a folder on the operating system. Partitions are actually stored by segment. This is also very consistent with the distributed system of buckets.

With this segmented design, Kafka’s messages are actually distributed in a small segment, and each file operation is a direct segment. For further query optimization, Kafka creates index files for segmented data files by default, which are.index files on the file system. This design of partition and index not only improves the efficiency of data reading, but also improves the parallelism of data operation.

Batch compression

In many cases, the bottleneck is not CPU or disk, but network IO, especially for data pipelining that needs to send messages between data centers on a wan. Data compression consumes a small amount of CPU resources, but in the case of Kafka, network IO should be considered.

  • If each message is compressed, the compression rate is relatively low, so Kafka uses batch compression, where multiple messages are compressed together instead of a single message

  • Kafka allows the use of recursive message sets. Bulk messages can be transmitted in compressed form and remain compressed in logs until they are decompressed by consumers

  • Kafka supports a variety of compression protocols, including Gzip and Snappy

The secret of Kafka speed is that it converts all messages into a batch file, and performs reasonable batch compression, reduces network I/O loss, improves I/O speed through MMAP, and optimizes the speed when writing data because a single Partition is added at the end. Read data with sendFile direct violence output.