- In order to read and write
Each Partition is a file. After receiving a message, Kafka inserts data to the end of the file. Therefore, Kafka reads and writes data sequentially.
The main internal components of hard disk are disk disk, drive arm, read and write magnetic head and spindle motor. The actual data is written on the disk, and reading and writing is mainly accomplished by driving the read and write magnetic head on the arm. In practice, the spindle rotates the disk, and the drive arm extends to allow the reading head to read and write on the disk
Three factors affect disk performance:
- Seek time: The time required to move the read/write head to the correct track. The shorter the seek time, the faster the I/O operation. Currently, the average seek time of a disk ranges from 3 to 15ms.
- Rotation delay: The time required for the disk to rotate to move the requested data sector below the read/write disk. The rotation delay depends on the disk speed and is usually expressed as 1/2 of the time it takes for the disk to rotate once.
- Data transfer time: The time required to complete the transfer of the requested data. It depends on the data transfer rate and is equal to the data size divided by the data transfer rate. The data transfer time is usually much less than the consumption time of the first two parts. It can be ignored in simple calculation.
When sequential read and write is performed on the disk, the time of track seeking by the read/write head is omitted, and the disk can be read and written directly. In some cases, sequential disk read/write performance is even higher than random memory read/write performance
- Producer to Broker (MMAP)
Reduce one copy of CPU from kernel to user mode
- Broker to Consumer (zero copy)
Traditional IOZero copyBefore version 2.4 of the Linux kernel, the sendfile() system call is copied one more time by the CPU, from the kernel buffer to the socket buffer.
From version 2.4 onwards, the sendFile () system call procedure has changed a little for sG-DMA support cards.
The process is as follows: First, copy the disk data to the kernel buffer through DMA. In the second step, the buffer descriptor and data length are passed to the socket buffer, so that the SG-DMA controller of the nic can directly copy the data from the kernel cache to the socket buffer of the NIC, eliminating the need to copy the data from the OS kernel buffer to the socket buffer, thus reducing the need for a data copy.
This is called zero-copy technology, because we don’t copy data at the memory level, that is, we don’t move the data through the CPU, all the data is transferred through DMA.
Zero copy technique file transfer mode compared with the traditional file transfer mode, reducing the context switch 2 times and the number of copies of data, need only 2 times a context switch and the number of copies of data, can complete the file transfer, and 2 times of data copy process, do not need through the CPU, 2 times are made by DMA to handle.
DMA, or Direct Memory Access, normally takes five steps to copy data from a disk file into a kernel buffer.
- The CPU issues corresponding instructions to the disk controller and then returns;
- When the disk controller receives the instruction, it prepares the data,
- Will put the data into the disk controller’s internal buffer, and then generate an interrupt;
- When the CPU receives the interrupt signal, it stops working and reads the data from the disk controller’s buffer into its registers, one byte at a time
- The data in the registers is then written to memory, while the CPU is unable to perform other tasks.
With DMA, the disk and memory transfers all the data, leaving the CPU to handle other transactions.
- PageCache (PageCache)
If every message is written to the disk once, the performance cost is very high. If you need to use caching, kafka does not implement caching itself. It relies on the PageCache function provided by the underlying operating system. Also mark the Page attribute as Dirty. When a read operation occurs, the disk is searched in the PageCache first. If a page is missing, the disk is scheduled and the required data is returned. PageCache actually uses as much free memory as possible as a disk cache. Using the PageCache feature also avoids the unnecessary overhead of caching data within the JVM, which leads to frequent GC. PageCache to cache recently accessed data, discarding the oldest unaccessed cache when running out of space.
The data read from disk read ahead of time, need to find the location of the data, but for mechanical disk, is through the head rotation to sector data, then began to “order” to read data, but the rotating head this physical movement is very time consuming, in order to reduce the effects of it, use PageCache read ahead. Suppose that the read method reads only 32 KB bytes at a time. Although read initially reads only 0 to 32 KB bytes, the kernel reads the following 32 to 64 KB bytes into PageCache, so that the cost of reading the following 32 to 64 KB is low. If the process reads PageCache before 32 to 64 KB removes it, the benefit is very large. However, because Kafka is mostly sequential reading, there is no need to rotate the head too much, so the preread function may be used to reduce the number of I/OS.
- Find the message process by index
Each partition corresponds to a log file, which is appended to in sequence. When the log size reaches a certain size (specified by log.segment.bytes, but related to other configurations), the log file is partitioned into a new log file. Each log file is called a log segment. The introduction of log segments facilitates kafka data query and location.
Log segments are divided into active log segments and inactive log segments. Only the active log segment (the current log segment has only one partition) can be read and written, while the inactive log segment can be read and written.
1. Use offset to find the index file. Kafka also has a configuration (log.index.interval. Bytes) that allows you to set the index sparser (sparse index, time + space). It then finds the target file by finding the index closest to the target and iterating through the message file in sequence. The operation time complexity of this wave is O(log2n)+O(m), where N is the number of indexes in the index file and m is the sparseness. 3, and then into the log file, by absolute address, and size, precisely locate the message location.