Overview of Kafka message persistence

Kakfa relies on the file system to store and cache messages. The conventional wisdom about hard disks is that they are always slow. Can a file system-based architecture deliver superior performance? In fact, the speed of a hard disk depends entirely on how you use it. Also, Kafka has the following disadvantages based on JVM memory:

  • Object memory overhead is very high, usually twice or more than the data to be stored
  • As the amount of data in the heap increases, the GC gets slower and slower

In fact, the performance of linear writes to disk is much better than that of arbitrary writes, linear reads and writes are heavily optimized by the operating system (read-ahead, write-behind, etc.), and even faster than random memory reads and writes. So instead of caching data in memory and then brushing it to disk, Kafka writes the data directly to the file system’s log:

  • Write: To append data sequentially to a file
  • Read operation: Read from a file

Benefits of this implementation:

  • Reads do not block writes and other operations, and the size of the data has no effect on performance
  • Hard disk space is less limited than memory space
  • Linear access to disk, fast, can be stored for a longer time, more stable

2. Analysis of Kafka’s persistence principle

A Topic is divided into several partitions. Each Partition is an append-only log file at the storage level. Messages belonging to one Partition are appended to the end of the log file. The position of each message in the file is called offset.

As shown in the figure below, we previously created MyTopic1 with three partitions. We can go to the corresponding log directory to view.

Kafka logs are divided into index and log (as shown in the figure above), which come in pairs: the index file stores the metadata, and the log stores the messages. The index file metadata points to the migration address of the corresponding message in the log file. For example, 2,128 refers to the second entry of the log file, with an offset of 128; The physical address (specified in the index file) + the offset address locates the message. We can use Kafka’s own tools to view the data in the log file: