Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Kafka partition

About Kafka partition

  • Each partition is an ordered, immutable sequence of messages, which are continuously appended to the partition. This is a structured commit log (similar to Git’s commit log).

  • Each message in the partition is assigned a sequential ID value (offset) that uniquely identifies each message in the partition.

  • Messages in partitions are stored in logs, and message data in the same partition is strictly ordered in the order in which they are sent. The partition logically corresponds to the log, but when a producer writes a message to the partition,

Actually write to the partition’s corresponding log. A log can be thought of as a logical concept that corresponds to a directory on disk. A log file consists of multiple segments, each corresponding to an index file and a log file.

  • With partitioning we can extend Kafka horizontally. For a machine. Whether it’s a physical machine or a virtual machine, there’s always a limit to what it can do. When a machine reaches its maximum capacity

You can’t scale anymore, which means that vertical scaling is always limited by hardware. By using partitions, we can spread messages on a topic to different Kafka servers (using a Kafka cluster in this case) so that when machines run out of capacity, we can simply add machines, create new partitions on new machines, and theoretically achieve unlimited horizontal scalability.

  • Partitioning can also achieve parallel processing capability, where a message sent to a topic is sent to different partitions of the message and received and processed by multiple partitions.

Segment (section)

  • A partition is a series of ordered, non-programmable messages. The number of messages in a partition can be very large, so it is obvious that you cannot store all messages in one file

So, similar to Log4j’s Rolling Log, when the number of messages in the partition grows to a certain point, the message file is sliced and new messages are written to a new file. When the number of messages in the new file grows to a certain point, new messages are written to a new file. And so on; Each new data file is called a segment.

  • Therefore, a partition is physically composed of one or more segments. The actual message data is stored in each segment.

Relationship between partition and segment

  • Each partition is equivalent to a large file allocated to multiple data files with the same size segment, and the number of messages in each segment may not be the same (this is closely related to the message size,

Different messages take up different amount of disk space.) This feature makes it easy to delete old segments, which helps improve disk efficiency.

  • Each partition only needs to support sequential reads and writes. The life cycle of the segment file is determined by the Kafka Server configuration parameters, such as server.properties

Log.retention. Hours =168 indicates that old segment files will be deleted after 7 days.

About the meaning and function of 4 files in partition directory

  • 00000000000000000000. The index: it is a segment index file, it next we are going to introduce 00000000000000000000. The log data file come in pairs, suffix

.index means this is an index file.

  • 00000000000000000000. The log: its segment of data files, used to store the actual message. The file is in binary format. The segment file is named partition

Global the first segment starts at 0, and each subsequent segment is named with the offset value of the last segment. If there are no numbers, fill them with zeros.

  • 00000000000000000000. Timeindex: this file is a based on the index file message date, main use is in some scenarios based on date or event to find news, moreover based on

Date logging is also used in rolling or event-based log retention policies. In fact, this file was added in the new version of Kafka, which did not exist in older versions of Kafka. It is a useful addition to the heap.index file. The.index file is an offset based index, while *. Timeindex is a timestamp based index file.

  • Leader-epoch-checkpoint: is the leader’s cache file. In fact, it is an important file related to Kafka’s HW (High Water) and LEO (Log End Offset).

Partitions and themes

  • Partitions: Each topic can be divided into multiple partitions (there is at least one partition for each topic, and in the previous example, eucalyptus trees were used to create the topic –partitions represent the created topic

Number of partitions), when the value was specified as 1) Different partitions under the same topic contain different messages. When each message is added to a partition, an offset is added, which is the unique number of the message in the partition. Kafka uses the offset to ensure the order of messages in a partition. Orderliness of offset. It does not span partitions, which means that Kafka only ensures that messages within the same partition are ordered, but Kafka does not guarantee that messages within multiple partitions on the same topic are ordered.

  • As you can see from the figure above, messages are strictly ordered within each partition, while the order of messages between partitions is not guaranteed to be ordered

  • Based on this design strategy, Kafka’s performance does not suffer as the number of messages in the partition increases, so storing data for a long time is not a problem.

  • Kafka messages are stored on disk. By assigning an offset to each message, the sequence of messages in the same partition is ensured.

Messages in Kafka are stored on disk for a certain amount of time. During this time, messages are stored on disk. When this time passes, the message is discarded, freeing up disk space. This parameter is located in server.properties. The default is log.retention. Hours =168. Messages are kept for 7 days by default. Of course, you can change the time according to the actual situation. After the change, restart the Kafka Server to take effect.

  • The relationship of partitions to topics

  • As you can see, each message has a unique offset in the same partition.

  • Partitioning is an important way for Kafka to achieve high throughput, especially in a Kafka cluster environment, where messages for a topic are distributed on different Kafka servers

Distributed message storage is implemented, especially when configured with replicas in Kafka.

The resources

  • kafka.apache.org