Detailed analysis of Kafka architecture and components

This is the 9th day of my participation in the November Gwen Challenge. See details: The Last Gwen Challenge 2021.

1. Kafka architecture

Producers of the API

Allows applications to publish records to flow to one or more Kafka topics.

Consumer API

Allows an application to subscribe to one or more topics and process the stream of records received by those topics.

3。 StreamsAPI

Allow an application to act as a stream processor, taking input streams from one or more topics and producing one output stream to one or more topics, effectively changing input streams to output streams.

ConnectAPI

Allows you to build and run reusable producers or consumers that can connect Kafka themes to existing applications or data systems. For example, a connector connected to a relational database might fetch changes for each table.

Note: In Kafka 2.8.0, the dependency on Zookeeper is removed, and the management of the cluster is done by KRaft, using the Quorum controller inside Kafka instead of Zookeeper, so that for the first time users can execute Kafka without Zookeeper at all. This not only saves computing resources, but also makes Kafka more efficient and supports larger clusters.

In the past, Apache ZooKeeper was the key to distributed systems such as Kafka. ZooKeeper played the role of coordinating the proxy. When all the proxy servers started up, they would connect to ZooKeeper for registration. ZooKeeper is a powerful tool, but ZooKeeper is a stand-alone piece of software that complicates Kafka’s entire system, so officials decided to use an internal Quorum controller instead of ZooKeeper.

The work began in April last year, and now, with partial results, users will be able to run Kafka without ZooKeeper in version 2.8, officially known as Kafka Raft Metadata Schema (KRaft). In the KRaft model, metadata previously operated by the Kafka controller and ZooKeeper will be incorporated into the new Quorum controller and executed within the Kafka cluster, or on dedicated hardware if the consumer has a special usage situation.

Ok, so with zooKeeper removed in the new version, let’s move on to other features in Kafka:

Kafka supports message persistence. The consumer side actively pulls data, and the consumption state and subscription relationship are maintained by the client side. After a message is consumed, it is not deleted immediately. Therefore, when multiple subscriptions are supported, only one copy of the message will be stored.

Broker: A Kafka cluster contains one or more service instances (nodes). This service instance is called a broker.
Topic: Every message published to a Kafka cluster belongs to a category called topic.
Partition: A partition is a physical concept. Each topic contains one or more partitions.
Segment: Each segment is divided into two parts:.log file and.index file. The.index file is used to query the offset position of data in the.log file.
Producer: a producer of messages that publish messages to Kafka brokers.
Consumer: the consumer of messages, the client that reads messages to Kafka’s broker;
Consumer Group: A consumer group in which each consumer belongs to a specific consumer group (you can specify a groupName for each consumer);
Log: stores data files.
.index: stores the index data of the. Log file.

2. Kafka main components

1. The Big Bang Theory

Producer is mainly used to produce messages. They are the message producers in Kafka. The messages produced are classified through topics and stored in Kafka brokers.

2. Topic

Kafka groups messages by topic;
Topic refers to the different categories of feeds of messages that Kafka processes.
Topic is the nominal name for a category or column of published records. Kafka themes have always supported multi-user subscriptions; That is, a topic can have zero, one or more consumer subscriptions to write data;
In a Kafka cluster, there can be an infinite number of themes;
Producer and consumer consumption data are generally subject – specific. Finer granularity can be achieved at the partition level.

3. Partition

In Kafka, a topic is a grouping of messages. A topic can have multiple partitions, each of which stores some data of the topic. The data of all partitions is combined to form all data of a topic.

Multiple partitions can be created within a single broker service, regardless of the number of brokers. In Kafka, each partition has a number: the number starts at 0. The data within each partition is ordered, but the global data is not guaranteed to be ordered. (Order refers to the order in which production is produced and the order in which consumption is consumed.)

4. consumer

A consumer is a consumer in Kafka that consumes data in Kafka. A consumer must belong to a consumer group.

A consumer group

A consumer group consists of one or more consumers who consume the same message only once in the same group.

Each consumer belongs to some consumer group, and if not specified, all consumers belong to the default group.

Each consumer group has an ID, the Group ID. All consumers within a group coordinate to consume all partitions of a subscribed topic. Of course, each zone can only be consumed by one consumer in the same consumer group, but by different consumer groups.

The partition number determines the maximum number of concurrent consumers in each consumer group. The diagram below:

As shown in the left figure above, if there are only two partitions, even if there are four consumers in a group, there will be two free ones. As shown above on the right, there are four partitions, one for each consumer, with a maximum concurrency of 4.

Take a look at the following picture:

As shown in the figure above, different consumer groups consume the same topic, which has four partitions distributed on two nodes. The consumer group 1 on the left has two consumers, and each consumer needs to consume two regions to complete the consumption of the message. The consumer group 2 on the right has four consumers, and each consumer needs to consume one region.

To summarize the relationship between partitions and consumer groups in Kafka:

Consumer group: A group of one or more consumers who consume the same message only once. The number of partitions under a topic must be less than or equal to the number of partitions under the topic for the number of consumers in the same consumer group consuming the topic.

For example, if a topic has four partitions, the number of consumers in the consumer group should be less than or equal to 4, preferably an integer multiple of 1, 2, and 4. Data in the same partition cannot be consumed by different consumers in the same consumer group at the same time.

Conclusion: The more partitions, the more consumers can consume at the same time, the faster the speed of data consumption will be, improving the performance of consumption.

6. Partition replicas

A partition copy in Kafka looks like this:

Replication-factor: The number of brokers on which a control message is stored, usually equal to the number of brokers.

Multiple replica factors cannot be created in a single broker service. When creating a topic, the replica factor should be less than or equal to the number of brokers available.

Copy factor operations are on a partition basis. Each partition has its own master and slave copies;

The primary replicas are called leader and the secondary replicas are called followers. (If multiple replicas exist, Kafka assigns a leader and N followers to all partitions in a partition.) The replicas in the synchronous state are called in-sync-Replicas (ISR).

The followers synchronize data from the leader by pulling. Both consumers and producers read and write data from the leader and do not interact with followers.

What the copy factor does: Makes Kafka reliable for reading and writing data.

Replica factors are contains themselves, and the same replica factors cannot be placed in the same broker.

If a partition is a copy of the three factors, even if one hang up, then only the remaining two, choose a leader, but not in other broker, start another copy (as in another start, data transfer, as long as there is data transfer between the machine, is a long time to take up network IO, Kafka is a high-throughput messaging system and this is not allowed to happen) so it will not start in another broker.

If all replicas hang, the producer will fail to write data to the specified partition.

LSR indicates the copy that is currently available.

7. The segment file

A partition consists of several segment files. Each segment file contains two parts: the.log file and the.index file. The.log file contains the data store that we sent to the partition. Record the data index value of our.log file, so that we can speed up the data query speed.

Relationship between index files and data files

Since they are one-to-one pairs, there must be a relationship. The metadata in the index file points to the physical offset address of Message in the corresponding data file.

For example, the index file 3,497 represents the third message in the data file with an offset of 497.

In the data file, Message 368772 represents the 368772 Message in global Partiton.

Note: The segment index file adopts the sparse index storage mode to reduce the size of the index file. Mmap (memory mapping) allows direct memory operation. Sparse index sets a metadata pointer for each corresponding message of the data file, which saves more storage space than dense index. But it takes more time to find them.

The relationship between.index and.log is as follows:

The left part of the image above is the index file, which stores a pair of key-values, where the key is the number of the message in the data file (the corresponding log file), such as “1,3,6,8…” , respectively, the first message, the third message, the sixth message, the eighth message in the log file……

So why aren’t these numbers sequential in the index file? This is because instead of indexing every message in the data file, the index file uses sparse storage, with an index built every certain byte of data. This prevents the index file from taking up too much space and keeps the index file in memory. However, the disadvantage is that messages that are not indexed cannot be located in the data file at the same time, so a sequential scan is required, but the range of this sequential scan is very small.

Value represents the number of messages in the global Partiton.

Take metadata 3,497 in the index file, where 3 represents the third message from top to bottom in the log data file on the right, and 497 represents the physical offset address (location) of the message is 497(also represents the 497th message – sequential write feature in the global partiton).

Kafka creates folders in the log.dir directory we specify. The name is the folder made up of (theme name – partition name). Under the directory (topic name – partition name), there will be two files, as follows:

#Index file
00000000000000000000.index
#Log contents
00000000000000000000.log
Copy the code

The files in the directory will be split according to the size of the log file. When the size of the. Log file is 1 GB, the files will be split. As follows:

Rw - r - r -. 1 root root on January 17, 389 k 18:03 00000000000000000000. The index - rw - r - r -. 1 January 17 18:03 root root 1.0 G 00000000000000000000. The log - rw - r - r -. 1 root root 10 m January 17 18:03 00000000000000077894 index - rw - r - r -. 1 127 m root root January 17 18:03 00000000000000077894.logCopy the code

In kafka’s design, offset is used as part of the filename.

The segment file is named after the maximum number of messages offset from the previous global segment. Values are up to 64 bits long, 20 digits long, and filled with zeros if no digits are present.

The index information can be used to quickly locate messages. By mapping all the INDEX metadata to memory, I/O operations on the segment File can be avoided.

Sparse storage of index files can greatly reduce the space occupied by index file metadata.

Sparse indexing: Indexes are created for data, but ranges are not created for each column, but for an interval; Benefits: The number of index values can be reduced. The downside: once the index interval is found, a second processing is required.

8. The physical structure of message

Every message a producer sends to Kafka is wrapped as a message by Kafka

The physical structure of message is shown below:

So the message sent by the producer to Kafka is not stored directly, but is wrapped by Kafka. Each message has the structure shown above. Only the last field is the actual message sent by the producer.