Kafka is an enterprise-class messaging and subscription system written in Scala and Java. Kafka was first developed by Linkedin and eventually opened as a project of the Apache Software Foundation. Kafka is a distributed, partitioned, multi-replica and multi-subscriber messaging system with high throughput. It is widely used in application decoupling, asynchronous processing, peak clipping, and message-driven scenarios. This article will give a brief introduction to Kafka’s architecture and related components. Before introducing Kafka’s architecture, let’s take a look at the core concepts of Kafk.

Before introducing Kafka’s architecture and basic components in detail, we need to understand some of Kafka’s core concepts. Producer: Producers of messages that send messages to the Kafka cluster. Consumer: A message Consumer that actively pulls messages from the Kafka cluster. Consumer Group: Each Consumer belongs to a specific Consumer Group. When creating a Consumer, you need to specify the corresponding Consumer Group ID. Broker: Service instances in a Kafka cluster, also known as nodes. Each Kafka cluster contains one or more brokers (a Broker is a server or node). Message: Object entity passed through the Kafka cluster that stores information to be sent. Topic: A category of messages that is used to logically distinguish messages. Each message sent to a Kafka cluster must have a specific Topic that consumers consume. Partition: Partition of messages. Partition is a physical concept equivalent to a folder. Kafka creates a folder for each Partition of a topic, and messages for a topic are stored in one or more partitions. Segment: Each Segment is divided into a.log file and a.index file. The.index file is used to quickly query the offset position of data in the. Log file. Log files: Data files that store messages. In Kafka, data files are called log files. By default, there are more than n. Log files under a partition (segmented storage). The default size of a. Log file is 1 GB. Messages are continuously added to the. Log file. When the size of the. .index file: stores the index data of the. Log file. Each.

We’ll cover some of the core concepts in more detail later. After introducing the core concepts of Kafka, let’s take a look at the basic features, components and architecture that Kafka provides externally.

Kafka API

As you can see in the figure above, Kafka consists of four main API components:

1. The Producer API is used by the Producer API to send messages to a Kafka cluster on one or more topics.

2. The Consumer API uses the Consumer API to subscribe to one or more topics in the Kafka cluster and process the messages received under those topics.

3. A Streams API application uses the Streams API to act as a Stream Processor, taking input Streams from one or more topics and producing output Streams to one or more topics. Can effectively transform the input stream into the output stream output to the Kafka cluster.

4. The Connect API allows applications to build and run reusable producers or consumers through the Connect API that can Connect Kafka themes to existing applications or data systems. Connect actually does two things: use the Source Connector to read data from the data Source (e.g. DB) and write it to the Topic, and then read data from the Topic through the Sink Connector and export it to the other end (e.g. DB) to transfer message data between external storage and the Kafka cluster.

Kafka architecture

Next, we will start from Kafka architecture, focusing on Kafka’s main components and implementation principles. Persistence, Kafka support news consumption is through active pull messages for consumption, subscribe to the state and subscribe to the relationship between the client shall be responsible for the maintenance, after the news consumption is not immediately delete, will keep history, general default reservation 7 days, so when can support multiple subscribers, the message without copying of points, only need to store a can. The implementation of each component is described in detail below.

A Producer is a message Producer in Kafka. It is mainly used to produce messages with specific topics. The messages produced by producers are classified by topics and stored on brokers in Kafka clusters, specifically in directories of specified partitions. Store data as segments (.log files and.index files).

A Consumer is a Consumer in Kafka that consumes messages on a given Topic. A Consumer is an active pull that consumes messages from a Kafka cluster. The Consumer must belong to a specific Consumer group.

3. The messages in Topic Kafka are classified by Topic, which supports multiple subscriptions. A Topic can have multiple consumers of different subscription messages. There is no limit to the number of topics in a Kafka cluster. Data from the same Topic can be grouped into a single directory. A Topic can contain up to one partition, and the messages from all partitions add up to all the messages in a Topic.

In Kafka, to speed up the consumption of messages, multiple partitions can be assigned to each Topic. As mentioned earlier, Kafka supports multiple partitions. By default, messages for a Topic are stored in only one partition. The messages from all the partitions of a Topic are combined to create all the messages within a Topic. Each Partition has a number starting from 0, and the data in each Partition is in order. However, the direct data in different partitions cannot be guaranteed to be in order, because different partitions need different consumers to consume, and only one Consumer can be assigned to each Partition. But a Consumer can have multiple partitions in a Topic at the same time.

Every Consumer in Kafka belongs to a specific Consumer Group. If this is not specified, all consumers belong to the same default Consumer Group. A Consumer Group consists of one or more consumers who consume the same message only once in the same Consumer Group. Each Consumer Group has a unique ID, the Group ID, also known as the Group Name. All consumers in a Consumer Group coordinate to subscribe to all partitions of a Topic, and each Partition can only be consumed by one Consumer in a Consuemr Group. However, it can be consumed by one of the different Consumer groups. As shown below:

In terms of hierarchy, Consumer corresponds to Topic, and Consumer corresponds to Partition under Topic. The number of consumers in a Consumer Group and the number of partitions under a Topic together determine the amount of concurrent message consumption, and the number of partitions determines the final amount of concurrent message consumption, since only one Consumer can consume a Partition. When the number of consumers in a Consumer Group exceeds the number of partitions under the subscribed Topic, Kafka allocates one Consumer to each Partition, and any additional consumers are idle. When the number of consumers in a Consumer Group is less than the number of partitions in the currently scheduled Topic, a single Consumer will consume multiple partitions. As shown in the figure above, each Consumer in Consumer Group B needs to consume data from two partitions, while Consumer Group C has an extra free Consumer4. To sum up, the more partitions in a Topic, the more consumers can consume at the same time, the faster the consumption speed will be, and the higher the throughput will be. At the same time, the number of consumers in the Consumer Group should be smaller than or equal to Partition, and it is better to be an integer multiple, such as 1, 2, and 4.

In Kafka, messages are stored in segments for each Partition. That is, a Segment is created for every 1 gb of messages. Each Segment contains two files:.log and.index. The.log file is where Kafka actually stores messages produced by Producer, while the.index file uses sparse indexes to store logical numbers and physical offsets of messages in the.log file to speed up data queries. The.log file and the.index file are one-to-one and come in pairs. The following figure shows how the. Log and. Index files exist in a Partition.

Each message in Kafka has its own logical offset (relative offset) and the actual physical address on the physical disk, Position. That is, a message in Kafka has two positions: offset (relative offset) and Position (physical offset address). In Kafka’s design, the offset of the message is part of the Segment filename. Name the Segment file as follows: Each Segment file is named as the maximum offset of the previous Partition (offset from Message, not the physical offset, mapped to.log). More on how to query messages in a.log file later). The maximum value is the size of a 64-bit long, represented by 20 digits, preceded by a 0.

Kafka finds a Message in a Segment using a.index file and a.log file. Use binary lookup to find the file name in the Partition (must be less than, Because the file number is equal to the current file offset is greater than the current offset of the current offset of the largest number of message). The index file, there is naturally find in 00000000000000000000. The index. 2. In the.index file, use binary search to find the largest offset whose offset is less than or equal to the specified offset (assume 7). In the index file, find 6. 3. In the. Log file, scan for the Message whose offset is 7 starting from disk position 258. At this point, we have briefly introduced the basic components of Segment. Index file and. Log file storage and query principle. There is a problem: offsets in the.index file are not stored sequentially, so why does Kafka design the index file to be this discontinuous? This discontinuous index design is called sparse indexing. Kafka uses sparse indexing to read indexes. Kafka appends an index to the.index whenever 4k data is written to the.log. Sparse indexes are used for the following reasons: (1) Sparse index storage can greatly reduce the storage space occupied by. Index files. (2) Sparse index files are small and can be read into memory, avoiding frequent IO disk operations during index reading, so as to quickly locate the Message in the.log file through the index.

Each Message sent by the Producer to a Kafka cluster is wrapped into a Message object by Kafka and then stored on disk instead of directly. The physical structure of Message on disk is shown below.

On-disk format of a message

offset         : 8 bytes 
message length : 4 bytes (value: 4 + 1 + 1 + 8(if magic value > 0) + 4 + K + 4 + V)
crc            : 4 bytes
magic value    : 1 byte
attributes     : 1 byte
timestamp      : 8 bytes (Only exists when magic value is greater than zero)
key length     : 4 bytes
key            : K bytes
value length   : 4 bytes
value          : V bytes
Copy the code

Key and value store the actual Message content with variable length, while others are statistics and descriptions of Message content with fixed length. Therefore, in the process of searching for the actual Message, the disk pointer calculates the number of moves based on the offset and Message length of the Message to speed up the search for the Message. This is possible because Kafka’s.log files are written sequentially. When writing data to disk, it is appending data.

Finally, let’s talk about Partition Replicas in Kafka. Prior to version 0.8, Kafka did not have Replicas. When creating a Topic, you can specify partitions for the Topic as well as the number of replicas. A partition copy in Kafka looks like this:

Kafka uses replication-factors to control how many brokers a message can be stored on. Generally, the number of replicas is equal to the number of brokers, and the same replicas cannot be stored on the same Broker. The replica factor is based on partitions and distinguishes roles; The master replicas are called the Leader (only one at any time), the slave replicas are called followers (multiple replicas can exist), and the synchronized replicas are called in-sync-Replicas (ISR). The Leader reads and writes data, while the followers do not read and write data externally, but only synchronize data with the Leader. Both consumers and producers read and write data from the Leader and do not interact with the followers. Using the Leader to read and write data simultaneously reduces the data read delay caused by data synchronization, because the followers can provide read services only after the data is synchronized from the Leader.

If a partition has three replica factors, even if one of them fails, then only one of the remaining two will be chosen as leader, as shown in the figure below. Kafka is a high-throughput messaging system, which cannot be allowed to copy and transfer data on another broker. If all copies of the specified partition hang, the Consumer will fail to write data to the specified partition. The message sent by the Consumer to the specified Partition is first written to the Leader Partition and then written to another Partition copy in the ISR list before it can be submitted to offset.

At this point, Kafka’s architecture and basic principles are introduced. In order to achieve high throughput and fault tolerance, Kafka also introduces some excellent design ideas, such as zero copy, high concurrency network design, and sequential storage, which will be discussed later.