The Basics of Kafka (Part 2)

Continue above: did not know about Kafka in the last article

Replication policy

Kafka’s high reliability is guaranteed by its robust replication policy.

Data synchronization

Before version 0.8, Kafka did not provide Replication for partitions. Once the Broker is down, all partitions on the Broker are unavailable, and the availability of the Partition is greatly reduced without backup data. Therefore, after 0.8, Replication is provided to ensure failover of the Broker.

After Replication is introduced, there may be multiple replicas on the same Partition. In this case, a Leader needs to be selected among replicas. The Producer and Consumer only interact with this Leader. Other replicas act as followers to copy data from the Leader.

Copy placement Policy

** For better load balancing, Kafka tries to distribute all partitions evenly across the entire cluster. 支那

The algorithm for assigning Replicas in Kafka is as follows:

Sort all the surviving Brokers with the Partition to be allocated

The ith Partition is allocated to the (I mod n) Broker. The first Replica of the Partition exists on the Broker and serves as the priority copy of the Partition

The JTH Replica of the ith Partition is allocated to the ((I + j) mod n) Broker

Suppose a cluster has four brokers and a topic has four partitions, each with three copies. The following figure shows the distribution of copies on each Broker.

Synchronization strategies

When publishing information to a Partition, the Producer first finds the Leader of the Partition through ZooKeeper, and then no matter what the Replication Factor of the Topic is, The Producer only sends the message to the Leader of the Partition. The Leader writes the message to its local Log. Each Follower pulls data from the Leader. In this way, the data of the followers is stored in the same order as that of the Leader. After receiving the message, the Follower writes the Log, and then sends an ACK to the Leader. Once the Leader receives the ACK from all replicas in the ISR, the message is considered committed. The Leader increases the HW and sends an ACK to the Producer.

To improve performance, each Follower sends an ACK to the Leader immediately after receiving data, rather than waiting for data to be written to the Log. Therefore, Kafka can only guarantee that a committed message will be stored in the memory of multiple replicas, but it cannot guarantee that it will be persisted to disk. Therefore, there is no guarantee that a committed message will be consumed by a Consumer if an exception occurs.

Consumers also read messages from the Leader. Only committed messages are exposed to the Consumer.

The data flow for Kafka Replication looks like this:

For Kafka, two conditions define whether a Broker is “alive” :

One is that it must maintain a session with ZooKeeper (this is done through ZooKeeper’s Heartbeat mechanism).

The second is that followers must be able to copy the Leader’s messages in a timely manner and not “fall too far behind”.

The Leader keeps track of the Replica list that is synchronized with him. This list is called ISR (in-sync Replica). If a Follower fails, or falls too far behind, the Leader will remove it from the ISR. “Too far behind” refers to the number of messages copied by followers behind the Leader exceeds the preset value or the followers do not send fetch requests to the Leader for a certain period of time.

Kafka only addresses fail/recover. A message is considered committed only if all followers in the ISR have copied it from the Leader. This prevents some data being written to the Leader from crashing before it can be copied by any Follower, resulting in data loss (consumers cannot consume the data). On the other hand, the Producer can choose whether to wait for the commit message. This mechanism ensures that as long as the ISR has one or more followers, a committed message will not be lost.

Leader election

Leader election is essentially a distributed lock. There are two ways to implement distributed locks based on ZooKeeper:

Node name uniqueness: Multiple clients create a node. Only the client that successfully creates the node can obtain the lock

Temporary sequential node: All clients create their own temporary sequential node in a directory. Only the smallest sequential node obtains the lock

The Majority Vote strategy is similar to the Zab election strategy in ZooKeeper. ZooKeeper actually implements the Majority Vote strategy internally. The first method used in Kafka is to elect the leader replica of a Partition: The first ZNode copy that is successfully created is the Leader node. The other copies register Watcher listeners on the ZNode. If the Leader crashes, the temporary node is automatically deleted. All followers registered on the node will receive a listener event and attempt to create the node. Only the successful Follower will become the Leader (ZooKeeper guarantees that only one client can be created successfully for a node). Other followers continue to register for listening events.

Kafka message grouping, message consumption principle

A message of the same Topic can only be consumed by one Consumer in the same Consumer Group, but multiple Consumer groups can consume the message at the same time.

This is how Kafka implements a Topic message broadcast (to all consumers) and unicast (to one Consumer). A Topic can correspond to multiple Consumer groups. If you want to broadcast, you just need a separate Group for each Consumer. To implement unicast, all consumers need to be in the same Group. Consumer groups also allow consumers to be grouped freely without having to send messages to different topics multiple times.

Push and Pull

As a messaging system, Kafka follows the traditional approach of choosing producers to push messages to the broker and consumers to pull messages from the broker.

Push model

The push pattern is difficult to adapt to consumers with different consumption rates because the rate at which messages are sent is determined by the broker. The goal of the push mode is to deliver messages as quickly as possible, but it can easily cause consumers to have too little time to process messages, typically resulting in denial of service and network congestion. The Pull pattern, on the other hand, consumes messages at an appropriate rate based on the Consumer’s consumption capacity.

The pull model

For Kafka, the pull mode is more appropriate. The Pull pattern simplifies the design of the broker, giving consumers control over the rate at which messages are consumed. Consumers can also control how they consume messages — either in bulk or individually, and can choose between different delivery modes to achieve different transport semantics.

Kafak writes and reads data sequentially

A producer is responsible for sending data to Kafka. Kafka writes all the messages it receives to its hard drive and never loses data. To optimize write speed Kafak uses two technologies, sequential write and MMFile.

Sequential writes

Because the hard disk is a mechanical structure, each read and write will address, write, where addressing is a “mechanical action”, it is the most time-consuming. So hard disks hate random I/O most and like sequential I/O most. Kafka uses sequential I/O to speed up reading and writing to a hard disk.

Each message is appended to the Partition, which belongs to the sequential write disk, so it is very efficient.

In a traditional message queue, messages are deleted after they have been consumed. Kafka does not delete data. It keeps all data, and each Consumer has an offset for each Topic to indicate the number of messages read.

Even with sequential writes to the hard disk, it is impossible for the hard disk to catch up with memory. Instead of writing data to the hard disk in real time, Kafka takes advantage of modern operating systems that page storage to use memory for I/O efficiency.

支那

In Linux Kernal 2.2 after the appearance of a called “zero-copy” system call mechanism, is to skip the “user buffer” copy, establish a disk space and memory space directly mapping, data is no longer copied to the “user mode buffer” system context switch to reduce 2 times, can double the performance.

With MMAP, processes read and write memory (virtual machine memory, of course) just as they read and write to hard disks. Using this approach, you can get a large I/O boost, eliminating the overhead of copying from user space to kernel space (calling the read file will first put the data into kernel space memory and then copy it into user space memory).

Consumer (read data)

Imagine a Web Server sending a static file, how to optimize? The answer is zero copy. In traditional mode, we read a file from the hard disk like this.

Copy first to kernel space (read is a system call, put into DMA, so use kernel space), then to user space (1, 2); It is copied from user space to kernel space (the socket you are using is a system call, so it has its own kernel space) and then sent to the network card (3, 4).

Zero Copy goes directly from kernel space (DMA) to kernel space (Socket) and then sends the network card. This technique is very common, and Nginx uses this technique.

In effect, Kafka stores all messages in a single file and sends the “file” directly to consumers when they need the data. Kafka calls Zero Copy’s sendfile function when there is no need to send out the entire file. This function includes:

Out_fd as output (usually a socket handle)
In_fd serves as the input file handle
Off_t indicates the offset of in_fd (where to start reading)
Size_t indicates how many to read