Kafka is a distributed publish-subscribe messaging system. It was originally developed on LinkedIn and became an Apache project in July 2011. Today, Kafka is used by LinkedIn, Twitter, and Square in applications such as log aggregation, queues, real-time monitoring, and event processing. In the following article, we will discuss replication design for Kafka.

The purpose of Replication is to make the service highly available so that even if some nodes fail, the Producer can continue to publish messages and the Consumer can continue to receive messages.

A way to ensure data consistency

There are two typical ways to ensure strong consistency of data. Both methods require that a leader be specified, and all writes are sent to the leader. The leader is responsible for receiving all write requests and propagating them to other followers in the same order.

Most of the copy

Based on the way most submissions are made. The leader does not consider the data to be committable until most fellower has received it. In the event that the leader fails, a new leader is elected through the coordination of the majority of fellower. This approach includes algorithms such as Raft, Paxos, Zookeeper, Google Spanner, etCD, etc. In this way, when there are 2N + 1 nodes, a maximum of N node failures can be tolerated.

A master-slave replication

Based on the master-slave replication mode. The message is received successfully only when both the leader and Fellower write successfully. In the case of N nodes, the failure of n-1 nodes can be tolerated at most.

Kafka uses master-slave replication to replicate logs between clusters. Here’s why:

Master-slave replication can tolerate more failures in the same number of copies. That is, it can tolerate n failures with N +1 copies, as opposed to n failures with 2n +1 copies for the majority copy-based approach. For example, if there are only two copies, no failures are tolerated based on the majority of copies. Kafka’s log replication concerns data replication between machines in the same data center, and latency is relatively less of a bottleneck for log replication.

Several concepts

In Kafka, message flows are defined by topics, which are divided into one or more partitions. Replication occurs at the partition level, with each partition having one or more copies.

In a Kafka cluster, copies are evenly distributed among different service brokers. Each replica maintains a log on disk. Published messages are sequentially attached to the log, each identified by monotonically increasing offsets in the log. Offset is a logical concept in a partition. Given an offset, the same message can be identified in each partition copy. When a consumer subscribes to a topic, it keeps track of the offset for consumption in each partition and uses it to make read requests to the broker.

As shown in the figure above, when producer publishes a message to a partition of a topic, the message is first forwarded to the leader copy of that partition and appended to its log. The copy of Fellower is constantly getting new information from the leader. Once enough replicas have received the message, the leader commits the message. One problem is how the leader decides how much is enough. The leader cannot always wait for all copies to complete writes. It is not feasible to reduce the availability of our service to ensure data consistency because any follower replica can fail and the leader cannot wait indefinitely.

Kafka ISR model

To address the above issues, Kafka takes a compromise approach and introduces the concept of ISR. ISR is short for in-Sync Replicas. The replica of the ISR is kept in sync with the leader, who is also in the ISR. Initial state All replicas are in the ISR. When a message is sent to the Leader, the leader waits for all replicas in the ISR to tell it that it has received the message. If one replica fails, it is removed from the ISR. When the next message arrives, the leader sends the message to the current NODE in the ISR.

The leader also maintains the HW(High Watermark), which is the offset of the last message of a partition. HW continuously sends the HW to Fellower, which the broker can write to disk for future recovery.

When a failed replica is restarted, it first restores the HW recorded on disk and then synchronizes its messages to the HW offset. This is because messages after HW are not guaranteed to have been committed. It then becomes a Fellower, synchronizing data from the Leader starting with the HW, and once it catches up with the Leader, it can rejoin the ISR.

Kafka uses Zookeeper to implement the leader election. If the leader fails, the Controller elects a new leader from the ISR. Data may be lost during leader election, but committed messages are guaranteed not to be lost.

Fault recovery, leader reelection expression ~

Tradeoffs between data consistency and service availability

In order to ensure the consistency of data, Kafka proposed ISR. In order to improve the availability of service when synchronizing logs to Fellower, FELLOW writes the logs synchronized by the leader into the memory and then returns to the Leader the sign of log writing success. These operations can then be implemented through Kafka configuration.

Reference documentation

  • Kafka.apache.org/documentati…
  • Colobu.com/2017/11/02/…
  • Engineering.linkedin.com/kafka/intra…
  • Cwiki.apache.org/confluence/…