This is the fifth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021.

Kafka consumers

Kafka consumers consume data from the broker using the pull pattern. The push pattern, in contrast, is difficult to accommodate consumers with different consumption rates, because the message sending rate is determined by the broker. The goal is to deliver the message as quickly as possible, but it’s easy to leave the consumer unable to process the message. The Pull pattern consumes messages at an appropriate rate based on the consumer’s ability to consume. The downside of the pull pattern is that if Kafka has no data, the consumer may get stuck in a loop that keeps returning empty data. For this reason, Kafka consumers pass in a timeout when consuming data. If no data is currently available for consumption, the consumer will wait a certain amount of time before returning.

Consumer groups

Consumer groups are closely related to consumers. In Kafka, multiple consumers can form a consumer group, and one consumer can only follow one consumer group. One of the most important functions of consumer groups is to implement broadcast and unicast functions. A consumer group can ensure that each partition of the Topic to which it subscribes can be consumed by only one consumer belonging to that consumer group; If different consumer groups subscribe to the same Topic, these consumer groups are independent of each other and not subject to mutual interference.

Therefore, if we want a message to be consumed by multiple consumers, we can put these consumers into different consumer groups, which is in effect the effect of broadcasting. If you want a message to be consumed by only one consumer, you can put those consumers in the same consumer group, which is essentially a unicast effect.

Partition allocation policy

There are multiple consumers in a consumer group and multiple partitions in a topic, so partition allocation is inevitably involved, that is, determining which consumer consumes that partition.

Kafka has two allocation strategies: RoundRobin and Range.

RoundRobin

For example, there is a consumer group consisting of three consumers, ConsumerA, ConsumerB, and ConsumerC, who consume TopicA topic messages at the same time. TopicA is divided into seven sections. If RoundRobin allocation strategy is adopted, the process is as follows:

This polling approach should be easy to understand. But what happens if a consumer group consumes multiple partitions for multiple topics? For example, if you have a group of two consumers, ConsumerA and ConsumerB, consuming TopicA and TopicB topic messages at the same time, the RoundRobin allocation strategy would look like this:

Note: TAP0 stands for TopicA Partition0 partitioned data, and so on.

In this case, with the RoundRobin algorithm, topics are treated as a whole that contains their respective partitions, such as TopicPartition in the Kafka-Clients dependency. These TopicPartitions are then sorted according to their hash values and then distributed to consumers in a polling manner.

But there’s a problem: In the figure above, if ConsumerA subscribes only to TopicA and ConsumerB subscribes only to TopicB, RoundRobin polling algorithm is used, ConsumerA may consume messages in TopicB. The ConsumerB consumes the messages in the TopicA topic section.

To sum up, the RoundRobin algorithm is only applicable to the case where consumers in the consumer group subscribe to the same topic. At the same time, it can be found that using RoundRobin algorithm, the number of messages consumed by consumers in the consumer group differs by 1 at most.

Range

Kafka uses a Range allocation strategy by default.

For example, there is a consumer group consisting of three consumers, ConsumerA, ConsumerB and ConsumerC, consuming TopicA topic messages at the same time. TopicA is divided into seven sections. If the Range allocation strategy is adopted, the process is as follows:

Suppose you have a consumer group of two consumers, ConsumerA and ConsumerB, consuming TopicA and TopicB topic messages at the same time. If you use the Range allocation strategy, the process is as follows:

The Range algorithm does not treat multiple subject partitions as a whole.

From the above example, we can summarize a disadvantage of the Range algorithm: that is, the number of messages consumed by consumers in the same consumer group may differ greatly.

Offset the maintenance

Offset is uniquely determined by the Topic+ Partition of the message and the consumer group name.

Since a consumer may have power outages and other faults in the process of consumption, after recovery, it needs to continue to consume from the position before the fault, so it needs to record the offset to which it consumes in real time, so that it can continue to consume after recovery of the fault.

Before Kafka 0.9, consumers stored offsets in Zookeeper by default. Starting with 0.9, consumers stored offsets in a built-in Kafka topic by default. The topic is __consumer_offsets.