The author xiaosan is a full stack engineer who just graduated from the university. The technical article written is basically made up of notes during the learning process. If you like it after reading it, you can give a little praise to my little brother. Exception little brother also has a programmer exchange group, welcome everyone to touch fish. Click add group

What is a Kafka

Kafka was originally developed by Linkedin, which contributed to the Apache Foundation as a top open source project in 2010. Kafka is also an open source [distributed stream processing platform], written in Scala and Java (also used as an MQ system, but not as a pure messaging system).

At present, Kafka has been positioned as a distributed streaming processing platform, which is widely used for its high throughput, persistence, horizontal expansion, support for streaming data processing and other features. More and more open source distributed processing systems such as Cloudera, Storm, Spark and Flink support Kafka integration

Producer and consumer mechanisms

In Kafka, the producer sends messages to the Broker, which stores the messages sent by the producer to disk. The Consumer subscribes to and consumes messages from the Broker. The Consumer uses the pull pattern to pull messages from the server. Zookeeper is responsible for metadata management and controller election for the entire cluster. The details are shown in the following figure.

Kafka’s producer sends the partitioning policy to the Broker

Publish subscriptions to topics, producers send messages to specific topics, and consumers consume the topics responsible for subscribing. What is the partitioning mechanism in Kafka? It divides the topic into partitions, and each partition has multiple copies. Messages in different partitions under the same topic are also different. Each message produced by a producer is sent to only one partition. Partition numbers in Kafka start at 0. If a producer sends a message to the topic of two partitions, it is sent to either partition 0 or partition 1.

So how do you specify the message to the specified partition?

At this point we can look at the producer send logic, before we need to know something called ProducerRecord, what is this?

ProducerRecord is a Key/value pair sent to the Broker that encapsulates the underlying data information, called PR for short.

The internal structure

Topic (Name) PartitionID (Optional) Key (Optional) ValueCopy the code

Producer sending logic

If the Partition ID is specified, PR will be sent to the specified Partition.

2. If the Partition ID is not specified but the Key is specified, PR will be sent to the corresponding Partition based on the hash (Key)

3. If the Partition ID and Key are not specified, PR will be sent to each Partition using the default round-robin mode (default is range mode for consumer Partition).

If the Partition ID and Key are specified, PR will only be sent to the specified Partition.

Note: Partitions have multiple copies, but only one replicationLeader is responsible for the Partition’s producer-consumer interactions

The process of sending producers to brokers

Kafka’s clients send data to the server (not one at a time). The messages sent by KafkaProducer are sent to the client’s local cache. The messages are then collected into Batch and sent to the Broker at once. This is how performance can be improved.

Common configuration for producers

When producer sends data to the leader, the request. Required. Acks parameter can be used to set the level of data reliability, which is 0, 1, or all. If the acks # request fails, the producer automatically retries 0 times. If retries is enabled, there is the possibility of repeating messages. Retries # Total size of unsent messages in bytes for each partition (unit: Size # The default value is 0. Messages are sent immediately, even if the batch.size buffer is not full. Messages that should have been sent earlier are forced to wait for at least linger.ms time. More messages are accumulated during this time, and batch sending reduces requests. Ms # buffer.memory specifies the size of the buffer Kafka Producer can use. The default value is 32MB. If buffer.memory is set too small, messages may be written to the buffer too quickly, but the Sender thread does not have time to send messages to the Kafka server. Buffer. memory must be larger than batch.size, otherwise it will report insufficient memory, do not exceed the physical memory, adjust the buffer. Serializer must be set, even if no key is specified in the # message. Serializer must be a org.apache.kafka.com mon. Serialization. The Serializer interface classes, the # key serialized into a byte array. key.serializer value.serializerCopy the code

Kafka Consumer mechanism and partitioning strategy

According to what pattern does the consumer obtain data from the broker? Why pull mode, rather than broker active push?

The consumer pulls data from the partition of the broker. Why Pull instead of push? The Pull mode can adjust itself according to the consumption power of consumers, and different consumers have different performance. If the broker has no data, the consumer can configure a timeout world, block for a while and then return. However, if the broker actively pushes the message, the advantage of Push is that it can process the message quickly, but it is easy for consumers to process the message, resulting in the accumulation and delay of the message.

From which partition does the consumer consume?

We know that a topic has multiple partitions, and there are multiple consumers in a consumer group. How is it distributed? A topic can have multiple consumers because it has multiple leader partitions, each of which can be consumed by a single consumer in a consumer group.

So where does the consumer come from for consumption?

RoundRobinAssignor (round robin assignor) : assign all partitions and all consumers to the same consumer group. Therefore, it is ok to subscribe to the same topic in the consumer group. If the topic is different, uneven distribution will occur. Take this example:

Topic p0/topic p1/topic p2/topic p3/topic p4/topic p5/topic p6 Topic - P0 /topic- P4 /topic-p6 (consumer 1) c-2:topic- P1 /topic-p4/topic-p6 (consumer 1) C -2:topic- P1 /topic-p4/topic-p6 (consumer 2)Copy the code

The disadvantage of this is that if the subscribing messages are not the same in the same consumer group, partition allocation is not polling allocation when performing partitions, which may result in uneven partition allocation. For example, there are three consumers, C0, C1, and C2, who subscribe to three topics: T0, T1, and T2. At this point, T0 has one partition (P0), T1 has two partitions (P0, P1), and T2 has three partitions (P0, P1, P2). Consumer C0 subscribes to topics T0, consumer C1 subscribes to topics t0 and T1, and consumer C2 subscribes to topics T0, T1,t2. Because is polling mechanism, when C0 subscribe to T0, C1 can’t subscribe to the T0, but you can subscribe to the T1, C2 is the same subscription T0, T1 and T2 can subscribe to, however, this time T2 only C2 subscriptions, other C0 and C1 is not visible, this time T2 news also gives C2 the consumer to consumer. This situation is the problem of unequal distribution.

If the allocation is not equal, the first consumer will allocate more partitions, and one consumer will listen to different topics. What is the disadvantage of this strategy? It is only for a topic. If C-1 consumes one more partition, it will not have a big impact. If there are multiple topics, c-1 consumers will consume one more partition for each topic. The more topics, the more partitions will be consumed for a long time, and the performance will decline.

【 答 案 】Consumer redistribution strategy and offset maintenance mechanism

What is Rebalance

How Kafka evenly distributes all partitions within a topic to each consumer so that messages are consumed at the fastest rate is called balance. Rebalance is simply reassigning partitions to rebalance them. When C joins the Rebalance, Kafka will Rebalance A, B, and C. After the Rebalance, the allocation is still fair. Each Consumer instance has access to both Consumer partitions.

When a consumer suddenly goes down in the process of consumption, where does it come back from?

The consumer will record the offset and continue to consume from there when the fault recovers, so where is the offset recorded? Documented in ZooKeeper and native, the new version defaults to offset in Kafka’s built-in topic, named _consumer_offsets. On this topic by default will have 50 Partition, each Partition has three copies, the Partition number is offset by the parameters. The topic. Num. Partition configuration. Use the groupid hash and how this parameter is modded to determine in which partition of the _consumer_offsets theme the offset consumed by a certain consumer group is saved. The consumer group name + topic + partition is used to determine the key of the unique offset to obtain the corresponding value.