Kafka FAQ summary

What is Kafka? What are the main application scenarios?

Kafka is a distributed streaming processing platform. What exactly does that mean?

The streaming platform has three key functions:

Message queues: Publish and subscribe to message flows. This function is similar to message queues, which is why Kafka is also classified as a message queue.
Fault-tolerant persistence stores recorded message flows: Kafka persists messages to disk, effectively avoiding the risk of message loss.
Streaming platform: Kafka provides a complete library of streaming classes for processing messages as they are published.

Kafka has two main applications:

Message queues: Establish real-time streaming data pipelines to reliably retrieve data between systems or applications.
Data processing: Build real-time streaming data handlers to transform or process data streams.

What is Kafka’s advantage over other message queues?

We now often refer to Kafka as an excellent message queue by default, and we often compare it to RocketMQ and RabbitMQ. I see the main advantages of Kafka over other message queues as follows:

Extreme performance: Developed based on Scala and Java language, the design makes extensive use of batch processing and asynchronous ideas, and can process up to ten million messages per second.
Ecosystem compatibility is unbeatable: Kafka is one of the most compatible ecosystems around, especially in big data and streaming computing.

In fact, Kafka was not a proper message queue in its early days. The early Kafka was a ragged child in the message queue world, with some minor problems such as message loss, unreliable message reliability, and so on. Of course, this has a lot to do with the fact that LinkedIn first developed Kafka to handle the huge amount of logs. Ha ha ha, Kafka was not originally intended as a message queue.

Over time, these shortcomings have been fixed by Kafka. So the idea that Kafka is unreliable as a message queue is outdated!

Do you understand the queue model? Is Kafka’s message model known?

Messaging model

Whereas traditional message queues provide at least two message models, one P2P and one PUB/SUB, Kafka does not. Cleverly, it provides the concept of a consumer group. A message can be consumed by multiple consumer groups, but only by one consumer within a consumer group. In this way, it is equivalent to P2P model when there is only one consumer group, and PUB/SUB model when there are multiple consumer groups.

Kafka’s consumer gets message data in the form of pull. Pruducer pushes messages into the Kafka cluster, and the consumer pulls messages from the cluster, as shown below. This blog mainly explains. The distribution of Parts among consumers, and the associated consumer order, underlying structure metadata information, Kafka data reading and storage, and so on.

Detailed refer to: www.cnblogs.com/moonandstar…

Queue model: Early message model

Queues are used as message communication carriers to meet the producer-consumer pattern. A message can only be used by one consumer, and unconsumed messages are kept in queues until consumed or timed out. For example, if our producer sends 100 messages, two consumers will consume half of the messages in the order in which they are sent.

Problems with the queue model:

Suppose we have a situation where we need to distribute producer-generated messages to multiple consumers, and each consumer can receive the finished message content.

In this case, the queue model is not easy to solve. Many of the more sophisticated people say: we can create a separate queue for each consumer and let producers send multiple copies. This is stupid, wasteful of resources, and defeats the purpose of using message queues.

Publish – subscribe model :Kafka message model

The publish-subscribe model is designed to solve the problems of the queue model.

Pub-sub (pub-SUB) uses Topic as message communication carrier, similar to broadcast mode; A publisher publishes a message that is delivered by topic to all subscribers, and users who subscribe after a message is broadcast do not receive the message.

In the publish-subscribe model, if there is only one subscriber, it is basically the same as the queue model. So the publish-subscribe model is functionally compatible with the queue model.

Kafka uses a publish-subscribe model.

RocketMQ’s message model is basically the same as Kafka’s. The only difference is that Kafka does not have a queue. Instead, it has a Partition.

5. What are Producer, Consumer, Broker, Topic and Partition?

Kafka sends messages published by producers to topics that consumers who need these messages can subscribe to, as shown below:

The diagram above also introduces some of Kafka’s most important concepts:

Producer: A party that produces information.
Consumer: The party that consumes messages.
Broker: Can be thought of as a separate Kafka instance. Multiple Kafka brokers form a Kafka Cluster.

At the same time, you must have noticed that each Broker contains Topic and Partition:

Topic: Producers send messages to specific topics, and consumers consume messages by subscribing to specific topics.
Partition: A Partition is part of a Topic. A Topic can have multiple partitions, and partitions within the same Topic can be distributed across different brokers, indicating that a Topic can span multiple brokers. This is exactly what I drew above.

Highlight: Partitions in Kafka can actually correspond to queues in message queues. Doesn’t that make a little bit more sense?

Kafka has multiple copies. What benefits did it bring?

Another point that I think is important is that Kafka introduces the Replica mechanism for partitions. Between copies of a Partition there is a guy called the leader, and the other copies are called followers. The messages we send are sent to the Leader replica, and the follower replica can then pull messages from the Leader replica for synchronization.

Producers and consumers interact only with the Leader replica. You can understand that the other replicas are just copies of the Leader replicas and exist only to keep the message store secure. When the leader copy fails, a leader will be elected from the followers. However, any follower whose synchronization degree with the leader does not meet the requirements cannot participate in the election for leader.

What are the advantages of Kafka’s partitions and replicas?

Kafka provides concurrency (load balancing) by assigning multiple partitions to a particular Topic that can be distributed across different brokers.
Partition specifies the Replica number. This greatly improves message storage security and disaster recovery capability, but also increases the required storage space.

Zookeeper in Kafka

To understand how ZooKeeper works in Kafka, you must set up a Kafka environment and go to ZooKeeper to see what folders are associated with Kafka and what information each node holds. Don’t just look at it and not practice it, you’ll forget what you’ve learned! This part of the reference and draw lessons from this article: www.jianshu.com/p/a036405f9… .

The following is my local Zookeeper, which is successfully associated with my local Kafka (the following folder structure is implemented with the Zookeeper tool plugin idea).

ZooKeeper provides metadata management for Kafka.

As you can see from the diagram, Zookeeper mainly does the following for Kafka:

Broker registration: There is a node on Zookeeper dedicated to recording the Broker server list. At startup, each Broker registers with Zookeeper to create its own node under /brokers/ IDS. Each Broker records information such as its IP address and port to the node
The Topic is registered: In Kafka, sameThe messages for a Topic are divided into multiple partitionsAnd spread it across multiple brokers,This partition information and its correspondence to the BrokerThey are also maintained by Zookeeper. For example, if I create a topic named my-Topic and it has two partitions, zooKeeper will create these folders:/brokers/topics/my-topic/Partitions/0,/brokers/topics/my-topic/Partitions/1
Load balancing: Kafka provides concurrency by assigning multiple partitions to a particular Topic that can be distributed across different brokers. For different partitions of the same Topic, Kafka tries to distribute these partitions to different Broker servers. When a producer generates a message, it tries to deliver it to the partitions of different brokers. When consumers consume, Zookeeper implements dynamic load balancing based on the number of partitions and consumers.
.

How does Kafka ensure that messages are consumed in order?

In the process of using message queue, we often need to strictly ensure the order of message consumption in business scenarios. For example, we send two messages at the same time, and the corresponding operations of these two messages respectively correspond to the database operations: changing the user membership level and calculating the order price according to the membership level. If the two messages are consumed in a different order, the end result will be very different.

We know that Partition in Kafka is where messages are actually stored, where all the messages we send are stored. Our partitions exist within the Topic concept, and we can specify multiple partitions for a particular Topic.

Tail addition is used every time a message is added to a Partition, as shown in the figure above. Kafka can only guarantee the order of messages in partitions for us, not partitions in topics.

Messages are assigned a specific offset when appended to a Partition. Kafka uses offsets to ensure that messages are ordered within a partition.

Therefore, we have a very simple way to guarantee the order of message consumption: 1 Topic corresponds to only one Partition. That certainly solves the problem, but it defeats the purpose of Kafka.

Kafka supports topic, partition, key, and data when sending a message. If you specify a Partition when sending a message, all messages will be sent to the specified Partition. In addition, messages with the same key can only be sent to the same partition, so we can use the ID of the table/object as the key.

To summarize, there are two ways to ensure the order of message consumption in Kafka:

A Topic corresponds to only one Partition.
(Recommended) Specify a key or Partition when sending messages.

Of course, there are not only the above two methods, the above two methods are easier to understand,

How does Kafka ensure that messages are not lost

9.1. Loss of messages by producers

After a Producer invokes the send method to send messages, the messages may not be sent due to network problems.

Therefore, we cannot default to sending a message after calling send. To determine whether the message was sent successfully, we need to determine the result of the message sent. Kafka Producer sends messages asynchronously. We can use get() to get the result of the call, but this also makes it a synchronous operation.

See my article for detailed code:Kafka series 3! Learn how to use Kafka as a message queue in Spring Boot programs.

SendResult<String, Object> sendResult = kafkaTemplate.send(topic, o).get(); if (sendResult.getRecordMetadata() ! = null) {logger. The info (" producers to successfully send a message to the "+ sendResult. GetProducerRecord (). The topic () +" - > "+ sendRe sult.getProducerRecord().value().toString()); }Copy the code

But it’s not recommended! This can take the form of adding a callback function to it, with the following example code:

ListenableFuture<SendResult<String, Object>> future = kafkaTemplate.send(topic, o); Future.addcallback (result -> logger.info(" producer successfully sent message to topic:{} partition:{} ", result.getRecordMetadata().topic(), Result.getrecordmetadata ().partition()), ex -> logger.error(" {}", ex. GetMessage ()));Copy the code

If the message fails to be sent, we can check the cause of the failure and send it again.

In addition, it is recommended to be produced by Producerretries(Retry times) Set a reasonable value, usually 3, but a higher value to ensure that messages are not lost. After the configuration is complete, messages can be automatically retried to avoid message loss when network problems occur. In addition, it is recommended to set the retry interval, because if the interval is too small, the retry effect is not obvious, the network fluctuation once you three times a sudden retry finished

9.2 Information loss of consumers

We know that messages are assigned a specific offset when appended to a Partition. The offset represents the Partition to which the Consumer is currently consuming. Kafka uses offsets to ensure that messages are ordered within a partition.

When the consumer pulls a message from the partition, the consumer automatically submits the offset. One problem with auto-submission is that the consumer just gets the message and is ready to actually consume it, but suddenly hangs up. The message is not actually consumed, but the offset is automatically submitted.

The solution is to manually turn off the automatic submission of offset, and then manually submit the offset after the actual consumption of the message. However, careful friends will certainly find that this will bring about the problem of the message being re-consumed. For example, if you have just consumed the message and have not submitted the offset, then the message will theoretically be consumed twice.

Kafka lost the message

We know that Kafka introduces the Replica mechanism for partitions. Between copies of a Partition there is a guy called the leader, and the other copies are called followers. The messages we send are sent to the Leader replica, and the follower replica can then pull messages from the Leader replica for synchronization. Producers and consumers interact only with the Leader replica. You can understand that the other replicas are just copies of the Leader replicas and exist only to keep the message store secure.

Imagine a situation where a new leader needs to be selected from the follower copy if the broker that the leader copy belongs to suddenly dies. However, if the leader’s data is not synchronized with the follower copy, the message will be lost.

Set acks = all

The solution is to set acks = all. Acks are an important parameter for Kafka producers.

The default value of acks is 1, indicating that our message is successfully sent after being received by the leader copy. When acks = all is configured to represent, the message is not successfully sent until all replicas receive it.

Replication. factor >= 3

To ensure that the leader has followers to synchronize messages, we usually set replication. Factor >= 3 for the topic. This ensures that each partition has at least three copies. Although it causes data redundancy, it brings data security.

Set min.insync.replicas > 1

Typically, you also need to set min.insync.replicas> 1 so that the configuration represents that the message must be written to at least 2 replicas before it can be successfully sent. The default value of min.insync.replicas is 1, which should be avoided in actual production.

However, to ensure high availability of the entire Kafka service, you need to ensure replica.factor > min.insync.replicas. Why is that? Imagine if the two were equal, if only one copy died, the entire partition would not work. This is a clear violation of high availability! Replic. factor = min.insync.replicas + 1 is recommended.

Set the unclean. Leader. Election. The enable = false

Kafka 0.11.0.0 versions are unclean. Leader. Election. Enable parameters with default value of true or false instead

We also said at the beginning that the messages we send are sent to the Leader replica, and the follower replica then pulls the messages from the Leader replica for synchronization. Multiple followers copies of messages between the synchronization condition is different, when we configure unclean. Leader. Election. Enable = false words, If the leader copy fails, the leader will not be selected from the copies whose synchronization degree between the follower copy and the leader does not meet the requirements, thus reducing the possibility of message loss.

Kafka FAQ summary