preface

Kafka is something I learned outside of games during the pandemic. Although I have used ActiveMQ and RabbitMQ before, I am a beginner to Kafka. Please point out if there is anything perfect or inaccurate in the article.

Today we are going to talk about Kafka, mainly to introduce you to Kafka, and talk about some of the important concepts and problems in Kafka. In a later article I will introduce:

  1. Some of Kafka’s advanced features include workflows.
  2. Install Kafka using Docker and simply use it to send and consume messages.
  3. How Spring Boot programs use Kafka as a message queue.

We now often refer to Kafka as an excellent message queue by default, and we often compare it to RocketMQ and RabbitMQ. I see the main advantages of Kafka over other message queues as follows:

  1. Extreme performance: Developed based on Scala and Java language, the design makes extensive use of batch processing and asynchronous ideas, and can process up to ten million messages per second.
  2. Ecosystem compatibility is unbeatable: Kafka is one of the most compatible ecosystems around, especially in big data and streaming computing.

In fact, Kafka was not a proper message queue in its early days. The early Kafka was a ragged child in the message queue world, with some minor problems such as message loss, unreliable message reliability, and so on. Of course, this has a lot to do with the fact that LinkedIn first developed Kafka to handle the huge amount of logs. Ha ha ha, Kafka was not originally intended as a message queue.

Over time, these shortcomings have been fixed by Kafka. So the idea that Kafka is unreliable as a message queue is outdated!

I met Kafka

Take a look at the official website for its introduction, should be the most authoritative and real-time. It doesn’t matter if it’s In English. I’ve already extracted the most important information for you.

From the official introduction, we can get the following information:

Kafka is a distributed streaming processing platform. What exactly does that mean?

The streaming platform has three key functions:

  1. Message queues: Publish and subscribe to message flows. This function is similar to message queues, which is why Kafka is also classified as a message queue.
  2. Fault-tolerant persistence stores recorded message flows: Kafka persists messages to disk, effectively avoiding the risk of message loss.
  3. Streaming platform: Kafka provides a complete library of streaming classes for processing messages as they are published.

Kafka has two main applications:

  1. Message queues: Establish real-time streaming data pipelines to reliably retrieve data between systems or applications.
  2. Data processing: Build real-time streaming data handlers to transform or process data streams.

A few very important concepts about Kafka:

  1. Kafka stores record streams (stream data) intopicIn the.
  2. Each record consists of a key, a value, and a timestamp.

Kafka message model

As an aside: The early JMS and AMQP standards were developed by the authorities in the messaging services field, which I described in JavaGuide’s Message Queues are Really Simple. However, the evolution of these standards has not kept pace with the evolution of message queues, and these standards are effectively obsolete. So it is possible that different message queues have their own set of message models.

Queue model: Early message model

Queues are used as message communication carriers to meet the producer-consumer pattern. A message can only be used by one consumer, and unconsumed messages are kept in queues until consumed or timed out. For example, if our producer sends 100 messages, two consumers will consume half of the messages in the order in which they are sent.

Problems with the queue model

Suppose we have a situation where we need to distribute producer-generated messages to multiple consumers, and each consumer can receive the finished message content.

In this case, the queue model is not easy to solve. Many of the more sophisticated people say: we can create a separate queue for each consumer and let producers send multiple copies. This is stupid, wasteful of resources, and defeats the purpose of using message queues.

Publish-subscribe model :Kafka message model

The publish-subscribe model is designed to solve the problems of the queue model.

Pub-sub (pub-SUB) uses Topic as message communication carrier, similar to broadcast mode; A publisher publishes a message that is delivered by topic to all subscribers, and users who subscribe after a message is broadcast do not receive the message.

In the publish-subscribe model, if there is only one subscriber, it is basically the same as the queue model. So the publish-subscribe model is functionally compatible with the queue model.

Kafka uses a publish-subscribe model.

RocketMQ’s message model is basically the same as Kafka’s. The only difference is that Kafka does not have a queue. Instead, it has a Partition.

Kafka important concept interpretation

Kafka sends messages published by producers to topics that consumers who need these messages can subscribe to, as shown below:

The diagram above also introduces some of Kafka’s most important concepts:

  1. Producer: A party that produces information.
  2. Consumer: The party that consumes messages.
  3. Broker: Can be thought of as a separate Kafka instance. Multiple Kafka brokers form a Kafka Cluster.

At the same time, you must have noticed that each Broker contains Topic and Partition:

  • Topic: Producers send messages to specific topics, and consumers consume messages by subscribing to specific topics.
  • Partition: A Partition is part of a Topic. A Topic can have multiple partitions, and partitions within the same Topic can be distributed across different brokers, indicating that a Topic can span multiple brokers. This is exactly what I drew above.

Highlight: Partitions in Kafka can actually correspond to queues in message queues. Doesn’t that make a little bit more sense?

Another point that I think is important is that Kafka introduces the Replica mechanism for partitions. Between copies of a Partition there is a guy called the leader, and the other copies are called followers. The messages we send are sent to the Leader replica, and the follower replica can then pull messages from the Leader replica for synchronization.

Producers and consumers interact only with the Leader replica. You can understand that the other replicas are just copies of the Leader replicas and exist only to keep the message store secure. When the leader copy fails, a leader will be elected from the followers. However, any follower whose synchronization degree with the leader does not meet the requirements cannot participate in the election for leader.

What are the advantages of Kafka’s partitions and replicas?

  1. Kafka provides concurrency (load balancing) by assigning multiple partitions to a particular Topic that can be distributed across different brokers.
  2. Partition specifies the Replica number. This greatly improves message storage security and disaster recovery capability, but also increases the required storage space.

Zookeeper in Kafka

To understand how ZooKeeper works in Kafka, you must set up a Kafka environment and go to ZooKeeper to see what folders are associated with Kafka and what information each node holds. Don’t just look at it and not practice it, you’ll forget what you’ve learned!

The next article will explain how to set up a Kafka environment. You can set up a Kafka environment in 3 minutes after reading the next article.

This part of the reference and draw lessons from this article: www.jianshu.com/p/a036405f9… .

The following is my local Zookeeper, which is successfully associated with my local Kafka (the following folder structure is implemented with the Zookeeper tool plugin idea).

ZooKeeper provides metadata management for Kafka.

As you can see from the diagram, Zookeeper mainly does the following for Kafka:

  1. Broker registration: There is a node on Zookeeper dedicated to recording the Broker server list. At startup, each Broker registers with Zookeeper to create its own node under /brokers/ IDS. Each Broker records information such as its IP address and port to the node
  2. The Topic is registered: In Kafka, sameThe messages for a Topic are divided into multiple partitionsAnd spread it across multiple brokers,This partition information and its correspondence to the BrokerThey are also maintained by Zookeeper. For example, if I create a topic named my-Topic and it has two partitions, zooKeeper will create these folders:/brokers/topics/my-topic/Partitions/0,/brokers/topics/my-topic/Partitions/1
  3. Load balancing: Kafka provides concurrency by assigning multiple partitions to a particular Topic that can be distributed across different brokers. For different partitions of the same Topic, Kafka tries to distribute these partitions to different Broker servers. When a producer generates a message, it tries to deliver it to the partitions of different brokers. When consumers consume, Zookeeper implements dynamic load balancing based on the number of partitions and consumers.
  4. .

How does Kafka guarantee the order in which messages are consumed?

In the process of using message queue, we often need to strictly ensure the order of message consumption in business scenarios. For example, we send two messages at the same time, and the corresponding operations of these two messages respectively correspond to the database operations: changing the user membership level and calculating the order price according to the membership level. If the two messages are consumed in a different order, the end result will be very different.

We know that Partition in Kafka is where messages are actually stored, where all the messages we send are stored. Our partitions exist within the Topic concept, and we can specify multiple partitions for a particular Topic.

Tail addition is used every time a message is added to a Partition, as shown in the figure above. Kafka can only guarantee the order of messages in partitions for us, not partitions in topics.

Messages are assigned a specific offset when appended to a Partition. Kafka uses offsets to ensure that messages are ordered within a partition.

Therefore, we have a very simple way to guarantee the order of message consumption: 1 Topic corresponds to only one Partition. That certainly solves the problem, but it defeats the purpose of Kafka.

Kafka supports topic, partition, key, and data when sending a message. If you specify a Partition when sending a message, all messages will be sent to the specified Partition. In addition, messages with the same key can only be sent to the same partition, so we can use the ID of the table/object as the key.

To summarize, there are two ways to ensure the order of message consumption in Kafka:

  1. A Topic corresponds to only one Partition.
  2. (Recommended) Specify a key or Partition when sending messages.

Of course, there are not only the above two methods, the above two methods are easier to understand,

Recommended reading

  • Apache Kafka using Keys for Partition: linuxhint.com/apache_kafk…
  • Spring the Boot and Kafka – Practical Configuration Examples:thepracticaldeveloper.com/2018/11/24/…
  • Six years of change in big data: www.infoq.cn/article/b8*…

Open Source Project Recommendation

Other open source projects recommended by the authors:

  1. JavaGuide: A Java learning + Interview Guide that covers the core knowledge that most Java programmers need to master.
  2. Springboot-guide: A Spring Boot tutorial for beginners and experienced developers.
  3. Advancer-advancement: I think there are some good habits that technical people should have!
  4. Spring-security-jwt-guide: Start from Scratch! Spring Security With JWT (including permission validation) backend part of the code.

The public,