What a message queue is

Message Queue. Is a first-in, first-out data structure.

//1: create a queue to hold characters
Queue<String> stringQueue = new LinkedList<String>();
//2: put messages into the message queue
stringQueue.offer("hello");
//3: fetch the message from the message queue and print it
System.out.println(stringQueue.poll());
Copy the code

The above code demonstrates a simple queue. What it does is it saves the message, it takes the message out. The essence of a message queue is a container for messages. The producer of the message generates the message and puts it into the queue, and the consumer of the message takes it out for consumption.

So Kafka is a secondary trafficker, taking messages in and sending them out.

Application scenarios of message queues

  • The application of decoupling

    If an application interacts with another application through an interface, this is a strongly coupled calling relationship. If the interface invocation fails, the whole process fails. Here, the message queue is added as a relay, and the caller changes the behavior of calling the interface into sending messages to the message queue. After the producer sends the message successfully, his mission is completed. He doesn’t care if the message is consumed in a timely manner. And the consumer of the message can take out the message at the right time to consume.

  • Asynchronous processing

    Asynchronous processing means that multiple applications process the same message in the message queue and concurrently process the message, which is much more efficient than serial processing.

  • Current limiting peak clipping

    For example, in this business scenario, the message queue is used to make a container. The container can put ten messages (which can be realized by limiting the length of the queue). The background only needs to consume the ten messages slowly, so as to avoid a large number of requests rushing directly to the background interface and breaking the background interface.

  • Message-driven systems

    The system is divided into message queue, message producer and message consumer. The producer is responsible for producing the message and the consumer (multiple) is responsible for processing the message. It’s a system design concept. Through the message queue to achieve decoupling, but also to achieve the buffer, peak cutting effect.

Two ways to queue messages

Point-to-point Mode The point-to-point mode consists of three roles

  • The message queue
  • Sender (producer)
  • Recipient (consumer)

The message sender sends the message, the message consumer consumes the message, and when the message is consumed, the message is no longer saved in the queue. So it is impossible for a message consumer to consume a message that has already been consumed.

Publish/subscribe

Publish/subscribe Pattern The publish/subscribe pattern consists of three roles:

  • Role Topics
  • Publisher
  • Subscriber

The publisher sends messages to the Topic, and the system passes those messages to multiple subscribers. Features of the publish/subscribe model:

  • Each message can have multiple subscribers. That is, a single message can be consumed by different subscribers.
  • Publishers and subscribers have a time dependency. For a Topic subscriber, a subscriber must subscribe to consume messages published on that Topic.
  • Subscribers must subscribe to the role topic in advance and keep it running online.

Basic information about Kafka

Distributed, partitioned, multi-replica, multi-subscription journaling system (distributed MQ system). The sender is a Producer, and the receiver is a Consumer. A Kafka cluster consists of multiple Kafka instances, each of which is called a broker. Both The Kafka cluster and producers and consumers rely on Zookeeper to ensure the availability of the system. Zk stores some meta (mate) information. Kafka features:

  • Reliability: distributed, partitioned, replicated, fault tolerant
  • Scalability: The Kafka messaging system scales easily without downtime
  • Durability: Kafka uses distributed commit logging, which is supposed to persist data to disk as quickly as possible, so it is persistent.
  • Performance: Kafka has high throughput for both publishing-subscribe and messing-subscribe.

Kafka architecture

Kafka Cluster: a Cluster of multiple instances. Each instance is called a broker

Kafka Broker: Each instance of a Kafka cluster. Each instance has a unique number to identify it.

Kafka Consumer: Message consumer, responsible for consuming messages.

A Kafka Topic is used to distinguish between different types of messages. When storing messages, they are classified by type and placed under different topics. For example, in previous broadcasts, each station has a frequency, and if you want to listen to a certain station, you need to tune the frequency to the corresponding frequency, and this frequency is topic. It’s actually a differentiation function. Topic is a logical unit.

Shard: Indicates a topic shard. Generally, different shards are placed on different nodes (brokers). There is theoretically no upper limit to the number of shards. A topic can be divided into several smaller containers, each of which is a fragment (partition), and each partition can be evenly distributed on each node. The main function is to store data in different fragments, which is equivalent to storing data in different servers, so as to improve the topic storage capacity.

Replicas: Multiple replicas are built for each fragment to ensure that data is not lost. The maximum number of replicas is the number of nodes. For example, if there are three instances in a Kafka cluster, a maximum of three replicas can be set. There is a master-slave relationship between multiple copies. The primary copy is responsible for reading and writing data, and the secondary copy is responsible for copying data. This has changed in version 2.0. You can read and write to a certain extent from the copy. The more copies you have, the more secure the data is and the more disk space is occupied.

Kafka is a sharding and replica mechanism

A consumer can listen to multiple topics. The offset.

Sharding: A topic is divided into multiple containers, each of which is a shard, and these shards are distributed evenly across the brokers. When storing data, the data will be stored in different shards, that is, the data will fall on different machines, thus expanding the storage capacity of topic.

Copy: Multiple copies are made for each fragment to prevent data loss. It is important to note that multiple copies of the same shard cannot be placed on one node, because when the node fails, all copies will be lost. The purpose of copies is to prevent loss, so you need to ensure that copies are distributed. Therefore, the number of replicas is limited by the number of nodes, and the maximum number of replicas can only be equal to the maximum number of nodes.

Kafka data Non-loss Principle (ACK)

How does the producer ensure that data is not lost

The producer uses THE ACK verification mechanism to ensure that data is not lost.

The three values of ack (0,1, -1)

0: The producer is only responsible for sending the message and does not care whether the message is successfully received by Kafka.

1: The producer needs to ensure that the data is successfully sent to the primary copy of the fragment for the specified topic, and then kafka will respond with an ACK response.

-1 (all) : The producer must ensure that the message has been successfully sent to all copies of the fragment in kafka for a specific topic, and that the ACK response is given.

How does the broker ensure that messages are not lost

The broker ensures that data is not lost through the data copy mechanism and ack of -1.

How does the consumer ensure that data is not lost

1: The consumer connects to the Kafka cluster. Kafka finds the last consumption location (offset) based on the groupId of the consumer. If the consumer is the first consumption, the default is to listen for messages from the same time as the listener. (You can configure different consumption mechanisms here, or you can consume from scratch.)

2: The consumer starts to fetch the data, then performs business processing, and then submits the offset to Kafka.

Is there a message loss here?

The answer is no! But here could message repeated consumption problems, because if when news consumption is completed, before and then submit the offset, consumers hang up and then next time consumption, kafka, according to the consumer the groupId last time to find the location of the consumer, and because consumers not submitted last time offset, so here will cause message repeated consumption.

Where does Kafka record the offset information for each consumer group?

Different version, different record location.

Before and after 0.8.x, offsets are recorded in Zookeeper. After 0.8.x, offsets are recorded in Kafka, which has a topic for uniform recording (_consumer_offsets this topic has 50 partitions, one copy for each partition).

Kafka’s storage mechanism

Data storage mechanism in Kafka: Take a copy of a topic fragment as an example:

Index is an index file, and log is a log file. Data is recorded in log. The index file is used to store information about the physical offset of the message in the log file.

Kafka is a messaging middleware. When data is consumed, it is considered useless and needs to be deleted at some point in time.

The data is stored in a replica, which is managed by a broker node. Is the final data stored on disk, and is the data stored in a file or in separate files?

It’s stored in separate files. Each file stores 1GB of data.

There are mainly two files in a file segment. One is index, one is log. Index is the index file of log.

The file name is the start offset of the message stored in this file.

Why does Kafka store data in separate files?

1) Ensure that each file is not too large, so that the reading efficiency will be higher.

2) Kafka is only a temporary medium to store data, by default Kafka will delete expired data (for 7 days). If they are placed in a file, the deletion requires traversal of the file contents, which is inefficient and troublesome. If the file is divided, it only needs to judge the last modification time of the file.

Kafka’s data query mechanism

The figure above is a copy of data, how to quickly find 777777 this data?

1) Determine the segment where the data is located

2) In this section (737337) first query index, from which to find the specific physical offset 777777 message in the log file

3) Traverse the log file, find the specific location in order, and obtain the data

Producer partitioning strategy for Kafka

Suppose you have a topic that has three fragments and three replicas. Is this the message producer that should send the data to that shard, or will all shards receive the message?

Messages are sent to only one of the shard’s master replicas, which then synchronize the information data to the other two slave replicas.

There are four partitioning policies in Kafka for producers to send messages:

1) Hash modulo

2) Sticky partition (polling)

3) Specify a partition scheme

4) Customize the partition scheme

Too lazy to write…

Load balancing strategy for Kafka consumers

If messages are being produced much faster than they are being consumed, and there is a backlog of messages, what is the solution?

To increase the number of consumers (need to ensure they are in the same group to achieve the purpose of improving consumption speed!), here it is important to note that kafka consumer load balancing regulation, within a consumer group, the largest number of consumers and listen to the topic of the shard equal amount, if consumer data quantity is greater than the topic of fragmented data quantity, There will always be consumers sitting idle. And a sharded data can only be consumed by one consumer, not by other consumers in the group.

How do you use Kafka to simulate peer-to-peer and publish-subscribe?

Define multiple consumers and make them belong to different groups and subscribe to the same topic to simulate publish subscription

You can simulate point-to-point by having all the consumers listening to a topic belong to the same consumer group

There is a data backlog in Kafka

(See Kafka-Eagle)

Kafka quota rate limiting mechanism

Producers and consumers produce/consume messages at extremely high rates, occupying all or large amounts of the broker’s resources and saturating the network with IO. The normal operation of other topics may be affected.

Quotas are intended to solve this problem.