one How can Kafka data be guaranteed against loss?

The producer end, the consumer end, and the broker end.

  1. No loss of producer data

Kafka ack mechanism: when kafka sends data, every time it sends a message, there is an acknowledgement mechanism to ensure that the message is properly received. The status is 0,1, -1.

In synchronous mode, ack is set to 0, which is risky. Therefore, you are not advised to set ack to 0. Even if set to 1, data is lost as the leader goes down. To ensure data loss at the production end, set this parameter to -1.

In asynchronous mode: The ack state is also taken into account. In asynchronous mode, a buffer is used to control the sending of data, and two values are used to control the sending of data, the time threshold and the number of messages threshold. If the buffer is full and the data has not been sent, there is an option to configure whether to empty the buffer immediately. It can be set to -1, permanently blocking, so that data is no longer produced. In asynchronous mode, even if the value is set to -1. It is also possible to lose operational data due to unscientific manipulation by the programmer, such as kill -9, but this is a special exception.

Note: ACK =0: The producer does not wait for confirmation that the broker has completed synchronization and continues to send the next (batch) message. Ack =1 (default) : The producer waits for the leader to successfully receive and confirm the data before sending the next message. Ack =-1: The producer sends the next data only after receiving follwer’s confirmation.

  1. No loss of consumer data

The offset commit is used to ensure that data is not lost. Kafka records the offset value of each consumption, and the next consumption will be followed by the last offset.

The offset information is stored in zookeeper before kafka0.8, and in topic after 0.8. Even if the consumer fails during the runtime, it will find the value of the offset, find the location of the previous consumption message, and then consume. Since the offset information is not written after each message consumption is complete, this situation may result in repeated consumption without message loss.

The only exception is that we are in the program to make different functions of two consumer group set KafkaSpoutConfig. Bulider. SetGroupid groupid set, when this kind of situation will lead to the two groups share the same data, Group A consumes messages in partition1, partition2, and group B consumes messages in partition3, so that messages consumed by each group are lost and incomplete. To ensure that each group has an exclusive share of message data, the groupid must not be duplicated.

  1. Data for brokers in a Kafka cluster is not lost

The number of replicas that the producer writes to the leader will be determined by the distribution policy (partition by partition, key by key, no polling). The followers (replicas) then synchronize data with the leader to ensure that message data is not lost after backup.

Does kafka restart cause data loss?

  1. Kafka writes data to disk and generally does not lose data.
  2. However, if a consumer consumes a message during a Restart of Kafka, then Kafka may fail to commit the offset in time, causing data inaccuracies (loss or double consumption).

How to ensure high availability of message queues?

A basic architectural understanding of Kafka is that it consists of multiple brokers, each of which is a node. You create a topic that can be divided into multiple partitions, each of which can reside on a different broker, and each of which holds a portion of the data.

The producer can only write data to the leader. When the data is written to the leader, the leader will synchronize the data to the followers

Kafka evenly distributes all replicas of a partition to different machines to improve fault tolerance.

In this way, there is what is called high availability, because if a broker goes down, it doesn’t matter, because partitions on that broker have copies on other machines. If the broken broker has a partition leader, a new leader will be elected from the followers. This is called high availability.

When the data is written, the producer writes to the leader, who then writes the data to the local disk, and the other followers actively pull the data from the Leader themselves. Once all the followers have synchronized their data, they send an ACK to the leader, who returns a write success message to the producer after receiving an ACK from all the followers. (Of course, this is just one pattern, and you can tweak this behavior)

When consuming a message, it will only be read from the leader, but will only be read by the consumer if a message has been ack successfully synchronized by all followers.