preface

Data delivery reliability means that a message will not be lost when a “delivery success” reply is sent to the producer

Kafka can be used in a variety of applications, some for higher reliability, such as banking, and some for less, such as logging. So Kafka offers a variety of configurations to meet varying degrees of reliability requirements

Reliability is not a single component, but a system-level concept, so ensuring the reliability of Kafka requires cluster operations and application developers to work together to build a reliable system

Of course, in principle, the reliability of basic services should be guaranteed by the infrastructure team, while the business team focuses more on the business itself

How to ensure the reliability of Kafka

Broker

Multiple replicas per partition per topic in Kafka are at the heart of data reliability. If there is only one copy, the crash will cause data loss, whereas if there are multiple copies, as long as one copy is available, the data will not be lost

Copy the coefficient

Replication. factor is the number of copies per partition. The default value is 3.

The default value is generally sufficient, but you can add it up. A higher replication factor means higher reliability and availability, but also higher storage cost, which is a trade-off between reliability and hardware cost

If you don’t need too much reliability, you can set this value to 1, where there is only one copy per partition. This not only affects reliability, but also availability: when a broker with a single copy goes down, it becomes unavailable for a period of time because no other copies come out

Distribution between multiple replicas is a point worth considering. Kafka by default assigns each copy of a partition to as many different brokers as possible to ensure instance-level reliability

Brokers can be distributed on different racks at deployment time and rack names can be configured using the broker.rack parameter. Kafka ensures that each copy of a partition is on a different rack as far as possible, further improving data reliability

Therefore, a better partition allocation method is as follows:

In fact, the reliability can also be added, in the computer room, city, national layer of disaster, but this is the general disaster means, this paper does not consider

Least synchronized copy

If a partition has three copies, there may be only one synchronous copy

Synchronous copy: A copy is synchronous if a message has been obtained from the master copy recently and the message is the latest. A replica that is out of sync may be broken, disconnected from the master replica, or still in contact, but with requested data much behind

The min.insync.replicas parameter can be used to set the minimum synchronized replica. For a three copies of the partition, for example, if the parameter is set to 2, and that there are at least two synchronous replicas, to write data to the partition, or producers will be affected NotEnoughReplicasException anomalies

So what does this parameter do? Assume that the minimum synchronized copy is 1, that is, only one synchronized copy is allowed.

A message is considered committed only if it has been written to all synchronized copies, which is only one. Data loss occurs if the broker on which the copy is located fails and has not been copied to another copy (which is not a synchronous copy and may not have caught up with the latest data from the master copy)

That is, when the broker replies to the producer with “commit successful”, the data is still lost, resulting in loss of data reliability

If set to 2 or 3, this improves data reliability and has the disadvantage of increasing response time

producers

We configure the broker side to be reliable, but the data can become unreliable without the corresponding configuration and processing of producers and consumers in terms of reliability

On the sender side, what we need to do is

  • Configure send acknowledgements correctly
  • Handle send errors correctly

Send confirmation

There are three configuration options for sending acknowledgement (acks)

  • acks=0

    • As long as the producer sends the message over the network, it is considered successful. If the operation before sending (serialization, partition selection, network card failure) fails, the producer will be notified. However, if there is an exception on the broker side such as partition offline, the sender will not be aware of it and the data will be lost
    • This mode is generally used to measure performance and is not recommended in the generation environment
  • acks=1

    • Write success or failure is returned only when the message is written to the primary copy of the partition. This mode has higher reliability than acks=0. However, data loss may occur. That is, the primary copy crashes before data can be synchronized to other copies
  • acks=all

    • Write success or failure is returned when the message is written to all synchronous copies of the partition. Min.insync.replicas can be written as many copies as possible on the broker side. The broker must write at least one copy to the min.insync.replicas before the producer responds “write success”.
    • This mode maximizes reliability, but reduces throughput. You need to select this mode based on the reliability requirements of service scenarios

Handling errors

Errors sent by producers fall into two categories, retried and non-retried

  • Can try again:

    • For example, a LEADER_NOT_AVAILABLE error in electing a primary partition, or a connection failure error due to network jitter, can be resolved by retry
    • Note that the total retry time (number of retries -1) x the direct interval between retries) must be longer than the election process; otherwise, the producer will give up retries prematurely
  • Not retried:

    • Some errors cannot be resolved by retry, that is, they will fail no matter how many times they are tried, such as message overload errors

The producer client processes retried errors automatically. The program code processes unretried errors and errors that exceed the maximum number of retries. These errors are usually logged or saved to a database. Determine whether to ignore the service directly or manually access the service

Note that retries may cause message duplication. For example, the producer does not receive the broker’s acknowledgement because of a network problem, but the message has been written successfully. The producer does not know that the message has been written, and thinks that the network or the broker is temporarily faulty. At this point the consumer side needs to consider the idempotency of message processing

consumers

As a consumer, you need to process messages without missing them, and then try not to process them twice

So consumers need to keep track of which messages are processed and which are not

The consumer takes a batch of data from the broker, processes it, and commits offsets

Why do I need to commit offsets? If one consumer quits (perhaps through an outage or restart) and another consumer takes over (or “rebalance”) the partition, it needs to know where the last consumer has gone. The “other consumer” could also be the rebooted self

Therefore, the timing of the offset submission is critical. If the consumer submits the offset before the message is processed, data will be lost, and if the offset submission is conservative, the message will be processed repeatedly

Repeated consumption:

Message loss:,

Commit offset

Offsets can be submitted automatically and manually:

  • Automatically commit offsets

You can enable automatic submission offsets by setting enable.auto.mit =true

When the message is pulled, the maximum offset of the last pull will be automatically submitted every time the message is pulled and the offset from the last pull has passed the submission interval (auto.mit.interval.ms)

Advantages and disadvantages of automatic submission are:

  • Advantages:

    • Conveniently, developers can worry less about offset submissions
    • Automatic commit guarantees that messages will not be lost if the program guarantees that all messages will be processed before a new pull
  • Disadvantages:

    • If the commit interval is too long, it may result in the actual offset being processed being much larger than the submitted offset, and in the event of a consumer switch, the message being processed repeatedly
    • Do not process messages asynchronously (breaking the condition that all messages are processed before a new pull), otherwise data will be lost
  • Manually submit the offset

If you want more control over commit timing, you need to manually control the offset commit. If you commit after each batch of messages is processed, you reduce the likelihood of processing duplicate messages, but you can’t avoid it (the message will be processed again if it goes down right after processing), and you will lose throughput

Processing only once

Either the producer side retry or the consumer side commit offset will cause the message to be processed repeatedly. If the message is processed repeatedly in the business, it is ok. Otherwise, a unique ID can be added to the message, and idempotency check can be performed according to this ID

conclusion

This article introduces how to improve the reliability of Kafka from the perspectives of brokers, producers, and consumers.

  1. A proper replication factor must be configured on the broker. The more the minimum number of synchronized copies, the more reliable the data
  2. The producer side needs to be configured with reliable send acknowledgements and correct error handling
  3. The consumer needs to ensure that the batch message has been processed by the time the offset is submitted