preface

Read some information and read a lot of online articles, always feel that Kafka producer partition strategy to explain clearly, this article will be around the following questions step by step to the article launched.

  • Why do WE need a producer partitioning policy
  • What are the producer partitioning policies
  • What are the advantages and disadvantages of different partitioning strategies
  • How can I customize a partition policy

1. The process of the producer sending messages

instructions

(1) create a ProducerRecord object, including the target topic and the content to send, can also specify the key or partition;

(2) When sending the ProducerRecord object, the producer serializes the key and value objects into byte arrays so that they can be transmitted over the network;

(3) Data is passed to the partition:

  • If a partition is specified in the ProducerRecord object, the divider does nothing more and sends directly to that partition;
  • If this parameter is not specified, the hash value of key is used to specify a partition by default.
  • If a message key is not specified when it is sent, a partition is selected by polling (this is the default partition strategy in Kafka).

(4) This record is added to a record batch. All messages in this batch are sent to the same topic and partition, and a separate thread is responsible for sending these record batches to the corresponding broker.

(5) The server returns a response when it receives these messages:

  • If the message was successfully written to Kafka, one is returnedRecordMetaDataObject that contains subject and partition information, as well as offsets recorded in the partition.
  • If the write fails, an error is returned. The producer tries to resend the message after receiving an error and returns an error message if it fails again after several attempts.

ProducerRecord introduction

When sending messages, the data sent by producer needs to be encapsulated into a ProducerRecord object. A ProducerRecord represents a message record to be sent.

ProducerRecord(@NotNull String topic, Integer partition, Long timestamp, String key, String value, @Nullable Iterable<Header> headers)

ProducerRecord(@NotNull String topic, Integer partition, Long timestamp, String key, String value)

ProducerRecord(@NotNull String topic, Integer partition, String key, String value, @Nullable Iterable<Header> headers)

ProducerRecord(@NotNull String topic, Integer partition, String key, String value)

ProducerRecord(@NotNull String topic, String key, String value)

ProducerRecord(@NotNull String topic, String value)
Copy the code

Parameters that

parameter instructions
topic Its theme
partition Subordinate to the partition
key The key value
value The message body
timestamp The time stamp
  • If a partition is specified, the specified value is directly used as the partition value.
  • If the partition value is not specified but there is a key, mod the hash value of the key and the number of partitions of the topic to obtain the partition value.
  • That is, when there is no partition value and no key, an integer is randomly generated in the first call (the integer is incremented in each subsequent call) and the partition value is obtained by modulating this value with the total number of partitions available for topic, which is also known as the round-robin algorithm.

conclusion

Since a topic can be composed of multiple partitions, the topic is created by the user and must be specified during parameter transfer. However, the Partition is generated by the system according to the topic name + order number, so it is difficult for the user to specify a specific Partition.

So how does the partitioner know which partition to send a message to?

This is where the producer partitioning strategy comes in handy.

The first question we raised in the introduction should have been answered by reading the process of sending messages to producers above.

Let’s continue the exploration of the second question. What are the producer partitioning strategies?

Second, partition policy

2.1 an overview of the

Partitioning policy: The algorithm that determines which partition a producer will send a message to. Kafka provides a default partitioning policy as well as support for custom partitioning policies.

2.2 Default Partition Policies

Polling strategy

Also known as the Round- Robin policy, sequential allocation.

instructions

For example, if there are three partitions under a topic, the first message is sent to partition 0, the second to partition 1, the third to partition 2, and so on. It starts again when the fourth message is produced, assigning it to partition 0.

conclusion

  • Polling policy is the default partitioning policy provided by the Kafka Java Producer API;
  • The polling strategy performs very well in load balancing, always ensuring that messages are distributed as evenly as possible across all partitions, and is the most reasonable partitioning strategy by default.
Random strategy

Also known as Randomness strategy. Randomness means that we randomly place messages on any partition, as shown in the figure below.

If you want to implement the partition method of random policy version, it is very simple, just need two lines of code:

List partitions = cluster.partitionsForTopic(topic);
return ThreadLocalRandom.current().nextInt(partitions.size());
Copy the code

Compute the total number of partitions for the topic and randomly return a positive integer less than it. In essence, the random strategy also strives to evenly disperse data to each partition, but in practice, it is inferior to the polling strategy, so if the pursuit of uniform distribution of data, it is better to use the polling strategy. In fact, the random strategy is the partitioning strategy used by the producers of the old version, which has been changed to polling in the new version.

By message key order preservation policy

This is also called a key-ordering strategy.

instructions

  1. Kafka allows you to define a message Key, called a Key for short, for each message
  2. A Key can be a string with a clear business meaning: customer code, department number, business ID, metadata used to represent the message, and so on
  3. Once a message is defined with a Key, all messages with the same Key can be guaranteed to enter the same partition. Because messages are processed sequentially in each partition, this policy is called the message Key sequencing policy

The partition method to implement this policy is equally simple, requiring only the following two lines of code:

List partitions = cluster.partitionsForTopic(topic);
return Math.abs(key.hashCode()) % partitions.size();
Copy the code

conclusion

Kafka Java producer default partition policy:

  • If Key is specified, the sequence preservation policy by message Key is adopted
  • If no Key is specified, the polling policy is adopted

3. Customize a partition policy

Kafka Java API provides an interface for custom partition strategy: org. Apache. Kafka. Clients. Producer. The Partitioner

public interface Partitioner extends Configurable.Closeable {

    /**
     * Compute the partition for the given record.
     *
     * @param topic The topic name
     * @param key The key to partition on (or null if no key)
     * @param keyBytes The serialized key to partition on( or null if no key)
     * @param value The value to partition on or null
     * @param valueBytes The serialized value to partition on or null
     * @param cluster The current cluster metadata
     */
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster);

    /** * This is called when partitioner is closed. */
    public void close(a);

}
Copy the code

instructions

  • partition(): Calculates the partition of a given record

Parameters that

parameter instructions
topic Topics to be delivered
key Key values in the message
keyBytes Serialized keys and byte arrays in the partition
value Value value of the message
valueBytes An array of serialized values in a partition
cluster Original data of the current cluster
  • close(): inheritedCloseableThe interface can implement the close() method, which is called when the partition is closed.
  • onNewBatch(): indicates that the partition is notified to create a new batch

To implement a custom partitioning policy, implement the Partitioner interface directly and override the methods in the interface.

Four,

  • Polling policy (default partitioning policy)

    Advantages: Provides excellent load balancing capability to ensure that messages are evenly distributed across all partitions. Disadvantages: No guarantee of message order.

  • Random strategy

    Advantages: Simple partition selection logic for messages. Disadvantages: The load balancing capability is mediocre, and the order of messages cannot be guaranteed

  • By message key order preservation policy

    Advantages: Messages with the same key can be guaranteed to be sent to the same partition, so that all messages with the same key can be guaranteed to be sequential. Disadvantages: Data skew may occur — depending on the distribution of keys in the data and the hash algorithm used.

Attached reference article link address:

Zhongmingmao. Me / 2019/07/24 /…

www.cnblogs.com/yxb-blog/p/…