Is there a way to properly evaluate the performance bottlenecks in Kafka message sending? How do you tune it?

1. Monitoring indicators of the Kafka message sender

Kafka provides a rich set of monitoring metrics, and provides a JMX way to obtain these metrics. The client provides monitoring metrics as shown in the following figure:

Main monitoring indicators are classified as follows:

  • Producer-metrics Indicates the monitoring indicators of the sending end. Its sub-nodes are all producers in the process

  • Producer-node-metrics Metrics of each producer in terms of Broker nodes.

  • Producer-topic-metrics Collects statistics on the sending end based on topic.

There are a number of metrics related to Kafka Producer that will not be listed in this article.

1.1 producer – metrics

Producer-metrics is a very important monitoring item at the sending end, as shown in the figure below:

The key items are as follows:

  • Batch-size – Avg Sender Specifies the average size of a ProducerBatch when the Sender thread actually sends messages.

  • Batch-size-max Sender Maximum size of a batch when sending messages.

Practical guidance: I think it is necessary to collect these two parameters. If the value is far less than the value set by Batch. size, and if the throughput is not as expected, you can adjust linger.

  • Batch-split-rate Kafka provides a mechanism for splitting large ProducerBatch into small pieces. That is, if the ProducerBatch of the client exceeds the maximum message size allowed by the server, the client will be split and re-sent. This value records the split rate per second

  • Batch-split-total Number of split times in Kafka.

Tips: According to the author’s reading of this part of the source code, I think the split of ProducerBatch is of little significance, because the capacity of the newly allocated ProducerBatch will be equal to batch.size. If it is not larger than the size, the batch will not be separated.

Practice Guideline: If the value is not 0, it indicates that the message size set on the server is inappropriate. The batch.szie size set on the client should be smaller than the max.message.bytes set on the server.

  • Buffer-available-bytes Specifies the available bytes of the buffer on the sender.

  • Buffer-total-bytes Indicates the total cache size of the sender. The default value is 32 MB, or 33,554,432 bytes.

If the number of bytes remaining in the cache continues to be low, it is necessary to evaluate whether the size of the cache is appropriate. If the Sender thread has encountered a bottleneck, then consider whether the network and Brorker have encountered a bottleneck.

  • bufferpool-wait-ratio

  • Bufferpool-wait-time-total Total time that the client blocks the memory allocated from the cache for creating ProducerBatch.

Practical guidance: If the value continues to be greater than 0, it indicates that there is a bottleneck in sending messages. You can lower the value of Linger. Ms to allow messages to be processed in a more timely manner.

  • Product-throttle-time – Specifies the average time for sending avG messages to be restricted by the broker

  • Product-throttle-time-max Specifies the maximum time for sending messages to be restricted by the broker

  • IO -ratio Total time of I/O reads and writes processed by I/O threads

  • Io-time-ns-avg Average time (in nanoseconds) for each IO operation invoked by an event selector

  • IO -waittime-total Average time for an I/O thread to wait for read/write readiness (in nanoseconds)

  • Iotime-total Total I/O processing time.

  • Network-io-rate Indicates the network read/write TPS of all connections on the client per second.

  • Network-io-total Total number of network operations (read or write) on all connections of the client.

1.2 General Indicators

In addition to the above indicators, Kafka also has some general monitoring indicators. These indicators have three statistical dimensions: message sender, node, and TOPIC.

The main dimensions are described as follows:

  • Producer-metics Indicates the dimension of the sending end

  • Producer-node-metrics Senter-broker node dimensions

  • Producer-topic-metrics Statistics of the producer-topic dimension

The indicators described in the following sections are counted in different dimensions, but their meanings are the same, so they are explained in a unified manner.

  • Incoming-byte-rate Indicates the number of incoming bytes per second.

  • Incoming-byte-total Indicates the total number of incoming bytes.

  • Outgoing-byte-total Specifies the total number of bytes sent.

  • Request-latency – Avg Average latency for sending messages.

  • request-latency-max

    Maximum delay for sending messages.

Actual combat guidance: If the latency is too high, it indicates that the Sender thread has a bottleneck in sending messages. It is recommended to compare this value with Linger. ms. If the value is significantly smaller than Linger. ms, adjust batch.size appropriately to improve the throughput.

  • Request-rate Tps sent per second

  • Request-size-indicates the average size of avG messages sent.

  • Request-size-max Sender maximum size of a single message sent by the Sender thread.

If the value is smaller than max-request. size, it indicates that the message backlog on the client is not large. If the bottleneck is encountered from other dimensions, linger.

  • Request-total Total number of bytes sent in a request

  • Response-rate TPS received by the server per second

  • Response-total Total number of responses received from the server.

2. Collection of monitoring indicators

While Kafka built-in many monitoring index, but these indicators default is stored in memory, since it is stored in memory, in order to avoid the monitoring data of endless additional memory trigger memory, usually monitoring data storage is essentially based on sliding window, which will only store recent monitoring data of a period of time to scroll cover.

Therefore, in order to display these indicators more intuitively, because the information needs to be collected regularly and stored in other databases and other persistent storage, curves can be drawn based on historical data, and the desired effect is shown in the following figure:

The basic monitoring and acquisition system architecture design is shown in the figure below:

Mq-collect should be stored in the producer SDK, and the collected information is asynchronously and periodically uploaded to the timing database InfluxDB by using the MQ-Collect class library. Then, the mQ-Portal display page is used to visually display each production client according to indicators to realize the visualization of monitoring data. Thus provide basis for performance optimization.

Well, this article is introduced here, one key three (follow, like, leave a message) is the biggest encouragement to me.

To master one or two Java mainstream middleware is a necessary skill to knock on BAT and other big factories. It gives you a learning route of Java middleware and helps you realize the transformation of the workplace.

Java advanced ladder, growth path and learning materials to help break through the middleware field

Finally, share a core RocketMQ ebook with me and you will gain experience in the operation and maintenance of billions of message flows.

How to get it: Follow the public account and reply RMQPDF to get it for free.

Middleware interest circle

RocketMQ Technology Insider author maintenance, mainly into the system analysis of JAVA mainstream middleware architecture and design principles, to build a complete Internet distributed architecture system, help break the workplace bottleneck.