• 1. Accident scenario

  • Reason 2.

  • 3. Solutions

  • 4.RocketMQ load balancing

  • 4.1. Load balancing of the sender

  • 4.2. Subscriber load balancing

  • 5. To summarize

1. Accident scenario

The company’s business system mainly through MQ to achieve the series of processes, mainly in order to solve the long business link, through MQ to achieve asynchronous call, but completely rely on MQ also synchronous problems, the most important problem is MQ message accumulation problem.

When the amount of service data is too large, MQ messages cannot be consumed in a timely manner. When the amount of messages accumulates, normal service flow is affected.

Take the above production accident as an example. After the release of the version on 19th, the business flow on 20th increased sharply, and the flow in the morning reached more than two million. The message accumulation also maintained more than two million. In addition, the downstream system was affected by normal orders. The downstream system did not receive messages due to message accumulation, which affected normal business. The project team urgently began to repair the problem.

Reason 2.

The production configuration is as follows: There are three instances of the service system, and the consumption capacity of each is 200/s in normal consumption.

Scrambled to fix the problem when the first is to find the reason, the main reason lies in the backlog of all the messages in an instance, the other two instances of consumption is normal, and RocketMQ load balance can not according to the instance of the spending power to distribute news, therefore creates one instance consumption ability slowly will lead to the overall backlog of messages.

3. Solutions

In view of the backlog of cases, our solution is to increase the instance to improve overall spending power, so the backlog instance of the message has been shut down first and then increased the three instances, so total is five instances for processing, but with ali RocketMQ technical experts for advice, RocketMQ load balancing is divided into eight queue for message to subscribe to distribute, Therefore, we added 5 more instances to process a total of 10 instances.

The reason:

  1. We checked the production link time through the Trace plug-in, and found that the time of one link was about 2s, resulting in a slow message consumption capacity on a topic.

  2. This is just a question, so we backlog of messages involving more than one Group, so the production message ID query message path, get the failure message consumption, message retry (our message retry count for 3 times), causing the message consumption have not been able to be successful, the message has not consumption also has been a success, it is part of the backlog, based on the problem, We set the number of message retries to 1, and only one retry is required for failure. This has greatly accelerated spending power.

  3. After checking the code, it is found that exceptions are thrown in the code. Exceptions are thrown in the whole logic, which still fails in subsequent retries. As a result, messages cannot be ended.

  4. An exception was found in the Hystrix thread pool on trace, because the hystrix thread pool overflowed due to a large number of calls to feign during message processing. Therefore, the maximum number of Hystrix threads was increased.

  1. After the consumption capacity was increased, we continued to monitor MySQL and found that the execution of MySQL was too long. Through searching SQL statements, we found that the related SQL of batch update had problems. After analyzing with DBA, we found that PolarDB was used in production. At the same time, because batch update of SQL involves sub-database and sub-table, the whole SQL is locked for too long, resulting in other SQL waiting and causing connection pool overflow. So adjust the SQL statement and increase the connection pool.

This allows all backlogged messages to be quickly allocated to other instances for consumption, while MQ affecting normal business is split and topic is set up to handle separately, not affected by the overall messages.

4.RocketMQ load balancing

RocketMQ’s responsible balancing is not load balancing as we understand it in terms of services. First, all messages are bound to topics, and each topic is divided (logically) into eight queues, with RocketMQ split into different load balancers from sender and subscriber.

4.1. Load balancing of the sender

Although the message produced by the producer is bound to the topic, it is divided into eight queues on the topic. The message is sent to the corresponding Queue in a polling manner. For example, the first message produced is sent to Queue 0, the second message produced is sent to Queue 1, and so on. So you know that the RocketMQ producer is not randomly binding the message to the topic Queue, but polling the message to the topic Queue.

(Image from RocketMQ)

The Broker allocates queues equally among the subscriber clusters for message consumption.

4.2. Subscriber load balancing

To be clear, the RocketMQ subscriber’s load balance is not affected by the consumption capacity of each consumer instance, so the timely consumer instance consumes slowly but messages from its Queue are not distributed to other consumer instances.

The subscriber can process the Queue in the following ways:

If the number of subscribers is greater than the number of queues, machines that exceed the number of queues will process messages on 0 queues, as shown in the figure below:

The number of subscriber instances is equal to the number of queues, and each instance processes one message on the Queue, as shown:

If the number of subscribers is less than the number of queues, each machine processes messages on multiple queues, as shown in the figure below:

If a machine consumes slowly, the Queue messages allocated to that machine cannot be processed in a timely manner, resulting in a build-up of messages across the Queue.

5. To summarize

  1. RocketMQ load balancers are not aware of consumer spending power, so it is important to pay attention to consumer spending power.

  2. Be aware of the logic in your code and forbid throwing exceptions for logic that will still fail if you try again

  3. Pay attention to the consumption power of the link, so there may be a ripple effect that leads to message backlogs.

  4. Monitor MQ consumption and alert to backlog problems.

  5. Continue to observe subsequent problems, as increased consumption capacity will challenge the stability of the system, so ensure that the stability of the system will not be damaged after increased consumption capacity.

References:

1. RocketMQ official (help.aliyun.com/document\_d…).