1, the background

In a RocketMQ cluster consisting of four master and four slave servers, three of the servers “unexpectedly” went offline at the same time.



Looking at the surveillance graphics of the three machines in turn, the time stamps almost “match” perfectly, incredible.

2. Fault analysis

If any problem occurs, restart all servers immediately to recover the cluster as soon as possible and reduce the impact on services. Then analyze logs.

Java processes exit automatically (RocketMQ is itself a Java process). One of the most common problems is that the process crashes due to memory overflow or memory leaks. Since our launch parameters configuration – not XX: + HeapDumpOnOutOfMemoryError

-xx :HeapDumpPath=/opt/jvmdump dump =/opt/jvmdump dump =/opt/jvmdumpgceasy.io/After uploading gc logs, a graphical display will be presented as follows:





Garbage collection is found to be normal.

If a Java process doesn’t exit due to a memory overflow, what could it be? Let’s look at the broker’s log at that point. The key log is captured as follows:



ShutdownHook is printed in the broker log, indicating that the registered exit hook function was executed before the process exited. The broker stopped normally, and it is unlikely to have been executed by kill -9. The shutodown or kill command must have been executed, so the history command was immediately used to check the history command, but the command was not executed at the specified time. Moreover, after switching to the root command, the history command was also used, but no clue was found.

However, I always believed that the process quit was caused by manually executing the kill command. After searching online, I learned that the system command call could be viewed by referring to the system log /var/log/messages. Then I downloaded the log file to the local computer and searched for the kill keyword, and found the following logs:



The last kill command was found to have stopped the RocketMQ cluster and restarted it with bin/ mqbroker-c conf/ broker-b.cav& a little over 1:00 am on the 25th.

If the session fails, the process will exit. To verify this, let’s check the exit log again:



Removed logs were found at the fault point.

The basic analysis of the cause of the fault is in place. The operation and maintenance did not use NohUP to start the operation and maintenance. Therefore, check the mode of the cluster that was just started and restart the Broker.

RocketMQ graceful restart tips:

  1. To turn off write permissions on the broker, run the following command:

    Bin /mqadmin updateBrokerConfig -b 192.168.x.x:10911 -n 192.168.x.x:9876 -K brokerPermission -v 4Copy the code
  2. Check the write TPS of the broker using rocketmq-console. When the write TPS drops to 0, use kill PID to shut down the RocketMQ process. Note: After the write permission of the broker is disabled, non-sequential messages will not be rejected immediately. Instead, messages will not be sent to the Broker until the client routing information is updated. Therefore, this process needs to wait.

  3. Start the rocketmq

    nohup bin/mqbroker -c conf/broker-a.conf  /dev/null  2>&1 &
    Copy the code

    Note: Nohup.

  4. Restore the write permission of the node

    Bin /mqadmin updateBrokerConfig -b 192.168.x.x:10911 -n 192.168.x.x:9876 -k brokerPermission -v 6Copy the code

This article introduces the fault analysis and processing, this article focuses on the analysis process and RocketMQ Broker elegant shutdown solution.

If this article is of any help to you, please give it a thumbs up. Thank you.


See article such as surface, I am Weige, keen on systematic analysis of JAVA mainstream middleware, pay attention to the public number “middleware interest circle”, replycolumnCan get into the system column navigation, replydataCan obtain the author’sLearn mind mapping.