The author is “sad”. Near the end of the year, a memory failure of a physical machine in the production MQ cluster maintained by the author leads to an abnormal restart of the operating system. Within 10 minutes, many application sending clients have message sending timeout, and the accident is determined as S1.

1. Fault description

The RocketMQ cluster adopts two primary and two secondary deployment architectures, as shown in the following figure:

An obvious feature of its deployment architecture is that nameserver and Broker processes are deployed on a physical machine.

The memory of one of the machines (192.168.3.100) was faulty, causing the machine to restart. However, due to the self-check of the Linux operating system, the whole restart process took nearly 10 minutes, and the client sent timeout lasted 10 minutes, which is unacceptable.

What is RocketMQ’s high-availability design? The analysis process will be described in detail next.

2. Fault analysis

When I learned that a machine failure caused a message timeout lasting 10 minutes, my first reaction was that it should not be. RocketMQ cluster is a distributed deployment architecture, which naturally supports fault detection and recovery. The sending client can automatically detect Broker faults for no more than 10 minutes. So how did it happen?

Let’s start by reviewing RocketMQ’s route registration and discovery mechanism.

2.1 RocketMQ Route registration and elimination mechanism

The route registration and deletion mechanisms are described as follows:

  • All brokers in the cluster send heartbeat packets to all Nameservers in the cluster every 30 seconds, registering Topic routing information.

  • When NameServer receives a heartbeat packet from the Broker, it first updates the routing table and records the time when the heartbeat packet was received.

  • NameServer starts a scheduled task to scan the Broker’s survival table every 10 seconds. If NameServer has not received a heartbeat packet from the Broker for 120 seconds, it determines that the Broker is offline and removes it from the routing table.

  • If a Nameserver is disconnected from a long connection to the Broker, Nameserver can immediately sense that the Broker is offline and remove it from the routing table.

  • The message client (sender and consumer) establishes a connection with only one NameServer at any time and queries the routing information of the NameServer every 30 seconds. If the query result is found, the local routing information of the client is updated. If the route query fails, the route is ignored.

From the route registration and culling mechanism above, how long does it take for a message sender to perceive the change in routing information when a Broker server goes down?

The following two cases are discussed respectively:

  • When the TCP connection between the NameServer and the Broker server is disconnected, the NameServer can immediately sense the change of routing information and remove it from the routing table. Therefore, the sender can sense the change of routing information within 30 seconds. Within 30 seconds, the sender will fail to send messages. There will be no major failure to the sender, acceptable.

  • If the TCP connection between NameServer and the Broker server is still open but the Broker cannot provide services (for example, the Broker is suspended), it takes 120s for NameServer to sense that the Broker is down, and up to 150s for the message sender to sense the change of its routing information.

But the question arises, why does a Broker restart due to memory failure take 10 minutes for business to resume, that is, for clients to truly feel that the Broker is down?

Now that it has occurred, we need to analyze it and come up with a solution to avoid the same type of error in production.

2.2 Troubleshooting Process

Query the client log (/ home / {user} / logs/rocketmqlogs/rocketmq_client log), we can see from the client for the first time to send timeout time is when the message, the log output is as follows:

192.168.3.100 machine memory failure, check the logs of other Nameserver in the cluster to see how long it takes a nameserver in a normal machine to sense the failure of broker-A.

Nameserver, 192.138.3.101, takes about 2 minutes to detect its downtime. That is, although the machine is being rebooted, the TCP connection is not disconnected due to hardware self-check of the operating system. Therefore, Nameserver does not detect its downtime until 120s. If the broker is removed from the routing table, the client should be aware of the change within 150 seconds. Why not?

Continue to view the routing information of the client and check the time when the routing information of the client changes, as shown in the following figure:

According to the client logs, the client sensed the change at 14:53:46. Why?

The original client reported a timeout exception when updating routing information. The screenshot is as follows:

During the period from failure to recovery, the client kept trying to update the routing information from the failed NameServer, but kept returning a timeout. As a result, the client was unable to obtain the latest routing information and was unable to sense the downed Broker.

From log analysis, up to now it is clear that the client within all didn’t in the 120 s perception of the change of the routing information, because the client has been try from the outage nameserver to update the routing information, but has been unable to the request is successful, so the client cache routing information has been unable to get updates, caused the above phenomenon

The problem is that, as we know about RocketMQ, NameServer is down, and the client automatically selects the next NameServer from the list of NameServer, so why didn’t the NameServer switch happen here, but wait until 14:53 to switch?

Let’s look at the switch code for NameServer, which is shown in the following snippet:

Here are a few key analyses from the chart above:

  • The client selects a connection from the cache to send an RPC request only if the isActive method of the connection returns true, that is, the underlying TCP connection isActive.

  • When a client makes an RPC request to a server, if a non-timeout class exception occurs, the closeChannel method is executed. This method closes the connection and removes it from the connection cache table. This is critical because if a connection exists in the cache and is active when NameServer is switched, Nameserver will not be switched.

  • If send RPC timeout, rocketmq according to clientCloseSocketIfTimeout parameters to decide whether to close the connection, but it is a pity that the parameter to false by default, and did not provide modification of entrance.

So far the analysis of the problem has become very clear.

The broker can no longer process the request, but the underlying TCP connection is not broken. After timeout, the broker returns, but the client does not close the TCP connection with the failed nameserver, and nameserver switchover is not triggered. After the host restarts successfully, the TCP connection is disconnected. After the restart, the faulty host detects the change of routing information and recovers the fault.

Root cause: The routing information cannot be updated due to the fake death of nameserver.

3. Best practices

After the above failures, I think nameserver should not be deployed with the broker. If Nameserver and the broker are not deployed together, the above problems can be effectively avoided. The deployment architecture is shown in the following figure:

Can such a deployment architecture be effectively avoided if the above scenario, in which the Broker is suspended in animation, occurs? The answer is yes.

If the broker at 192.168.3.100 faked death, then both 3.110 and 3.111 nameserver could sense that broker-A was down within 2 minutes, and the client could get the latest routing information from nameserver. So that messages will not continue to be sent to the down Broker when the message is sent, and fault recovery;

If the Nameserver faked death and a timeout error occurred, it would work as long as the broker did not go down, but if the Nanmeserver and the broker faked death together, the above architecture would not be able to avoid the above problem.

Therefore, the best practices include the following two actions: 1. Nameserver and Broker must be deployed separately and isolated. 2. The connection between nameserver and the client should be closed after timeout, triggering nameserver drift, and the source code needs to be modified.