RabbitMQ Combat: Usability analysis and implementation

This series is the summary note for RabbitMQ In Action: Deploying Distributed Message Queues efficiently.

The last article covered the best practices for various scenarios, most of which can be done in a “fire and forget” mode and do not require a response. If a response is required, RabbitMQ’s RPC model can be used.

RabbitMQ decouples systems asynchronously. The caller sends a service request to the Rabbit server, which then returns the request. Rabbit ensures that the request is properly processed, even in exceptional circumstances such as a network exception, Rabbit server crash, or entire machine room power outage. Rabbit provides various mechanisms to ensure its availability.

This article examines the availability assurance provided by Rabbit by summarizing specific scenarios and learning how it is implemented. You will learn:

Summarizing abnormal Scenarios
Cluster and process failure
Connection loss and failover
Active/standby mode
Cross-room replication

I would like to promote my personal public account “Qingqingshuo” to share my work, study and life in the first time. If it is helpful to you, I hope you can pay attention to it.

Abnormal scenario

In practice, a large portion of your time is spent resolving exceptions, such as validation of user input, the various exception classes provided in the JDK, and network exceptions, which are relatively easy to resolve.

The Rabbit service serves as a bridge between callers and processors. If the Rabbit service is unavailable due to a network exception, a single server crash, or an equipment room crash, all dependent service systems will be affected.

Network anomalies

The handler interacts with the server through a long connection to push messages in real time. Network exceptions may cause the long connection to be disconnected. If the client cannot sense the connection, the handler cannot receive any messages, which is called connection loss.

This problem can be solved by catching the connection exception and reconnecting, and the Rabbit client is packaged so it can handle the problem easily.

Server crash

If only one server provides services, the service becomes unavailable when the server crashes. In this case, the failure of a single server does not affect the entire service.

Once you use clustering, there are a few things to consider:

Which server the client connects to is random, and a queue is only in one server, so each server stores queue metadata (like an index) and can retrieve the actual queue data from other servers;
When the server crashes, non-persistent queues and switches are lost. After the client is reconnected, it needs to be created again, but unconsumed messages cannot be recovered.
If queues, exchanges, messages, etc. are persistent, how do you recover them? Rabbit provides several ways to do this, as described below.
Subscribers also need to reconnect and listen;

Computer room paralysis

RabbitMQ provides a mechanism to easily copy messages between rabbits in different data centers if the machine room is down and multiple data centers need to be built.

Cluster and process failure

One of RabbitMQ’s best features is its built-in clustering, which accomplishes two main objectives:

[Fixed] Allow consumers and producers to continue operating in the event of a Rabbit node crash
Linearly expand the throughput of message communication by adding more nodes;

Cluster architecture

RabbitMQ will always record four types of internal metadata (similar to indexes) :

Queue metadata: queue name and its properties;
Switch metadata: Switch names, types, and properties;
Binding metadata: A simple table shows how to route messages to queues;
Vhost metadata: Provides namespaces and security attributes for queues, switches, and bindings within a Vhost;

When a cluster is introduced, a new metadata type needs to be tracked: the location of the cluster node and the relationship of the node to other types of metadata that have been recorded.

Not every node has an exact copy of all queues, and if a queue is created in a cluster, only the complete queue information (metadata, status, content) is created on a single node, and all other nodes know only the metadata of the queue and the node pointer to that queue.

If the node crashes, consumers attached to the queue cannot receive new messages. Consumers can reconnect to the cluster and re-create the queue, which is only possible if the queue is not persistent, to ensure that queue messages are not lost on the failed node when it recovers and joins the cluster.

Why not copy queue content and state to all nodes: first, storage space, if each cluster node has a full copy of all queues, adding new nodes will not bring more storage space; Second, performance, the publisher of the message needs to copy the message to every cluster node, and for persistent messages, both network and disk replication increase.

The switch is just a query table, not an actual message router, so it’s easier to replicate the switch across the cluster

You can think of each queue as a process running on a node, each process has its own process ID, and the exchange is just a list of routing patterns and queue process ids to which matching messages should be sent.

Each Rabbit node is either a memory node or a disk node. Single-node systems only run disk type nodes. In a cluster, you can optionally configure some nodes as memory nodes.

When declaring queues, switches, or bindings in a cluster, these actions are not returned until all cluster nodes have successfully committed their metadata changes.

RabbitMQ requires only at least one disk node in the cluster. If there is only one disk node and it crashes, the cluster can continue to route messages but cannot create queues, switches, bindings, add users, change permissions, etc. Therefore, you are advised to configure two disk nodes. After the memory node is restarted, the memory node will connect to the preconfigured disk node to download the metadata copy of the current cluster. Therefore, all disk nodes must be notified to the memory node.

Mirrored queue

As mentioned earlier, queues are only on one node in the cluster, and when a node crashes queue messages are lost. With RabbitMQ2.6, mirror queues are provided, and once the primary queue becomes unavailable, the secondary queue is elected as the new primary queue.

For mirrored queues, in addition to Posting messages to the appropriate queue according to the routing binding rules, messages are also posted to the slave copy of the mirrored queue.

For sender confirmation messages, Rabbit notifies the sender until all queues and slave copies of queues have safely received the message.

In addition, there is a problem with mirroring queues: if the primary copy node fails to send, the secondary queue elects the Wie primary, and all consumers of the queue need to re-attach and listen for the new primary copy of the queue. Consumers connecting through the failed node can be detected by the TCP connection missing to the node, but not by consumers attaching to the mirror queue through the node and functioning properly.

Rabbit sends a consumer cancel notification to the consumer that it is no longer attached to the master copy of the queue and needs to reconnect.

Connection loss and failover

This section focuses on how a consumer detects a lost connection and reconnects.

There are multiple strategies for handling reconnection to the cluster, and a better approach is to use load balancing, which not only reduces the complexity of the application dealing with node failure code, but also ensures an even distribution of connections across the cluster.

About load balancing, there are many descriptions on the Internet, so I will not introduce too much here. I will focus on how to detect faults and perform reconnection operations.

Fault detection is simple. When the long connection is down, an exception is thrown and the corresponding exception is captured.

When a cluster node fails, the application needs to consider: Where should it connect next? The job is left to the load balancer.

Regarding reconnection treatment, consider:

If you reconnect to a new server, channels and all consumption loops on them fail and need to be rebuilt;
When reconnecting, all queues and bindings may not exist and need to be reconstructed.

Active/standby mode

When to particularly high availability requirements, does not allow message loss, need to queue, switches, message set to persistent, if a node collapsed, before the return, will not be able to forward the message, because the default cluster architecture does not allow other nodes in cluster create a queue, prevent fault node after recovery, historical messages are lost.

You can solve this problem by building independent RabbitMQ for both the active and standby machines, the Warren pattern. A Warren is a pair of independent primary/standby servers with a load balancer front-loaded to handle failover.

There is no coordination between the primary server and the standby server, and the standby server only processes messages when the primary server crashes. After the primary node fails, the standby node re-creates queues and switches to continue services. After the faulty node recovers, it can continue to consume messages not consumed by the primary node.

Cross-room replication

RabbitMQ clustering is a great solution for improving message performance when there is only one data center, but routing messages from one program to another can be a hassle. You can Shovel.

The Shovel is a RabbitMQ plug-in that allows you to define the replication relationship between queues on RabbitMQ and switches on another RabbitMQ. To put it bluntly, producers and consumers are far apart.

Create a new queue in machine room 1 to receive messages posted to websites, then let the shovel consume them and repost them to the switches on machine room 2 over a WAN connection.

In this way, users can return to the queue as long as they publish to machine room 1, reducing the response time. Machine Room 1 can continuously publish messages to machine Room 2.

As you can see from the above introduction, it takes a lot of work to ensure high availability. You can choose different architectures based on business requirements for availability.

The next article focuses on the RabbitMQ management interface and monitoring.

Please scan the qr code below and follow my personal wechat official account ~