How do I prevent message loss

Original link: RocketMQ practice issues

We will use MQ in our development, and when we talk about MQ, we will inevitably be asked about message loss. The answer to this question is to think about two aspects: where can messages be lost and what can be done to prevent message loss.

Which links are likely to lose messages?

As shown in the figure above, this is a generic MQ scenario where links 1, 2, and 4 are cross-network, and there is a natural possibility of messages being lost across networks. In the process of 3, MQ is usually written to the page cache of the operating system before the operating system writes the message to the disk. There is a time difference in this process, which may cause message loss. If the service fails, messages will be lost before the cache can write to the hard disk.

The producer uses the transaction messaging mechanism to guarantee zero message loss

This conclusion is easy to understand, since RocketMQ’s transaction messaging mechanism is designed to guarantee zero loss and has been validated by Ali, so it certainly works.

Taking the e-commerce order scenario as an example, we briefly analyze how the transaction message mechanism ensures message loss.

Why send a half message? What does it do?

This half message is sent before an order is placed by the ordering system and is not visible to consumers of downstream services. The purpose of this message is to confirm that the RocketMQ service is normal. This is equivalent to sniffing to see if the RocketMQ service is healthy and notifying you that a message is coming.

What if writing the half message fails?

Without the half process, we would normally complete the order in the order system and then send a message to MQ. Failure to write a message to MQ would be embarrassing, whereas failure to write a half message would be considered a problem for the MQ service and the downstream service would not be notified. We can give the order a status mark at the time of placing an order, then wait for the MQ service to return to normal before compensating, and then re-place the order to notify the downstream service.

What if the order system fails to write to the database?

What would we have done without transactional messaging? Without the transaction messaging mechanism, we would have judged that the order had failed, thrown an exception, and not sent messages to MQ, which at least would not have notified the downstream service of the error. However, if the database recovers after some time, the message cannot be sent again. Of course, you can design additional compensation mechanisms, such as caching the order data and starting a thread to periodically attempt to write to the database. A more elegant solution can be found if transactional messaging is used.

If the order fails to write to the database (maybe the database crashes and takes some time to recover), we can find another place to cache the order information (Redis, text, or whatever) and return RocketMQ with UNKNOWN status. RocketMQ then checks the transaction status back and forth over a period of time. We can try to write the order data to the database when we check back the transaction status. If the database has been recovered at this time, the order can be complete and correct, and then continue the following business. This way, the order message will not be lost due to a temporary database crash.

What if RocketMQ hangs after the half message is successfully written

It is important to note that in the transaction message processing mechanism, the unknown transaction status check is actively initiated by the RocketMQ Broker. This means that if this happens, RocketMQ will not call back to the service in the transaction message to check the transaction status. At this point, we can always mark the order as “New order”. After RocketMQ recovers, RocketMQ will resume the process of status check again as long as the stored messages are not lost.

How to gracefully wait for payment after placing an order

In the order scenario, it is usually required that after the order is placed, the customer completes the payment of the order within a certain time (for example, 10 minutes). After the payment is completed, the downstream service will be notified for further compensation.

What happens if you don’t use transaction messages?

The simplest way to do this is to start a timer that scans the order table every once in a while, compares the ordering time of unpaid orders, and collects overdue orders. This way is obviously very problematic, need to regularly scan a very large order information, which is not a small pressure on the system.

So what’s the next step? You can use the delayed message mechanism provided by RocketMQ. Send a 1-minute delay message to MQ to check the payment status of the order and send an order notification downstream if the order has been paid. If no payment is made, another message is sent with a 1-minute delay. The order is finally reclaimed on message 10. Instead of scanning the entire order table, this solution processes a single order message at a time.

What about using transaction messages? Instead of a scheduled task, we can use a status check mechanism for transaction messages. When placing an order, return an UNKNOWN UNKNOWN state to the Broker. In the status check method to query the payment status of the order. This makes the whole business logic much simpler. This payment status check requirement can be more elegantly fulfilled by configuring the number of transaction message callback times (default 15) and the transaction callback interval (messageDelayLevel) in RocketMQ.

The role of the transaction messaging mechanism

Overall, in the order scenario, the problem of message loss is actually translated into a distributed transaction consistency between the order business and the downstream service business. Transaction consistency is always a very complicated problem. RocketMQ’s transaction messaging mechanism, however, actually guarantees only half of the total transaction message. It guarantees the transaction consistency of the two events of order system placing and sending messages, but does not guarantee the transaction of downstream services. But even this is a good downgrading of distributed transactions.

At present, it is also the best downgrade plan in the industry.

RocketMQ configures synchronous flush +Dledger master/slave architecture to ensure that MQ itself does not lose messages

Synchronous brush set

And that makes a lot of sense from our previous analysis. FlushDiskType: flushtype: flushType: flushType: flushType: flushType: flushType: flushType: flushType: flushType: flushType: flushType: flushType: flushType

Dledger file synchronization

In a RocketMQ cluster built using Dledger technology, Dledger ensures file synchronization between master and slave through two-phase commit.

In simple terms, data synchronization goes through two phases: the UNcommitted phase and the commited phase.

Dledger on the Leader Broker is marked as uncommitted after receiving a piece of data. He then sends this uncommitted data to the Follower Broker’s DledgerServer component via his DledgerServer component.

Then, after the Follow Broker’s DledgerServer receives the uncommitted message, it must return an ACK to the Leader Broker’s Dledger. Then, if the Leader Broker receives more than half of the acks returned by the Follower brokers, it marks the message as committed.

In turn, the DledgerServer on the Leader Broker sends committed messages to the DledgerServer on the Follower Broker, asking them to mark the message as committed. This is done based on Raft protocol

Do not use asynchronous consumption mechanisms on the consumer side

Normally, the consumer would have to process a local transaction and then give MQ an ACK response, at which point MQ would change the Offset to mark the message as consumed and not push messages to other consumers. Therefore, messages are not lost in transit through the Broker’s re-push mechanism. However, the following situations can also cause server messages to be lost:

DefaultMQPushConsumer consumer = new DefaultMQPushConsumer("please_rename_unique_group_name_4"); consumer.registerMessageListener(new MessageListenerConcurrently() { @Override public ConsumeConcurrentlyStatus ConsumeMessage (List < MessageExt > MSGS, ConsumeConcurrentlyContext context) {new Thread () {public void the run () {/ / processing business logic System.out.printf("%s Receive New Messages: %s %n", Thread.currentThread().getName(), msgs); }}; return ConsumeConcurrentlyStatus.CONSUME_SUCCESS; }});Copy the code

In this asynchronous consumption mode, the message may be lost due to the failure of the consumer’s local business logic after the message state is returned.

Rocketmq-specific problem, NameServer is down how do I keep messages from being lost?

NameServer acts as a routing center in RocketMQ, providing routing to brokers. However, routing center functionality is required in all MQ applications. Kafka uses ZooKeeper and a Broker that acts as a Controller to provide routing services. RabbitMQ is routed by each Broker. RocketMQ isolated the routing center and deployed it independently.

As the NameServer has seen before, any number of node failures in the cluster will not affect the routing functionality it provides. What if all the NameServer nodes in the cluster fail?

Many people would think that there would be cached copies of all routing information in both producers and consumers, and that the whole service would work for a while. In fact, you can do an experiment on this problem, when the NameServer all failed, producers and consumers are immediately unable to work. As to why, you can go back to our previous source code lessons and find the answer in the source code.

So, back to our message loss problem, in this case, RocketMQ is essentially unavailable for the entire service, so there’s no way it can guarantee that the message is not lost. We’ll have to devise a downgrade to deal with this. In the order system, for example, if multiple attempts to send RocketMQ fail, you have to find another place (Redis, file, memory, etc.) to cache the order messages, and then start a thread to periodically scan the failed order messages and try to send them to RocketMQ. This allows the messages to be re-routed as soon as RocketMQ service is restored. This whole set of downgrading mechanism, in large Internet projects, is a must.

RocketMQ message zero loss scheme summary

After a complete analysis, the whole RocketMQ message zero loss scheme is pretty straightforward

Producers use transaction messaging mechanisms.
Broker configures synchronous flush +Dledger master/slave architecture
Consumers should not use asynchronous consumption.
Prepare a downgrade solution when the whole MQ is down

Is this a perfect solution? In fact, it is obvious that the whole message zero loss scheme greatly reduces the processing performance and throughput of the system in each link. In many scenarios, the cost of performance loss may far outweigh the cost of partial message loss. So, when we design RocketMQ usage solutions, we have to consider the actual business situation. For example, if all servers are in the same room, it is perfectly possible to configure the Broker to be an asynchronous flush to improve throughput. In some scenarios where message reliability is not so high, other simpler schemes can be adopted at the producer side to improve throughput, and periodic reconciliation and compensation mechanisms can be used to improve message reliability. If consumers do not need to save messages, the performance gains from using asynchronous consumption mechanisms can be significant.

In summary, the message zero loss scheme is summarized as a good reference when designing RocketMQ usage schemes.