How to guarantee 100% delivery success and message idempotency?

preface

Many of us have heard of messaging middleware MQ, such as RabbitMQ, RocketMQ, Kafka, etc. The benefits of introducing middleware can prevent high concurrency, peak clipping, and business decoupling.

As shown above:

(1) The order service delivers the message to the MQ middleware. (2) The logistics service listens to the MQ middleware message for consumption

This article will discuss how to ensure that the order service delivers messages successfully to MQ middleware, using RabbitMQ as an example.

To analyze problems

Friends will have some questions about this, order service launch message service, return success is not successful? The pseudo-code is as follows:

In the above code, the general message is written like this, friends think there is any problem?

For a scenario, what happens if the MQ server suddenly goes down? Is it true that all the messages sent by our order service are gone? Yes, generally MQ middleware stores messages in memory to improve the throughput of the system. If nothing else is done, the messages will be lost if the MQ server goes down. This is not allowed by the business, resulting in a great impact.

persistence

Durable RabbitMQ messages: Durable RabbitMQ messages will be durable if the durable parameter is set to true.

This way, if the MQ server goes down, there will be messages stored in the disk file after the restart, so that it will not be lost. Yeah, so there’s some probability that the message won’t get lost.

However, there is also a scenario where the message is just saved to MQ memory, but has not had time to update to disk files, and suddenly goes down. This scenario is common in the course of a continuous high volume of messages.

So what? What can we do to ensure that it will persist to disk?

Confirm mechanism

The problem above is that no one tells us whether persistence is successful or not. Fortunately, many MQ have a callback notification feature, RabbitMQ has a confirm mechanism to tell us whether the persistence was successful.

Confirm mechanism works as follows:

(1) The message producer sends the message to MQ. If the message is received successfully, MQ returns an ACK message to the producer.

(2) If the message is not received successfully, MQ returns a NACK message to the producer;

The pseudocode above has two ways of handling messages, namely, ack callback and NACK callback.

Does this guarantee that 100% of messages are not lost?

Looking at the confirm mechanism, imagine if we producers persisted MQ to disk every time we sent a message and then issued an ACK or NACK callback. So our MQ throughput is not very high because messages are persisted to disk every time. Writing to disk is slow, which is unacceptable in high concurrency scenarios, and throughput is too low.

So the real implementation of MQ persistent disk is handled by asynchronous calls, which have certain mechanisms, such as: when there are thousands of messages, the disk will be flushed to the disk at once. Instead of brushing every time a message comes in.

Therefore, comfirm mechanism is actually an asynchronous listening mechanism to ensure the high throughput of the system, which leads to the failure of 100% guarantee against message loss, because even if the confirm mechanism is added, messages in THE MQ memory will crash before being flushed to disk, and still cannot be processed.

After all this, I still can’t be sure, so what should I do??

Message prepersistence + scheduled task

It’s essentially because you don’t know if it’s persistent, right? Can we persist messages ourselves? And the answer is yes, our solution is a step further evolution.

Flow chart above:

(1) The order service producer persists the message to Redis or DB before delivering the message. Redis is recommended for high performance. The status of the message is sending.

(2) Check whether the confirm mechanism sends the listening message successfully. For example, delete the ACK success message from Redis.

(3) If the nACK fails, the user can choose whether to resend the message according to its own business. You can also delete this message, depending on your business.

(4) A scheduled task is added here to pull the message state is still in sending after a certain period of time. This state indicates that the order service has not received the ACK success message.

(5) Scheduled tasks will make compensatory delivery messages. If the MQ callback ACK receives the message successfully, delete the message from Redis.

This mechanism is a compensation mechanism, I don’t care whether MQ actually received the message, as long as the message status in my Redis is “sending”, the message was not successfully delivered. Then start the scheduled task to monitor and initiate compensation delivery.

Of course, we can also add a compensation number for the scheduled task. If it is more than 3 times and still no ACK message is received, we can directly set the status of the message to [failed] and manually check the cause.

In this case, the solution is perfect, ensuring that 100% of the messages are not lost (except that the disk is also broken, which can be a slave solution).

In this scenario, however, it is possible to send the same message multiple times, and it is possible that MQ has already received the message, or that a network failure occurred during the ACK message callback and the producer did not receive it.

That is about to ask the consumer to ensure idempotency when consuming certainly!

Idempotent meaning

So what is idempotent? In distributed applications, idempotence is very important, that is, no matter how many times an operation is performed on a business under the same conditions, the result will be the same.

Why do we have idempotent scenarios?

Why do we have idempotent scenarios? Because in a large system, all are distributed deployment, such as: order business and inventory business may be independently deployed, are separate services. When a user places an order, the order service and the inventory service are invoked.

Due to distributed deployment, it is very likely that the order service invocation fails when the inventory service is called due to network reasons. However, the inventory service has been processed, but there is an exception when the order service processing result is returned. At this point, the system will generally make a compensation scheme, that is, the order service again put the inventory service call, inventory minus 1.

There is a problem, in fact, the last call was reduced by 1, but the order service did not receive the processing result. Now it’s called again, minus 1 again, so it’s not business, it’s too much.

Idempotent is the concept that no matter how many times an inventory service is invoked under the same conditions, the result is the same. Only in this way can we ensure the feasibility of the compensation scheme.

Optimistic locking scheme

Learn from the optimistic locking mechanism of the database, such as:

According to version version, that is, obtain the version version number of the current item before operating the inventory, and then carry this version number when operating. Let’s see, the first time we operate on the inventory, we get version 1, call the inventory service, version 2; The order service invokes the inventory service again. If the version passed by the order service is still 1, the SQL statement above will not be executed. Since version has changed to 2, the WHERE condition does not hold. This ensures that no matter how many times the call is made, it will actually be processed once.

Unique ID + fingerprint code

The principle is to use the primary key of the database for deduplication and insert the primary key identity after the service is completed

A unique ID is a unique primary key for a business table, such as an item ID
Fingerprint code is to distinguish each normal operation of the code, each operation generated fingerprint code; You can use the timestamp + business number.

SQL statement above:

Insert into T_check (unique ID+ fingerprint code)
Return If greater than 0 indicates that the operation was performed, it returns directly

Benefits: Simple implementation

Cons: Database bottlenecks with high concurrency

Solution: Perform algorithmic routing according to ID and table

Redis atomic operation

Use redis atomic operations to mark completion. This one has better performance. But there are problems.

First: whether we need to database the business results, if database, the key problem to solve when the database and redis operation to achieve atomicity?

This means that the inventory is reduced by 1, but what if redis fails to complete the mark operation? Make sure roku and Redis either succeed or fail together

Second: if the database is not dropped, then it is stored in the cache, how to set the timing synchronization policy?

This means that the inventory is reduced by 1, do not drop the inventory, directly operate the redis operation completion mark, and then another synchronization service for inventory drop, this is to increase the complexity of the system, and how to set the synchronization strategy