preface

I’m sure you’ve all used message MQ, which is a great way to decouple systems, reduce their complexity, and peak clipping, increasing the stability of systems with high concurrency. So what are the considerations when using MQ? Is MQ foolproof? Is it possible for an MQ message to fail from generation to consumption? What are the possible failures and how to deal with them?

1. Message production failed

Typically, calls from producers to MQ middleware are made over the network, and failure is possible with network calls. MQ production may fail due to the following reasons, such as network fluctuation. Although there are Intranet calls between the producer and the MQ server, it does not mean that the success rate of network calls is 100%. Intranet calls may also encounter network fluctuation, resulting in timeout or failure of calls. If the calling MQ machine crashes in an instant, this can also cause the call to fail. If the producer fails to call MQ, we can easily handle it. We can simply retry it. If the producer fails to call MQ for 2-3 times, it is very likely that there is a big problem.

2. The MQ process storage fails

Messages are usually stored once they reach message-oriented middleware, and only when they are written to disk are messages truly stored and not lost. However, most of the MQ message middleware is not received immediately written to disk, just because of the disk write speed relative to the memory, seemed much more slowly, so, like Kafka message system, is to be able to write a message buffer, the asynchronous written to disk, if midway suddenly loses power to the machine, is likely to lose messages. To solve this problem, most MQ is deployed in a distributed manner, where messages are written to the cache on multiple machines before being returned to the business side for success. Since the likelihood of multiple machines outages is low, this can be considered a cost-effective and reliable solution.

3. Consumer processing failure

General MQ has an MQ retry mechanism, and if processing fails, attempts are made to re-consume the MQ. The problem with this is that MQ may have been successfully consumed but failed to notify the MQ middleware, resulting in repeated message consumption. Similarly, the problem of repeated message consumption occurs when the producer tries again. At this time, we are required to design the interface as idempotent as possible, even if repeated consumption at this time, there is no need to worry about what problem. Basically, by doing these three things well, we can greatly improve the usability of our system!

Here are a few key points to note:

  1. Idempotent is not just one (or more) request that has no side effects on the resource (such as a query database operation that has no add, delete, and therefore no impact on the database).
  2. Idempotent also means that the first request has side effects on the resource, but subsequent requests do not have side effects on the resource.
  3. Idempotent focuses on the side effects of subsequent multiple requests on the resource, not the outcome.

Idempotence is a promise (rather than an implementation) of a system service that multiple external calls will have the same effect on the system as long as the call interface succeeds. Services declared idempotent assume that external invocation failures are the norm, and retries are inevitable after failures.

When do you need idempotent

In business development, repeated submission is often encountered, whether the request is re-initiated due to network problems, or the operation jitter of the front end causes repeated submission. The problems caused by repeated submission are particularly obvious in transaction systems and payment systems, such as:

If the user clicks on the APP for several times to submit an order, only one order should be generated in the background.

Initiate a payment request to the payment system, due to network problems or system BUG retransmission, the payment system should only deduct money once. Obviously, idempotent services assume that there will be multiple invocations by external callers and are designed to be idempotent in order to prevent multiple changes in system data state caused by multiple invocations from external callers.

Idempotent VS anti-weight

The problem encountered in the above example, which is only the case of repeated commits, is different from the original intention of service idempotent. Repeated commit is an artificial multiple operation after the first request has been successful, causing the service that does not meet the idempotent requirement to change the state many times. In the case of more idempotence, multiple requests are initiated in the abnormal situation where the result of the first request is not known (such as timeout) or failure. The purpose is to confirm the success of the first request for multiple times, but there will not be multiple state changes due to multiple requests.

When do we need to ensure idempotency

Taking SQL as an example, there are three scenarios, and only the third scenario requires developers to use other policies to ensure idempotency:

  1. SELECT col1 FROM tab1 WHER col2=2
  2. UPDATE tab1 SET COL1 =1 WHERE col2=2 UPDATE tab1 SET col1=1 WHERE col2=2
  3. UPDATE tab1 SET col1=col1+1 WHERE col2=2; UPDATE tab1 SET col1=col1+1 WHERE col2=2

Why design idempotent services

Idempotent can simplify client logic processing at the expense of service logic complexity. Meeting the need for idempotent services involves at least two things in logic:

  1. The last execution status is queried first, if not, it is considered the first request
  2. Ensure that the logic against repeated commits is maintained before the service changes the business logic of the state

The insufficiency of idempotent

Idempotent is to simplify the client logic processing, but increase the logic and cost of the service provider. Whether it is necessary or not needs to be analyzed according to specific scenarios. Therefore, idempotent interfaces should not be provided except for special business requirements.

Complicate business functions by adding additional control idempotent business logic;

The function of parallel execution is changed to serial execution, which reduces the execution efficiency.

Ensure idempotent strategy

Idempotent is guaranteed by a unique business order number. In other words, the same order number is considered to be the same business. Use this unique order number to ensure that the processing logic and execution effect of the same order number are consistent multiple times. Using payment as an example, it is easy to implement idempotent without considering concurrency:

First check whether the order has been paid.

② If the payment has been made, the payment is returned successfully; If not, go through the payment process and change the order status to ‘paid’.

Anti-duplicate submission policy

The above idempotent guarantee scheme is divided into two steps. Step ② depends on the query result of step ① and cannot guarantee atomicity. Under high concurrency, the following situation occurs: the second request arrives before the order status in step ② of the first request has been changed to ‘paid state’. Now that we have this conclusion, the rest of the problem is simple: lock the query and change-state operations, and change the parallel operations to serial operations.

Optimistic locking

If you only update existing data, there is no need to lock services. Optimistic locking is used when designing table structures. Optimistic locking is generally implemented through Version, which ensures both execution efficiency and idempotent. Such as: UPDATE tab1 SET col1=1,version=version+1 WHERE version=#version# However, ABA will not occur if version is continuously incremented.

The heavy table

Using the order number orderNo as the unique index of the de-duplicating table, each request inserts a data entry into the de-duplicating table based on the order number. The first request to query the order payment status, of course, the order is not paid, the payment operation, whether successful or not, after the execution of the order status update to success or failure, delete the data in the repeat table. Subsequent orders fail to be inserted because of a unique index in the table, and the operation is returned as a failure until the first request completes (success or failure). It can be seen that the function of anti – heavy table is the function of locking.

A distributed lock

The anti-duplicate tables used here can be replaced with distributed locks, such as Redis. When an order initiates a payment request, the payment system will check whether there is a Key for the order number in Redis cache. If there is no Key for the order number, the Key will be added to Redis. Check that the order payment has been paid, if not, the payment will be made, and delete the Key of the order number after the payment is completed. Distributed locking is achieved through Redis, until the order payment request is completed, the next request can come in. Compared with resending the table, put the concurrent into the cache, more efficient. Same idea, only one payment request can be completed at a time.

Token token

This method is divided into two stages: application for token stage and payment stage. In the first stage, before entering the order submission page, the order system needs to initiate a request for token application to the payment system according to the user information, and the payment system saves the token in Redis cache for the second stage payment. In the second stage, the order system initiates a payment request with the token applied for, and the payment system checks whether the token exists in Redis. If so, the payment request is initiated for the first time, and the token in the cache is deleted and the payment logic processing begins. If the request does not exist in the cache, it indicates an invalid request. In fact, the token here is a token, and the payment system based on the token confirms that you are your mother’s child. The disadvantage is that the system needs to interact twice, and the process is more complex than the above method.

Payment buffer

The payment requests for orders are quickly followed by a buffer pipeline of quick orders. Asynchronous tasks are then used to process the data in the pipeline and filter out duplicate pending orders. Advantages are synchronous to asynchronous, high throughput. The disadvantage is that the payment result cannot be returned in time, and the asynchronous return of the payment result needs to be monitored later.