** Abstract: ** Let’s talk about how the internal damage can be absorbed by the “compensation” mechanism while ensuring external high availability.

I. The significance of “compensation” mechanism?

Take the shopping scene of e-commerce as an example:

Client —-> Shopping cart microservice —-> order microservice —-> Payment microservice.

This kind of call chain is very common.

So why think about compensation mechanisms?

As mentioned in previous articles, a cross-machine communication may go through DNS services, network cards, switches, routers, load balancers and other devices, which may not always be stable. In the whole process of data transmission, any error in any link will lead to problems.

In distributed scenarios, an entire business is composed of multiple cross-machine communications, so the probability of problems increases exponentially.

However, these problems don’t necessarily mean that a real system can’t handle requests, so we should try to automatically digest these exceptions as much as possible.

You may ask, having seen “compensation” and “transaction compensation” or “retry” before, what is the relationship between them?

You don’t have to worry too much about the names, the purpose is the same. Once an operation is abnormal, how to eliminate the “inconsistent” state caused by the exception through internal mechanisms.

“Transaction compensation” and “retry” are subsets of “compensation”. “Transaction compensation” and “retry” are subsets of “compensation”. The former is a reverse operation, while the latter is a forward operation.

They just don’t mean the same thing in terms of results. “Transaction compensation” means “surrender” and the current operation is bound to fail.

▲ Transaction compensation

Retry still has a chance of success. These two methods are suitable for different scenarios.

Bring a retry

Because “compensation” is an extra process, since the ability to go through this extra process shows that timeliness is not the first consideration, so the core point of compensation is: it is better to be slow than wrong.

Therefore, do not be hasty to determine the implementation plan of compensation, need to be carefully evaluated. Although mistakes are not 100% avoidable, having such a mindset can more or less reduce the occurrence of some mistakes.

Two, “compensation” how to do?

The main ways to do “compensation” are the aforementioned “transaction compensation” and “retry”, which will be referred to as “rollback” and “retry”.

Let’s talk about rollback first. It is logically simpler than “retry.”

“Rollback”

Z brother will be divided into two modes of rollback, one is called “explicit rollback” (call the reverse interface), one is called “implicit rollback” (do not call the reverse interface).

The most common is “explicit rollback.” The plan is to do two things:

The first step to determine the failed step and state is to determine the scope to roll back. A business process is often defined at the beginning of design, so it is easy to determine the scope of rollback. The only caveat here, however, is that if not all services involved in a business process provide a rollback interface, then the services that provide a rollback interface should be placed first when choreographing the services so that they have a chance to “roll back” if later work services fail.

Second, it must be able to provide the business data used in the rollback operation. The more data you provide during rollback, the more robust your program will be. Because the program can receive the “rollback” operation can do business checks, such as checking whether the accounts are equal, whether the amount is consistent, and so on.

Since the data structure and size of this intermediate state are not fixed, Zig recommends that you implement this by serializing the relevant data into a JSON file and storing it in a NOSQL type store.

“Implicit rollback” is used in relatively few scenarios. It means that you don’t need to do any extra processing for this rollback, and there are mechanisms like “preoccupy” and “timeout” inside the downstream service. Such as:

In the e-mart scene, the goods in the order will be pre-occupied by the inventory, waiting for the user to pay within 15 minutes. If no payment is received from the user, the inventory is released.

Let’s talk about retry, which has a lot of play and is easier to get into.

“Try again”

The biggest benefit of retry is that the business system does not need to provide a “reverse interface”, which is particularly beneficial for long-term development costs, since the business is changing every day. So, when possible, use of “retry” should be preferred.

However, retry is less applicable than rollback, so our first step is to determine whether the current scenario is suitable for retry. Such as:

  • When the downstream system returns a temporary status such as Request Timeout or Traffic restricted, we can consider retry
  • If a service error such as “insufficient balance” or “no permission” is returned, you do not need to retry
  • Some middleware or RPC frameworks return Http503, 404, etc., with no expectation of when to recover, and do not require retries

If we do decide to retry, we also need to select an appropriate retry policy. The mainstream retry policies are as follows.

Policy 1. Retry immediately. Sometimes the failure is temporary and may result from events such as a network packet collision or a hardware component traffic peak. In this case, it is appropriate to retry the operation immediately. However, the number of immediate retries should not exceed one. If the immediate retry fails, use another policy.

Strategy 2. Fixed intervals. The application has the same interval between each attempt. This makes sense, for example, to fix retries every 3 seconds. (The specific numbers in all of the sample code below are for reference only.)

Policies 1 and 2 are mostly used in the interactive operations of the front-end system.

Strategy 3. Incremental interval. The retry interval increases incrementally each time. For example, 0 seconds for the first time, 3 seconds for the second time, 6 seconds for the third time, 9, 12, 15 and so on.

return (retryCount - 1) * incrementInterval;
Copy the code

In this way, the retry requests with more failures are ranked lower in priority, making way for new retry requests.

Strategy 4. Exponential interval. Each retry interval increases exponentially. This is the same as the incremental interval, which is intended to make retries with more failed attempts lower in priority, but with a larger increment.

return 2 ^ retryCount;
Copy the code

Strategy 5. Full jitter. Add randomness to incremental growth (you can replace exponential growth with incremental growth). . This method is applicable to the scenario where the pressure of a large number of retry requests is distributed at a certain time.

return random(0 , 2 ^ retryCount);
Copy the code

Policy 6. Jitter. Find a middle ground between “exponential interval” and “full jitter” to reduce the role of randomness. The application scenario is the same as full jitter.

var baseNum = 2 ^ retryCount; return baseNum + random(0 , baseNum);Copy the code

Strategies 3, 4, 5, and 6 look something like this. (X axis is number of retries)

Why is there a hole in “retry”?

As mentioned earlier, for development cost reasons, you may reuse the interface of regular calls when doing “retries”. Then we have to raise the question of idempotence.

Idempotence is an issue that must be considered if the technical solution chosen to implement retry is not 100% sure that retries will not be initiated repeatedly. Even if the technical solution ensures 100% that retries will not be initiated repeatedly, try to consider idempotency for unexpected reasons.

** idempotency: ** Idempotency is guaranteed if the state of the program (all relevant data changes) is consistent with the result of a single call, regardless of how many repeated calls are made to the program.

This means that operations can be repeated or retried as needed without causing unexpected effects. For non-idempotent operations, the algorithm may have to keep track of whether the operation has been performed.

Therefore, once a function supports retry, the idempotency of the interface on the entire link should be taken into account. Service data cannot be increased or decreased because of multiple service calls.

Satisfying idempotence is all about identifying repeated requests and filtering them out. The idea is:

  1. Define a unique identity for each request.
  2. Determine whether the request has already been executed or is being executed during retry, and discard the request if so.

** Point 1, ** We can use a globally unique ID generator or generate a service. Or some simple rough, use the official class library with the Guid, UUID and so on also line.

Then through the RPC framework in the initiating client, each request to add a unique identifier of the field is assigned.

** Point 2: ** We can work with validation before and after the server gets into the actual processing logic code using Aop.

The general code idea is as follows.

If (isExistLog(requestId)){ Determines whether the request has been received. Var lastResult = getLastResult(); //2. Get to determine whether the previous request has been processed. If (lastResult == null){var result = waitResult(); // Suspend waiting for processing to complete return result; } else{ return lastResult; } } else{ log(requestId); } //do something.. LogResult (requestId, result); //4. Update the result.Copy the code

If the “compensation” work is done through MQ, it can be done directly in the SDK that interfaces with MQ. Assign globally unique identifiers on the production side and eliminate weight by unique identifiers on the consumer side.

Best practices for retry

Let’s talk about some of zach’s best practices (emphasis :)), which are all about “retry,” which is indeed the most commonly used solution at work.

Retry is particularly suitable for downgrading under high load conditions and should also be affected by current limiting and circuit breakers. It works best when the spear of Retry is used in conjunction with the shield of flow limiting and fusing.

The input-output ratio of an increased compensation mechanism needs to be measured. For less important problems, “fail fast” instead of “retry.”

It is important to note that overly aggressive retry strategies, such as too short intervals or too many retries, can adversely affect downstream services.

Be sure to have a termination policy for retry.

Long intervals and a large number of retries are acceptable when the rollback process is difficult or costly, as is often referred to in DDD as the “saga” mode. However, this is provided that other operations (such as serial operations 1, 2, 3, 4, 5) are not prevented by reserving or locking up scarce resources. 3, 4 and 5 cannot continue because 2 has not been processed.

Four,

In this article, we first talk about the meaning of “compensation”, and the realization of two ways to make compensation “rollback” and “retry” ideas.

Then, you should be aware of the idempotence problem when retrying, and Z also gives a solution.

Finally, I shared a few of Z’s best practices for retry.

I hope it helps.

Question:

Have you ever made amends manually? Welcome to ridicule ~

Brother Z himself has stayed up in the middle of the night many times to clean up the mess caused by “accidents”, which I will never forget

Click to follow, the first time to learn about Huawei cloud fresh technology ~