Distributed transaction solutions that you must understand

preface

The last article “this? Distributed ID device actual combat”, my friend Hui elder brother in the background message let beautiful young chat about distributed affairs, since hui elder brother are open, that must be satisfied ah, arrangement!

What are distributed transactions

What is the transaction most friends should be very clear, not clear can see the previous article “this? An article to let you read the Spring transaction”.

Distributed transaction means that transaction participants, transaction supporting servers, resource servers, and transaction managers are located on different nodes of different distributed systems.

In simple terms, a large operation consists of N small operations, which are distributed on different servers and belong to different applications. Distributed transactions need to ensure that all of these small operations succeed or fail. For example, there is an order microservice, an inventory microservice. When the order is completed, the inventory needs to be reduced synchronously. At this time, the integrity and consistency should be ensured in the transaction.

The related theory

The properties of transactions (ACID) and isolation levels will not be repeated here, but the previous article will introduce two new knowledge: CAP theory and BASE theory.

Theory of CAP

Consistency ** : ** After a write operation in the distributed system, any read operation should obtain the latest value written by the write operation. It is equivalent to requiring all nodes in a distributed system to maintain data consistency all the time.
Availability: Read and write operations can be performed normally. Simply put, the client is always accessible and receives a normal response from the system. From the user’s point of view, no system operation failure or access timeout will occur.
PartitionTolerance: ** refers to a distributed system in which the whole system can still provide services that meet the requirements of consistency and availability when a node or network partition fails. That is to say, part of the fault does not affect the overall use. In fact, bug, hardware, network and other failures are taken into account when designing distributed systems. Even if some nodes or networks fail, we require the whole system to continue to use

CAP is a proven theory. In distributed system, only two of these three items can be simultaneously satisfied at most, and fault tolerance of partition must be satisfied in distributed system. Therefore, CP and AP are common combinations in distributed system

CP: Give up availability and pay attention to consistency and fault tolerance of partition. In fact, this is called strong consistency, which may only be used in strong consistency business scenarios such as inter-bank transfer. The specific choice should be made according to business scenarios.
AP: Abandon strong consistency and focus on availability and fault tolerance of partitions, which is now the choice of most distributed business scenarios, as long as final consistency is guaranteed at the end (BASE theory).

The BASE theory of

Basically Available: A distributed system is allowed to lose part of its availability in the event of a failure, i.e. the core is guaranteed to be Available. During the e-commerce boom, some users may be directed to degraded pages to cope with the surge in traffic, and the service layer may only provide degraded services. This is a partial loss of usability.
Soft State: The Soft State allows the system to have intermediate states that do not affect the overall system availability. Generally, one copy of data in distributed storage has at least three copies. The soft state is reflected in the delay of copy synchronization between different nodes. MySQL Replication’s asynchronous Replication is another example.
Eventual Consistency: Eventual Consistency means that all data copies in the system reach the same state after a certain period of time. Weak consistency is the opposite of strong consistency, and final consistency is a special case of weak consistency.

Common Solutions

1. Two-stage submission

Two-phase commit, or 2PC for short, is a strongly consistent design that introduces the role of a transaction coordinator to coordinate and manage the commit and rollback of individual participants (also known as individual local resources).

The so-called two stages are: the first stage: preparation stage (voting stage) and the second stage: submission stage (implementation stage).

The Prepare Phase: The coordinator first sends a request to all participants to Prepare or cancel the commit, and then collects the participants’ decisions.
Commit Phase: The coordinator collects decision information from all participants, and the coordinator submits the request if and only if all participants send a confirmation message to the coordinator, otherwise it rolls back or cancels the request.

Problems existing in 2PC:

Synchronous blocking: All participants are transaction synchronous blocking. When a participant occupies a common resource, other third party nodes have to block access to the common resource.
Single point of failure: Once the coordinator fails, the system becomes unavailable.
Data inconsistency: After the coordinator sends a COMMIT, some participants receive the COMMIT message and the transaction executes successfully, while others do not receive it and are in the blocked state. Data inconsistency occurs during this period.
Uncertainty: After the coordinator sends a COMMIT, and only one participant has received the COMMIT, the reelected coordinator cannot determine whether the message was successfully committed when both the participant and the coordinator go down.

The advantage of 2PC is that it does not invade services and can use the database mechanism to commit and roll back transactions.

Common 2PC-based implementation schemes include JTA (XA specification) and Seata (AT mode).

2. Three-phase submission

Three-phase Commit, or 3PC for short, is an improved version of 2PC. Timeouts have been introduced for both the coordinator and the participant, and a pre-commit phase has been added between the prepare and commit phases in 2PC.

Preparation phase (CanCommit) : The coordinator sends a request to each participant asking if a transaction can be performed, but does not.
PreCommit phase: If the feedback from the coordinator is that the execution condition is met, the PreCommit request is sent and the transaction begins to execute; If the feedback from the coordinator is that the execution condition is not met or that there is a timeout, a transaction interrupt request is sent.
Commit phase (DoCommit) : If the pre-commit phase sends a pre-commit request, then the commit transaction is normal; If the pre-commit phase sends a transaction interrupt request, the transaction is interrupted directly.

Compared to 2PC, 3PC mainly solves single points of failure and reduces blocking, because an actor defaults to commit once he fails to receive information from the coordinator in a timely manner. Transaction resources are not held and blocked all the time. However, this mechanism can also cause data consistency problems because, due to network reasons, the interrupt response sent by the coordinator is not received in time by the participant, and the participant performs the COMMIT operation after waiting for a timeout. This creates data inconsistencies with other participants who received the interrupt command and performed the rollback. Moreover, the overall interaction process of 3PC is longer, and performance will be reduced.

3PC seems to exist only in theory at present, there is no concrete implementation plan.

3, TCC

Both 2PC and 3PC rely on the transaction commit and rollback of the database, but sometimes many businesses do not only involve the database, but may also send SMS, messages, etc. TCC is a distributed transaction at the business level or application level.

TCC scheme is divided into three stages of try-confirm-Cancel, which is a compensatory distributed transaction.

Try phase: Complete all service checks (consistency) and reserve service resources (quasi-isolation)
Confirm phase: Only the reserved service resources in the Try phase are used.
Cancel phase: Cancels service resources reserved during the Try phase.

If the Confirm phase fails, Confirm will be executed. If the Confirm phase fails, Cancel will be executed. In this case, you can only set the retry mechanism to retry the failed Confirm until it succeeds. If the confirmation fails, you can only manually intervene.

TCC needs to design corresponding operations according to each scenario and business logic, so it greatly increases the complexity of business code and incurs a lot of business.

Although it is intrusive to the business, TCC does not block the resources. Each method commits the transaction directly. If an error occurs, it is compensated with Cancel at the business level, so it is also called compensatory transaction method.

Some issues to note in TCC:

Idempotent problem: Because network calls cannot guarantee the arrival of the request, there is always a reset mechanism. Therefore, idempotent implementation is required for the Try, Confirm, and Cancel methods to avoid repeated execution errors.
Void rollback problem: the transaction manager will issue the Cancel command because the Try method has not timed out due to a network problem. The transaction manager needs to support Cancel to Cancel normally without a Try.
Suspension problem: This problem also occurs when the Try method triggers the transaction manager to issue the Cancel command due to network blocking timeout, but the Try request is received after executing the Cancel command. So after the empty rollback, we have to log it to prevent the Try from being called again.

4. Local message table

Local message table distributed transaction solution is a set of solutions proposed by eBay abroad. In fact, it uses the local transactions of each system to realize distributed transactions, and stores a transaction message table in the database. When performing business operations, the execution of business and the operation of putting messages into the message table are placed in the same transaction.

If the local transaction succeeds, the message state in the message table will be changed to success. If the message fails, the scheduled task will read the unsuccessful message in the local transaction table, and then invoke the corresponding service. After the success, the state will be changed again.

Here also need to set up retry mechanism, once there is really unsuccessful, also need manual intervention. Note here that the method idempotency of the corresponding service should also be guaranteed.

As you can see, the local message table implementation is relatively simple, is a best effort notification idea, achieve is final consistency, tolerate data temporarily inconsistent situation.

The downside is a heavy reliance on databases.

5, reliable message final consistency scheme

In the above local message table scenario, producers need to create additional message tables and poll local message tables, which is a heavy business burden. Alibaba’s RocketMQ 4.3 and later versions officially support transaction messages, which essentially place local message tables on RocketMQ to address the atomicity of message delivery and local transaction execution at the production end.

Service A sends A Half Message to the Broker. The Half Message has already been sent to the Broker, but the state of the Message is marked as “undeliverable” and cannot be seen by the consumer.

After sending the half-message, service A performs A business operation (A local transaction) and, depending on the result of the operation: if successful, sends A Commit command to the Broker and the half-message becomes consumable. If that fails, send a RollBack command and the message is deleted.

If it is Commit, service B receives the message, does the corresponding operation, and then consumes the message.

If RocketMQ does not receive A message confirming the status of service A, then half-message RocketMQ will automatically poll back to your interface, asking about the status of this processing. With this in mind, service A implements A callback to enforce consistency judgments based on the actual processing results, Commit or Rollback.

Service B may fail during execution, in which case it also needs to be retried. If the execution fails all the time, manual intervention is required, and idempotency of the service B method is also required.

6, best efforts to notify

Best-effort delivery is the simplest kind of flexible transaction, which is suitable for some services with low final consistency time sensitivity, and the processing result of the passive party does not affect the processing result of the active party. Typical usage scenarios: bank notification, merchant notification, etc.

For the local message table, there will be background tasks to check the unfinished messages regularly, and then call the corresponding service. When a message fails to be called for many times, it can be recorded and then manual introduction or direct abandonment. That’s the best I can do

The same is true for transactional messages, which are truly normal when they are committed. If the subscriber does not consume or fails to consume, it will try again until it enters the dead-letter queue. In fact, this is the best effort.

Notification with maximum effort. The sender tries its best to notify the recipient of the service processing result. However, the recipient may fail to receive the message.

conclusion

In fact, there are many distributed transaction solutions, but they still have many problems, and in extreme cases, they need to be handled manually, and they greatly increase the complexity of the process, which will bring a lot of extra overhead.

So remember, in real development, don’t use distributed transactions if you can’t!

Behind will bring you the actual combat of distributed transactions, not a point of concern can point a concern, to prevent lost.

END

Phase to recommend

Is this it? Distributed ID transmitter actual combat

Some knowledge of design patterns in factory patterns

Is this it? Spring transaction failure scenarios and solutions

Is this it? An article will help you read Spring transactions

SpringBoot+Redis implements message subscription publishing