1. Basic concepts

1.1 What are transactions

A transaction can be thought of as a large activity that consists of different smaller activities that either all succeed or all fail.

1.2 Local Transactions

In computer systems, more is through the relational database to control affairs, the affairs of this is to use the database itself characteristic, therefore is called a database transaction, due to the application mainly depends on the relational database to control the transaction, and the database and application on the same server normally, so based on relational database transaction is also known as a local transaction.

Four characteristics of database transactions: ACID

A (Atomic) : atomicity. All operations that make up A transaction are either completed or not executed. There is no possibility of partial success and partial failure.

C (Consistency) : Consistency. Before and after a transaction is executed, the Consistency constraint of the database is not damaged. For example, if Zhang SAN transfers 100 yuan to Li Si, the data before and after the transfer are in the correct state, which is called consistency. If Zhang SAN transfers 100 yuan, Li Si’s account does not increase 100 yuan, which is a data error, and the consistency is not achieved.

I (Isolation) : Isolation. Transactions in a database are generally concurrent. Isolation means that the execution of two concurrent transactions does not interfere with each other, and one transaction cannot see the intermediate status of the running process of other transactions. The problem of dirty and duplicate reads can be avoided by configuring the transaction isolation level.

D (Durability) : Specifies that changes made to data by a transaction persist to the database and do not get rolled back after the transaction completes.

Database transactions will be implemented in a transaction all operations into an indivisible execution unit, all operations of the execution unit either succeed or fail, as long as any of the operations fail, will lead to the whole transaction rollback.

1.3 Distributed Transactions

With the rapid development of the Internet, software system has changed from single application to distributed application. The following figure describes the evolution from single application to microservice:

Distributed system will apply a system into multiple services can be developed independently, so you need to remote collaboration between services and services to complete transactions, this kind of distributed system environment by remote collaboration to complete the transaction through the network between different service is called a distributed transaction, such as user registration send integral transaction, create order decrease inventory transaction, Bank transfer transactions and so on are distributed transactions.

We know that local transactions rely on transaction features provided by the database itself, so the following logic can control local transactions:

Begin the transaction; SQL > commit transation; SQL > commit transation;Copy the code

But in a distributed environment, it looks like this:

​​​​​​​

Begin the transaction; //2. Remote call: commit transation;Copy the code

It can be assumed that when the remote call succeeded in increasing the amount of money, the remote call did not return due to network problems. At this time, the local transaction failed to commit, and the operation of reducing the amount of money was rolled back. At this time, the data of John and John are inconsistent.

Therefore, on the basis of distributed architecture, the traditional database transactions can not be used, the accounts of John and John are not in the same database or even in the same application system, and the transfer transactions need to be implemented through remote call, which will lead to distributed transaction problems due to network problems.

1.4 Scenario of distributed transaction generation

  1. Distributed transactions occur across JVM processes

    The typical scenario is microservices architecture: transactions are performed between microservices through remote calls. For example: order microservice and inventory microservice, order microservice requests inventory microservice to reduce inventory at the same time of placing an order.

  1. Generate distributed transactions across database instances

    Distributed transactions occur when a single system needs to access multiple databases (instances). For example, the user information and order information are stored in two MySQL instances. When the user management system deletes the user information, it needs to delete the user information and the user order information respectively. As the data is distributed in different data instances, it needs to operate the data through different database links.

  1. Multiple services access the same database instance

    The order microservice and the inventory microservice can generate distributed transactions even if they access the same database because, across JVM processes, the two microservices hold different database links for database operations.

2. Basic theory of distributed transactions

From the previous study, we learned the basic concepts of distributed transactions. Unlike local transactions, distributed systems are called distributed because the nodes that provide services are distributed on different machines and interact with each other through a network. The whole system cannot provide services because of a few network problems. Network factors become one of the criteria for the consideration of distributed transactions. Therefore, distributed transactions need further theoretical support. Next, we will first learn the CAP theory of distributed transactions.

2.1 CAP theory

2.1.1 understanding CAP

CAP is the abbreviation of Consistency, Availability, and Partition tolerance.

In order to facilitate the understanding of CAP theory, we combine some business scenarios in e-commerce system to understand CAP.

The following figure shows the execution process of commodity information management:

The overall execution process is as follows

  1. The goods service requests the master database to write the goods information (add goods, modify goods, delete goods)

  2. The primary database successfully wrote to the commodity service response

  3. The commodity service request reads the commodity information from the database

C – Consistency

Consistency means that the read operation after the write operation can read the latest data status. If the data is distributed on multiple nodes, the data read from any node is the latest data status.

In the figure above, the consistency of reading and writing of commodity information is to achieve the following objectives:

  1. If writing to the master database succeeds, querying new data from the slave database succeeds.

  2. If writing goods and services to the master database fails, querying new data from the slave database also fails.

How to achieve consistency?

  1. After writing to the master database, data is synchronized to the slave database.

  2. After data is written to the master database, the slave database must be locked during data synchronization to the slave database. Release the lock after the synchronization is complete. Otherwise, old data can be queried from the slave database after new data is written to the master database.

Characteristics of distributed system consistency:

  1. Due to the process of data synchronization, the response of write operations is delayed.

  2. To ensure data consistency, resources are temporarily locked and released after data synchronization is complete.

  3. If a node fails to synchronize data, it will return an error message and must not return old data.

A – Availability

Availability means that any transaction operation can get response results without response timeouts or response errors.

In the figure above, reading commodity information to meet availability is to achieve the following objectives:

  1. The data query request received from the database can immediately respond to the data query result.

  2. Response timeouts or errors are not allowed from the slave database.

How is availability achieved

  1. After writing to the master database, data is synchronized to the slave database.

  2. To ensure the availability of the slave database, resources in the slave database cannot be locked.

  3. Even if the data has not been synchronized, the data to be queried must be returned from the database, even if it is old data. If there is no old data, the default information can be returned according to the convention, but cannot return an error or response timeout.

Characteristics of distributed system availability: All requests have responses, and no response timeouts or errors occur

P – Partition tolerance

Usually, each node of a distributed system is deployed in different subnets, which is network partition. It is inevitable that communication between nodes will fail due to network problems, but they can still provide services to the outside, which is called partition tolerance.

In the figure above, reading and writing of commodity information to meet partition tolerance is to achieve the following goals:

  1. The failure to synchronize data from the primary database to the secondary database does not affect read and write operations.

  2. The failure of one node does not affect the external services provided by the other node.

How to achieve partition tolerance?

  1. Use asynchrony instead of synchronous operations, such as asynchronously synchronizing data from the primary database to the slave database, so that loose coupling between nodes can be achieved.

  2. Add slave database nodes where one slave node suspends the other slave nodes to provide service.

Characteristics of distributed partition tolerance: Partition tolerance is a basic capability of distributed systems

2.1.2 CAP combination mode

Does the product management example above also have CAP?

All distributed transaction scenarios do not have the three features of CAP at the same time, because C and A cannot coexist under the premise of P

For example, if P is satisfied in the following figure, partition tolerance is implemented:

The meanings of partition tolerance in this figure are:

  1. The primary database synchronizes data to the secondary database over the network. The primary and secondary databases are deployed in different partitions and interact with each other over the network.

  2. The network failure between the primary and secondary databases does not affect the external services provided by the primary and secondary databases.

  3. The failure of one node does not affect the external services of the other node.

To implement C, ensure data consistency. During data synchronization, to prevent inconsistent data from being queried from the secondary database, lock the secondary database data and unlock it after the synchronization is complete. If the synchronization fails, return error information or timeout information to the secondary database.

If A is to be implemented, data availability must be guaranteed, so that data can be queried from slave data at any time and no response times out or error messages are returned. Through analysis, it is found that C and A are contradictory on the premise that P is satisfied.

The way CAP is combined

Therefore, when dealing with distributed transactions in production, it is necessary to determine which two aspects of CAP are satisfied according to the requirements.

  1. AP

    Abandon consistency in favor of partition tolerance and availability. This is the design choice of many distributed systems.

    For example: the above commodity management, can achieve AP, the premise is as long as the user can accept the query data within a certain period of time is not the latest.

    Generally, AP will ensure the final consistency, and the BASE theory is extended according to AP. Some business scenarios, such as order refund, successful refund today, account arrival tomorrow, as long as the user can accept the account arrival within a certain period of time.

  2. CP

    Instead of availability, ZooKeeper pursues consistency and fault tolerance of partitions. In fact, ZooKeeper pursues strong consistency. Another example is inter-bank transfer, a transfer request cannot be completed until the whole transaction is completed by both banking systems.

  3. CA

    Consistency and availability can be achieved by giving up partition tolerance, that is, not partitioning and not considering problems due to network failure or node failure. The system would not be a standard distributed system, and the most commonly used relational data would satisfy the CA.

    The above commodity management, if you want to implement CA architecture is as follows:

There is no data synchronization between the master database and the slave database. The database can respond to each query request, and each query request can return the latest data through the transaction isolation level.

2.1.3 summary

CAP is a proven theory that a distributed system can only satisfy at most two of Consistency, Availability and Partition tolerance simultaneously. It can be considered as a standard for architecture design and technology selection. For most large-scale Internet application scenarios, there are many nodes, scattered deployment, and the current cluster scale is getting larger and larger, so node failure and network failure are normal, and service availability should be guaranteed to reach N 9 (99.99.. %), and to achieve good response performance to improve user experience, the following choices are generally made: ensure P and A, abandon C strong consistency to ensure final consistency.

2.2 the BASE theory of

  1. Strong consistency and final consistency

    CAP theory tells us that a distributed system can only meet at most two of Consistency, Availability and Partition tolerance at the same time. AP is more common in practical applications, and AP means to abandon Consistency. Ensure availability and partition tolerance, but in actual production, consistency should be realized in many scenarios. For example, in the previous example, the master database synchronizes data to the slave database. Even if there is no consistency, data must be successfully synchronized to ensure data consistency, which is different from consistency in CAP. Consistency in CAP requires that the data of each node must be consistent when queried at any time. It emphasizes strong consistency, but the final consistency allows the data of each node to be inconsistent within a period of time, but after a period of time, the data of each node must be consistent, which emphasizes the consistency of the final data.

  2. Introduction to Base Theory

    BASE is an acronym for Basically Available, Soft state, and Eventually consistent. BASE theory is an extension of AP in CAP, which achieves availability by sacrificing strong consistency. When failure occurs, part of the data is allowed to be unavailable but core functions are ensured to be available. Data is allowed to be inconsistent for a period of time, but eventually reaches a consistent state. The transactions satisfying BASE theory are called “flexible transactions”.

  • Basic availability: When a distributed system fails, it allows the loss of some available functions to ensure the availability of core functions. If the e-commerce website transaction payment problems, the goods can still be browsed normally.

  • Soft state: Since strong consistency is not required, BASE allows the existence of intermediate states (also called soft states) in the system. This state does not affect system availability, such as “payment in progress”, “data synchronization in progress”, etc. After data consistency, the state will be changed to “success”.

  • Final consistency: Indicates that data on all nodes will be consistent after a period of time. For example, the “payment in progress” status of the order will eventually change to “payment success” or “payment failure”, so that the order status is consistent with the actual transaction result, but it needs a certain time delay and waiting.

3. Distributed transaction solution 2PC

I have learned the basic theory of distributed transactions. Based on the theory, common solutions in the industry for different distributed scenarios include 2PC, 3PC, TCC, reliable message final consistency, and maximum effort notification.

3.1 What is 2PC

2PC is a two-phase commit protocol, which divides the entire transaction process into two phases: Prepare Phase and Commit phase. 2 refers to two phases, P refers to the preparation phase, and C refers to the commit phase.

Example: Tom and Tom have not seen each other for a long time. Their old friends are having a dinner party. At this time, Zhang SAN and Li Si respectively complain about the current situation is not satisfactory, poor money, are not willing to treat, then only AA. The boss can arrange the meal only if John and Lisa pay. But since Both Zhang and Li are iron-willed, it creates an awkward situation:

Preparation stage: the boss demands payment, payment. The boss demands payment from Li4. Li4 pays.

Submission stage: the boss gives the ticket, and two people take the ticket and sit down for dinner.

The example creates a transaction in which if either Joe or Joe refuses to pay, or if there is not enough money, the shop owner will not give the ticket and will return the money received.

The whole transaction process is composed of transaction manager and participants. The shop owner is the transaction manager, John and John are the transaction participants. The transaction manager is responsible for making the submission and rollback of the entire distributed transaction, and the transaction participants are responsible for the submission and rollback of their own local transaction.

In the computer, some relational databases such as Oracle and MySQL support the two-phase commit protocol, as shown in the following figure:

  1. Prepare Phase: The transaction manager sends a Prepare message to each participant. Each participant performs the transaction locally and writes a local Undo/Redo log. The transaction is not committed. (Undo log records data before modification and is used for database rollback. Redo log records data after modification and is used for writing data files after transaction commit.)

  2. Commit Phase: If the transaction manager receives an execution failure or timeout message from each participant, it sends a Rollback message to each participant. Otherwise, send a Commit message; According to the instructions of the transaction manager, participants perform commit or rollback operations and release lock resources used during transaction processing. Note: Lock resources must be released at the last stage.

The following figure shows the two phases of 2PC, with success and failure:

Successful cases

failures

3.2 Solutions

3.2.1 XA scheme

The traditional 2PC scheme is implemented at the database level. For example, Oracle and MySQL all support 2PC protocol. In order to unify standards and reduce unnecessary docking costs in the industry, standardized processing model and interface standards need to be developed. The International Open Group defines the Distributed Transaction Processing Reference Model (DTP).

In order to make you more clear about the content of XA scheme, the following is an example of new users registering and giving points:

The execution process is as follows:

  1. The application (AP) holds two data sources, the user library and the credits library.

  2. The application program (AP) notified the user database RM of adding a user through TM and the integration database RM of adding points for the user. At this time, RM did not submit a transaction, and the user and the integration resources were locked.

  3. After receiving the execution reply, TM will initiate rollback transactions to other RMS as long as either party fails. After rollback, resource lock will be released.

  4. After receiving the execution reply, TM successfully initiates the submission transaction to all RMS. After the submission, the resource lock is released.

The DTP model defines the following roles:

  • Application Program (AP) : an Application Program that uses DTP distributed transactions.

  • Resource Manager (RM) : refers to the participant of a transaction. Generally, it refers to a database instance. The Resource Manager controls the database, and the Resource Manager controls branch transactions.

  • Transaction Manager (TM) : A Transaction Manager that coordinates and manages transactions. The Transaction Manager controls global transactions, manages Transaction lifecycles, and coordinates RMS. Global transaction refers to the distributed transaction processing environment, need to operate multiple databases to complete a work, this work is a global transaction.

  • DTP model defines the interface specification for communication between TM and RM called XA, which is simply understood as 2PC interface protocol provided by database. The realization of 2PC based on DATABASE XA protocol is also called XA scheme

The interaction between the three roles is as follows:

  1. TM provides an application programming interface to the AP through which the AP commits and rolls back transactions.

  2. TM transaction middleware notifies RM of database transaction start, end, commit, rollback, etc., through XA interface.

conclusion

The whole 2PC transaction process involves three roles AP, RM and TM. AP refers to applications that use 2PC distributed transactions; RM stands for resource manager, which controls branch transactions; TM refers to the transaction manager, which controls the entire global transaction.

(1) In the preparation phase, RM performs actual service operations, but does not submit transactions, and resources are locked

(2) In the commit phase, TM will accept the execution reply from RM in the preparation phase. As long as any RM fails to execute, TM will notify all RM to perform the rollback operation; otherwise, TM will notify all RM to submit the transaction. The commit phase ends and the resource lock is released.

Problems with XA schemes

  1. Local databases are required to support XA.

  2. Resource locks are not released until the end of two phases, resulting in poor performance.

3.2.2 Seata scheme

Seata is an open source project Fescar initiated by Ali middleware team, later renamed Seata, which is an open source distributed transaction framework.

Seata solves the problem of traditional 2PC by coordinating branch transactions of local relational databases to drive the completion of global transactions. Seata is middleware that works at the application layer. The main advantage is that it has good performance and does not occupy connection resources for a long time. It solves the problem of distributed transaction in microservice scenarios with high efficiency and zero intrusion on business. AT present, it provides distributed transaction solutions in AT mode (2PC) and TCC mode.

The design philosophy of Seata is as follows:

One of Seata’s design goals is to be non-intrusive, so it starts with a non-intrusive 2PC solution, evolves on the basis of the traditional 2PC solution, and solves the problems faced by the 2PC solution.

Seata understands a distributed transaction as a global transaction that contains several branch transactions. The responsibility of a global transaction is to coordinate the branch transactions under its jurisdiction to agree on either a successful commit together or a failed rollback together. In addition, the branch transaction itself is usually a local transaction of a relational database. The following diagram shows the relationship between global transactions and branch transactions:

Similar to the traditional 2PC model, Seata defines three components to protocol the processing of distributed transactions:

  • A Transaction Coordinator (TC) is an independent middleware that needs to be deployed and run independently. It maintains the running status of global transactions, receives TM instructions to initiate the submission and rollback of global transactions, and communicates with RM to coordinate the submission or rollback of branch transactions.

  • Transaction Manager (TM) : A Transaction Manager that needs to be embedded in the application to work, which is responsible for starting a global Transaction and ultimately issuing a global commit or global rollback instruction to the TC.

  • Resource Manager (RM) : Controls branch transactions, is responsible for branch registration and status reporting, receives instructions from transaction coordinator TC, and drives the submission and rollback of branch (local) transactions.

Here’s an example of Seata’s distributed transaction process:

The specific execution process is as follows:

  1. The TM of the user service applies to the TC for starting a global transaction. The global transaction is successfully created and a globally unique XID is generated.

  2. The RM of the user service registers the branch transaction with the TC, which performs the new user logic in the user service and brings it under the jurisdiction of the global transaction corresponding to the XID.

  3. The user service performs a branch transaction, inserting a record into the user table.

  4. The logic is executed until the integral service is called remotely (the XID is propagated in the context of the microservice invocation link). The RM of the integral service registers a branch transaction with TC, which performs the logic of adding points and brings it into the jurisdiction of the XID corresponding global transaction.

  5. The integral service performs a branch transaction, inserts a record into the integral record table, and returns to the user service after execution.

  6. The user service branch transaction is complete.

  7. TM initiates a global commit or rollback resolution against the XID to TC.

  8. TC schedules all branch transactions under XID to complete commit or rollback requests.

Seata implements 2PC differently than traditional 2PC

In terms of architecture level, RM of traditional 2PC scheme is actually in the database layer. RM is essentially the database itself, implemented through XA protocol, while RM of Seata is deployed on the application side in the form of JAR package as middleware layer.

Two-phase commit aspect: Traditional 2PC locks on transactional resources are held until Phase E2 is complete, regardless of whether the second phase resolution is COMMIT or ROLLBACK. Instead, Seata commits the local transaction at Phase1, which saves Phase2 from being locked and improves overall efficiency.

3.3 summary

This section explains two 2PC solutions of traditional 2PC (based on database XA protocol) and Seata. Seata is recommended to implement 2PC because of its 0 intrusion and solves the problem of traditional 2PC locking resources for a long time.

4. TCC for distributed transaction solutions

4.1 What are TCC transactions

TCC is an abbreviation of Try, Confirm, and Cancel. TCC requires each branch transaction to implement three operations: preprocessing a Try, confirming a Confirm, and revoking a Cancel. The Try operation is a service check and resource reservation operation, Confirm is a service confirmation operation, and Cancel implements a rollback operation that is the opposite of the Try operation. TM firstly initiates Try operations of all branch transactions. If the Try operation of any branch transaction fails, TM will initiate Cancel operations of all branch transactions. If all Try operations are successful, TM will initiate Confirm operations of all branch transactions. TM will retry the Confirm/Cancel operation if it fails.

Success of branch transaction:

Branch transaction failure:

TCC is divided into three phases:

  1. The Try stage is to complete the business check (consistency) and resource reservation (isolation). This stage is only a preliminary operation, which together with subsequent Confirm can really form a complete business logic.

  2. In the Confirm phase, a confirmation commit is made. In the Try phase, Confirm is executed after all branch transactions are successfully executed. In general, the Confirm stage is assumed to be error-free when TCC is used. That is, as long as Try is successful, Confirm is always successful. If the Confirm stage does fail, a retry mechanism or manual processing is introduced.

  3. In the Cancel phase, a branch transaction is cancelled and reserved resources are released when a service execution error needs to be rolled back. In general, the Cancel phase is considered a certain success when TCC is used. If the Cancel phase does go wrong, a retry mechanism or manual processing should be introduced.

TM Transaction manager

The TM transaction manager can be implemented as an independent service, or the global transaction initiator can act as a TM. TM is separated to become a common component for the consideration of system architecture and software reuse.

TM generates a global transaction record when it initiates a global transaction. The global transaction ID runs through the whole distributed transaction call chain and is used to record the transaction context and track and record the state. Since Confirm and Cancel failures need to be retried, it needs to be implemented as idempotent. The result is the same.

4.2 TCC Exception Handling

TCC needs to be aware of three types of exception handling: empty rollback, idempotent, and suspension

Empty the rollback

The two-phase Cancel method is called without calling the TCC resource Try method, which needs to recognize that this is an empty rollback and return success directly.

The reason is that when the service of a branch transaction breaks down or the network is abnormal, the call of the branch transaction is recorded as failure. At this time, the Try stage is not executed. When the fault is recovered, the two-stage Cancel method will be called for the rollback of the distributed transaction, thus creating an empty rollback.

The idea is that the key is to identify the empty rollback. The idea is simple: you need to know if a phase is executed, and if it is, it is a normal rollback; If it is not executed, it is an empty rollback. As mentioned earlier, TM generates a global transaction record when it initiates a global transaction, and the global transaction ID runs through the entire distributed transaction invocation chain. Add an additional branch transaction record table with the global transaction ID and branch transaction ID, and insert a record in the first phase Try method to indicate that the first phase has been executed. The Cancel interface reads the record. If the record exists, it is rolled back normally. If the record does not exist, it is a null rollback.

Power etc.

As we have seen from the previous introduction, in order to ensure that the TCC two-phase commit retry mechanism does not cause data inconsistencies, TCC’s two-phase Try, Confirm, and Cancel interfaces are idempotent so that resources are not reused or freed. If idempotent control is not done well, serious problems such as data inconsistencies can result.

Add the execution status to branch Transaction Record and query the status before each execution.

suspension

Suspension is when, for a distributed transaction, the two-stage Cancel interface executes before the Try interface.

The reason is that when RPC calls the branch transaction Try, the branch transaction is registered first and then the RPC call is executed. If the network of RPC calls is congested at this time, usually the RPC call has a timeout time. After THE RPC times out, TM will inform RM to roll back the distributed transaction. Only when the RPC request reaches the participant and is actually executed can the business resources reserved by a Try method be used only by the distributed transaction. The business resources reserved in the first stage of the distributed transaction are no longer able to be processed. In this case, we call it suspension, that is, the business resources cannot be processed after being reserved.

The solution is that if the two-phase execution is complete, the two-phase execution cannot be continued. When executing a one-phase transaction, determine if there are already two-phase transaction records in the branch transaction Records table under the global transaction. If there are, do not execute a Try.

For example, the scenario is that A transfers 30 yuan to B, and A and B account are in different services.

plan

Account A

​​​​​​​

-Leonard: Check the balance for thirty dollars. -Leonard: Thirty dollars. -Leonard: Thirty dollarsCopy the code

Account B

Try: add 30 yuan confirm: empty Cancel: subtract 30 yuanCopy the code

Solutions that

(1) Account A, the balance here is the so-called business resources. According to the principle mentioned above, the business resources need to be checked and reserved in the first stage. Therefore, we first check whether the balance of account A is enough in the Try interface of TCC resources, if so, 30 yuan will be deducted. The Confirm interface represents a formal commit and since the business resource has been deducted from the Try interface, nothing can be done in the second stage of the Confirm interface. The transaction is rolled back by executing the Cancel interface. If account A is rolled back, the $30 deducted from the Try interface is returned to the account.

(2) Account B is added to account B in the Try interface in the first stage. The execution of Cancel interface represents the rollback of the whole transaction. In the rollback of account B, the 30 yuan added to the Try interface needs to be subtracted.

Scheme problem analysis

  1. If account A’s Try is not executed at Cancel, it adds $30.

  2. Since Try, Cancel and Confirm are all called by separate threads and are often repeatedly called, they all need to be idempotent.

  3. Account B adds $30 to the Try, which may be consumed by other threads when the Try completes.

  4. If account B’s Try is not executed at Cancel, that’s $30 more.

Problem solving

  1. The Cancel method of account A needs to check whether the Try method is executed. Cancel can be executed only after the Try method is executed properly.

  2. Try, Cancel, Confirm methods implement idempotency.

  3. Account B is not allowed to update the account amount in the Try method but in Confirm.

  4. The Cancel method of account B needs to check whether the Try method is executed, and Cancel can be executed only after the Try method is executed properly.

Optimization scheme

Account A

Idempotent check try Suspend process check whether the balance is sufficient by $30 deduct by $30 Confirm empty cancel Idempotent check cancel Empty rollback process Increase the available balance by $30Copy the code

Account B

Try: empty Confirm: confirm The idempotent check is formally added by 30 yuan cancel: emptyCopy the code

4.3 summary

If the TCC transaction processing flow is compared with 2PC two-phase commit, 2PC is usually handled at the DB level across libraries, while TCC is handled at the application level, which needs to be realized by business logic. The advantage of this distributed transaction implementation is that applications can define their own granularity of data operations, making it possible to reduce lock conflicts and improve throughput.

However, the disadvantage is that it is very intrusive to the application. Each branch of the business logic needs to implement Try, Confirm and Cancel operations. In addition, it is difficult to implement. Different rollback strategies need to be implemented according to different failure causes, such as network status and system faults.

5. Final consistency of reliable messages in distributed transaction solutions

5.1 What are reliable message ultimate Consistency transactions

The scheme of reliable message final consistency refers to that when the transaction initiator sends a message after completing the local transaction, the transaction participants (message consumers) will be able to receive the message and successfully process the transaction. This scheme emphasizes that the final transaction must be consistent as long as the message is sent to the transaction participants.

This scheme is accomplished by using message-oriented middleware, as shown below:

Affairs sponsors (producers) to send a message to the message middleware, the transaction participants receiving messages from the message middleware, the transaction between the sponsors and message middleware, the transaction participants (consumer) and message middleware is communication through the network, because of the uncertainty of the network communication cause problems of distributed transactions.

Therefore, the final consistency scheme of reliable message should solve the following problems:

Atomicity issues with local transactions and message sending

The atomicity problem of local transactions and message sending is that the transaction initiator must send the message after the local transaction is successfully executed, otherwise the message is discarded. That is, implementing atomicity for local transactions and message sending either succeeds or fails. The atomicity of local transactions and message sending is the key problem to achieve the final consistency scheme of reliable messages.

The following operation, first send a message, in the operation database:

Begin the transaction; //1. Send MQ //2.Copy the code

In this case, the consistency between the database operation and the sending message cannot be guaranteed, because the sending message may succeed and the database operation may fail. The second option is to perform the database operation first and then send the message:

​​​​​​​

Begin the transaction; //2. Send MQcommit transation;Copy the code

There seems to be no problem in this case, and if sending the MQ message fails, an exception is thrown, causing the database transaction to roll back. However, if there is a timeout exception, the database is rolled back, but MQ has been sent normally, which will also result in inconsistency.

  1. Reliability of messages received by transaction participants

    Transaction participants must be able to receive messages from the message queue and can repeat receiving messages if they fail to receive messages.

  2. The problem of repeated message consumption

    Due to the existence of network 2, if a consumption node times out but consumption succeeds, the message middleware will repeatedly deliver this message, leading to repeated consumption of messages.

    To solve the problem of repeated message consumption, the idempotency of transaction participants’ methods must be realized.

5.2 Solution

The previous section discussed the issues that need to be addressed in the final consistency transaction scheme for reliable messages, and this section discusses specific solutions.

5.2.1 Local message table scheme

The scheme of local message table was originally proposed by eBay. The core of this scheme is to ensure the consistency of data business operations and messages through local transactions, and then send messages to message middleware through scheduled tasks, and delete messages after confirming the success of messages sent to consumers.

The following example is taken as an example to illustrate: The following example has two micro-service interactions: user service and integral service. The user service is responsible for adding users, and the integral service is responsible for increasing points.

The interaction process is as follows:

User registration

User services added users in local transactions and added “points message log”. (User table and message table are guaranteed to be consistent through local transactions)

Begin Transaction //1. Add a user //2. Store bonus message logs commit TransationCopy the code

In this case, the local database operation is in the same transaction as the store credit message log, and the local database operation and the message log operation have atomicity.

  1. Periodic task scanning logs

    How do you ensure that messages are sent to message queues?

    After the first step, the message has been written to the message log table. You can start an independent thread to scan the message log table periodically and send the message to the message middleware. After the message middleware reports that the message is successfully sent, delete the message log.

  2. News consumption

    How do you ensure that consumers are consuming information?

    In this case, MQ’s ACK (message acknowledgement) mechanism can be used. The consumer listens to MQ. If the consumer receives the message and sends an ACK (message acknowledgement) to MQ after the service processing is complete, MQ will no longer push the message to the consumer. Otherwise the consumer will keep trying to send the message to the consumer.

    The integral service receives the “increase integral” message and begins to increase the integral. After the increase of the integral is successful, it responds to the message middleware with ack. Otherwise, the message middleware will deliver the message repeatedly.

    Since messages are delivered repeatedly, the “add points” function of the points service needs to be idempotent.

5.3 summary

Reliable message ultimate consistency is the consistency that ensures message passing from producer to consumer through messaging middleware:

  1. Atomicity issues with local transactions and message sending.

  2. Reliability of messages received by transaction participants.

The final consistency transaction of reliable message is suitable for the scenario with long execution cycle and low real-time requirement. With the introduction of message mechanism, synchronous transaction operation becomes asynchronous operation based on message execution, avoiding the influence of synchronous blocking operation in distributed transaction, and realizing the decoupling of the two services.

Maximum effort notification for distributed transaction solutions

6.1 What is maximum Effort notification

Maximum effort notification is also a solution to distributed transactions. Here is an example of top up:

Interactive process:

  1. The account system calls the recharge system interface

  2. The recharge system sends the notification of the recharge result to the account after payment processing. If the notification fails, the recharge system will repeat the notification according to the policy

  3. The account system receives the recharge result and notifies you to modify the recharge status

  4. If the account system does not receive the notification, it will call the interface of the recharge system to query the recharge result

From the above example we summarize the goal of the best effort notification scheme: the originator of the notification uses some mechanism to make the best effort to notify the recipient of the result of the business process.

Specifically include:

  1. There is a mechanism for repeating messages. Because the receiving party may not have received the notification, there must be some mechanism to notify the message repeatedly

  2. Message proofreading mechanism. If the recipient fails to be notified despite his best efforts, or if the recipient consumes the message and then wants to consume it again, the recipient can proactively query the message information from the notifiable party to meet the demand.

What is the difference between best effort notification and reliable message consistency?

  1. Solutions think differently

    Reliable message consistency: the notifier needs to ensure that the message is sent and sent to the notifier, and the reliability of the message is guaranteed by the notifier. Maximum effort notification. The sender tries its best to notify the recipient of the service processing result. However, the recipient may fail to receive the message.

  2. The service application scenarios are different

    Reliable message consistency is concerned with transaction consistency in the transaction process, completing the transaction in an asynchronous manner. Best effort notification is concerned with post-transaction notification transactions, i.e., notification of reliable transaction results.

  3. The direction of technical solution is different

    Reliable message consistency addresses the consistency of messages from send to receive, that is, messages are sent and received. Best effort notification does not guarantee consistency of messages from sending to receiving, but only provides a reliability mechanism for message receiving. The reliable mechanism is to try its best to notify the receiver of the message, and when the message cannot be received by the receiver, the receiver takes the initiative to query the message (business processing results).

6.2 Solution

With an understanding of maximum effort notification, the ACK mechanism of MQ can be used to achieve maximum effort notification.

Plan 1:

This scheme uses MQ ACK mechanism to send notification to the receiving party by MQ. The process is as follows:

  1. The notification originator sends the notification to MQ. Notifications are sent to MQ using a plain messaging mechanism.

    Note: If the message is not sent, the receiving party can actively request the initiating party to query the service execution result. (More on that later)

  2. The receiver listens to MQ.

  3. Receiving The notification party receives the message and responds to the ACK after the service processing is complete.

  4. MQ repeats the notification if the receiving party does not respond with an ACK.

    MQ increases the notification interval by 1 minute, 5 minutes, 10 minutes, 30 minutes, 1h, 2h, 5h, and 10h until the upper limit of the notification time window is reached.

  5. The receiver can check message consistency through the message calibration interface.

Scheme 2:

The interaction process is as follows:

  1. The notification originator sends the message to MQ.

    Local transactions and atomicity of messages are guaranteed using transactional messages in the reliable message consistency scheme, and notifications are eventually sent to MQ first.

  2. Notifier listens to MQ and receives messages from MQ.

    In scenario 1, the notifier listens to MQ directly, and in scenario 2, the notifier listens to MQ.

    If the notifier does not respond to an ACK, MQ will repeat the notification.

  3. The notification program invokes the receiving notification scheme interface through Internet interface protocol (such as HTTP and WebService) to complete the notification.

    Notification succeeds when the notifier successfully invokes the receive notification scheme interface, that is, consuming MQ messages, and MQ will no longer deliver notification messages to the notifier.

  4. The receiver can check message consistency through the message calibration interface.

Differences between Plan 1 and Plan 2:

  1. In scheme 1, the receiving notifier and MQ interface, that is, the receiving notification scheme listens to MQ and mainly applies notifications between internal applications.

  2. In scheme 2, the notifier interconnects with MQ. The notifier monitors MQ and invokes the notifier through Internet interface protocol after receiving messages from MQ. This scheme is mainly applied to the notification between external applications, such as alipay and wechat payment results notification.

6.4 summary

Maximum effort notification scheme is one of the lowest consistency requirements in distributed transactions and is suitable for some services with low final consistency time sensitivity. The maximum effort notification scheme needs to implement the following functions:

  1. Message repetition notification mechanism

  2. Message proofreading mechanism

7 summary

Comparative analysis of distributed transactions

The biggest problem with 2PC is a blocking protocol. The RM waits for TM’s decision after executing the branch transaction, at which point the service blocks and locks the resource. Due to its high blocking mechanism and worst-time complexity, this design cannot adapt to the need to expand as the number of services involved in a transaction increases, and is difficult to be used in distributed services with high concurrency and long sub-transaction life cycle.

If the TCC transaction processing flow is compared with 2PC two-phase commit, 2PC is usually handled at the DB level across libraries, while TCC is handled at the application level, which needs to be realized by business logic. The advantage of this distributed transaction implementation is that applications can define their own granularity of data operations, making it possible to reduce lock conflicts and improve throughput. However, the disadvantage is that it is very intrusive to the application. Each branch of the business logic needs to implement Try, Confirm and Cancel operations. In addition, it is difficult to implement, and different rollback strategies need to be implemented according to different failure causes, such as network status and system faults. Typical usage scenarios: full subtraction, login to send coupons, etc.

The final consistency transaction of reliable message is suitable for the scenario with long execution cycle and low real-time requirement. With the introduction of message mechanism, synchronous transaction operation becomes asynchronous operation based on message execution, avoiding the influence of synchronous blocking operation in distributed transaction, and realizing the decoupling of the two services. Typical usage scenarios: register to send points, login to send coupons, etc.

Maximum effort notification is one of the lowest requirements in distributed transactions and is suitable for some businesses with low final consistency time sensitivity. The notification sender is allowed to handle service failures, and the notification recipient is allowed to handle failures actively after receiving the notification. No matter how the notification sender handles the failure, the subsequent processing of the notification recipient will not be affected. The sender of the notification shall provide an interface for querying the execution status for receiving the notification to check the results. Typical usage scenarios: bank notification, payment result notification, etc.

2PC

TCC

Reliable sources

Best effort notice

consistency

Strong consistency

Final consistency

Final consistency

Final consistency

throughput

low

In the

high

high

Implementation complexity

easy

difficult

In the

easy

conclusion

When possible, we choose the local transaction single data source as much as possible, because it reduces the performance loss caused by network interaction and avoids various problems caused by weak data consistency. If a system frequently and irrationally uses distributed transactions, we should first observe whether the separation of services is reasonable from the perspective of overall design, and whether it is high cohesion and low coupling. Is the granularity too small? Distributed transactions have always been a challenge in the industry because of the uncertainty of the network and the fact that we are used to comparing distributed transactions to single-machine transactions ACID.

Neither XA at the database level, nor TCC at the application level, reliable messaging, maximum effort notification, etc., can perfectly solve the distributed transaction problem. They are just trade-offs in performance, consistency, availability, etc., and seek trade-offs under certain scenario preferences.

Follow the author’s wechat official account – “The Road to JAVA Architecture Advancement”

Learn more about Java back-end architecture and the latest interview gems