With the rapid development and increasing complexity of business, almost every company’s system will move from monomer to distributed, especially to microservice architecture. Subsequently, it is inevitable to encounter the problem of distributed transactions. This article summarizes the most classical solutions of distributed transactions and shares them with you.

The basic theory

Before going into specific scenarios, let’s take a look at the basic theories involved in distributed transactions.

Let’s take the transfer as an example. A needs to transfer 100 yuan to B, so the balance to A needs to be -100 yuan, and the balance to B needs to be +100 yuan. The whole transfer should ensure that A-100 and B+100 succeed or fail at the same time. See how this problem is solved in various scenarios.

The transaction

The ability to operate on multiple statements as a whole is called a database transaction. Database transactions can ensure that all operations within the scope of the transaction can all succeed or fail.

Transactions have four attributes: atomicity, consistency, isolation, and persistence. These four properties are commonly referred to as ACID properties.

  • Atomicity: All operations in a transaction either complete or do not complete and do not end at an intermediate stage. If a transaction fails during execution, it is restored to the state before the transaction began, as if the transaction had never been executed.

  • Consistency: The integrity of the database is not compromised before and after a transaction. Integrity includes foreign key constraints, application definitions, and other constraints that are not broken.

  • Isolation: The ability of a database to allow multiple concurrent transactions to read, write, and modify its data at the same time. Isolation prevents data inconsistencies due to cross-execution when multiple transactions are executed concurrently.

  • “Durability” : After a transaction ends, modifications to data are permanent and not lost even if system failures.

Distributed transaction

Inter-bank transfer business is A typical distributed transaction scenario. Assuming that A needs to transfer inter-bank transfer to B, the data of two banks will be involved. The ACID of transfer cannot be guaranteed by local transaction of one database, but can only be solved by distributed transaction.

Distributed transaction means that the initiator of a transaction, resource and resource manager, and transaction coordinator are located on different nodes of a distributed system. In the transfer service, user A-100 operations and user B+100 operations are not on the same node. In essence, distributed transactions are designed to ensure that data operations are performed correctly in distributed scenarios.

Distributed transaction In distributed environment, in order to meet the needs of availability, performance and degraded service, and reduce the requirements of consistency and isolation, on the one hand, we follow the BASE theory (BASE related theory, involving a lot of content, interested students, you can refer to the BASE theory) :

Basic Business Availability

Soft State

Eventual consistency

Similarly, distributed transactions follow the ACID specification in part:

Atomicity: Strictly follow

Consistency: Consistency after transaction completion is strictly followed; Consistency in transactions can be relaxed

Isolation: non-impact between parallel transactions; Mid-transaction result visibility allows for security relaxation

Persistence: Strict adherence

Distributed transaction solutions

Two-phase commit /XA

XA is a distributed transaction specification proposed by the X/Open organization. The XA specification mainly defines the interface between the (global) transaction manager (TM) and the (local) resource manager (RM). Local databases such as mysql play the role of RM in XA

XA is divided into two stages:

Phase 1 (Prepare) : All participants in the RM are ready to execute the transaction and lock the required resources. When the participant is ready, it reports to TM that it is ready. Phase 2 (COMMIT/ROLLBACK) : When the transaction manager (TM) confirms that all participants (RM) are ready, send the COMMIT command to all participants. At present, most mainstream databases support XA transactions, including mysql, oracle, sqlserver and postgre

An XA transaction consists of one or more resource managers (RMS), a transaction manager (TM), and an ApplicationProgram.

Taking the above transfer as an example, a sequence diagram of a successfully completed XA transaction looks like this:

If any participant fails to prepare, TM will notify all participants who completed the prepare to rollback.

The characteristics of XA transactions are:

  • Easy to understand, easy to develop

  • Resources are locked for a long time and the concurrency is low

If readers want to further investigate XA, the GO language can be referred to DTM(github.com/yedf/dtm), J…

SAGA

Saga is one of the schemes mentioned in this database paper Saga. The core idea is to split a long transaction into a number of local short transactions, coordinated by the Saga transaction coordinator, that are completed if they are completed properly, and that compensation operations are invoked once in reverse order if a step fails.

Taking the above transfer as an example, the sequence diagram of a successfully completed SAGA transaction looks like this:

Features of SAGA transactions:

  • High concurrency and no longer need to lock resources as XA transactions do

  • Normal operations and compensation operations need to be defined, with more development than XA

  • The consistency is weak. For transfer, it may happen that user A has already deducted the money, and then the transfer fails

There are many contents of SAGA in this paper, including two recovery strategies, including concurrent execution of branch transactions. Our discussion here only includes the simplest SAGA

SAGA applies to many scenarios, long transactions, and business scenarios that are not sensitive to intermediate results

If readers want to further study SAGA, go language can be referred to DTM ((github.com/yedf/dtm))…

TCC

About the concept of TCC (try-confirm-cancel), It was first proposed by Pat Helland in a paper titled Life Beyond Distributed Transactions: An Apostate’s Opinion published in 2007.

TCC is divided into three phases

  • Try phase: Try execution, complete all service checks (consistency), reserve necessary service resources (quasi-isolation)

  • In the Confirm phase, only service resources reserved in the Try phase are used. The idempotent design is required for the Confirm operation. If the Confirm fails, you need to retry.

  • Cancel phase: The execution is cancelled and service resources reserved during the Try phase are released. The exception processing scheme of Cancel phase is basically the same as that of Confirm phase, which requires idempotent design.

Taking the above transfer as an example, the amount is normally frozen in Try but not deducted, Confirm is deducted, and Cancel is unfrozen. A TCC transaction with successful completion is shown below:

TCC features are as follows:

  • High concurrency, no long-term resource locking.

  • A Try, Confirm, or Cancel interface is required for a large amount of development.

  • With good consistency, it will not happen that SAGA has deducted money and then failed to transfer

  • TCC applies to order-type services that have constraints on intermediate states

If readers want to further explore TCC, go can be referred to DTM ((github.com/yedf/dtm))…

Local message table

The local message table solution was originally published by ebay architect Dan Pritchett to the ACM in 2008. The core of the design is to asynchronously ensure the execution of tasks requiring distributed processing by means of messages.

The general process is as follows:

Writing local messages and business operations in one transaction ensures atomicity of business and messaging, and either they all succeed or they all fail.

Fault tolerance mechanism:

  • When the balance deduction transaction fails, the transaction is rolled back directly with no further steps

  • Retry if polling for production messages fails or increasing the balance transaction fails

Local message table features:

  • Long transactions only need to be split into multiple tasks and are easy to use

  • Producers need to create additional message tables

  • Each local message table needs to be polled

  • If the consumer’s logic does not succeed through retry, then more mechanisms are needed to roll back and forth operations

This mode applies to services that can be executed asynchronously and do not need to be rolled back in subsequent operations

Transaction message

In the above local message table scheme, producers need to create additional message tables and poll the local message tables, resulting in heavy business burden. Alibaba’s RocketMQ 4.3 and later versions officially support transaction messages, which essentially place local message tables on RocketMQ to address the atomicity of message delivery and local transaction execution at the production end.

Transaction message sending and submission:

  • Send messages (half messages)

  • The server stores the message and responds to the result of writing the message

  • Execute a local transaction based on the sent result (if the write fails, the half message is not visible to the business and the local logic is not executed)

  • Perform Commit or Rollback based on the local transaction state (Commit publishes messages visible to consumers)

The normal flow chart is as follows:

Compensation process:

For pending transaction messages that are not Commit/Rollback, the Producer initiates a “look back” from the server. Producer receives the look back message and returns the status of the local transaction corresponding to the message. The Commit or Rollback transaction message scheme is very similar to the local message table mechanism. The main difference is that the previous related local table operations have been replaced by a backlookup interface

Transaction messages have the following characteristics:

  • Long transactions only need to be divided into multiple tasks, and provide a backlookup interface, easy to use

  • If the consumer’s logic does not succeed through retry, then more mechanisms are needed to roll back and forth operations

This mode applies to services that can be executed asynchronously and do not need to be rolled back in subsequent operations

If you want to explore transaction messages further, you can refer to RocketMQ, and DTM ((github.com/yedf/dtm)) also…

Best effort notice

The notifiable party shall make its best efforts to notify the receiving party of the result of business processing through certain mechanism. Specifically include:

There is a mechanism for repeating messages. Because the receiving party may not have received the notification, there must be some mechanism to notify the message repeatedly. Message proofreading mechanism. If the recipient fails to be notified despite his best efforts, or if the recipient consumes the message and then wants to consume it again, the recipient can proactively query the message information from the notifiable party to meet the demand. The local message table and transaction messages described earlier are reliable messages. How are they different from maximum effort notifications described here?

Reliable message consistency: the notifier needs to ensure that the message is sent and sent to the notifier, and the reliability of the message is guaranteed by the notifier.

Notification with maximum effort. The sender tries its best to notify the recipient of the service processing result. However, the recipient may fail to receive the message.

On a solution, maximum effort notification requires:

  • Provides an interface through which the receiving party can query the result of service processing

  • Message queue ACK mechanism: The message queue increases the notification interval by 1 minute, 5 minutes, 10 minutes, 30 minutes, 1 hour, 2 hours, 5 hours, 5 hours, and 10 hours until the upper limit of the notification time window is reached

Maximum effort notification applies to business notification types. For example, the result of wechat transaction is notified to each merchant through maximum effort notification, including callback notification and transaction query interface

AT transaction mode

This is a transaction mode in Alibaba’s open source project Seata, also known as FMT in Ant Financial. The advantage of this transaction mode is that it is similar to XA mode. Services do not need to write all kinds of compensation operations and the rollback is automatically completed by the framework. The disadvantage is similar to AT, which has a long time lock and does not meet high concurrency scenarios. Those who are interested can refer to SEATA-AT.

Network exception in distributed transaction

Network and service faults may occur in each link of distributed transaction. These problems require the business side of distributed transaction to implement the three characteristics of air defense rollback, idempotence and anti-suspension. TCC transaction is used to illustrate these anomalies:

Empty the rollback

Empty rollback: The two-phase Cancel method is called without calling the TCC resource Try method. The Cancel method needs to recognize that this is an empty rollback and return success directly.

The reason is that when the service of a branch transaction breaks down or the network is abnormal, the call of the branch transaction is recorded as failure. At this time, the Try stage is not executed. When the fault is recovered, the two-stage Cancel method will be called for the rollback of the distributed transaction, thus creating an empty rollback.

Power etc.

Idempotent: Because any request can be a network exception, repeated requests, so all distributed transaction branches need to ensure idempotent

suspension

Suspension: For a distributed transaction, the two-stage Cancel interface executes before the Try interface.

The reason is that when RPC calls the branch transaction try, the branch transaction is registered first and then the RPC call is executed. If the network of RPC calls is congested at this time and the RPC times out, TM will inform RM to roll back the distributed transaction, and the RPC request may not be executed until the rollback is completed.

To better understand these problems, take a look at a sequence diagram of network exceptions

For business processing request 4, Cancel is executed before Try, for empty rollback request 6, Cancel is repeated, and for idempotent business processing request 8, Try is executed after Cancel, and suspension is handled

In the face of the above complex network exceptions, the service side uses a unique key to check whether the associated operation has been completed, and returns success if the operation has been completed. Related judgment logic is complex, error-prone, heavy burden of business.

In project DTM, there is a sub-transaction barrier technology that can achieve this effect. See schematic diagram:

All of these requests, behind the sub-transaction barrier: abnormal requests are filtered; Normal request, through the barrier. Once the developer uses the subtransaction barrier, all the aforementioned exceptions are handled properly and the business developer only needs to focus on the actual business logic, greatly reducing the burden.

The subtransaction barrier provides the method ThroughBarrierCall, which is prototyped as:

func ThroughBarrierCall(db *sql.DB, transInfo *TransInfo, busiCall BusiFunc)

Copy the code

Business developers, writing their own logic in busiCall, call this function. ThroughBarrierCall guarantees that busiCall will not be called in the ThroughBarrierCall scenario. When a business is invoked repeatedly, there is idempotent control to ensure that it is committed only once.

The sub-transaction barrier manages TCC, SAGA, XA, transaction messages, and more, and can be extended to other areas as well

The principle of transaction barrier technology is, in the local database, set up branch sub_trans_barrier transaction state table, the only key to global transaction id – child transaction id – the transaction branch name (try | confirm | cancel)

  • Open the transaction

  • If it is a Try branch, then insert ignore inserts gID-branchid-try, and if the insert is successful, then the in-barrier logic is called

  • If it is a Confirm branch, insert ignore inserts gID-branchid-confirm and, if successful, invokes the in-barrier logic

  • If it is the Cancel branch, insert ignore inserts gID-branchid-try, then GID-branchid-cancel. If the try is not inserted and Cancel is successfully inserted, the intra-barrier logic is called

  • The logic inside the barrier returns success, commits the transaction, returns success

  • The logic inside the barrier returns an error, rolls back the transaction, and returns an error

In this mechanism, the problems related to network anomalies are solved

  • Null compensation control — If the Try is not executed and Cancel is executed directly, then Cancel inserts gID-branchid-try successfully, ignoring the logic inside the barrier and ensuring null compensation control

  • Idempotent control – no branch can insert a unique key repeatedly, ensuring that the execution is not repeated

  • Anti-suspension control –Try is executed after Cancel. If the gID-branchid-try fails, it will not be executed, ensuring anti-suspension control

A similar mechanism applies to SAGA transactions.

Subtransaction barrier technology, the first of DTM (github.com/yedf/dtm)…

The technology currently needs to be paired with a DTM transaction manager, and the SDK is currently available to go developers. SDKS for other languages are in the works. For other distributed transaction frameworks, as long as appropriate distributed transaction information is provided, this technology can be implemented quickly according to the above principles.

conclusion

This paper introduces some basic theories of distributed transaction, and explains common distributed transaction schemes. In the second part of the paper, it also gives the causes, classification and elegant solutions of transaction exceptions.

-END-