As businesses grow rapidly and become more complex, almost every company’s systems move from monolithic to distributed, especially to microservice architectures. This article summarizes the most classic solution of distributed transaction and shares it with you.

The basic theory

Before going into the specifics, let’s look at the basic theories involved in distributed transactions.

So let’s take A transfer, where A needs to transfer 100 yuan to B, so the balance for A needs to be minus 100 yuan, and the balance for B needs to be plus 100 yuan, and the whole transfer needs to either succeed or fail at the same time, a-100 and B+100. See how this problem is solved in various scenarios.

The transaction

The ability to operate on multiple statements as a whole is called a database transaction. A database transaction ensures that all operations within the scope of the transaction will succeed or fail entirely.

Transactions have four properties: atomicity, consistency, isolation, and persistence. These four properties are often referred to as the ACID properties.

  • Atomicity: All operations in a transaction are either complete or not complete, and do not end up somewhere in the middle. If an error occurs during execution, the transaction is restored to the state it was in before the transaction started, as if the transaction had never been executed.
  • Consistency: The integrity of the database is not compromised before and after a transaction begins. Integrity, including foreign key constraints, application-defined constraints, etc., will not be broken.
  • Isolation: The ability of a database to allow multiple concurrent transactions to read, write, and modify its data at the same time. Isolation prevents data inconsistency due to cross-execution when multiple transactions are concurrently executed.
  • They look permanent: After the transaction ends, data changes are permanent and won’t be lost even if the system fails.

Distributed transaction

The inter-bank transfer service is A typical distributed transaction scenario. Suppose A needs to transfer the data of two banks to B, and the ACID of the transfer cannot be guaranteed through the local transaction of one database, but can only be solved through distributed transaction.

Distributed transaction means that the transaction initiator, resource and resource manager and transaction coordinator are located on different nodes of the distributed system. User A-100 and user B+100 are not on the same node. Essentially, distributed transactions are about ensuring that data operations are performed correctly in a distributed scenario.

Distributed transaction In a distributed environment, in order to meet the needs of availability, performance and degraded services, and reduce the requirements of consistency and isolation, on the one hand, BASE theory (related theory of BASE, which involves a lot of content, students interested in it can refer to the BASE theory) is followed:

Basic Availability Soft State Eventual consistency Similarly, distributed transactions follow the ACID specification in part:

Atomicity: strict compliance consistency: Strict compliance after transaction completion; Consistency within a transaction can ease isolation appropriately: parallel transactions cannot be affected; Visibility of transaction intermediate results allows security to relax persistence: strict compliance

A solution for distributed transactions

Two-phase commit /XA

XA is a distributed transaction specification proposed by the X/Open organization. The XA specification mainly defines the interface between (global) transaction manager (TM) and (local) resource manager (RM). Local databases such as mysql play the RM role in XA

XA is divided into two phases:

The first phase (prepare) : all participant RM’s prepare to execute the transaction and lock the required resources. When the participant is ready, report to TM that it is ready. Phase 2 (COMMIT/ROLLBACK) : After the transaction manager (TM) confirms that all participants (RM) are ready, the transaction manager (TM) sends the COMMIT command to all participants. At present, the mainstream database basically support XA transaction, including mysql, Oracle, SQLServer, PostGRE

An XA transaction consists of one or more resource managers (RM), a transaction manager (TM), and an ApplicationProgram (ApplicationProgram).

Using the above transfer as an example, the sequence diagram of a successfully completed XA transaction is as follows:

If any participant fails to prepare, TM informs all participants who completed the prepare to roll back.

The characteristics of XA transactions are:

  • Simple and easy to understand, easy to develop
  • The resource is locked for a long time and the concurrency is low

If you want to explore XA further, you can refer to DTM for the Go language and SEATA for the Java language

SAGA

Saga is one of the scenarios mentioned in the database paper Saga. The core idea is to split a long transaction into multiple local short transactions, which are coordinated by the Saga transaction coordinator, and if it ends normally it completes normally, and if a step fails, a compensation operation is invoked once in reverse order.

Using the above transfer as an example, the sequence of a successfully completed SAGA transaction is shown below:

Characteristics of SAGA transactions:

  • Concurrency is high, and resources are not locked for a long time like XA transactions
  • Normal operations and compensation operations need to be defined, with more development than XA
  • The consistency is weak. For the transfer, it may happen that user A has deducted money, and the transfer fails again at last

There are many contents of SAGA in this paper, including two recovery strategies, including concurrent execution of branch transactions. Our discussion here only covers the simplest SAGA

SAGA applies to many scenarios, such as long transactions and service scenarios that are not sensitive to intermediate results

If you want to explore SAGA further, you can see DTM for the Go language and SeATA for the Java language

TCC

Regarding the concept of TCC (try-confirm-cancel), It was first proposed by Pat Helland in a paper titled Life Beyond Distributed Transactions: An Apostate’s Opinion published in 2007.

TCC is divided into three phases

  • Try phase: Attempts to execute, completes all service checks (consistency), and reserves required service resources (quasi-isolation).
  • Confirm: The service resources reserved in the Try phase are used. The Confirm operation requires an idempotent design. If the Confirm operation fails, retry.
  • Cancel phase: Cancel the execution and release service resources reserved in the Try phase. The exception handling scheme of Cancel phase and Confirm phase is basically the same, which requires to meet the idempotent design.

Taking the above transfer as an example, usually the amount will be frozen in Try but not deducted, the amount will be deducted in Confirm and the amount will be unfrozen in Cancel. The sequence diagram of a successfully completed TCC transaction is as follows:

TCC features are as follows:

  • High concurrency and no long-term resource lock.
  • The Try, Confirm, and Cancel interfaces need to be provided.
  • The consistency is good, and there will be no failure of transfer after SAGA has deducted money
  • TCC is applicable to the order-type business that has constraints on the intermediate state

If you want to explore TCC further, you can refer to DTM for go and SEATA for Java

Local message table

The local message table solution was originally published by ebay architect Dan Pritchett to ACM in 2008. The core design is to ensure asynchronous execution of tasks requiring distributed processing by means of messages.

The general process is as follows:

Writing local messages and business operations are placed in one transaction, ensuring atomicity of business and messaging, and either they all succeed or they all fail.

Fault-tolerant mechanism:

  • When the balance reduction transaction fails, the transaction is rolled back without subsequent steps
  • Retries will be performed if the round-sequence production message fails or the transaction to increase the balance fails

Local message table features:

  • Long transactions simply need to be split into multiple tasks and are easy to use
  • Producers need to create additional message tables
  • Each local message table needs to be polled
  • If the consumer’s logic does not succeed with retries, then more mechanisms are needed to roll back and forth

Applicable to services that can be executed asynchronously and do not need to be rolled back in subsequent operations

Transaction message

In the above local message table scheme, the producer needs to create an additional message table and poll the local message table, resulting in heavy service burden. Alibaba’s open source RocketMQ version after 4.3 officially supports transaction messaging, which is essentially a local message table on RocketMQ to solve the problem of atomicity between message sending and local transaction execution on the production side.

Transaction message sending and committing:

  • Sending a message (half message)
  • The server stores the message and responds to the result of writing the message
  • Execute the local transaction according to the sending result (if the write fails, the half message is not visible to the business and the local logic is not executed)
  • Commit or Rollback based on the local transaction state (the Commit operation publishes the message, which is visible to the consumer)

The normal sending process is as follows:

Compensation process:

For pending transaction messages (messages with no Commit or Rollback status), the Producer initiates a “Rollback” from the server. After receiving the pending message, the Producer returns the status of the local transaction corresponding to the message, which is “Commit” or “Rollback”. The transaction message scheme is similar to the local message table mechanism. The main difference is that the relevant local table operation is replaced by a backcheck interface

Transaction message features are as follows:

  • Long transactions simply need to be split into multiple tasks and provide a backcheck interface that is easy to use
  • If the consumer’s logic does not succeed with retries, then more mechanisms are needed to roll back and forth

Applicable to services that can be executed asynchronously and do not need to be rolled back in subsequent operations

If you want to explore transaction messages further, you can refer to RocketMQ, and DTM also provides a simple implementation to help you learn about transaction messages

Best effort notice

The originator tries its best to notify the recipient of the result of the business processing through a certain mechanism. Specifically include:

There are certain message duplication notification mechanisms. Because the recipient of the notification may not have received the notification, there needs to be some mechanism for repeating the notification to the message. Message proofreading mechanism. If the best efforts are not notified to the recipient, or the recipient consumes the message and wants to consume it again, the recipient can take the initiative to query the message information from the notification party to meet the requirements. The local message table and transaction messages described earlier are both reliable messages. How is this different from the best effort notification described here?

Reliable message consistency: the originator needs to ensure that the message is sent out and sent to the recipient. The reliability of the message is guaranteed by the originator.

The notifying party tries its best to notify the business processing result to the notifying party, but the message may not be received. In this case, the notifying party needs to proactively invoke the interface of the notifying party to query the business processing result. The reliability of notification lies in the notifying party.

On the solution, the best effort notification needs:

  • Provides an interface for the notification receiver to query the business processing results through the interface
  • The MESSAGE queue ACK mechanism increases the message queue notification interval by 1 minute, 5 minutes, 10 minutes, 30 minutes, 1 hour, 2 hours, 5 hours, and 10 hours until the notification interval reaches the upper limit. No further notice

Best effort notification applies to business notification types. For example, the result of wechat transaction is notified to each merchant through best effort notification, which includes both callback notification and transaction query interface

AT transaction mode

This is a transaction model in Alibaba’s open source project seata, also known as FMT at Ant Financial. The advantage of this transaction mode is that it is similar to THE XA mode, and the business does not need to write all kinds of compensation operations, and the rollback is automatically completed by the framework. The disadvantage is also similar to the AT mode, which has a long lock and does not meet the high concurrency scenarios. Those who are interested can refer to SEATa-AT

Exception handling in distributed transactions

Problems such as network and business failure may occur in each link of distributed transaction. These problems require the business side of distributed transaction to achieve the three features of anti-air rollback, idempotentality and anti-suspension

Abnormal situation

These exceptions are illustrated in terms of TCC transactions:

Empty rollback:

The two-phase Cancel method is called without calling the TCC resource Try method. The Cancel method needs to recognize that this is an empty rollback and return success directly.

The reason is that when the service where a branch transaction is located is down or the network is abnormal, the branch transaction invocation is recorded as failed. In fact, the Try phase is not executed at this time. When the fault is recovered, the two-phase Cancel method will be called when the distributed transaction is rolled back, thus forming an empty rollback.

Idempotent:

All distributed transaction branches need to be idempotent because any request may have network exceptions and repeat requests

Suspension:

Suspension is when, for a distributed transaction, the two-phase Cancel interface executes before the Try interface.

The reason is that when RPC calls branch transaction try, the branch transaction is registered first, and then the RPC call is executed. If the network of RPC call is congested at this time, TM will notify RM to roll back the distributed transaction after RPC timeout. Perhaps, the RPC request is executed by the participant only after the rollback is complete.

Let’s look at a sequence diagram of network exceptions to better understand the above problems

When service request 4 is processed, Cancel is executed before the Try. When empty rollback request 6 is processed, Cancel is executed repeatedly. When service request 8 is processed, Try is executed after Cancel, and suspension is processed

In the face of the above complex network exceptions, all the proposed schemes are that the business side queries whether the related operation has been completed through the unique key, and returns success directly if it has been completed. The related judgment logic is complex, error-prone, and the business burden is heavy.

Subtransaction barrier

In project DTM, a sub-transaction barrier technology was developed to achieve this effect, as shown in the diagram:

All of these requests get behind a subtransaction barrier: abnormal requests are filtered; Normal request, through the barrier. With the use of a subtransaction barrier by the developer, all of the aforementioned exceptions are handled properly, and the business developer only needs to focus on the actual business logic, greatly reducing the burden. Subtransaction barriers provide the ThroughBarrierCall method. The prototype of the method is:

func ThroughBarrierCall(db *sql.DB, transInfo *TransInfo, busiCall BusiFunc)
Copy the code

Business developers, writing their own logic in busiCall, call this function. ThroughBarrierCall Ensures that busiCall is not invoked in empty rollback or suspension scenarios. When the business is called repeatedly, there is idempotent control to ensure that it is only committed once.

Subtransaction barriers manage TCC, SAGA, XA, transaction messages, etc., and can be extended to other domains

Subtransaction barrier principle

The principle of transaction barrier technology is, in the local database, set up branch sub_trans_barrier transaction state table, the only key to global transaction id – child transaction id – the transaction branch name (try | confirm | cancel)

  • Open the transaction
  • If it is a Try branch, then insert ignore to insert gID-branchID-try, and if the insertion is successful, then intra-barrier logic is called
  • If it is a Confirm branch, insert ignore inserts gID-BranchID-confirm, and if the insertion is successful, intra-barrier logic is invoked
  • If it is the Cancel branch, insert ignore to insert gid-branchID-try, then insert GID-branchid-cancel, and if the try is not inserted and Cancel is inserted successfully, then intra-barrier logic is called
  • The barrier logic returns success, commits the transaction, and returns success
  • The logic inside the barrier returns an error, and the transaction is rolled back

Under this mechanism, the problems related to network exceptions are solved

  • Void compensation control — If the Try is not executed and Cancel is executed directly, then Cancel inserted into gID-BranchID-try succeeds, without going through the logic inside the barrier, ensuring null compensation control
  • Idempotent control – no unique key can be inserted repeatedly in any branch, ensuring that execution is not repeated
  • Anti-suspension control –Try is executed after Cancel, so if the gID-BranchID-try inserted is not successful, it is not executed, ensuring anti-suspension control

A similar mechanism exists for SAGA transactions.

Summary of subtransaction barriers

Invented transaction barrier technology, for the DTM, its significance lies in the design simple and easy to implement algorithms, provides a simple and easy to use interface, at first, its significance lies in the design simple and easy to implement algorithms, provides a simple and easy to use interface, with the help of the two, developers completely liberated from network exception handling.

The technology currently requires a DTM transaction manager, which the SDK is currently available to go developers. SDKS for other languages are being planned. For other distributed transaction frameworks, as long as appropriate distributed transaction information is provided, this technology can be implemented quickly according to the above principles.

conclusion

This article introduces some of the basic theories of distributed transactions, and explains the common distributed transaction schemes. In the second half of the article, it also gives the causes of transaction exceptions, classifications, and elegant solutions.

DTM supports TCC, XA, SAGA, transactional messaging, and maximum-effort notifications (implemented using transactional messaging), providing simple and easy-to-use access. Welcome to visit the DTM project, give a star support!