preface

I don’t know if you have come across such a situation, go to a small shop to buy something, pay for it, but the shopkeeper is dealing with some other things, forget you paid for it, and ask you to pay again. Or I was told that no transaction had taken place when I had already been charged for online purchases. All of these situations are caused by the absence of transactions. This illustrates something about the importance of transactions in life. Have business, you go to the shop to buy things, that is to hand over money and hand delivery. With the transaction, you go online shopping, deducting money to generate order transactions.

The specific definition of a transaction

Transactions provide a mechanism to bring all operations involved in an activity into one indivisible unit of execution. All operations that make up a transaction can be committed only if all operations are properly executed. Failure of any operation will result in a rollback of the entire transaction. Simply put, transactions provide an “All or Nothing” mechanism.

Database local transactions

ACID

Database transaction (s) :

  • A: Atomicity


All operations in a transaction either complete or do not complete and do not end somewhere in between. If a transaction fails during execution, it will be rolled back to the state before the transaction began, as if the transaction had never been executed.

Just like when you buy something, you either pay to receive the goods together, or if you don’t ship, you refund the money.

  • C: Consistency


Transaction consistency means that the database must be in a consistent state both before and after a transaction is executed. If the transaction completes successfully, all changes in the system are applied correctly and the system is in a valid state. If an error occurs during a transaction, all changes in the system are automatically rolled back and the system is returned to its original state.

  • I: Isolation


In a concurrent environment, when different transactions manipulate the same data at the same time, each transaction has its own complete data space. Changes made by concurrent transactions must be isolated from changes made by any other concurrent transactions. When a transaction views a data update, the data is either in the state it was in before another transaction modified it, or in the state it was in after another transaction modified it, and the transaction does not see the data in the intermediate state.

For example, when you buy something, it doesn’t affect other people.

  • D: Durability


This means that as long as a transaction successfully terminates, its updates to the database must be persisted forever. Even if a system crash occurs, the database system can be restarted to the state it was at the successful end of the transaction.

For example, when you buy something, you need to record it in a ledger, even if the boss forgets it.

InnoDB implementation principle

InnoDB is a storage engine for mysql. Most people are familiar with mysql. Here is a brief introduction to some basic principles of database transaction implementation.

Our local transactions are managed by the resource manager:

The ACID of transactions is guaranteed by InnoDB logging and locking. Transaction isolation is implemented through database locks, persistence through redo logs, atomicity and consistency through Undo logs. The principle of UndoLog is simple: to satisfy the atomicity of transactions, the data is backed up to a place (called UndoLog) before any data is manipulated. Then the data is modified. If an error occurs or the user performs a ROLLBACK statement, the system can use the backup in the Undo Log to restore the data to the state before the transaction began. In contrast to Undo Log, RedoLog records a backup of new data. Just persist RedoLog before the transaction commits, without persisting the data. When the system crashes, RedoLog is persisted, although the data is not persisted. The system can restore all data to the latest state according to the contents of RedoLog. Students who are interested in the implementation process can search for extensions on their own.

Distributed transaction

What are distributed transactions

Distributed transaction means that transaction participants, transaction supporting servers, resource servers, and transaction managers are located on different nodes of different distributed systems. To put it simply, a large operation consists of different small operations, which are distributed on different servers and belong to different applications. Distributed transactions need to ensure that all of these small operations either succeed or fail. Essentially, distributed transactions are designed to ensure data consistency across different databases.

The cause of distributed transactions

A service produces multiple nodes, and a resource produces multiple nodes.

Service Multiple nodes

With the rapid development of the Internet, micro-services, SOA and other service architecture modes are being used on a large scale. For example, within a company, users’ assets may be divided into several parts, such as balance, points, coupons and so on. Within the company, it is possible that the points function is maintained by one microservices team and the coupons are maintained by another team

In this way, there is no guarantee that the coupon can be successfully deducted after the deduction of points.

Resource Multiple nodes

Likewise, the Internet development too fast, we carry the Mysql generally being depots table data is, for a money transfer business alipay, you turn money to friends, may your database is in Beijing, and your friend’s money is exist in Shanghai, so we will not be able to ensure that they can successfully at the same time.


The foundation of distributed transactions

From the above point of view, distributed transactions emerge with the rapid development of the Internet, which is inevitable. We said before that the four characteristics of ACID database can no longer meet our distributed transactions. At this time, some new masters put forward some new theories:

CAP

CAP theorem, also known as Breuer’s theorem. For architects designing distributed systems (not just distributed transactions), CAP is your gateway theory.

  • C (consistency): For a specified client, the read operation can return the latest write operation. For data distributed on different nodes, if the data is updated on one node, the latest data can be read on other nodes, then it is called strong consistency. If the data is not read on one node, it is called distributed inconsistency.
  • A (availability) : non-failing nodes return reasonable responses (not error and timeout responses) in A reasonable amount of time. The two keys to usability are reasonable time and reasonable response. A reasonable time means that the request cannot be blocked indefinitely and should be returned within a reasonable time. A reasonable response means that the system should definitely return the result and that the result is correct, and by correct I mean, for example, it should return 50, not 40.
  • P (fault tolerance of partitions): The system can continue to work after network partitions occur. For example, here is a cluster with multiple machines, one machine has a network problem, but the cluster still works.

Familiar with CAP knows, not A total of three, if interested can search proof CAP, in A distributed system, the network cannot be 100% reliable, partition is A inevitable phenomenon, if we chose to give up the CA and P, so when the partition is, in order to ensure consistency, this time, you must decline A request but A and does not allow, Therefore, it is theoretically impossible for a distributed system to choose CA architecture, only CP or AP architecture.

For CP, giving up availability and pursuing consistency and fault tolerance of partition, our ZooKeeper is actually pursuing strong consistency.

For AP, abandon consistency (consistency here is strong consistency), pursue fault tolerance and availability of partition, this is the choice of many distributed system design, the later BASE is also based on AP extension.

Incidentally, CAP theory ignores network latency, that is, replication from node A to node B when A transaction commits, but in reality this is obviously not possible, so there will always be some time of inconsistency. At the same time, if you choose CP, it does not mean that you give up A. Because the probability of P is so small, most of the time you still have to guarantee CA. Even if the partition does appear, you have to prepare for the later A, for example by logging back to other machines.

BASE

BASE is an acronym for Basically Available, Soft state, and Eventually consistent. AP is an extension to THE AP in CAP

  1. Basic availability: When a distributed system fails, it allows the loss of some available functions to ensure the availability of core functions.
  2. Soft state: An intermediate state is allowed in the system that does not affect system availability. This refers to inconsistencies in the CAP.
  3. Final consistency: Indicates that data on all nodes will be consistent after a period of time.

BASE solves the problem of no network delay in CAP theory, and adopts soft state and final consistency in BASE to ensure the consistency after delay. BASE is the opposite of ACID in that it is completely different from ACID’s strong consistency model, but instead sacrifices strong consistency for availability and allows data to be inconsistent for a while, but eventually reach a consistent state.

Distributed transaction solutions

With the above theoretical foundation in mind, this introduction introduces several common solutions for distributed transactions.

Whether you really want to distribute transactions

Before you talk about scenarios, you must first ask whether you really need distributed transactions.

One of the two reasons for distributed transactions mentioned above is that there are too many microservices. I have seen too many team one maintain several micro service, too much and excessive design team, to get everyone exhausted, and micro service too much will lead to a distributed transaction, this time I won’t advise you to take any kind of solution, but please put the need transaction micro services aggregated into a single service, using the database of local affairs. Because either option adds complexity to your system, the cost is simply too high. Don’t introduce unnecessary cost and complexity in pursuit of certain designs.

If you do want to introduce distributed transactions, look at the following common solutions.

2PC

Talking about 2PC, we have to talk about XA Transactions in distributed Transactions of database.

There are two stages in the XA protocol:

Phase 1: The transaction manager requires each database involved in a transaction to precommit this operation and reflect whether it can be committed.

Phase 2: The transaction coordinator requires each database to commit or roll back data.

Advantages: as far as possible to ensure the strong consistency of data, low implementation cost, in each major mainstream database have their own implementation, for MySQL from 5.5 support.

Disadvantages:

  • Single point of problem: The transaction manager plays a critical role in the entire process. If it goes down, for example when it has completed phase 1 and is about to commit phase 2, the resource manager will block all the time, making the database unusable.
  • Synchronous blocking: After it is ready, the resource in the resource manager is blocked until the commit completes, freeing the resource.
  • Data inconsistency: Although the two-phase commit protocol is designed for strong consistency of distributed data, there is still the possibility of data inconsistency. For example, in the second phase, suppose that the coordinator sends a notice of transaction COMMIT, but only a part of participants receive the notice and execute the commit operation due to network problems. The rest of the participants are blocked because they have not been notified, resulting in data inconsistencies.

Overall, the XA protocol is simpler and cheaper, but its single-point problems and inability to support high concurrency (due to synchronization blocking) remain its biggest weaknesses.

TCC

About the concept of TCC (try-confirm-cancel), It was first proposed by Pat Helland in a paper titled Life Beyond Distributed Transactions: An Apostate’s Opinion published in 2007. The TCC transaction mechanism addresses several disadvantages compared to XA described above: 1. The single point of coordinator is resolved, and the business activity is initiated and completed by the main business. The business activity manager has also become multi-point, introducing clustering. 2. Synchronization block: A timeout is introduced and the system compensates after the timeout. The entire resource is not locked and converted to service logic with smaller granularity. 3. Data consistency. With compensation, consistency is controlled by the business activity manager

Explanation of TCC:

  • Try phase: Try execution, complete all service checks (consistency), reserve necessary service resources (quasi-isolation)
  • Confirm: The Confirm operation is idempotent, and only service resources reserved during the Try phase are used. Idempotent design is required. Retry if Confirm fails.
  • Cancel: Cancels the execution and releases the reserved service resources in the Try phase. The exception handling scheme of the Cancel phase is basically the same as that of the Confirm phase.

For a simple example, if you buy a bottle of water for $100, Try phase: you need to check your wallet to see if it has $100 and lock the $100. The same goes for water.

If there is a failure, cancel(release the 100 dollars and the bottle of water) is performed, and if Cancel fails, cancel is retried regardless of any failure, so you need to keep idempotent.

If both are successful, confirm that the 100 yuan deduction and the bottle of water are sold. If confirm fails, retry (will rely on the activity log to retry).

For TCC:

  • Strong isolation, strict consistency requirements of the activities of the business.
  • Services that take a short time to execute

Local message table

The core of this solution is to execute tasks requiring distributed processing asynchronously through message logging. Message logs can be stored in local text, databases, or message queues, and then automatically or manually initiated retries by business rules. Manual retry is more commonly used in payment scenarios to deal with post-hoc problems through reconciliation systems.


The core for local message queues is to turn large transactions into small transactions. You can buy a bottle of water for $100.

1. When you withhold money, you need to add a new local message table on the server where you withhold money. You need to put your withholding money and write minus water inventory into the local message table in the same transaction (depending on the database local transaction to ensure consistency).

2. At this time, there is a scheduled task to poll the local transaction table, throw the message that has not been sent to the commodity inventory server, ask him to subtract the water inventory, after reaching the commodity server at this time, first write the transaction table of the server, and then deduct, after the deduction is successful, update the status in the transaction table.

3. The commodity server scans the message table by scheduled task or directly notifies the deduction server to update the status of the local message table of the deduction server.

4. Aiming at some abnormal situation, regularly scan failed to deal with the message, to send, after commodity server received the message, the first determines whether repetitive, if have received, determine whether to perform, if executed immediately and notify the transaction, if not, need to perform by business assurance idempotent, also is not buckle a bottle of water.

Local message queue is the BASE theory and the ultimate consistency model, which is suitable for those with low consistency requirements. Implementing this model requires attention to idempotent retries.

The MQ transaction

In RocketMQ implements a distributed transaction, in fact, it is a wrapper table for local news, local news table moved to MQ, the following a brief introduction of the MQ transaction, if you want to know the detailed you can refer to: https://www.jianshu.com/p/453c6e7ff81c.

The basic process is as follows: Stage 1 Prepared message will get the address of the message.

The second phase performs the local transaction.

Phase 3 accesses the message using the address obtained in phase 1 and modifies the state. The message recipient can then use the message.

If the confirmation message fails, the RocketMq Broker provides a periodic scan for messages whose status has not been updated. If a message has not been confirmed, a message is sent to the sender to determine whether to submit it. In RocketMq, a listener is sent to the sender for processing.

If the consumption times out, it needs to keep retrying, and the message receiver needs to be idempotent. If message consumption fails, this needs to be handled manually because the probability is low, and it is not worth designing this complicated process for such a low probability time

Saga transaction

Saga is a concept mentioned in an article on database ethics 30 years ago. The core idea is to split a long transaction into a number of local short transactions, coordinated by the Saga transaction coordinator, that are completed if they are completed properly, and that compensation operations are invoked once in reverse order if a step fails. Composition of Saga:

Each Saga is composed of a series of sub-transaction Ti and each Ti has its corresponding compensation action Ci, which is used to undo the result caused by Ti. Each T here is a local transaction. As you can see, Saga has no “reserve try” action compared to TCC, and its Ti simply commits directly to the library.

There are two sequences of Saga execution:

T1, T2, T3, … , Tn

T1, T2, … , Tj, Cj,… , C2, C1, where 0 < j < n Saga defines two recovery strategies:

Backward recovery is the second execution order mentioned above, where J is the sub-transaction in which the error occurred. The effect of this approach is to undo all the previous successful sub-transation, making the execution result of the whole Saga undone. Forward recovery, for scenarios that must succeed, is performed in a sequence similar to this: T1, T2… , Tj(failed), Tj(retry),… Tn, where j is the sub-transaction where the error occurred. No Ci is required in this case.

Note here that isolation is not guaranteed in saga mode, because without locking resources, other transactions can still overwrite or affect the current transaction.

Let’s take the example of $100 for a bottle of water, which is defined here

T1= deduct 100 yuan T2= add a bottle of water to the user T3= reduce a bottle of water in stock

C1= add 100 yuan C2= subtract one bottle of water from the user C3= Add one bottle of water to the inventory

We do T1,T2, T3 all at once and if something goes wrong, we do the reverse of the C operation that went wrong. The isolation problem mentioned above can occur if a rollback is performed at T3, but the user has already drunk the water (another transaction), the rollback will find that it is not possible to reduce the user’s water bottle. This is the problem with no isolation between transactions

It can be seen that saga mode still has a great impact without isolation. We can refer to Huawei’s solution: add a Session and lock mechanism from the business level to ensure that resources can be serialized. At the service level, the resources can be isolated by freezing funds in advance. In the end, the latest updates can be obtained by reading the current status in time during service operations.

For example, see Huawei Servicecomb

The last

Again, don’t use distributed transactions if you can. If you have to use distributed transactions, use your own business analysis to see which one your business is more suitable for, whether you care about strong consistency or final consistency. The above solutions are only a brief introduction. If you really want to implement them, in fact, each solution needs to be considered in a lot of places, and the complexity is relatively large. Therefore, at last, it is necessary to judge whether to use distributed transactions. Finally, in the summary of some questions, we can come down from the article to find the answer:

  1. Is the CA of ACID and CAP the same?
  2. What are the advantages and disadvantages of common solutions for distributed transactions? What scenarios are applicable?
  3. Why do distributed transactions occur? To deal with what pain points?

Welcome Java engineers who have worked for one to five years to join Java Programmer development: 854393687

Group provides free Java architecture learning materials (which have high availability, high concurrency, high performance and distributed, Jvm performance tuning, Spring source code, MyBatis, Netty, Redis, Kafka, Mysql, Zookeeper, Tomcat, Docker, Dubbo, multiple knowledge architecture data, such as the Nginx) reasonable use their every minute and second time to learn to improve yourself, don’t use “no time” to hide his ideas on the lazy! While young, hard to fight, to the future of their own account!