Distributed transaction

How to design the system architecture and solve the problem of data consistency in distributed transaction scenario, personal understanding of the final solution can grasp the following principles, that is: Big transaction = small transactions (atomic transactions) + asynchronous (news), the best way to solve the distributed transaction is actually not to consider a distributed transaction, will take apart a big business, the big business process, transformation into several small business processes, and then through the design compensation process to consider eventual consistency.

What is a transaction

Transaction and its ACID properties

A transaction is a logical processing unit consisting of a set of SQL statements. A transaction has the following four properties, often referred to simply as the ACID property of the transaction:

  1. Atomicity: A transaction is an atomic unit of operation in which all or none of the modifications to data are performed.

  2. Consistent: Data must be Consistent at the beginning and completion of a transaction. This means that all relevant data rules must be applied to transaction modifications to preserve data integrity; At the end of the transaction, all internal data structures (such as b-tree indexes or bidirectional linked lists) must also be correct.

  3. Isolation (Isoation) : Database systems provide an isolation mechanism to ensure that transactions are executed in a “separate” environment that is not affected by external concurrent operations. This means that intermediate states during transaction processing are not visible to the outside world and vice versa.

  4. Persistence (Durabe) : After a transaction is complete, its changes to the data are permanent and can be maintained even in the event of a system failure.

Typical scenario: bank transfer service

Such as: There is 500 yuan in Li Lei’s account and 200 yuan in Han Meimei’s account. Li Lei needs to transfer 100 yuan from his own account to Han Meimei. After the successful execution of the transfer (transaction), Li Lei’s account should be reduced 100 yuan to 400 yuan, and Han Meimei’s account should be increased 100 yuan to 300 yuan. That is, the data must be in a consistent state at the beginning and end of the transaction (consistency), and all data and structures must be correct at the end of the transaction. And the same transfer operation (the same flow, that is, a transfer operation) has the same result no matter how many times it is performed (idempotence).

Emporium scene: traffic recharge business

Let’s talk about a project we did: China Mobile – Traffic Recharge Capability Center. The core business process is as follows:

  1. The user enters the traffic recharge commodity purchase page and selects the traffic commodity;

  2. Purchase flow recharge goods, if there is inventory limit, judge inventory, generate traffic purchase orders;

  3. Select the corresponding payment method (bao, UnionPay, Alipay, wechat) for payment operation;

  4. After successful payment, the near real-time traffic to the account can be used flow goods;

This business process does not seem very complicated, does it? It does not involve the physical purchase similar to e-commerce business, but I think the difference is not very big, except the logistics and delivery process in e-commerce. Other processes are almost the same, including inventory, discounts and other services.

The whole system interaction is shown as follows:



Distributed transaction

Having said the business requirements for the above two scenarios, let’s talk about distributed transactions. To talk about distributed transactions, let’s talk about local and distributed transactions first:

Similarities:The first is to ensure data correctness (ACID), local transactions and distributed transactions can also correspond to:
Rigid and flexible transactionsIn my personal understanding, the biggest difference between rigid transaction and flexible transaction is whether a complete transaction operation can be performed in
Same physical medium(e.g., memory) at the same time; A flexible transaction is a complete transaction requirement
Across physical media or across physical nodes(network communication), then exclusive locking, shared locking, etc., will not be used. There is no guarantee
Atomicity) Complete the transaction. Personally, I understand that distributed (flexible) transactions are pseudo-transactions in essence. Flexible transactions actually use different methods to achieve final consistency according to different business scenarios, because partial trade-offs can be made according to the characteristics of the business, and data inconsistency can be tolerated in a certain period of time in the business process.


There are four flexible transaction implementation modes of Alipay for different business scenarios, as shown in the figure below:



  1. A two-phase type

  2. Compensation type

  3. Asynchronous guarantee type

  4. Best efforts notification type

The service scenario of traffic trading center

Microservitization is achieved through Dubbo, which is roughly divided as follows:

  1. Goods and services

  2. Order service

  3. The inventory service

  4. Payment service

  5. Direct charge service

  6. Message service

  7. And other services

Scene 1:

The inventory quantity is consistent with the order quantity, and the compensation type + maximum effort notification type is adopted because it does not involve cross-machine room and long transaction (under normal circumstances, the inventory and order service process is fast) :

  1. Users place orders first reduce inventory, inventory reduction after success;

  2. Call the ordering service:

  3. 2-1. The order has been placed successfully, and both transactions have been submitted.

  4. 2 – (2) failure, the inventory rolled back, both transaction failure, there’s a safeguard mechanism (best to notify type), is the inventory service exceptions if the call, to determine the inventory rollback failed, then into the message service message queue (delay) timing retry in stages, after efforts to try again to insure the normal inventory service successfully rolled back.

Scene 2:

The reason for adopting asynchronous guarantee is that the whole business link is too long and across different machine room systems, the network delay is high, and the business does not need very high real-time performance, so small transaction + asynchronous notification is adopted. Under normal circumstances, it takes about 1-5 minutes on average for users to place an order, complete the payment and transfer the traffic to the account:

  1. When the order is placed successfully, the order service creates the order successfully and sends the payment request to the payment gateway system (order status – to be paid, if the payment is not made in more than 1 hour, the flow is timeout and unpaid cancellation, and the delayed consumption of RocketMQ is used here to realize the timer business scenario).

  2. Return to the payment page, the user in the business process, pays the trading system to complete payment gateway asynchronous notification traffic center, traffic center to receive payment after successful state change order status – pay for success, and return to the payment gateway successful outcome (concurrent pressure is not here, temporarily no asynchronous decoupling).

  3. After the order status is modified, the traffic center invokes the message service to put the direct charge service into the message queue and decoupling the direct charge service (the reason is that the direct charge needs to call the mobile CRM system of 31 provinces, the link is too long, and the CRM system of some provinces takes a lot of time. The processing capacity of each province is different, and the timeout of more than 20 seconds often occurs. Therefore, some provinces with higher timeout should be considered to drag down the system and carry out peak cutting and valley filling);

  4. 3-1. When the direct charging is successful, modify the order status – completed;

  5. 3-2. When the direct charging fails (mobile features, for example, when the user closes the account or shuts down), change the order status to waiting for refund and call the refund interface of the payment gateway system. After the refund is successful, the payment gateway asynchronously notifies the flow center, and the flow center changes the order status to – refund successful;

  6. 3-3. When the direct charging times out, the scheduled task service is invoked to implement the timeout retry mechanism (10 minutes after the first retry, 30 minutes after the second retry, and the third retry). ..) , the direct charging result will not be obtained until the maximum timeout retry times, and the order status will be stuck in the payment success state, which relies on T+1 checking process to ensure the final consistency. According to the checking result, the order status is transferred as: Completed or to be refunded – > refund successful.

Scenario 3:

Charge to the bill after the news of the notification (APP) or SMS message push, adopt best notification type, this business scenario is simpler, the straight after the success of the charger, the order status as completed, through the message service at this time to inform the business of decoupling, call news service failure cases, using regular tasks to notice.

Scene 4:

Checking accounts:

According to pay the payment days daily reconciliation to T + 1, the principle of reconciliation: will be subject to payment transaction records, records of traffic center order + payment gateway trading record + province CRM top-up three parties, will be of some of the intermediate state of orders (for example: pay for success, after waiting for a refund) to circulate the order status after checking end (has been completed, a refund success).

Settlement audit:

The data after successful reconciliation is periodically entered into the settlement process, and the payment amount within the payment gateway cycle is checked against the amount of settlement data. After successful audit, the financial settlement process is carried out, and the money is settled to the provincial company, and the settlement details are provided to the provincial company for the provincial company to review with the direct charging cost record.

The following is part of the architecture design of the traffic center, and the general principle direction: microservitization

Traffic Center – Architecture design



Architecture design idea: in the early part of the system design and some of the hard environment constraints, we according to the business split multiple sub system (micro) : goods and services, order service, inventory, payment gateway, uniform interface platform, reconciliation services, clearing services, gateway and docking services, subsequent also increases: account services, virtual currency service, CARDS and stamps, etc… . According to the core of the micro service design thought, all services completely independent, isolation, therefore all services from top to bottom: request access (connection management), the request processing (computing), data storage, storage services) split, tried their best to realize stateless access and computing, data storage for vertical + horizontal split, vertical resolution: Commodity database -mysql (read more and write less, master-slave architecture + read and write separation) + Redis (read more and write less, cluster mode), order database -mysql (read and write balancing, multi-master and multi-slave + horizontal split), inventory special database – Redis (distributed + active-passive disaster recovery), external transaction system – payment gateway, external processing system – unified interface platform.

This architecture has to support the total volume of 360 million, the total order 46.8 million, 5 million, average daily trading volume orders for 500000, subsequent volume increasing cases according to the micro service ideas continue to split, for example the order service and then broken down into: order service, check list, until according to the needs of the business and system coupling relationship between split at the most granular.

  1. Performance expansion: Application layer computing service (stateless application) improves computing performance by adding service nodes, supports quality (performance) monitoring service Dubbo Monitor and integrates Netflix Hystrix fuse to manage service quality to achieve dynamic expansion and shrinkage of application layer.

  2. Capacity expansion: Data layer storage service (stateful application) can achieve unlimited capacity expansion through horizontal data splitting. Nosql class solution: Codis middleware; Relational database: Mycat database sub-database sub-table middleware. The current project uses Twitter’s Snowflake unique ID generator (optimized for business scenarios) to implement its own horizontal split of data and routing rules.

  3. Storage performance: Nosql: For the scenario of read more and write less – use Tedis of Taobao (write more and read randomly to improve performance), and use -COdis for read and write balancing; Mysql: In read and write scenarios, the architecture of one master and many slaves (for example, product information) is used; in read and write balancing scenarios, the architecture of many master and many slaves (for example, order information) is used.

The overall splitting principle is shown below: