This article was first published on the wechat public account “Shopee Technical Team”.

Abstract

Seadt is a distributed transaction solution provided by Shopee Financial Products team for real business scenarios using Golang.

Golang currently has no mature middleware or components to support distributed transactions. How to quickly support business requirements and simultaneously solve distributed transactions is a thorny problem faced by business teams.

This will be a series of articles, divided into chapters. This introductory article will provide an overview of the distributed transaction problems encountered in financial services and detail team selection and implementation.

1. Challenges of distributed transactions

The Shopee Financial Products team, which provides Financial services such as credit in Southeast Asia and beyond to both C and B end users, has launched in several markets.

Financial business, the biggest challenge is the accuracy of data requirements are very high. Users take out loans on the platform, and we have to make sure that every loan is accurate, no more, no less. More will infringe on the user’s property, less will cause the company’s capital loss. How to ensure the accuracy of data, in addition to the accuracy of system computing algorithm itself, the biggest challenge is the transaction processing brought by the system due to distribution.

1.1 Distributed transaction challenges

For example, if a user applies for a loan and the system returns that the loan application is successful, what happens in the background system? Let’s start with Cashloan’s system architecture (uncorrelated partial desensitization) :

A lot of things happen in Cashloan, such as freezing coupons, freezing limits, calling in an external payment gateway to make a loan, waiting for the result of the loan, post-processing the result of the loan, and so on.

These processes are orchestrated by the Cashloan-Transaction management module. Although the internal processing is a lot, divided into a lot of interfaces, from the user’s point of view, the subsequent processing of the loan is a transaction, or the loan succeeds, or the loan fails. The consistency of the data states between the various systems within the success/failure of the loan is guaranteed by the Cashloan-Transaction module.

In the loan business scenario, there are many transaction issues involved: 1) how to ensure the consistency of status between freezing coupons and freezing limits (both successful and unsuccessful); 2) If the payment gateway succeeds/fails, how to ensure the consistency of subsequent processing status; 3) Others.

This paper mainly explains the first problem, that is, how to ensure the consistency of frozen coupons and frozen quotas. There will be other articles explaining solutions to the remaining issues, so stay tuned.

I believe the first solution that many students think of is to use SEATA. But our development language is Golang and we can’t use Seata directly.

Hence the choice of technology for this article, including the choice of existing open source middleware or self-development. Of course, before that, our team was more concerned about which model was more suitable for our business. For example, SEATA provides four modes: TCC mode, Saga mode, AT mode, XA mode. Each of these patterns has its own scenarios, but the first step is to prioritize.

1.2 Model selection: TCC

Combined with the current situation of the project, our team did some research and analysis.

AT mode TCC mode Saga mode Existing implementations
The data link link AT link TCC link Saga Internal document
Principle that Record the mirror before and after data changes at the framework level and perform redo and undo operations on the application side The service provider provides a phase 2 interface to TCC The business side provides a phase forward service and its corresponding flush service To achieve the final consistency in the business logic, take into account the timeout/failure/downtime of each RPC call, and realize self-compensation recovery/rollback logic in the business
Applicable scenario AT mode (Refer to the link) based on a relational database that supports local ACID transactions:

1) Phase-one prepare behavior: In the local transaction, service data updates and corresponding rollback logs are submitted

2) Phase-2 COMMIT behavior: The rollback logs are automatically and asynchronously cleared in batches after the logs are successfully completed immediately

3) Two-phase rollback: Data rollback is completed by automatically generating compensation operations based on the rollback logs
TCC mode, independent of transaction support for underlying data resources:

1) One-stage prepare behavior: invoke the customized prepare logic

2) Two-stage COMMIT behavior: call custom commit logic

3) Two-stage ROLLBACK behavior: Invoke custom ROLLBACK logic



The TCC pattern supports the integration of custom branch transactions into the management of global transactions
1) Long and multiple business processes

2) Participants include other corporate or legacy system services that cannot provide the three interfaces required by the TCC pattern
Simple business scenario with only local and one RPC write call
advantage No service intrusion 1) One-stage local transaction submission, no lock, high performance

2) Good scalability, no extra development cost for adding one participant

3) easy to understand

4) Complexity of external shielding
1) One-stage local transaction submission, no lock, high performance

2) Event-driven architecture, participants can execute asynchronously with high throughput

3) Compensation services are easy to implement
Easy to understand
insufficient 1) Ensure isolation, but lock for a long time

2) The implementation is complex, heavily dependent on the database, and need to do different processing according to the database transaction type, prone to error
1) Services are invaded and intermediate states need to be provided

2) High requirements for business development
1) Isolation is not guaranteed

2) Some services are invaded
1) It is difficult to maintain and has poor expansibility

2) Each scene is handled differently and has low replicability

1.2.1 the AT mode

The loan process of introducing AT mode is as follows:

Although this mode has no impact on services, the impact point is minimal. However, the resource lock time in this mode is long, which may bring disastrous consequences to intermediate concurrency and is difficult to implement. There are difficulties in troubleshooting and recovery, and the team does not have the experience and knowledge in this area.

Therefore, the AT mode is not selected.

1.2.2 Saga mode

The Saga mode is introduced into the loan process as follows:

The service process is long and has high requirements on each service node. The rollback interface needs to be supported. However, it is doubtful whether the rollback interface can be supported in the service. For example, if this mode is used in the loan, it is unknown whether the coupon/capital side can support the rollback (there may be a long time span of rollback, for example, the processing after the payment is passed will take several days), and the amount rollback of other attached businesses is troublesome, so it needs to go offline to the user.

Therefore, Saga mode is not supported.

1.2.3 Service self-realization

The current business implementation logic of the loan process is as follows:

The first step “deduction amount processing” is explained in the following sequence:

Note: In order to ensure the consistency of the amount deduction and the state of the entire user operation (that is, the amount of deduction will be frozen only when the user’s loan is successful; if the user’s loan is not successful, the amount will not be frozen and deducted eventually), processing ①~⑦ is done.

Among them:

  • (1) Adding delay queue to restore quota operation before freezing quota is to avoid (2) application downtime after freezing quota invocation, and special processing of freezing quota cannot be unfrozen;
  • (3) After the frozen quota is successfully frozen, the delay queue for resuming frozen quota should be deleted to avoid the user’s loan being successfully restored without the deduction of frozen quota;
  • ⑥ Use reliable events to ensure the final execution and success.

Thus, the handling of exception scenarios is integrated into the business code, resulting in extremely complex business processes that cannot be extended and reused. For example, the realization of loan, in the repayment scenario, can not completely copy the past, still need to consider the particularity of repayment. In the loan scenario, if other external calls are added, the consistency of the whole process needs to be reconsidered. As shown in the figure above, only the consistency processing of frozen credit is implemented, and other business processes cannot be implemented in the same way.

Therefore, business self-implementation is not selected.

1. TCC mode

The process of introducing TCC mode into the loan process is as follows (same as AT) :

Intrusion into the business, high requirements for development level, need to consider several exceptions and deal with. Fortunately, there is a unified solution template for these abnormal situations to reduce the probability of problems.

Hence the TCC mode.

1.3 Technology selection: self-developed

After the model is determined, we will consider how to select the technology. We made four kinds of comparison: Seata-Golang, Seata-Go-Server, other internal system practice (no public comparison), and self-research.

Many dimensions were considered in the research process: maintenance status, support team, community building, maturity, documentation, functionality, License, maintainability, integration, team intent, etc.

The following analysis is based on the 2021-6-25 survey data.

1.3.1 seata – go – server

  • Maintenance status: Stop updating (last updated)
  • Maturity: No release version
  • Documents: no

It has not been updated for 2 years, so it is not considered.

1.3.2 seata – golang

  • Maintenance status: Under maintenance with few updates (recently active with continuous updates)
  • Maturity: No release version
  • Support team: currently only one
  • Documentation: There is no valid documentation, all are the documentation of Java version of SEATA
  • Functionality: TCC mode only (Saga mode after December)
  • Integration: It integrates THE ORM framework Go-SQL-driver /mysql, which conflicts with the ORM currently used by the team
  • License: Apache License 2.0
  • Development mode:
    • Development of original branch: jointly develop with SEATA team, send merge request for content that needs to be changed, which needs review by Alibaba/OpenTRX team.
    • New branch development: it needs to be hosted in GitHub’s OpenTrx repository. Due to uncontrolled development permissions, it may be changed by others. Benefits: Changes made by the OpenTrX team can be easily synchronized to branches.
    • New warehouse: an independent set, placed in the Intranet GitLab. Disadvantages: OpenTrX changes cannot be updated synchronously; Many of openTrX’s reserved features are difficult to understand and change.

Despite continuous updates, Cashloan does not have a stable release or any commercial application, so its current business volume is high and its direct use is risky. So I don’t think about it.

1.3.3 itself

  • Maintenance status: the project does not fall, the maintenance does not stop
  • Support team: at least 3 people
  • Documentation: at present, there are rich design documents, internal sharing documents, and will continue to build
  • Functionality: TCC mode is supported before Saga mode
  • Integration: Based on the financial team technology stack development, no integration problems
  • Team intention: High team intention, able to solve a large number of distributed scene problems within the department

Cashloan has been subject to strong regulatory constraints in various markets. For example, encrypt sensitive database information and desensitize logs. It’s harder for external teams to do special releases for this. Cashloan’s code hosting is a regulatory risk if it is in an external repository.

Self-research can design support for the above problems, but the disadvantage is that extra manpower needs to be invested in the design and development from scratch, and many problems may be encountered after the launch.

Feasibility analysis:

  • Technology: The team has strong technical reserve, is proficient in TCC, and has development experience in similar framework components (last year, the team made reliable event components and applied them in a large number of scenarios after they were launched);
  • Application: As a business team, the team has a large number of business scenarios that can be applied to ensure landing;
  • Input-output: the input-output ratio is considerable. In realizing the consistency of loan process, the team invested 1 person/month in design and development joint investigation. In the following six months, 2 people/month were successively invested in loan demand change. There are as many as five similar scenes in our team. Meanwhile, about 2 people/month will be invested in self-research and development, and the scene of docking and joint adjustment will be transformed with 0.25 people/month, with no maintenance cost for subsequent business.

In addition, self-research has the following benefits:

  • Solve Cashloan’s own business issues: loan/repayment processes, etc.
  • Improve system maintainability and scalability;
  • Comply with applicable principles and meet current business requirements;
  • In accordance with the principle of simplicity, the implementation of the first stage is simple and clear, to ensure that team members can accept and understand, improve maintainability;
  • In line with the evolution principle, the evolution iteration expansion of other functions is reserved to facilitate the demand challenges brought by future business changes;
  • It can be used as a capability output to improve the efficiency of the entire financial team and solve the common problems of distributed transactions.

2. Seadt-tcc design and implementation

Since we have decided to develop our own TCC, we must first consider the architecture design of TCC. There are two design schemes, one is pure SDK mode (SDK mode), the other is SDK+ independent central service (TC global mode).

To make the difference, consider the structure of the distributed transaction component:

  • TM (Transaction Manager) : client SDK to start/end distributed transactions;
  • Resource Manager (RM) : client SDK to manage local resources.
  • A Transaction Coordinator (TC) is a client SDK or server that manages global and branch Transaction status and promotes Transaction execution.

SDK mode: TM and TC together.

TC global mode: TCS are deployed as services.

Considering the future use of nested transactions, convenient unified monitoring and management, as well as Saga mode support, we chose TC global mode.

The interaction of the global pattern is as follows:

The distributed Transaction is initiated by the Transaction module of the business side, and the TC interacts with each module to promote the execution of the whole Transaction.

2.1 Cashloan’s new system architecture

The overall architecture after the introduction of SEADT is as follows:

Each business system module introduces seADT-SDK (using distributed transactions) on demand, and all systems can share the same SET of TC services.

After the introduction of SEADT-SDK, each business system module can still be extended horizontally, and also supports sub-library and sub-table.

As a common service, SEADT-TC can easily become a new bottleneck, so it is designed to be highly available. It can be horizontally scaled, divided into libraries and tables, allowing multi-tenant mode to be shared, and allowing services to be deployed in physical isolation.

The business system module introduces the seADT-SDK structure as follows (using the Transaction module in the example) :

The Transaction module introduces the SEADT-SDK, which contains the Transaction manager TM and resource manager RM, as well as the reliable event manager Reliable_Event (local message mode ensures final consistency).

2.2 TC Global Mode

TC global mode, support for TCC:

  • The initiator TM registers the global transaction with TC.
  • The initiator freezes the limit and the coupon;
  • Participant RM registers branch transactions;
  • Participants execute the one-stage Try method to freeze coupons.
  • The initiator RM performs local service processing.
  • The initiator TM commits the global transaction;
  • TC performs two-stage Confirm/Cancel.
  • The participant RM performs two-stage Confirm/Cancel to perform actual service processing.

TC Global mode, support for Saga (not detailed in this issue) :

2.3 State machine design

There are two core state machines in distributed transaction: primary transaction state machine and branch transaction state machine.

TM and RM state matrices (rows represent primary transactions and columns represent branch transactions) :

Branch, the main There is no Prepared Committing Committed Rollbacking Rollbacked
There is no Y N N Y N
Prepared N Y N N Y N
Tried N Y Y N Y N
Confirmed N N Y Y N N
Canceled N N N N Y Y

Note: There is a special case where the master is in Rollbacking and the branch transaction may be in Prepared.

The corresponding scenarios are as follows: When TM sends a phase T request to multiple participants, if one business execution fails, the branch will stay in the Prepared state. When TM receives the T failure, it will immediately enter the Rollback process and broadcast phase 2 Cancel to all participants. So the main transaction is in Rollbacking and the branch transaction is in Prepared.

In this scenario, there are no data, Prepared, Tried, and Canceled states because multiple participants have different processing speeds and results.

This special situation also occurs between RM and Canceled, with one saying Canceled and the other saying Canceled without data, Prepared, Tried, and Canceled. But it’s impossible to have one Canceled and the other Confirmed.

RM and RM state matrix:

Branch \ branch There is no Prepared Tried Confirmed Canceled
There is no Y Y Y N Y
Prepared Y Y Y N Y
Tried Y Y Y Y Y
Confirmed N N Y Y N
Canceled Y Y Y N Y

2.4 Detailed Process

2.4.1 Glossary description

  • Commit, Rollback: the status of the entire distributed transaction and the status of the branch transaction.
  • Confirm and Cancel: Confirm and Cancel are used only when the participant interface is invoked.
  • Commit, rollback: specifies the transaction operation methods of the underlying code.

2.4.2 TCC Service Processing

The TCC pattern of SEADT is designed to make external calls to business processes as simple as local transactions. After using SeADT, services are processed as follows:

2.4.3 TCC Commit processing in SDK

1) First look at business startup distributed transaction, processing in SDK:

The SDK provides a new transaction template sdK-TT that registers transaction triggers throughout the TCC transaction. For details about transaction triggers in transaction templates, see 2.4.5 Transaction Templates.

2) Business adjustment external Try interface, SDK internal implementation.

3) The business goes through the Commit process, and the SDK is implemented internally.

In the figure above, tX-global represents the distributed transaction started by the service. 4.1.3 Activity is set to Commit, and the distributed transaction started by the service must be in the same transaction. If a new transaction tX-SUB is enabled, the global transaction status may be Commit, but an exception occurs later, resulting in the rollback process.

Seadt-sdk consolidates distributed issues so that business code remains as simple as local transactions.

All kinds of exception handling are implemented by SEADT-SDK, for example:

  • Main transaction and branch transaction state maintenance;
  • Commit Rollback of the flow.
  • Two-stage advance processing.

2.4.4 TCC Rollback in the SDK

1) If an exception occurs, enter the Rollback process and handle the SDK.

If an exception occurs when starting distributed transactions, the rollback process is started. An exception or timeout occurs when the external Try method is called.

2) SDK processing of Rollback.

2.4.5 TCC transaction recovery processing in SDK

The transaction recovery manager will trigger periodically to Rollback distributed transactions that have been committed and branch transactions that have not entered the final state. The processing is as follows:

2.4.6 Transaction Templates

Both the Commit process and the Rollback process rely on various triggers of the transaction template, which are also not provided by the Golang transaction template. Therefore, we redesigned a new transaction template with the following trigger design:

This transaction template and the various transaction triggers it contains are the cornerstone of the SEADT-SDK.

2.4.7 Precautions

Before a distributed transaction is committed, any exception reported goes to the Rollback process. The red box part in the Commit process is often ignored. Although all participants Try successfully and the business code is ready to Commit the transaction, there are still some execution failures in the red box part in the SEADT-SDK, resulting in the whole distributed transaction finally rolling back. Subsequent failures are compensated by the transaction recovery handler.

In the Rollback process, most errors can be simplified to be compensated by the transaction recovery handler. In order for a distributed transaction to fail quickly, the call actor Cancel processing is triggered immediately.

You need to distinguish between Commit and rollback, and Commit and rollback.

2.5 SEADT constraints and specifications

The TCC mode of SEADT adopts the 2PC idea of committing transactions, which needs to meet the Atomic Commitment Protocol. Referring to this protocol, SEADT also proposed some protocol specifications of its own to ensure efficient and controllable transaction flow.

AC1: All participants that decide reach the same decision. AC2: If any participant decides COMMIT, then all participants must have voted YES. AC3: If all participants vote YES and no failures occur, then all participants decide COMMIT. AC4: Each participant decides at most once (that is, a decision is irreversible). A protocol that satisfies all four of the above properties is called an atomic commitment Protocol. -- Quoted from: Ozalp Babaoglu. Understanding non-blocking Atomic Commitment. January 1993Copy the code

2.5.1 Constraints and specifications on initiators

  • The global transaction state can only be determined by the initiator;
  • All participants must succeed in the first phase to enter Commit;
  • Provide distributed transaction backcheck interface;
  • When starting a distributed transaction, you need to consider whether it is a nested transaction and whether the SDK supports it.

Description:

  • The TCC mode of SEADT, positioned as Blocking Atomic Commitment, allows only the initiator TM to decide the transaction status, and does not allow participants to decide.
  • The transaction can enter the Commit process only after all participants Try successfully. If one participant fails, the Rollback process is initiated.
  • The SDK provides a unified transaction back lookup interface, and the initiator does not need to implement it. The TC is not notified of the outage because a local transaction is committed by the initiator. TC does not allow the decision of the final state of the transaction, so only the reverse lookup TM;
  • Nested transactions are not currently supported and are therefore not allowed.

2.5.2 Constraints and Specifications on Participants

  • TCC interface, Try lock resources, CC to ensure that the business will be successful;
  • Control idempotent and concurrent processing;
  • Disable empty commit and allow empty rollback;
  • Avoid transaction suspension and monitor alarms;
  • Good data visibility and isolation;
  • Initiating distributed transactions as an initiator is not allowed in two-stage processing.

Description:

  • Participants must occupy all resources in the Try phase; otherwise, TC cannot succeed in promoting Confirm.
  • TC informs participants in stage 2 that Exactly Once cannot be guaranteed, only At Least Once can be achieved, so idempotent is required. Concurrency also exists due to retry and network latency.
  • No participant Try is not executed, and the TC notifies Confirm. Allow Try not executed, Cancel first;
  • If the Try does not arrive until a null rollback occurs, the transaction is suspended and the resource cannot be released if no special processing is done. Unified processing in SEADT-SDK;
  • Data visibility and isolation need to be handled by the business itself, such as increasing the balance freeze limit.

2.5.3 TC Constraints and Specifications

  • TC cannot determine distributed transaction state;
  • The second-stage state is immutable, and the second-stage state when TC lands is the final state, which is not allowed to change under any circumstances;
  • Ensure the success of phase 2 of branch transaction;
  • Data clearing and archiving;
  • This alarm is generated when a transaction fails due to timeout or a transaction is suspended in one phase for a long time.

Description:

  • If TC does not receive the notification from TM for a long time, it is not allowed to make decision on the transaction state, and will reverse check TM to get the transaction result.
  • The TC pushes participants to perform the two-phase execution. The transaction status is not allowed to be adjusted even if an error is reported after multiple retries. But error alarm, manual intervention;
  • TC ensures the successful implementation of the second stage through broadcasting.
  • After the distributed transaction, the data is cleaned in time to avoid data accumulation. Rollback State transactions are reserved for a long time. Commit state transactions are reserved for a certain period of time and are deleted in batches periodically.
  • TCS have global and branch transaction data and state, so they can monitor long-suspended transactions.

2.6 SeADT difficulty analysis

  • How Golang realizes the AOP aspect function, so that participants’ branch transaction registration, transaction result reporting, idempotency control, null rollback, transaction suspension processing can be uniformly processed in SEADT-SDK;
  • How to solve the problem that TC does not depend on RM pb when TC calls RM in phase 2, and how to reflect the real business phase 2 methods of participants;
  • TC’s high availability design.

Solutions to these difficulties will be introduced in the following articles.

At present, SEADT components have been used in some core business processes, greatly reducing the development content of the original business self-implementation. In the future, SEADt will also support the Saga model to make transactions easier for business teams.

In the future, we will output documentation on seADT applications, new functions, new planning and difficult design, etc. Please look forward to it.

In this paper, the author

Marshal, Ansen and Yongchang from the Financial Products team.