Attention can view more fans exclusive blog~

Phase 2 Submission (2PC)

  • Stage 1: request/vote stage
    • When the distributed transaction initiator sends a request to the distributed transaction coordinator, the transaction coordinator sends a vote Request to all participants.
    • At this point, the participant will start the local transaction and start executing the local transaction. After the completion of the transaction, the participant will not commit, but report to the transaction coordinator whether the transaction can be processed
  • Phase 2: Commit/execute/rollback phase
    • After the distributed transaction coordinator receives feedback from all participants and all the participant nodes respond that the transaction can be committed, the participant and the initiator are notified to commit, or rollback

Three common problems:

  1. Performance issues: As you can see from the flow, the biggest disadvantage is that the nodes are blocked during execution. All the nodes that operate the database occupy database resources. Only when all the nodes are ready, the transaction coordinator will notify the global COMMIT /rollback, and participants will release resources only after the local transaction commit/rollback, which adversely affects performance.
  2. Single point of failure: The transaction coordinator is the core of the entire distributed transaction. Once the transaction coordinator fails, participants will not receive the COMMIT/ROLLBACK notification, resulting in the participant node in the intermediate state that the transaction cannot be completed.
  3. Message loss problem: During phase 2, if a local network problem occurs and some transaction participants do not receive commit/ ROLLBACK messages, data inconsistency may occur between nodes.

Three-phase Submission (3PC)

A CanCommit phase was added to 2PC and a timeout mechanism was introduced. If a transaction participant does not receive a COMMIT/ROLLBACK instruction from the coordinator at a specified time, a local commit is automatically performed, thus resolving the coordinator’s single point of failure.

  • CanCommit phase (Commit query)
    • The distributed transaction coordinator asks all participants whether they can perform transaction operations, and the participants respond Y/N according to their health condition whether they can perform transaction operations.
  • PreCommit Phase (pre-commit)
    • If all participants return consent, the coordinator sends a pre-commit request to all participants and goes to prepared.
    • After receiving the pre-commit request, the participant performs the transaction and saves Undo and Redo information to the transaction log.
    • After an actor has executed a local transaction (uncommitted), he sends an Ack to the coordinator indicating that he is ready to commit and waits for the coordinator’s next instruction.
    • If the coordinator receives a withholding symphony that should be rejected or timed out, the interrupt transaction operation is performed, notifying each participant of the interrupt transaction (ABORT).
    • Participants receive abort transaction (ABORT) or wait timeout and actively interrupt the transaction/commit directly. (In the case of timeout, many blogs in China talk about direct interruption, while Wiki talks about direct submission. There is uncertainty in both interruption and submission, and only half of the probability is correct, which may cause inconsistency. It needs to see the implementation of different frameworks. TODO here.)

If, after a cohort member receives a preCommit message, the coordinator fails or times out, the cohort member goes forward with the commit.

  • DoCommit Phase (final commit)
    • After receiving Ack from all participants, the coordinator moves from pre-commit to commit and sends a commit request to each participant.
    • The participant receives the commit request, formally commits the transaction (COMMIT), and reports the commit result Y/N back to the coordinator.
    • The coordinator receives all feedback messages and completes the distributed transaction. (Participants include transaction initiators. For example, A calls B, and both AB participate in distributed transactions.)
    • If the coordinator times out and does not receive feedback, the interrupt transaction instruction (ABORT) is sent.
    • After receiving the interrupt transaction instruction, the participant uses the transaction log to rollback.
    • The participant reports back the rollback result, the coordinator receives the feedback result or times out, and the interrupted transaction is completed.

TCC (Try – Confirm – Cancel)

  • Try
    • Do business reviews and reserve resources (e.g., freeze inventory rather than reduce it).
  • Confirm
    • Confirm the submission and execute Confirm after all transaction participants successfully execute in the Try phase. Generally, TCC does not fail to execute Confirm by default. If the Try succeeds, Confirm must succeed.
  • Cancel
    • In the Try phase, if any transaction participant fails to execute, Cancel will be executed. In general, TCC default Cancel will not fail, and it is considered that as long as the Try succeeds, Cancel will succeed. If the Cancel does fail, retry mechanism or manual intervention is required.

The advantages and disadvantages:

  1. Advantages: Solves performance problems, does not block, does not occupy database resources.
  2. Disadvantages: strong code intrusion, each transaction needs to implement try, confirm and cancel, but also need to ensure interface idempotent, high development and maintenance costs.

RocketMQ achieves final consistency based on messages

RocketMQ Transaction message flow chart (image from Aliyun) :

RocketMQ

  1. TransactionStatus.Com mitTransaction: submit a transaction, it allows consumer spending this message.
  2. TransactionStatus. RollbackTransaction: roll back a transaction, it represents the message will be deleted, is not permitted to consume.
  3. TransactionStatus. Unknown: intermediate state, it means we need to check the message queue to determine the state.

Execution process:

  1. The message sender starts a transaction and sends a semi-transaction message to RocketMQ, but the message is only stored in the Commitlog, is not visible to the consumer and is not stored in the customerQueue.
  2. After the message sender completes the transaction, it moves on to the second phase.
    1. If successful, send a COMMIT confirmation message to RocketMQ to save the semi-transaction message to the customerQueue for the customer to consume.
    2. If this fails, send a ROLLBACK message to RocketMQ to delete the semi-transaction message.

Anomaly analysis:

  1. Failed to send the prepared message. (The process will be interrupted, so there is no impact)
  2. The prepared message was sent successfully, but the local transaction execution failed. (The prepared message does not enter the customerQueue and will not be consumed, so it has no impact)
  3. The prepared message was sent successfully and the local transaction was executed successfully, but the confirmation message failed to be sent. As a result, the message could not enter the customerQueue and the consumer could not consume it. (Solution: Message back lookup mechanism)

Message check mechanism:

  1. RocketMQ periodically checks for prepared messages in the Commitlog and checks back to local services (implementing the check method of the LocalTransactionChecker interface).
  2. RocketMQ determines whether to commit to customerQueue or rollback delete messages (resolve exceptions 2 and 3) based on the status of the lookup.
  3. In order to reduce code intrusion and judgment complexity, a separate Transaction table can be designed to decouple the Transaction from the specific business, and the query can be performed according to the status of the Transaction table.

Ensure idempotency of consumption:

  1. Message ids can conflict in RocketMQ, so it is recommended that business unique identifiers be used as a basis for idempotent processing.
  2. Use Redis to determine if the message has been consumed at consumption time.

conclusion

Strongly consistent distributed transaction code has low intrusion, but it will block, occupy resources and affect performance. TCC code and business intrusion; Weakly consistent transaction asynchronous operations can involve rollback retries in exceptional cases, rollback failures, and so on. Therefore, it is still necessary to choose from its own business situation. The following is the current mainstream distributed transaction implementation (in no particular order).

  1. seata
  2. ByteTCC
  3. spring-cloud-rest-tcc
  4. hmily
  5. EasyTransaction
  6. tcc-transaction
  7. RocketMQ
    1. Alibaba Cloud RocketMQ sends and receives transaction messages
    2. Message queue RocketMQ version versus self-built open source RocketMQ cost comparison