The specific algorithm of 2PC and 3PC is not described here. There are many articles on the search engine that introduce algorithm + blogger’s own understanding in detail. After reading many articles, I finally figured out some problems, which I will sort out and record here.

Two-phase commit (2PC) is a distributed transaction algorithm with strong consistency. The strong consistency here is a relative term, which is compared with flexible transaction algorithms such as TCC and Saga, which satisfy the final consistency. 2PC basically meets the ACID characteristics of transactions.

The reason WHY I emphasize relatively strong consistency rather than absolutely guaranteed consistency is that 2PC(or 3PC) is a synchronous blocking mode in which all nodes act in uninterruptible order, avoiding intermediate states to the greatest extent possible (intermediate states are allowed for algorithms that seek ultimate consistency). Consistency can be guaranteed with no errors, but if at some stage the coordinator dies, data inconsistencies can occur. In the following 2PC figure, for example, if the coordinator goes down after sending A Commit message to actor A during phase 2, A commits the transaction, but B does not, and the data is blocked, resulting in inconsistencies.

In addition to inconsistent data, 2PC has two drawbacks

  1. In the Prepare phase, resources need to be locked, so other resource managers will block access to the resource. Imagine a long transaction that locks too many resources and does not release them. This will have a significant impact on the system performance. In contrast, flexible transactions such as TCC and Sage do not need to lock resources.
  2. Single points of failure, whether 2PC or 3PC, rely heavily on the stability of the transaction coordinator, especially 2PC, because once the coordinator goes down, participants can only wait for the coordinator’s command and cannot recover themselves, exacerbating the problem of resource blocking.

3PC was introduced to alleviate some of the above problems (I don’t think it can solve the above problems). 3PC introduced timeout mechanism for participants, and expanded the original Prepare and Commit nodes into three stages: CanCommit, PreCommit, and DoCommit. As follows:

Let’s take a look at how 3PC alleviates this problem. In 2PC, participants can block resources by waiting endlessly for commands because the coordinator is down. In 3PC, due to the introduction of the participant timeout mechanism, if the coordinator goes down, the participant will wait for the timeout to execute the rollback or submit instruction. How do participants determine if they should submit or return after a timeout? Depending on whether you received a PreCommit directive or not, you’ll be fine. Once the PreCommit command is received, participants will perform the DoCommit themselves, even if the coordinator goes down later.

Question 1: Can the effect of 3PC be achieved by introducing participant timeout only on 2PC?

If a participant timeout mechanism is introduced only on 2PC to try to solve the problem of participant blocking due to coordinator downtime, it means that a node will commit as soon as it completes the Prepare phase and does not receive absort from the coordinator. What’s wrong with that?

Compare this to 3PC, where participants commit as long as they complete the PreCommit phase and do not receive an ABsort instruction within the timeout period. For 3PC participants, receiving the PreCommit command not only conveys the message that they are required to perform the PreCommit command, but also conveys another important message that the rest of the participants have completed the Prepare phase. This information ensures that all participants are in a consistent state before the PreCommit is executed, so that in the event of a coordinator outage, people believe (just believe, not be sure) that the other nodes should also complete the DoCommit phase, so that when the DoCommit command is not available, Do the DoCommit yourself.

Back to 2PC, if the participant node times out after completing the Prepare phase, it will automatically Commit. In this case, the participant commits after timeout without knowing the status of other nodes. In this case, the probability of data inconsistency is much higher than 3PC.

Conclusion: It is not enough to introduce timeout mechanism only on the basis of 2PC, but also need to add an intermediate stage, which can make the completion status of all participants uniform and improve the success rate of transaction completion.

Question 2:2 PC is considered to be a strongly consistent distributed transaction protocol, but it does not guarantee data consistency. Why?

I think strong agreement here is a relative term. It requires nodes to reach a consistent state before actually acting to increase the probability of success of the transaction. As for the process of data modification and falling of each node after reaching the consistent state, no one can guarantee that a node does not go wrong in this process.