Weekend night, is watching variety show at home, suddenly my girlfriend ran to me to play “King of Glory”.

After playing a few games, I was finally able to take a break and continue to watch my variety show, but my girlfriend came to me to explain to him exactly what stage 2 submission is.

Distributed consistency

It’s a good thing we told our girlfriend about distribution, otherwise it would have been a long topic.

Before we introduced distributed, we used the kitchen of a restaurant as an example, and today we continue the previous example to talk about distributed consistency.

With the development of the hotel, slowly from only one chef evolved into a number of chefs, and then evolved into a dishwasher, chef, chef and other positions.

When there is a variety of division of labor, it is necessary to coordinate the cooperation between these people.

For example, restaurant guests order a tomato scrambled eggs, and then the back kitchen starts to prepare, the dishwasher starts to wash the tomatoes, the side dish chef starts to prepare the eggs, and the chef starts to add oil to the pot to prepare the dishes. This is a very normal situation.

However, if the message is not conveyed in place, or the dishwasher is temporarily not in the kitchen, it will lead to some people have started to prepare, but some people are not.

This is just like a distributed system. When we place an order on an e-commerce website, we need to have multiple distributed services at the same time, such as payment system for payment, red envelope system for red envelope deduction, inventory system for inventory deduction, logistics system to update logistics information, etc.

However, if one of the systems fails in the execution process, or the request is not received due to network reasons, then the whole system may have an inconsistent phenomenon: the money is paid, the red envelope is deducted, but the inventory is not deducted.

This is known as the problem of data consistency in distributed systems.

Two-stage submission

The reason for the consistency problem in the previous example is that each employee only pays attention to what he or she is doing and cannot pay attention to others. Therefore, in order to ensure the consistency of the whole, a new role should be introduced in the kitchen, who is responsible for coordinating and deploying everyone.

The introduction of a coordinator responsible for coordinating the work of all participants in a distributed system is actually the distributed transaction processing model defined by the X/Open organization, from which two-phase commit is derived.

For example, if five people want to play Honor of Kings together, the following steps are required:

There was a guy who wanted to play King of Glory on five Black, so he started contacting his friends.

Organizer: small A, we are going to play king of Glory, if you can participate, now you log in the game, and then give me A reply on the game friends.

Little A logs in to his game account and tells the organizer: Little A is in position.

Organizer: small B, small C, small D, we are ready to play king of Glory, if you can participate, now you log in the game, and then in the game friends to reply to me a message.

B, C and D log in their game accounts and tell the organizer that B, C and D are in place.

The organizers found that everyone was in place, so they announced everyone on the game,

Organizer: Small A, I invite you, you come in.

“A” accepts the invitation

Organizer: Little B, little C, little D, I invite you, you come in.

Little B, little C, and little D accept the invitation

So, 5 people in king canyon happy to play up.

For five people to open this transaction operation, before starting to prepare for the five people are idle, busy with their own things. After the organizer has coordinated, everyone should also reach an agreed state, which is one of the following two conditions:

  • 1. Five people happily start playing the game together

  • 2. All five quit the game and go about their business.

If you end up with a group of people waiting in the game and a group of people not entering the game, you have inconsistent data.

The above process, which is a typical two-phase commit (2PC) process, has the same problem and can be solved in distributed systems.

In distributed systems, although each node can know the success or failure of its own operation, it cannot know the success or failure of other nodes’ operation (only it knows that it has the time to play King of Glory, not others).

When a transaction spans multiple nodes, in order to maintain the ACID nature of the transaction, it is necessary to introduce a component that acts as a coordinator to unify the results of all participants’ actions and ultimately indicate whether or not the nodes should actually commit the results of the actions (the organizer notifies all participants to enter the game room together).

Therefore, the algorithm idea of two-stage submission can be summarized as follows: participants will inform the coordinator of the success or failure of the operation, and then the coordinator will decide whether to submit the operation or stop the operation according to the feedback information of all participants.

The so-called two stages are: the first stage: preparation stage (voting stage) and the second stage: submission stage (implementation stage)

Preparation stage

The transaction coordinator sends each participant a Prepare message, and each participant either returns a failure (telling the organizer that they are not available to play together) or performs the transaction locally (logging in to King of Glory) but does not commit (not starting the game yet).

The preparation phase can be further divided into the following three steps:

  • 1) The coordinator node asks all the participant nodes whether they can perform the submission operation and starts to wait for the response from each participant node. (Ask if you can play a game together)

  • 2) The participant node performs all transactions until the query is initiated, and writes Undo and Redo information to the log. (Login king of Glory game)

  • 3) Each participant node responds to the query initiated by the coordinator node. If the transaction of the participant node actually succeeds, it returns a “agree” message; The participant node returns an abort message if the transaction actually fails. (Inform the organizer that you are temporarily unable to play the game together, for example, your account is limited to rank)

The commit phase

If the coordinator receives a failure message from a participant or a timeout (someone can’t play together, or hasn’t replied), send a rollback message directly to each participant (tell others to cancel the game for now); Otherwise, send a submit message (inviting everyone into the game room); Participants perform commit or rollback actions (enter the room to play together or quit the game to do something else) at the coordinator’s command.

The commit phase process is discussed in two separate cases.

When the coordinator node gets the corresponding message “agree” from all the participant nodes:

  • 1) The coordinator node sends a request to all participant nodes for “formal submission” (requiring all logged friends to join the game room).

  • 2) The participant node formally completes the operation and releases the resources occupied during the entire transaction (accept the invitation and enter the room).

  • 3) The participant node sends the “Done” message to the coordinator node (click “Ready” to enter the ready state).

  • 4) The coordinator node will complete the transaction (enter the king valley) after receiving the “complete” message from all the participant nodes.

If the response message returned by any participant node in the first phase is “abort”, or the coordinator node cannot obtain the response message from all the participant nodes before the query in the first phase times out:

  • 1) The coordinator node sends a request for “rollback operation” to all participant nodes (telling everyone to cancel the game).

  • 2) The participant node performs a rollback using the previously written Undo information and frees the resources occupied during the entire transaction (exit the game and go about their business).

  • 3) The participant node sends a “rollback complete” message to the coordinator node (telling the organizer that it knows and will play later).

  • 4) The coordinator node will cancel the transaction (cancel the game activity) after receiving the “rollback completed” message from all participant nodes.

The disadvantage of 2 PC

The above process actually has some disadvantages, such as

1. After receiving the organizer’s message, participants need to log in the game and wait for the organizer’s invitation again, which is a waste of time.

2. If the organizer is interrupted in the middle of this process, the participants who are already in the game may wait forever.

3. After everyone logs in to the game, the organizer asks everyone to join his room by invitation. At this time, if there are some network abnormalities or participants are not in front of their mobile phones, some users may join the room and others may not.

4. If the organizer invites all participants in the game and he invites the first person, he and the person he invites are disconnected. The other three didn’t know what to do.

The above problems also exist in the 2PC stage of the distributed system and correspond to the following problems respectively:

1.Synchronization blocking problem.

During execution, all participating nodes are transaction blocking. When a participant occupies a common resource, other third party nodes have to block access to the common resource.

2,A single point of failure.

Because of the importance of the coordinator, if the coordinator fails. The participants will keep blocking. Especially in phase 2, when the coordinator fails, all participants are still locked in the transaction resources and cannot continue to complete the transaction. (If the coordinator is down, you can re-elect a coordinator, but you can’t solve the problem of participants being blocked because the coordinator is down)

3,Data inconsistency.

In phase 2 of the two-phase commit, after the coordinator sends a COMMIT request to the participant, a local network exception occurs or the coordinator fails during the commit request, which results in only a subset of the participant receiving the commit request. These participants perform the COMMIT operation after receiving the COMMIT request. However, other parts of the machine that do not receive the COMMIT request cannot perform the transaction commit. Then the whole distributed system appears the data consistency phenomenon.

4. Problems that cannot be solved in the second stage:

When the coordinator sends a COMMIT message, it goes down, and the only participant who received the message also goes down. So even if the coordinator creates a new coordinator by election agreement, the status of the transaction is uncertain, and no one knows whether the transaction has been committed.

In summary, 2PC is not perfect, it has synchronization blocking issues, single point of failure issues, not 100% data consistency issues, etc.