consistency

The paper

Consistency refers to one data update for each node, which is known to the whole cluster and is consistent. Assuming a distributed system with N nodes, we say that the system satisfies consistency when it satisfies the following conditions:

  • Full agreement: All N nodes agree on a result
  • Value valid: The result must be presented by half of N nodes
  • Endable: the resolution process ends within a certain amount of time and does not go on indefinitely

The problems we are facing

  1. Asynchronous disordered message delivery: the real network is not a reliable channel, there is message delay, loss, the message delivery between nodes can not achieve synchronous order
  2. Node down: The node continues to be down and will not recover
  3. Node recovery: A node recovers after being down for a period of time. This mode is most common in distributed systems
  4. Network differentiation: N nodes are isolated into multiple parts due to network link faults
  5. Byzantine general problem: nodes or down or logic failure, or even the usual cards throw messages that interfere with the resolution

The following image demo:

On Friday I: from work at night to eat chicken Saturday morning xc: ok / / message delay I: WC -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- : I work at night to eat chicken xc: No (two hours) xc: No problem! / / downtime node recovery I: WC -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- I: from work at night to eat chicken... / / node goes down -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- I: from work at night to eat chicken cx: ok, let's go to the great care! // Byzantine General me: WCCopy the code

It has been discussed before that in distributed environment, there are many uncertain factors and failure can happen at any time. CAP theory and BASE theory have also been discussed. We want to build a highly available and highly consistent service in a distributed environment, the goal is, but CAP theory tells us that it is impossible to achieve such an ideal environment. At most two of these three are perfectly satisfied.

In this premise, P (partition fault tolerance) is bound to meet, because after all, it is distributed, can not put all the applications into a server, so the server is unbearable, and there is a single point of failure. So it’s a balance between consistency and availability.

What’s the balance? In this environment, the BASE theory emerges: even if strong consistency cannot be achieved, the distributed system can adopt appropriate ways to achieve the final consistency according to its own business characteristics. BASE is Basically Avaliable, Soft state, and Eventually consistent. Usually the system requirements are basically available, except for success or failure, the operation has a tolerable delay state, but, in any case, after a period of delay the system must finally reach the data is consistent.

In fact, it may be found that no matter CAP theory or BASE theory, they are all theories, and these theories need algorithms to realize them. The 2PC, 3PC, Paxos algorithm and ZAB algorithm mentioned today are just doing this.

So the premise of today’s talk must be distributed, the problem to solve is all in a distributed environment, how to make the system as high availability as possible, and data can finally achieve consistency.

2PC

Preparation stage
The commit phase
coordinator
participants

Phase 1

In phase 1, the coordinator initiates a proposal and asks each participant for acceptance, as shown below:

The stage 2

In Phase 2, the coordinator commits or aborts a transaction based on feedback from the participants, commits it if all participants agree, and aborts it if only one participant disagrees. The diagram below:

The instance

A organizes B, C and D to climb A mountain. If all agree to climb A mountain, the event will be held. If one person does not agree to climb the mountain, the activity will be cancelled. The process of solving this problem with 2PC algorithm is as follows:

Firstly, A will be the coordinator of the activity, and B, C and D will be the participants of the activity.

Phase 1:

  1. A sends an email to B, C and D, proposing to climb the mountain next Wednesday and asking if they agree. So now A needs toWaiting for theB, C and D’s mail.
  2. B, C, and D check their schedules, respectively. When B and C found that they had no plans for that day, they sent an email to TELL A that they agreed to go hiking next Wednesday. For some reason, D did not check the mail during the day. So A, B, and C areNeed to wait for. In the evening, D found A’s email, then checked the schedule and found that there were other arrangements on Wednesday, so D replied TO A that the activity would be cancelled.

Phase 2:

  1. At this time, A received the emails of all participants, and A found that D could not climb the mountain next Wednesday. Then A will inform B, C and D by email that the mountain climbing activity will be cancelled next Wednesday.
  2. B, C, D, SORRY. At this point the transaction terminates.

2PC in the database

In the InnoDB storage engine, database changes are written to undo and redo files. This idea is used for many transactions, not just databases.

The undo log is used to record the original data, and then the transaction is written to the redo log. If something goes wrong and the transaction fails, the undo log is used to retrieve the data.

Not only the database, in many companies, such as huawei submitted database changes are the demands in this way, you want to add a field, first of all want to change the database field SQL to DBA (redo), it is not enough, also need to submit delete your fields, back into the data submitted before you modify the statement also is called (undo)

The database can guarantee the strong consistency of data through undo and redo, and the premise to solve distributed transaction is when a node supports transaction.

This in a premise, 2 PC draw lessons from this failure, first of all, the whole distributed transaction in two nodes, first of all, the first stage is called nodes, transaction request will be sent to each resource, the resource can be a database, may also be other framework supports transactions, they will perform their own affairs respectively, writing log to undo and redo, but does not commit the transaction.

When the transaction manager receives feedback from all resources, the transaction manager sends a commit command to commit the transaction. If any resource fails to be executed during the preparation phase, the transaction manager sends a rollback to rollback all resources. That’s 2PC, very, very simple.

He is highly consistent in that he needs to ensure that any single resource succeeds in order for the entire distributed transaction to succeed.

The advantages and disadvantages

Under the asynchronous environment and no node down model, 2PC can meet the requirements of full recognition, value legal, terminable, is a protocol to solve the problem of consistency. From the time when the coordinator receives a transaction request and initiates a proposal to the completion of the transaction, two RTT(Propose + COMMIT) are added after 2PC protocol, which brings relatively little delay increase. Advantages:

Advantages: simple principle, convenient implementation

Disadvantages:

Disadvantages: synchronization blocking, single point of problem, inconsistent data, poor fault tolerance

  1. A synchronized block

In the two-phase commit process, all nodes are waiting for responses from other nodes and cannot perform other operations. This synchronization blocking greatly limits the performance of distributed systems.

  1. A single point of the problem

The coordinator is important throughout the two-phase commit process, and if the coordinator fails during the commit phase, the entire process will not work. More importantly, other participants will be in a state where the transaction resources are locked and will not be able to continue to complete the transaction.

  1. Data inconsistency

Suppose that after the coordinator sends commit requests to all participants, a local network exception occurs, or the coordinator crashes itself before all commit requests are sent, resulting in only some participants receiving commit requests. This leads to serious data inconsistencies.

  1. Poor fault tolerance

The two-phase commit protocol does not have a well-designed fault tolerance mechanism, and the failure of any node will lead to the failure of the whole transaction.

3PC

Three-phase Commit is an improved version of 2PC. Unlike the two-phase commit, the three-phase commit has two change points.

  1. Introduce timeouts. Introduce timeouts for both the coordinator and the participant.
  2. Insert one between phase one and phase twoPreparation stage. The state of each participating node is consistent before the final submission stage. In other words, in addition to introducing the timeout mechanism, 3PC puts 2PC’sThe preparation phase splits into two again, so there are three phases of CanCommit, PreCommit, and DoCommit.

Phase one canCommit

The CanCommit phase for 3PC is actually very similar to the preparation phase for 2PC. The coordinator sends a COMMIT request to the participant, who returns a Yes response if he can commit, or a No response otherwise.

1. The transaction asks the coordinator to send a CanCommit request to the participant. Asks if a transaction commit operation can be performed. It then waits for the participant’s response. 2. Response Feedback Once a participant receives a CanCommit request, it normally returns a Yes response and enters the preparatory state if it thinks it can successfully execute the transaction. Otherwise feedback No

Phase 2 PreCommit

The coordinator determines whether a transaction’s PreCommit operation can be remembered based on the response of the participant. Depending on the response, there are two possibilities. If the coordinator receives a Yes response from all participants, the transaction is pre-executed.

  1. Send a PreCommit request The coordinator sends a PreCommit request to the participant and enters the Prepared phase.
  2. Upon receipt of a PreCommit request, a transaction is performed and undo and redo information is recorded in the transaction log.
  3. Response Feedback If the participant successfully executes the transaction, an ACK response is returned and the participant waits for the final instruction.

If either participant sends a No response to the coordinator, or if the coordinator does not receive a response from the participant after a timeout, the transaction is interrupted.

Sending interrupt Requests The coordinator sends abort requests to all participants. An interrupt transaction participant performs an interrupt of the transaction after receiving an ABORT request from the coordinator (or after a timeout and still no request from the coordinator).

Phase 3 doCommit

The actual transaction commit at this stage can also be divided into the following two scenarios.

commit

  1. Send submit requestWhen the coordinator receives an ACK response from the participant, he goes from the pre-committed state to the committed state. DoCommit requests are sent to all participants.
  2. Transaction commitAfter receiving the doCommit request, the participant performs the formal transaction commit. All transaction resources are released after the transaction commits.
  3. In response to feedbackAfter the transaction commits, an Ack response is sent to the coordinator.
  4. To complete the transactionAfter receiving ack responses from all participants, the coordinator completes the transaction.

The interrupt transaction coordinator does not receive the ACK response sent by the participant (either the recipient sent an ACK response instead, or the response timed out), then the interrupt transaction is executed.

  1. Sending interrupt Requests The coordinator sends abort requests to all participants
  2. Transaction rollback participants receive abort requests, use the undo information they record in phase two to roll back the transaction and release all transaction resources after the rollback.
  3. Feedback Results After the participant completes the transaction rollback, it sends an ACK message to the coordinator
  4. The interrupt transaction coordinator performs the interrupt of the transaction after receiving the ACK message from the participant.

The distributed transaction of all databases is generally two-stage commit, and the idea of three-stage commit is more borrowed and diffused into other algorithms.

The advantages and disadvantages

Differences from 2PC:

Undo and redo transaction logs are written in phase 2. In phase 3, participants commit when the coordinator fails or the network times out

Advantages:

Improved synchronization blocking improved single point of failure

Disadvantages:

Synchronous blocking single point of failure data inconsistent fault tolerance mechanism is not perfect

Paxos algorithm

Proposal is
The receiver

The first stage

The proposer yelled at the receiver, and I have something to tell you, of course there are more than one recipient (an odd number of half), but it is also a distributed set. Equivalent to a Monday morning meeting, the shameful leader yelled, “Meeting now, I am going to publish a proposal numbered 001, please reply”. At this point, the manager will wait for the employee to reply 1 “yes”, if more than half of the responses, the next step will be taken. If for some reason (the receiver freezes, the network is faulty, or the service is faulty), less than half of the protocols are accepted.

This time will be shout a voice, and the leadership of the imposing manner, of course, not the brutal: “good, afraid of you, I’m going to release a new number 002 proposal, receive please reply 1” [actually very like and go to school when the teacher, the teacher often ask understand? Understand back 1, didn’t know back to 2, only reply 1 account for most of knowledge to speak next 】

The second stage

Next, in the second stage, the leader has called you to the meeting with great effort. The content of proposal 002 today is: “Because of the project is tight, today to work overtime to 12 points, please raise your hands whoever agree with” what if the recipient agree that at this time, so good, bill come to a decision so, if the employee against or door directly, then leadership can only start from the first stage: “eldest brother, the elder sister, I have a new bill 003, quickly go to the meeting room..”

At its core, Paxos is about majority rule.

Helpless miserable leader (single point of problem) : with this group of ferocious subordinates, the leader will either be angry to death, or will resign, this is a single point of problem. Aggressive subordinates (consistency problem) : If employees keep refusing, deliberately pushing the boundaries with the leader, it will be impossible to produce a consistent solution.

So the PaxOS protocol will definitely not have a single proposer, and the subordinate employee will not be as strong

The protocol requires that if the recipient has not received a proposal number, he must accept the first proposal number and if the recipient has not received any other agreement, he must accept the first agreement. Once a proposal has been accepted by all, it is invalid for a subsequent proposal to be made again, and the result must be the same as the one previously accepted by all.

Image interpretation

Paxos algorithm solves the problem of how to reach agreement on a certain value in a distributed system where message delay, loss, and repetition may occur, so as to ensure that no matter the occurrence of any of the above exceptions, the consistency of the resolution will not be broken. This “value” may be a certain data, may be a LOG, etc.; This “value” varies according to the application environment.

A typical scenario is that in a distributed database system, if the initial state of the nodes is consistent and each node performs the same sequence of operations, they end up in a consistent state. To ensure that each node executes the same command sequence, a consistency algorithm needs to be executed on each instruction to ensure that each node sees the same instruction.

For example, when the company decides where to hold its annual meeting, everyone can make suggestions. In the real environment, we can discuss together in a conference room or in wechat groups (based on memory sharing). But in a distributed environment based on messaging everyone can only communicate with each other via SMS. How to decide where to hold an annual conference in such a delayed and lost environment;

The Paxos algorithm solves this problem like this:

Everyone can make a proposal, agree with a proposal, and accept a proposal. As long as the proposal is approved by a majority, it is confirmed. 1. Only what is proposed can be agreed upon. 3. If someone thinks a proposal has been agreed upon, it must have been agreed upon

Algorithmic corollary: Case One: What if only one person makes a suggestion? If only one suggestion is put forward then you must agree with it, because if you don’t agree with it you can’t decide where to hold the conference. P1: Each person must agree with the first proposal he receives, based on which the following problems arise:

Zhang SAN sent a message to Wang Wu saying: I suggest Shanghai to hold the annual meeting! Wang Wu sent a message to Li Si saying: I suggest going to Guangzhou to hold the annual meeting! Li Si sent a text message to Zhang SAN saying: I suggest holding the annual meeting in Beijing!

According to P1: Everyone must agree with the first proposal he receives, then The information that Zhang SAN, Li Si and Wang Wu finally get is inconsistent. So again: a proposal must be approved by a majority of people to take effect. So that means that one person can agree to more than one proposal at a time, and if one person can agree to more than one proposal at a time you end up with the Byzantine general problem that leads to an inconsistent outcome. For example, if John agrees to go to Beijing and to Guangzhou, then John will get 2 votes, one for himself and one for John. He will decide to hold the event in Beijing because he has won the support of the majority of the people. Similarly, Wang Wu will also get 2 votes and think everyone will decide to hold the event in Guangzhou. So to avoid this problem, someone can only agree to multiple proposals that have the same content (the company hosted the address).

after
before
after
Serial number
Total order

Situation two: When Zhang SAN, Li Si and Wang Wu finally go to Zhengzhou for the annual conference. Zhao Liu and Sun Qi did not receive the notice because their mobile phones were out of power. When they turned on the phone, Zhao Liu sent a text message to Sun Qi proposing to hold the annual meeting in Hainan. This proposal was the first proposal Sun Qi received after turning on the phone. But this will lead to sun Qi and Zhang SAN, Li Si, Wang Wu they decided to hold the venue is inconsistent.

Then later people agree again that the proposed company to hold the annual meeting of the address must be consistent
The content of Zhao Liu's proposal

We modify P2a again: P2b: Once a proposal is approved by everyone, then the next person proposes again, and the address for the proposed company to hold the annual meeting must be the same as the address solved by others before.

How to let just boot zhao Six proposed content must be consistent with Zhang SAN, Li Four, Wang Five discussed (to Zhengzhou held)?

We continue to strengthen P2b change: P2C: if there is a number N proposal with V (proposal), so there is a majority, or all of them have not approved any number less than N, either they have agreed to the largest number less than the proposal of N number the proposal with V.

To meet the requirements of P2C, the proposer should first communicate with the majority of people and obtain their last agreed proposal before proposing. Then according to the feedback of the information to determine the content of the proposal, the formation of the proposal to start voting!

Focus on

So there are two stages:

  1. Preparation stage

1. The proposer selects a number N and sends the preparation message to the majority. 2. If the recipient has received the preparation message, if the number of the proposal is greater than all the preparation messages it has replied to. Then the recipient will reply to the proposer the content of the proposal he accepted last time, and promise not to reply to any proposal less than N.

  1. Agreed to phase

1. When a proposer receives feedback from the majority, he or she enters the consent stage. It sends another request to the person who gave it the information to agree to the proposal. 2. Agree to the proposal immediately upon receipt of the proposal without breaking the promise made to the other party.

Assume that only User1, User2, and User3 decide what 1+1 is!

The proposal stage

  1. User1 proposal number 1 and send it to User2 and User3.

Since User2 and User3 have not accepted a proposal less than number 1 according to P2c. So they can accept the proposal and report back to User1 that it will no longer accept proposals less than number 1. At this point User1 receives a majority response and moves on to phase 2. (If the response does not form a majority, stage 1 will proceed again)

  1. User2 The proposal number is 2; Send to User1 and User3.

Since User1 is receiving the proposal for the first time and according to P2C it has not agreed to a proposal less than 2, it can accept the proposal. User3 accepts the proposal from User1 whose proposal number is 1, but whose proposal number is 2 > 1. User3 can also accept the proposal from User2 and will not accept any proposal less than 2. User2 also received a majority response and will proceed to phase 2.

  1. User3 Proposal number is 3; Send to user1 and user2.

Accept the proposal from User3 because user1 received the proposal from user3 number 3 > the proposal from user2 number 2. Accept the proposal from User3 because user2 received the proposal from User3 number 3 > the proposal from user1 number 1. User3 has now received a majority of responses and will proceed to phase 2.

Agreed to phase

  1. User1 sends a proposal numbered 1 with the following content: 1+1=1. Send to user2 and User3.

Since User2 has stated that it will no longer accept proposals less than 3, user1’s proposal is rejected. Since User3 has stated that it will no longer accept proposals less than 2, User1’s proposals are also rejected. The User1 proposal was rejected by the majority and once again entered stage 1**.

  1. User2 sends a proposal numbered 2. The proposal content is 1+1=2. Send to User1 and User3

Since User1 has stated that it will no longer accept proposals less than 3, user2’s proposal is rejected. Since User3 has stated that it will no longer accept proposals less than 2, the proposal number =2, User3 agrees to User2’s proposal. But User2 didn’t get majority approval, so proceed to stage 1 again.

  1. User3 sends a proposal numbered 3 with the following content: 1+1=3. Send to User1 and User2;

Since User1 states that it will no longer accept proposals less than 3, User3’s proposal is accepted. Since User2 states that it will no longer accept proposals less than 3, agree to User3’s proposal.

At this point User3 can be approved by the majority.

ZAB

Many people mistakenly think that the ZAB protocol is a special implementation of Paxos, when in fact they are two different protocols. The biggest difference between ZAB and Paxos is that ZAB is primarily designed for distributed primary and secondary systems, while Paxos is implemented as state Machine Replication.

Although ZAB is not an implementation of Paxos, it does reference some of the design ideas of Paxos, such as:

  1. The leader puts forward a proposal to the follows.
  2. The Leader does not commit until a quorum (more than half) of the follows have been confirmed
  3. Each proposal has an epoch number, similar to a ballot in Paxos.

ZAB features

  1. Consistency assurance

Reliable delivery – If A transaction A is committed by one server, it will eventually be committed by all servers

  1. Total Order

If there are two transactions, A and B, and one server executes A first and then B, it is guaranteed that A will always be executed before B on all servers

  1. The Causal order is –

If the sender sends B after transaction A commits, then B must execute after A

  1. As long as the majority (legal number) of nodes are started, the system runs normally
  2. When a node is restarted after it goes offline, it must ensure that it can resume the transaction currently in progress

Concrete implementation of ZAB

  • They are byclient,serverTwo parts
  • Client can be in either oneserverRun on nodereadoperation
  • Client can be in either oneserverOn-node initiationwriteRequest, the non-leader node forwards the write request toleaderNode. Executed by the leader node
  • ZooKeeper uses an adapted two-phase commit protocol to ensure transaction consistency on server nodes

ZXID

epoch

In fact, after the new leader is elected successfully, it will get the largest ZXID in the current cluster (because the data is the latest), remove the epoch of this ZXID, and add 1 to this epoch as its own epoch.

History Queue

Each follower node has a first-in, first-out (FIFO) queue to store incoming transaction requests, ensuring the order in which transactions are executed

  • Reliable commit is guaranteed by ZAB’s Transaction Consistency protocol
  • The global order is guaranteed by TCP
  • The causal order is guaranteed by the history queue of followers

ZAB working mode

ZAB protocol is a crash recovery atomic broadcast protocol specially designed for distributed coordination service ZooKeeper. ZooKeeper relies on the ZAB protocol to implement distributed data consistency. ZAB protocol has two modes: message broadcast and crash recovery.

Broadcast mode

  1. The leader receives a write request from the client
  2. The leader generates a new transaction and a unique ZXID for the transaction,
  3. The leader sends the transaction to all the FOLLOWS nodes and distributes the message with the ZXID as a proposal to all followers.
  4. The follower node adds the received transaction request to the history queue. After receiving the proposal, the follower writes it to the hard disk and sends an ACK to the leader.
  5. When the leader receives a majority of ACK messages from more than a legal number of followers, the leader sends a COMMIT request
  6. When the followers receive a commit request, they determine whether the transaction has a lower ZXID than any other transaction in the history queue and commit it. If so, the followers wait for a commit from a smaller transaction.
Recovery mode

Recovery mode can be roughly divided into four stages: election, discovery, synchronization and broadcast.

  1. When the leader crashes, the cluster enters the election phase to elect a potential new leader(generally, the leader with the largest number in the cluster)ZXIDThe node)
  2. In the discovery phase, the followers communicate with the potential new leader. If it is found that more than the quorum of followers agree, the potential new leader will add one to the epoch and enter a new era. A new leader is created
  3. Data is synchronized between clusters to ensure the consistency of transactions among nodes in the cluster
  4. The cluster restores to broadcast mode and starts to accept write requests from clients

When the leader goes down after the commit but before the commit message is issued, that is, only the old leader commits. None of the other followers receive the commit message and the new leader must ensure that the proposal is submitted.

When the leader breaks down after producing a proprosal but before sending a message, that is, only the old leader has the proproal, When the old leader restarts (left and right followers at this time), the new leader must ensure that the old leader must discard this Proprosal.(The new leader will inform the old leader to truncate the last commit location corresponding to its epoch.)

ZK election mechanism

  1. Each Server issues a vote, and the first vote is for itself. Voting information :(myid, ZXID)
  2. Collect votes from various servers
  3. Process the vote and re-vote, processing logic: compare ZXID first, then compare myID
  4. The leader can be determined as long as more than half of the machines receive the same voting information
  5. Changing server state

Split brain

In order to solve the problem of split brain, ZAB requires the number of nodes in the cluster to be 2N+1. When the network is split, the number of nodes in one cluster is always more than half, while the number of nodes in the other cluster is less than N+1. Because the primary selection requires the consent of more than half of nodes, it is impossible to have more than one leader in the cluster under any circumstances.

Java clients connect to the cluster

ZkClient client = new ZkClient(“host1,host2,host3,host4,host5”);

reference

2PC with 3PC popularly said Paxos image said zhihu Li Kai Paxos good explanation of Paxos