Mike Burrows, author of Google Chubby, says:

There is only one consensus protocol, and that’s Paxos.

The introduction

We have already described the consensus problem in detail and introduced some consensus algorithms, among which Paxos algorithm is the consensus algorithm proposed by Leslie Lamport in 1990. Unfortunately, the analogy of the Greek Democratic Parliament clearly fails. Lamport is like writing a novel, Turning a complex mathematical problem into an archeological historical novel. According to Lamport’s own account, all three reviewers agreed that the paper was interesting, though not important, and that any paxOS-related background should be removed. Lamport is annoyed by the lack of humor, so he doesn’t plan to make any changes to the paper.

Years later, two people working at SRC(Systems Research Center, DEC founded in 1984, where Lamport had also worked) needed to find the right algorithm for the distributed system they were building, and Paxos provided just what they wanted. Lamport sent them the paper, and they found nothing wrong with it.

Lamport therefore felt it was Time for The paper to be republished, and “The Part-time Parliament” was finally published in 1998.

However, many people complained that the paper could not be understood at all. People only remembered the strange story, not the Paxos algorithm. Lamport can’t go anywhere without being complained about. In 2001, he republished a paper on Paxos called “Paxos Made Simple”. This time, there was no formula and the abstract was just one sentence:

The Paxos algorithm, when presented in plain English, is very simple.

Full of sarcasm!

However, possibly because of the order of presentation, the paper was still very difficult to understand, and a series of articles were written to explain the paper (duplicate paper) and how it could be implemented in engineering.

In my opinion, the best video explaining Paxos comes from Diego Ongaro, author of Raft algorithm. This paper uses the pictures in Diego’s handout to understand Paxos algorithm and also corrects a mistake of Diego’s pen.

The term

The basic concept

  • Proposal Value: the Value of the Proposal.
  • Proposal Number:
  • Proposal: Proposal = Proposal number + Proposal value;
  • Chosen: approved. Once a value is Chosen, subsequent Paxos must interact with that value.

1, Make a Proposal, make a Proposal, make a Proposal

role

  • Proposer: the Proposer that initiates a proposal;
  • Acceptor: proposal receiver;
  • Learner: proposal Learner;

Problem description

For high availability, a common design is to write to a master node and then copy to the slave nodes. The problem with this solution is that if the master node fails, the entire service becomes unavailable or the data is inconsistent.

To overcome the single-point write problem, Quorum writes, the idea being to write to more than half of the nodes. That is, if there are N nodes in the cluster, the client needs to write W >= N/2 + 1 nodes. No primary node is required. This method can tolerate up to (n-1)/2 node failures.

But the question remains: how should each recipient decide whether to accept the value of this request?

If we accept the values we receive the first time, then no majority occurs, no value is Chosen, and the algorithm cannot terminate when Split Votes occur, which violates liVENESS.

In order to solve the Split Votes problem, we allow multiple different values to be accepted, and every request received is accepted. In this case, a new problem arises, as follows, more than one value may be Chosen, which violates safety.

Note that Paxos emphasizes:

Once a value has been chosen, future proposals must propose the same value.

In other words, basic-Paxos we’re talking about is only going to choose one value. Based on this, a 2-phase protocol is required, and for values already Chosen, subsequent proposals will use the same values.

In the case shown below, S3 simply rejects the red value because Blue is already Chosen, thus guaranteeing success.

This way we need to rank proposals. If you are familiar with Distributed systems, you may recall the paper “Time, Clocks and the Ordering of Events in a Distributed System”. We cannot use Time to determine the order of proposals.

Proposal Number

A simple way to do this is to give each request a unique number, e.g.

. Seq_id is incremented for sorting; In order to avoid crash and restart, it must be able to persist storage locally.
,>

Paxos

Now we can finally begin to describe the Paxos algorithm.

As mentioned above, Paxos is a two-phase algorithm. We call the first stage the Prepare stage and the second the Accept stage. Correspond to two rounds of RPCS.

First round Prepare RPCs:

Request (also called Prepare phase) :

Proposer selects a proposal number N and broadcasts a Prepare(n) request to more than half of the acceptors.

Note: Prepare() requests are sent to all acceptors. Original paper: Phase 1. (a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.

Prepare (n) here does not contain the value of the proposal.

Pseudo code:

send PREPARE(++n)
Copy the code

Response (also called PROMISE stage) :

An Acceptor receives the Prepare (n) request in either of the following circumstances:

  • Return if n is greater than the number of all Prepare requests receivedPromise()With a promise that proposals numbered less than N will not be accepted. If any of the proposals are Chosen,Promise()The response should also contain the previous proposal number and corresponding value.
  • Otherwise (that is, n is less than or equal to the maximum number previously received by an Acceptor) it is ignored, but often a rejection response is returned.

Therefore, acceptors require persistent stores for max_n, accepted_N, and accepted_VALUE

Pseudo code:

if (n > max_n)
    max_n = n     // save highest n we've seen so far
    if (proposal_accepted == true) // was a proposal already accepted?
        respond: PROMISE(n, accepted_N, accepted_VALUE)
    else
        respond: PROMISE(n)
else
    do not respond (or respond with a "fail" message)
Copy the code

Second round Accept RPCs:

Request (also called withdraw phase) :

(a) when a Proposer receives a Promise() from more than half of its acceptors, it sends a Accept(n, value) request with a proposal number and a value to the majority of acceptors. (Note: Accept() requests are sent to all acceptors. In my opinion, Accept() requests are sent to most acceptors.)

Note: Proposer does not necessarily takeAccept()Requests are sent to responding majority Acceptors and another majority Acceptors may be selected to broadcastAccept()The request.

** If a previous Promise response returns an accepted_VALUE, use that value as value. If no accepted_VALUE is returned, you are free to decide on the proposal value value.

Pseudo code:

did I receive PROMISE responses from a majority of acceptors?
if yes
    do any responses contain accepted values (from other proposals)?
    if yes
        val = accepted_VALUE    // value from PROMISE message with the highest accepted ID
    if no
        val = VALUE     // we can use our proposed value
    send Accept(ID, val) to at least a majority of acceptors
Copy the code

Response (also called ACCEPT phase) :

An Acceptor receives an Accept() request, and if an Acceptor does not have a separate Promise with a number greater than n during this period, it accepts the proposal.

Pseudo code:

if (n >= max_n) // is the n the largest I have seen so far?
    proposal_accepted = true     // note that we accepted a proposal
    accepted_N = n             // save the accepted proposal number
    accepted_VALUE = VALUE       // save the accepted proposal data
    respond: Accepted(N, VALUE) to the proposer and all learners
else
    do not respond (or respond with a "fail" message)
Copy the code

Some examples

Case 1: The proposal is Chosen

  1. S1 receives proposal request X from the client and sends a request to S1-S3Prepare (3.1)The request,PROMISE()The response returns that no proposal is Chosen
  2. Since s1-S3 has no proposal Chosen, S1 continues to send to S1-S3Accept(3.1, X)Request, proposal is successfully Chosen
  3. After the proposal is Chosen, S5 receives a request with the proposal value Y from the client and sends the request to S3-S5Prepare (4.5)Request, because number 4 > 3 will receive the proposal whose value is X already ChosenPROMISE()The response
  4. So the S5Replace the proposal value Y with X, to s1-S3Accept(4.5, X)Request, proposal is Chosen again

Case 2: The proposal is not Chosen and the Proposer is visible

Case 2 is similar to case 1. After S3 has Chosen the proposal, S5 receives a PROMISE() response from S3 containing the already Chosen proposal value X, so it also replaces the proposal value with X, and all acceptors reach a consensus on X.

Note the above pseudocode: do any responses contain accepted values, which means that any Acceptor that returns a proposal value in a Promise() response should replace it with the proposal value.

Case 3: No proposal has been submitted and no Proposer is visible

In case 3, the proposal is Chosen only by S1, S3 has not Chosen the proposal, s3-S5 Promise() response does not have any proposal information, so S5 decides the proposal value is Y and sends the Accept(4.5, Y) request.

Since the proposal number n promised by S3 now changes to 4 and 4 is greater than 3, S3 no longer accepts S1’s subsequent Accept(3.1, X) requests. Proposal value X is blocked, and proposal value Y is eventually Chosen.

Live lock

(a) when a Proposer prepares for a request in the first phase before a second phase accepts it, then a second Proposer sends a request with a larger number in the first phase. If this process is endless and acceptors remain stuck in the sequence number process, no one will be successful.

The simplest way to solve for a live lock is to introduce a random timeout where a Proposer proposes first, reducing the likelihood of preempting each other all the time.

conclusion

Paxos selects only one value from one or more values. If Paxos needs to be run repeatedly to create a duplicate state machine, we call it multi-PaxOS, but if each command is consistent through an instance of the Basic Paxos algorithm, there is a lot of overhead. Some optimizations can be made for multi-PaxOS, and we’ll discuss variations of Paxos in the next article.