In a distributed system, due to various reasons such as node failure, network delay, etc., according to CAP theory, we can only guarantee two among Consistency, Availability and Partition Tolerance.

Systems that require high levels of consistency, such as bank atMs, will sacrifice availability and refuse service in the event of a failure. MongoDB, Redis, and MapReduce use this solution.

For static websites and query databases with weak real-time performance, consistency will be sacrificed and inconsistency will be allowed for a period of time. Simple distributed protocol Gossip, database CouchDB, Cassandra use this scheme.

Figure 1

As shown in Figure 1, consistency problems can be classified into two categories based on the presence or absence of malicious nodes. A malicious node is a node that will lose, resend, and do not respond to messages, but will not tamper with messages. Malicious nodes may tamper with messages. The problem of malicious nodes is called the Byzantine general problem and is not covered today. Paxos solves the problem of distributed consistency of non-malicious nodes well.

background

In 1990, Leslie Lamport proposed The Paxos algorithm in his paper The Part-time Parliament. The paper was not taken seriously at first because of the way it used stories, without using mathematical proofs. It wasn’t until 1998 that the paper was officially accepted. Then, in 2001, Lamport reorganized the paper and published Paxos Made Simple. As an early contributor to the field of distributed systems, Lamport received the 2013 Turing Award.

Paxos is widely used in distributed systems, says Mike Burrows, author of Google Chubby: “There is only one consensus protocol, and that’s Paxos.”

The Raft algorithm was a simplification and improvement of Paxos, making it easier to understand and implement.

Paxos type

Paxos was originally a fictional island where the parliament voted to reach a consensus. But senators can leave, couriers can get lost, or messages can be repeated. Node faults and network faults corresponding to distributed systems.

Figure 2

As shown in Figure 2, suppose that the councillor wants to propose what to have for lunch. If one or more people propose at the same time, but only one proposal can be approved at a time, this is Basic Paxos, the most Basic protocol in Paxos.

It is obvious that Basic Paxos is not efficient enough. If Basic Paxos is put together in parallel and multiple proposals are put forward at the same time, such as what to have for lunch, where to have fun after eating, who will treat and so on, lawmakers can also pass multiple proposals at the same time. This is the Multi-PaxOS protocol.

Basic Paxos

role

The Paxos algorithm has three roles: Proposer, Acceptor, and Learner. In an implementation, a node can play more than one role.

Figure 3

  • Proposer is responsible for making proposals
  • Acceptor is responsible for voting on proposals
  • Learner gets the vote results and helps spread them

Learner does not participate in the voting process. To simplify the description, we directly ignore this role.

algorithm

The operation process is divided into two phases: Prepare phase and Accept phase.

Proposer makes two requests, Prepare and Accept. An Acceptor accepts or rejects a proposal based on the information it collects.

Prepare stage

  • Proposer selects a proposal number N and sends a Prepare(n) request to more than half (or more) of the acceptors.
  • When an Acceptor receives a message, it replies to the message if n is greater than the number it has seen before and will not accept proposals less than n in the future. In addition, if you have previously accepted a proposal less than n, send that proposal number and content back to the Proposer.

The Accept stage

  • When a Proposer receives more than half of the responses, it sends an Accept(n, value) request. N is the proposal number and value is the value of the maximum proposal number that an Acceptor responded to. If an Acceptor does not respond to any proposals, value is the Proposer’s proposal.
  • When an Acceptor receives a message, if n is greater than or equal to the largest number it has seen before, it records the proposal number and content and accepts the request.
  • If a Proposer receives a majority of responses, it says its proposal has been accepted. Otherwise, go back to step 1 and re-launch the proposal.

The complete algorithm is shown in Figure 4:

Figure 4.

Acceptor stores the minProposal, acceptedProposal, and acceptedValue values persistently.

Three of the following

There are three possible scenarios for the Basic Paxos consensus process. The following are introduced respectively.

Situation 1: The proposal has been accepted

See Figure 5. X and Y represent clients and S1 through S5 are servers representing proposals from both acceptors and clients. To prevent duplication, a Proposer proposes a number consisting of two parts:

Serial number Server ID

For example, S1’s proposal numbers are 1.1, 2.1, 3.1…

Figure 5 The above image is from page 13 of the Paxos Lecture (Raft User Study)

If S1 receives a proposal X from a client, then S1, as a Proposer, sends a Prepare(3.1) request to S1-S3. If Acceptor S1-S3 has not accepted any proposal, it accepts the proposal. Proposer S1-S3 then sends a Accept(3.1, X) request, and proposal X is successfully accepted.

After proposal X is accepted, S5 receives proposal Y from the client and sends a Prepare(4.5) request to S3-S5. For S3, 4.5 is larger than 3.1, and having accepted X, it responds to this proposal (3.1, X). After receiving s3-S5’s reply, S5 replaces its Y with X and sends Accept(4.5, X). S3-s5 Accept the proposal. Finally, all acceptors agree on the same value X.

The result of this scenario is that the new Proposer uses the proposals already accepted

Case 2: A new Proposer is visible if the proposal is not accepted

Figure 6 The above image is from page 14 of the Paxos Lecture (Raft User Study)

As shown in Figure 6, S3 accepts the proposal (3.1, X), but S1-S2 has not yet received the request. After receiving the Prepare(4.5), S3-S5 responds with the accepted proposal (3.1, X). S5 replaces the proposal value Y with the proposal value X and sends Accept(4.5, X) to S3-S5.

S1-s2 then accepts (3.1, X), and all acceptors reach an agreement.

The result of this scenario is that the new Proposer uses the values already submitted, and both proposals succeed

Case 3: The proposal is not accepted and the new Proposer is not visible

Figure 7 The above image is from page 15 of the Paxos Lecture (Raft User Study)

As shown in Figure 7, S1 accepts the proposal (3.1, X), S3 receives Prepare(4.5), and then Accept(3.1, X). Since 3.1 is less than 4.5, S3 will reject the proposal outright. So proposition X could not receive more than half of the responses, and the proposal was blocked. Proposition Y can pass without a hitch.

The result of this scenario is that the new Proposer uses its own proposals and the old proposals are blocked

Live lock (livelock)

Live locks occur very rarely, but can seriously affect performance. A Proposer is a situation where two or more proposers preemp each other in the Prepare phase.

Figure 8 The above image is from page 16 of the Paxos Lecture (Raft User Study)

The solution is to give a random wait time after a Proposer fails, reducing the likelihood of simultaneous requests.

Multi-Paxos

Live locks, as mentioned in the previous section, can also be solved using Multi-PaxOS. It selects a Leader from the Proposer to submit the Proposal only, eliminating the Prepare phase and reducing the performance penalty. Of course, it is possible to just move the multiple Proposer mechanisms of Basic Paxos, but not high enough.

With Basic Paxos in parallel, you can process multiple proposals at the same time, so you need to be able to store different proposals and keep them in order.

The structure of Acceptor is shown in Figure 9, with each square representing an Entry that stores proposal values. Use ascending indexes to distinguish entries.

Figure 9.

Multi-paxos needs to solve several problems, and let’s take a look at each one.

1. The Leader election

The simplest way to elect a Leader is if the Server ID is the largest.

Each Server sends heartbeat packets to other servers at an interval of T. If a Server does not receive a heartbeat from a higher ID within 2T, it becomes the Leader.

Any other Proposer must reject a request from the client or forward the request to the Leader.

Of course, there are other, more complicated methods of voting that can be used, and I won’t go into detail here.

2

Prepare blocks old proposals and checks if there are accepted proposal values.

When only one Leader sends a proposal, Prepare does not conflict. You can omit the Prepare phase to reduce RPC requests by half.

Change the logic of the Prepare request to:

  • Acceptor records a global maximum proposal number
  • Reply to the maximum proposal number, and if the current entry and all subsequent entries do not accept any proposal, reply to noMoreAccepted

When the Leader receives more than half of the noMoreAccepted replies, the Prepare phase is no longer needed and only an Accept request is sent. Until Accept is rejected, the Prepare phase is needed again.

3. Complete information flow

So far the information is incomplete.

  • Basic Paxos only needs more than half of the nodes to agree. However, in multi-PaxOS, this approach may cause some nodes to fail to get complete entry information. We want each node to have all the information.
  • Only a Proposer knows whether a proposal has been accepted (based on the responses received), and an Acceptor does not know that information.

The solution to the first question is simple: Proposer sends an Accept request to all nodes.

The second problem is a little more complicated. It is perfectly possible to add a Success RPC with a Proposer that explicitly tells acceptors which proposal has been accepted, but it can be optimized to reduce the number of requests.

We add a firstUnchosenIndex parameter to the Accept request, indicating the first unaccepted Index of the Proposer. This parameter implies that any proposal less than Index has been accepted for that Proposer. Acceptors can therefore use this information to mark proposals less than Index as accepted. Also note that only proposals with that Proposer can be marked, because if a Leader switch occurs, different proposers may have different information and may not be consistent if they do not differentiate between proposers directly.

Figure 10.

As shown in Figure 10, a Proposer is preparing to submit an Accept request with Index=2, where 0 and 1 are accepted proposals, so firstUnchosenIndex=2. When an Acceptor receives the request, it compares the Index to mark the Dumplings proposal as accepted.

Because of the Leader switch situation mentioned earlier, an explicit request is still required to get the complete information. When an Acceptor replies to an Accept message, include your firstUnchosenIndex. If the number is less than a Proposer, the Acceptor sends Success(index, value), marks the received index as accepted, responds with a new firstUnchosenIndex, and so on until the indexes are equal.

conclusion

Paxos is an important consensus algorithm in distributed consistency problems. This article introduces Basic Paxos and parallel Multi-PaxOS respectively.

In Basic Paxos, there are three Basic roles Proposer, Acceptor, and Learner, and three Basic situations that can occur when a proposal is made. In multi-PaxOS, three problems to be solved are introduced: Leader election, Prepare omission, and complete information flow.

In the next article, we will implement a simple demo to verify this algorithm, and the implementation process will be covered in more detail.

Reference

Distributed consistency and consensus algorithms

Paxos algorithm and Raft algorithm

Paxos

Paxos Made Simple

Paxos lecture (Raft user study)

YouTube | Paxos lecture (Raft user study)

copyright

This work adopts CC BY 4.0 license, please indicate the link when reprinting.