This paper links: blog.openacid.com/algo/paxos/

preface

What is PaxOS?

  • An algorithm to ensure strong consistency of multiple replicas in distributed systems.

What’s paxOS for?

  • A bunch of machines without PaxOS is called distributed;
  • A collection of machines coordinated by PaxOS is called a distributed system.

Mike Burrows, author of Google Chubby, says:

There is only one consistency algorithm in the world and that is Paxos…

Other consistency algorithms can be seen as variations and extensions of paxOS implementation.

Another distributed algorithm that is often mentioned is Raft, whose contribution is to ground the consistency algorithm. Because Leslie Lamport’s theory is very abstract, in order to apply his theory to the real world, engineers need to fully grasp his theory and add the necessary engineering steps to make it work.

It’s not unusual to be asked about the difference between Raft and PaxOS, or which one to choose in an implementation, which you might not have before you learned about PaxOS. This question is like being asked what is the difference between four calculations and an abacus, and whether a shopkeeper should use four remote calculations or an abacus to pay.

Leslie Lamport was in Beijing in 2015 and someone asked him what the difference was between Paxos and Raft.

I haven’t heard raft…

Raft core can be considered as an application of Multi PaxOS. For understanding the core content of consistency algorithms, paxOS makes it easier to remove irrelevant interference and get to the heart of the problem. So we chose PaxOS as the entry point to understanding consistency algorithms, and we talked through it.

Raft is more popular on the Internet than Paxos because it’s a bit more straightforward, raft is actually more complex than Paxos because IT explains HOW in detail and doesn’t explain WHY in detail, paxos basically explains WHY, but it doesn’t have an easy-to-understand tutorial. So much so that it has not been more widely accepted. So there is this article, a paxOS introductory tutorial, starting from the basic problem of replication in the distributed, by gradually solving and improving these problems, finally derive the algorithm of PAxOS.

This paper is divided into two parts:

  • The first part is the discussion of distributed consistency problem and the gradual improvement of the solution, and the process of obtaining paXOS algorithm in human words. If you just want to understand PaxOS and don’t want to spend too much time digging into details, just read this part.
  • Part 2 is a strict description of paXOS algorithm and protocol. This section can be used as a summary of the implementation part of paxOS’s original paper. If you plan to implement your own PaxOS or similar protocols, you will need to read the details carefully. Hopefully, this section will save you time reading the original paper.

Xp’s previous work on Paxos sharing using Slides added more verbal explanations.

Problems to be solved in distributed systems

slide-00


Slide-01 Paxos’s job is to take a bunch of running machines and work them together to make a whole system. In this system, each machine must have the same state in the system. For example, if a picture is uploaded on one machine, the other two machines must also copy the picture, so that the whole system can be in a consistent state.


slide-02I am the contents page without explanation.


The consistency issues of sliDE-03 distributed systems ultimately boil down to consistency of distributed storage. Aws, for example, requires 9 to 13 object storage devices for reliability. Such high reliability is based on less reliable hardware.


Slide-04 Almost all distributed storage (even stand-alone systems, see EC 1: Principles, EC 2: Implementation, EC 3: Limits) must build highly reliable storage on cheap hardware in some redundant way. The basis of redundancy is the multi-copy policy, where one copy of data is stored in multiple copies. Multiple replicas ensure reliability, but the consistency between replicas requires distributed consistency algorithms such as PAXOS.


Slide-05 a variety of replication strategies have been proposed in the early years to address the needs of various scenarios. In addition to the number of copies, various algorithms are actually trying to solve the same problem. Start with the next page and briefly review the various replication strategies to see their strengths and weaknesses and how PaxOS addresses the issue of consistency between replicas.

A less-than-perfect replication strategy

slide-06Table of contents page without explanation


Slide-07 Asynchronous master-slave replication is one of the simplest strategies and is easy to implement, but there is a problem: The client receives a data safety information (OK), with real data security (data is copied to the entire machine), there is a gap in time this paragraph of time is responsible for receiving a client request that machine (master) if struck by lightning or by meteorites hit or by cleaning elder sister to kick off the power, the data may be lost. Therefore, it is not a reliable replication strategy (to use master-slave asynchronous replication requires that you believe in the non-existence of meteorite lightning and Sweeping Lady).


Slide-08 provides complete reliability compared to asynchronous master-slave replication: the master does not inform the client that the data is safe until it has been safely replicated to all machines.

But the fatal drawback of master-slave synchronous replication is that if any machine in the entire system goes down, the write can’t continue. The equivalent is that the availability of the system decreases exponentially with the number of replicas.


Slide-09, a compromise between synchronous and asynchronous, seems like a good solution. This is semi-synchronous replication. It requires the master to copy data to enough machines before answering the client, but not all of them. This number of copies is sufficient to provide a relatively high level of reliability; One machine down does not stop the entire system from writing.

But it’s still not perfect. For example, data A is copied to slave-1, but not to slave-2; Data B is replicated to slave-2 but not slave-1. In this case, if the master fails, data needs to be recovered from a slave, and neither slave can provide complete data. So there’s some sort of inconsistency in the data throughout the system.


Slide-10 To address data inconsistencies in semi-synchronous replication, we can add one more improvement to this replication strategy: Majority read and write: Each piece of data must be written to more than half of the machines. More than half of the machines must be checked for the data each time it is read.

Under this policy, data reliability is sufficient, outages are tolerated enough, and all data can be read even if any machine fails.


Slide-11 The majority read and write strategy also has a but, which is that an update to a piece of data creates inconsistent states. Such as:

  • Node-1 and node-2 both say a=x,
  • The next update for Node-2, node-3 writes a=y.

At this point, a client reading a will see two different pieces of data if it contacts Node-1 and Node-2.

To avoid ambiguity, majority read and write must also add a globally increasing timestamp to each write. Records with larger timestamps should be ignored if they are seen. So in the process of reading, the client will see a= X ₁, A = Y ₂ this 2 data, by comparing the timestamp 1 and 2, found that Y is updated data, so ignore a= X ₁. This ensures that multiple updates of the same data will not generate ambiguity.


Slide-12 yes, but here we go again. The time-stamped majority still has problems reading and writing. For example, in the example above, a=x₁ wrote node-1 and Node-2. At a= Y ₂, only Node-3 was written successfully. The client process would hang, leaving the state as follows:

Then another client comes in to read,

  • If it connects to Node-1 and Node-2, it will get a=x₁.
  • If it relates to Node-2 and Node-3, it gets a= Y ₂.

The system as a whole is still inconsistent with the information provided externally.


We are now very close to the final meaning of SliDE-13. Paxos can be considered as a further upgrade of majority read and write. Paxos implements a rigorous strong consensus algorithm through two non-rigorous majority reads and writes.

From majority reading and writing to paxOS derivation

Slide-14 first of all, in order to clearly present the core problem of distributed systems: consistency, we set up a pseudo-storage system. On this system, we gradually implemented a strongly consistent storage, and got the solution to the consistency problem of PaxOS.


Slide-15 In the implementation, the set command is implemented directly as a majority write, which is a very simple step. And inc operating logic is also very simple, read a variable value I ₁, give it plus a number to get I ₂, and then through the majority of the I ₂ write back to the system.


You must have seen the problem with this implementation: if you have two concurrent client processes doing the INC operation at the same time, in most read-write implementations, you will inevitably have the problem of one Y client overwriting X client. Thus resulting in the loss of data update points.

Paxos is designed to solve this problem by enabling Y to detect concurrency conflicts and take measures to avoid lost updates.


The SLIDE -17 extracted the above problem: Telling the Y to update could not directly update the I ₂. Instead, it should detect the presence of the I ₂ and store its results in the next version of the I ₃ ₃, which was written back to the system.

This problem translates to: each version of I can only be written once, and no changes are allowed. If the system is designed to meet this requirement, the INC operations of both X and Y can be performed correctly.


Slide-18 then turns our problem into a simpler, more basic one: how to determine that a value (such as I ⱼ) has been written.

Intuitively, the solution is also simple: do a majority read before X or Y writes to confirm if any other client processes are already writing, and if so, abort.


Slide – 19 but!!!!!! There is also a concurrency problem. X and Y may do the read-before write operation at the same time and conclude that no other process is writing, so I can write. This still causes the problem of lost updates.


Slide-20 solves this problem by adding the ability for the storage node to remember who was the last person to perform a read before write operation. In addition, only the process that completed the read before write is allowed to write subsequent data, and the process that did the read before write is denied the write permission.

As you can see, if each node remembers who read, then the entire system can prevent the expiration of X from writing after Y finally completes the read-before operation.

This approach also works because in majority writes, a system allows at most one majority write to succeed. Paxos also achieves strong consistency through two majority writes.


Slide-21 That’s the whole idea of the PaxOS algorithm, isn’t it simple? All that remains is a simple matter of how to implement it: how to identify a client such as X and Y, how to determine who was the last process to finish writing before writing, and so on.


Slide-22 Leslie Lamport just wrote a paper on such a simple algorithm and won the Image Award! It’s so easy to change the world.

Paxos algorithm description

In the following sections, we will describe exactly how paxOS works in computer language.

Slide-23 first identifies the problem to be solved:


The PaxOS we are going to introduce in Slide-24 is actually the most simple classic PaxOS. After that, we will introduce several optimizations of PaxOS, multi Paxso and Fast PaxOS, which are all theoretical optimizations of PaxOS.


The Slide-25 PAXOS algorithm addresses how to build a reliable distributed system based on unreliable hardware. However, the core paxOS algorithm only deals with network latency/out-of-order problems. It does not attempt to solve the problems of unreliable storage and message errors, because these two problems have little to do with distribution in nature and belong to the level of data verification.

Please refer to the introduction to Byzantine Paxos.


Slide-26 this article tries to describe it in Classic Paxos terms,

A later article on Fast Paxos implemented fast-Paxos and also included classic-Paxos, but expressed in some different terms.

  • Proposer can be understood as a client.
  • Acceptors can be defined as storage nodes.
  • Quorum means the majority, that is, more than half of acceptors, in 99% of cases.
  • Round is used to identify one PAXOS algorithm instance. Each Round is two majority reads and writes: Phase-1 and Phase-2 are identified in the algorithm description. For simplicity and clarity, the algorithm also states that each Proposer must generate a globally monotone increasing round that can be used to distinguish between proposers (proposers) and between each other.


Slide-27 also has several concepts on the storage side:

  • The last_rnd is the last Proposer that acceptors remember to read before writing to determine who can actually write a value to the store later.
  • V is the last value to be written.
  • VRND is paired with V, and it records the Round in which V was written.

V and VRND are used to recover an unfinished PAxOS. An incomplete PAxOS algorithm run may leave some writes that did not reach the majority value (like the native majority write dirty read problem). VRND in PAxOS determines which values were written last and decides which incomplete PaxOS run to restore. We will describe the role of VRND through several examples later.


Slide-28 is first of all phase-1 of PAXOS, which is equivalent to the read before write process mentioned earlier. This is used to log an identifier on a storage node (Acceptor) : I will write later; And read the Acceptor to see if there are any previously unfinished PaxOS runs. If so, try to restore it; If not, keep doing what you want to do.

We use a YAML-like format to describe the request/reply format of Phase-1:

Upon completion of Phase-1, acceptors should record THE RND of X =1 and return their previously saved V and VRND.


Slide-29 Proposer X receives a quorum response and considers it ready to proceed. If more than half of the acceptors are not contacted, the system is stuck, which is paxos’s claim that less than half of the nodes will fail.

A Proposer then faces two scenarios:

  • None of the replies contain any non-empty v, indicating that the system was previously clean and that no value has been written by any other Paxos client (because a majority read must see the result of a majority write). The Proposer X then proceeds to actually write the values it is writing to in phase 2 to more than half of the acceptors.

  • If a Proposer X receives a response containing a V and VRND that were written to it, then it must assume that a Proposer is running, even if it does not know if the Proposer has successfully terminated, but that no value that has been written can be modified! , so X must remain the same. The v corresponding to the maximum VRND that X will see is then the value that X’s Phase-2 will write.

    X is actually considered to have performed a fix to the other client (Proposer) if it has interrupted.


Slide-30 In phase 2, phase-2, the Proposer X writes its selected value to the Acceptor, either a value it wants to write itself, or a v(fix) that it reads from an Acceptor.

Similarly, request responses are described in a yamL-like manner:


Of course, at this point (between the time X receives the phase-1 response and the time it sends the phase-2 request), there may be other proposers that have completed a phase-1 with a larger RND, so X may not be able to successfully complete Phase 2.

Acceptor compares the RND in the Phase-2 request with the RND recorded locally to determine whether X is still authorized to write. If the RND in the request is the same as the RND recorded locally by the Acceptor, the write is allowed. The Acceptor writes V locally and records the RND in the Phase-2 request to the local VRND.

See paxOS in action with examples

Okay, so that’s the algorithm description for PaxOS. These abstract algorithm descriptions, in which the rules cover the processing of virtually all possible situations. It’s not easy to see what they do at once, so let’s take a look at a few examples of how PaxOS handles various states and ultimately agrees on the state of the entire system.


slide-32I won’t explain the non-conflicting examples


Example of slide-33 X and Y running paxOS at the same time, where Y forces X to interrupt:

  • X completes phase-1 read before write successfully and writes RND =1 to the two acceptors on the left.
  • Y overwrites the RND of X with the larger RND =2, and writes RND =2 to the two acceptors on the right.
  • X thought it could still run Phase 2, but it could not. X could only successfully run Phase 2 to the leftmost acceptors, and the middle acceptors rejected PHASE 2.
  • Y The phase 2 Acceptor (v= Y, VRND =2) runs successfully to the two acceptors on the right.


Slide-34 continues the above example to see how X handles the case of being robbed of write rights:

At this time, THE PHASE-2 of X failed, so it needs to start again with a larger RND =3.

  • After X successfully ran phase-1 on the two acceptors on the left, X detected two written values: v= X, VRND =1 and v=y, VRND =2. X can no longer write whatever value it wants. It must not modify existing values in this paxos run, and the only thing that this paxos run of X can do is fix (possibly) runs of other proposers that have been interrupted.
  • V = y here, VRND = 2 is may achieve the value of the majority in phase 2. V = x, VRND = 1 is impossible, because other proposer must abide by the algorithm also agreed, if v = x, VRND = 1 in a certain phase – 2 to the majority, Y must be visible in phase-1 so that v= Y, VRND =2 is not written.

So X selects v=y and continues with RND =3, finally writing v=y, VRND =3 to all acceptors.


Slide-35 Paxos also has a less important role, Learner, which is added to make the system complete, but is not a key role in the implementation of the whole algorithm, and is only notified at the end.

Paxos optimization

Slide-36 first Optimized Multi-PaxOS:

One of the early criticisms of Paxos was that it required two rounds of RPCS for every value written :phase-1 and phase-2. So a common optimization is to run Phase-1 for multiple PAXOS instances with a single RPC.

For example, numbered NUMBERED NUMBERED NUMBERED NUMBERED numbered numbered numbered numbered numbered I ₁~ I ₁₀ with phase-1, for example, with RND 1001, 1002… This saves nine RPCS, and all the writes on average take only one RPC to complete.

This makes it look a bit like Raft:

  • Add to this the concept of commit (commit can be understood as whether v is sent to a majority or not),
  • And group membership changes (expanding the definition of quorum from “more than half” to “any two QuourMs must intersect”).


Slide-37 second optimization of Fast-PAxOS:

Fast-paxos achieves this by increasing the number of quorum nodes in a single RPC. If fast-PAxOS fails to reach an agreement at one RPC, it reverts to Classic PaxOS.


Slide-38 fast-PAxOS must expand quorum’s value in order to degrade to Classic PaxOS without selecting a different value. That is, in fast-round, quorum has a different size than classic PaxOS. We’ll also take a look at the reason why fast quorum cannot be the same as classic-quorum. This configuration will cause the incorrect y₀ value to be used in the classic phase recovery:


The most crude solution to this problem with slide-39 is to set quorum to n, which means that all acceptors must write successfully before the fast-round is considered successful. If X and Y proposers are written to each phase concurrently, no one succeeds, so if X and Y both revert to classic PaxOS, no problem. Because there’s no Proposer that thinks it’s successfully written.

If the problem goes further, it can be concluded that if the quorum of Classic PaxOS is N /2+1, then the quorum of fast-round should be greater than ¾n, and the reason can be simply understood as: In the worst-case scenario, the number of acceptors reaching a fast-quorum must be greater than half in classic-quorum in order not to cause the fix process to select a value different from that of the fast-round.


Slide-40 Here is an example of a fast-round collision where X succeeds and Y fails:

X was successfully written to 4(fast-quorum> n) acceptors, while Y was only written to 1 (classic quorum> n) acceptors. Then Y entered the classic-round for repair. With at least two X ₀ values, Y will always use the same value as x to ensure that the written value will not be modified.


Slide-41 Let’s look at another conflict where neither X nor Y has reached quorum:

In this case, neither X nor Y will consider their fast round successful, so any value selected during the repair process is ok. The final choice comes down to the competition between the classics-Paxos processes X and Y. We will eventually select either X ₀ or Y ₀.

other

Slide-42 is an easily verifiable optimization that yields consistent results in all situations.


slide-43AD page, I won’t explain

The PDF can be downloaded and viewed online:

  • Reliable distributed systems – intuitive explanation of PAxOS. PDF
  • Reliable distributed Systems – intuitive explanation of PAxOS. HTML

This paper links: blog.openacid.com/algo/paxos/