Introduction: Egalitarian Paxos (Epaxos), as a next-generation distributed consistency algorithm, has broad application prospects. But across the industry, there hasn’t been a single engineering implementation of Epaxos, or even a single article that makes Epaxos more accessible. Although the theory of EPAXOS algorithm is good, its practical application is not mature yet because it is difficult to understand and has many challenges in engineering implementation. This paper will introduce Epaxos algorithm in simple and easy language, step by step, let only Paxos or RAFT and other distributed consistency algorithm foundation students can easily understand Epaxos, truly obscure Epaxos, become approacheable, into thousands of people.

The author | auspicious light source (YanXiangGuang) | ali technology to the public

The introduction

Egalitarian Paxos (Egalitarian Paxos), as a next-generation distributed consistency algorithm, has broad application prospects. But across the industry, there hasn’t been a single engineering implementation of Epaxos, or even a single article that makes Epaxos more accessible. Although the theory of EPAXOS algorithm is good, its practical application is not mature yet because it is difficult to understand and has many challenges in engineering implementation.

This paper aims to introduce Epaxos algorithm in a simple way, step by step, so that students who only have Paxos or RAFT and other distributed consistency algorithm foundation can easily understand Epaxos, and truly make the obscure Epaxos accessible and accessible to thousands of people.

In the article Understanding the Distributed Consistency Algorithm Epaxos, Epaxos is derived from the problems of Paxos, and the basic concepts and intuitive understanding of Epaxos are introduced. It is believed that readers have had an overall impression of Epaxos.

This paper will introduce the EPaxos core protocol flow from the perspective of Paxos and EPaxos contrast. The last article left the question, I believe that after reading this article, can find the answer. Some background in distributed consistency algorithms such as Paxos or RAFT is required to read this article.

The basic idea of EPaxos

Epaxos is a Leaderless consistency algorithm, without the need to elect a Leader, any replica can initiate a proposal.

Leaderless can also be regarded as a Leader for each replica. From the perspective of multi-paxos or Raft, if multiple groups are used, each Leader is divided into different groups, and each replica serves as the Leader of a Group. As followers of all other groups at the same time, it seems that the effect of Leaderless can also be achieved.

Leaderless with multi-group implementation, each Group independently agrees on a series of instances, and each Group generates an Instance sequence. Instances generated by different groups are independent of each other and the sequence cannot be determined. Therefore, consistency across groups is a problem that cannot be solved at the consistency level and often needs to be solved by using distributed transactions at the upper level.

Epaxos solves this problem and implements true Leaderless. EPAXOS determines the relative order of Instance generated by different groups by tracking the dependency relationship between Instance, and then combines multiple Instance sequences generated by multiple groups into a global Instance sequence by sorting, thus realizing the consistency across groups. That is to achieve the real Leaderless.

Epaxos first runs the consensus protocol to make each copy agree on the value of Instance and the relative order of dependence on Instance. Then it runs the sorting algorithm to carry out global sorting on Instance based on the relative order of Instance that has been previously agreed upon, and finally obtains a consistent global Instance sequence.

The above is the introduction of the basic idea of Epaxos from the perspective of multi-paxos or Raft using multiple groups to achieve Leaderless. The actual Group is a concept outside the consistency algorithm, and the introduction of Group here is just for the convenience of introduction. There is no concept of Group in actual Epaxos. But like Paxos or RAFT, you can implement multiple groups on top of Eaxos.

Two EPaxos Instance

Epaxos’ Instance is different from Paxos’ Instance. Paxos’ Instance allocates sequence number in advance, but Epaxos’ Instance does not allocate sequence number in advance. Each copy can be submitted concurrently out of order, but the dependency between Instance can be tracked and sorted according to the dependency relationship. Here’s a summary of the differences:



How Epaxos’s Instance differs from Paxos’s

Paxos Instance from the global continuous incremental InstanceID logo, InstanceID is also the Instance of the serial number, globally unique, increasing continuously.

The Instance space of Epaxos is two-dimensional, and each replica has one of the rows. Therefore, it is marked by the two-dimensional R.i, where R is the replica identity and I is the continuously increasing integer within the replica. Each new Instance is added by one. Instance sequence owned by replica R is R.1, R.2, R.3,…… R.i,…

Instance of EPaxos has some additional properties compared to Paxos:

  • STATE identifies the current state of Instance and can be used as pre-accepted, accepted and committed. Since there are many states of Instance in Epaxos, a special state field is needed to identify Instance.
  • DEPS is the set of dependent Instances, which holds the identifier of all dependent Instances, that is, the Instance to be executed in the previous Instance. The DEPS holds the relative order between instances, which are then sorted based on the DEPS.
  • Seq is the sequence number of Instance, and its value is the maximum seq of all instances in deps plus one. It reflects the order of Instance proposals and is used in subsequent sorting.

The ePaxos Instance deps and seq attributes, like the value of Instance, need to be agreed upon among the replicas. Subsequent replicas will sort Instance independently based on deps and seq. Because the sorting algorithm of Epaxos is deterministic, each copy sorts Instance based on the same deps and seq, and finally obtains a consistent global Instance sequence.

The Instance is regarded as the vertex of the graph, and the deps is the edge of the vertex. After the vertex and edge of the graph are determined and agreed among the replicas, the replicas conduct deterministic ordering on the graph, and finally obtain the consistent Instance sequence.

Three EPaxos consensus protocols

Paxos takes two stages to commit a value. Epaxos takes one more stage than Paxos to determine the value of these properties because its Instance relies more on the set deps and the sequence number seq. A complete Epaxos has three phases, but not all of them are required. The following table compares Paxos with the protocol flow of Epaxos:



Comparison of Paxos and Eaxos protocol flows

Compared with Paxos, the Prepare stage is similar to the Accept stage, but the PreAccept stage is also the most critical stage of Epaxos. Since there are more PreAccept stages in Epaxos, Instance has more states, so special state attribute is introduced to identify the current state of Instance (pre-accepted, accepted and committed). The Prepare phase state was not introduced because the Prepare phase has no proposed value and can be distinguished from other states by the presence or absence of a proposed value locally. Typically, EPaxos runs only the PreAccept phase to Commit, and both the Prepare and Accept phases can be skipped.

Epaxos is similar to Paxos in that if Instance is preempted at any stage, it needs to fall back to Prepare stage and start again.

1 Prepare stage

The role of Prepare stage, EPaxos is similar to Paxos, both in order to gain the right to offer, and at the same time to learn the latest value that has been proposed before. In EPaxos, since each replica has its own Instance space and is proposed on its own Instance space, it is equivalent to the Leader of multi-paxos, so in general, it can skip the Prepare stage directly. Start straight from the next stage.

Phase 2 PreAccept

The PreAccept phase is specific to Eaxos and is used to determine the dependency set deps and sequence number seq for Instance, while trying to get the proposed value, deps, and seq to agree across replicas. If the PreAccept phase has been agreed on, it goes straight to the Commit phase (Fast Path), otherwise it needs to run the Accept phase and then to the Commit phase (Slow Path).

How does the PreAccept phase determine the dependency set deps and sequence number seq for Instance? The local seq is the maximum value of the seq of all instances in the local deps set, and the value of the seq of all instances in the local deps set is increased by one. The final dependent set deps takes the union of the local deps set of most replicas, and the final sequence number seq takes the maximum of their local seq.

The local deps and local seq of each replica may be different when each replica is concurrently proposing, so how can we reach an agreement during the PreAccept phase? If enough replicas (Fast Quorum) have the same local deps and local seq, an agreement has been reached. Otherwise, the final dependent set deps takes the union of the local deps of most (Slow Quorum) replicas, the final sequence number seq takes the maximum of their local seq, and then runs the Accept phase to reach agreement.

The Fast Quorum in the PreAccept phase always includes the proposer, for reasons discussed later. The value of Fast Quorum is not less than Slow Quorum. Assuming that the total number of replicas is N, F replicas can be tolerated to fail at the same time, and N = 2F + 1, then Fast Quorum = 2F, optimized EPaxos can be optimized to F + [(F + 1) / 2], Slow Quorum = F + 1. The derivation of the value of the Fast Quorum, which I won’t describe here and will discuss in more detail in a subsequent article, is Slow Quorum, which is the majority copy, just like the Accept phase of Paxos.

3 the Accept stage

In the Accept phase, EPaxos is similar to Paxos, but Paxos only synchronizes the proposed value to most replicas, while EPaxos needs to synchronize the proposed value, deps and seq to most replicas. Once a majority is formed, the decision is reached. If a decision has been reached at the PreAccept phase, you can skip the Accept phase and proceed directly to the Commit phase.

4 the Commit phase

The purpose of the Commit phase is that, similar to Paxos, Epaxos asynchronously sends the resolution to other replicas so that other replicas can learn the resolution. The difference is that Epaxos’ resolution includes not only the resolution value, but also the deps and seq.

Four Epaxos sorting algorithms

Unlike Paxos, the order of Epaxos’s instances is not determined after they have been committed, so Epaxos requires an additional sorting process to sort the instances that have been committed. When Instance is committed and all instances in its dependent collection of deps are also committed, a sorting process can begin.

The Instance of EPaxos is regarded as the vertex of the graph, and the dependency set of Instance deps is regarded as the edge of the vertex. When the value of Instance is agreed with the dependency set deps, the vertex and edge of the graph are agreed among the copies, so each copy will see the same dependency graph.

The process of Eaxos sorting Instance is similar to deterministic topological sorting of graphs. However, it is important to note that the dependencies between Epaxos Instances may form loops, that is, there may be loops in the diagram, so it is not strictly topological sorting.

In order to deal with cyclic dependence, Eaxos’ algorithm of sorting Instance needs to first look for the strongly connected components of the graph, and all the loops are contained in the strongly connected components. If a strongly connected component is regarded as a vertex of the graph as a whole, then all the strongly connected components constitute a directed acyclic graph (DAG). Then topological ordering is performed on the directed acyclic graph composed of all strongly connected components.

The process of the Eaxos sorting algorithm is shown in Figure 1, where the part circled by the background color is the strongly connected component:



Epaxos sorting algorithm

Tarjan algorithm is generally used to find the strongly connected components of graphs, which is a recursive algorithm. It is found that the recursive implementation is easy to burst the stack, which also brings some challenges to engineering applications.

Instance in different strongly connected components is ordered in deterministic topological order. Instance in the same strongly connected component is concurrently proposed and can be ordered in theory according to arbitrary deterministic rules. Eaxos calculates a seq sequence number for each Instance. The size of the seq reflects the order in which the Instance is proposed. Instances within the same strongly connected component are sorted according to the seq size. Actual SEQ may duplicate and does not guarantee globally unique increments, which is not taken into account in the EPaxos paper, and can actually be sorted using SEQ plus copy identification.

In fact, with the continuous running of new Instances, the old Instances may depend on the new Instances, and the new Instances may depend on the updated Instances. In this way, the dependency chain may be continuously extended, and the sorting process cannot be carried out without termination, thus forming a live lock. This is also a challenge for engineering applications of Epaxos.

Since the Instance sorting algorithm is deterministic, after each copy sorts Instance based on the consistent dependency graph, a consistent Instance sequence will be obtained.

Five EPaxos case

The following is A specific case to introduce the core protocol process of EPAXOS, as shown in the figure below. The system consists of R1, R2, R3, R4 and R5 copies. The horizontal direction represents the time, and the proposed process with values A, B and C is shown in the figure below.



Epaxos Consensus Protocol

The values of the attributes of each Instance in the case are shown in the following table:



Property of Instance in the Epaxos core protocol flow case

1 is suggested to be worth A

First, R1 initiates the proposal value A by running the PreAccept phase. It first gets its own local deps and local seq. At this point, it does not have any Instance locally, so the local deps is an empty set, and the local seq is the initial value 1.

R1 then broadcasts A PreAccept(A) message to R2 and R3, carrying the proposed value A, the local deps, and the local seq (not identified), at which point R1, R2, and R3 form Fast Quorum. Preaccept messages can be broadcast only to replicas in the Fast Quorum, an optimization called Thrifty in the Epaxos paper.

When R2 and R3 receive PreAccept(A) message, they get their own local deps and local seq respectively. Similar to R1, local deps is empty set, local seq is 1, and after persistence, they reply to R1.

R1 receives the same local deps and local seq as the copy in the Fast Quorum, the resolution is reached, the final deps is an empty set, seq is 1, and the Commit phase is run.

2 Suggested value B

Next, R5 runs the PreAccept phase and initiates the proposal value B. At this point, it also does not have any Instance locally, so the local deps is an empty set and the local seq is the initial value 1. After R5 is persisted locally, PreAccept(B) messages are broadcast to R3 and R4.

The local deps of R4 is empty set, and the local seq is 1. At this time, R3 already has A local Instance of A, so the local deps of R3 is {1.1}, that is {A} on the figure, and the local seq is 2, that is, the seq of A’s Instance plus 1.

The local deps and seq of the replicas in Fast Quorum are not the same, so the Accept phase needs to be run. The final deps takes the union set {1.1} of the local deps of most replicas, that is, the index {A} on the graph, and the final seq takes the maximum local seq of most replicas, which is 2. Through the Accept stage, the proposed value B, the final deps, and the final seq are reached to a majority. Finally, run the COMMIT phase Commit.

3 Proposed value C

Finally, R1 initiates the proposed value C by running the PreAccept phase. At this time, R1 already has Instance of the value A locally, so the local deps is {1.1}, that is, {A} as shown in the figure, and the local seq is 3. After R1 is persisted locally, PreAccept(C) messages are broadcast to R2 and R3.

R2 and R3 already have Instance of A and B locally, so R2 and R3 reply with local deps of 1.1, 5.1}, that is {A, B}, local seq is 3, that is, seq of B’s Instance plus 1.

The local deps and seq are the same for all copies of the Fast Quorum except for the propositions R1, so the decision has been reached, and the final deps is {1.1, 5.1}, that is, {A, B}, seq is 3, and the Commit phase is run.

4 the sorting

For Instance of the proposed values A, B, and C, the dependency relationship is drawn according to their dependency set DEPS as shown below (left) :



Instance of A, B, C (left), sorted (right)

A’s Instance deps is an empty set, so there is no edge out; B’s Instance deps is {A}, so there is an edge pointing to A; C’s Instance deps is {A, B}, so there are two exits pointing to A and B, respectively.

The dependency graph has no cyclic dependencies and is already a directed acyclic graph (DAG). Therefore, vertices A, B, and C are each A strongly connected component, and their order can be obtained after deterministic topological sorting: A < — B < — C, as shown in the figure (right).

Six EPaxos discussion

1 the Instance conflict

EPAXOS introduces the concept of Instance conflicts (which, like Parallel Raft, is not a concept of concurrency conflicts). If there is no conflict between the values of two instances (for example, accessing different keys), then the order in which they occur doesn’t matter and can even be processed in Parallel. Therefore, Epaxos only handles dependencies between conflicting logs.

The Epaxos Instance dependency set deps holds the Instance that needs to be executed before the Instance. After a conflict is introduced, DEPS stores the Instance that conflicts with the Instance.

Conflict is a concept related to specific applications. After the introduction of conflict, all Instances are no longer fully ordered, but partial ordered, which reduces dependence, reduces the probability of Slow Path, and improves efficiency.

2 Fast Quorum

We’ll leave the derivation of the Fast Quorum value to a later article, but here we’ll discuss why the Fast Quorum in the PreAccept phase always includes the proposer.

At each stage of Epaxos, the proposer always broadcasts messages to other replicas after the local persistence is successful. That is, the proposer is always on the Quorum, so the proposer can always count one vote when the Quorum is reached.

In the PreAccept phase, the proposer includes its local deps and seq in the PreAccept message as a basis for other replicas to calculate their local deps and seq, so that the proposer’s local deps and seq are always included in the final deps and seq. Thus the Fast Quorum of the PreAccept phase always contains the proposer.

EPaxos always persistent success after the broadcast to other local copy, so that we can reduce Fast Quorum, but also lead to local persistence and network messaging cannot be parallel, reduces the efficiency of some, but also makes the proposed person cannot tolerate the situation of the local disk damage, these are all EPaxos engineering application must face the problem.

Why is the value of Fast Quorum not less than Slow Quorum? There is no need to derive the value of Fast Quorum, but this is intuitively the case. In Paxos, one replica proposes a value, and all the replicas have only two outcomes, accept or reject the value. In Epaxos each copy may give a different deps and seq, so it takes more copies to give the same result to ensure that the correct result can be restored if one copy fails.

Seven EPaxos pseudocode

At this point, I believe that the reader can already understand the EPaxos core protocol process pseudocode. The EPaxos core protocol flow pseudocode is shown in the figure below. For simplicity, the part related to Proposal ID (Proposal ID, or Ballot Number) is omitted, which is the same part as Paxos.

The pseudocode treats the log as a Command, each Instance agrees on one Command, and each copy uses a two-dimensional array CMDS to hold the received commands.



Epaxos core protocol process pseudocode

Eight summary

By displaying and maintaining the dependencies between instances, EPaxos not only removes the dependency on the Leader, but also enables the Instance to be concurrently committed out of order, which enables better Pipelining optimization. At the same time, explicit maintenance of dependencies also makes out-of-order execution possible. Eaxos supports out-of-order acknowledgement, out-of-order commit, out-of-order execution, and theoretically has higher throughput. At the same time, you can also see some of the challenges of the Epaxos engineering application, these are the problems that the Epaxos engineering is going to solve.

This paper introduces the EPaxos core protocol process from the perspective of comparison between Paxos and EPaxos, but the content of EPaxos is not only these, especially how to ensure the sequentiality of log sequence under the scenario of Failover.

thinking

Finally, I will leave a few questions for you to think about. If you are interested, you can think about them first:

  1. Why does the SEQ of Instance repeat and under what circumstances?
  2. How do we derive the value of Fast Quorum?
  3. If an Instance’s consensus protocol flow is not complete and its proposer is down, what should the other partners do with the Instance?

This article is the original content of Aliyun, shall not be reproduced without permission.