The messy “consistency” problem

Consensus ! = Consistency

Many blogs on the web discussing PAxos or RAFT use the term “distributed consistency protocol” or “distributed consistency algorithm”. Although in Chinese “reach consensus” means the same as “reach consensus”, it is important to note that we are talking about consensus issues. Use “consensus” to be clearer. The C of THE CAP theorem and the C of the ACID database are the real “consistency” problem, although the two CS are not the same problem, but will not be extended here.

For the sake of regulation and clarity, the term “consensus” is used uniformly when discussing consensus issues.

Note: In earlier literature, consensus was also called agreement.

Consensus problem

So what exactly is the consensus problem? For example, Xiao Ming and Xiao Wang are going out to a party. Xiao Ming asks: “Xiao Wang, shall we have something to drink?” Xiao Wang: “How about coffee?” Xiao Ming: “Ok, then come to a cup of coffee.”

In the above scenario, Xiao Wang suggests having a cup of coffee, and Xiao Ming agrees. The two reach a consensus on the problem of “having a cup of coffee” and take action based on the result. That’s the consensus in life.

In distributed systems, consensus is when multiple nodes in the system agree on a value. The consensus problem can be described in mathematical terms: A distributed system consists of n processes {0, 1, 2… Each process has an initial value, and processes communicate with each other. Design an algorithm so that, despite failures, processes can still negotiate an irrevocable final value, and each execution satisfies the following three properties:

  • Termination: All correct processes eventually agree on a value.
  • Agreement: All correct processes agree on the same value.
  • Integrity (also known as Validity) : If the correct processes all propose the same value, then all the correct processes in the agreed state choose that value.

Completeness can vary, for example, a weaker kind of integrity is to assume that values are equal to some correctly frequently proposed values, not necessarily the values proposed by all processes. Completeness also implies that the value that is ultimately accepted must have been proposed by a node.

Why consensus?

We first introduce the consensus-building motivations of distributed systems.

In the previous article, we have seen several major problems with distributed systems:

  • Network problems
  • The clock problem
  • Node failure

The first paper on consensus is from lamport’s “Time, Clocks and the Ordering of Events in a Distributed System2”. Although it does not explicitly propose the concept of consensus or agreement. The paper states that in distributed systems, you cannot tell whether event A happened before event B unless there is some dependency between event A and event B. This also leads to the concept of distributed state machines.

In distributed systems, consensus is often applied in Replicated state machines, which have copies on each node. These state machines all have the same initial state, with each state transition and the next state determined jointly by the related processes. The logs on each node have the same values and order. A consensus problem is when each state machine agrees on which state to deal with next.

As a result, these nodes look like a single, highly reliable state machine. Raft’s paper 3 states that using state machines we can overcome these three problems:

  • Meets all non-Byzantine conditions to ensure security (no error results returned), including network latency, partitioning, packet loss, duplication, and reordering.
  • It doesn’t depend on timing.
  • High availability. As long as most of the nodes in the cluster are up and running and can communicate with each other and clients, the cluster is fully available. Therefore, a cluster with five nodes can tolerate two of them failing. If some nodes fail by stopping them, they will later recover from their persistent storage state and rejoin the cluster.

Moreover, consensus solves the following classic problems in distributed systems:

  • Mutual exclusion: Which process enters the critical zone to access resources? Distributed lock?
  • Leader election: In a single-master replication database, all nodes need to agree on which node is the Leader. If some cannot communicate with other nodes due to network failure, there may be two leaders, both of which accept writes, and data may diverge, resulting in inconsistencies or loss of data.
  • Atomic commit: In a database that spans transactions across multiple nodes or partitions, a transaction may fail on some nodes but succeed on others. If we want to maintain the atomicity of this transaction, all nodes must agree on the outcome of the transaction: either commit all or abort/rollback all.

In short, with the help of consensus, distributed systems can function as a single node — so the consensus problem is the most fundamental problem of distributed systems.

The system model

Before considering how to reach consensus, you need to consider what alternative computing models are available in distributed systems. There are mainly the following aspects:

Network model:

  • Synchronous: The response time is within a fixed and known limited range.
  • Asynchronous: The response time is infinite.

Fault type:

  • Fail-stop Failures: A node suddenly fails and stops responding to other nodes.
  • Byzantine failures: Derived from the “Byzantine general problem,” this refers to a node responding to data that produces unexpected results, may contradict each other or make no sense at all, or that the node is even “lying,” such as a hacked node.

Message model:

  • Oral messages: Messages can be tampered with when they are relayed.
  • Signed messages: Messages cannot be forged after they have been sent; tampering will be detected.

As the most common, we will discuss consensus in synchronous and asynchronous systems respectively. It is possible to reach consensus in a synchronous communication system (discussed below), but in a real distributed system synchronous communication is not practical and we do not know whether the message is faulty or delayed. Asynchrony is a more general case than synchronization. An algorithm that works for asynchronous systems can also work for synchronous systems, but the reverse is not true.

Let’s start with the asynchronous case.

Consensus in asynchronous systems

FLP Impossibility

As early as 1985, Fischer, Lynch and Paterson (FLP) proved in “Impossibility of Distributed Consensus with One Faulty Process”4 that: In an asynchronous system, there is no algorithm to guarantee consensus even if only one process fails.

Simply put, because in an asynchronous system, processes can respond at any time, there is no way to tell if a process is slow or has crashed, which does not satisfy Termination. Detailed proof is beyond the scope of this article and is not covered in detail in 5.

At this point, people realized that there are two p’s that a distributed consensus algorithm requires: safety and liveness. Security means that all the right processes agree on the same value, and activity means that the distributed system will eventually agree on a certain value. Each consensus algorithm either sacrifices a property or relaxes the assumption of network asynchrony.

Although the FLP impossibility theorem sounds daunting, it also provides the research idea for the later people — no longer try to find the exact solution of consensus problem in asynchronous communication system. FLP cannot mean that consensus cannot be guaranteed, not that consensus can never be reached if one process goes wrong. This impossible result comes from the worst result of the algorithmic flow:

  • A completely asynchronous system
  • Out of order
  • Finally, it is impossible to have a definitive consensus algorithm.

For these worst-case scenarios, some ways can be found to circumvent FLP as far as possible, which can meet the consensus in most cases. Distributed Systems: Concepts and Design notes that there are generally three approaches:

  1. Fault masking
  2. Use a Failure detector
  3. Using random algorithms (non-Determinism)

1. Fault masking

Since consensus cannot be proven in an asynchronous system, we can convert an asynchronous system to a synchronous one, and fault masking is the first method. Fault masking assumes that the failed process will eventually recover and find a way to rejoin the distributed system. If no message is received from a process, wait until the expected message is received.

For example, two-phase commit transactions use persistent storage and can recover from crashes. If a process crashes, it is restarted (automatically or by an administrator). The process retains enough information in persistent storage at key points in the program to be able to use the data to continue working in the event of a crash or restart. In other words, a faulty program can behave like a proper process, but it sometimes takes a long time to perform a recovery process.

Fault shielding is used in various system designs.

In addition, use a Failure detector.

A second way to convert an asynchronous system to a synchronous one is to introduce a fault detector, where a process can assume that a process that has not responded for more than a certain amount of time has failed. A very common implementation of a fault detector is timeout.

However, this approach requires that the fault detector be accurate. If the fault is not precise, the system may abandon a normal process; If the timeout is set to a long time, the process will have to wait (and not be able to do any work) for a longer time before concluding that something went wrong. This approach may even lead to network partitioning.

The solution is to use “imperfect” fault detectors. Chanadra and Toueg, in “The Challenging Failure Detector for Solving Consensus6 “, analyze two properties that a failure detector must have:

  • Displacement: Every failed process is questioned by every correct process.
  • Accuracy: The correct process is not suspected.

They also demonstrated that even with unreliable fault detectors, the consensus problem can be solved as long as communication is reliable and no more than N/2 processes crash. We didn’t need to implement Strong displacement and Strong Accuracy, we just needed a final weak failure detector, which had the following properties:

  • Eventually weak completeness: each bad process can often be doubted by some of the right processes, eventually weakly complete.
  • Eventually weak accuracy: after some time, at least one correct process is never doubted by the other correct processes.

The paper also proves that in asynchronous systems, we cannot rely on messages alone to implement a final weak fault detector. However, the actual fault detector can adjust its timeout value based on the observed response time. If a process or a connection to the detector is slow, the timeout value increases, and it becomes rare to falsely suspect a process. For practical purposes, such a weak fault detector is very close to the ideal final weak fault detector.

3. Using random algorithms (non-Determinism)

The technique to solve the impossibility is to introduce a random algorithm whose output depends not only on external inputs but also on random probabilities during execution. Therefore, given two identical inputs, the algorithm may output two different values. Random algorithms make it impossible for “enemies” to effectively block consensus.

Unlike traditional models where leaders are chosen and nodes work together, a consensus like blockchain is based on which node can solve a puzzle the fastest. Each new block in the blockchain is added by the node that can solve the math problem the fastest in the round. The entire distributed network continues to build the time-stamped blockchain, and the blockchain that carries the most computation is the agreed main chain (i.e., the one with the most cumulative computational difficulty).

Bitcoin uses PoW (Proof of Work) to maintain consensus, some other cryptocurrencies (DASH, NEO) use PoS (Proof of Stake), and some (Ripple) use ledger.

However, none of these random algorithms can strictly satisfy safety. Attackers can hoard huge amounts of computing power to control or influence a large number of normal nodes of the network. For example, if they control more than 50% of the network computing power, they can launch Sybil attacks on PoW. Only if the attacker has to pay a large amount of money to accumulate computing power, in practice this risk is very low, if you have such a strong computing power is better than directly mining profit.

Consensus in synchronization systems

Methods 1 and 2 above both try to make the system more “synchronized”. The familiar Paxos does not fully address the liVENESS issue in asynchronous systems due to live locks. But Paxos is widely used in distributed systems because before consensus is reached, the system is not that “asynchronous” and there is a high probability that consensus will be reached.

Dolev and Strong prove in Authenticated Algorithms for Byzantine Agreement7 that: In a synchronous system, if f out of N processes fail at most, consensus can be reached after f + 1 round of message passing.

Fischer and Lynch’s “A lower bound for the Time to Assure Interactive Consistency8 “proves that the same is true for Byzantine failures.

Based on this, most practical applications rely on the assumption of synchronous or partially synchronous systems.

The Byzantine general problem in synchronous systems

Leslie Lamport, Robert Shostak, and Marshall Pease in The paper “The Byzantine General’s Problem 9” discuss 3 The processes sent unsigned (verbal) messages to each other and proved that the Byzantine generals’ conditions could not be met if only one process failed. However, if a signed message is used, the Byzantine consensus can be achieved even if one of the three generals fails.

Pease generalized this to N processes, meaning that in a system with F Byzantine failure nodes, there must be at least 3F + 1 nodes in total to reach a consensus. So N is greater than or equal to 3f plus 1.

While a solution to the Byzantine General problem does exist under synchronous systems, it is expensive, requiring O(N^ F +1) of information exchange, and is only used in places where security threats are serious (e.g., the aerospace industry).

PBFT algorithm

PBFT(Practical Byzantine Fault Tolerance) 10 is a Practical Byzantine Fault Tolerance algorithm, published by Miguel Castro and Barbara Liskov in 1999.

The main details of the algorithm are no longer developed. PBFT also circumvents FLP impossibility by using synchronous assumptions to ensure activity. PBFT algorithm is also N >= 3F + 1, but only O(N ^2) information exchange capacity is required, that is, every computer needs to communicate with all other computers in the network.

Although PBFT has been improved to some extent, it is still not practical in the scenario with a large number of participants. However, it has made important breakthroughs in The Byzantine fault tolerance, and some important ideas are also used for reference by the later consensus algorithm.

conclusion

This article refers to a lot of literature, to do some basic overview of the research history of “consensus problem”, hoping to bring you a little help.

Many of the papers mentioned in this paper talk directly about the results, ignoring the mathematical proof. First, this paper is only a general discussion of the consensus problem, to establish a knowledge framework, which is easy to fill in later content. The other is that most readers are not interested in the mathematical proof process and do not want this article to be as long as a book. Many important algorithms are also missing and will be added as necessary.

Limited by my ability, I sincerely request readers to tell me the mistakes and shortcomings of this article by leaving a message or private message.

Chat πŸ† technology project stage v | distributed those things…

Welcome to follow my official account:

Reference


  1. Mark Mc Keown: “A brief history of Consensus, 2PC and Transaction”↩
  2. Lamport, Leslie (July 1978). “Time, Clocks and the Ordering of Events in a Distributed System. Communications of the ACM. 21 (7): 558 — 565
  3. Ongaro, D., and Ousterhout, J. “In Search of an Understandable Consensus Algorithm”. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USA, 2014), USENIX ATC ’14, USENIX Association, pp. 305 — 320.↩
  4. Fischer, M. J.; Lynch, N. A.; Paterson, M. S. (1985). “Impossibility of Distributed Consensus with One Faulty Process” . Journal of the ACM. 32 (2): 374-382. ↩
  5. A Brief Tour of FLP Impossibility↩
  6. Chandra, T. D., Hadzilacos, V. & Toueg, S. (1992), “The weakest failure detector for solving consensus”, in `Proc. of the 11th Annual ACM Symposium on Principles of Distributed Computing’.↩
  7. Dolev, D.; Strong, H.R. (1983). “Authenticated Algorithms for Byzantine Agreement”. SIAM Journal on Computing. 12 (4).↩
  8. Michael J. Fischer, Nancy A. Lynch: “A lower bound for the time to assure interactive consistency”. Inf.process. Lett. 14(4): 183-186(1982)↩
  9. Lamport, L.; Shostak, R.; Pease, M. (1982). “The Byzantine Generals Problem”. ACM Transactions on Programming Languages and Systems.↩
  10. Castro, Miguel; Liskov, Barbara (1999). Practical Byzantine Fault Tolerance. OSDI.↩