This article is the first in a series on consensus. In this consensus series, I’ll introduce you to common consensus algorithms. At the beginning of this article, I will cover the basics of consistency. Consistency is the most basic and important problem in distributed system, and consensus algorithm is used to solve the consistency of distributed system. After that, I’ll introduce a very classical Byzantine fault-tolerant algorithm called PBFT.

prepare

consistency

Consistency, also known as Agreement in the early days, refers to that in the distributed system domain, for multiple service nodes, given a series of operations, under the guarantee of the Agreement, they achieve “some degree” of coordination on the processing results. The process for reaching agreement on a distributed system shall satisfy:

  • Termination: Consistent results can be completed ina limited time;

  • Agreement: The result of the final decision of different nodes is the same;

  • Validity: The result of the decision must be the proposal put forward by a node.

consensus

Consensus algorithm is needed to ensure the consistency of the system in different degrees. Consensus algorithm is a process in which all honest nodes in a distributed system reach a consensus on a Proposal. So the problem that consensus needs to solve can be abstracted as follows:

  • How do you propose a consensus proposal?

  • How do you get multiple nodes to agree on the proposal?

CFT & BFT

For a distributed system, if the communication between nodes is very smooth and each node can respond instantly, then the consistency problem can be solved by simply broadcasting the vote and reply. But reality doesn’t work that way. Nodes often encounter problems such as network interruption, node failure, and even forged messages by illegal intrusion. Problems encountered by nodes can be classified as follows:

  • A node that fails but does not falsize information is called a “non-Byzantine error”;

  • A case in which a node falsifies a message in a malicious response is called a “Byzantine error,” and the node that falsifies the message is called a Byzantine node.

Correspondingly, consensus algorithms can also be divided into CFT and BFT:

  • CFT (Crash Fault Tolerance) : It only tolerates node faults, but does not tolerate nodes doing evil.

  • BFT (Byzantine Fault Tolerance) : Byzantine Fault Tolerance of nodes.

FLP impossible principle

Computer scientists have demonstrated that there is no deterministic consensus algorithm that can solve the consistency problem in a minimally asynchronous system where the network is reliable but allows node failure. This might seem to mean that it’s futile to design a consensus algorithm, but science tells you what’s impossible; Engineering tells you that, at some cost, you can make it work. That is to say, at what cost, consensus can be reached.

The two militaries problem

The White army was in the ditch, the Blue army and the Red army on either side. The White army is more powerful than either the Blue army or the Red Army, but the blue army and the Red Army can defeat the White army if they attack together. The Blues and Reds could not communicate remotely across the ditch and had to send signalmen across the ditch to negotiate the attack time. But signalmen can get lost or intercepted by the enemy and their messages can be tampered with.


According to the FLP impossibility principle, there is no general solution to this problem, but it must be solved in the field of communication. To keep costs under control, the TCP protocol’s three-way handshake is now used to (partially) solve this problem.

The question of Byzantine generals

A group of Byzantine generals led by one army each laid siege to a city. To simplify matters, the operational strategies of each army are limited to either attack or withdrawal. Because a partial attack and a partial withdrawal could be disastrous, the generals had to vote to agree on a strategy that all troops would attack together or all troops would leave together. Since the generals were located in different directions of the city, they could only communicate with each other by Courier. During the voting process, each general sent a message by Courier to all the other generals that he would vote to attack or to withdraw, so that each general could decide on his strategy based on his vote and the message sent by all the other generals.



The above story is mapped to a distributed system where the general becomes the consensus node and the traitor the Byzantine node. Unlike the two-army problem, the Byzantine General problem does not consider whether the signalman will be intercepted or not be able to convey information. In other words, consistency and fault tolerance are discussed on the assumption that there is no problem with the channel.

With this knowledge you can better understand the consensus algorithm. Let’s move on to today’s topic, PBFT.

PBFT consensus algorithm

PBFT is short for Practical Byzantine Fault Tolerance, which means Practical Byzantine Fault Tolerance. This algorithm reduces the complexity of Byzantine fault-tolerant algorithm from exponential to polynomial level for the first time, and it can guarantee both Safety and Liveness when the number of malicious nodes is not more than 1/3. We assume that the total number of all nodes is R and the number of Byzantine nodes is F, and the security proof is given as follows:

Imagine f mutineers and K loyalists. The mutineer is deliberately bad, either giving the wrong result or not responding. At some point F of the defectors are not responding, then k of the loyalists taking the majority will get the correct result. When f defectors all make a malicious proposal, and f of K loyalists are offline, the remaining K-F loyalists are unable to tell whether they are mixed with the defectors or not. They still need to make sure that the correct result is obtained by taking the majority. Therefore, k-f > f, that is, K > 2f or R-f > 2f. So the overall size R of the system is going to be greater than 3f. So in order to ensure that the number of nodes in the Byzantine system is at least four, at most one Byzantine node is allowed.

PBFT is an algorithm based on replica replication of state machines, where each copy of the state machine holds the state of the service and also implements the operation of the service. All copies in the PBFT operate on View rotation, which is initiated when the master node goes offline to keep the algorithm running.

As can be seen from the above flow chart, the PBFT algorithm process is as follows:

  1. The client sends a request to the primary node.

  2. The primary node broadcasts requests to other replicas;

  3. All copies execute the request and return the result to the client;

  4. The client waits for 2F +1 different copies to return the same result as the final result.

The implication here is that all nodes are deterministic and that all nodes start from the same state.

First the client sends a request to the master node, and then the classic three-phase protocol kicks in.

Preparatory phase

First, the master node sends a prepared message to all replica nodes. Here the bread contains the message number, view number, and summary of the message. It is important to note that the prepared message does not contain the request, which has two advantages: one is to compress the message size to improve the propagation efficiency, and the other is to decouple the request ordering from the request transmission.

The replica node then verifies that the message is signed correctly, that the view number is consistent, and that the message number meets the waterline requirements. Here comes the waterline mechanism in PBFT. For the sequence number N of the message, it is required to be between the waterline H and H. The significance of waterline is to prevent a failed node from using a large number to consume the number space.

Preparation stage

If the replica node accepts the prepare message, it enters the prepare phase. In the prepare phase, each node sends a prepare message with its ID to the other nodes and also receives prepare messages from the other nodes. The same validity check is performed on the receipt of prepared messages. If authenticated, the prepared message is written to its own message log. A node enters the ready state only when it has at least 2f+1 verified messages.

The commit phase

During the commit phase, each node broadcasts a COMMIT message to tell the other nodes that it is in the ready state. If at least 2F +1 commit messages are collected, the proposal is approved.

After the three-phase protocol, each replica node sends a reply to the client, and the replica node discards requests with a timestamp smaller than the reply timestamp to ensure that the request is executed only once.

conclusion

The PBFT consensus algorithm was formally proposed by Castro and Liskov in 1999, and the consensus objects considered in its design are some relatively small messages. To sort the messages, PBFT uses the checkpoint mechanism to move the waterline to process the voting process of multiple messages concurrently. At the same time, PBFT triggers view switching and primary node replacement only when a node does evil or goes offline. This is because the view switching process also requires consensus, which is time-consuming, and the PBFT cannot accept frequent view changes. In addition, to accommodate the water level mechanism, the message for view switching is much larger than the normal message. Because of the above reasons, the DESIGN of PBFT is very complex and inefficient.

However, with the development of technology, the birth of blockchain technology easily solves some problems in the DESIGN of PBFT. In a blockchain, where every message (block) follows each other, the waterline mechanism for concurrent processing is useless, so the waterline mechanism and the checkpoint mechanism that serves it make no sense. Without waterlines and checkpoint, the only thing that gets in the way of view switching is the consensus process of view switching, which is simplified by the fact that blockchain is a consensus ledger. If the nodes switch through the data on the chain to reach a consensus, then the original need to go through the process of online consensus is omitted.

Therefore, a large number of blockchain projects have used improved PBFT as a consensus algorithm, and PBFT, as a representative of Byzantine fault tolerance, is also showing new vitality in the process of continuous optimization.


Ref:

1. Block chain technology guidelines: https://legacy.gitbook.com/book/yeasy/blockchain_guide/details

2.Castro M, Liskov B. Practical Byzantine fault tolerance[C]//OSDI. 1999, 99: 173-186.