Raft principle will bring you back to animation

Hello everyone, I am Lou Zai! The previous selection is summarized here 👉🏻👉🏻 bank: original selection, for your convenience.

I wrote an article “ETCD for one month, from Raft Principle to Practice”, there are a lot of dry articles, but no one reprinted it, my colleagues said the article was too long to be convenient to read. In this article, I only choose Raft protocol, the essence of which is more readable!

The Paxos algorithm, which most of you may have heard of, has a monopoly on consistency algorithms. Prior to Raft, Paxos was synonymous with consistency protocols. But for most people, the Paxos algorithm is too hard to understand and difficult to implement. So Stanford professors Diego Ongaro and John Ousterhout decided to design a consistency algorithm that was easier to understand and came up with Raft algorithm!

Raft is a much simpler and easier to understand distributed algorithm that addresses the problem of consistency in distributed environments. Compared to traditional Paxos algorithms, Raft breaks down a large number of computational problems into simple and relatively independent sub-problems with the same performance as Multi-PaxOS. In this section, we will use a GIF to retrieve the internal principles of Raft.

Raft foundation

Noun explanation

The Raft protocol contains three types of roles:

  • The Leader is elected by popular vote. Only one Leader can be elected at a time.
  • Candidate: When there is no leader, certain people can become candidates and then compete for the position of leader;
  • Follower: This is well understood, so I won’t explain it.

And then there are a couple of important concepts in the electoral process:

  • A Leader Election is an Election in which a Leader is chosen from among the candidates.
  • Term: it is actually a single increasing serial number, with each Term leading to a new election;
  • Election Timeout: an Election will be held again if the crowd does not receive a heartbeat from the leader.

Role transformation

This picture shows the roles of leaders, candidates and the masses. Let me briefly summarize:

  • Crowd -> candidate: when the election begins, or when “the election runs out of time”
  • Candidate -> Candidate: When the “election expires”, or a new “term” begins
  • Candidate -> Leader: When obtaining a majority of votes
  • Candidate -> Crowd: other nodes become leader, or start a new “term”
  • Leader -> Crowd: automatically relinquishes the leader position if your tenure ID is smaller than other nodes
  • Note: Each case will be explained in detail later.

The election

Scenario 1: Leadership election

For the sake of further explanation, I have drawn a sketch of the “election timer” which is essentially the “timeout time” for each node.Candidate: each node has its own “timeout time”, which is random and ranges from 150 ms to 300ms. Therefore, the probability of the same random time is relatively small. Node B is the first node to timeout and becomes the candidate.

Election of leader: Candidate B begins to vote, and people A and C return to vote. When candidate B wins the majority of votes, candidate B becomes the leader.

Heartbeat detection: In order to pledge their leadership status, leader B needs to initiate heartbeat to the masses at all times. When masses A and C receive the heartbeat of leader B, the “timeout time” of masses A and C will reset to 0, and then count again and again.

It needs to be explained here that the heartbeat broadcast cycle of the Leader must be shorter than the timeout time of the “election timer”, otherwise the masses will frequently become candidates, which will lead to frequent elections and the change of the Leader.

Case 2: Leader dies

When leader B dies, the “election timer” of masses A and C will always run. When masses A times out first, they will become the candidate. Then the following process is the same as the “leader election” process, that is, notification vote -> receive vote -> become leader -> heartbeat detection.

Scenario 3: Multiple candidates appear

When there are multiple candidates A and D, the two candidates will vote at the same time. If the number of votes is different, the node that gets most votes first will become the leader. If the number of votes is equal, a new vote will be held.

When C becomes the new candidate and the Term is 5, a new round of voting is initiated. After other nodes vote, they will update their Term value and finally choose the new leader as C node.

Log copy

Replication state machine

The basic idea of a replication state machine is a distributed state machine. The system consists of multiple replication units. Each replication unit is a state machine, and its state is saved in operation logs. As shown in the figure below, the consistency module on the server is responsible for receiving external commands and then appending them to its own operation logs, and it communicates with the consistency module on other servers to ensure that the operation logs on each server eventually contain the same instructions in the same order. Once the instructions have been copied correctly, each server’s state machine processes them in the order of the operation logs and then returns the output to the client.

Data Synchronization Process

The data synchronization process, which borrows from the idea of a replication state machine, is “commit” and then “apply”. When the Client initiates a data update request, the request will first go to the leader node C, which will update the log data, and then notify the masses node to update the log. When the masses node updates the log successfully, it will return a success notification to the leader C, thus completing the “submit” operation. When leader C receives the notification, he will update the local data and inform the masses to update the local data. At the same time, he will return a success notification to the Client, thus completing the “application” operation. If the Client has a new data update operation, the above process will be repeated.

Log Replication Mechanism

Each Log entry typically contains three attributes: the integer Index Log Index, the Term number, and the directive Commond. The “integer index” for each entry is its slot in the log file, the “tenure number” corresponds to the number in each box in the figure to detect log inconsistencies on different servers, and the instructions are the external commands executed by the state machine, the numbers with arrows in the figure.

Does the leader decide when it is safe to apply log entries to the state machine and commit them? An entry created by a leader is said to be committable once it has been copied to more than half of the nodes. For example, entry 9 in the figure is replicated on four of the seven nodes, so entry 9 can be submitted; But entry 10 is replicated on only three of the nodes, so entry 10 is not committable.

Generally, the logs of the Leader and followers are saved the same. If the Leader node does not copy all previous entries in the log file to other nodes before a fault occurs, log inconsistency may occur. In Raft algorithm, the Leader forces the followers to save the same logs as his own. Therefore, logs that conflict with the Leader on the followers will be overwritten by the Leader’s logs. In order to achieve the above logic, it is necessary to know where the Follower logs are inconsistent with the Leader’s logs. Then how does the Leader accurately find the slot where the Follower logs are inconsistent?

The Leader maintains a Nextlndex for each Follower, which represents the index of the next log entry that the Leader will send to the Follower. When a Leader wins an election, it assumes that the logs on each Follower are consistent with its own. Therefore, nexTLndex is initialized to its latest log entry index +1. In the figure above, since the Leader’s latest log entry index is 10, the initial value of NexTLndex is 11. When the Leader sends AppendEntries to the Follower RPC, it carries binary group information (Item_id, nextindex-1), where item_id is the term of the log entry in slot Nextindex-1. After receiving the AppendEntries RPC message, the Follower performs a consistency check. That is, the Follower searches for the presence of such log entries in its own log file. If the log does not exist, the Leader returns AppendEntries with failed RPC and decrement nextIndex. Retry until successful. The following logic is relatively simple: the followers reserve all the logs before nextIndex, delete all the logs after nextIndex, and then synchronize all the logs after nextIndex from the Leader.

The above is only about the method, the following example, to deepen the understanding, or the picture above example. AppendEntries RPC(6,9)(6,8) (5,7) (5,6) (4,5) (4,5) (4,4) is not found until the Leader’s nextlndex is 11. Delete all logs after nextlNdex =4, and append all logs after nextlNdex =4 on the Leader.

Fissure situation

When the network problems lead to split brain, a double Leader, each network can be understood as an independent network, because the original Leader alone in one area, so the data submitted to him can’t be copied to most of the nodes, so never submit data, this can carry out now in the fourth picture (no submission SET 3).

After the network is restored, the old Leader automatically demots to Follower if it finds that the Term of the new Leader in the cluster is longer than its own. The old Leader synchronizes data from the new Leader to achieve data consistency in the cluster. For details about how to synchronize data, see Logging Principles.

It is better to have no books than to have no books. Because of my limited ability, it is hard to avoid omissions and mistakes. If you find bugs or have better suggestions, you are welcome to criticize and correct.

Previous selections:

  • How to treat the programmer 35 career crisis?
  • Java full set of learning materials (14W words), took half a year to sort out
  • I spent three months writing the GO core manual for you
  • Message queuing: From model selection to principle, this article takes you through everything
  • Micro service gateway selection, please take my knee!
  • How to choose the five registries? Interpret to you from principle!
  • Liver ETCD for a month, from Raft principle to practice
  • Much more! Much more! Much more! Triple strike!!