1, the introduction of

  • Raft is also a discussion of how to reach consensus on a set of values in a distributed system and how to keep the logs consistent from node to nodeWrite requestsAll the proposals are published by the Leader
  • Raft algorithm belongs toMulti - Paxos algorithm, it is based on the idea of Multi-PaxOS, make some simplification and restriction, for example, add log must be continuous, onlyThe leader,followerandcandidatesThree states
  • Raft has three main sub-problems, respectively
    • Leader Election,
    • Log Replication.
    • Safety(Safety, data recovery)

2. Roles (Node status)

Each node may have three different states, respectively

  • Leader [Leader] :Handles write requests, manages log replication, and continuously sends heartbeat messages
  • Follower (followers) :Responsible for receiving and processing the leader’s message (election message, data writing message, etc.), when the leader’s heartbeat message times out or the countdown ends, it will become the candidate to initiate the voting election request
  • Candidate [Candidate] :The transition in which a follower becomes a leader by asking other nodes to vote and becoming the leader if it gets more than half of the votes.

caption

3. About logs

  • aboutThe logSee the previous article for more information onState machine replication

The following is the local log storage of nodes A and B at A certain time:

  • Log ID Locally Uniquely identifies the number of an operation logContinuous and monotonically increasing. It is used for data recovery when the distributed system logs are inconsistent, and to ensure that the same log sequence through the same state machine each node can finally achieve the same state.
    • For example, node A has eight operation logs, while node B has only four. The data on the two nodes is inconsistent.Suppose node A is now the leaderAfter communicating with B based on heartbeat, it finds that the latest log is stored locally until the fourth log. Each subsequent heartbeat request synchronously copies an operation log to B. After 4 heartbeats node B is consistent with the leader’s log data. In fact, the process isA data recovery process
  • Term No. :This is actually a little bit similarProposal ID in the Paxos algorithmInstead of incrementing every proposal made (call every consensus request a proposal), it increments only when an election vote request is made and not when a write request is made, which is also called RaftTerm No.In the log, the leader of which tenure made the request for proposal (write request). Functionally the same as the Proposal ID for Paxos,Each node will not accept a request for a proposal ID smaller than its ownHere it isEach node will not accept requests that are smaller than its own locally stored tenure number (whether election requests or log replication requests)
  • Specific instructions:This isSimilar to the proposal content in the Paxos algorithm, which describes the actual operation instructions of the client

4. Core consensus consultation process

4.1 Process of leader election

  • In the following figure, each node belongs toStatus of the follower, and everyA follower or candidateThere is aClock countdown mechanism“, the shrinking outer line, and the countdown time at each node is randomly generated, usually within a few hundred milliseconds.If the current is a follower, the follower becomes the candidate once the first countdown ends. And initiate a petition for election.And if it's a candidateOnce the candidates have finished counting down, a new round of requests for votes will be made,This process can also be called an election timeout
  • Figure S4 and S5successivelyThe countdown is over and the candidate status is changed. S4 nodeVote for yourself firstAnd thenIncrements the term number of the local guarantee to 2, and sends the vote request (or proposal request, whose proposal ID is the tenure number of S4 node [2]) to the other four nodes. S1 and S2 and S3 are the three followersReceived the firstCandidate S4 node election request,On a first come, first served basisAnd the current proposal request ID is term number 2No less thanThe request can be accepted because of its local guaranteed term number 1. Then all three nodes will vote for candidate S4 and change the local guaranteed term number to the ID of the current proposal request, i.e. 2. After the three followers had voted, candidate S5’s request to vote also came in, but because the vote had already been cast. So all three followers rejected candidate S5’s election request. The number of votes received by the final candidate, S4fourWhile candidate S5 has onlyA vote.By majority [majority]Candidate S4 updates himself to the Leader Leader state, which then sends heartbeat packets to the other three followers and S5 candidate. Followers will now reset their countdown after receiving the leader’s heartbeat packet. The candidate S5 receives the leader’s heartbeat and not only resets his countdown but alsoDemoted to follower state

4.1.1 Election Timeout Cases

  • As shown in the figure above, S5 node also has a clock countdown mechanism during the candidate period. If it fails to win the election and does not receive the vote request from the new Leader (for example, the new Leader fails or no candidate gets more than half of the votes),
  • For example, as shown in the figure below, the countdown will continue until the new leader S4 dies before the heartbeat packet is issued. Once the countdown candidate S5 initiates a new round of voting requests and increments the local guaranteed term number to 3.

4.2 Data Writing Process (Log Replication)

  • If the client now writes a piece of data (such as the instruction set x=3), leader S5 receives it and writes the operation log to itself but has not committed it yet. Then send the log replication request to the other four followers. After receiving the request, the four followers find that the tenure number of the request is no less than their own local guaranteed tenure number (that is, 3), so they accept the log replication request and write the operation log to their own local butHaven't submitted“And then respond to the leader. When the leader S5 receivesMore than halfThe operation log is formally committed locally and a success message is written to the client in response. After receiving a heartbeat or log replication message from the leader, the follower will pin the log if it finds that the leader has committed a log that has not yet been committedFormal submission.

5 Abnormal Scenario

5.1 Leader hang up

  • After the leader S3 failed, S4 first counted down and became the candidate, launched a round of voting and finally got more than half of the votes and became the new leader. The old leader, S2, is reinstated and automatically demoted to a follower. After receiving the heartbeat message from the leader S4, the current heartbeat request term number 3 is greater than the local term number 2, so the local term number will be changed to 3. And undertakes not to process any subsequent requests smaller than term number 3 (whether election requests or data write requests)

5.2 Data Inconsistency (Security and Recovery)

  • For example, the follower S1 hangs, and then the leader S5 writes twice. Then leader follower S4 didn’t receive overtime S5 heartbeat messages into candidate election requests, after more than half of consent, node S4 become the new leader, then the leader S4 and written into the data twice, if the followers S1 now up again, now with other followers S1 node do not match the local log data. Four logs are missing.

There’s a slight change in the GIF above,When leader S4 is elected, the log read pointer of all followers will be the same as that of the leader (all points to log ID 5).In fact, compared with the leader, it needs to read the pointer to the follower node log, because the leader needs to know whether the current log of each follower is consistent, missing or has wrong data.Why be consistent with the leader log read pointer?Even if you have more logs than the leader, the leader will synchronize your log data from its log read pointer to keep it consistent.

Follow S1’s specific data recovery process

  • After the follower S1 recovers, after receiving the first heartbeat from leader S4, it finds that the leader’s term number 4 is greater than its own approved term number 3, and changes its approved term number to 4(it will only accept requests whose term number is greater than or equal to 4). If follower S1 finds that its log is inconsistent with leader S4’s, the follower will reject the new log copy and return a failure message to the leader.
  • At this point, the leader S4’s heartbeat message contains its current local latest log message, such as< The latest log ID is 6 and the latest log tenure number is 4>Normally, so is the latest log if the follower is a local log< The latest log ID is 6 and the latest log tenure number is 4>Then respond to the heartbeat request normally. However, if the previous item of the follower S1 log pointer is null and the log message of the leader’s position is inconsistent, the follower sends a failure message to the leader(Figure 2.2 below).
  • After receiving the failure message, the leader will decrement the log reading pointer of node S1, for example, the log ID is 4. Then node S1 still finds that the previous item of the log pointer is null and the log message corresponding to the leader is still inconsistent, and returns the failure message to the leader. The leader received the log read pointer S1 and decrement it now to 3(Figure 2.3 below). But if the first entry of the log pointer matches the log message at the leader’s location, a success message will be returned to leader S4
  • The leader then overwrites and restores the log data of follower S1 starting from log pointer 3, but only one log is recovered in each heartbeat, and finally S1 is recovered after 4 heartbeats.

Figure 2.2

Figure 2.3

5.3 Nodes with low log integrity become candidate problems

  • Raft’s other principle is thisData integrity principle, only candidates with high data integrity can become leaders, because leaders need to be responsible for data recovery. An election request made by a candidate with low data integrity will be rejected.
  • For example, two pieces of local log data are missing because S1 was suspended earlier. It happens to be the first candidate to initiate a round of voting request, and the follower rejects the election request of S1 node when he finds that the data integrity is lower than his own, and updates the local guaranteed arbitrary number as 3

5.4 Network Partition Faults

  • As shown in the figure, node AB and node CDE cannot communicate with each other due to network partitioning. CDE cannot receive the heartbeat message from leader B, and will start to elect a new leader once the countdown ends. For example, the 3 votes of node C are more than half to become the leader and the term number is increased to 2. At this point, there are two leaders in the Raft cluster. Contrary to Raft’s rule that there is only one leader

  • In this case, the client firstWrite request set 3Leader B then sends a log copy request to other nodes, but only node C receives it and writes it to the local log but does not commit it. Leader B does not receive more than half of the two write success requests, so leader B returns write failure to the client
  • Now there’s another clientWrite request set 3Finally, more than half of the nodes responded to the log replication request of leader C. Leader C first submitted its own log, and then returned that the log was successfully written to the client, and then notified other nodes to submit the log.
  • After the network partition is restored, leader C receives a message from leader B, but does not process the message because leader B’s term number 2 is smaller than his. On the contrary, leader B receives the message from leader C. Since leader C’s term number is larger than his own, leader B will be automatically demoted to a follower and then follow the instructionsSection 5.2 Data Recovery principlesRestore log data for follower B and leader C.

5.5 Member Change

  • Raft changes in the number of cluster nodes may result in a different total number of cluster nodes between the old and new configurations. And that eventually leads to each nodeMore than half of the principled judgments are in question.Dual-leader problems can arise, as with network partitions. Raft is throughSingle-server changesTo solve the problem

Single-node change

  • For example, if you have three nodes in a cluster and need to expand to five nodes, you need to add nodes one at a time, rather than two at a time.
  • Because only one node is added or deleted at a time, there must be an intersection between the old configuration and the new configuration. The “majority” of the old configuration and the “majority” of the new configuration will overlap one node, and there will not be two “majority” of the old configuration and the new configuration at the same time. The idea is to keep people on the same page about the majority

Specific change flow: Suppose there are three nodes ABC in the cluster, and a new node D needs to be added

  • 1. The leader (node B) synchronizes data to the new node (node D) first.
  • 2. After data synchronization, the leader (node B) copies the cluster list of the new configuration [A, B, C, D] as A log entry and writes it to all nodes in the new configuration. After receiving the majority of responses, the leader (node B) submits the log to the local log, so as to ensure the consistency of the cluster member configuration among all nodes, [A, B, C, D]

6, summary

Raft’s strong leadership model is leadership-oriented, similar to stand-alone leadership. Performance and throughput are limited

2. Differences with Muti Paxos consensus algorithm

  • In Raft, not all nodes can be elected leaders, only the node with the most complete log can be elected leader. Second, in Raft, logging must be continuous.
  • Raft algorithm ensures only one leader per term through tenure, leader heartbeat message, random election timeout, first come, first served voting principle, majority vote principle, etc. It also greatly reduces election failure.

Raft algorithm cleverly uses random election timeouts to spread them out so that in most cases only one server node initiates the election first rather than simultaneously, thus reducing the number of votes lost due to split up.

[Animation demo website]

  • An introduction to the animation
  • Interactive animation