【 Distributed 】Raft algorithm - Leader election

Raft from “Reliable, Replicated, Redundant, And Fault-Tolerant” is a consensus algorithm used to replace Paxos. Its goal is to ensure that the state of any node in the cluster is consistent, while ensuring that the entire algorithm process is easy to understand.

Note: all the pictures are by myself, if you want to use them, please use them. But reprint the article please indicate the source: nuggets – radiation engineers.

By the end of this article, you will:

The Leader election for Raft algorithm will be etched in your mind.

The relevant data

Interactive demo: Raft (thesecretlivesofdata.com)
In Search of an Understandable Consensus Algorithm (raft.github. IO)

Characteristics of Raft algorithm

Strong Leader: Raft introduces a Leader role and gives that role a very strong leadership and all the data is managed by that role.
Leader election: Raft uses a random timer to elect the Leader. Most simultaneous elections are avoided by simply adding a random timer to the heartbeat mechanism.
Membership adjustment: Raft uses a common and consistent approach to dealing with changing cluster members, in which most of the machines in the two different configurations of the cluster overlap during the adjustment process, which allows the cluster to continue to work when the members change.

Roles defined by Raft

Raft defines three roles: Leader, Follower, and Candidate. These three states are not static, and each node transitions between them.

Leader: When the node acts as the leader, all data received by the cluster is received through the node and it decides what to do with the data. He maintains a heartbeat connection with all other nodes to maintain his leadership status. Heartbeat connections are made by sending heartbeat packets in which commands and data are carried if they are to be issued to other nodes.
Followers: When a node is a follower, it is only responsible for executing the leader’s decisions or responding to heartbeat packets sent by the leader to show that it is still connected.
Candidate: When the leader is offline, the follower node will immediately turn into a candidate and start campaigning to become the new leader through canvassing.

The demo diagrams in this article are as consistent as possible with the styles in the interactive demo:

Node status

A node has two required attributes:

currentTerm: Indicates the current node tenure.
votedFor : ID of other nodes whose current term of a node follows. In the voting phase, the node votes for the candidate. If the campaign is over, the current leader.

Timeout timer

Raft defines timeout timers to control elections, namely election timeout, voting timeout, and campaign waiting timeout.

Election timeout

Leaders maintain their dominance by periodically sending heartbeat packets to followers. Each follower is given an election timeout, which is random (such as 150-300 ms), the random timer mentioned above. The follower resets the timeout each time a heartbeat packet is received. If the heartbeat packet is not received before the timeout, the follower determines that the leader is offline, and the follower is turned into a candidate to start campaigning.

The election timeouts allow each node a chance to become the leader, as if each node were a potential usurper. The timeouts are random, and the shorter the timeouts, the faster they become candidates and the more ambitious they become. As soon as those in power lose control (break the heartbeat), nodes whose ambitions cannot be contained will seize power (run for election).

Voting timeout

When the follower becomes a candidate, the countdown to the vote timeout starts and all other nodes are invited to vote for them. If a candidate wins more than half of the votes before the countdown ends, the candidate wins the election. If there are not enough votes by the time the voting timeout expires, the candidate declares defeat and waits a certain amount of time before running again.

When other nodes respond to the voting invitation, they will only reply whether to vote for the initiator. All candidates who initiate voting cannot know the votes of others, but can only count their own votes. So when a candidate loses, he doesn’t know whether the other candidates are successful or even that the other candidates exist.

Election waiting timeout period

When a candidate loses an election, he or she will wait for a certain period of time to run again. This period of time is the election waiting timeout.

At this point, other nodes can participate in the election, or they may already be campaigning. If a candidate succeeds in the election of other nodes while waiting for the voting timeout or waiting for the election timeout, the candidate will immediately turn into a follower to follow the new leader.

Timeout reset

The follower resets its own election timeout as soon as it receives the request, so the leader periodically sends heartbeat packets to control the follower. At the same time, the candidate’s invitation to vote will also reset the request timeout period of the follower, so that the follower who received the vote but has not yet participated in the election will not participate in the election.

RPC requests defined by Raft

Raft uses RPC to communicate and only defines two requests to complete all communication. These are the log appending request and the vote request.

In Raft protocol, we usually talk about logs, but the data types that can be transferred can be very different, but because of the nature of logging, logs are more complex and difficult to coordinate than normal data. This article directly replaces logging with data, because the Leader election process involves very little logging.

Data add-on request

When the leader receives data from the client, it sends a data appending request to copy the data to all followers, thus ensuring data uniformity across all nodes. When the transferred data is empty, the request becomes a heartbeat packet that holds the connection. The request is defined as:

/** ** @param term term of the sending node * @param leaderId leaderId, used to redirect back to the leader, Because sometimes the client sends the data directly to the follower * @param prevLogIndex Index of the last data (log) entry (non-leader election phase) * @Param prevLogTerm Term of the last data (log) entry (non-Leader election phase) * @param Entries Need to save data, if it is a heartbeat packet, Null * @param leaderCommit Leader's index of the highest data entry known to have been committed (non-leader election phase) */ Result appendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries[], leaderCommit) {}Copy the code

The return result structure is:

Result {term, term of the response node success request success or failure}Copy the code

Vote request

When a candidate initiates a vote, the voting request will be found on all other nodes and other nodes will be requested to vote. Each node that receives a vote request returns its own vote result.

/** ** Vote request ** @param term to send the candidate's term * @param candidateId to send the candidate's ID * @param lastLogIndex Index of the last data (log) entry (not the Leader election phase) * */ Result requestVote(term, candidateId, lastLogIndex, lastLogTerm) {}Copy the code

The return result structure is:

Result {term, term of the response node voteGranted Boolean, if true, support the candidate making the request}Copy the code

Term concept

Raft defines the concept of Term, and all nodes have the property currentTerm, which is the currentTerm of the node.

Term acts as a logical clock in Raft algorithm, term number is similar to the concept of dynasty number in our country, and the leader is like the king of the dynasty.

Electoral rules

The Leader election has the following rules:

There is only one leader or no leader in each term. Just like some dynasties, there is no unified emperor.

For a candidate to be promoted to leader, he must receive more than half of the votes (N/2 + 1), and the total number of votes is the number of nodes in the cluster.

If there is an even number of voting nodes in the cluster, there may be no leader in a round due to tie votes. In this case, the election will wait for the election timeout timer to expire and proceed to the next term.

(Not all elected leaders, as the paper’s graphic shows.)
When the leader goes offline, the cluster begins a new term. Just as the abdication of an emperor signals the change of dynasties.

Because of the election timeout timer and the heartbeat mechanism, if the node’s timer times out, it will see that the leader is offline, at which point the node will increment currentTerm, votedFor pointing to itself, and immediately start the race.
When any node receives a request for a term larger than its own, it needs to immediately follow the other node and update its own term.

Whether voting request or data addition request, as long as the term in the request is larger than the node’s own term, it indicates that the other party is more advanced than itself. At this point, the node unconditionally follows the other party, sets its votedFor as the ID of the other party, and updates its own term as term.
- If it is a vote request, it means that the other party finds out the leader has gone offline before them and canvassing before others. Based on this information, the candidate has a greater probability of winning from the node’s point of view, so follow it and vote for him.
- If it is an additional data request, it indicates that the other party is the election winner of the latest term, and indicates that the node itself does not participate in the election of this term. The node may be temporarily offline or the packet of voting request is lost. Just follow the leader.
When any node receives a data appending request whose term is equal to its own term, it needs to follow the other node immediately.

The peer party’s tenure is equal to the node’s own tenure, and it can send additional data requests (heartbeat packets), indicating that the peer party is the leader. As we know, only one leader can be elected in one round of election, so the sender must be the winner of this round of election. Therefore, if the votedFor of the node is not it, it means that the current node may be the follower of the elected candidate or the unsuccessful candidate himself. In this case, you need to immediately set your votedFor to the ID of the other party to follow him.
In a one-term election, each node can vote for only one candidate.

As we know from the above, when a node receives a voting request whose term is longer than its own term, it will update its term and follow the candidate. And in this election (TERM == currentTerm), any request to vote will be for the candidate they follow.

At the same time, we also know from the above that if the node is the candidate himself, then he will renew his term when he becomes the candidate. Therefore, similarly, in this round of election, he will only vote for himself.

VotedFor decides when the term is renewed, either by itself or by the first candidate to come to the polls. Change to follow the winning leader only if you or your candidate loses, otherwise it stays the same.
If you receive a request whose tenure is smaller than your own, discard it. Otherwise, you must reply.

If the term of the request is smaller than the node’s own term, whether it is a data additional request or a vote request, it indicates that the other party has not communicated with other nodes for a period of time, and should be the leader or candidate who is disconnected. The status of the peer is outdated and all requests are discarded.

The election process

Based on the above rules, in fact, the process of Leader election becomes very simple. In a word, the candidate will only vote for himself, and the followers will always vote for the first candidate who seeks him. Only the candidate who gets more than half of the votes can become the Leader, and everyone must follow the Leader who wins the election.

There is only one candidate

Node A finds that its leader is offline,currentTermPlus one, plus twovotedForPoint to yourself and start sending vote requests to other nodes.
The other nodes receive the vote request and find that the term on the ballot is larger than their own term, so they follow A and vote for A and reset the timeout timer.
Each time node A receives A vote reply, it accumulates the votes for its current term.
When the total number of votes exceeds half of the total number of nodes, node A becomes the leader and sends heartbeat packets to all nodes to declare victory.

Multiple candidates were elected simultaneously

Raft has already implemented random timeout timers to prevent most multi-candidate elections from happening at the same time, but the odds are small. At the same time, if the data delay between clusters is very large, it is easy to have the situation of simultaneous election of multiple candidates.

If section A, B and C choose Leader, there are multiple candidates:

Nodes A and B run for election at the same time. Or node A starts the election first, but node B times out and becomes the candidate before the vote request reaches node B.
Nodes A, B, and C start voting at the same time. Or when the voting request of each node reaches other nodes, the target node times out and becomes the candidate.

It’s the same in both cases:

Nodes A and B run A campaign at the same time, send A request for A vote,
A’s request goes to B, B’s request goes to A, and neither side votes for the other.
A’s request reaches C, C sends A request to A, and C sends A reply to A.
Before the reply reaches A, B’s request reaches C, who does not vote for B.
B receives all the replies, finds only 1/3 of the votes, loses the election, and runs again after some time.
When A receives C’s reply, he finds that the number of votes has exceeded half and declares that he has won the election.

SequenceDiagram Note left of A: A becomes A candidate, term = 3 Note right of B: B becomes A candidate, term = 3 A->>B: requestVote(3, A, 0, 0) B-->>A: false B->>A: requestVote(3, B, 0, 0) A-->>B: false A->>C: RequsetVote (3, A, 0, 0) Note right of C: C follows A, term = 3 B->>C: requestVote(3, B, 0, 0) C-->>B: false C-->>A: AppendEntries (3, A, 0, 0, null, 0) true Note left of A: A become leader A->>B: appendEntries(3, A, 0, 0, null, 0)

Or:

Nodes A, B, and C run for election at the same time.
After the election of all nodes, 1/3 of the votes are found, and the election timeout timer is entered. The election of the next term will be held after a random period of time.
In the next round, A runs out earlier than the others, the term is increased by one, and the first vote request is sent.
Requested by the votetermBoth are bigger than themselves, so B and C vote for A.
When A receives B’s or C’s vote, it will declare victory outright.

SequenceDiagram Note left of A: A becomes the candidate, term = 3 Note right of B: B becomes the candidate, term = 3 Note right of C: If C becomes A candidate, term = 3 A->>B: requestVote(3, A, 0, 0) B->>A: requestVote(3, B, 0, 0) C->>A: requestVote(3, C, 0, 0) B->>C: requestVote(3, B, 0, 0) C->>B: requestVote(3, C, 0, 0) A-->>B: false B-->>A: Note left of A: false A-->>C: false C-->>A: false A-->> A: false A-->>C: false C-->> C: false B-->> B: false B-->> B: Term = 4 A->>C: requsetVote(4, A, 0, 0) Note right of C: C follows A, term = 4 C- >>A: true A->>B: RequestVote (4, A, 0, 0) Note right of B: B follows A, term = 4 B-->>A: true Note left of A: A becomes leader A->>B: appendEntries(3, A, 0, 0, null, 0) A->>C: appendEntries(3, A, 0, 0, null, 0)

Status transition of a node

Based on the above, the state transition process of nodes can be deduced:

Candidate:

When the timer expires, it becomes a candidate for election.

Candidate:

When voting times out, candidates wait a certain amount of time to run again.
When the vote is won, the candidate is promoted to leader.
When receiving data requests for the same term or for higher terms, candidates turn into followers to follow each other.

Leader:

When asked for more tenure, leaders become followers.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

【 Distributed 】Raft algorithm – Leader election

The relevant data

Characteristics of Raft algorithm

Roles defined by Raft

Node status

Timeout timer

Election timeout

Voting timeout

Election waiting timeout period

Timeout reset

RPC requests defined by Raft

Data add-on request

Vote request

Term concept

Electoral rules

The election process

There is only one candidate

Multiple candidates were elected simultaneously

Status transition of a node

【 Distributed 】Raft algorithm – Leader election

The relevant data

Characteristics of Raft algorithm

Roles defined by Raft

Node status

Timeout timer

Election timeout

Voting timeout

Election waiting timeout period

Timeout reset

RPC requests defined by Raft

Data add-on request

Vote request

Term concept

Electoral rules

The election process

There is only one candidate

Multiple candidates were elected simultaneously

Status transition of a node

Related Posts

Five ways to write a for loop, which is the fastest?

The production environment encountered a Go problem, and the whole team was confused…

Spring source (1) container initialization process