An overview of the

Consistency algorithm

The consistency algorithm allows multiple machines to work together as a cluster, and the cluster still works when some of the machines fail. Because of this, consistency algorithms play a key role in building reliable large-scale software systems. Previously, Paxos almost dominated the discussion of conformance algorithms: most conformance implementations are based on or influenced by Paxos, and Paxos has become the primary tool used to teach students about conformance.

Unfortunately, Paxos is far too difficult to understand, Raft separates the key elements of consistency, such as leader election, log replication, and security, in order to make the Raft protocol easier to understand, and it enforces greater consistency to reduce the number of states that must be considered. Let’s get into the world of Raft!

Basic Concepts of Raft

  • First, the Raft cluster must have a leader, through which all the actions we initiate as clients to the cluster must be handled. So the first part of Raft’s core algorithm is the Leader election — the cluster can’t work without a host, vote out a host first, and then worry about the rest.
  • Second, what work does the master node need to host? It receives the operation request from the client and synchronizes the operation as a log to other nodes. After ensuring that most nodes have synchronized the operation, it can safely respond to the client. This part of the job is called Log replication in the Raft core algorithm
  • However, since the responsibility of the master node is so great, the nodes must be careful when choosing the master node, and only the nodes that meet the conditions can be elected as the master node. In addition, the primary node must be cautious when processing operation logs. To ensure the consistency of the cluster, do not overwrite or delete operation logs that have been processed by the former primary node. This “careful handling” is all about limiting the selection and submission of the log, which is called Safety in the Raft core algorithm.

The Raft core algorithm is made up of these three sub-problems: Leader election, Log replication, and Safety. Together, these three parts implement the consensus and fault tolerance mechanisms of the Raft core.

Choose the main

Raft election

In a Raft cluster with several nodes, there are several important roles: Leader, Candidate, and Follower. Each role is responsible for different tasks. Normally, a node in a cluster has only two roles: Leader and Follower.

  1. Leader: Handles all client interactions, log replication, etc., usually only one Leader at a time

  2. Follower: responds to the Leader’s log synchronization request, Candidate’s request for votes, and forwards (redirects) the transactions requested by the client to the Follower.

  3. Candidate: When the cluster is just started or the Leader is down, the node with the role of Follower will become the Candidate and initiate the election. After winning the election (winning more than half of the votes of the nodes), the node will change from the Candidate to the Leader.

Each crowd was assigned a countdown timer, which was set at random between 150ms and 300ms. The one whose countdown timer ended first would have the priority to solicit votes. For example, when the countdown is over, Joe will vote for himself first and then initiate RequestVote RPC to others in the cluster. When the majority of the cluster (N/2+1) vote for Joe, then Joe will become the leader. At this time, Joe will initiate heartbeat to all the others in the cluster to show his authority and tell them, You will all listen to me.

Let’s take A look at an actual process and if we think about it in A very simplified way, A minimum Raft cluster has at least three nodes (A, B, C), assuming that A’s countdown ends first and A initiates A vote request to B and C.

In the first case, both B and C vote for A, at which time A becomes the Leader

In the second case, B votes for A and C votes for himself. At this time, A can also become the Leader because it has obtained A large majority of votes

In the third case, A, B and C all vote for themselves, and the Leader is not elected at this time

In the third case, it indicates that Split Votes are invalid, in which each party Votes for itself and no party wins a majority of Votes. Each participant then takes a random Election Timeout to re-vote until one party has a majority. In theory, if the vote is tied every time, the Election will continue.

== After the election, the Leader sends heartbeat messages to all the Follower nodes. If the Follower does not receive heartbeat messages from the Leader for a period of time, the Follower considers that the Leader may have died and initiates a new master election process again. = =

term

Each new election is called a term, and each term is associated with a strictly increasing integer.

The term is added each time a candidate triggers the Leader election, and if a candidate wins the election, he will assume the role of leader for the term. However, every term does not necessarily correspond to a leader. Sometimes a leader cannot be elected in a term due to election timeout. Then candicate will increment the term number and start a new election.

Term is more like a logic clock that allows you to discover which nodes have expired state. Each node saves a current term and carries this term number when communicating.

Nodes communicate with each other through RPC. There are two main types of RPC requests:

  • RequestVote RPCs: used for candidate canvassing elections.
  • AppendEntries RPCs: used by the leader to copy logs to other nodes and synchronize heartbeats.

Log copy

Log Replication Overview

Consensus algorithms are usually based on the Replicated State Machine model, in which all nodes start from the same State and eventually reach a consistent State after a series of steps of the same log operation. In other words, as long as the logs of all nodes in the cluster are consistent, the state machine obtained after a series of applications will be consistent.

Raft is responsible for ensuring log consistency across all nodes in the cluster.

In addition, Raft gives the Leader node a stronger leader. It’s easy to understand how Raft ensures that logs are consistent, that all logs must be handed over to the Leader node and copied to the other nodes.

This process is called Log replication.

Raft log replication mechanism parsing

Replication process

  • Each request from the client contains instructions to be executed by the replicated state machine.
  • The leader adds this instruction to the log as a new log entry, and then initiates an RPC in parallel to the other servers to copy this information.
  • If the log is safely replicated, the Leader applies the log to his state machine and returns it to the client.
  • If followers break down or are slow or lose packets, the leader will try again and again until all the followers have finally saved all the log entries.

The data structure of the log

  • Tenure number when logs are created (used to check whether node logs are inconsistent, and logs with the same term are sent by the same leader during their tenure)
  • Instructions that the state machine needs to execute (real content)
  • Index: The integer index indicates the position of the log entry in the log

In sending AppendEntries RPC, the leader includes the index position and tenure number of the previous log entry. If the follower does not find an entry in its log that contains the same index position and tenure number, it will reject the new entry. Consistency checking is like a generalization step: the empty Log state must be a Log Matching Property at first, and then consistency checking guarantees Log Matching as the Log expands. Therefore, every time AppendEntries RPC returns success, the leader knows that the follower’s log must be the same as his own (from the first entry to the latest entry).

Raft has the following guarantees for logs: If two log entries have the same tenure number at the same index position, Raft assumes that the log is identical from the beginning to the index.

According to this guarantee, when the leader and follower logs conflict, the leader checks whether the last log from the follower matches the leader. If the last log does not match the leader, the leader checks the last log from the follower until it matches. After the match, all the conflicting logs are deleted. This enables consistency between master and slave logs.

During normal operation, the leader and follower logs are consistent, so AppendEntries consistency check for RPC never fails. However, a leader crash would leave the log in an inconsistent state (the old leader might not have copied all the entries in its log). This inconsistency is exacerbated by a series of leader and follower crashes. Followers may lack some log entries that the new leader does not have, may have some log entries that the new leader does not have, or both. Missing or extra log entries can involve multiple terms.

The following figure shows the Leader and Follower log conflicts:

How does Raft deal with this situation?

The leader maintains a subscript for each follower, called nextIndex, which indicates the index of the next log entry that needs to be sent to the follower.

When a new leader gains power, he adds the index of his last log to 1. If a follower’s log does not match the leader’s, the consistency check will fail in the next RPC add-on log request.

When this happens, the leader retries the nextIndex decrement until a correct log match is encountered. Eventually there must be a next index so that the leader and follower logs are consistent up to this point. In extreme cases, if the next index is 1, the follower does not have any logs consistent with the leader, and the leader must synchronize the logs from the first log

When the match is successful, the followers delete all the conflicting logs, so that the follower and the leader’s logs agree.

The complete process of log replication

Reference:

Raft consistency algorithm

In-depth analysis of Raft distributed consistency protocol

Raft animation