This article is a study note for a paper on Raft algorithm, a consensus algorithm for managing multi-copy logs. The consensus algorithm runs the cluster so that the nodes in the cluster work together even when a few fail. The Raft algorithm has the following features:

  • Strong leadership: Raft uses strong leadership compared to other consistent algorithms;
  • Leadership election: Raft uses random time to elect a leader.
  • Member changes: There is an overlap state when using federation to run configuration changes consistently.

Multi-copy state machines

Consensus algorithms refer to the concept of a multi-replica state machine, which is a group of servers that make the same state copy and can continue to operate even after some servers go offline. Multi-copy state machines are usually implemented using multi-copy logs, and maintaining multi-copy logs is the task of consensus algorithms.

Consensus algorithm has the following characteristics:

  • Ensure security under non-Byzantine conditions;
  • As long as most servers are operational and communicate with each other, consensus algorithms are available;
  • Independent of the clock;
  • As long as most of the nodes respond, the submitted command is considered complete.

Designed for ease of understanding

Paxos solved the conformance protocol problem for the first time, but the problem was:

  • Paxos is hard to understand
  • Paxos does not provide a good foundation for engineering implementation

Raft uses two main ideas to improve understandability:

  1. Divide the problem into several sub-problems to solve independently;
  2. Simplify the number of states to consider, improve system consistency, and eliminate uncertainty.

Raft consensus algorithm

The complete Raft algorithm is shown below:

Raft algorithm can be divided into three sub-problems to be discussed:

  • Leadership election: how to elect a new leader after the failure of leadership;
  • Log backup: The leader receives backup log entries from the client to the cluster.
  • securityRaft security is guaranteed as follows
    • Election security: at most one leader can be elected in a term;
    • Leaders can only append: Leaders never delete entries in the log, only append new entries;
    • Log matching: If the entries in two logs have the sameindexandterm, so the log entries up to this point are the same;
    • Leadership integrity: If a log entry is submitted to a term, it will appear in the logs of all the higher-term leaders.
    • State machine securityIf one server applies a log entry to the state machine, no other server applies the same entryindexOther different commands.

Raft foundation

Any server will be in one of three states: leader, subordinate, and candidate, and the state switching state machine looks like this:

Raft divides time into terms of arbitrary length, each term starting with an election, and the node that wins the election acts as the leader of the current term. When the election fails or the leader node crashes, a new term needs to be started.

Leadership election

The leader periodically maintains the “heartbeat” by sending AppendEntries RPC to the subordinate, sending empty content if no new logs are added, and entering the candidate state to initiate the election when the subordinate has not received any messages for a period of time (election timeout). If the candidate is voted by a majority of the servers, the server transitions to lead state and sends AppendEntries RPC to the other servers.

To avoid generating multiple candidate nodes at the same time, the election timeout for each server needs to be selected randomly from a range. If the cluster has no leader, candidate nodes need to wait for a timeout period before starting the next election.

Log backups

The lead node receives entries from the client, appends them to its own log, and then sends copies of the log to other nodes through AppendEntries RPC. When the leader knows that the replica has reached the majority of nodes, he can commit the log entry. Raft ensures that submitted entries are persistent and will eventually be executed by all available state machines. When the log entry is backed up on most servers, it can be submitted:

Raft meets the following characteristics to ensure log matching:

  • If two entries in different logs have the sametermandindex, then they save the same command;
  • If two entries in different logs have the sametermandindexThen all their previous logs are the same.

When the leader node starts to manage the cluster, the logs on each server may be as follows:

Some subordinates lose log entries (a-B), some subordinates contain additional entries that have not yet been committed (C-d), or both (e-f). F is because it is the leader of TERm2 and ter3, but neither has successfully submitted any entries. In any case, conflicting entries in the subordinate server’s log are overwritten by the leader’s log.

The leader server maintains nextIndex to hold the next entry that needs to be sent to subordinates, and when the leader initializes nextIndex, it initializes it as the last entry on the server. AppendEntries RPC will fail if the subordinate server logs are not the same as the leader logs, at which point the leader will subtract nextIndex and retry. AppendEntries consistency checks help recover logs for servers that have lost or fallen behind due to failures.

security

The election limits

Leadership integrity requires that the leadership of any term must contain all log entries submitted by the previous term. To ensure this feature Raft requires the candidate server to contain all commit logs in order to get elected. Therefore, the RequestVote RPC contains the candidate server’s logs, and if the poll server’s logs are updated, the vote is rejected.

Submit the entry for the previous term

Raft will never submit an entry for the previous term. The following figure, S1 will assume leadership node 2 drops after backup to S2 (a), then the S5 be dropped after receipt of the 3 leading (b), then after S1 to become leaders continue to backup to most nodes after submit 2, but again offline (c), but if the S5 will cover after back into 2 (d), but if 4 is submitted, 2 won’t be overwritten, because S5 can’t be the leader.

The Raft leader will only submit entries within the current term; In this way, when an entry is committed, it is possible to ensure that all previous entries are committed and log matching is satisfied.

Security discussion

We can use contradiction to prove that leadership integrity holds. We assume that thetermA committed log entry was not saved to the future leader server log. Then let the exclusion of this clause be minimaltermFor U,).

  1. At election time, the entry must not be in leader U’s log (the leader server never deletes or overwrites the log);
  2. Then leader T must have backed up the log on most servers, and leader U received most votes.
  3. Then at least one voting leader U server received the submitted entry;
  4. Therefore, the entry must have been saved when voting for U;
  5. Since U is voted for, the leader U’s log is at least as new as the voter’s, so it is impossible not to include the log entry, which conflicts with the assumption.

The establishment of leadership integrity can further prove state machine security.

The subordinate and candidate servers crashed

RPC requests fail when subordinate and candidate servers crash, Raft uses an infinite number of attempts, and when the server is restarted it naturally recovers with RPC processing.

Time and availability

Raft security does not depend on time: the system does not produce incorrect results because some events happen faster or slower.

However, Raft has certain requirements for election timeouts:


BroadcastTime is how long it takes a server to send RPCS to all other servers in parallel, and MTBF is the average time for a single server to send a failure.

Cluster members manage changes

For security, configuration changes need to be made in two phases. When you apply a configuration change in Raft, you first enter a transitional state called federated consensus, in which you enter a transition state called federated consensus

  • Copies of log entries are kept on both configured servers;
  • Both configured servers can be called leaders;
  • Conformance requires majority agreement between the two configuration servers, respectively.

The cluster configuration is saved in a special log entry, and once a server adds a new configuration entry to the log, it uses the new configuration for all future operations.

The configuration change process is as follows:

  1. Commit and apply the transition state(majority in the new cluster and majority in the old cluster), at which point it is safe to try to commitA;
  2. Commit and apply the new configuration(Save the majority in the new cluster).

The following three problems occur when you change the configuration:

  1. The new server does not initially have any log entries. It takes a lot of time to directly participate in the consensus. Therefore, it is possible to get log entries directly from the leader node as a non-voting node before adding a new server.
  2. The leader node is no longer a member of the new configuration. In this case, submitLeaders don’t need to count in their votes,Once committed, the lead node is logged out.
  3. Servers that have been removed may interfere with existing clusters.After the commit, the old node will no longer receive the heartbeat, and they will initiate a timeout election to interfere with the new node, which can be resolved by asking the new node to ignore the RequestVote that received the heartbeat packet for a certain period of time.

Log compression

According to the Raft algorithm, a restarted node can replay the log for recovery, but if the log is long it can be stressful for the recovery process. An intuitive approach is to compress the logs with snapshots.

Each node maintains its own snapshot and saves only submitted logs to the snapshot. The node needs to save the last term and index contained in the snapshot to handle AppendEntries to check consistency. For the lead node, you need to send the InstallSnapshot remote call to give your snapshot to the nodes that are falling behind to help them catch up.

The snapshot mechanism has two performance problems:

  • The timing of creating a snapshot is as follows: Too fast consumes the storage and too slow recovers the snapshot. A simple policy can be used to write snapshots after the snapshot reaches a certain size.
  • Writing snapshots takes a lot of time: copy-on-write is used for snapshot updates to avoid unnecessary replication.

Client interaction

The client needs to find the correct leader node. The client randomly connects to a node, and the random node informs the client about the leader node.

If the leader node crashes after submitting a log entry without informing the client, the client repeats the command when contacting the new leader node to retry. You can ask the client to set a sequence number for each command.

Since read-only operations do not commit log entries, it is possible to read old data from the old leader node, Raft uses two additional requirements to address this:

  • First, the leader node must have the latest submitted entry;
  • The leader node needs to confirm that it is not abandoned when responding to a read-only operation.

reference

  1. Ongaro, Diego, and John K. Ousterhout. “In search of an understandable consensus algorithm.” USENIX Annual Technical Conference. 2014.