Raft claims to be an easy-to-understand distributed consistency algorithm. Many engineers have read his papers, consulted many sources, and finally had to wonder if their IQ is wrong.

Raft has long been a technical ceiling for many senior programmers, and it can be quite difficult to poke through. Every time just picked up the surging, after a long time then yanqi drum, there is a kind of zombie uncomfortable. This frustration is often experienced when you want to escape your technological comfort zone. Raft is a very high threshold in distributed systems, and the freedom of technology beyond that threshold is a step up.

The content of Raft Paper is not easy for the average programmer to understand. The election module is relatively simple, log replication is superficially easy to understand, and the snapshot module is very visual. But delve into the details, and confusion is inevitable. In particular, understanding the cluster node change module is more difficult.

Open source code

A great open source project was found on Github. The implementation of The Raft Project based on Netty was open-source by baidu engineers.

https://github.com/wenweihu86/raft-java

Recently, I spent some time reading through his code and found that it was understandable. I felt that I was closer to my goal. With the RPC framework we’ve already implemented, creating Raft on your own should soon become a reality.

The macro structure

First, we assume that there are three RaftNodes. Each RaftNode has a port that accepts client (Get/Set) requests and RPC requests from other RaftNodes. It should be noted that most well-known open source projects usually choose two ports, one for client and one for RPC. The advantage is that you can choose different IP addresses. The client port can be for extranet, while RPC is for secure Intranet communication. The author chose a port because it is only used on the Intranet, and it will be much simpler to implement.

The client can connect to any node. If the connection is not to the Leader, the sent request is forwarded on the server side from the currently connected RaftNode to the Leader for processing.

An alternative design is for all clients to connect to the Leader, which avoids the forwarding process and improves performance.

However, server-side forwarding also has its advantages, that is, when the client has poor data consistency requirements, read requests can be directly processed in the current RaftNode without forwarding. So the data returned may not be real time. This blocks most client requests and improves overall read performance.

The details of the RaftNode

RaftNode contains all the important components in this diagram.

First, after the Local Server receives the request, it immediately attaches the request log to the SegmentedLog. This log will be stored in the file system in real time and a copy will be kept in the memory. Because the author considers that large log files may affect performance and security (the specific reason is unknown, why Redis aOF log does not need to be segmented), so the log is segmented into multiple files in order, so it is called SegmentedLog.

Log has four important indexes, is firstLogIndex/lastLogIndex and commitIndex/applyIndex respectively, the first two is the beginning and end of the current memory logs index position, behind the two logs submitted index and effect index. The reason we use firstLogIndex instead of zero is because log files can get very large if they grow indefinitely and Raft has a strategy to periodically clean up old logs so the log doesn’t start at zero. CommitIndex refers to the maximum location of logs that have been successfully synchronized on more than half of the nodes. The logs before this location are safe and can be applied to the state machine. Raft immediately applies the logs that have been committed to the state machine. ApplyIndex is used to identify the log index that has been successfully applied to the state machine.

The example state machine provided for this project is RocksDB. RocksDB provides efficient key-value pair storage. There are many other practical options, such as using kv in pure memory or using LevelDB. The downside of pure memory is that the contents of the database are all in memory. The advantage of Rocksdb/Leveldb is that it can drop disks, reducing the pressure on memory and reducing performance.

If the Server is set to isAsyncWrite, the Local Server will immediately return a success message to the client after inserting the log into the SegmentedLog. But this can lead to data security issues. Raft protocol requires that the data be considered safe until more than half of the servers respond successfully and the client can be told that the request was successful. Such a configuration item is provided purely for performance reasons. Kafka, a distributed database, has similar options. Is a compromise that improves performance by sacrificing data consistency.

Normally, the Local Server suspends the current request without returning it with an await operation of the Condition variable.

For each RPCClient, it also maintains two indexes of the log, one of which is the matchIndex where the peer node has been successfully synchronized, which can be interpreted as a local commitIndex. NextIndex is the next log index location to synchronize. As messages sync between the Leader and Follower, matchIndex tries to keep up with nextIndex. NextIndex also continues to advance as client requests continue to arrive.

After suspending the user’s request, the Local Server immediately sends an asynchronous log synchronization operation. It sends a AppendEntries message (also a heartbeat message) to the other nodes via the RPC Client containing all log data (from commitIndex to lastLogIndex) that has not yet been synchronized. Then wait for real-time feedback. If the feedback is successful, you can advance to the current log synchronization location, matchIndex.

The matchIndex is the local location of each RPCClient. If half of the RPCClient matchIndex is advanced, the global commitIndex can also advance. The minimum matchIndex of half of the RPCClient is required.

CommitIndex once forward means that all previous logs have been successfully committed and the suspended client can continue. So immediately wake up all pending requests through the Condition variable’s signalAll operation, telling them to respond immediately to the client.

Note that log synchronization is dependent on whether the node logs are too late. If synchronization is slow through AppendEntries, consider taking another route: snapshot synchronization.

RaftNode takes snapshots periodically, serializes the current state machine contents to the file system, and then cleans up the old SegmentedLog to slim down the Raft request log.

Snapshot synchronization means that the Leader sends the latest snapshot file to the Follower node. After the Follower installs the snapshot successfully, the Leader continues to synchronize the SegmentedLog in an attempt to make the Follower catch up with him.

RaftNode startup process

The first step to start RaftNode is to load the SegmentedLog and then load the latest Snapshot to form the initial state of the state machine. Then use RPCClient to connect to other nodes and enable the snapshot scheduled task. Then the election process officially begins.

The election process

A RaftNode starts from a Follower state and immediately starts a startNewElection timer. If no heartbeat message is received from the Leader or any other Candidate’s vote request message is received before the timer expires, it becomes a Candidate immediately. Launch a new election process.

When a RaftNode becomes a Candidate, it sends a requestVote message to the other nodes and immediately starts a startElection timer. If the RaftNode does not become the Follower or Leader before the timer expires, it will immediately initiate a new election.

When a RaftNode is a Candidate, it changes to a Follower immediately after receiving a heartbeat message from the Leader. If half of the nodes successfully respond to the voting request, they will immediately become the Leader and periodically broadcast heartbeat messages to other nodes to maintain their dominance as long as possible.

Conditions for being elected Leader

Not every node can become a Leader. To be the Leader, this node must contain the most complete log. When a Candidate canvasses via a RequestVote message, the Candidate needs to carry the lastLogIndex of the current log list and the term value of the corresponding log (term and index of tail-log). Other nodes need to match these two values and reject any new canvassing requests that do not have their own. Simply put, the most awesome node in the group should be the leader.

Log synchronization

When the Leader switches over, the logs of the new Leader and the followers may be inconsistent. In this case, the Follower truncates its own log and synchronizes the log from the truncated position. The Leader’s own logs are append-only and it never erases any of its own logs.

The standard policy is to broadcast AppendEntries messages carrying information about the last log to all nodes immediately after the Leader is elected. After receiving the message, the Follower compares it with its own log. If the last log does not match its own log, the Follower rejects the Leader.

After the Leader is hit, it takes a step back, carrying the last two logs and re-sending AppendEntries to the rejected Follower. If the Follower finds that the first of the two logs in the message does not match its own, the Follower continues to reject the log, and the Leader, after being hit, continues to back up and try again. If so, the log entry in the message overwrites the local log, synchronization succeeds, and consistency is achieved.

Cluster member Changes

Cluster configuration changes are probably the most complex module of the Raft algorithm. I also struggled to understand this module. After reading many articles, I found that the authors actually did not understand the cluster member change algorithm in depth, but just copied what the paper said. I believe they only half implemented Raft at best, and the whole algorithm would be hard to write without a fine grasp of detail.

One of the biggest headaches with distributed systems is that the same action happens at different times. For example, in the figure above, the cluster changes from 3 to 5, and the configuration of the cluster changes from OldConfig to NewConfig. The time of configuration change of these nodes is not exactly the same, and there is a certain deviation, so the superposition of the old and new configurations is formed.

At the red clipping point in the figure, Server[1,2] can elect Server1 as the Leader in the old cluster. If Server3 does not agree, it does not matter. At the same time, Server[3,4,5] in the newly configured cluster can elect Server5 as another Leader. At this point, the coexistence of multiple leaders exists.

To avoid this problem, Raft uses a single node change algorithm. Only one node is allowed to change at a time, and the changes should be made sequentially, not in parallel, or chaos will occur. If you want to go from 3 nodes to 5 nodes, go from 4 nodes to 5 nodes. The benefit of changing a single node is that the cluster does not split and two leaders do not exist at the same time.

As shown in the figure, the blue circles represent the majority of the old configuration and the red circles represent the majority of the new configuration. Most of the two clusters in the old and new configurations will necessarily overlap (most of the old configurations with 2k nodes are K +1, most of the new configurations with 2K +1 nodes are K +1, and the sum of most of the two clusters is 2K +2 greater than the number of nodes in the cluster 2k+1). The two clusters must have different terms, and the same node cannot have two terms. Therefore, this overlapping node will only belong to a majority, and eventually there will only be one cluster, that is, only one Leader.

Cluster change operation logs are different from ordinary logs. Normal logs cannot be applied to the state machine until after commit, whereas cluster change logs can be applied as soon as the leader persists the log appending. Why do you want to do this, you can refer to zhihu Sun Jianliang https://zhuanlan.zhihu.com/p/29678067 in this article, it detailed describes the commit & reply cluster change log of the cluster is not available.

The last

The article is finished, but I still feel a little confused. I always feel that there are many details that I haven’t figured out yet. In addition, I began to think that Raft implemented by Baidu should be very perfect, but after further understanding, I found that there were still many imperfections, and this project should only be a demo. Further work will be done on raft code for ETCD, which is a bit more complex but should be much improved. It has the PreVote process and no-op log synchronization after the Leader commits, which are areas where raft-Java projects are lacking.

Follow the public number “code hole” to explore Raft protocol