The original link: https://thesquareplanet.com/blog/students-guide-to-raft/

Students’ Guide to Raft

Public id: Rust

Lab2 Guide for MIT6.824 Distributed Systems Is a guide for LAB2 instruction in MIT6.824 Distributed Systems. This guide is a guide for LAB2 instruction in MIT6.824 Distributed Systems. This article was published on March 16, 2016.

Debugging Raft was Debugging (Raft)

Inevitably, the first iteration of your Raft implementation will be buggy. The second iteration will do the same. And the third and fourth rounds. In general, each iteration will have fewer bugs than the previous one, and from experience, most of your bugs will result from not following Figure 2 carefully. There are usually four main sources of bugs when debugging Raft: livelocks (livelocks), incorrect or imperfect RPC handlers, failure to follow rules, and term confusion. Deadlocks are also a common problem, but they can often be debugged by logging all locks and unlock, and finding out how you are holding locks that are not being released. Let’s consider these questions in turn:

Live lock (Livelocks)

When your system is locked, every node in the system is working, but the whole set of nodes is in a state of non-advancement. This can easily happen in Raft, especially if you don’t follow Figure 2 carefully. One kind of live lock scenario is particularly common; No leader is being elected, or once a leader is elected, another election is initiated by another node, forcing the recently elected leader to abdicate immediately. There are many reasons for this scenario, but there are a few mistakes that we see many students make:

  • When Figure 2 says you should reset your election timer, make sure the election timer is reset exactly. Specifically, you should only restart your election timer if a) you receive a AppendEntries RPC from the current leader (that is, you should not reset your timer if the term in the AppendEntries parameter expires); B) You are starting an election round; Or c) you vote for another peer. The last situation is especially important in the network that is unreliable. In this network, the followers may have different logs. In these cases, the majority of servers often end up voting for a minority of servers. If you reset the election timer when a server asks you to vote for it, it may give a server with an expired log as much chance to stand up as a server with a longer log. In fact, because there are so few servers with enough new logs, those servers are quite unlikely to be elected in the normal way. If you follow the rules in Figure 2, servers with updated logs will not be interrupted by the election of expired servers, so these servers are more likely to complete the election and become the leader.

  • Follow the instructions in Figure 2 for when you should start an election cycle. Specifically, know that if you are a candidate(that is, you are currently running a round of elections), but the election timer wakes up and you should start another round. This is important to avoid system outages caused by delayed or discarded RPCs.

  • Make sure you follow the second rule in “Rules for Server” before dealing with an upcoming RPC. The second rule states:

If the RPC request or response contains term T > currentTerm: Set currentTerm = T and convert to follower (§5.1)

For example, if you have already voted in the current term, and an incoming RequestVote RPC has a term higher than yours, you should first step down and comply with their term(thus resetting votedFor), then process the RPC, This will cause you to vote!

Incorrect RPC handlers

Although Figure 2 illustrates exactly what each RPC handler should do, some nuances are easily overlooked. Here are some things that we see over and over again, and that you should pay attention to in your implementation:

  • If a step says, “Reply false,” that means you should reply immediately and not perform any of the following steps.

  • If you receive a AppendEntries RPC with a prevLogIndex pointing to the end of your log, you should treat it as if you do have the entry but term does not match (i.e., reply false).

  • Check 2 for the AppendEntries RPC handler should be executed even if the leader does not send any more entries.

  • The min of the last step (#5) is necessary, and it needs to be calculated from the index of the latest entry. It is not enough to simply have the function that applies the contents of the log between lastApplied and commitIndex stop when it reaches the end of the log. This is because there may be entries in your log that are different from the leader’s log after the leader sends you entries that match the entries in your log. Because #3 indicates that you will only truncate your log if you have conflicting entries, these conflicting entries will not be removed, and if leaderCommit exceeds the entries sent to you by the Leader, you may apply incorrect entries.

  • It is important to implement “up-to-date log” checks strictly as described in Section 5.4. No cheating, just checking the length.

Failure to follow The Rules

While Raft’s paper gives clear instructions for the implementation of each RPC handler function, there are many rules and invariants that are not specified. These are listed in the “Rules for Servers” area on the right side of Figure 2. While some of these are fairly self-explanatory, there are a few that need careful attention when designing your application so as not to violate these rules:

  • At any point during execution, if commitIndex > lastApplied, you should apply a specific log entry. The key is not that you do this directly (for example, in the AppendEntreis RPC handler), but that you ensure that the application is done by only one entity. Specifically, you’ll need a dedicated “applier,” or lock these apps so that some other application doesn’t also detect the item that needs to be applied and try to apply it.

  • Make sure you check commitIndex > lastApplied either periodically or after a commmitIndex update (that is, after a matchIndex update). For example, if you are sending AppendEntries to peers while checking commitIndex, you may have to wait until the next entry is appended to the log before you can apply the entry you just sent and be confirmed.

  • If a leader issues an AppendEntries RPC and it is rejected, but not because of log inconsistencies (this only happens if our term matches), then you should immediately abate and do not update nextIndex. If you update nextIndex, you may compete with the reset of nextIndex if you are immediately reelected.

  • The leader does not allow the commitIndex to be updated to a location of the previous term (or future term). So, as the rule says, you need to check log[N].term == currentTerm. This is because if an entry is not from the current term, Raft Leaders cannot be sure that the entry has actually been submitted (and will not be changed in the future). This is illustrated in Figure 8 of the paper.

One common area of confusion is the difference between nextIndex and matchIndex. Specifically, you might notice that matchIndex = Nextindex-1 and not implement matchIndex. It’s not safe. Although nextIndex and matchIndex are usually updated to a similar value at the same time (specifically, nextIndex = matchIndex + 1), they serve different goals. NextIndex is the estimate of the prefix assigned by the leader to the specified follower share. It is usually optimistic (we all share it) and only moves in response to negative responses. For example, when a leader is newly elected, nextIndex is set to the index at the end of the log. In a way, nextIndex is used for performance — you just have to do this to the peer.

MatchIndex is used for security. It is a conservative metric of how many log prefixes are shared by the leader and the specified follower. The matchIndex cannot be set to a value too high, as this may cause commitIndex to move too far forward. This is why the matchIndex is initialized to -1(i.e., we agree that there is no prefix) and will only be updated if follower actively replies with a AppendEntries RPC.

Term Confusion

Term confusion means that servers are confused with RPCs from the old terms. In general, this is not a problem when receiving an RPC, because the rules in Figure 2 specify what to do when you see an old term. However, Figure 2 does not discuss what to do when you receive an old RPC reply. As a rule of thumb, we’ve found that the easiest thing to do right now is to record the term in the reply (which may be higher than your current term), and then compare the current term to the term you sent in the original RPC. If the two terms are different, the reply is discarded and a return is returned. You should proceed with this reply only if the two terms are the same. You can probably do some clever protocol reasoning here to further optimize, but so far it seems to be working pretty well. To do otherwise would be a long and winding road of blood, sweat, tears and despair. A related, but not identical, problem is to assume that your status has not changed between the time you sent the RPC and the time you received the reply. A good example is when you receive a PRC response, set matchIndex = nextindex-1, or matchIndex = len(log). This is not secure because both values may have been updated since you sent the RPC. The correct approach is to update the matchIndex to prevLogIndex + Len (Entries []) from the argument in the RPC you originally sent.

By the way, An aside on Optimizations

The Raft paper contains several interesting optional features. In 6.824, we asked students to implement two of these: log compaction (Section 7) and accelerated log Backtracking (left-hand side of page 8). The former is needed to prevent the log from growing indefinitely, while the latter is useful for quickly updating expired followers.

These features are not part of “Core Raft” and will not receive the same attention as the main conformance protocol section of the paper. This paper covers log compression fairly thoroughly (Figure 13), but if you read it casually, some design details may be missing.

  • When taking a snapshot of the application state, you need to make sure that the application state corresponds to the state after a known index in the Raft log. This means that either the application needs to communicate with Raft to determine what index the snapshot corresponds to, or Raft needs to delay applying the other log entries until the snapshot is complete.

  • This paragraph does not discuss server crashes involving snapshots and the recovery protocol in case of a restart. Specifically, if the Raft state and snapshot are committed separately, a server may crash between the Raft state after a persistent snapshot and the Raft state after a persistent update. This is a problem because step 7 in Figure 13 shows that the Raft log overridden by the snapshot must be discarded. If, when the server restarts, it reads the updated snapshot but the expired log, it may end up applying the log entries that were already included in the snapshot. This happens because commitIndex and lastIndex are not persisted and Raft doesn’t know that the log entries have been applied. The fix for this is to introduce a persistent state in Raft that records the index of the first entry in the Raft persistent log. This can be compared to the lastIncludeIndex of the loaded snapshot to determine which elements should be discarded at the beginning of the log.

Accelerated log backtracking optimization is not clear cut, probably because the authors believe it is not necessary for most deployments. It is not clear from the text exactly how the conflicting index and term from the client are used by the leader to decide what nextIndex to use. We think the protocol the authors might want you to follow is:

  • If a follower does not have a prevLogIndex in the log, it should return conflictIndex=len(log) and conflictTerm = None.

  • If a follower has prevLogIndex in the log but term does not match, it should return conflictTerm = log[prevLogIndex].Term, And query its log to find the index of the entry whose first term is equal to conflictTerm.

  • Once a conflicted response is received, the leader should first query its log to find the conflictTerm. If it finds an entry in the log with that term, it should set nextIndex to the one after the index of the last entry in the log for that term.

  • If it does not find an entry in this term, it should set nextIndex=conflictIndex.

A less-than-perfect solution is to use conflictIndex only (and ignore conflictTerm), which simplifies the implementation, but later, the leader will occasionally generate more log entries for followers to update than sending only the necessary ones.

This article is prohibited to reprint, thank you for your cooperation! Welcome to follow my wechat official account: Rust

Rust