Abstract: This paper briefly introduces the principles of Raft protocol and some practical engineering experience on how to use Raft to replicate storage nodes.

1, the introduction

In the engineering practice of Huawei distributed database, we implemented a distributed database system prototype with computing and storage separation and replication of the underlying storage based on Raft protocol. Below is a diagram of its architecture.

The logs generated by the compute node are encapsulated and sent to the storage node over the network. After agreement is reached at the Raft layer, the logs are applied to the Wal Engine of the state machine to perform log playback and data storage and management.

Here’s a brief introduction to the principles of Raft and some engineering practices on how to use Pinetree for replication.

2. Raft principles

2.1 Basic principles of Raft

Raft algorithm everything is subject to the leader to achieve consensus of a set of values and log consistency of each node. I’ll focus on Leader election, log replication, and member changes for the Raft protocol.

Raft’s election mechanic:

The protocol defines three states for each node: Leader, Candidate, and Follower. Time is defined as Term, and each Term has an ID. Term is similar to a logical clock. A Leader is elected at each Term.

The Leader is responsible for processing all write requests, initiating log replication, and timing heartbeat. A maximum of one Leader can be deployed in each Term. Election failures may occur.

The Follower is in the passive state and processes and responds to RPC requests sent by the Leader.

A Candidate is used to elect a new Leader. When the followers time out, they enter the Candidate state.

In the initial state, all nodes are in the Follower state. After the node times out, the current Term increases to enter the Candidate, and the node sends a broadcast message RequestVote RPC to request other followers to vote. After receiving votes from a majority of nodes, the node goes from Candidate to Leader. After receiving a vote request, the followers compare the Term first and then the log index. If both are satisfied, the followers update the local Current Term and respond to the RequestVote RPC to vote for it. Followers can only vote once per Term.

Log synchronization for Raft:

When the Leader is elected, write requests can be accepted. Each write request represents an instruction or Command that the user needs to copy. Raft protocol wraps Term and Index around write requests, thus creating a Raft Log entry. The Leader appends the Log entry to the Log and then sends AppendEntries to the other node with RPC requests. When the Leader determines that a Log entry has been logged by most nodes, the Leader applies the Log entry to the state machine and returns the result to the client.

Raft member change mechanism:

A membership change means an increase or decrease in the number of nodes in the cluster and a replacement. Raft protocol definition takes into account scenarios of member changes to avoid system unavailability due to cluster changes. Raft uses the above Log Entry and consistency protocol to do this. Member changes are initiated by the Leader, who generates a new Log entry locally and pushes the Log entry to other Follower nodes.

The Follower node updates the local Log after receiving the Log entry and applies the configuration relationships in the Log. After most nodes are applied, the Leader submits this change log entry. There is also the problem of switching over new configurations. I won’t go into more details.

2.2 Open source implementation of Raft

Raft implementations include CoreOS ‘ETCD/Raft, Kudu, Consul, LogCabin, cockroach, and others.

Etcd, LogCabin, and Consul implement a single Raft ring and cannot be flexibly scaled. The Kudu and cockroach implemented multiple raft rings. Kudu’s Consensus module realized data replication consistency of copies. Kudu called data fragments tablets, which were horizontal sub-tables of Kudu table, and TabletPeer was a node in Raft ring. A Tablet corresponds to a Raft ring, a Tablet corresponds to a Raft Consensus, and these correspond to a circle within the Raft. A Consensus Round corresponds to a synchronized message, and multiple Consensus rounds are synchronized between a circle. The cockroach, on the other hand, is a multi-raft ring based on the ETCD/Raft implementation, which maintains multiple instances of raft, called multiraft.

Raft from Etcd is one of the most fully functional Raft implementations out there. It was first implemented in a production environment and is well modularized. Raft kernel implements most of the protocols, and provides interface for storage and transport externally. It can be implemented independently with high flexibility for users, and users can also independently implement Snapshot and WAL. Raft is very easy to transplant and apply. Therefore, the storage node Pinetree uses the Raft implementation of the open source Etcd to build our prototype system and to facilitate the later evolution to Multiraft.

3. Engineering practice

3.1 Implement Raft storage interface and network transmission

Raft storage refers to the storage of raft-log, which is the persistent storage of log entries. Benchmark tests show that the performance of raft-log engine is a major bottleneck affecting the overall OPS. In order to support the rapid replacement of the underlying storage engine more flexibly, a pluggable storage engine framework is added. We decouple the underlying storage engine. Pinetree encapsulates a third-party independent storage interface to accommodate etCD Raft’s log storage interface.

GRPC+Protobuf is used for communication, such as Raft Transport and snapShot Transport. GRPC is set to simple in heartbeat, log transfer AppendEntries RPC and election RequestVote RPC. SnapShot is set to stream format.

3.2 Election Issues

Raft can elect herself. In practice, however, the disadvantages are obvious, as Raft autonomous selection may have the following problems:

1. Uncontrollable: You may randomly choose a node that meets Raft conditions

2. The Leader changes due to intermittent network disconnection

3. Leader changes caused by busy nodes

4. Destructive nodes

To prevent the storage node Leader from switching between azs or nodes, Pinetree uses the cluster management module to designate the Leader. Pinetree sets electionTimeout to infinity to turn off automatic election processes that followers may trigger, all of which are controlled by the cluster-managed suggested primary module.

3.3 Read Consistency Model

There are default, consistent, and stale consistency models in Raft clusters. How to implement read operations is important for consistency. The general approach is to give the choice of consistency to the user, so that the user can choose according to the actual service characteristics, flexible use.

Consistent has the highest read consistency, but implementation requires all read requests to go through Raft kernel and will be serialized with write operations, which will cause some stress to the cluster. Stale has good performance benefits, but read operations may fall on nodes with delayed data. In the Pinetree design, the cluster manager is responsible for maintaining the information of storage nodes and managing the status of Raft master copies of all nodes. On the one hand, it can load balance read requests. On the other hand, it can route read requests based on the affinity of AZ and whether the data in the copy has the latest log. This maximizes tradeoff between performance and consistency.

3.4 Log Problems

There are several issues to consider for Raft leader-centric replication:

1. Performance problems. If the leader is a slow node, it will lead to a long tail

2. Log synchronization must be ordered commit

3. The leader switchover is unavailable for a period of time

Question 3: We use cluster management to prevent Leader switching to the maximum extent.

For problem 2, Since Pinetree logs are similar to InnoDB redo logs and are numbered by LSN, logs applied to Pinetree storage layer must be in order and cannot skip log segments or log holes. This requires that logs sent to Raft be sorted. Wal logs generated by the computing layer correspond to an LSN, which represents the offset of the daily log in the file and is monotonically increasing and discontinuous. Therefore, Wal logs must be generated in the same sequence as those applied to Pinetree Storage. To meet this requirement, we added an adaptation layer between the computing layer and Raft layer, maintained a queue for sorting, and added terms to messages to ensure that the logs were not out of order in order to cope with the master/slave switch at the computing layer. Raft instructions can also be repeatedly committed and executed, so the storage layer has idempotence issues to consider. Since Logs of Pinetree Storage are numbered with LSN, you can repeat apply.

3.5 How can I Solve the False Master Problem

The compute node needs to obtain certain metadata information and must read data from the Leader each time to prevent secondary latency. In the case of network isolation, the old leader will not actively exit, and there will be a situation of dual master. The false leader may never know that he is not the real Raft master anymore, resulting in the existence of both the real leader and the false leader providing read services, which is not allowed in a non-latency system.

Going through Raft protocol once for each read request can identify false masters, but this can seriously affect system performance.

In the form of lease, a master Pinetree node conservatively checks whether it has a lease at this time before providing services, and then decides whether it can provide read services. Therefore, even if a Pinetree fake master is accessed, the fake master cannot provide services because it does not have a lease.

3.6 Performance Problems

Where performance Pinetree considerations and optimizations are concerned:

1 If the Raft algorithm is used to ensure strong consistency, all read and write operations should be done on the leader node. In this case, the read performance is equivalent to that of a single machine, which is not very ideal. The optimization realizes the mode of Leader +lease to provide read services, which ensures consistency without affecting performance.

2. Optimize raft parameters: number of in-flight Number of Transport queues

3 maximum asynchronization, for example: instructions are passed to the state machine after raft reaches agreement to complete the persistence, put into the message queue and return immediately, then the message is asynchronously parsed in parallel.

4 Batch and Cache are configured to the maximum. For example, write operations within a transaction are cached to the client, and then all writes are packaged into a batch and sent to the server together with the transaction commit request

**#DevRun Developer Salon # September 15, 20:00-21:00, specially invited Huawei cloud database solution expert Sugar, for you to create a special live broadcast “end-to-end security and trust, Huawei cloud database solution best Practices”! ** Huawei cloud database services, focusing on the Internet, car enterprises, finance, games, ISV, map and other industries pain points, to meet the diversified computing needs of enterprise users. Provide end-to-end secure and trusted solutions to help enterprises fully cloud and intelligent applications. Welcome to click on the live (live.vhall.com/206537223), community interaction (bbs.huaweicloud.com/forum/threa…). Polite!