Etcd raft introduced

Etcd Raft is the most widely used raft library today, “In Search of an Understandable Consensus Algorithm” (raft.github. IO /raft.pdf), etcd raft at ETCD, Kubernetes, Docker Swarm, Cloud Foundry Diego, CockroachDB, TiDB, Project Calico, Flannel and other distributed systems have been applied and verified in the generation environment. While traditional RAFT library implementations are single design (integrated storage layer, message serialization, network layer, etc.) etCD RAFT inherits the simple design concept and only implements the core RAFT algorithm, which makes it more flexible. Etcd separates network, log storage, and snapshot functions into independent modules that users can invoke when needed. Etcd implements its own suite of raft libraries: ETCD-WAL (for logging), SNAP (for snapshots), MemoryStorage(for storing current logs, snapshots, status and more for use by raft core programs).



Introduce etcd…

WAL is short for Write Ahead Log. Etcd uses the WAL module for persistent storage of RAFT logs. All etCD implementations of WAL are placed in the WAL directory.

Wal data structure

type WAL struct {

lg *zap.Logger

dir string
// the living directory of the underlay files

// dirFile is a fd for the wal directory for syncing on Rename

dirFile *os.File

metadata []
byte
// metadata recorded at the head of each WAL

state raftpb.HardState
// hardstate recorded at the head of WAL

start walpb.Snapshot
// snapshot to start reading

decoder *decoder
// decoder to decode records

readClose func() error
// closer for decode reader

mu sync.Mutex

enti uint64
// index of the last entry saved to the wal

encoder *encoder
// encoder to encode records

locks []*fileutil.LockedFile
// the locked files the WAL holds (the name is increasing)

fp *filePipeline
}

The above is a WAL data structure. Get an instance of WAL using the Create() method in the wal. Go file. Wal first creates a temporary directory and initializes related variables, and creates and initializes the first WAL file. After all operations are initialized, the wal instance is initialized by changing the name of the temporary directory.

File organization

Wal all logs are saved in a specified directory. The log file name ends with. Wal and the format of the log file is -. Wal. Index represents the index of the first raft log in this file and seq is the sequence number of this file (in ascending order).

The default size of each file is 64 MB. When a file is larger than 64 MB, WAL automatically generates a new log file for storing logs. Each log file is locked using flock with the parameter LOCK_EX. This is a unique lock. Only one process can handle the log file at a time, so when wal owns the file, it cannot be deleted by the process.

Log logical organization

Wal logs can store various types of data.

type
describe
metadataType File meta information. Each WAL file has a meta information log
entryType Raft log
stateType State information
crcType CRC type
snapshotType The snapshot




  • CrcType The first record of each new log file is a crcType record. CrcType is written only at the beginning of each log file to record the last CRC of the previous file
  • MetadataType The metadataType that follows the crcType record in each new log file appears only once per log file
  • The stateType log type can be added in two cases:
    1. When the log file is automatically shred, a stateType log is stored in the new log file immediately after metadataType
    2. This type of log is also stored when hard State is returned in raft core program Ready
  • SnapshotType WAL logs store only the term and index of the Snapshot. Specific data is stored in special Snapshots. Each time a snapshot is stored, a WAL snapshot is stored in wal logs. When a snapshot is stored, all log files of index before the snapshot are released. The Snapshot stored in wal is used to check whether snapshots are correct.

Read and write the log

Wal uses packaged Encoder and decoder modules to implement log reading and writing.

Write the log

The Encoder module writes incremental computing CRC and data to a WAL file. The encoder data structure is shown below

type encoder struct {

mu sync.Mutex

bw *ioutil.PageWriter

crc hash.Hash32

buf []
byte
// Cache space, 1 MB by default, reduces the pressure of data distribution

uint64buf []
byte
}

Wal implements logging through Encoder, and CRC calculations are done in this module. To better manage data, wal aligns each piece of data in logs with 8 bytes (WAL automatically aligns bytes). The log writing process is as follows.




The CRC in the figure is an incremental calculation based on all previous log data. Wal only cares about log writing and does not check whether the index of the log is duplicated. However, if the Node is restarted, the system will automatically filter out the mixed logs.

Log segmentation

Wal implements automatic log splitting. When log data exceeds the default value of 64 MB, a new file is generated and written to the log. Log splitting is implemented using the cut method in the wal. The cut method will only trigger the call if the Save method in wal is called. The first record of the new file is the last CRC of the previous WAL file.

Log compact

Wal does not implement automatic compact for logging, only the logging Compact method of MemoryStorage (which requires active invocation by the user) is provided.

File_pipeline module

Wal creates a TMP file before creating a new file. After all operations are complete, it renames the file. Wal uses the file_pipeline module to start a coroutine in the background to prepare a temporary file for use at any time, avoiding the overhead of creating a temporary file.

Etcd snap to introduce

IO/etCD/ETCDse…

File organization

In the SNAP module, a snapshot is stored in a. Snap file. The file format is -. Snap, term, and index respectively. The storage structure of each snapshot is as follows:



Detailed introduction

The system can have multiple snapshots. The SNAP module uses the Snapshotter structure to centrally manage snapshots.

type Snapshotter struct {

lg *zap.Logger

dir string
}

Snapshotter structure code above, Snapshotter is mainly used to store and read snapshots.

It’s up to the user to specify what the snapshot will store, such as when the KV data is marshaled into the snapshot in raft’s official example.

func (s *kvstore) getSnapshot() ([]
byte
, error) {

s.mu.RLock()

defer s.mu.RUnlock()

return
json.Marshal(s.kvStore)
}

When to take a snapshot

In etcd-raft the user can choose when to take a snapshot. In the official case of ETCD, the method to take a snapshot is maybeTriggerSnapshot(), which is called when the Ready() method of the node returns, If the index value of the current snapshot and the index value of the previous snapshot is greater than 10000, a new snapshot will be created.

Etcd MemoryStorage introduction

MemoryStorage is used to store temporary data of a RAFT node, including entrys and snapshots. The user stores the data in the memoryStorage, which is also used by the RAFT node. The delivery of entrys, the delivery of snapshots, and so on are all sent from memoryStorage.

// MemoryStorage implements the Storage interface backed by an
// in-memory array.
type MemoryStorage struct {

// Protects access to all fields. Most methods of MemoryStorage are

// run on the raft goroutine, but Append() is run on an application

// goroutine.

sync.Mutex

hardState pb.HardState

snapshot pb.Snapshot

// ents[i] has raft log position i+snapshot.Metadata.Index

ents []pb.Entry
}

MemoryStorage stores the latest entrys (including those that are not committed), snapshots, and statuses. Users need to store data to memoryStorage when they receive related data from other nodes.

Author: Feng Jie