Author: siddontang

This paper will introduce how TiKV processes read/write requests in detail. Through this document, students will know how TiKV stores data changes contained in a write request to the system and can read the corresponding data.

This article is divided into two parts, in the first part, we will introduce some basic knowledge to facilitate you to understand the following process.

Basic knowledge of

Raft

TiKV uses the Raft consistency algorithm to ensure data security and by default provides support for three copies that form a Raft Group.

When the Client needs to write some data, the Client will send the operation to the Raft Leader, which is called withdraw in TiKV. The Leader will encode the operation into an entry and write it into his Raft Log. This is called Append.

The Leader also copies entries to other followers using the Raft algorithm, which we call Replicate. Followers also Append after receiving this entry, telling the Leader that Append succeeded.

When the Leader finds that an entry has been appended by most nodes, it considers that the entry is Committed. Then the Leader decodes the operations in the entry, performs them and applies them to the state machine, which is called Apply.

In TiKV, we provide Lease Read. The Read request is sent directly to the Leader. If the Leader confirms that his Lease has not expired, he will provide the Read service directly, so that he does not have to walk Raft once. If the Leader finds that the lease has expired, it forces a Raft walk to renew the lease and then provides the Read service.

Multi Raft

Since a Raft Group handles a limited amount of data, we split the data into multiple Raft groups called regions. In other words, we sort the keys of the data in byte order, that is, an infinite sorted map, and then divide it into segments (consecutive) of key ranges, each of which is treated as a Region.

Empty Spaces cannot exist between two neighboring regions. That is, the end key of one Region is the start key of the next Region. The range of a Region uses the start, end mode. For key start, it belongs to the Region, but for end, it belongs to the next Region.

The TiKV Region has a maximum size limit. When the size threshold is exceeded, the TiKV Region is split into two regions, for example, [A, b) -> [A, ab) + [ab, b). Of course, if a Region has no data or only a small amount of data, It also merges with neighboring regions to create a larger Region, for example, [a, ab) + [ab, b) -> [a, b).

Percolator

For the same Region, using Raft consistency protocol, we can ensure consistency of key operations in the same Region, but if we need to operate on multiple data simultaneously and the data falls on different regions, we need distributed transactions to ensure consistency of operations.

For example, a = 1 and b = 2 need to be modified successfully at the same time, but A and B belong to different regions. After the operation, only a and B must be modified successfully or not. A cannot be modified but B is not modified or B is modified. A does not modify the situation.

The most common approach for distributed transactions is to use two-phase commit, also known as 2PC, but traditional 2PC requires a coordinator, and we need mechanisms to ensure that the coordinator is highly available. Here, TiKV references Google’s Percolator and optimizes 2PC to provide distributed transaction support.

The principle of Percolator is quite complex, which needs to be paid attention to:

First, Percolator needs a service TIMESTAMP Oracle (TSO) to allocate a global TIMESTAMP, which is monotonically increasing in time and globally unique. Any transaction is started with a Start TIMESTAMP (startTS) and then committed with a commit TIMESTAMP (commitTS).

Percolator provides three column families (CF) : Lock, Data, and Write. When a key-value is written, the Lock of the key is placed in the Lock CF. The actual value will be put into the Data CF. If the commit succeeds, the corresponding COMMIT information will be put into the Write CF.

When a Key is stored in the Data CF and Write CF, the corresponding timestamp is added to the end of the Key. In the Data CF, startTS is added, and in the Write CF, commitCF is added.

Let’s say we need to write a = 1, first get a startTS from TSO, such as 10, then enter the PreWrite phase of Percolator and write Data to Lock and Data CF as follows:

Lock CF: W a = lock

Data CF: W a_10 = value
Copy the code

We will use W for Write, R for Read, D for Delete, and S for Seek.

After PreWrite is successful, a commitTS, such as 11, is taken from TSO and written:

Lock CF: D a

Write CF: W a_11 = 10
Copy the code

After the Commit succeeds, a key-value is recorded in both the Data CF and the Write CF. The Data CF records the actual Data, and the Write CF records the startTS.

When we want to read data, we also get a startTS from TSO, such as 12, and read:

Lock CF: R a

Write CF: S a_12 -> a_11 = 10

Data CF: R a_10
Copy the code

In the Read process, we first check to see if there is a Lock in Lock CF. If so, the Read fails. If not, we will seek the latest submitted version in Write CF, where we will find 11, and then get the corresponding startTS, which is 10, and then combine the key and startTS in Data CF to read the corresponding Data.

The above is a brief introduction to the Percolator read and write process, which is much more complicated than this.

RocksDB

TiKV stores data to RocksDB, which is a key-value storage system, so for TiKV, any data will eventually be converted to one or more key-values and stored in RocksDB.

Each TiKV contains two instances of RocksDB, one for Raft Log, which we’ll call Raft RocksDB, and the other for the actual user data, which we’ll call KV RocksDB.

A TiKV has multiple Regions, and in Raft RocksDB we prefix the key with the Region ID and then add the Raft Log ID to uniquely identify a Raft Log. For example, if we now have two regions with ids 1,2, the Raft Log is stored in RocksDB as follows:

1_1 -> Log {a = 1} 1_2 -> Log {a = 2}... 1 _n - > Log {a = N} 2 _1 - > Log = {b} 2 _2 - > Log = 3} {b... 2_N -> Log {b = N}Copy the code

Since we are dividing key according to range, in KV RocksDB, we directly use key for saving, similar to the following:

a -> N

b -> N
Copy the code

There are two keys, a and B, but they are not distinguished by any prefix.

RocksDB supports Column Family, so it can directly correspond to the CF in Percolator. In TiKV, we use Default CF in RocksDB to directly correspond to the Data CF of Percolator. Lock and Write of the same name are also used.

PD

TiKV will report all its Region information to PD, so that PD has the Region information of the entire cluster, and of course has a routing table of the Region, as follows:

When a Client needs to manipulate data on a key, it first asks PD which Region the key belongs to. For example, PD knows that key A belongs to Region 1. The Client is returned with information about Region 1, including how many replicas there are, which replica the Leader is now, and which TiKV the Leader replica is on.

The Client caches Region information locally to speed up subsequent operations. However, the Raft Leader of a Region changes, or a Region splits or merges. The Client knows that the cache is invalid and retrieves the latest information again.

PD also provides global timing service. In the Percolator transaction model, we know that both transaction start and commit require a timestamp, which is uniformly assigned by PD.

The basic knowledge is introduced here, next we will introduce the TiKV reading and writing process in detail ~ please look forward to!