Author: Tang Liu

primers

It is not easy to build a distributed key-value Store. We need to consider many questions. The first is what kind of functions our system needs to provide, such as:

  • Consistency: Whether we need to ensure linear consistency across the entire system, or whether we can tolerate short periods of data inconsistency and only support final consistency.

  • Stability: Can we guarantee the stable operation of the system for 7 x 24 hours? The availability of the system is four nines, and five nines, right? In case of machine damage and other disasters, whether the system can do automatic recovery.

  • Scalability: As data continues to grow, can the data be automatically rebalanced by adding machines without affecting external services?

  • Distributed transactions: Whether distributed transaction support is required, and to what extent transaction isolation levels are required.

The above problems need to be considered at the beginning of the system design, as the design goal of the whole system. In order to implement these features, we need to consider which implementation scheme to adopt, trade-offs of various aspects and so on.

Later, I will take the distributed key-value TiKV we developed as a practical example to illustrate how we chose and implemented it.

TiKV

TiKV is a distributed key-value store. It is developed by Rust and adopts Raft consistency protocol to ensure strong consistency and stability of data. At the same time, the system scalability is realized by Raft Configuration Change mechanism.

TiKV provides basic KV API support, namely the usual Get, Set, Delete, Scan apis. TiKV also provides a Transaction API that supports ACID transactions. You can start a Transaction with Begin, perform an operation on a Key, and Commit a Transaction with Commit. TiKV supports SI and SSI Transaction isolation levels. It is used to meet different business scenarios of users.

Rust

After planning the characteristics of TiKV, we will start the development of TiKV. At this point, the first question we faced was what language to use for development. At the time, we had several options before us:

  • Go, Go is the language our team is best at, and the goroutine and channel mechanisms provided by Go are naturally suitable for the development of large-scale distributed systems, but they are flexible and convenient and also have some sweet burdens. The first one is GC. Although GC of Go is becoming more and more perfect now, However, there will always be short delays, and goroutine scheduling will have switching overhead, which can cause higher latency for requests.

  • Java, there are too many distributed systems based on Java, but Java also has GC and other overhead problems. At the same time, our team does not have any development experience in Java, so we did not adopt it.

  • C++, C++ can be considered as a synonym for developing high-performance systems, but there are not many students in our team who are proficient in C++, so it is not very easy to develop large C++ projects. While modern C++ programming can greatly reduce the risk of data races, dangling Pointers, etc., it’s still possible to make mistakes.

When we eliminated the major languages above, we found that in order to develop TiKV, we needed the following features:

  • Static language, in order to maximize performance.

  • No GC, completely manual control of memory.

  • Memory safe to avoid problems such as dangling pointer and Memory leak.

  • Thread Safe, without problems like data race.

  • Package management, we can easily use third-party libraries.

  • Efficient C binding, because we may also use some C library, so there is no overhead to interacting with C.

In summary, we decided to use Rust. Rust is a system programming language that provides the language features we wanted above, but it was also a risky choice for us for two main reasons:

  1. Our team doesn’t have any Rust development experience, it all takes time to learn Rust, and Rust has a very steep learning curve.

  2. The absence of a basic network library. Although Rust had come out 1.0 by then, we found that many of the basic libraries were missing, such as miO on top of the network library, no useful RPC framework, and HTTP was not mature.

But we decided to use Rust anyway. For the first point, our team spent almost a month learning Rust and fighting the Rust compiler, and for the second point, we started writing it all ourselves.

Fortunately, when we got past the Rust phase, we found that developing TiKV with Rust was incredibly efficient, which is why we were able to develop TiKV in a short time and get it live in production.

Consistency protocol

For distributed system, CAP is A problem we have to consider, because P is Partition Tolerance, so we have to consider whether to choose C-consistency or A-availability.

When we designed TiKV, we decided to fully ensure data security, so we naturally chose C. But in fact, we did not give up A completely, because most of the time, after all, the power failure of the machine is not very frequent, we just need to ensure ha-high Availability. That’s the availability of four nines or five nines.

Now that WE have chosen C, the next thing we need to consider is which distributed consistency algorithm to choose. The current popular ones are Paxos or Raft, which naturally becomes our first choice due to its simplicity, ease of understanding, and many open source libraries available for reference.

On Raft implementation, we refer directly to ETCD Raft. Etcd has been used by a large number of companies in production environments, so the Raft library quality is guaranteed. Although ETCD is implemented with Go, its Raft Library is a C-like implementation, so it’s very easy to translate directly with Rust. In the process of translating, we also fixed some bugs and added some features to EtCD Raft to make it more robust and easy to use.

For now Raft code is still in the TiKV project, but we will soon separate it out and make it a separate library so that people can use Raft in their Rust projects.

Using Raft not only ensures data consistency, but also allows you to scale your system horizontally using Raft’s Configuration Change mechanism, which we’ll explain in more detail in a later article.

The storage engine

With distributed consistency protocol, the next thing to consider is data storage. In TiKV, we store Raft logs and then apply the actual customer requests from the Raft log to the state machine as well.

First of all, we will look at the state machine, because it will store the actual data of the user, and these data may be random key-value. In order to efficiently deal with random data insertion, we naturally consider using the LSM Tree model that is now commonly used. Under this model, RocksDB can be considered as an optimal choice at this stage.

RocksDB is a high performance key-value Storage built on top of LevelDB by the Facebook team. RocksDB provides many configuration options that you can tune for different hardware environments. There is a meme saying that RocksDB has so many configurations that even the RocksDB team members don’t know what all the configurations mean.

How we use it in TiKV, optimize RocksDB, add features to RocksDB, and fix bugs will be explained in more detail in a later article.

For Raft logs, since the index of any Log is completely monotonically increasing, such as Log 1, the next Log must be Log 2, the insertion of logs can be considered sequential. The most common way to do this is to write a Segment File by yourself, but we still use RocksDB because RocksDB also has very high performance for sequential writes and meets our needs. But we don’t rule out using our own engine at the back.

Because RocksDB provides a C API, it can be used directly in Rust, or you can use RocksDB in your Rust projects through the rust-Rocksdb library.

Distributed transaction

To support distributed transactions, the first problem to be solved is the distributed system time problem, that is, what we use to identify the order of different transactions. There are several ways to do this:

  • TrueTime is the method used by Google Spanner, but it requires hardware GPS + atomic clock support, and Spanner did not explain how to build the hardware environment in detail in the paper, so it is difficult to implement by ourselves.

  • HLC is a hybrid Logical Clock that uses Physical Time and Logical Clock to determine the sequence of events. HLC has been used in some applications, but HLC relies on NTP. If the PRECISION error of NTP is relatively large, Commit Wait time is likely to be affected.

  • TSO, TSO is a global 授时器 that allocates time directly using a single point of service. The TSO approach is simple, but there are single points of failure, and single points can also have performance problems.

TiKV adopts TSO for global timing, mainly for simplicity. As for single points of failure, we do automatic Fallover handling with Raft. As for the single point of performance, TiKV mainly targets at small and medium-sized clusters of PB and PB levels below, so as long as the time allocation of millions per second can be guaranteed in terms of performance, while in terms of network delay, TiKV does not have global cross-IDC requirements. In the case of single IDC or intra-city IDC, The network speed is very fast, even remote IDC, because there is a dedicated line will not have too much delay.

The time problem is solved, the next problem is what kind of distributed transaction algorithm we use, the most common is to use 2 PC, but the usual 2 PC algorithm will have problems in some extreme cases, so the industry either through Paxos, or is to use 3 PC and other algorithms. Here, TiKV references Percolator and uses another enhanced version of the 2 PC algorithm.

Percolator uses optimistic locking, which caches the data to be changed, locks the data to be changed at Commit time, and then updates the data. The advantage of using optimistic locking is that it can improve the concurrent processing capacity of the whole system for many scenarios, but it is not as efficient as pessimistic locking in the case of serious conflicts.

Percolator has three fields corresponding to a row of Data to be modified: Lock, Write, and Data:

  • Percolator transaction (primary key, secondary key, secondary key, secondary key) Only then will we try to lock the subsequent secondary keys again.

  • Write, save the commit timestamp of the data actually committed to Write. When a transaction commits successfully, we will Write the commit TIMESTAMP of the corresponding modified row to Write.

  • Data, save the Data for the actual row.

When the transaction starts, we Get a start timestamp and then retrieve the row to be modified. When we Get, if the row already has a Lock on it, we may terminate the transaction or attempt to clear the Lock.

When we commit a transaction, we get commit timestamp, which has two phases:

  1. Prewrite: First try to lock the primary key, then try to lock the second keys. If there is already a Lock on the corresponding key or a new Write on Write after start timestamp, Prewrite will fail and we will terminate the transaction. When locking, we also write Data to Data incidentally.

  2. Commit: When all data involved is locked successfully, we can Commit the primay key. At this time, we will check whether the Lock is still in place. If so, delete the Lock and Write Commit timestamp to Write. When the primary key is successfully committed, we can commit the second keys asynchronously. We don’t care if the primary key is successfully committed. Even if it fails, there is a mechanism to ensure that the data is committed properly.

In TiKV, transaction implementation mainly includes two parts, one is TiKV client integrated in TiDB, and the other is storage mod in TiKV, which will be introduced in detail later.

RPC framework

RPC should be a common way of network interaction in distributed systems, but it is not easy to implement a simple and efficient RPC framework. Fortunately, there are many options available.

TiKV wanted to use gRPC from the very beginning of its design, but Rust did not have a gRPC implementation that could be used in the production environment. We had to build an RPC framework based on MIO. However, with the complexity of the business, this RPC framework began to fail to meet the requirements, so we decided to use gRPC. Use Rust directly to encapsulate Google’s official C gRPC, and you have GRPC-RS.

Here’s why we decided to use gRPC:

  • GRPC is widely used in many well-known open source projects, such as Kubernetes, ETCD and so on.

  • GRPC has multiple language support, as long as we define a good protocol, other languages can be directly connected.

  • GRPC has rich interfaces, such as unARY, Client Streaming, Server Streaming and Duplex streaming.

  • GRPC uses protocol Buffer to efficiently process message codec operations.

  • GRPC is based on HTTP/2, some of the features of HTTP/2, such as duplexing, flow control, etc.

When we first started developing Rust gRPC, we tried to develop it based on a version of Rust, but we had to give up when there were too many panic attacks, so we focused on the official Library of Google gRPC. The Google gRPC library provides support for languages such as C++, C#, and Python, all of which are based on a core C gRPC, so it is natural for us to use C gRPC directly in Rust.

Since Google’s C gRPC is an asynchronous model, to simplify writing asynchronous code in Rust, we repackaged it using the Rust Future library to provide a Future API so that it can be used as a Future.

A detailed introduction to gRPC and the design and use of Rust gRPC will be covered in a later article.

monitoring

It is hard to imagine how a distributed system without monitoring can operate stably. If we only had one machine, it might be enough to check the service on that machine every now and then to see if there was a CPU problem, but if we had hundreds of machines, we would have to rely on monitoring.

TiKV uses Prometheus, a very powerful monitoring system. Prometheus has the following features:

  • Time-series based multidimensional data model, for a metric, we can use multiple tags for multidimensional differentiation.

  • Custom alarm mechanism.

  • Rich data types, including Counter, Guage, Histogram, and Summary support.

  • Powerful query language support.

  • Support for pull and push modes.

  • Supports dynamic discovery and static configuration of services.

  • Deep integration with Grafana.

Because Prometheus did not have a client for Rust, we developed Rust-Prometheus. Rust Prometheus referred to the Go Prometehus API in its design, but we only supported the most commonly used Counter, Guage and Histogram, and did not implement Summary.

The use of Prometheus and usage scenarios for different data types will be described in detail later.

test

Testing is an important part of building a distributed key-value Store. Only after the most rigorous test, we can have confidence to ensure that the whole system is stable operation.

From the very beginning of the development of TiKV, we put testing at the top of our list. In addition to regular Unit tests, we did more, such as:

  • Stability test. We specially wrote a Stability test, which randomly interferes with the whole system and runs our test program at the same time to see if the result is correct.

  • Jepsen, we use Jepsen to verify the linear consistency of TiKV.

  • Namazu, we use Namazu to interfere with the file system and TiKV thread scheduling.

  • Failpoint. We injected Fail points into many key logic of TiKV, and then triggered these Fail outside to verify that the data is still correct even if there are abnormal conditions.

The above are just some of our test cases. When the code is merged into the master, our CI system will trigger all test execution after the build is completed. We will release the latest version only when all tests have run completely.

On the Rust side, we developed Fail-RS based on FreeBSD’s Failpoint and have already injected a lot of Fail into Raft in TiKV, with more to come. We will also develop more test tools based on Rust to test the entire system.

summary

The above only lists the design ideas of some core modules in the process of developing TiKV with Rust. This article is just a brief introduction, we will explain each module in detail later. There are also some functions that we have not done at present, such as Open Tracing, which will gradually be improved in the future.

Our goal is to provide a Rust solution and form a Rust ecosystem through TiKV in the field of distributed systems. This goal is ambitious, and anyone who is interested is welcome to join.