This article is adapted from a post by Di Wu, head of Database middleware/distributed database in TiDB DevCon2018.

TiDB is mainly applied in Toutiao core OLTP system – object storage system, which stores part of the metadata and supports toutiao image and video-related services, such as Douyin.

Today (data up to published), TiDB supports the scenario of high QPS in Toutiao OLTP system: cluster capacity is about tens of tons, daily QPS peak can reach hundreds of thousands.

Why do we need TiDB

Toutiao has a large amount of internal business data. The single disk of MySQL used before is about 2.8T SSD disk. We do object storage. Because headlines not only do video, pictures, also do the video and photos are basically use our since the research of S3 storage system, this kind of storage system needs a metadata, such as a picture save, it exists the S3 system of which the inside of the machine, which files, which offset data, there will be like a big video, S3 will slice it into a bunch of small video clips, and the location of each fragment will be stored in metadata.

Before TiDB was used, metadata was stored in a 2.8TB disk in MySQL. Due to the rapid growth, the disk was not enough, so we had to use the scheme of separate libraries and tables. The scheme we used before was MyCAT. However, we have encountered some problems in the process of using this scheme, such as data loss. After I commit a certain data, I finally realize that the data is lost.

Then there is the connection problem, the current headlines do sharding is about fixed points of 100 pieces. If you have a business where you need to split the database and table, and you have 101 shards, then some business, he uses a shard key, shard key to do the query, then maybe the middleware can find the relevant data with only one connection. But some services do have requests without sharding keys. MyCAT can open 101 connections to each of the following MySQL libraries because of connection constraints. Well, sometimes I give it 50,000 connections, and it uses up a million. This will result in non-shard key SELECT requests, and it will consume very fast connections, often on the business side of the business side will throw a message saying, there are not enough connections.

Toutiao database mainly uses MySQL and MongoDB, which is relatively simple, so we also want to try some other databases.

Main Application Scenarios

Currently, TiDB is mainly used in the following two scenarios:

Firstly, for OLTP scenarios, that is, for scenarios with large data volumes, we not only consider the delay, but also consider that the data volume cannot be accommodated in a single machine, which requires scalability.

There are also OLAP scenarios, some users, using Hive or Tableau, and then found in the process of using MySQL, because the latter are connected to do some OLAP way query is slow. Later, the company was promoting TiDB, so it picked up some OLAP business scenarios.

The amount of metadata of toutiao’s self-developed object storage system is very large and growing very fast. Take one of the largest clusters as an example: this cluster has two ways, one is the shard information was originally used by MySQL. If you want to use TiDB, you may need to use TiDB as a backup for MySQL. TiDB provides syncer to synchronize data. Some read requests can be cut from MySQL to TiDB.

We used it for a while and found TiDB to be quite stable. Then, the company will have such demands. For example, there will be a New Year’s Day activity suddenly. At this time, there will be more pictures uploaded, and the data growth will be too large. Basically, the single disk of our MySQL is more than 2.0TB stable (the total disk memory is 2.8TB), so we have to delete data (some very old data), communicate with the business department, and say that this data is no longer needed, delete from the single disk of MySQL, and support in this way.

But even if do so, single dish still cannot carry the demand that data grows now. Then I decided to be radical and cut into TiDB some traffic that was written in and read immediately and never read again. Since S3 stores many buckets, the active user creates new buckets, and the metadata of these buckets is stored directly in TiDB, without MySQL.

These two cases are the scenarios with the largest data flow and QPS in OLAP and OLTP.

Cluster Deployment Status

For deployment, we put TiDB and PD together, both three. TiKV we used dozens of machines altogether. CPU is 40 virtual cpus, 256 GB memory.

At present, the average QPS is in the tens of thousands, and 3 TiDB are used. The total connection number of 3 TiDB adds up to about 14K, and the PCT99 of Latency is less than 60ms. This is actually quite a peak period of data, QPS will reach hundreds of thousands of activities.

Compare latency with MySQL

In the process of using TiDB, we also compare the delay between TiDB and MySQL:

The first line is the delay for MySQL, pcT99, and the black line below is the delay for TiDB. As you can see, TiDB is significantly better at Latency in the case of very large MySQL data volumes, even though it uses slightly more machines.

Some use of jokes and experience

In the process of using, we also encountered some slots, which have been basically solved in the current version of TiDB.

The first one is the histogram. As you know, the optimizer based on CBO must use some statistics. TiDB did not update the histogram statistics in a timely manner in the previous version, which caused me to choose the wrong execution plan when I took an SQL. For example, I could pick an index, but in reality, because the update information is not real time, it might do a full table scan.

If you have a problem, you can use the analyze command. The analyze command can update the table statistics, and then execute the SQL statement again. You will find that the execution plan has changed.

The second is raft Leader. Since it is well known that each region is a raft, TiDB has a monitoring metric that gives you how many raft leaders there are on each machine. When the amount of data reaches 10TB+, about 20TB, raft leader is frequently dropped. The main reason for the drop is that when you perform region migration, for example, when you perform migration or load balancing, the data of a range in RocksDB will be sent to the target machine. After sending the SST file to RocksDB, the target end will load the SST file into RocksDB. During this process, due to a problem in RocksDB implementation, the process of adding SST to RocksDB takes about 30 to 40 seconds. Normally it could be a millisecond or a second. When RocksDB implements ingest file, it opens some files that don’t really need to be opened. Because LevelDB and RocksDB have many layers, when you put a file into ingest, actually you need to merge with some data overlap, because of its implementation problems, it will touch some SST which is not necessary to touch. It generates a lot of IO. Because our data volume is relatively large, SST is very large, so when the data volume is very large, we will step on this pit.

Then, RocksDB ingest takes too long and Raft’s heartbeat breaks. Raft protocol has to maintain your lease and you need to send heartbeat packets which are stuck behind because it took too long to ingest file. Then Raft Leader falls and many of the read and write requests have problems after that.

The third is a lot of short links. When our business uses databases, we often build too many short links. Because most businesses don’t use databases very much, it doesn’t know to set up connection pools, idle Connections, things like that. So it’s not uncommon to use a connection and then turn it off. After the TiDB connection is established, it is necessary to check a System variable. These variables are stored in several instances of TiKV in TiDB. If there are a large number of short links, these short links will check these System variables. It just so happens that these system variables cluster on a few machines, resulting in these machines being particularly heavily loaded. An alarm is then sent to read requests piling up. TiKV uses a threaded model. When a request comes in, it is thrown into a queue. Then the thread takes it out and processes it. PingCAP is now also optimizing to place caches in the TiDB process.

Fourth, strictly speaking, this is not a TiKV problem, it is a Prometheus client problem. We had a situation where the machine on which Prometheus was deployed was down, and when we rebooted, we found that a lot of TiKV monitoring was not reported. When later checked, TiKV was not connected to Prometheus at all. So we felt there was a problem with Prometheus’s client implementation.

The fifth problem is the shredding of Row IDS. This problem is exactly one of the performance problems we encountered on our side. If I want to insert a row, it will have two rows, the first row will be the index, the index will be Key, and then the value will be the row ID; The second row is row, id is the Key, value is the whole row of data, so the second row is kind of like a clustered index or something. But the Key of the clustered index is the row ID. The original implementation said that the row ID was incremented, so that no matter what data you insert, the row ID will be incremented, because when the row ID is incremented, the data will be sent to a region of a TiKV. Since my TiKV is an ordered Map, if the row ID increases, I’m sure everyone will hit a TiKV when inserting it. At that time, our business was under great pressure. As a result, the customer found that after he expanded the number of machine instances of this business, he would find that the TPS of this insert was about 20,000, which was about more than 100 bytes in a row. No matter how much you added, he could not get up, that is to say, the QPS of the insert could not get up.

Instead of monotonically increasing row ids, the new version of TiDB will scatter row ids, which will have better performance and no hot spots.

The last problem, because TiDB is a transaction model, is to take a transaction version, the transaction version in TiDB is a timestamp, and this timestamp is managed by the PD component. Basically, after every transaction is connected, it has to access the PD component to get the timestamp. In fact, when doing RPC, the timestamp delay is not very long, i.e. the single-digit millisecond level. But because TiDB is written by Go, there is scheduling overhead. A Goroutine that gets back a bunch of timestamps from PD is slow to issue the bunch of timestamps to a bunch of Goroutines that perform transactions, with a latency of around 30 milliseconds when the number of links and pressure is high. It probably takes about a millisecond to call an RPC, less than 2 milliseconds. But because of the overhead of Go, you can multiply that delay by several times.

These are some of the major issues that TiDB encountered in the tiao OLTP scenario, most of which have now been fixed.

Some applications of Toutiao on OLAP

In the OLAP scenario, there is less content. Some of the previous businesses prefer to use a client like Tableau to connect to MySQL behind the back, which is too slow. You can use Syncer to synchronize some data from MySQL to TiDB.

This may cause a problem: our company has a component, which is a tool that can batch synchronize Hive data to MySQL. Many students who do data analysis will synchronize Hive data to TiDB. However, the transactions generated by this tool are very large, and TiDB itself has a limit on the size of transactions.

If the following two configuration items are turned on, TiDB will split the large transaction into many smaller transactions, so there is no problem:

  • set @@tidb_batch_insert =ON

  • set @@tidb_batch_delete = ON

The transaction size limitation is mainly due to the implementation of TiKV using the consistency protocol. For any distributed database, if you use a consistency protocol to do this kind of replication, you want to avoid very large transactions. So the problem is not TiDB. Basically, anyone who wants to do a distributed database will run into this problem. In an OLAP scenario, people are less transactional with their data, so it’s ok to turn this on.

This is toutiao’s application on OLAP: for example, ugC hits, app Crash’s requirement is that after the client request hangs, a log should be entered into TiDB cluster. Druid is an OLAP engine, and they have metadata from MySQL, and some people put that metadata on TiDB, and some q&A services, and they put some data on TiDB.