Sequential databases are suddenly hot in 2017. In February, Facebook opened its Beringei timing database; In April, TimeScaleDB, a timing database based on PostgreSQL, was open source. As early as July 2016, Baidu Cloud released TSDB, the first multi-tenant distributed timing database product in China, on its Tiangong Internet of Things platform, which has become a support for its development in manufacturing, transportation, energy, Smart city and other core products in the industrial field, but also become a landmark event of Baidu’s strategic development of industrial Internet of Things. As a very important service in the direction of the Internet of Things, the frequent voice of the industry shows that enterprises have been eagerly embracing the arrival of the Internet of Things era.

In this paper, the basic concepts, application scenarios and problems to be solved of timing database will be discussed one by one. Finally, how to solve the technical problem of timing data storage will be analyzed in depth.

1. The background

Baidu’s unmanned vehicles need to monitor various states during operation, including coordinates, speed, direction, temperature, humidity and so on, and record the monitoring data at every moment for big data analysis. Each car collects nearly 8 terabytes of data every day. If you just store it and do not query it, it is ok (although it is already a large cost). But if you need to quickly query the multi-latitude grouping and aggregation query such as “what are the unmanned cars with a speed of more than 60km/h in Houchangcun Road at 2pm today?”, then the timing database will be a good choice.

2. What is a sequential database

Let’s start with what timing data is. Sequential data is a series of data based on time. By connecting these data points into a line in time coordinates, we can make a multi-latitude report in the past to reveal its tendency, regularity and anomaly. The future is big data analytics, machine learning, prediction and warning.

Sequential database is a database that stores sequential data, and it needs to support the basic functions such as fast writing of sequential data, persistence, and multi-latitude aggregated query.

Whereas traditional databases only record the current value of data, sequential databases record all historical data. At the same time, time is always used as a filter condition in the query of time series data.

Here are some basic concepts of a sequential database (different sequential databases are called differently).

Metric: A metric equivalent to a table in a relational database.

Data point: A data point, equivalent to a row in a relational database.

Timestamp: Indicates the time when the data point was generated.

Field: Different fields under the measurement. For example, the measure of position has two fields: longitude and latitude. Typically stores data that changes with time stamps.

Tag: Tag, or additional information. It typically stores property information that does not change with time stamps. Timestamp plus all tags can be considered as the primary key of the table.

As shown in the figure below, the metric is Wind. Each data point has a timestamp, two fields: direction and speed, and two tags: sensor and City. The first and third rows store devices with sensor number 95D8-7913, and the attribute city is Shanghai. With the change of time, the wind direction and speed have changed, the wind direction from 23.4 to 23.2; The wind speed went from 3.4 to 3.3.

3. Sequence database scenario

All the scenarios with temporal data generation and the need to show its historical trend, periodic law, anomaly, and further prediction and analysis for the future are suitable for temporal database.

In the direction of environmental monitoring of the Industrial Internet of Things, baidu Tiangong’s customers have encountered such a problem. Due to the requirements of the industry, working condition data need to be stored. The customer has 20,000 monitoring points in each factory area, with a collection cycle of 500 milliseconds, a total of 20 factories. That adds up to a staggering 26 trillion data points a year. Assuming 50 bytes per point, the total amount of data is 1P(if each server has 10 terabytes of hard disk, then a total of more than 100 servers are required). This data is not just generated in real time and written to storage; Also support quick query, do visual display, help managers to analyze decisions; It can also be used for big data analysis to find deep-seated problems, help enterprises save energy and reduce emissions, and increase benefits. The final customer adopted Baidu Tiangong’s timing database solution, which helped him solve the problem.

In Internet scenarios, a large amount of temporal data is also generated. Baidu has a large internal database of time series services using tiangong’S iot platform. For example, baidu’s internal service records every network delay and network delay of users into baidu Tiangong’s timing database in order to ensure user experience. Reports are directly generated by the sequential database for analysis by technical products, and problems are found and solved as soon as possible to ensure user experience.

4. Challenges encountered in timing databases

Many people might think that adding a timestamp column to a traditional relational database would be enough for a sequential database. There is no problem when there is a small amount of data, but a small amount of data can not be used for big data analysis because of limited latitude, few details and low confidence. It is clear that sequential databases are designed to deal with massive data scenarios.

You can see that the sequential database needs to address the following issues

L Sequential data writing: how to support tens of millions of data points written per second.

L Reading of sequential data: how to support grouping and aggregation of hundreds of millions of data at the second level.

Cost sensitive: The problem with mass data storage is cost. How to store these data at a lower cost will become the top priority to be solved in timing database.

These problems are not covered by an article, and each problem can be optimized from multiple angles to solve. Here only from the perspective of data storage to try to answer how to solve the large amount of data write and read.

5. Data storage

Data storage can be divided into two problems, stand-alone storage and distributed storage.

Single store

If you just store it, just log it. But because quick queries follow, you need to consider the structure of the storage.

Traditional database storage uses B Tree because of its organization in order to reduce the number of seek times in queries and sequential inserts. We know that disk seek times are very slow, typically around 10ms. The random read and write of the disk is slower than the seek. Random write to B tree will consume a lot of time in the disk seek path, resulting in slow speed. We knew that SSDS had faster seek times, but that didn’t solve the problem at all.

For a sequential database where more than 90% of the scenes are written, B Tree is obviously not appropriate.

In the industry, LSM trees are used to replace B trees, such as Hbase and Cassandra noSQL. Here we introduce it in detail.

LSM tree consists of data structures in memory and files on disk. They correspond to MemStore and HLog in Hbase. Correspond to MemTable and sstable in Cassandra.

The LSM tree operation process is as follows:

1. When data is written or updated, the data structure in the memory is written first. WAL files are also written to avoid data loss.

2. Data structures in the memory are flushed to disks periodically or reach a fixed size. The files on these disks will not be modified.

3. As the number of files accumulated on the disk increases, the disk periodically merges files to eliminate redundant data and reduce the number of files.

The core idea of LSM Tree is to achieve higher write performance through memory write and subsequent sequential write to disk, avoiding random write. Read performance is also compromised, because the value of the same key may exist in multiple Hfiles. To achieve even better read performance, you can use Bloom Filter and compaction.

Distributed storage

Timing database is oriented to write, store and read massive data, which cannot be solved by a single machine. Therefore, you need to use multi-machine storage, also known as distributed storage.

The first consideration of distributed storage is how to distribute data among multiple machines, also known as sharding. Next, we introduce the problem of sequential database sharding. The sharding problem consists of the choice of sharding method and the design of sharding.

Subdivision method

The sharding method of sequential database is similar to other distributed systems.

Hash sharding: This method is simple to implement, balance is good, but the cluster is not easy to expand.

Consistency hashing: this scheme has good balance and easy cluster expansion, but the implementation is complex. Examples include Amazon’s DynamoDB and open source Cassandra.

Scope partitioning: Usually in conjunction with global ordering, complexity lies in merging and splitting. Indicates Hbase.

Shard design

Sharding design is simply what sharding is, which is very tricky and directly affects write and read performance.

In combination with the characteristics of sequential database, sharding based on metric+tags is a better method, because the query is usually based on a time range. In this way, the data of the same metric and tags are allocated to a machine for continuous storage, and sequential disk reading is fast. Combined with the single storage content mentioned above, you can do fast query.

Further, considering that the time range of sequential data is very long, we need to divide it into several segments according to the time range and store them on different machines respectively, so as to support concurrent query for large-scale sequential data and optimize the query speed.

The first and third lines are the same tag(sensor=95D8-7913; City = Shanghai), so it is assigned to the same shard, and the fifth row is divided into different shards according to the time range, although it is also the same tag. The second, fourth, and sixth rows belong to the same tag(sensor=F3CC-20F3; The same goes for city= Beijing.

P5 – Description of sequential data sharding

6. Real cases

I’ll use a batch of open source timing databases as an illustration.

InfluxDB:

Very good timing database, but only stand-alone version is free open source, cluster version is charged. The storage scheme can be viewed from the standalone version: TSM, a storage structure similar to LSM Tree, is adopted for InfluxDB on the standalone. For the sharding solution InfluxDB, the ShardGroup is first determined by +(in fact, retentionPolicy is added), and then the specific Shard is determined by the hash code of +.

Timestamp is 7 days aligned by default, meaning that 7 days of sequential data is in one Shard.

Kairosdb:

Cassandra is used as the distributed storage engine in the bottom layer. As mentioned above, LSM Tree is used in the single machine.

Cassandra has two levels of indexes: partition key and clustering key. The partition key is the partition ID, and the consistency hash is used. The clustering key guarantees order within a partition key.

Kairosdb takes advantage of Cassandra to take ++< data type >+ as the partition key, and the offset of data point time on timestamp as the clustering key, whose orderness is convenient for query based on time range.

The timestamp in the partition key is 3-week aligned, meaning that the 21-day series is grouped under one clustering key. The number of milliseconds at 3 weeks is 1.8 billion, just below Cassandra’s limit of 2 billion per row.

OpenTsdb:

In the bottom layer, Hbase is used as the distributed storage engine and LSM tree is also used.

Hbase uses the fragmentation mode based on ranges. Use row keys for sharding to ensure global order. Each row key can have multiple column families. There can be multiple columns under each column family.

The figure above shows how OpenTsdb’s Row keys are organized. Different from other sequential databases, row keys in Hbase are globally ordered, so optional salts are added to achieve better data distribution and avoid hot spots. The column Qualifier consists of the offset and data type between timestamp.

His timestamp is hour-aligned, meaning that a row key stores at most one hour of data. In addition, you need to convert the metric and tags that form the row key into uuIds to reduce storage space and avoid large Hfile indexes. Below is an example of a real Row key.

P7 – Open TSDB row key example

7. The conclusion

Although all distributed sequence database can see storage solution is slightly different, but in essence is consistent, as a result of the time-series data to write more read less scene, more suitable for large throughput on single write single storage structure, and according to the characteristics of the time-series data in a distributed scheme to elaborate design, target is the design of the subdivision scheme can convenient time-series data write and read, At the same time, the data distribution is more uniform to avoid hot spots.

Data storage is a small part of the design of sequential database, but it can be seen that the characteristics of sequential data should be considered from the beginning of the design of sequential database. We’ll talk about it from other perspectives later.