The overall architecture design and implementation of MRS IOTDB timing database are deeply interpreted

【 Recommendation 】The June issue of HUAWEI Cloud Community is coming, sharing the Top10 technical dry goods and important technical topics. There are graduation season challenge, HUAWEI cloud experts to take you to do a good job in career planning.

Abstract:This paper will systematically introduce the context and functional characteristics of MRS IOTDB, focusing on the overall architecture design and implementation of MRS IOTDB time series database.

This article is shared from “The Overall Architecture Design and Implementation of MRS IOTDB Time Series Database”, by Cloudsong.

MRS IOTDB is the latest time series database product launched by Huawei FusionInsight MRS big data suite. Its leading design concept has shown more and more strong competitiveness in the time series database field and has been recognized by more and more users. In order to have a better understanding of MRS IOTDB, this paper will systematically introduce the origins and functions of MRS IOTDB, focusing on the overall architecture design and implementation of MRS IOTDB timing database.

What is a temporal database

Time series database is the short term of time series database, refers to the special database system for the storage, query, analysis and other processing operations of the data with time label (change according to the order of time, namely time serialization). In general, a temporal database is a database specially used to record all kinds of values (measuring points, events) such as temperature, humidity, speed, pressure, voltage, current, and stock bid and sell prices of devices in the Internet of Things that change over time.

At present, with the continuous development and application Of big data technology, two types Of data, represented by Internet Of Things (IoT) and financial analysis, show that a large number Of sensor values or event data are continuously generated over time. Time series data is a continuous numerical sequence formed with the time of data (events) occurring (timestamp) as the time axis. For example, the temperature data of a certain Internet of Things device at different moments constitutes a time series data:

Both machine-generated sensor data and human-generated social event data share some common characteristics:

(1) High collection frequency: dozens, hundreds, hundreds or even millions of times per second;

(2) High collection accuracy: support millisecond level collection at least, some need to support microsecond and nanosecond level collection;

(3) Large collection span: 7*24 hours continuous collection of data for several years or even decades;

(4) Long storage cycle: it is necessary to support the persistent storage of time series data, and even for some data (such as seismic data) it needs to be permanently stored for up to 100 years;

(5) Query window length: it needs to support time window query with different granularity from millisecond, second, minute, hour to day, month, year and so on; It also needs to support the number window query with different granularity, such as ten thousand, one hundred thousand, million, ten million, etc.

(6) Difficult data cleaning: time series data are out of order, missing, abnormal and other complex situations, which need special algorithm for efficient real-time processing;

(7) High real-time requirements: whether it is sensor data or event data, it needs the real-time processing capacity of millisecond level and second level to ensure the real-time response and processing capacity;

(8) Strong algorithm expertise: time series data in different fields, such as earthquake, finance, electricity, transportation, etc. have many requirements for professional time series analysis in vertical fields. Time series trend prediction and similar sub-series need to be used

Analysis, periodic prediction, time moving average, exponential smoothing, time autoregressive analysis and LSTM based timing neural network algorithms for professional analysis.

Can be seen from the common features of time-series data, time series special scene demand for traditional relational database storage and large data storage have brought challenges, not a relational database is adopted to structured storage, or with no database for storage, cannot satisfy the huge amounts of time-series data high concurrency real-time written demand and query. Therefore, there was an urgent need for a dedicated database for storing time series data, and the concept and product of time series database were born.

It is important to note that temporal databases are different from temporal databases and real-time databases. Temporal Database is a Database that can record the change history of objects, that is, it can maintain the change experience of data, such as Timedb. Temporal database is a system to maintain the time state of time records in traditional relational database with fine granularity, while temporal database is completely different from relational database, which only stores the value of measurement points corresponding to different timestamps. A more detailed comparison between temporal database and temporal database will be presented in a subsequent article, which will not be detailed here.

Timing databases are also different from real-time databases. Real-time database was born in traditional industry, mainly because of the development of modern industrial manufacturing process and large-scale industrial automation, traditional relational database is difficult to meet the needs of industrial data storage and query. Thus, in the mid-1980s, a real-time database for industrial monitoring was born. Due to the early birth of real-time database, it has limitations in scalability, big data ecological docking, distributed architecture, data types and other aspects, but it also has the advantages of complete products and complete industrial protocol docking. Time-series database was born in the Internet of Things era, and has more advantages in big data ecological docking, cloud native support and other aspects.

The basic comparison information of temporal database, temporal database and real-time database is as follows:

2. What is MRS IOTDB Time Series Database

MRS IOTDB is a timing database product in Huawei FusionInsight MRS big data suite. It is a high-performance entertaine-level timing database product launched on the basis of deep participation in the open source version of Apache IOTDB community. IOTDB, as the name implies, is a special time series database software for IoT and is a domestic Apache open source software initiated by Tsinghua University. Since the birth of IOTDB, Huawei has been deeply involved in the architecture design and core code contribution of IOTDB. It has invested a lot of manpower and put forward a lot of improvement suggestions and contributed a lot of code to the stability, high availability and performance optimization of IOTDB cluster version.

At the beginning of design, IOTDB comprehensively analyzed the time-series database related products in the market, including the mainstream time-series databases such as Timescale based on traditional relational database, OpenSDB based on HBase, KariosDB based on Cassandra, and InfluxDB based on time-series exclusive structure. Drawing on the advantages of different time series data in the implementation mechanism, it has formed its own unique technical advantages:

(1) Support high-speed data writing

The unique TLSM algorithm based on two-stage LSM merging effectively ensures that IOTDB can easily realize the concurrent writing ability of tens of millions of measurement points per second on a single machine even in the case of out-of-order data.

(2) Support high-speed query

Support TB level data millisecond level query

(3) Complete function

Support CRUD and other complete data operations (update is realized by covering and writing the data of the measuring point with the same time stamp on the same device, deletion is realized by setting TTL expiration time), support frequency domain query, have rich aggregation function, support similarity matching, frequency domain analysis and other professional time series processing.

(4) Rich interface, simple and easy to use

Support for JDBC interfaces, Thrift API interfaces, and SDK interfaces. The use of SQL type statement, in the standard SQL statement increased for the time sliding window statistics and other commonly used timing processing functions, to provide system efficiency. The Thrift API interface supports multiple language interface calls to Java, C\C++, Python, C#, etc.

(5) Low storage cost

IOTDB independent research and development of TSFILE timing file storage format, specifically for timing processing has been optimized, based on column storage, support explicit data type declaration, different data types automatically match Snappy, LZ4, GZIP, SDT and other different compression algorithms, Can achieve 1:150 or even higher compression ratio (in the case of further reduction of data accuracy), greatly reducing the user’s storage cost. For example, a user originally used 9 KariosDB servers to store time series data, but IOTDB can be easily realized by using 1 server with the same configuration.

(6) Multi-form deployment of cloud edge end

IOTDB’s unique lightweight architecture design ensures that IOTDB can easily achieve “one set of engines to get through the cloud side and one data compatible with the whole scene”. In the cloud service center, IOTDB can be deployed in clusters to give full play to the advantages of cloud cluster processing. In the position of edge calculation, IOTDB can deploy stand-alone IOTDB or cluster version of a small number of nodes on the edge server, depending on the configuration of the edge server. At the device terminal, IOTDB can be directly embedded in the local storage of the terminal device in the form of TSFILE file, and directly read and write TSFILE file by the device terminal directly, without starting and running the IOTDB database server, which greatly reduces the requirements on the processing capacity of the terminal device. Due to the open format of TSFILE file, any terminal language and development platform can directly read and write the binary byte stream of TSFILE, also can use the SDK of TSFILE to read and write, and even send TSFILE file to the edge or cloud service center through FTP.

(7) Integration of inquiry and analysis

IOTDB supports real-time reading and writing as well as distributed computing engine analysis. The loose coupling design of TSFile and IOTDB engine ensures that IOTDB can efficiently write and query time-series data using its proprietary time-series data processing engine. TSFile can also be read and written by big data related components such as Flink, Kafka, Hive, Pulsar, RabbitMQ, RocketMQ, Hadoop, Matlab, Grafana, Zeepelin, etc. It greatly improves IOTDB’s query and analysis integration ability and ecological expansion ability.

3. The overall architecture of MRS IOTDB

On the basis of the existing architecture of Apache IOTDB, MRS IOTDB integrates MRS Manager’s powerful enterprise-level core capabilities, such as log management, operation and maintenance monitoring, rolling upgrade, security reinforcement, high availability guarantee, disaster recovery, fine-grained access control, big data ecological integration, resource pool optimization scheduling, etc. A lot of refactoring and optimization have been done to the Apache IOTDB kernel architecture, especially the distributed cluster architecture, and a lot of system-level enhancements have been made to stability, reliability, availability and performance.

(1) Interface compatibility:

Further improve the northbound interface and southbound interface, support JDBC, CLI, API, SDK, MQTT, COAP, HTTPS and other access interfaces, further improve SQL class statements, compatibility with most Influx SQL, support batch import and export

(2) Distributed peer architecture:

On the basis of RAFT protocol, MRS IOTDB adopted the improved Multi-RAFT protocol, optimized the underlying implementation of MUTI-RAFT protocol, and adopted optimization strategies such as Cache Leader to ensure that there is no single node failure. Further improve the performance of MRS IOTDB data query routing; At the same time, fine-grained optimization was carried out for strong consistency, medium consistency and weak consistency strategies. The virtual node strategy is added to the consistent hash algorithm to avoid data skewness. At the same time, the algorithm strategy of lookup table and hash partition is combined to further guarantee the performance of cluster scheduling on the basis of improving the high availability of cluster.

(3) Two-level granularity metadata management:

Due to the adoption of peer-to-peer architecture, metadata information is naturally distributed in all nodes of the cluster for storage, but the larger storage capacity of metadata brings a larger consumption of memory. In order to balance memory consumption and performance, MRS IOTDB adopts a two-level granularity metadata management architecture. Firstly, time series metadata is synchronized among all nodes, and secondly, time series metadata is synchronized among partitioned nodes. In this way, when querying the metadata, the filter tree will be pruned based on the time series group to greatly reduce the search space, and then the time series metadata will be queried in the filtered partition nodes.

(4) High performance local disk access:

MRS IOTDB uses a special TSFILE file format for optimized time series storage, and a column format for adaptive coding and compression, which supports optimized pipeline access and high-speed insertion of out-of-order data

(5) HDFS ecological integration:

MRS IOTDB supports HDFS file read and write, and has carried on the local cache, short-circuit read, HDFS I/O thread pool and other optimization methods to improve the MRS IOTDB reading and writing performance of HDFS comprehensively. At the same time, MRS IOTDB supports Huawei OBS object storage and has carried on the deeper optimization of more high performance.

On the basis of HDFS integration, MRS IOTDB supports efficient reading and writing of TSFile by MRS components such as Spark, Flink and Hive.

(6) Multi-level authority control:

Support storage groups, devices, sensors and other multi-level authority control
Support create, delete, query and other multilevel operations
Support for Kerberos authentication
Support for the Ranger permission architecture

(7) Cloud Edge Deployment:

Flexible deployment of cloud edge is supported. The edge part can be connected based on Huawei’s IEF products or directly deployed in Huawei’s IES.

MRS IOTDB Cluster Edition supports dynamic scaling, which can provide more flexible deployment support for the cloud side.

4. Standalone architecture of MRS IOTDB

4.1 Basic concepts of MRS IOTDB

MRS IOTDB mainly focuses on the real-time processing of measuring point values of devices and sensors in the IoT field. Therefore, the infrastructure design of MRS IOTDB takes devices and sensors as the core concepts. At the same time, in order to facilitate users to use and IOTDB to manage time series data, the concept of storage group is added.

Storage Group: A concept proposed by IOTDB for managing temporal data, similar to the concept of databases in relational databases. From the user’s point of view, it is mainly used for grouping management of device data. From the perspective of IOTDB database, storage group is also a unit of concurrency control and disk isolation, and different storage groups can read and write in parallel.

Device: corresponding to specific physical equipment in the real world, such as a manufacturing unit of power plant, wind generator, automobile, aircraft engine, seismic wave acquisition instrument, etc. In IOTDB, Device is a unit of sequential data written once, and a write request is limited to one device.

Sensor: corresponding to the Sensor carried by the specific physical equipment in the real world, such as the Sensor for the information collection of wind speed, steering Angle, power generation and so on on the wind generator equipment. In IOTDB, Sensor is also called Measurement point. Specifically, it refers to the Sensor value at a certain moment collected by a Sensor, which is stored in the form of <time, value> as determinant inside IOTDB.

The relationship between storage groups, devices, and sensors is as follows:

Time Series: It is similar to a table in a relational database, but this table mainly has three main fields: Timestamp, Device ID, and Measurement point. In order to facilitate more description of device information of time series, IOTDB also adds extension fields such as Tag and Field, where Tag supports indexes while Field does not. In some time series databases, it is also called the time line, which records the value of a certain device and a certain sensor changing with time, forming a time line that continuously adds the value of measuring points along the time axis.

Path: IOTDB constructs a tree structure with root node connecting storage group, device and sensor in series. A Path is formed from root node through storage group, device and sensor leaf node. As shown in the figure below:

Virtual storage group: Since the concept of storage group has the dual function of users’ concurrency control on device grouping and system, the excessive coupling of the two will cause the influence of users’ different usage modes on system concurrency control. For example, users put all unrelated device data into a storage group, and IOTDB locks this storage group for concurrency control, which limits the concurrent reading and writing ability of data. In order to realize the relatively loose coupling between storage group and concurrency control, IOTDB designs the concept of virtual storage group, which splits the concurrency control fine granularity of storage group into virtual storage group, thus reducing the granularity of concurrency control.

4.2 Basic architecture of MRS IOTDB

The stand-alone MRS IOTDB is mainly composed of different storage groups. Each storage group is a concurrency control and resource isolation unit, and each storage group includes multiple Time partitions. Each storage group corresponds to a WAL prewritten log file and a TSFILE temporal data storage file. The Time sequence data in each Time Partition is first written into Memtable and recorded into WAL at the same Time, and then timed and asynchronous brush disk to TsFile. The specific implementation mechanism will be introduced in detail later. The basic architecture of MRS IOTDB stand-alone machine is as follows:

5. Cluster architecture of MRS IOTDB

The MRS IOTDB cluster is a fully peer-to-peer distributed architecture that avoids single point of failure based on the RAFT protocol and single point of performance problems caused by a single RAFT consensus group through the Multi-RAFT protocol. At the same time, the underlying communication, concurrency control and high availability mechanism of distributed protocol are further optimized.

First, all the nodes of the entire cluster form a metadata group (Metagroup), which is only used to maintain the metadata information of the storage group. For example, an IOTDB cluster with 4 nodes is shown in the blue-gray box below. All 4 nodes constitute a metadata group (MetaGroup).

Secondly, the data group is constructed according to the number of data copies. For example, if the number of copies is 3, a DataGroup (Datagroup) with 3 nodes is constructed. Storage groups are used to store time series data and the corresponding metadata.

In distributed systems, reliable storage of data is usually realized in the form of multiple copies. Multiple copies of the same data are stored in different nodes and must be consistent. Therefore, RAFT consensus protocol is needed to ensure the consistency of the data. It splits the consistency problem into several relatively independent sub-problems, such as leader election, log replication, consistency assurance, and so on. There are several important concepts in the RAFT protocol:

(1) set of Raft. The Raft group has one elected leader node and the other nodes are followers. When a write request arrives, it is first submitted to the leader node, which records the write request in its own log and then distributes the log to the follower node.

(2) the Raft log. RAFT ensures that operations are not lost through logging, which maintains a COMMIT number and Apply number. If a log is COMMIT, it is now received and persisted by more than half of the nodes in the cluster. If a log is applied, the current node executes the log. When some node fails and is restored, the log for that node falls behind the leader’s log. Until this node catches up with the leader’s log, it cannot provide normal services to the outside world.

5.2 Hierarchical management of metadata

Metadata management strategy is the key point of MRS IOTDB distributed design. The purpose of metadata in the read-write process should be considered in the design of metadata management strategy:

When writing data, metadata is needed to check the validity of data type, permissions, etc
Metadata is required for query routing when querying data. At the same time, due to the timing sequence data in the scene the number of primitives

For large data, there is also the consumption of memory resources by metadata to be considered.

Existing metadata management strategies either assign metadata to metadata nodes for special management, which will reduce the performance of read and write; Or you can save metadata in full on all nodes of the cluster, which consumes a lot of memory resources.

In order to solve the above problems, MRS IOTDB designed a two-layer granularity metadata management strategy. The core idea is to split metadata into storage groups and time series for management:

(1) Storage group metadata: Metadata group (Metagroup) contains the routing information when querying data, save

Metadata information of the Storage Group is stored in full on all nodes of the cluster. The granularity of the storage group is large, and the order of magnitude of the storage group within a cluster is much smaller than that of the time series. Therefore, the storage of these storage group metadata on all nodes of the cluster greatly reduces the memory footprint.

Each node in the metadata group is called the metadata holder, and RAFT protocol is used to ensure that each holder is consistent with the data of other holders in the same group.

(2) Time series metadata: the time series metadata in the DataGroup (Datagroup) contains the data type, permissions and other information required for data writing, which is saved in the node where the DataGroup is located (part of the cluster node). Due to the time sequence of metadata particle size small, far more than the storage group metadata, so the time series of the metadata stored in the data set is located on the node, to avoid the unnecessary memory footprint, as well as by storing metadata set of first-order filter rapid positioning, at the same time, the Raft consistency of data sets to avoid the metadata storage time sequence of a single point of failure.

Each node in a data group is called a data partition holder, and RAFT protocol is used to ensure that each holder is consistent with the data of other holders in the same group.

According to this method will metadata storage group and time sequence is two layers of granularity holders in the metadata and data partitioning, respectively, holders of management, due to the time series data and metadata in the data synchronization in the group, so every time writing data do not need to inspection and the synchronous operation of metadata, only need to modify the time series of the metadata stored metadata set of checks and synchronous operation, Thus improve the system performance. For example, when creating a time series and doing 500,000 data writes, metadata checks and synchronizations are reduced from 500,000 to 1.

5.3 Metadata distribution

According to the hierarchical management of metadata, metadata can be divided into storage group metadata and time series metadata.

Storage group metadata is replicated on all nodes of the total set group and belongs to the Metagroup group.

Time series metadata is only stored on the corresponding DataGroup, storing some information about the time series properties, field types, field descriptions, etc. Time series metadata is distributed in the same way that data is distributed, both of which are generated through slot hash.

5.4 Time series data distribution

In the implementation of distributed system, time-series data are partitioned according to storage groups based on the algorithm of hash ring and lookup on the ring. Each node of the cluster is placed on the hash ring according to the hash value. For an incoming time series data point, the hash value of the storage group corresponding to the time series name is calculated and placed on the hash ring. The search is carried out in a clockwise direction on the ring, and the first node found is the node to be inserted.

Using hash ring data partition, prone to the hash value of the two nodes of the small difference, so the consistency in the use of hash ring on the basis of introducing virtual node, and the specific practice is to each physical node virtual into several, and placing these virtual node according to the hash value to hash ring, to a great extent, to avoid the data skew, Make the data more evenly distributed.

First, the entire cluster presets 10,000 slots, which are distributed evenly across the DataGroup. As shown in the figure below, the IOTDB cluster has 4 DATAGROUP and the entire cluster has 10000 slots, so on average each DATAGROUP has 10000/4=2500 slots. Since the number of DataGroup is equal to the number of cluster nodes 4, this is equivalent to an average of 2500 slots per node.

Next, the mapping of slot to DataGroup, Time Partition, and Time series is completed.

IOTDB cluster is divided into multiple DATAGROUP groups according to RAFT protocol, each DATAGROUP group contains multiple slots, and each slot contains multiple time partitions, and each time partition contains multiple time series. Constitutive relation is shown in the figure below:

Finally, the mapping of the input storage group and timestamp to the slot is completed by calculating the value of the slot through a Hash:

1) First partition according to time range, which is convenient for time range query:

TimePartitionNum = TimeStamp % PartitionInterval

TimePartitionNum is the ID of the time partition, Timestamp is the TimeStamp of the data to be inserted into the measurement point, and PartitionInterval is the interval of the time partition. The default value is 7 days.

2) Then calculate the slot value according to the partition of storage group by Hash:

Slot = Hash(StorageGroupName + TimePartitionNum) % maxSlotNum

Where StorageGroupName is the name of the storage group, TimePartitionNum is the time partition ID calculated in the first step, MaxSlotNum is the maximum slot number, the default is 10000.

The relationship between the Data Group and the Storage Group is shown in the figure below, where the Data Group 1 on node 3 and node 1 shows the situation where the same Data Group is distributed on two nodes:

Click on the attention, the first time to understand Huawei cloud fresh technology ~