DolphinDB and InfluxDB test reports in the sequential database

In August 2018, we published a DolphinDB test report on Zhihu comparing DolphinDB with InfluxDB. DolphinDB was one to three orders of magnitude ahead in query performance and one order of magnitude ahead in data import performance, with no significant difference in data export performance. DolphinDB and InfluxDB have been improved over the past year or so. How has the performance of the two models changed? DolphinDB and InfluxDB were re-tested for data import and export, data query, and disk space usage. The test data set also covers the increasingly popular Internet of Things data set, as well as the larger financial big data set.

DolphinDB did better than DolphinDB in all the tests.

DolphinDB performed 75 times better in small data sets than InfluxDB, and about 100 times better in large data sets. Moreover, The InfluxDB does not support CSV import. Manual conversion to Line Protocol format is required.
DolphinDB performs about 11 times better than InfluxDB in data export, and InfluxDB is prone to memory overflow when exporting large amounts of data in CSV format.
DolphinDB always occupies less than or equal to the capacity occupied by the InfluxDB.
In terms of query functions, InfluxDB does not support comparison query, table connection, and sorting of tags and fields other than time. Moreover, the parameter of the function can only be a field rather than an expression of the field, which has great limitations in functions. Several test samples cannot be implemented in InfluxDB. DolphinDB is more flexible and easy to handle.
DolphinDB performed more than 1000 times better than InfluxDB in two samples. In 6 test samples, the performance exceeded InfluxDB by more than 50 times. In 2 test samples, the neutral energy was more than 10 times the InfluxDB. All other test samples also performed better than InfluxDB.

I. System overview

DolphinDB

DolphinDB is written with an analysis of the c + + type high performance distributed temporal database, the use of high throughput and low delay column type memory engine, integrates the functional programming language and high capacity, high speed data flow analysis system, can be in the database for complex programming and operation, significantly reduce the amount of data migration time.

DolphinDB through memory engine, data localization, fine-grained data partition and parallel computing to realize high-speed of distributed computing, built-in line, Map Reduce and iterative calculation, and other computing framework, using the embedded distributed file system partition data automatic management and its copy, for distributed computing provides load balancing and fault tolerance.

DolphinDB supports standard SQL-like syntax and provides a Python-like scripting language for manipulating data. It also provides apis for other common programming languages, historical data analysis and modeling, and real-time streaming data processing in finance. And massive sensor data processing and real-time analysis in the field of the Internet of Things.

InfluxDB

InfluxDB is the most popular high-performance open source time series database written in Go. At its heart is TSM Tree, a customized storage engine optimized for time series data, prioritizing insert and query data performance.

InfluxDB uses the SQL-like query language InfluxQL and provides time series math and statistics functions out of the box. At the same time, it provides an HTTP – based interface to support data insertion and query

InfluxDB allows users to define Retention Policies to delete or downsample data that is stored beyond a specified period of time. It is widely used for monitoring data in storage systems and real-time data in the IoT industry.

Second, the test environment

DolphinDB and InfluxDB were tested in single-machine mode because the InfluxDB cluster version was closed source.

Host: DELL OptiPlex 7060 CPU: Intel Core i7-8700 (6-core 12-thread 3.20 GHz) memory: 32 GB (8GB x 4, 2666 MHz) hard disk: 2T HDD (222 MB/s to read; 210 MB/s write) OS: Ubuntu 16.04 LTS

DolphinDB was used in the test for Linux V0.89 (2019.01.31) with a maximum memory of 28GB.

The InfluxDB version used in the test is 1.7.5. According to the official configuration file of the InfluxDB, the configuration is optimized based on the actual hardware of the test machine. Wal – Fsync-delay is mainly adjusted to 100ms suitable for mechanical hard disks. Set cache-max-memory-size to 28GB and serial-id-set-cache-size to 400.

For details about modified configurations, see the influxdb.conf file in the Appendix.

Data set

DolphinDB and InfluxDB tables were tested for small data (4.2GB) and large data (270GB). Here are the table structures and partitioning methods for the two data:

4.2GB Device sensor record small data set (CSV format, 30 million)

We used sensor information of iot devices as a small data set for testing, which contains temporal statistics such as sensor time, device ID, battery, memory and CPU of 3000 devices at 10000 time intervals from November 15, 2016 to November 19, 2016. The dataset consists of 30,000,000 pieces of data, including an equipment information table (DEVICe_INFO) and an equipment sensor information table (READINGS).

Data sources: docs.timescale.com/v1.1/tutori…

Download address: timescaledata.blob.core.windows.net/datasets/de…

DolphinDB and InfluxDB:

Our DolphinDB database partitioning scheme uses Time as the first dimension of the partition, divided into four zones by day. The partition boundaries are [2016.11.15 00:00:00, 2016.11.16 00:00:00, 2016.11.17 00:00:00, 2016.11.18 00:00:00, 2016.11.19 00:00:00]. Then, device_ID is used as the second dimension of the partition, which is divided into 10 partitions every day. Finally, each partition contains about 100 MB of raw data.

InfluxDB uses Shard groups to store data of different time ranges. Time ranges of different Shard groups do not overlap. A Shard Group contains a large number of shards, which are the structures used by the InfluxDB to store data and provide read and write services. InfluxDB uses a Hash partition method to partition data falling into the same Shard Group again, that is, sequence data is mapped to different shards according to the Hash (Series). So we use the following statement to manually specify the Duration of each Shard Group, partitioned by day on the time dimension.

create retention policy one_day on test duration inf replication 1 shard duration 1d default
Copy the code

270GB Large data set of stock Transactions (CSV format, 23 CSV, 6.5 billion pieces)

We tested the Level 1 quotation data of the stock market from January 08.2007 to August 31, 2007 provided by the New York Stock Exchange (NYSE) as a big data set. The data set contains more than 8000 stocks in a month’s trading time, stock code, bid price, ask price, buy quantity, sell quantity and other quotation information.

There are 6.5 billion (65,6169,3704) quotation records in the data set. The records of one trading day are saved in a CSV. There are 23 trading days in this month, and the uncompressed CSV files amount to 270 GB.

Source: NYSE Exchange Proprietary Market Data

DolphinDB and InfluxDB tables are shown below:

In the DolphinDB database, we are divided by Date and symbol, and divided by symbol into 100 dolphin partitions per day, each of which is about 120 MB.

Use the same policy for InfluxDB as for small data sets.

4. Data import and export test

1. Import data

DolphinDB is imported using the following script:

Timer {for (fp in FPS) {loadTextEx(db, 'taq,' date 'symbol, fp, schema) print now() + ": fp}}Copy the code

It takes 20 seconds to import 30 million pieces of data in a small data set recorded by 4.2 GB device sensors, with an average rate of 1.5 million pieces/second.

The 270 GB big data set of stock trading totaled 6.5 billion pieces of data (TAQ20070801-TAQ20070831 23 files), which took 38 minutes to import.

InfluxDB itself does not support importing CSV directly, and can only be imported through HTTP API or dolphin-import. For the sake of import performance, we choose to convert each Line in CSV to Line Protocol format first, as shown in the following example:

readings,device_id=demo000000,battery_status=discharging,bssid=A0:B1:C5:D2:E0:F3,ssid=stealth-net Battery_level = 96, battery_temperature = 91.7, cpu_avg_1min = 5.26, cpu_avg_5min = 6.172, cpu_avg_15min = 6.51066666666667, mem_free = 6 50609585,mem_used=349390415,rssi=-42 1479211200Copy the code

Add the following header:

# DDL
CREATE DATABASE test
CREATE RETENTION POLICY one_day ON test DURATION INF REPLICATION 1 SHARD DURATION 1d DEFAULT

# DML
# CONTEXT-DATABASE:test
# CONTEXT-RETENTION-POLICY:one_day
Copy the code

Save the file to a disk and run the following command to import the file:

influx -import -path=/data/devices/readings.txt -precision=s -database=test
Copy the code

After conversion, the 4.2GB device sensor recorded 30 million pieces of data in a small data set, which took 25 minutes and 10 seconds to import, with an average rate of 20,000 pieces per second.

If the TAQ data is inserted repeatedly at the same time (1185923302) and the same tag record (symbol, mode, ex) is inserted, for example, Even if the value is different (bid, OFR, Bidsiz, ofrsiz), the later record overwrites the previous record, and only the last record remains in the database.

Taq, symbol = A, mode = 12, the ex = T bid = 37, ofr = 54.84, bidsiz = 1, 1185923302 taq ofrsiz = 1, the symbol = A, mode = 12, the ex = T The bid = 37, ofr = 38.12, bidsiz = 1, ofrsiz = 1, 1185923302Copy the code

To solve this problem document (InfluxDB frequently asked questions | InfluxData Documentation) are given in two ways, One is to add a new tag to manually differentiate data with different tag values at the same time, and the other is to forcibly fine-tune the timestamp to make it different.

This method is applicable only when data is inserted in chronological order. In the process of inserting data in other programming languages, it is inefficient and cumbersome to determine whether the timestamp of this record is exactly the same as that of the previous record, and to manually specify different tag values when inserting data into the database. If the data is not inserted in chronological order, it is impossible to determine whether the current point in time has data records and whether the previous data will be overwritten.

In this test, we use the method of forcibly fine-tuning the timestamp. Since the time precision of the original TAQ transaction record is seconds, we can add a millisecond value randomly on the basis of the original timestamp accurate to seconds in the process of converting the CSV data to the Line Protocol format. Generates a new pseudo-timestamp with a precision of milliseconds to prevent data collisions.

After the transformation, it took 65 hours to import the 6.5 billion pieces of data contained in the 270 GB stock trading big data set, with an average import rate of 27,000 pieces per second.

The import performance comparison is as follows:

DolphinDB import rates were significantly higher than those of InfluxDB. It was also observed during the import process that the import rate of InfluxDB declined over time while DolphinDB remained stable. Moreover, before importing data, InfluxDB needs to write codes to convert CSV files to InfluxDB’s Line Protocol format, which is complicated and generates redundant intermediate files that occupy a lot of space.

2. Export data

DolphinDB used saveText((select * from Assignment), ‘/data/devices/readings_dump. CSV ‘) to export data in just 28 seconds.

InfluxDB influence-database ‘test’ -format CSV -execute “select * from Assignment > /data/devices/export_15. CSV Data export occupies more than 30 GB of memory. Fatal Error: Runtime: out of memory is eventually caused. The code looks like this:

for i in 1{5.. 8}; do time influx -database 'test' -format csv -execute "select * from readings where '2016-11-$i 00:00:00' <= time and time < '2016-11-$((i+1)) 00:00:00'" > /data/devices/export_$i.csv doneCopy the code

The total time was 5 min and 31 s.

In addition to poor performance, the export of InfluxDB CSV data is complex and prone to memory overflow. Moreover, exported CSV files do not have field names in the first line, which makes the user experience worse than DolphinDB.

The export performance comparison is shown in the following table:

Five, disk space usage comparison

DolphinDB and InfluxDB take up space after importing data, as shown in the following table:

DolphinDB in small data sets has similar space utilization as InfluxDB. Data is compressed and stored in both databases. The compression rate is about the same, between 20% and 30%. InfluxDB compresses data poorly and takes up twice as much space as DolphinDB.

Six, database query performance test

We compared the following eight categories of queries:

Point-to-point query Specifies a field value for query
A range query queries data based on time intervals for a single or multiple fields
Precision query implements data aggregation for different label dimension columns, and realizes high – or low-dimensional field range query function
Aggregate queries refer to the time series database that provides API support for counting, averaging, summing, maximum, minimum, sliding average, standard deviation, and reduction of aggregate classes for fields
A comparison query rearranges the contents of a field in a table into a table with two dimensions (the first dimension as columns and the second dimension as rows)
Sampling query refers to the API of data sampling provided by the database, which can manually specify the sampling method for sparse data processing for each query to prevent data overload due to a large query time range
Based on the query with the same precision and time range, the associated fields are filtered out and grouped
Classical queries are commonly used in real business

The test time includes the disk I/O time. To ensure the fairness of the test, run the sync command on the Linux system before starting the test. Echo 1, 2, 3 | tee/proc/sys/vm/drop_caches respectively clean up system page caching, and cache directory entry and hard disk cache, after start the program to do this, in turn, performs all the query.

The query test results of 4.2GB device sensor record small data set are shown in the following table:

See appendix for the query script.

DolphinDB performed much better than InfluxDB. InfluxDB is not as powerful as DolphinDB in terms of functions, such as:

InfluxDB does not support comparison queries and table joins, and cannot perform many queries supported by regular SQL databases.
Select * from taq order by

For details, see [Feature Request] ORDER BY Tag values · Issue #3954 · Influxdata/InfluxDB.
Parameter of the InfluxDB function can only be a field, not the expression of the field (field1 + field2, etc.). Therefore, subquery can only be used to calculate the value of the expression before applying the function, which is very complicated. In addition, the time range needs to be specified repeatedly in the WHERE clause of the child and parent query, otherwise the scan will start from the time of the oldest data record to the current time. The comparison of the 14th query statement from InfluxDB and DolphinDB is as follows:

//14. DolphinDB select Max (date(time)) as date, Max (mem_free + mem_used) as MEM_all from Assignment WHERE time <= 2016.11.18 21:00:00, battery_level >= 90, cpu_avg_1min > 90 group by hour(time), device_id

//InfluxDB select max(mem_total) from ( select mem_free + mem_used as mem_total from readings where time <= ‘2016-11-18 21:00:00’ and battery_level >= 90 and cpu_avg_1min > 90 ) where time <= ‘2016-11-18 21:00:00’ group by time(1h), device_id

The test results of 270GB stock trading big data query are shown in the following table:

See appendix for the query script.

DolphinDB was found to be 100 to 200 times faster than InfluxDB in some queries.

In the tests, we found that:

InfluxDB In the second query, null results are returned for multiple discontinuous time partitions selected in the WHERE condition, rather than the sum of the results for each time partition. The code for the second query in InfluxDB is shown below:

//2. Select symbol, time, bid, ofr from taq where (symbol = ‘IBM’ or symbol = ‘MSFT’ or symbol = ‘GOOG’ or symbol = ‘YHOO’) and (‘2007-08-03 01:30:00’ <= time and time < ‘2007-08-03 01:30:59’

select symbol, time, bid, ofr from taq where (symbol = ‘IBM’ or symbol = ‘MSFT’ or symbol = ‘GOOG’ or symbol = ‘YHOO’) and ((‘2007-08-03 01:30:00’ <= time and time < ‘2007-08-03 01:30:59’) or (‘2007-08-04 01:30:00’ <= time and time < ‘2007-08-04 01:30:59’)) // When the time range is extended, the result is null
The eighth query could not be completed in InfluxDB. The code for DolphinDB’s eighth query looks like this:

//8. Calculate the difference between the maximum selling and minimum buying price per minute for each stock on a given day max(ofr) – min(bid) as gap from taq where ‘2007-08-03’ <= time and time < ‘2007-08-04’ and bid > 0 and ofr > bid group by symbol, time(1m)

InfluxDB raises an exception, ERR: Mixing multiple selector functions with tags or fields is not supported. DolphinDB can be used normally in select statements, such as Max and min.

InfluxDB group by returns results that include all time periods, even if there is no valid data within the current period (as shown below). For sparse data, processing complexity and performance degrades, while DolphinDB returns only time periods containing valid data.

2007-07-31T23:02:00Z 22.17 54.84 2007-07-31T23:03:00Z 2007-07-31T23:04:00Z 2007-07-31T23:05:00Z 2007-07-31T23:06:00Z 2007-07-31T23:07:00Z 2007-07-31T23:08:00Z 37 38.12 2007-07-31T23:09:00Z 2007-07-31T23:10:00Z 37.03 38.12

Vii. Comparison of other aspects

DolphinDB has several advantages in addition to its excellent performance in benchmark testing:

Linguistically, InfluxDB operates databases using InfluxQL, which is a SQL language. DolphinDB, on the other hand, comes with a complete script language that supports not only SQL but also commands, quantization, functionalization, metaprogramming and RPC.
In terms of data import, InfluxDB does not have good official support for batch import of data in specific file formats, such as CSV files. Users can only read files using third-party open source tools or by themselves, and then import data in batches using APIS. Only 5000 rows can be imported at a time, which is complicated and inefficient. DolphinDB provides ploadText, loadText, and loadTextEx functions that import CSV files directly into scripts, making it more user-friendly and efficient.
Functionally, InfluxDB does not support table connection, and some routine queries cannot be completed. DolphinDB not only supported common table joins, but also had many performance optimizations for non-simultaneous joins such as ASOF joins and Window Joins.
The maximum grouping by the InfluxDB time series is week. DolphinDB supports grouping of all built-in time types up to a month. InfluxDB can only be a field, not an expression of a field. DolphinDB does not have this restriction.
DolphinDB offers more than 600 built-in functions, can satisfy the historical data in the field of financial modeling and real-time streaming data processing and real-time monitoring and real-time analysis and processing in the field of Internet of things, such as the needs of different scenarios, and most of the aggregation function, the processing sequence data needs to be lead, lag, the cumulative Windows, sliding window function are made a performance optimization.
DolphinDB’s clustered version supports transactions and ensures strong consistency when multiple copies of a partition are written.

The appendix

Data preview (take the first 20 lines)

readings
readings(Line Protocol)
The Readings table is converted to the Line Protocol script
TAQ
TAQ(Line Protocol)
TAQ tables are converted to Line Protocol scripts

DolphinDB

Installation, configuration, startup scripts
The configuration file
Small data set test script
Big data set test scripts

InfluxDB

Installation and configuration scripts
Modify the configuration
Small data set test script
Big data set test scripts