An overview,

DolphinDB

DolphinDB is written with an analysis of the c + + type high performance distributed temporal database, the use of high throughput and low delay column type memory engine, integrates the functional programming language and high capacity, high speed data flow analysis system, can be in the database for complex programming and operation, significantly reduce the amount of data migration time.

DolphinDB through memory engine, data localization, fine-grained data partition and parallel computing to realize high-speed of distributed computing, built-in line, Map Reduce and iterative calculation, and other computing framework, using the embedded distributed file system partition data automatic management and its copy, for distributed computing provides load balancing and fault tolerance.

DolphinDB Database supports standard SQL-like syntax and provides a Python-like scripting language for manipulating data. It also provides apis for other common programming languages, historical data analysis and modeling, and real-time streaming data processing in finance. And massive sensor data processing and real-time analysis in the field of the Internet of Things.

TimescaleDB

TimescaleDB is currently the only open source and fully SQL supported timing database on the market. It is developed on top of the PostgreSQL database and is essentially a PostgreSQL plug-in.

TimescaleDB fully supports SQL and has a rich ecosystem of PostgreSQL. It is optimized for fast insertion and complex query of time series data, supports automatic sharding, automatic partition of time and space dimensions, and supports parallel query of multiple servers and multiple chunks. Internal write optimization (bulk commit, in-memory indexing, transaction support, data dump).

However, TimescaleDB currently does not support horizontal scaling (clustering), Write clustering for multi-node Timescale deployments is under active development. Github.com/timescale/t…). , only read-only clustering implemented by PostgreSQL streaming Replication is supported.

In this report, we compared the performance of TimescaleDB and DolphinDB on time series datasets. The test covers the import and export of CSV data files, disk space usage, and query performance. DolphinDB did better in all the tests we did. The main conclusions were:

  • DolphinDB performs better than TimescaleDB in importing data from small data setsMore than 10 timesIn the case of large data sets, the import performance is itsMore than 100 timesIn addition, the import rate of TimescaleDB decreased as the import time increased, while DolphinDB remained stable.
  • DolphinDB performs as well as TimescaleDB in terms of data exportThree timesThe left and right sides.
  • DolphinDB occupies only as much disk space as TimescaleDB in small data sets1/6, the occupied space under the big data set is only TimescaleDB1/17.
  • DolphinDB outperformed TimescaleDB in query performance in four of the testsMore than 50 times; In the 15 test samples, the neutral energy TimescaleDB10 to 50 times; Performance several times that of TimescaleDB in 10 test samples; Only two test samples performed worse than TimescaleDB.

Second, the test environment

TimescaleDB does not currently support clusters that can write data, so we tested using a single machine. The configuration of a single machine is as follows.

Host: DELL OptiPlex 7060 CPU: Intel Core i7-8700 (6-core 12-thread 3.20 GHz) memory: 32 GB (8GB x 4, 2666 MHz) hard disk: 2T HDD (222 MB/s to read; 210 MB/s write) OS: Ubuntu 16.04 LTS

DolphinDB was tested in Linux V0.89 (2019.01.31) with a maximum memory of 28GB. The PostgreSQL version used for the test is Ubuntu 10.6-1 on x86_64, and the TimescaleDB plug-in is v1.1.1. According to the official guidelines recommending TimescaleDB performance tuning method, combining with the actual hardware configuration of test machine, we are pgtune. The leopard. In the ua/website generates a configuration file, at the same time reference wiki.postgresql.org/wiki/Tuning… This official configuration guide has been optimized to mainly set shared_buffers and effective_CACHE_size to 16GB and parallel workers to a 12-thread CPU. Since only one mechanical hard disk is used, Effective_io_concurrency is set to 1. See the postgresQL_test. conf file in the appendix for the configuration modification.

Data set

DolphinDB and TimescaleDB were tested in small data (4.2GB) and large data (270GB) formats:

DolphinDB loadTable(memoryMode=true). Use the pg_prewarm plug-in in PostgresQL to load this into shared_buffers.

In the test of the magnitude of big data, we do not preload the disk partition table, and the query test time includes the disk I/O time. To ensure the fairness of the test, the Linux system command sync is used before each program test. Echo 1, 2, 3 | tee/proc/sys/vm/drop_caches respectively clean up system page caching, and cache directory entry and hard disk cache.

Here are the table structures and partitioning methods for the two datasets:

4.2 GB Device sensor record small data set (CSV format, 30 million)

We chose Devices_big as a small dataset from the sample dataset given by TimescaleDB. The data set contains temporal statistics of sensor time, device ID, battery, memory, CPU and so on for 3000 devices at 10000 time intervals (2016.11.15 — 2016.11.19).

Source: docs.timescale.com/v1.1/tutori… Download address: timescaledata.blob.core.windows.net/datasets/de…

The data set contains 30 million pieces of data (4.2 GB CSV). The compressed package contains a device information table and a device sensor information record table. The table structure and partitioning method are as follows:

Device_info table

Readings table

There are 3000 different values of the field device_id in the dataset, and these values appear repeatedly in the readings table records. Using the text type not only takes up a lot of space but also has low query efficiency, but it is difficult to use the enum type for this field in TimescaleDB. DolphinDB’s Symbol type solves both storage and query problems easily and efficiently.

Similarly, for bSSID and SSID, which represent WiFi information connected to the device, it is not suitable to use enum type although there are a lot of repeated values due to the uncertainty of data in practice.

Our DolphinDB database partitioning scheme uses Time as the first dimension of the partition, divided into four zones by day. The partition boundaries are [2016.11.15 00:00:00, 2016.11.16 00:00:00, 2016.11.17 00:00:00, 2016.11.18 00:00:00, 2016.11.19 00:00:00]. Then, device_ID is used as the second dimension of the partition, which is divided into 10 partitions every day. Finally, each partition contains about 100 MB of raw data.

We tried using Device_id as a second partition in TimescaleDB, but 90% of our queries performed worse than just partitioning by time, so we chose to partition only by time and by day into four partitions, the same as DolphinDB. The device_id this dimension in the form of official recommended index (reference docs.timescale.com/v1.0/using-…). To speed up the query, as shown below.

create index on readings (device_id, time desc);
create index on readings (ssid, time desc);
Copy the code

270 GB Large data set of stock Transactions (CSV format, 23 CSV, 6.5 billion pieces)

We tested the Level 1 quotation data of the stock market from January 08.2007 to August 31, 2007 provided by the New York Stock Exchange (NYSE) as a big data set. The data set contains more than 8000 stocks in a month’s trading time, stock code, bid price, ask price, buy quantity, sell quantity and other quotation information.

There are 6.5 billion (6,561,693,704) quotation records in the data set. One trading day record is saved in a CSV. There are 23 trading days in the month, and the uncompressed CSV file is 270 GB in total. Source: www.nyse.com/market-data… .

Taq table

We partition according to date(date),symbol(stock symbol), and then divide into 100 partitions according to symbol every day, each partition is about 120 MB.

4. Data import and export test

Import data from a CSV file

DolphinDB Database is imported using the following script:

Timer {for (fp in FPS) {loadTextEx(db, 'taq,' date 'symbol, fp, schema) print now() + ": fp}}Copy the code

A total of 30 million pieces of data were imported into a small data set recorded by 4.2GB device sensors in 20 seconds with an average rate of 1,500,000 pieces/second

The 270 GB big data set of stock trading totaled 6,561,693,704 pieces of data (TAQ20070801-TAQ20070831 23 files), which took 38 minutes to import

In the TimescaleDB import, since the timescaledb-parallel copy tool does not support CSV header column names, we skip the CSV header with tail -n +2 before writing the file stream to its standard input.

for f in /data/TAQ/csv/*.csv ; do tail -n +2 $f | timescaledb-parallel-copy \ --workers 12 \ --reporting-period 1s \ --copy-options "CSV" \ --connection "host=localhost user=postgres password=postgres dbname=test sslmode=disable" \ --db-name test \ --table taq \ --batch-size 200000 echo "file $f import completed" doneCopy the code

It takes 5 minutes and 45 seconds to import 30 million pieces of data in a small data set recorded by 4.2 GB device sensors, with an average rate of 87,000 pieces/second

270 GB stock trading big data set only TAQ20070801, TAQ20070802, TAQ20070803, TAQ20070806, TAQ20070807 five files (total size 70 GB) contain 1.67 billion pieces of data to import 24 hours, import rate 19400 pieces/second, it is expected to import all 270 GB of data in 92 hours.

The import performance is as follows:

DolphinDB import rates were significantly higher than TimescaleDB import rates, and the difference was more pronounced when the data was large. The import rates of TimescaleDB decreased with time, while DolphinDB remained stable.

In addition, TimescaleDB still takes about 2 minutes to build indexes after importing small data sets.

The exported data is a CSV file

DolphinDB used saveText((select * from Assignment), ‘/data/devices/readings_dump. CSV ‘) for data export in DolphinDB.

Time PSQL -d test -c “\COPY (SELECT * FROM Assignment) TO /data/ Devices /devices_dump. CSV DELIMITER ‘,’ CSV” to export data.

The export performance of small data sets is shown in the following table:

Five, disk space usage comparison

The following table shows how TimescaleDB and DolphinDB databases take up space after importing data:

DolphinDB is much more efficient than TimescaleDB, and the database in TimescaleDB takes up more storage space than the original CSV data file for several reasons:

  • Timescale only compares large fields for automatic compression (TOAST) and does not compress tables automatically. That is, if the fields are small, each row is short, and the number of rows is large, the tables are not compressed automatically. If a compressed file system such as ZFS is used, the query performance will be significantly affected. DolphinDB uses LZ4 compression by default.
  • TimescaleDB useSELECT create_hypertable('readings', 'time', chunk_time_interval => interval '1 day')The original data table is transformed into HyperTable abstract table to provide unified query and operation interface for different data partitions. The underlying hyperChunk is used to store data. After analysis, it is found that the index of time sequence data field in HyperChunk is 0.8GB. The index of device_ID and SSID is 2.3 GB in total.
  • The device_id, SSID and BSSID fields have a large number of duplicate values, but bSSID and SSID fields represent WiFi information connected to the device. In practice, because of the uncertainty of data, they are not suitable for enum type and can only be stored in the form of repeated strings. DolphinDB symbols can be dynamically adapted to the actual data, making it easy and efficient to solve the storage problem.

Six, query test

We compared the following eight categories of queries:

  • Point-to-point query Specifies a field value for query
  • A range query queries data based on time intervals for a single or multiple fields
  • Precision query implements data aggregation for different label dimension columns, and realizes high – or low-dimensional field range query function
  • Aggregate queries refer to the time series database that provides API support for counting, averaging, summing, maximum, minimum, sliding average, standard deviation, and reduction of aggregate classes for fields
  • A comparison query rearranges the contents of a field in a table into a table with two dimensions (the first dimension as columns and the second dimension as rows)
  • Sampling query refers to the API of data sampling provided by the database, which can manually specify the sampling method for sparse data processing for each query to prevent data overload due to a large query time range
  • Based on the query with the same precision and time range, the associated fields are filtered out and grouped
  • Classical queries are commonly used in real business

4.2GB device sensor record small data set query test

For small data set tests, we first load all the tables into memory.

DolphinDB is loaded to memory using loadTable(memoryMode=true). TimescaleDB uses select pg_prewarm(‘ _HYPER_2_41_chunk ‘) to load shared_buffers.

The following table describes the query performance. See appendix for the query script.

For sampling query, tablesample clause in TimescaleDB samples data tables. The parameter is the sampling proportion. However, there are only two sampling methods (System, Bernoulli). But all rows in the sampled block will be selected, which is less random. Bernoulli sampled the entire table, but at a slower rate. The two sampling methods do not support sampling by a certain field. DolphinDB does not support full table sampling, but only partition sampling, so we do not compare performance because of the implementation.

For interpolation query, TimescaleDB (PostgreSQL) has no built-in interpolation query support, need hundreds of lines of code to do this, see wiki.postgresql.org/wiki/Linear… ; DolphinDB supports four types of interpolation: ffILL, bfill, lfill, and nullFill.

For comparison queries, TimescaleDB’s comparison query function is implemented by the Crosstab () function provided by PostgresQL’s built-in tablefunc plug-in, but this function has significant limitations as can be seen from the sample query: First, it requires the user to manually hardcode all possible values and corresponding data types in the second dimension (row), which cannot be dynamically generated based on data and is very tedious. Therefore, it cannot be used for dynamic data or fields with many values. Second, it can only be collated according to the type dimension of text, or pre-cast to text by another type of dimension. This conversion operation is inefficient and wastes space when the data volume is large. DolphinDB natively supports the Pivot BY statement, which can be sorted automatically by specifying only two dimensions of a category.

For associative query, asof Join is very convenient for time series data analysis. DolphinDB supports asof join natively but PostgresQL does not support github.com/timescale/t… .

When count(*) is used to query the total number of records, TimescaleDB scans the entire table, which is extremely inefficient.

270 GB stock trading large data set query test

In the test of the magnitude of big data, we do not pre-load the disk partition table to the memory, and the query test time includes the disk I/O time. To ensure the fairness of the test, the Linux system command sync is used before each startup test. Echo 1, 2, 3 | tee/proc/sys/vm/drop_caches clean up system page caching, and cache directory entry and hard disk cache, start the program in turn perform all testing sample after it again.

The following table describes the query performance. See appendix for the query script.

Seven, the appendix

  • CSV data Format Preview (first 20 lines)

Device_info: devices. CSV

Readings: readings. CSV

: TAQ TAQ. CSV

  • DolphinDB

Installation, configuration, and startup script: test_dolphindb.sh

Configuration file: Dolphindb.cfg

Small data set test complete script: test_dolphindb_small.txt

Large data set test complete script: test_dolphindb_big.txt

  • TimescaleDB

Installation, configuration, and startup script: test_timescaledb.sh

The complete script for testing small data sets: test_timescaledb_small.sql

Large data set test complete script: test_timescaledb_big.sql

PostgreSQL Modify the configuration: postgresql_test.conf

PostgreSQL complete configuration: postgresql.conf

PostgreSQL Permission configuration: pg_hba.conf

All possible values for the ticker symbol: symbols.txt

Create the SQL statement for the Symbol enumeration type: make_symbol_enum. SQL

A script that generates the Symbol enumeration type: make_symbol_enum. Coffee

  • Test result processing script

REPL.coffee