DolphinDB/Elasticserach performance comparison tests on financial datasets

Elasticsearch is a popular log search and analysis tool that performs well in real-time, extensibility, ease of use and full text search. There is an article on Zhihu, Golion: Dimension reduction strike! ElasticSearch is used as the timing database, and it works very well. Can Elasticsearch be used to store and analyze massive financial data?

We did a comprehensive comparison test of DolphinDB and Elasticsearch on various size financial datasets. The test includes I/O, disk space usage, memory consumption, and database query (filtering query and grouping statistics). DolphinDB, a time-series database that did well in financial data processing, beat Elasticsearch.

DolphinDB performed about 10 times better than Elasticsearch in grouped statistics (aggregate calculations), and even better as the data set grew larger. DolphinDB did particularly well when test cases involved time-type fields.
DolphinDB is 100 times better than Elasticsearch for simple filtering queries.
Elasticsearch takes 25-75 times longer to import data than DolphinDB, and it tends to increase as data sets grow.
DolphinDB compresses raw data, and Elasticsearch takes up more disk space than raw data to maintain information such as document indexes (not temporary data). The overall gap is about 10 times.

1. System introduction

1.1 DolphinDB profile

DolphinDB is an analytical, distributed sequential database with column storage, a built-in streaming data processing engine, parallel and distributed computing engines, and a distributed file system that supports cluster scaling. DolphinDB is written in C++ and is extremely responsive. SQL – and Python-like scripting languages are provided to manipulate data. Provides apis for other common programming languages to facilitate integration with existing applications. In the financial field of historical data analysis modeling and real-time streaming data processing, as well as the Internet of Things in the field of massive sensor data processing and real-time analysis and other scenarios outstanding performance.

1.2 introduction of Elasticsearch

Elasticsearch is a Lucene-based search server. It is a distributed system that stores data on local disks and documents. It has similar comparisons with traditional databases as follows:

Relational DB =>Databases =>Tables => Rows => Columns

Elasticsearch =>Indices=>Types=>Documents => Fields

Elasticsearch clusters can have multiple Indices corresponding to DolphinDB databases. Each index can contain multiple Types corresponding to DolphinDB tables. Each type contains multiple Documents that correspond to DolphinDB rows. Each document then contains multiple Fields that correspond to the concepts of DolphinDB columns.

2. Configure the system

2.1 Hardware Configuration

The hardware configuration of the test is as follows:

Device: DELL OptiPlex 7060

CPU: Inter(R) Core™ I7-8700 CPU @ 3.20GHz, 6 cores and 12 threads

Memory: 32 gb

Hard disk: 2TB hard disk

Operating system: Ubuntu 16.04 x64

2.2 Environment Configuration

The test environment is a multi-node cluster under a single server. DolphinDB and Elasticsearch node parameters need to be set to maximize the performance of both in a standalone environment. Set the number of DolphinDB data nodes to four and the maximum available memory of a single data node to 7 GB. Set the number of Elasticsearch nodes to 4. Elasticsearch is based on Lucene, so you need to allocate some memory for loading Lucene segments into the memory. For this test allocate 8 GB of memory to Lucene and set the maximum available memory for a single node in Elasticsearch to 6 GB and disable Considerations.

3. Test the data set

To test DolphinDB and Elasticsearch more thoroughly, we used three stock datasets of different sizes. The data table CN_Stock contains the daily quotation data of Shanghai and Shenzhen stocks from January 1, 2008 to December 31, 2017. Table US_Prices contains daily quotes for the U.S. stock market from January 02, 1990 to December 30, 2016. The data table TAQ contains the 4-day high frequency data of the us stock market level1 in August 2007, with a total of 60.6 GB. The test data set is summarized in the following table:

The data types of the various DolphinDB and Elasticsearch fields in the test dataset are as follows:

(1) Data type mapping of CN_Stock table

(2) Data type mapping of the US_Prices table

(3) Data type mapping of TAQ table

4. Partition/sharding scheme

DolphinDB Database provides flexible partitioning mechanisms, including value partitioning, range partitioning, list partitioning, hash partitioning, and composite partitioning, whereas Elasticsearch only supports hash-based sharding.

In DolphinDB, the CN_Stock table was divided into 20 partitioned tables every six months. For table US_Prices, it is divided into 27 partitions each year by time. For table TAQ, the date and stock symbol combination partition method is adopted, with a total of 100 partitions. Set the number of copies to 1.

Elasticsearch only allows you to define the number of shards. For CN_Stock and US_Prices, define the number of shards as 4. For table TAQ, define the number of shards as 100. Set the number of copies to 1.

5. Comparison tests

DolphinDB and Elasticsearch were compared in terms of database query performance, I/O performance, disk footprint, and memory consumption.

5.1 Testing database Query Performance

The DolphinDB script language supports SQL syntax and extends it to a more powerful level. In Elasticsearch, you need to install a plug-in to query SQL statements. You also need to use Domain Specific Language (DSL) based on JSON data format to query Elasticsearch statements. This test uses DSL.

Elasticsearch is mainly used in search engines. It supports fuzzy query. By default, Elasticsearch returns only 10 query hits. For aggregate queries, the default size returned by buckets is also 10. DolphinDB, on the other hand, returns all DolphinDB queries.

Doc_count_error_upper_bound and sum_other_doc_count are returned from Elasticsearch aggregated query. They represent, respectively, the potential aggregation results that were not returned in this aggregation but might exist, and the number of documents that were not counted in this aggregation. Elasticsearch does not query all data in the database, but only some data in the database. In order to test Elasticsearch in a fair environment, we need to disable the fuzzy query of Elasticsearch by using the Scroll interface of Elasticsearch and the size of buckets to return all query results.

DolphinDB query performance was tested using DolphinDB scripts. Elasticsearch query performance test using Python script +DSL.

We performed several common SQL queries against three data tables. In order to minimize the influence of chance factors on the results, the query performance test performed 10 queries for each query operation, and then averaged the total time in milliseconds. The test scripts and results for each test data set are shown in the following table.

(1) CN_Stock table

DolphinDB query scripts:

Query performance test results (Data volume: 5,332,932) :

(2) US_Prices table

DolphinDB query scripts:

Query performance test results (Amount of data: 50,591,907) :

(3) TAQ table

DolphinDB query scripts:

Query performance test results (amount of data: 1,366,036,384) :

For this query performance test, we can draw the following conclusions:

DolphinDB performed many times better than Elasticsearch in all tests with the same table. DolphinDB, in particular, performed one or two orders of magnitude better than Elasticsearch for simple filtering queries (see 1 4 for CN_Stock and 1 to 4 for US_Prices).

DolphinDB also outperformed Elasticsearch by 8 or 9 times in aggregated queries. DolphinDB, in particular, performed 1315 times better than Elasticsearch in aggregated queries grouped by time (see test results for CN_Stock and US_Prices).

DolphinDB was more stable than Elasticsearch when the data size increased. (3) DolphinDB was more stable than Elasticsearch when the data size increased.

5.2 I/O Performance Testing

Elasticsearch provides the _BULK API for writing data in batches. When creating a new document, you first need to describe the attributes of each field that may be contained in the document, the data type (such as keyword, text, INTEGER, or Date), and whether these fields need to be indexed or stored by Lucene. Elasticsearch then builds a map for these attributes of the document and creates an inverted index to form segments in Lucene. Finally, the inverted index is stored on disk through the refresh and flush mechanisms. The process of flushing the inverted index in memory onto disk is critical to Elasticsearch performance. Refresh_interval = -1 and index. number_of_replicas = 0 are also available for Elasticsearch. However, in the case of bulk data imports, when the buffer in memory is full, it will still trigger refresh and flush the data to disk, so the optimization effect is not very obvious. Slow data import of Elasticsearch is a significant drawback.

In DolphinDB, when a distributed table is created and data is written to, the location of data nodes is determined based on the partition type of the distributed table. In the partition, data is organized in the way of column storage, and data import and query are carried out through the cooperation between nodes. The data import is fast and the performance is very high.

The following table shows the I/O performance test results of the two data imports. It can be seen that the ES/DDB load time ratio increases as the data volume increases. Specifically, when the data volume is 60.6GB, the import time of Elasticsearch is over 12 hours. See appendix for the data import script.

5.3 Testing Disk Space Usage

Elasticsearch is famous for its search efficiency and timeliness. It is a distributed search engine built on Lucene and compresses the content of the Source field, but internally builds an inverted index for each document created and stores the inverted index on disk. Because each document needs additional index information on disk, it needs more storage space. DolphinDB, on the other hand, does not need the rest of the index information, so it is compressed. The test results are shown in the following table.

5.4 Memory Usage

DolphinDB and Elasticsearch use the Linux htop command to monitor the memory usage of DolphinDB and Elasticsearch (32GB). The result is as follows:

5.5 Comparison in other aspects

Elasticsearch supports SQL by installing a plugin. The built-in DSL language of Elasticsearch is in JSON format, which has a complicated syntax. DolphinDB, on the other hand, comes with a complete script language that supports not only SQL but also commands, quantization, functionalization, metaprogramming and RPC.

(2) Elasticsearch is a distributed, multi-user, full-text search engine that supports fuzzy query. Documents (rows) do not need fixed structure, and different documents can have different fields. DolphinDB supports only structured data.

DolphinDB provides more than 600 built-in functions for historical data modeling and real-time streaming data processing in finance, real-time monitoring and data analysis in the Internet of Things. It provides the function of leading, lagging, cumulative window, sliding window and other indexes needed by time series data processing, and optimizes the performance with excellent performance. DolphinDB has more scenarios than Elasticsearch.

DolphinDB does not support table joins when Elasticsearch is used in sequential databases. DolphinDB supports table joins and optimizations for non-simultaneous joins such as ASOF join and Window Join.

DolphinDB supports distributed transactions. Elasticsearch does not support transactions.

6. Summary

Elasticsearch supports structured and unstructured data, fuzzy query, precise query, and aggregate computing, and is suitable for many application scenarios. However, compared to DolphinDB, a professional time series database, there is a big gap in functionality and performance. Especially, when the amount of data expands rapidly and exceeds the physical memory limit, the disadvantages of high memory consumption and disk space consumption are exposed, and the performance of historical data calculation is significantly reduced.

DolphinDB and Elasticsearch configuration details, DolphinDB and Elasticsearch test codes, and data import scripts are shown in the appendix.

DolphinDB/Elasticserach performance comparison tests on financial datasets

1. System introduction

2. Configure the system

3. Test the data set

4. Partition/sharding scheme

5. Comparison tests

6. Summary

Related Posts

Development tutorial: 4G module ML302-GC211, access intelligent cloud GoKit3 development board

The Visual Studio Code decorator prompts an “experimentalDecorators” solution

Git clone encountered [email protected]: Permission denied (publickey). Fatal: Could not read