DolphinDB text data loading tutorial

DolphinDB provides the following four functions to import text data into memory or a database:

LoadText: Imports a text file as a memory table.
PloadText: Parallel import of text files into partitioned memory tables. This is faster than the loadText function.
LoadTextEx: Imports text files into a database, either a distributed database, a local disk database or an in-memory database.
TextChunkDS: Divide a text file into multiple small data sources for flexible data processing using Mr Functions.

DolphinDB text data import is flexible, feature-rich and very fast. DolphinDB offers single-threaded import speeds up to an order of magnitude faster than popular systems like Clickhouse, MemSQL, Druid, and Pandas. In the case of multi-thread parallel import, the speed advantage is more obvious.

This tutorial introduces common text data import problems, solutions, and precautions.

1. Automatically recognize data formats

In most other systems, when importing text data, the format of the data needs to be specified by the user. DolphinDB automatically recognizes the format of DolphinDB data when importing it to make it easier for users.

Automatic recognition data format includes two parts: field name recognition and data type recognition. If the first line of the file does not start with any column with a number, the system considers the first line to be the file header containing the field names. DolphinDB samples a small amount of data and automatically extrapolates the data type of each column. Because it is based on partial data, the identification of data types for certain columns may be incorrect. However, for most text files, you can import DolphinDB correctly without manually specifying the field names and data types for each column.

Note: INT128, UUID, and IPADDR are not supported for import prior to 1.20.0. If you include these three data types in a CSV file, make sure you use version 1.20.0 or higher.

[loadText] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/loadText.html) function is used to data import DolphinDB memory table. The following example calls the loadText function to import the data and look at the structure of the generated data table. Please refer to the appendix for the data files involved in the example.

dataFilePath="/home/data/candle_201801.csv"
tmpTB=loadText(filename=dataFilePath);
Copy the code

Query the first 5 rows of the table:

select top 5 * from tmpTB; symbol exchange cycle tradingDay date time open high low close volume turnover unixTime ------ -------- ----- ---------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000 000001 SZSE 1 2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181 1.158757E7 1514856720000 000001 SZSE 1 2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894 1.204971E7 1514856780000 000001 SZSE 1 2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000 000001 SZSE 1 2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 1601939 2.140652E7 1514856900000Copy the code

[schema] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/schema.html) function called see table structure (field name, data type and other information) :

tmpTB.schema().colDefs;

name       typeString typeInt comment
---------- ---------- ------- -------
symbol     SYMBOL     17
exchange   SYMBOL     17
cycle      INT        4
tradingDay DATE       6
date       DATE       6
time       INT        4
open       DOUBLE     16
high       DOUBLE     16
low        DOUBLE     16
close      DOUBLE     16
volume     INT        4
turnover   DOUBLE     16
unixTime   LONG       5
Copy the code

2. Specify the data import format

Each of the four data loading functions described in this tutorial uses the Schema parameter to specify a table containing the name, type, format of each field, and the columns to be imported. The table can contain the following four columns:

The name and type columns are required and must be the first two columns. The format and COL columns are optional and have no sequential requirements.

For example, we can use the following tables as schema parameters:

2.1 Extracting the schema of text files

[extractTextSchema] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/extractTextSchema.html) function is used to obtain the text file Schema, including field names and data types.

For example, use the extractTextSchema function to get the table structure of the sample file in this tutorial:

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath)
schemaTB;

name       type
---------- ------
symbol     SYMBOL
exchange   SYMBOL
cycle      INT
tradingDay DATE
date       DATE
time       INT
open       DOUBLE
high       DOUBLE
low        DOUBLE
close      DOUBLE
volume     INT
turnover   DOUBLE
unixTime   LONG
Copy the code

2.2 Specifying the field name and type

When the field name or data type automatically recognized by the system does not meet expectations or requirements, the schema table generated by extractTextSchema can be modified or directly created to specify the field name and data type for each column in the text file.

For example, if the volume column of imported data is automatically identified as an INT and the required volume type is LONG, you need to modify the Schema table to set the volumn column type to LONG.

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath)
update schemaTB set type="LONG" where name="volume";
Copy the code

Import the text file using the loadText function to import the data into the database according to the field data types specified by schemaTB.

tmpTB=loadText(filename=dataFilePath,schema=schemaTB);
Copy the code

The above example describes how to change data types. To change the names of fields in a table, you can use the same method.

Note that automatic resolution of date and time related data types is not as expected and needs to be addressed in section 2.3 of this tutorial.

2.3 Specify the format of the date and time type

For data in the date or time column, if the automatically identified data type is not expected, you need to specify the data type in the Type column of the schema and the format (a string) in the format column, for example, MM/ DD/YYYY. For details about how to express the date and time format, see Date and Time Adjustment and Format.

The following illustrates how to specify data types for date and time columns with examples.

In DolphinDB, execute the following script to generate the required data files.

dataFilePath="/home/data/timeData.csv" t=table(["20190623 14:54:57","20190623 15:54:23","20190623 16:30:25"] as Time, 'AAPL' MS 'IBM as SYM,2200 5400 8670 as QTY,54.78 59.64 65.23 as price) saveText(t,dataFilePath);Copy the code

Before loading data, use the extractTextSchema function to obtain the schema of the data file:

schemaTB=extractTextSchema(dataFilePath)
schemaTB;

name  type
----- ------
time  SECOND
sym   SYMBOL
qty   INT
price DOUBLE
Copy the code

Clearly, the system recognizes that the data type in the Time column is not as expected. If you load the file directly, the data in the Time column will be empty. To load the data in the time column correctly, specify the data type of the time column as DATETIME and the format as “yyyyMMdd HH: MM: SS “.

update schemaTB set type="DATETIME" where name="time"
schemaTB[`format]=["yyyyMMdd HH:mm:ss",,,];
Copy the code

Import data and view, data display is correct:

tmpTB=loadText(dataFilePath,,schemaTB) tmpTB; Time sym qty price -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 2019.06.23 T14:54:57 AAPL 54.78 2200 2019.06.23 T15:54:23 MS 5400 59.64 2019.06.23T16:30:25 IBM 8670 65.23Copy the code

2.4 Importing specified columns

When importing data, you can specify to import only certain columns of a text file using the Schema parameter.

In the following example, only the 7 columns of Symbol, date, open, high, close, volume and Turnover in the text file are loaded.

First, the extractTextSchema function is called to get the table structure of the target text file.

dataFilePath="/home/data/candle_201801.csv"
schemaTB=extractTextSchema(dataFilePath);
Copy the code

Use the rowNo function to generate column numbers for each column, assign values to COL columns in the SCHEMA table, and then modify the schema table so that only rows representing the fields to be imported are retained.

update schemaTB set col = rowNo(name)
schemaTB=select * from schemaTB where name in `symbol`date`open`high`close`volume`turnover;
Copy the code

Please note:

The column numbers start at 0. In the example above, the first symbol column corresponds to the column number 0.
You cannot change the order of columns when importing data. If you need to adjust the column order, you can load the data file before using it[reorderColumns!] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/reorderColumns.html)Function.

Finally, use the loadText function and configure the schema parameter to import the columns specified in the text file.

tmpTB=loadText(filename=dataFilePath,schema=schemaTB);
Copy the code

Looking at the first five rows in the table, only the required columns are imported:

select top 5 * from tmpTB

symbol date       open  high  close volume  turnover  
------ ---------- ----- ----- ----- ------- ----------
000001 2018.01.02 13.35 13.39 13.38 2003635 2.678558E7
000001 2018.01.02 13.37 13.38 13.33 867181  1.158757E7
000001 2018.01.02 13.32 13.35 13.35 903894  1.204971E7
000001 2018.01.02 13.35 13.38 13.35 1012000 1.352286E7
000001 2018.01.02 13.35 13.37 13.37 1601939 2.140652E7
Copy the code

2.5 Skip the first few lines of text data

If the first N lines of a file need to be skipped during data import, you can set skipRows to N. The maximum value of this parameter is 1024 because the description of the file is usually not very lengthy. The four data-loading functions described in this tutorial all support skipRows parameters.

In this example, use the loadText function to import the data file and view the total number of rows in the table after the file is imported, as well as the contents of the first five rows.

dataFilePath="/home/data/candle_201801.csv" tmpTB=loadText(filename=dataFilePath) select count(*) from tmpTB; count ----- 5040 select top 5 * from tmpTB; symbol exchange cycle tradingDay date time open high low close volume turnover unixTime ------ -------- ----- ---------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01.02 2018.01.02 93100000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000 000001 SZSE 1 2018.01.02 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181 1.158757E7 1514856720000 000001 SZSE 1 2018.01.02 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894 1.204971E7 1514856780000 000001 SZSE 1 2018.01.02 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000 000001 SZSE 1 2018.01.02 2018.01.02 93500000 13.35 13.37 13.35 1601939 2.140652E7 1514856900000Copy the code

Skip the first 1000 lines of the text file and import the file:

tmpTB=loadText(filename=dataFilePath,skipRows=1000) select count(*) from tmpTB; count ----- 4041 select top 5 * from tmpTB; col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 ------ ---- ---- ---------- ---------- --------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6 1515377400000 000001 SZSE 1 2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000 000001 SZSE 1 2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000 000001 SZSE 1 2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000 000001 SZSE 1 2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000Copy the code

Note: As shown in the example above, when skipping the first n lines for import, if the first line of the data file is the column name, that line will be skipped as the first line.

In the example above, after the text file specifies the skipRows parameter is imported, the column names become the default: col0, col1, col2, and so on, because the first line representing the column names is skipped. If the column name needs to be retained and the first n lines are specified to be skipped, the extractTextSchema function can be used to obtain the schema of the text file and specify the schema parameter during import:

schema=extractTextSchema(dataFilePath) tmpTB=loadText(filename=dataFilePath,schema=schema,skipRows=1000) select count(*)  from tmpTB; count ----- 4041 select top 5 * from tmpTB; symbol exchange cycle tradingDay date time open high low close volume turnover unixTime ------ -------- ----- ---------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01.08 2018.01.08 101000000 13.13 13.14 13.12 13.14 646912 8.48962E6 1515377400000 000001 SZSE 1 2018.01.08 2018.01.08 101100000 13.13 13.14 13.13 13.14 453647 5.958462E6 1515377460000 000001 SZSE 1 2018.01.08 2018.01.08 101200000 13.13 13.14 13.12 13.13 700853 9.200605E6 1515377520000 000001 SZSE 1 2018.01.08 2018.01.08 101300000 13.13 13.14 13.12 13.12 738920 9.697166E6 1515377580000 000001 SZSE 1 2018.01.08 2018.01.08 101400000 13.13 13.14 13.12 13.13 469800 6.168286E6 1515377640000Copy the code

3. Import data in parallel

3.1 Multithreading loading of a single file into memory

The ploadText function loads a text file into memory in a multithreaded manner. This function has the same syntax as the loadText function, except that the ploadText function can quickly load large files and generate memory partition tables. It makes full use of multi-core cpus to load files in parallel, depending on the server’s CPU cores and the node’s localExecutors configuration.

The loadText function and ploadText function are compared to import the same file.

First, generate a 4GB text file through the script:

filePath="/home/data/testFile.csv" appendRows=100000000 t=table(rand(100,appendRows) as int,take(string('A'.. 'Z'), appendRows) as symbol, take (2010.01.01.. 2018.12.30,appendRows) as date,rand(float(100),appendRows) as float, 00:00.000 + rand(86400000,appendRows) as time) t.saveText(filePath);Copy the code

Load files by loadText and ploadText, respectively. The node used in this example is a 6-core, 12-thread CPU.

timer loadText(filePath); Elapsed: elapsed: 12629.492 ms timer ploadText(filePath); Time elapsed: 2669.702 msCopy the code

The result shows that the performance of ploadText in this configuration is about 4.5 times that of loadText.

3.2 Parallel Import of Multiple Files

In the field of big data application, data import is usually not only one or two files, but dozens or even hundreds of large files. To achieve better import performance, you are advised to import batch data files in parallel.

The loadTextEx function imports text files into a specified database, including a distributed database, a local disk database, or an in-memory database. DolphinDB’s partitioned table supports concurrent reads and writes, so multi-threaded import of data is supported.

LoadTextEx is used to import text data into a distributed database. Specifically, the data is first imported into memory and then written from memory to the database. The two steps are performed by the same function to ensure high efficiency.

The following example shows how to batch write multiple files from a disk to the DolphinDB partition table. First, the following script was executed in DolphinDB to generate 100 files of around 778MB, including 10 million records.

n=100000 dataFilePath="/home/data/multi/multiImport_"+string(1.. 100)+".csv" for (i in 0.. 99){ trades=table(sort(take(100*i+1.. 100,n)) as id,rand(' IBM 'MSFT' GM 'C' FB 'GOOG' V 'F' XOM 'AMZN' TSLA 'PG' S,n) as syM,take(2000.01.01.. 2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as Price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5) trades. SaveText (dataFilePath[I])};Copy the code

Create database and table:

login(`admin,`123456) dbPath="dfs://DolphinDBdatabase" db=database(dbPath,VALUE,1.. 10000) tb=db.createPartitionedTable(trades,`tb,`id);Copy the code

DolphinDB [cut] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/cut.html) function of the elements in a vector can be grouped. Call cut to group the paths of the files to be imported. [submitJob] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/submitJob.html) function called again, for each thread written assignment, bulk import data.

def writeData(db,file){
   loop(loadTextEx{db,`tb,`id,},file)
}
parallelLevel=10
for(x in dataFilePath.cut(100/parallelLevel)){
    submitJob("loadData"+parallelLevel,"loadData",writeData{db,x})
};
Copy the code

Note that DolphinDB’s partition table does not allow multiple threads to write to a partition at the same time. In the above example, the value of the partition column (id column) in each file is different, so multiple threads do not write to the same partition. When designing concurrent reads and writes to partitioned tables, ensure that no more than one thread writes to the same partition at the same time.

Through [getRecentJobs] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/getRecentJobs.html) function can be obtained by the recent n on the local node Status of a batch job. The select statement was used to calculate the time required for importing batch files in parallel, and the time was about 1.59 seconds on a 6-core and 12-thread CPU.

select max(endTime) - min(startTime) from getRecentJobs() where jobId like "loadData"+string(parallelLevel)+"%";

max_endTime_sub
---------------
1590
Copy the code

Run the following script to import 100 files into the database in single-thread order, and record the required time, which takes about 8.65 seconds.

timer writeData(db, dataFilePath);
Time elapsed: 8647.645 ms
Copy the code

The result shows that under this configuration, the speed of import with 10 threads enabled in parallel is about 5.5 times that of single-thread import.

Query the number of records in the data table:

select count(*) from loadTable("dfs://DolphinDBdatabase", `tb);

count
------
10000000
Copy the code

4. Pre-process the database

If data needs to be preprocessed before it is imported into the database, such as converting date and time data types, populating null values, etc., Can call [loadTextEx] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/loadTextEx.html) function specified when the transform parameters. The TansForm argument takes a function as an argument and requires that the function take only one argument. The input to the function is an unpartitioned memory table, and the output is an unpartitioned memory table. Note that only the loadTextEx function provides the transform argument.

4.1 Data type for specifying date and time data

4.1.1 Convert the date and time represented by the numeric type to the specified type

The data representing time in data files may be integer or long integer, and in data analysis, it is often necessary to import and store such data in the format of time type. For this scenario, date and time columns in a text file can be specified using the transform parameter of the loadTextEx function.

First, create distributed databases and tables.

login(`admin,`123456) dataFilePath="/home/data/candle_201801.csv" dbPath="dfs://DolphinDBdatabase" The db = database (dbPath, VALUE, 2018.01.02.. 2018.01.30) schemaTB=extractTextSchema(dataFilePath) Update schemaTB set type="TIME" where name=" TIME" tb=table(1:0,schemaTB.name,schemaTB.type) tb=db.createPartitionedTable(tb,`tb1,`date);Copy the code

Custom function i2t, used to preprocess the data and return the processed data table.

def i2t(mutable t){ return t.replaceColumn! (`time,time(t.time/10)) }Copy the code

Note: When working with data in custom functions, try to use local modifications (to! Closing function) to improve performance.

By calling the loadTextEx function and specifying the transform parameter as i2t, the system executes the i2t function on the data in the text file and saves the result to the database.

tmpTB=loadTextEx(dbHandle=db,tableName=`tb1,partitionColumns=`date,filename=dataFilePath,transform=i2t);
Copy the code

View the first five rows of the table. You can see that the time column is stored as time instead of INT as in the text file:

select top 5 * from loadTable(dbPath,`tb1); symbol exchange cycle tradingDay date time open high low close volume turnover unixTime ------ -------- ----- ---------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01.02 2018.01.02 02:35:10.000000000 13.35 13.39 13.35 13.38 2003635 2.678558E7 1514856660000 000001 SZSE 1 2018.01.02 2018.01.02 02:35:20.000000000 13.37 13.38 13.33 13.33 867181 1.158757E7 1514856720000 000001 SZSE 1 2018.01.02 2018.01.02 02:35:30.000000000 13.32 13.35 13.32 13.35 903894 1.204971E7 1514856780000 000001 SZSE 1 2018.01.02 2018.01.02 02:35:40.000000000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000 000001 SZSE 1 2018.01.02 2018.01.02 02:35:50.000000000 13.35 13.37 13.35 13.37 1601939 2.140652E7 1514856900000Copy the code

4.1.2 Conversion between date or time data types

If the DATE in the text file is stored as DATE and you want to store the DATE as MONTH when importing the database, you can also use the transform parameter of the loadTextEx function to convert the data type of the DATE column. The procedure is the same as the previous section.

Login (` admin, ` 123456) dbPath = "DFS: / / DolphinDBdatabase db =" database (dbPath, VALUE, 2018.01.02.. 2018.01.30) schemaTB=extractTextSchema(dataFilePath) Update schemaTB set type="MONTH" where name="tradingDay" tb=table(1:0,schemaTB.name,schemaTB.type) tb=db.createPartitionedTable(tb,`tb1,`date) def d2m(mutable t){ return t.replaceColumn! (`tradingDay,month(t.tradingDay)) } tmpTB=loadTextEx(dbHandle=db,tableName=`tb1,partitionColumns=`date,filename=dataFilePath,transform=d2m);Copy the code

View the first five rows of the table. You can see that the tradingDay column is stored as MONTH instead of DATE as in the text file:

select top 5 * from loadTable(dbPath,`tb1); symbol exchange cycle tradingDay date time open high low close volume turnover unixTime ------ -------- ----- ---------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 000001 SZSE 1 2018.01 M 2018.01.02 93100000 13.35 13.39 13.38 2003635 2.678558E7 1514856660000 000001 SZSE 1 2018.01m 2018.01.02 93200000 13.37 13.38 13.33 13.33 867181 1.158757E7 1514856720000 000001 SZSE 1 2018.01m 2018.01.02 93300000 13.32 13.35 13.32 13.35 903894 1.204971E7 1514856780000 000001 SZSE 1 2018.01m 2018.01.02 93400000 13.35 13.38 13.35 13.35 1012000 1.352286E7 1514856840000 000001 SZSE 1 2018.01m 2018.01.02 93500000 13.35 13.37 13.35 1601939 2.140652E7 1514856900000Copy the code

4.2 Filling null Values

The transform parameter calls DolphinDB’s built-in functions. When built-in functions require multiple arguments, we can use partial applications to convert a multi-argument function to a one-argument function. For example, calling [nullFill!] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/nullFill1.html) functions to fill empty values in a text file.

The db = database (dbPath, VALUE, 2018.01.02.. 2018.01.30) TB = db. CreatePartitionedTable (TB, ` tb1, ` date) tmpTB=loadTextEx(dbHandle=db,tableName=`pt,partitionColumns=`date,filename=dataFilePath,transform=nullFill! , {0}).Copy the code

5. Use map-reduce to customize data import

DolphinDB supports custom data import using Map-Reduce. Data is divided into rows and imported to DolphinDB using Map-Reduce.

Can use [textChunkDS] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/textChunkDS.html) function could be divided into multiple small files data source file, Through [Mr] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/mr.html) function writes to the database. The user can also perform flexible data processing before the Mr Function is called to store the data into the database, thus realizing more complex import requirements.

5.1 Store the stock and futures data in the file into two different data tables

The following script was executed in DolphinDB to generate a 1GB data file containing stock and futures data.

n=10000000 dataFilePath="/home/data/chunkText.csv" trades=table(rand(`stock`futures,n) as type, Rand (` IBM ` MSFT ` GM ` C ` FB ` GOOG ` V ` F ` XOM ` AMZN ` TSLA ` PG ` S, n) as sym, take (2000.01.01.. 2000.06.30,n) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as Price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4,rand(10000,n) as qty5,rand(10000,n) as qty6) trades.saveText(dataFilePath);Copy the code

Create distributed databases and tables for stock data and futures data, respectively:

login(`admin,`123456) dbPath1="dfs://stocksDatabase" dbPath2="dfs://futuresDatabase" Db1 = database (dbPath1, VALUE, IBM ` ` MSFT ` GM ` C ` FB ` GOOG ` V ` F ` XOM ` AMZN ` TSLA ` PG ` S) db2 = database (dbPath2, VALUE, 2000.01.01.. 2000.06.30) tb1 = db1. CreatePartitionedTable (trades, ` stock, ` sym) tb2 = the createPartitionedTable (trades, ` futures, ` date);Copy the code

Define the following functions to partition data and write data to different databases.

def divideImport(tb, mutable stockTB, mutable futuresTB) { tdata1=select * from tb where type="stock" tdata2=select * from tb where type="futures" append! (stockTB, tdata1) append! (futuresTB, tdata2) }Copy the code

TextChunkDS function is used to divide the text file. The file is divided into four parts in 300MB.

ds=textChunkDS(dataFilePath,300)
ds;

(DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment, DataSource<readTableFromFileSegment)
Copy the code

Import the file into the database by calling the Mr Function, specifying the textChunkDS function result as the data source. Since the map function (specified by the mapFunc argument) only takes a table as an argument, here we use a partial application that converts a multi-argument function to a one-argument function.

mr(ds=ds, mapFunc=divideImport{,tb1,tb2}, parallel=false);
Copy the code

Note that different small file data sources may contain data for the same partition. DolphinDB does not allow multiple threads to write to the same partition at the same time, so parallel in Mr Should be set to false or an exception will be thrown.

Check the first 5 rows of the tables in both databases. The stock database has stock data and the futures database has futures data.

Stock table:

select top 5 * from loadTable("dfs://DolphinDBTickDatabase", `stock); type sym date price1 price2 price3 price4 price5 price6 qty1 qty2 qty3 qty4 qty5 qty6 ----- ---- ---------- --------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- stock AMZN 2000.02.14 11.224234 112.26763 1160.926836 11661.418403 11902.403305 11636.093467 4 53 450 2072 9116 12 Stock AMZN 2000.03.29 10.119057 111.132165 1031.171855 10655.048121 12682.656303 11182.317321 6 21 651 2078 7971 6207 stock AMZN 2000.06.16 10.119057 111.132165 1031.171855 10655.048121 12682.656303 11182.317321 6 21 651 2078 7971 6207 stock AMZN 2000.06.16 11.61637 101.943971 1019.122963 10768.996906 11091.395164 11239.242307 091 857 3129 3829 811 stock AMZN 2000.02.20 11.69517 114.607763 1005.724332 10548.273754 12548.185724 12750.524002 1 39 270 4216 8607 6578 stock AMZN 2000.02.23 11.534805 106.040664 1085.913295 11461.783565 12496.932604 12995.461331 4 35 488 4042 6500 4826Copy the code

Futures table:

select top 5 * from loadTable("dfs://DolphinDBFuturesDatabase", `futures); type sym date price1 price2 price3 price4 price5 price6 qty1 qty2 qty3 qty4 qty5 ... ------- ---- ---------- --------- ---------- ----------- ------------ ------------ ------------ ---- ---- ---- ---- ---- -- Futures MSFT 2000.01.01 11.894442 106.494131 1000.600933 10927.639217 10648.298313 11680.875797 9 10 241 524 8325 . Futures S 2000.01.01 10.13728 115.907379 1140.10161 11222.057315 10909.352983 13535.931446 3 69 461 4560 2583... Futures GM 2000.01.01 10.339581 112.602729 1097.198543 10938.208083 10761.688725 11121.888288 11 714 6701 9203... Futures IBM 2000.01.01 10.45422 112.229537 1087.366764 10356.28124 11829.206165 11724.680443 0 47 741 7794 5529... Futures TSLA 2000.01.01 11.901426 106.127109 1144.022732 10465.529256 12831.721586 10621.111858 4 43 136 9858 8487 futures TSLA 2000.01.01 11.901426 106.127109 1144.022732 10465.529256 12831.721586 10621.111858 4 43 136 9858 8487...Copy the code

5.2 Quickly Loading the first and last Parts of a Large File

You can use textChunkDS to divide large files into smaller data sources (chunks) and then load the first and last data sources. In DolphinDB, execute the following script to generate data files:

n=10000000 dataFilePath="/home/data/chunkText.csv" trades=table(rand(`IBM`MSFT`GM`C`FB`GOOG`V`F`XOM`AMZN`TSLA`PG`S,n) as Sym, sort (take (2000.01.01.. 2000.06.30,n)) as date,10.0+rand(2.0,n) as price1,100.0+rand(20.0,n) as price2,1000.0+rand(200.0,n) as Price3,10000.0+rand(2000.0,n) as price4,10000.0+rand(3000.0,n) as price5,10000.0+rand(4000.0,n) as price6,rand(10,n) as qty1,rand(100,n) as qty2,rand(1000,n) as qty3,rand(10000,n) as qty4, rand(10000,n) as qty5, rand(1000,n) as qty6) trades.saveText(dataFilePath);Copy the code

The textChunkDS function is then used to divide the text file into 10MB units.

ds=textChunkDS(dataFilePath, 10);
Copy the code

Call the Mr Function to load the first and last chunks of data. Because the two chunks of data are very small, the loading speed is very fast.

head_tail_tb = mr(ds=[ds.head(), ds.tail()], mapFunc=x->x, finalFunc=unionAll{,false});
Copy the code

Check the number of records in the head_tail_TB table:

select count(*) from head_tail_tb;

count
------
192262
Copy the code

6. Other precautions

6.1 Processing of data with different codes

DolphinDB strings are encoded in UTF-8. If DolphinDB files are not encoded in UTF-8, they must be converted after being imported. DolphinDB provides [convertEncode] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/convertEncode.html), [fromU TF8] (https://link.zhihu.com/?target=https%3A//www.dolphindb.cn/cn/help/fromUTF8.html) and [toUTF8] (HTTP: / / https://link.zhihu.com/?t Arget = HTTPS % 3 a / / www.dolphindb.cn/cn/help/toUTF8.html) function, used to import the data after the transformed string coding.

For example, convert the encoding of the Exchange column in table tmpTB using the convertEncode function:

dataFilePath="/home/data/candle_201801.csv" tmpTB=loadText(filename=dataFilePath, skipRows=0) tmpTB.replaceColumn! (`exchange, convertEncode(tmpTB.exchange,"gbk","utf-8"));Copy the code

6.2 Analysis of numeric types

DolphinDB data types (CHAR, SHORT, INT, LONG, FLOAT, and DOUBLE) are interpreted automatically when DolphinDB data is imported. The system can recognize the following forms of numerical data:

The value of a number, for example, 123
A value with a thousand separator, for example, 100,000
A number containing a decimal point, that is, a floating point number, e.g. 1.231
A value represented by scientific notation, e.g. 1.23E5

If the data type is numeric, DolphinDB automatically ignores letters and other symbols before and after numbers. If no numbers are found, DolphinDB is interpreted as NULL values. The following are specific examples.

First, execute the following script to create a text file.

DataFilePath = "/ home/data/testSym CSV" prices1 = [" 2131 ", "$2131", "N/A"] prices2 = [" 213.1 ", "$213.1", The totals = "N/A"] [" 2.658 e7 ", "2.658 e7", "2.658 e-7]" = tt table (1.. 3 as id, prices1 as price1, prices2 as price2, totals as total) saveText(tt,dataFilePath);Copy the code

The text file you create has both numbers and characters in the PRICe1 and Price2 columns. If the schema parameter is not specified when importing data, the system will recognize both columns as SYMBOL types:

tmpTB=loadText(dataFilePath) tmpTB; Id price1 price2 Total -------- ------ -------- 1 2131 213.1 2.658E7 2 $2,131 $213.1 -2.658E7 3 N/A N/A 2.658E-7 tmpTB.schema().colDefs; name typeString typeInt comment ------ ---------- ------- ------- id INT 4 price1 SYMBOL 17 price2 SYMBOL 17 total DOUBLE 16Copy the code

If price1 is specified as an INT and price2 as a DOUBLE, the system ignores letters and other symbols before and after digits. If no number is present, the value is resolved to NULL.

schemaTB=table(`id`price1`price2`total as name, `INT`INT`DOUBLE`DOUBLE as type) tmpTB=loadText(dataFilePath,,schemaTB) tmpTB; Id price1 price2 Total -------- ------ -------- 1 2131 213.1 2.658E7 2 2131 213.1 -2.658E7 3 2.658E-7Copy the code

6.3 Automatically Removing Double Quotation marks

In CSV files, double quotation marks are sometimes used to handle fields containing special characters (such as thousands separator) in numeric values. DolphinDB automatically removes double quotes from the text when processing such data. The following are specific examples.

In the data file used in the following example, the num column is the value represented by thousands.

dataFilePath="/home/data/test.csv" tt=table(1.. 3 as id, [" \ "500 \" ", "\" 3500 \ ""," \ "9000000 \" "] as num) saveText (tt, dataFilePath);Copy the code

DolphinDB automatically removes double quotes from DolphinDB text when importing data and viewing data in the table.

tmpTB=loadText(dataFilePath)
tmpTB;

id num
-- -------
1  500
2  3500
3  9000000
Copy the code

The appendix

The data file used in the examples in this tutorial: CANDLE_201801.csv.

DolphinDB text data loading tutorial

1. Automatically recognize data formats

2. Specify the data import format

3. Import data in parallel

4. Pre-process the database

5. Use map-reduce to customize data import

6. Other precautions

The appendix

Related Posts

16 units, 20,000 research and development, how CCB Jinke centralized unified code management?

In-depth understanding of NetTY – See Netty traffic control through occasional outages

Code generation API documentation artifact: APidoc