ClickHouse is a database management system (DBMS) opened source by Yandex, Russia’s largest search engine, in 2016, mainly for online analytical processing (OLAP). Its performance is far superior to traditional row-oriented DBMS, and it has been widely concerned in recent years.

This article introduces the ClickHouse MergeTree family of table engines and uses examples to analyze the data storage structure of the MergeTree storage engine.

1 Introduction to MergeTree table engine

The MergeTree family of table engines is the most featured storage engine ClickHouse offers. The MergeTree engine supports data by primary key, data partitioning, data copying, and data sampling. Official offer including MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTr Implementations of seven different types of MergeTree engines, including EE, GraphiteMergeTree, and their corresponding MergeTree engine (Replicated*), which supports data replication.

Let’s start with the MergeTree core engine:

  • ReplacingMergeTree: During background data merging, data with the same sort key is decopied.
  • SummingMergeTree: Records with the same primary key are merged into a single record when data is merged. Based on the aggregation field Settings, the value of this field is the summary value after aggregation. The non-aggregation field uses the value of the first record. The type of the aggregation field must be numeric.
  • AggregatingMergeTree: Aggregates data with the same primary key in the same data partition.
  • CollapsingMergeTree: collapses data that have the same primary key in the same data partition.
  • VersionedCollapsingMergeTree: based on the CollapsingMergeTree engine, added data version information field configuration options. On the basis of sorting data according to the ORDER BY setting, if the version information column of the data is not in the sorting field, the version information will be implicitly used as the last column of the ORDER BY, thus affecting the data sorting.
  • GraphiteMergeTree: Used to store data from the sequential database Graphites.

MergeTree is the core of the engine family. Other engines are based on MergeTree and implement different features in the data merge process, thus forming the MergeTree table engine family. Below we through MergeTree to understand MergeTree table series engine.

2 MergeTree engine

2.1 the table creation

The DDL for creating MergeTree looks like this:

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2], ... ) ENGINE = MergeTree() ORDER BY expr [PARTITION BY expr] [PRIMARY KEY expr] [SAMPLE BY expr] [TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]  [SETTINGS name=value, ...Copy the code

Here are the main parameters of the MergeTree engine:

Required options

  • ENGINE: indicates the name of the ENGINE. The MergeTree ENGINE has no parameters.
  • ORDER BY: Sort key, which can consist of one or more columns and determines how the data is sorted, for example ORDER BY (CounterID, EventDate). If no PRIMARY KEY is specified, ORDER BY is used as the PRIMARY KEY. Usually, just specify ORDER BY.

Optional options

  • PARTITION BY: Specifies the rule BY which data in the table is partitioned. A partition is a logical data set divided by specified rules in a table. Partitioning can be done by any criteria, such as by month, by day, or by event type. To reduce the amount of data that needs to be manipulated, each partition is stored separately.

  • PRIMARY KEY: indicates the PRIMARY KEY. After this parameter is set, a PRIMARY index (primary.idx) is generated based on the PRIMARY KEY. Data is sorted based on the index Settings to accelerate query performance. BY default, the PRIMARY KEY is the same as the ORDER BY setting, so it is common to use the ORDER BY setting directly instead of the PRIMARY KEY setting.

  • SAMPLE BY: Data sampling setting. If this option is displayed, it should also be included in the primary key configuration. For example, ORDER BY CounterID/EventDate/intHash32 (UserID) and SAMPLE BY intHash32 (UserID).

  • TTL: Indicates the TTL of the data. You can set the TTL for a column or the entire table. The TTL must contain the Date or DateTime field type. If set on a column, the stale data in the field is deleted. If you set TTL at the table level, the stale data in the table will be deleted. If two types are set, the one that expires first takes effect. For example, TTL createTime + INTERVAL 1 DAY indicates that it will expire one DAY later. Usage scenarios include deleting or archiving data periodically.

  • Index_granularity: index interval granularity. MergeTree index is a sparse index, generating one index for each index_Granularity data. Index_granularity is set to 8092 by default.

  • Enable_mixed_granularity_parts: Whether to enable index_Granularity_bytes to control index granularity.

  • Index_granularity_bytes: Index granularity, in bytes, default 10Mb.

  • Merge_max_block_size: indicates the maximum number of records to merge data blocks. The default value is 8192.

  • Merge_with_ttl_timeout: minimum interval for merging frequencies. The default value is 1 day.

2.2 Data Storage Structure

Create a test table with DDL as follows:

CREATE TABLE test.test 
( 
    id        UInt64, 
    type      UInt8, 
    create_time DateTime 
) ENGINE = MergeTree() 
  PARTITION BY toYYYYMMDD(create_time)
  ORDER BY (id)
  SETTINGS index_granularity = 4;
Copy the code

The test table contains three fields such as ID, type, and CREATE. The create_time date field is used as the partitioning key and the date format is converted to YYYYMMDD. Sort by the ID field. Since the primary key is not explicitly set, the engine defaults to using the ID column set BY ORDER BY as the index field and generates the index file. Index_granularity is set to 4, meaning that one index data is generated for every four data pieces.

Insert a test data:

insert into test.test(id, type, create_time) VALUES (1, 1, toDateTime('2021-03-01 00:00:00'));
Copy the code

Run the following command to view partition information about the test table:

 SELECT 
    database,   
    table,      
    partition,     
    partition_id,     
    name,     
    active,     
    path  
FROM system.parts  
WHERE table = 'test' 
Copy the code

The result is as follows:

Table test returns a partition with partitionID 20210301. From the name field, we can see that the directory of this partition is 20210301_8_8_0. What does the directory name 20210301_8_8_0 mean? Here are the partition rules and the naming rules of the partition directory.

2.2.1 Rules for Generating Data Partition IDS

Data PARTITION rules are determined BY PARTITION ID, and PARTITION ID is determined BY PARTITION BY key. According to the partition key field type, ID generation rules can be divided into:

  • Not defined PARTITION key: PARTITION BY is not defined. BY default, a data PARTITION named all is generated and all data is stored in the all directory.

  • Integer partition key: If the partition key is an integer, the string value of the integer value is used as the partition ID.

  • Date-type partitioning keys: Partitioning keys are of the date type, or can be converted to the date type.

  • Other partition keys: String and Float. The Hash value is used as the partition ID by the 128-bit Hash algorithm.

Above we insert a data with a date of 2021-03-01 00:00:00. After formatting this field, the generated data partition ID is 20210301.

2.2.2 Data Partition Directory Naming Rules

Directory naming rules are as follows:

PartitionId_MinBlockNum_MaxBlockNum_Level
Copy the code
  • PartitionID: indicates the ID of a partition, for example, 20210301.
  • MinBlockNum: Minimum partition block number, increment type, starting from 1 up. Increments each new directory partition by one digit.
  • MaxBlockNum: Maximum partition block number. The newly created partition MinBlockNum is equal to the MaxBlockNum number.
  • Level: indicates the Level of merging and the number of merging times. The more merge times, the greater the level value.

The partition id is 20210301. MinBlockNum and MinBlockNum are both 8 and level is 0, indicating that the partition is not merged.

2.3 Data Partition File organization structure

Now that you know the rules for generating partition directory names, let’s look at the file organization structure under the data partition directory. Take partition 2021030188_0 as an example.

As you can see from the figure, the files in the directory mainly include bin files, MRK files, primary.idx files, and other related files.

  • Bin file: a data file that stores data in a specific column. Each column in the data table corresponds to a bin file with the same field name. For example, id.bin stores the data in the ID column of the test table.

  • MRK file: Tag file. Each column corresponds to a tag file with the same field name. Tag file acts as a bridge between idX index file and bin data file. A file ending in MRk2 indicates that the table has adaptive index interval enabled.

  • Idx file: primary key index file used to speed query efficiency. TXT: indicates the total number of data records in a data partition. In the above data partition 20210301_8_8_0, the total number of records in this file is 1.

  • TXT: information about the number of all columns in a table, including field names and field types.

  • Dat: Used to save the value of the partition expression. The value of the file in the data partition 20210301_8_8_0 is 20210301.

  • Minmax_create_time. idx: indicates the maximum and minimum value of a partition key.

  • Checkgoles. TXT: verification file used to verify the correctness of each file. Store the size and hash values of each file.

2.3.1 Data files

In MergeTree, each column corresponds to a bin file that stores the data in the column. For example, id.bin stores data in the ID column. All data is compressed, sorted, and finally written into a bin file in the form of data blocks. Bin data is written to the file in the unit of compressed data blocks. Each data block consists of header information and compressed data. Header information includes checksum, data compression algorithm, data size before compression and size after compression. The compressed data is composed of granule size, and granule size is related to index_Granularity.

2.3.2 Index Files

MergeTree index is a sparse index, which does not index a single piece of data, but a certain range of data. Idx index files are generated by periodically selecting the values of the primary key fields from the sorted total data to speed up the query efficiency of the table. The interval setting parameter is index_Granularity.

We insert 9 entries into table test,

insert into test.test(id, type, create_time) VALUES (1, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (1, 2, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (1, 3, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (2, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (2, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (3, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (3, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (4, 1, toDateTime('2021-03-01 00:00:00')); 
insert into test.test(id, type, create_time) VALUES (5, 1, toDateTime('2021-03-01 00:00:00'));
Copy the code

Because index_Granularity is set to 4, one index record is generated for every four data sets, using the values of the inserted ID fields for the first, fifth, and ninth data sets.

2.3.3 Mark files

The MRK tag file acts as a bridge between the primary.idx index file and the bin data file. Each index in the primary.idx file has a corresponding record in the MRK file. The composition of a record includes:

  • Offset-compressed bin file: indicates the offset of the compressed data block in the bin file.

  • Offset-decompressed data block: Indicates the offset of the pointed data in the decompressed data block.

  • Row COUNTS: Specifies the number of rows for data logging, less than or equal to the value set for index_GRANULARITY.

Indexes, tags, and data files are shown below:

Reference Documents:

1.clickhouse.tech/docs

2.www.clickhouse.com.cn/topic/5ffec…

3. “ClickHouse Principle Analysis and Application Practice” (China Machine Press, by Zhu Kai)

Author: Zhang Kai