This is the 12th day of my participation in the August Text Challenge.More challenges in August

The table engine of a database determines the presentation of a table. In The case of InnoDB, its index structure determines how data is organized, how data is loaded, and how transactions are supported.

ClickHouse has a very large table engine system. Of the many families, MergeTree is used in the widest range of production environments. Because it only supports primary key indexes, data partitioning, and data copies.

Because there are so many varieties of MergeTree, it’s hard to know all of them. But the family has basic characteristics common to other tables. This section is interpreted from the basic structure.

Storage structure

When MergeTree is used to write a batch of data, the data is written to disk as a Segment, and this is immutable. In order to avoid too many segments, ClickHouse periodically merges these segments using a background thread. All segments belonging to the same partition are merged into a new Segment.

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[SETTINGS name=value. ]Copy the code
  • PARTITION BY

    The partitioning key. Specifies which field the table data is partitioned with. Supports the combination of one field and multiple fields into tuples.

    If you do not declare a partitioning key, ClickHouse generates a large partition of All.

  • ORDER BY

    Sort keys. Specifies which field the data is sorted by. The default KEY is the same as the PRIMARY KEY. Single fields and multiple fields can be combined into tuples. Sorting is the same as Mysql.

  • PRIMARY KEY

    The primary key. After the declaration, index will be generated according to the primary key subsegment to speed up the query. The default is the same as Order By, so it is common to specify the primary key directly using the subsegment of the sort key.

    Therefore, in general, the data and level 1 index in a single Segment are arranged in the same ascending order.

Then let’s look at his physical storage structure:

  1. primary.idxStore sparse index, can exclude data except primary key in query, reduce invalid scan speed up query.
  2. If partition expression:Partition BY toYYYYMM(create_time). thenpartition.datIs the calculated valueThe 2020-06.minmax_create_time.idxDeposit isThe 2020-06-06, 2020-06-06.
  3. With partition index, queries can skip unnecessary data partitions.

Data partition

In MergeTree, data is organized in the form of partitions. This is different from data sharding, where data from a table is distributed across different database nodes. Data partition is the vertical segmentation of data, and data sharding is the horizontal expansion.

The MergeTree data partition rule is determined by the partition ID, that is, the partition key value. Partition BY key = Partition BY key; Partition BY key = Partition BY key;

Partition key type data case partition express The partition ID
There is no partitioning key all
The integer 7,8,9 Partition By day_num 7. 8; 9
'java','py','go' Partition By length(lang) 2; 4
The date of The 2020-06-01, 2020-06-01 Partition By create_day 20200601; 20200608
The 2020-05-01, 2020-05-01 Partition By toYYYYMM(createday) 202005; 202006
other 'clickhouse' Partition By lang Hash (lang) computes the hash value as the partition ID

Partition merging

  1. First, new partitions are created only when data is inserted.
  2. Partitions created are also not static and are automatically merged after a certain amount of time (10-15 minutes for writes).
  3. Every timeInsertAll generate new partition data. Unlike other databases, this one will automatically insert into the same partition. That is, different partition directories are generated for the same partition.
  4. The existing partitions will not be deleted immediately after being merged. The old partition will be setactive=0, these partition records will be filtered out and automatically deleted after a certain period of time (8 minutes by default).

As can be seen from the partition merge above:

  1. MinBlock/MaxBlockTake the values of the same partition when mergingMin/MaxValue;
  2. LevelAt every merge++;