A pre-aggregation engine in Clickhouse

Author: LemonNan

The original address: mp.weixin.qq.com/s/qXlmGTr4C…

Note: the author and original address should be indicated

introduce

This article will introduce the SummingMergeTree and AggregatingMergeTree pre-aggregation engines in Clickhouse. They are both inherited from MergeTree and belong to the MergeTree engine family. For those of you who have not read MergeTree, you can take a look at the previous article: MergeTree Indexing principles, which only discusses the use of these two engines.

SummingMergeTree

The SummingMergeTree engine periodically merges data after insertion. Clickhouse merges data with the same primary key in the same partition into a single row. If multiple partitions exist, it is very likely that there will be multiple rows with the same primary key, so when querying, Sum () and group by are used for aggregation. If a primary key corresponds to a large number of rows, using SummingMergeTree can be very effective in reducing the amount of space taken up by the data store (if only pre-aggregate engine tables) and speeding up aggregate queries.

Merging rules

The default is to merge and sum all numeric type fields except the primary key, when the set of column parameters are defined by the tableSummingMergeTree([columns])Column of,The column field is not allowed in a primary key and must be numericIf no column is specified, the default value isAll numeric type fields except primary key
If all columns are 0 at the time of the merge, the primary key row data will be deleted
If the column is not in the primary key and is not in the summary field, a value is randomly selected from the existing column
Fields located in primary keys are not merged

An 🌰

Here is an example of a user’s purchase record, which contains fields such as time, user ID, price, and item ID

Built the data table

SQL create database if not exists test; Create table if not exists test.shopping_record(' shop_time 'DateTime64(3, 'UTC') COMMENT' buy time ', 'user_id' String COMMENT 'user id',' price 'Decimal(6,2) COMMENT' purchase price ', 'product_id' String COMMENT 'iD') ENGINE = SummingMergeTree(price) Partition by toYYYYMM(shop_time) order by (toYYYYMMDD(shop_time), user_id)Copy the code

Insert data

Insert into shopping_record values ('2022-02-23 16:43:33.000','user_1',70,'product_1'),('2022-02-23 ' 16:43:22. 000 ', 'user_1', 20.33 'product_1'), (' 2022-02-23 17:43:44. 000 ', 'user_1, 433.99,' product_1 '), (' 2022-03-10 18:43:55. 000 ', 'user_1', 76.23 'product_1'), (' 2022-03-11 19:43:15. 000 ', 'user_1' 99, 'product_1'), (' 2022-03-10 20:43:32. 000 ', 'user_1, 37,' product_1 ');Copy the code

Looking at the data here, you can see that the engine has pre-aggregated some of the data

Manually merge

If the data is merged, the data of the same primary key will be merged for price summation. Because the amount of data is too small, the result of the query has been aggregated, so the above user_1 data is inserted several times

Manual merging of data

Optimize table shopping_record final;Copy the code

Query result after merging

This is the automatic aggregation effect of the engine, but the actual query needs to be performedsumAs well asgroup by, partly because the data may not have been aggregated yet, and partly because the data may be in multiple partitions, which need to be aggregated

SQL > create table primary key toYYYYMMDD(shopping_time); Select toYYYYMMDD(shop_time) as day, user_id, sum(price) from shopping_record group by day, user_id order by dayCopy the code

Here are the results of the query

New data insertion

So let’s add user 2’s purchase record

# User 2's purchase record, Insert into shopping_record values ('2022-02-23 16:43:33.000','user_2',33.33,'product_1'),('2022-02-24 ' 16:43:22. 000 ', 'user_2', 99.99 'product_2'), (' 2022-02-26 10:00:00. 000 ', 'user_2, 78.3,' product_3 ');Copy the code

Perform manual SQL merge and query

You can see that the data is preaggregated by date and user

AggregatingMergeTree

AggregatingMergeTree is also a type of pre-aggregation engine. Unlike SummingMergeTree, AggregatingMergeTree can specify various aggregation functions, whereas SummingMergeTree can only handle numerical summation.

When using AggregatingMergeTree for storage, you need to use the state-ending function to store intermediate state values

The merge end function is used to process the intermediate state values of state

So here’s 🌰

Here take “the number of book visits on e-commerce platforms and the length of visits” as an example

Built the data table

Create table if not exists test.book(' user_id 'String COMMENT' id', 'book_id' String COMMENT 'id', 'view_time' Int32 COMMENT 'create_time' DateTime64(3, 'UTC') COMMENT 'create_time ') ENGINE = MergeTree() partition by toYYYYMM(create_time) order by (create_time, Book_id) # Preaggregate the book browsing records with the materialized view CREATE MATERIALIZED VIEW IF NOT EXISTS test.book_mv engine = AggregatingMergeTree() partition by day order by (day, book_id) as select toYYYYMMDD(create_time) as day, book_id as book_id, count() as visit, sumState(view_time) as sum_view_time from test.book group by day, book_idCopy the code

Insert data

insert into test.book values
('user_1', 'book_1', 100, '2022-02-23 16:00:00'),
('user_2', 'book_1', 112, '2022-02-23 17:35:00'),
('user_1', 'book_1', 94, '2022-02-23 18:05:00'),
('user_1', 'book_1', 67, '2022-02-23 20:05:00'),
('user_1', 'book_1', 30, '2022-02-25 16:00:00'),
('user_1', 'book_2', 245, '2022-02-23 16:10:00'),
('user_1', 'book_2', 39, '2022-02-23 19:10:00'),
('user_2', 'book_2', 78, '2022-02-23 23:17:00'),
('user_2', 'book_2', 60, '2022-03-10 09:49:00'),
('user_2', 'book_3', 30, '2022-03-10 10:49:00'),
('user_2', 'book_4', 44, '2022-03-10 11:49:00'),
('user_2', 'book_5', 75, '2022-03-10 12:49:00'),
('user_2', 'book_6', 20, '2022-03-10 13:49:00');
Copy the code

Results of the query

Query the raw data table

Because this is the preaggregation engine used by materialized views, Select day, book_id, sum(visit), sumMerge(sum_view_time) as view_time from book_mv group by day, book_id order by day, book_id;Copy the code

The pre-aggregation results are shown below

As you can see, the pre-aggregation engine aggregates the original browsing history into the required information, the number of views per book per day, and the total page time per book per day

conclusion

The use of aggregation engine can save storage space and speed up data aggregation search in some statistical cases (single pre-aggregation engine table), but usually contains two tables, the original table of MergeTree and the data table of pre-aggregation engine. In the case of two tables, it is necessary to exchange space for time. Materialized views persist pre-aggregated results, taking up extra space in addition to the original table, but querying pre-aggregated engine tables can speed up the aggregation of queries.

The last

Please scan the qr code below or search LemonCode to exchange and learn together!

A pre-aggregation engine in Clickhouse

introduce

SummingMergeTree

Merging rules

An 🌰

Built the data table

Insert data

Manually merge

New data insertion

AggregatingMergeTree

So here’s 🌰

Built the data table

Insert data

Results of the query

conclusion

The last

Related Posts

Automatic spawn of King Gold with Python script

Spark Streaming, an unbounded Streaming processing system, is implemented in Spring Boot in three minutes, and the statistics of Streaming likes are realized

These questions are almost mandatory in an interview