Clickhouse materialized view stomp records

Because the online query is longer than 1s, you need to optimize the query. To speed up query efficiency, we create a materialized view on the base table

CREATE MATERIALIZED VIEW dwst.tt (
`sort_key` UInt8,
 `id` UInt64,
 `type` UInt8,
 `is_profit_avg` UInt8,
 `bd1` UInt64,
 `bd2` UInt64,
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{ck_cluster}/dwst/tt', '{replica}') PARTITION BY sn_sort_key ORDER BY (id, type, bd1, bd2) SETTINGS index_granularity = 8192 AS 
SELECT halfMD5(id) % 64 AS sn_sort_key, id,  type, 
multiIf(((sum(v1) - sum(v2)) < 0, 2, 1) AS is_profit_avg, bd1, bd2 FROM dwst.base_detail  WHERE date <=(today() - 10) GROUP BY sort_key,id,type,bd1,
	bd2
Copy the code

Some details were removed for security reasons; The base_detail base table aggregated id,type,bd1,bd2, until t-10 whether profit situation, because the basic table data volume is large, want to use materialized view, calculate data in advance, reduce SQL query time; During our practice, we discovered two problems and gained a deeper understanding of the ClickHouse materialized view

Problem one: The total amount of data in each view is inconsistent

Locally via insert…. Remote, which simulates the insertion of the main table, triggers the calculation function of the materialized view; However, when the underlying information table data was updated, it was found that the total number of data items aggregated was inconsistent each time

By looking at clickHouse’s official issue, Check whether there are similar problems. Sure enough, there are two similar problems that are Duplicated data in MATERIALIZED VIEW. The official interpretation is as follows

    Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. 
The reason for this is in case of network failures when the client application doesn’t know if the data was written to the DB, so the INSERT query can simply be repeated.
 It doesn’t matter which replica INSERTs were sent to with identical data. 
 INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.
Copy the code

The clickHouse insert statement is idempotent. When writing to the same data block, the client application does not confirm that the data has been written. Therefore, the following solution is recommended: * to use sub queries for repetitive data for secondary processing, to heavy (official recommendation) * use ReplicatedReplacingMergeTree execution engine for data to heavy, this is I want to use in practice, each use after inserting data, Optimize table, data removal;Copy the code

Problem 2: the number of rows for each profit is inconsistent

After de-duplicating the data, I found that the total of the data was correct, but the total of is_profit_avg was not consistent each time, which made me a little annoyed; Later through the search for official documentsCopy the code

A materialized view is implemented as follows: when inserting data to the table specified in `SELECT`, 
part of the inserted data is converted by this `SELECT` query, and the result is inserted in the view.

Important
Materialized views in ClickHouse are implemented more like insert triggers. If there’s some aggregation in the view query, it’s applied only to the batch of freshly inserted data.
 Any changes to existing data of source table (like update, delete, drop partition, etc.) doesn’t change the materialized view.</pre>
Copy the code

Materialized views are essentially like insert triggers; If there is any set operation, it applies it to the newly inserted data; Other changes to the original table, such as updating, deleting, or deleting partitions, do not affect materialized view changes

A 'SELECT' query can contain 'DISTINCT', 'GROUP BY', 'ORDER BY', 'LIMIT'... Note that the corresponding conversions are performed independently on each block of inserted data. For example, if `GROUP BY` is set, data is aggregated during insertion, But only within a single packet of inserted data. The data won't be further aggregated. The exception is when using an `ENGINE` that independently performs data aggregation, such as `SummingMergeTree`.Copy the code

A query statement can contain distinct, Group by, Order by, and limit. Note that these associated constraints can only be applied to each newly inserted data block. For example, if group by is set, these statements will only apply to newly inserted data, not to inserted partitions.

conclusion

In practice, after the group by dimension, the profit value obtained is the overall difference of historical data; You have to perform an operation on every piece of data in history; This does not quite fit the practical scenario in the Material View. Essentially, materialized views are the processing of streaming data, and a single piece of data is a value that is accumulated, rather than the value obtained by the whole processing of offline data; So for the optimization of this query, the use of materialized view is abandoned; Use the intermediate table directly, calculate once a day;

Refer to the link

1: Headed data in MATERIALIZED VIEW

2： Duplicated primary key in materialized view

3: Data Replication

4: CREATE VIEW

The original link: www.jianshu.com/p/bdf7f258d…

Clickhouse materialized view stomp records

Problem one: The total amount of data in each view is inconsistent

Problem 2: the number of rows for each profit is inconsistent

conclusion

Related Posts

General regression problem, linear regression and correct model setting

Pytorch Understanding of the parameter pin_memory when creating data.DataLoader

Python Model-based Collaborative Filtering Algorithm (10)