Table structure optimization

Table engine

There are many table engines in Clickhouse. The most widely used is the *MergeTree table engine, which is the most robust. Choose a different table engine depending on the application scenario. For example, we use Memory as the table engine for temporary tables. Here we will discuss some optimizations in MergeTree.

The website lists four MergeTree features:

  1. Data is stored sorted by primary key;
  2. Partitioning can be determined by specifying a Partitioning key.
  3. Support data backup;
  4. Support data sampling.

The MergeTree table engine’s data consists of groups of parts, each of which has its data sorted lexicographically by primary key, and then periodically merges the data in the background. Partition and data merging details may refer to ClickHouse | MergeTree engine data partition

There are several main points about MergeTree optimization:

  1. The partition size is determined by service characteristics. Take 100 million data in a single table as an example, the optimal partition size is 10-30.
  2. Set the primary key properly. The default primary key field is the sort field, that is, during table constructionORDER BYField specified. throughPRIMARY KEYThe keyword specifies a primary key field that is different from the sort field, but must be prefixed by the sort field tuple:In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.. In addition, there is no limit to the length of a primary key, but an excessively long primary key has a performance impact on data writes (queries have no impact).
  3. If the data is time-sensitive, it can be passedTTLSet the data expiration time. TTL can be applied to a table or a specific field, as shown in the following table:MergeTreeTTL section in.

The field type

  1. Do not use strings for fields that can be identified by numeric or time types.

  2. It is recommended to use UTF-8 encoding for string fields to store text, so that no additional conversion is required to read and write data.

  3. Although ClickHouse stores DateTime as timestamp Long, it is not recommended to store DateTime directly because DateTime is efficient and readable because it does not require function conversion.

  4. Fields with NULL values can be set to default values such as -1 and NULL strings. The reason for this is that ClickHouse requires additional files to record NULL tags, and the Nullable website clearly states that Nullable has a negative impact on performance:

    Note

    Using Nullable almost always negatively affects performance, keep this in mind when designing your databases.

Query optimization

  1. Most IO reduction operations, such as column clipping, partition clipping, filtering before joining, are common ways to speed up queries, and clickHouse also works. In particular, there is no predicate push-down in ClickHouse, so you need to manually place where before join.

  2. Join a large table first, as opposed to Hive.

  3. Control the number of join tables to 3 or less. Although the higher version optimizes the efficiency of multi-table join, too many table joins will have a huge negative impact on query efficiency. If possible, consider querying in a single table after generating large-width tables, and normal queries are at the millisecond level.

  4. Frequently used associated tables can be set to dictionary tables. The dictionary table data will be stored in memory, and the data volume should not be too large.

  5. When the query field is from a single table, consider changing the join to IN. In queries in clickHouse support both single fields and tuples:

    SELECT UserID IN (123.456) FROM.SELECT (CounterID, UserID) IN ((34.123), (101500.456)) FROM.-- Tuple comes from a subquery
    SELECT (CounterID, UserID) IN (SELECT CounterID, UserID FROM...).FROM.Copy the code
  6. Using preWHERE instead of WHERE (only available in *MergeTree) improves query performance tenfold when there are significantly more query columns than filter columns. By default, ClickHouse converts parts of where to preWHERE. You can manually specify fields for preWHERE based on the data:

    Prewhere automatically optimizes the way data is read during the filtering phase, reducing IO operationsselect * from work_basic_model  where product='tracker_view' and ( id='eDf8fZky' or code='eDf8fZky') # replacewhereThe keywordselect * from work_basic_model  prewhere product='tracker_view' and ( id='eDf8fZky' or code='eDf8fZky' ) 
    Copy the code
  7. For exact statistics, use uniqExact to replace count(DISTINCT). If a small amount of error is allowed, use uniQ and uniqCombined. The memory consumption of UNIQ is smaller and the accuracy is higher, but the performance is worse than that of UNIQ. The performance of these three functions is more than 10 times higher than count(DisitnCT).

  8. For some definite data models, statistical indicators can be constructed by materialized view, so as to avoid the process of repeated calculation in data query. Materialized views are updated when new data is inserted. For Materialized VIEW creation, refer to the Materialized section of CREATE VIEW.

Configuration optimization

config.xml

Configuration items describe
background_pool_size The default size of the merge process is 16. You are advised to change it to twice the number of CPU cores
log_queries The default value is 0. If you change the value to 1, the system automatically creates the system_QUERy_log table and records the query information of each query
max_execution_time Set the maximum query time, in seconds. Default unlimited; Note that the client timeout overrides this parameter
max_threads Set the maximum number of cpus that a single query can use. The default is the number of CPU cores
max_connections Maximum number of connections. The value is 4096 by default
max_server_memory_usage Clickhouse uses maximum memory and defaults to 0. Dynamic application drains all resources of the machine
max_concurrent_queries Maximum number of concurrent queries
max_open_files The default value is maximum. There may be too many open files error, cooperate with ulimit -A setting, if it does not take effect, confirm/etc/security/limits.d/clickhouse.confWhether the configuration is consistent, and whether the service is restarted, but the default value is 262144, which is quite large

users.xml

Configuration items describe
max_memory_usage Maximum memory available for a single query. The default memory size is 10 GB
use_uncompressed_cache Enable the cache of uncompressed blocks to improve the speed of short queries. Disable 0 or enable 1. Default 0 (disabled)
max_rows_to_read Maximum number of rows that can be read from a table. An exception will be thrown if the query result exceeds the value
max_bytes_to_read The maximum amount of data to be returned during query. If the value is exceeded, exceptions will occur
join_use_nulls Whether to use null when join, default is 0. If join is null, default values will be assigned to empty columns, such as 0 for Int and “” for String.

Other optimization

Clickhouse has since 20.6.3 provided the ability to explain execution plan parsing, which allows you to identify problems in SQL that may affect performance. See the Clickhouse native implementation plan for explain.

Two things to note about EXPLAIN:

  1. Currently implemented with tools (DBeaver7.2.2, clickhouse20.9.3.45)explainThe clickhouse-client command does not return an error but does not get the parsing result.
  2. Execute in clickhouse-clientSET send_logs_level = 'trace';You can view more detailed log information.

Reference documentation

  1. Clickhouse optimizes best practices
  2. ClickHouse | MergeTree engine data partition
  3. ClickHouse *MergeTree table engine
  4. MergeTree
  5. Initial explain
  6. feature request: EXPLAIN output
  7. Restrictions on Query Complexity
  8. Server Settings