The premise

This article will take a closer look at DDL and DML in ClickHouse, many of which differ from traditional DBMSS, especially the costly DELETE and UPDATE operations. Let’s get started

In general, I recommend that ClickHouse use all uppercase keywords to accentuate the difference between custom camel name and uppercase keywords, which makes ClickHouse more readable and maintainable

The current version of the ClickHouse service used for this article is 20.10.3.30

Database DDL

When the ClickHouse service is started, a database named Default is created by default. The database is like a namespace, physically isolating data and avoiding table naming conflicts. SHOW DATABASES lists all DATABASES in the current service:

F5abc88ff7e4:) SHOW the DATABASES SHOW DATABASES ┌ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ _temporary_and_external_tables │ │ Default │ │ system │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 3 rows in the set. The Elapsed: 0.023 SEC.Copy the code

The basic syntax for creating a database is:

CREATE DATABASE [IF NOT EXISTS] $db_name [ON CLUSTER $cluster] [ENGINE = $engine(...)]
Copy the code

There are three optional clauses:

  • IF NOT EXISTS: this clause is created when a database with the same name as $db_name does NOT exist. Otherwise, this clause is ignored. An exception is thrown IF a database definition with the same name already EXISTS
  • ON CLUSTER $CLUSTER: Specifies that all ClickHouse service instances in the CLUSTER perform the corresponding DDL, advanced distributed DDL functionality
  • [ENGINE = $engine(…)] : Specify the database engine (surprisingly, not only tables have corresponding engine, database also has corresponding engine)

Common database DDL:

Such as:

f5abc88ff7e4 :) CREATE DATABASE db_test; 0 rows in set.elapsed: 0.034 sec.f5abc88ff7e4 :) SHOW CREATE DATABASE db_test; SHOW the CREATE DATABASE db_test ┌ ─ statement ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ the CREATE DATABASE db_test ENGINE = Atomic │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 1 rows in the set. The Elapsed: 0.007 SEC.Copy the code

Database engine

This is a very advanced feature of ClickHouse. The database can define engine types, and different engines can be used in different scenarios, so you can get a taste of the “everything is a table” vision. Only three database engines are mentioned in the official documentation: the default database engine, MySQL, and Lazy. But from most references, ClickHouse supports ** “at least five” ** database engines. The following database engines are known to be available:

In most cases, choose Ordinary or you don’t need to specify a database engine. Test it out:

f5abc88ff7e4 :) CREATE DATABASE db_default; 0 rows in set.Elapsed: 0.027 sec.f5abc88ff7e4 :) SHOW CREATE DATABASE db_default; SHOW the CREATE DATABASE db_default ┌ ─ statement ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ the CREATE DATABASE db_default ENGINE = Atomic │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 1 rows in the set. The Elapsed: 0.007 sec.f5abc88ff7e4 :) CREATE DATABASE db_ordinary ENGINE = Ordinary; CREATE DATABASE db_ordinary ENGINE = Ordinary Ok. 0 rows in set. Elapsed: 0.019 sec.f5abc88ff7e4 :) CREATE DATABASE db_memory ENGINE = Memory; CREATE DATABASE db_memory ENGINE = Memory Ok. 0 rows in set. Elapsed: 0.014 sec.f5abc88ff7e4 :) CREATE DATABASE db_lazy ENGINE = Lazy(60); CREATE DATABASE db_lazy ENGINE = Lazy(60) Ok. 0 rows in set.Elapsed: 0.017 SEC.Copy the code

The data table DDL

Table DDL is used in many ways similar to traditional DBMSS such as MySQL, but some new features are added.

Build a table DDL

There are three ways to create a database table:

  • The first is to create database tables based on strict Schema syntax definitions
  • CREATE TABLE XX = YY; CREATE TABLE XX = YY; CREATE TABLE XX = YY
  • Create a new table by copying the Schema of another table and importing the data of SELECT query, and specify the table engine

“Strict Schema syntax definitions”

CREATE TABLE [IF NOT EXISTS] [$db_name.]$table_name [ON CLUSTER $cluster_name](name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [compression_codec] [TTL expr1] [COMMENT comment1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [compression_codec] [TTL expr2] [COMMENT comment1], ... ) ENGINE = $engineCopy the code

For example:

F5abc88ff7e4 :) CREATE TABLE t_test(id UInt64 COMMENT 'id ',name String COMMENT' name ') ENGINE = Memory; f5abc88ff7e4 :) CREATE TABLE t_test(id UInt64 COMMENT 'id ',name String COMMENT' name ') ENGINE = Memory; CREATE TABLE t_test ( `id` UInt64 COMMENT 'ID', 'name' String COMMENT 'Elapsed') ENGINE = Memory Ok. 0 rows in set.Elapsed: 0.032 SECCopy the code

“Copy table structure and modify table engine”

CREATE TABLE [IF NOT EXISTS] [$db_name.]$table_name AS [$other_db_name.]$other_table_name [ENGINE = ENGINE] # CREATE TABLE default.t_new_test AS default.t_testCopy the code

For example:

f5abc88ff7e4 :) CREATE TABLE default.t_new_test AS default.t_test; CREATE TABLE default.t_new_test AS default.t_test Ok. 0 rows in set. Elapsed: 0.028 sec.f5abc88ff7e4 :) DESC default.t_new_test; DESCRIBE TABLE default.t_new_test Chrysene ─name─┬─type─ ┬─default_type─┬─default_expression─┬─comment─ alle ─codec_expression─ ── ttl_expression─ id │ │, UInt64 │ │ ID │ │ name │ String │ name │ │ └ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 2 rows in the set. Elapsed: 0.004 SEC.Copy the code

“Copy table structure to import data and specify table engine”

CREATE TABLE [IF NOT EXISTS] [$db_name.]$table_name AS table_function() [$db_name.]$table_name ENGINE = $engine AS SELECT ...Copy the code

For example:

f5abc88ff7e4 :) CREATE TABLE default.t_test_func ENGINE = Memory AS SELECT * FROM t_test; T_test_func ENGINE = Memory AS SELECT * FROM T_test Ok. 0 rows in set.Elapsed: 0.028 SEC.Copy the code

Default value expression

ClickHouse recommends that all written data columns contain values, “otherwise they will be filled with zero values of the corresponding type,” or specify default values through default value expressions. If a field defines a default value, there is no need to specify a data type for that field. ClickHouse will infer its (narrow range of reasonable types) data type based on the default value expression, and yes, you don’t even need to define a type to define a default value. In addition, default expressions can be defined as ** “constant or computed expression based on other columns” **, and ClickHouse checks these expressions for loop dependencies. The default value expression contains three types of keywords:

  • DEFAULT $expression: example: amount Decimal(10,2) DEFAULT 0, c2 UInt32 DEFAULT c1 (c2 DEFAULT 1000)
  • MATERIALIZED $expression: for example, a MATERIALIZED (B +1), B UInt16
  • ALIAS $expression: for example a ALIAS (b+1), b UInt16

The differences are as follows:

If you are used to the DEFAULT keyword in MySQL, you can assume that ClickHouse’s DEFAULT keyword is similar, but more advanced and can be evaluated based on expressions

There are a few things to note when using default expressions:

  • You can change the default value of a column using the ALTER keyword: The ALTER TABLE [db \ _name \] table_name MODIFY COLUMN $column_name [DEFAULT | MATERIALIZED | ALIAS] exp, but MODIFY the action will not affect the already existing data before data in the TABLE
  • There are many restrictions on how to change the default values. MergeTree table engines cannot change the default values of primary key fields, and some table engine types do not allow changing the default values of any columns at all (such as TinyLog).
  • Default values cannot be set for elements in a Nested data structure

DEFAULT keyword example:

f5abc88ff7e4 :) CREATE TABLE t_d(a UInt16,b DEFAULT (a + 1)) ENGINE = Memory; CREATE TABLE t_d ( `a` UInt16, `b` DEFAULT a + 1 ) ENGINE = Memory Ok. 0 rows in set. Elapsed: INSERT INTO t_d(a,b) VALUES(1,11); INSERT INTO T_d (a, B) VALUES Ok. 1 rows in set.elapsed: 0.004 sec.f5abc88ff7e4 :) INSERT INTO T_d (a) VALUES(3); INSERT INTO T_d (a) VALUES Ok. 1 rows in set.elapsed: 0.004 sec.f5abc88ff7e4 :) select * from t_d; SELECT * FROM t_d ┌ ─ ─ a ┬ ─ ─ ─ b ┐ 1 11 │ │ │ └ ─ ─ ─ ┴ ─ ─ ─ ─ ┘ ┌ ─ ─ a ┬ ─ ─ b ┐ 3 4 │ │ │ └ ─ ─ ─ ┴ ─ ─ ─ ┘ 2 rows in the set. The Elapsed: 0.004 SEC.Copy the code

Examples of MATERIALIZED keywords:

f5abc88ff7e4 :) CREATE TABLE t_m(a UInt16,b MATERIALIZED (a + 1)) ENGINE = Memory; CREATE TABLE t_m ( `a` UInt16, `b` MATERIALIZED a + 1 ) ENGINE = Memory Ok. 0 rows in set. Elapsed: F5abc88ff7e4 :) INSERT INTO t_m(a) VALUES (2); INSERT INTO T_m (a) VALUES Ok. 1 rows in set.elapsed: 0.004 sec.f5abc88ff7e4 :) SELECT * FROM t_m; SELECT * FROM t_m ┌ ─ ─ a ┐ │ │ 2 └ ─ ─ ─ ┘ 1 rows in the set. The Elapsed: 0.005 SEC. F5abc88ff7e4:) SELECT the FROM b t_m; SELECT the FROM b t_m ┌ ─ ─ b ┐ 3 │ │ └ ─ ─ ─ ┘ 1 rows in the set. Elapsed: 0.004 sec.f5abc88ff7e4 :) INSERT INTO t_m(a,b) VALUES (2,3); INSERT INTO T_M (a, B) VALUES Received exception from server (version 20.10.3): Code: 44.db :: exception: Received from clickhouse-server:9000. DB::Exception: Cannot insert column b, because it is MATERIALIZED column.. 0 rows in set.Elapsed: 0.004 SECCopy the code

ALIAS example:

f5abc88ff7e4 :) CREATE TABLE t_a(a UInt16,b ALIAS (a + 1)) ENGINE = Memory; CREATE TABLE t_a ( `a` UInt16, `b` ALIAS a + 1 ) ENGINE = Memory Ok. 0 rows in set. Elapsed: Sec.f5abc88ff7e4 :) INSERT INTO TABLE t_a(a) VALUES (11); INSERT INTO T_a (a) VALUES Ok. 1 rows in set.elapsed: 0.004 sec.f5abc88ff7e4 :) SELECT *,b FROM t_a; SELECT *, b FROM t_a ┌ ─ ─ ─ a ┬ ─ ─ ─ b ┐ 11 12 │ │ │ └ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┘ 1 rows in the set. The Elapsed: 0.005 SEC.Copy the code

Such as:

f5abc88ff7e4 :) DESCRIBE TABLE p_v1 DESCRIBE TABLE p_v1 ┌ ─ name ─ ─ ─ ─ ─ ─ ┬ ─ type ─ ─ ─ ┬ ─ default_type ─ ┬ ─ default_expression ─ ┬ ─ the comment ─ ┬ ─ codec_expression ─ ┬ ─ ttl_expression ─ ┐ │ │ Id UInt64 │ │ │ │ │ │ │ EventTime │ Date │ │ │ │ │ │ │ name │ String │ DEFAULT │ 'dv │ │ │ │ │ age │ UInt16 │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 4 rows in the set. Elapsed: 0.004 sec.f5abc88ff7e4 :) SHOW CREATE TABLE p_v1 SHOW CREATE TABLE p_v1 ┌ ─ statement ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ the CREATE TABLE default.p_v1 ( `Id` UInt64, `EventTime` Date, `name` String DEFAULT 'dv', 'Age' UInt16) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventTime) ORDER BY Id SETTINGS index_Granularity = 8192 │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 1 rows in the set. Elapsed: 0.008 SEC.Copy the code

Column compression coding

ClickHouse services compress stored data using high-performance compression algorithms to save disk space. The lZ4 (LZ4 Fast Compression) compression algorithm is enabled by default and can be changed by configuring the Compression node option in the ClickHouse server configuration under the MergeTree family of engines. Basic syntax:

ALTER TABLE $table_name MODIFY COLUMN $column_name CODEC($column_name CODEC); CREATE TABLE codec_example (ts Date CODEC(LZ4), dt Date CODEC(DEFAULT))Copy the code

Available algorithms:

  • None: indicates no compression
  • Default: indicates the Default compression algorithm. The Default value is also used
  • LZ4: fast compression algorithm version of LZ4 compression algorithm
  • LZ4HC[(level)] : lz4 high compression rate compression algorithm version. The default value of level is 9. [1,12] is supported.
  • ZSTD[(level)] : ZSTD compression algorithm, level default value is 1, support [1,22]

Different table engines support different compression algorithms. The current supported list is as follows:

  • MergeTree series of engines: support all optional compression algorithms, and support to configure the default compression algorithm on the server
  • Log engine: LZ4 compression algorithm is used by default, and all optional compression algorithms are supported
  • Set engines: only the default compression algorithm is supported
  • Join engines: Only the default compression algorithm is supported

There are also special encoding and decoding methods such as Delta(delta_bytes), DoubleDelta, Gorilla, and T64, which are not expanded here.

A temporary table

ClickHouse also supports temporary tables, but with more restrictions:

  • Life cycle Bound during the lifetime of the session, the temporary table disappears when the session ends (for example, when the connection is disconnected)
  • Temporary tables cannot specify a table engine and are created using the Memory engine implicitly
  • You cannot specify a database when creating temporary tables. Temporary tables are always created “outside” of the database (in other words, temporary tables do not belong to any database)
  • If a temporary table has the same name as another non-temporary table, and no database is specified, the temporary table is used for query (in other words, if no database is specified for query, the temporary table has higher priority than the normal table).

The syntax for creating temporary tables is as follows:

CREATE TEMPORARY TABLE [IF NOT EXISTS] $table_name
(
    $column_name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    $column_name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
)
Copy the code

Note: In most cases, temporary tables are not created manually, but internally. They are typically used for external data queries that are distributed globally, such as query passing between clusters. Therefore, temporary tables are not officially recommended

view

ClickHouse supports views, and currently supports two types of views: Normal and Materialized. The view can be dropped directly with the DROP TABLE [db\ name.\]view_table_name statement, while all TABLES can be displayed with SHOW TABLES and the view can be treated as a special TABLE.

The common view

The syntax for creating a normal view is as follows:

CREATE [OR REPLACE] VIEW [IF NOT EXISTS] [$db_name.]$view_table_name [ON CLUSTER] AS SELECT ...
Copy the code

A normal view does not store any data, it is just a query map that simplifies query semantics and has no positive or negative impact on query performance. Suppose we have a table called T_test, create a common view view_test:

CREATE VIEW view_test AS SELECT * FROM t_test
Copy the code

SELECT * FROM view_test (SELECT * FROM T_test);

Materialized views

Materialized views support defining a table engine because the format of their data storage is determined by the table engine. The syntax for creating materialized views is as follows:

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [$db_name.]$view_table_name [ON CLUSTER] [TO[$db_name.]$table_name] [ENGINE = $engine] [POPULATE] AS SELECT ...
Copy the code

Materialized views have the following characteristics:

  • A table engine needs to be defined to determine the format of the data store
  • After materialized views are created, internal TABLES are generated with names starting with.inner., which can be verified by SHOW TABLES
  • The [TO[db\_name.]table_name] clause and the [POPULATE] clause are mutually exclusive, and you can choose either or neither
  • When using the [TO[db\_name.\]table_name] clause, you must specify the table engine explicitly
  • The POPULATE keyword determines the update strategy for materialized views. If POPULATE is used, it imports the data from the source TABLE during the view creation process. AS, if you leave the POPULATE keyword untouched, the materialized view will be new and blank, and the data from the source table will be synchronized after the POPULATE view is created.
  • Materialized view data does not support synchronous deletion. If the source table data does not exist or is deleted, materialized view data still exists

Materialized views in ClickHouse are implemented more like data insert triggers. If there are aggregations in a view query, those aggregations only apply to the newly written data. Any changes to the existing data in the source table (such as updates, deletes, deletes partitions, and so on) do not change the data in the materialized view.

The author’s note: Materialized views are a double-edged sword. When used properly, they simplify a lot of synchronization and aggregation. When abused, they can be difficult to maintain and affect performance

Basic column operations

Basic column operations are performed around the ALTER keyword. The basic syntax in general is:

ALTER TABLE [$db_name.]$table_name [ON CLUSTER cluster] ADD|DROP|CLEAR|COMMENT|MODIFY COLUMN ...
Copy the code

[$db_name.] and [ON CLUSTER CLUSTER] clauses are omitted.

Append a new COLUMN -add COLUMN

The ADD COLUMN statement is used to ADD a new COLUMN to the specified table. Basic syntax:

ALTER TABLE $table_name ADD COLUMN [IF EXISTS] $column_name [type] [default_expr] [codec] [AFTER $pre_column_name]
Copy the code

  • Type: Optional, used to specify column types
  • Default_expr: Optional, used to set the default value expression
  • Codec: Optional, see ** column compression encoding ** in the previous section
  • AFTER clause: Optional, used to specify which column to add AFTER an existing one

For example:

ALTER TABLE default.p_v1 ADD COLUMN age UInt16 AFTER name
Copy the code

MODIFY COLUMN – MODIFY COLUMN

The MODIFY COLUMN statement can be used to MODIFY an existing COLUMN type, default value expression, or TTL expression. Basic syntax:

ALTER TABLE $table_name MODIFY COLUMN [IF EXISTS] $column_name [type] [default_expr] [TTL]
Copy the code

For example:

ALTER TABLE default.p_v1 MODIFY COLUMN age UInt32
Copy the code

When a type is changed, it is essentially converted using the built-in toType() function. If the current type is incompatible with the desired type and cannot be converted, the column modification operation will fail and an exception will be thrown.

Add or modify COLUMN remarks – COMMENT COLUMN

Add or modify COLUMN comments in ClickHouse using a special COMMENT COLUMN clause. Basic syntax:

ALTER TABLE $table_name COMMENT COLUMN [IF EXISTS] $column_name 'Copy the code

For example:

ALTER TABLE default.p_v1 COMMENT COLUMN age 'age'Copy the code

DROP COLUMN – DROP COLUMN

The DROP COLUMN statement is used to DROP columns, and the corresponding COLUMN data is ** “physically deleted” ** from the file system. Basic syntax:

ALTER TABLE $table_name DROP COLUMN [IF EXISTS] $column_name
Copy the code

For example:

ALTER TABLE default.p_v1 DROP COLUMN age
Copy the code

Reset all values of the corresponding COLUMN and partition – CLEAR COLUMN

The CLEAR COLUMN statement is used to reset all values of the corresponding COLUMN and the specified partition to default values. If no default value expression is set, all values of the corresponding COLUMN are reset to zero values of their type. Basic syntax:

ALTER TABLE $table_name CLEAR COLUMN [IF EXISTS] $column_name IN PARTITION $partition_name
Copy the code

For example:

f5abc88ff7e4 :) CREATE TABLE p_v1(Id UInt64,EventTime Date,name String DEFAULT 'dv')ENGINE = MergeTree() PARTITION BY toYYYYMM(EventTime) ORDER BY Id CREATE TABLE p_v1 ( `Id` UInt64, `EventTime` Date, `name` String DEFAULT 'dv' ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventTime) ORDER BY Id Ok. 0 rows in set. Elapsed: Elapsed: f5abc88ff7e4 :) INSERT INTO P_v1 VALUES(1,' Elapsed ','doge1'),(2,' Elapsed ','doge2'); INSERT INTO p_v1 VALUES Ok. 2 rows in set. Elapsed: F5abc88ff7e4 :) ALTER TABLE p_v1 clear column name IN partition 202011; ALTER TABLE p_v1 CLEAR COLUMN name IN PARTITION 202011 Ok. 0 rows in set. Elapsed: 0.163 sec.f5abc88ff7e4 :) SELECT * FROM p_v1; SELECT * FROM p_v1 ┌ ─ ─ Id ┬ ─ ─ EventTime ─ ┬ ─ name ─ ─ ┐ │ │ │ 2020-10-29 2 doge2 │ └ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┘ ┌ ─ ─ Id ┬ ─ ─ EventTime ─ ┬ ─ name ─ ┐ │ │ │ 2020-11-28 1 dv │ └ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┘Copy the code

Data partition

Data partitioning in ClickHouse is a partitioned table, essentially a data table DDL, but given the importance of data partitioning, this feature is a separate H2 section. Data partitioning in ClickHouse refers to vertical shards of data within a local instance, as opposed to the concept of data sharding in horizontal shards. ClickHouse currently supports data partitioning only for tables that use the MergeTree family of table engines, including the REPLICATED* series and materialized views that use the MergeTree family of table engines. This section describes the use of the PARTITION keyword and common PARTITION operations.

A partition is a logical combination of records in a data table according to specified conditions. Partitions can be set based on any conditions (such as monthly, daily, or event type). Data in each partition is stored separately to simplify data operations and improve performance. ClickHouse specifies the PARTITION definition through the PARTITION BY EXPR clause when creating a table, and the partitioning key can be any expression based on the data columns in the table. For example, if a field is of Date type, you can use the toYYYYMM(date_column) expression to partition by month, for example:

CREATE TABLE pv (
    visitDate Date,
    hour UInt8,
    clientID String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(visitDate)
ORDER BY hour;
Copy the code

To create this table, use the INSERT INTO pv VALUES (‘ 2020-11-01 ‘, 1, ’11), (‘ 2020-10-01′, 2 ’22’); After writing two pieces of data, query partition information with the following statement:

f5abc88ff7e4 :) SELECT partition,name,path,active FROM system.parts WHERE table = 'pv' SELECT partition, name, path, active FROM system.parts WHERE table = 'pv' ┌ ─ partition ─ ┬ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ path ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ active ─ ┐ 202010 _5_5_0 202010 │ │ │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202010 _5_5_0/1 │ │ │ 202010 202010 _7_7_0 │ │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202010 _7_7_0/1 │ │ │ 202011 202011 _2_2_0 │ │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _2_2_0/1 │ │ │ │ 202011 202011 _4_4_0 │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _4_4_0/1 │ │ │ │ 202011 202011 _6_6_0 │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _6_6_0 / │ │ 1 └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 5 Rows in set.Elapsed: 0.005 SECCopy the code

The name attribute is the name of the partition data part (parts), for example 202010_5_5_0:

  • The first part 202010 is the partition name
  • The first 5 is the MinBlockNum number.
  • The second 5 is the MaxBlockNum number.
  • The final 0 indicates the level of partitions and the number of times a partition has been merged

The name attribute is unique to the merge tree family table engine, and will be explained in more detail when we have a chance to look at the basic principles of the merge tree. After the partition is complete, the partition minimum data set can be used by querying with the partition key:

f5abc88ff7e4 :) SELECT * FROM pv WHERE visitDate = '2020-11-01' SELECT * FROM pv WHERE visitDate = '2020-11-01' ┌ ─ ─ visitDate ─ ┬ ─ hour ─ ┬ ─ clientID ─ ┐ │ │ │ │ 1 2020-11-01 11 └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 3 rows in the set. The Elapsed: 0.020 SEC.Copy the code

If multiple parts of a partition are not merged, the newly written parts of the partition will be merged after 15 minutes of data writing, and then the corresponding parts will become inactive. OPTIMIZE TABLE table_name PARTITION PARTITION can be used to trigger merge with the execution plan, but this is a time-consuming operation and is generally not recommended.

Other operations on data partitioning revolve around the ALTER keyword. The syntax is:

ALTER TABLE $table_name $OP PARTITION|PART $partition(_part)_expr
Copy the code

Querying Partition Information

Table system.parts is used to query partition information. You can use DESC system.parts to view column metadata definitions (44 columns in total).

  • Partition: indicates the partition name
  • Name: partition part name (this should be the smallest unit of the target table for a table partition)
  • Active: Indicates whether the partition is active
  • Path: storage disk path of the partition
  • Database: indicates the database where the partition resides
  • Table: indicates the table where the partition resides
  • Engine: Table engine where the partition resides

Such as:

f5abc88ff7e4 :) SELECT database,table,engine,partition,name,path,active FROM system.parts WHERE table = 'pv'; SELECT database, table, engine, partition, name, path, active FROM system.parts WHERE table = 'pv' ┌ ─ database ─ ┬ ─ table ─ ┬ ─ engine ─ ─ ─ ─ ┬ ─ partition ─ ┬ ─ name ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ path ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ active ─ ┐ │ default │ pv │ MergeTree │ │ 202010 202010 _5_5_0 │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202010 _5_5_0/1 │ │ │ default │ pv │ MergeTree │ 202010 202010 _7_7_0 │ │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202010 _7_7_0/1 │ │ │ default │ │ pv │ MergeTree │ 202011 │ 202011_2_2_0 │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _2_2_0/1 │ │ │ default │ pv │ MergeTree │ 202011 202011 _4_4_0 │ │ / var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _4_4_0/1 │ │ │ default │ pv │ MergeTree │ 202011 │ 202011_6_6_0 │ The/var/lib/clickhouse/store/f-17 thunder/f1729bf7 40 d0 - bd4f - 6-7 baf - ee13877d0db / 202011 _6_6_0 / │ │ 1 └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ┘ 5 rows in the set. The Elapsed: 0.005 SEC.Copy the code

Deleted partitions

Physically deleted PARTITION or PARTITION PART by DROP PARTITION | PART clause, basic syntax is as follows:

ALTER TABLE $table_name DROP PARTITION|PART partition(_part)_expr
Copy the code

The deletion is asynchronous. After the statement is executed, the partition or part of the partition is set to inactive (that is, active = 0) and then physically deleted 10 minutes later.

Such as:

ALTER TABLE pv DROP PARTITION 202010;

ALTER TABLE pv DROP PART 202010_5_5_0;

ALTER TABLE pv DROP PART all_5_5_0;
Copy the code

Unload and load partitions

ClickHouse PARTITION support unload and hot load (read the document found should be relatively time consuming asynchronous operations, operating time need careful), mainly by keyword DETACH PARTITION | PART and ATTACH PARTITION | PART is completed, both is just inverse operation.

“Unmount partitions – DETACH PARTITION | PART”

The basic syntax is as follows:

ALTER TABLE $table_name DETACH PARTITION|PART $partition_expr
Copy the code

Detached partition unmount does not physically delete the detached partition data. Instead, it moves the entire partition data to the detached subdirectory of the corresponding table directory. In this case, querying the data set of the corresponding partition using SELECT will not return any data (of course, the data directories have been moved……). . Partition once it has been moved to detached subdirectory will, unless the initiative to delete or use the ATTACH | DROP detached command to reload or delete the corresponding data directory.

“Loading PARTITION – ATTACH PARTITION | PART”

The basic syntax is as follows:

ALTER TABLE $table_name ATTACH PARTITION|PART $partition_expr
Copy the code

Detached mount is the reverse of detached mount. It is essentially moving partitioned data from the detached subdirectory back to the partitioned directory of the table.

“Remove the PARTITION uninstall backup – DROP DETACHED PARTITION | PART”

The basic syntax is as follows:

ALTER TABLE $table_name DROP DETACHED PARTITION|PART $partition_expr
Copy the code

Detached Remove the detached partition data in the detached subdirectory, physically delete it, and after removing it cannot reload using the ATTACH keyword.

Copy coverage of partitioned data

The basic syntax is as follows:

ALTER TABLE $table_y_name REPLACE PARTITION $partition_expr FROM $table_x_name
Copy the code

Copy the partition of table_x_name to the existing partition of table_y_name, overwrite the entire partition data, and do not delete the original partition data of table_x_name. The prerequisites are as follows:

  • The table structure of both tables is exactly the same (column definition)
  • The partitioning keys for both tables are exactly the same

Movement of partitioned data

The basic syntax is as follows:

ALTER TABLE $table_source MOVE PARTITION $partition_expr TO TABLE $table_dest
Copy the code

Move table table_source to table_dest. Similar to the clipping operation, the original partition data of table table_source will be deleted. The prerequisites are as follows:

  • The table structure of both tables is exactly the same (column definition)
  • The partitioning keys for both tables are exactly the same
  • The table engines for both tables are exactly the same
  • The storage policies of the two tables are identical

Reset partitioned column data

The basic syntax is as follows:

ALTER TABLE $table_name CLEAR COLUMN $column_name IN PARTITION $partition_expr
Copy the code

Resets the column data of the partition to the default value, or zero of the corresponding type if no default value expression is defined.

Resetting a partitioned index

The basic syntax is as follows:

ALTER TABLE $table_name CLEAR INDEX $index_name IN PARTITION $partition_expr
Copy the code

It is a bit like resetting partitioned column data, but only resets the index of the partition, not the data (the exact function is unknown, because the principles of indexing are not well understood).

Other partitioning operations

  • ALTER TABLE table_name FREEZE [PARTITION partition_expr]
  • ALTER TABLE table_name FETCH PARTITION partition_expr FROM ‘path-in-zookeeper’
  • Mobile PARTITION TO DISK: ALTER TABLE table_name MOVE PARTITION | PART partition_expr TO DISK | VOLUME ‘the disk_name’

It involves configuration, disk paths, and even paths in Zookeeper. Therefore, it is complicated and will not be expanded for the moment.

TTL expression

The TTL (Time To Live) expression is an innovative and advanced feature of ClickHouse that specifies the lifetime of data. TTL expressions have column field levels (expiration resets the expiration data of the corresponding column) and table levels (expiration deletes the entire table). If both the column TTL and the table TTL are specified, the TTL is processed according to the principle of “first expire, first execute”. The TTL expression is used to determine the life cycle of the target. The result of the expression must be a Date or DateTime data type. The INTERVAL is defined using the keyword INTERVAL. The basic syntax is as follows:

TTL time_column TTL time_column + interval ## The interval keyword is required to define the interval TTL date_time + interval 1 MONTH TTL date_time + INTERVAL 15 HOURCopy the code

So far, ClickHouse provides only the syntax for TTL definition and update, not for specifying a TTL expression to STOP, and only a command to STOP all TTL MERGES globally: SYSTEM STOP/START TTL MERGES

Table TTL expression

Table TTL sets the table level expiration time with a Date or DateTime data type (currently, only tables in the MergeTree table engine family support table level TTL). When TTL clearing is triggered, columns that meet the expiration time will be deleted (or moved). The basic expression is as follows:

TTL $expr [DELETE|TO DISK 'aaa'|TO VOLUME 'bbb'], ...
Copy the code

Here’s an example:

CREATE TABLE test_ttl ( d DateTime, A Int) ENGINE = MergeTree PARTITION BY toYYYYMM(d) ORDER BY D TTL d + INTERVAL 1 MONTH [DELETE], # <-- The default policy is DELETE D + INTERVAL 1 WEEK TO VOLUME 'data VOLUME ', d + INTERVAL 2 WEEK TO DISK' DISK directory ';Copy the code

The above example illustrates:

  • Data based on d whose expiration date is 1 week will be moved to another data volume
  • Data based on d that is 2 weeks out of date will be moved to another disk directory
  • Data based on D whose expiration date is 1 month will be physically deleted

You can use the ALTER keyword to modify the TTL at the table level, for example:

ALTER TABLE $table_name MODIFY TTL $data(_time)_column + INTERVAL 1 DAY
Copy the code

Column TTL expression

The TTL at the column level uses expressions to define that when column data expires, ClickHouse will reset the expired column data to the default value or zero value of the corresponding type. The KEY column cannot define a TTL expression, and ClickHouse removes a column directly from the file system if all of its data has expired. The basic syntax is as follows:

$column_name type $Date(_Time)_column + INTERVAL_EXP # Example: CREATE TABLE example_table (d DateTime, a Int TTL d + INTERVAL 1 MONTH, b Int TTL d + INTERVAL 1 MONTH, c String ) ENGINE = MergeTree PARTITION BY toYYYYMM(d) ORDER BY d;Copy the code

DML

DML corresponds to CURD in everyday development understanding, and the key keywords include INSERT, SELECT, UPDATE, and DELETE.

SELECT

The basic use of SELECT in ClickHouse is similar to that of mainstream relational DBMSS in that it supports specifying columns, *, built-in functions, and a large number of aggregation-related keys. I won’t go into detail here, but SELECT operations should be heavily used when analyzing keys and functions later.

INSERT

The primary function of the INSERT keyword is to write data, which is much more flexible in ClickHouse than in a mainstream relational DBMS. The basic syntax is as follows:

INSERT INTO [$db_name.]$table_name [(c1, c2, c3) | (*)] VALUES (v11, v12, v13), (v21, v22, v23), ...
Copy the code

  • Specifies that columns write: INSERT INTO table (x_col,y_col,z_col) VALUES (x1,y1,z1),(x2,y2,z2) or INSERT INTO table COLUMNS(x_col,y_col,z_col) VALUES (x1,y1,z1),(x2,y2,z2)

  • INSERT INTO table VALUES (x1,y1,z1),(x2,y2,z2) or INSERT INTO table (*) VALUES (x1,y1,z1),(x2,y2,z2)

  • INSERT INTO table (* EXCEPT(x_col,y_col)) VALUES (z1),(z2)

If a column is not filled with data at write time, it will be filled with the default value or zero value of the corresponding type.

You can also specify the data format for data writing. The basic syntax is:

INSERT INTO [$db_name.]$table_name [(c1, c2, c3)] FORMAT $format_name $data_set
Copy the code

Such as:

  • Write tab-separated data:

    INSERT INTO test_tab FORMAT TabSeparated 1 foo 2 bar

  • Write data in CSV format:

    INSERT INTO test_csv FORMAT CSV 1,’foo’ 2,’bar’

The last option is to write data via the SELECT clause, which supports expressions or functions. The basic syntax is as follows:

INSERT INTO [$db_name.]$table_name [(c1, c2, c3)] SELECT ... INSERT INTO test_insert SELECT 1,'doge',now()Copy the code

For performance reasons, do not include functions in the SELECT clause, because functions are ultimately parsed and called by the ClickHouse server, and excessive use can degrade write performance.

For write performance reasons, the official recommendation is:

  • Batch write data
  • Data is pregrouped by partitioning keys before being written

ClickHouse writes to a Block structure. The number of rows that can be written to a Block is controlled by the max_insert_block_size configuration item. The default value is 1048576. Note that this atomicity does not take effect when data is written using CLI commands. It takes effect only when JDBC or HTTP is used.

UPDATE and DELETE

ClickHouse provides the UPDATE and DELETE keywords, but these two operations are heavyweight operations, known as Mutation queries, and are performed with ALTER. The Mutation query has several features:

  • Transactions not supported
  • Heavyweight operation, serious consumption, must be batch operation as far as possible
  • Results are performed asynchronously and returned immediately after submission, but the results need to be queried from the system table system. Mutations

The basic syntax is as follows:

# DELETE ALTER $table_name DELETE WHERE $filter_exp ## DELETE # UPDATE ALTER $table_name UPDATE columnx = exp_x,... ALTER test_update UPDATE name = 'throwable' WHERE id = 1;Copy the code

Results of system. Mutations

SELECT database,table,mutation_id,blick_numbers.number,is_done FROM system.mutations
Copy the code

ClickHouse’s bias towards write and query performance causes it to forgo features such as transactions and efficient precise update or delete functionality. These are trade-offs, not right or wrong.

summary

This article covers DDL and DML commonly used in ClickHouse in some detail, and some advanced features such as distributed DDL will be covered later when we look at ClickHouse cluster setup. Next, take a closer look at the main table engines ClickHouse currently supports and their usage scenarios.

Three things to watch ❤️

If you find this article helpful, I’d like to invite you to do three small favors for me:

  1. Like, forward, have your “like and comment”, is the motivation of my creation.

  2. Follow the public account “Java rotten pigskin” and share original knowledge from time to time.

  3. Also look forward to the follow-up article ing🚀

  4. [666] Scan the code to obtain the learning materials package