Getting started, using, and optimizing ClickHouse

ClickHouse is an open source multidimensional data storage and retrieval tool for data warehouse scenarios. It is a column database management system (DBMS) for online analysis (OLAP). It aims to solve the query performance problem of massive multidimensional data through targeted design.

Enjoy:

One, the database row storage and column storage

In a traditional row database system, data is stored sequentially:

Data in the same row is always physically stored together. Common row database systems include MySQL, Postgres, and MS SQL Server.

In a column database system, values from different columns are stored separately, and data from the same column is stored together. Column databases are better suited for OLAP scenarios (processing speeds are at least 100 times faster for most queries). Emerging distributed databases such as Hbase, HP Vertica, and EMC Greenplum use column storage.

ClickHouse takes a list store approach.

ClickHouse installation and common command parameters

1. Operating systems and hardware environments supported by ClickHouse

As long as it’s Linux, 64-bit is fine. Preferentially support Ubuntu, Ubuntu has an official compiled installation package available. The second is CentOS and RedHat. RPM packages compiled by third-party organizations are available.

For other Linux systems, compile the source code yourself.

Furthermore, the CPU of the machine must support the SSE 4.2 instruction set.

/ root @ localhost ~ # grep – q sse4_2 / proc/cpuinfo && echo “SSE 4.2 supported” | | echo “SSE 4.2 not supported”

2. How to install ClickHouse

(1) the RPM installation package

It is recommended to use the official precompiled RPM packages of CentOS, RedHat, and all other RPM-based Linux distributions.

First, you need to add the official repository:

sudo yum install yum-utils

Sudo RPM – import repo. Clickhouse. Tech/clickhouse -…

Sudo yum – config – manager – add – repo repo. Clickhouse. Tech/RPM/stable /…

Then run the following command to install:

sudo yum install clickhouse-server clickhouse-client

(2) Set system parameters

CentOS cancels the limit on the number of open files

In the/etc/security/limits. Conf, / etc/security/limits. D / 90 – nproc. Conf this 2 the end of the file to join the following content:

* soft nofile 65536

* hard nofile 65536

* soft nproc 131072

* hard nproc 131072

Cancel the SELINUX CentOS

Set selinux to Disabled in /etc/selinux/config and restart the system

Disabling the firewall

Centos 6. X:

service iptables stop

service ip6tables stop

Centos 7. X:

chkconfig iptables off

chkconfig ip6tables off

(3) Start the connection

Start the service: service clickhouse-server start

Connect to the client: clickhouse-client

3.ClickHouse common command parameters

(1) Enter the interactive client

Connect to the local Clickhouse-server server with clickhouse-client:

clickhouse-client -m

Connect to a remote Clickhouse-server server with a local clickhouse-client:

Clickhouse-client – host 192.168.x.xx – port 9000 – database default – user default – password “”

If the startup fails, you can view logs in /var/log/clickhouse-server/ by default.

(2) service

Stop:

service clickhouse-server stop

Activation:

service clickhouse-server start

Resume:

service clickhouse-server restart

(3) Set the data directory

vi /etc/clickhouse-server/config.xml

(4) Let go of remote access

vi /etc/clickhouse-server/config.xml

Modify the server configuration file /etc/clickhouse-server/config.xml at line 65 to leave out the comments.

(5) Memory limit Settings

vi /etc/clickhouse-server/users.xml

Introduction to The ClickHouse engine

ClickHouse provides a large number of data engines, including database engine and table engine. It is important to choose the right engine according to the characteristics of the data and the usage scenario.

1.ClickHouse engine classification

ClickHouse uses its own database engine in the following cases:

Decide where and how to store tables
What queries are supported and how
Concurrent data access
Use of indexes
Whether multithreaded requests can be performed
Data replication Parameters

Of all the table engines, the most core is the MergeTree series of table engines, which have the most powerful performance and the most widely used occasions. For the other engines in the non-Mergetree series, they are mainly used for special purposes and have relatively limited scenarios. The MergeTree series table engine is the official main storage engine that supports almost all of ClickHouse’s core functions.

As the most basic table engine in the family series, MergeTree has the following features:

Stored data sorted by primary key: allows the creation of sparse indexes to speed up data query
Partitions are supported. You can specify partition fields through the PRIMARY KEY statement
Support for data copy
Support for data sampling

2. Set table engine parameters

**ENGINE: **ENGINE = MergeTree();

**ORDER BY: **ORDER BY specifies the ORDER in which data is stored in the partition.

Order BY is the only mandatory field in MergeTree, even more important than the primary key, because when the user does not set the primary key, a lot of processing will be done according to the order by field.

Requirement: Primary key must be a prefix field to the Order BY field.

If the ORDER BY is different from the PRIMARY KEY, the PRIMARY KEY must be the prefix of ORDER BY (to ensure the ORDER of data and PRIMARY keys within the partition).

ORDER BY determines how data is sorted in each partition;

The PRIMARY KEY determines the PRIMARY index (primary.idx);

ORDER BY can refer to the PRIMARY KEY, usually just BY declaring ORDER BY.

**PARTITION BY: Specifies the PARTITION field. This parameter is optional. If left blank: only one partition will be used.

Partitioned directory: MergeTree is composed of column files, index files, and table definition files, but if partitioned, these files will be stored in different partitioned directories.

**PRIMARY KEY: ** specifies the PRIMARY KEY. If the sort field does not match the PRIMARY KEY, the PRIMARY KEY field can be specified separately. Otherwise the default primary key is the sort field. Optional.

**SAMPLE BY: ** The SAMPLE field, if specified, must also be included in the primary key. SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)). Optional.

**TTL: ** The TTL of the data. In MergeTree, you can set TTL for a column field or for the entire table. When the time is up, if the TTL is at the column field level, the data in this column will be deleted. If the TTL is at the table level, the entire table is deleted. Optional.

**SETTINGS: ** Additional parameter SETTINGS. Optional.

3 4. Import data

clickhouse-client –query “INSERT INTO default.emp_mgetree FORMAT CSV” –max_insert_block_size=100000 < test_data.csv

By default, the spacing is,

Iv. Summary of Advantages and disadvantages of ClickHouse

1. The advantage

For efficient CPU use, data is not only stored as columns, but also processed as vectors
Large data compression space, reduce I/O. Handle single query with high throughput of up to billions of rows per second per server
The index is not a B-tree structure and does not need to meet the leftmost principle. As long as the filter conditions are included in the index column, it is necessary. ClickHouse full table scans are fast even when the data in use is not in the index due to various parallel processing mechanisms
Write speed is very fast, 50-200m /s, for large numbers of data updates

2. Disadvantage

No transaction support, no real delete/update support
The official recommended QPS is 100. If the server is good enough, you can modify the configuration file to increase the number of connections
SQL to meet the daily use of more than 80% of the syntax, join writing is more special. The latest version already supports SQL like join, but with poor performance
Try to do more than 1000 writes in batches. Avoid row-by-row insert or small-batch INSERT, Update, and DELETE operations. ClickHouse is constantly doing asynchronous data merges, which can affect query performance
Clickhouse is fast because of its parallel processing mechanism. Even one query is executed using half of the server’s CPU, so Clickhouse cannot support high concurrency scenarios. By default, the number of CPU cores used in a single query is half the number of server cores. You can modify this parameter in the configuration file
Full data import: Import a temporary table -> Rename the original table to Tmp1 -> rename the temporary table to a formal table -> delete the original table
Incremental data import: import the incremental data to a temporary table -> Import the original data except the increment to a temporary table -> Rename the original table to TMP1 -> change the temporary table to a formal table -> delete the original data table

Five, ClickHouse use optimization summary

1. Data type

When creating a table, do not use strings for fields that can be represented as numeric values or date-time values. The full String type is common in Hive-centric warehouse construction, but is not affected in CK environments.

Although ClickHouse stores DateTime as timestamp Long underneath, it is not recommended to store the Long directly because DateTime does not need to be converted into a function, which makes it efficient and readable.

It has been pointed out that the Nullable type is almost always a performance drag because Nullable columns need to be created in an extra file to store the NULL flag, and Nullable columns cannot be indexed. Therefore, except in very special cases, use the default value of the field to indicate null, or specify a value that is meaningless in business (for example, -1 to indicate that there is no item ID).

2. Partition and index

The partition granularity depends on the service characteristics. The partition granularity is not too coarse or too fine. Partition by day, or tuple(). Take a single table of 100 million data as an example, partition size control in 10-30 is the best.

You must specify the index column. The index column in ClickHouse is the order, which is specified by order by. Properties that are often used as filters in queries are included. It can be an index of a single dimension or a combination of dimensions. Usually need to meet the advanced column in the first, query frequency in the first principle; There are particularly large cardinality is not suitable for the index column, such as user table userID field; Usually, it is best to filter the data within a million.

3. The table parameters

Index_granularity is used to control index granularity. The default value is 8192.

If it is not necessary to keep full historical data in the table, you are advised to specify TTL to avoid the trouble of manually expiating historical data. The TTL can also be changed at any time with the ALTER TABLE statement.

4. Query a single table

Use preWHERE instead of where keyword; When there are significantly more query columns than filter columns, using PREWHERE improves query performance by a factor of ten.

5. Data sampling policy

The performance of data analysis can be greatly improved by using computation.

When the amount of data is too large, avoid using the select * operation. The query performance will linearly transform with the size and number of queried fields. The fewer fields, the less IO resources consumed, and the higher performance.

The where condition and the LIMIT statement should be used together when querying an order by query for datasets with more than 10 million data sets.

Do not build virtual columns on the result set if you do not have to. Virtual columns are very resource-intensive and waste performance. Consider either front-end processing or constructing actual fields in the table for additional storage.

It is not recommended to perform a DISTINCT deduplicate query on a high column. Use an approximate deduplicate uniqCombined instead.

When multiple tables Join, the principle of small table on the right should be satisfied. When the right table is associated, it is loaded into memory and compared with the left table.

6. Storage

Clickhouse does not support multiple data directories. To improve data I/O performance, you can mount virtual coupon groups. One coupon group can be attached to multiple physical disks to improve read/write performance. In most query scenarios, SSDS are 2-3 times faster than ordinary mechanical hard disks.

End

More dry content, please pay attention to the wechat public number “soft” ~