Clickhouse copy backup mechanism

Train of thought

Core concepts:

shard

A copy of the

The cluster

Distributed table implementation method:

Replicated tables: The engine prefix is a Replicated table that can be automatically implemented by the engine with the underlying replication functionality.

Distributed tables: Use the Distributed engine, which works like a view. Physical tables need to be created in each instance and then mapped to the actual physical tables.

Configuration method

Configure the ZK property of ClickHouse

Configure the ClickHouse cluster

Configure parameters such as fragment {shard}

concept

Sharding: Data is divided into multiple parts without duplicate content between the parts. Note the data skew

Duplicate: The duplicate is redundant and has multiple backups. The contents of the fragments of the same copy are the same.

It is equivalent to the Partition and Replication mechanism in Kafka. Mainly to solve the distributed resource utilization maximization, as well as the CEP principle of availability principle.

The following figure shows the logical concepts of sharding and duplicating:

Clustering: The logical concept that multiple ClickHouse instances can be abstracted into multiple different ClickHouse clusters.

Clickhouse has no concept of centralized nodes, and all nodes have the same roles.

The cluster can be divided into multiple clusters. The cluster name is based on the label name configured in the config. XML file.

Fragment lookup mechanism

How do I find the data fragments

Replica synchronization mechanism

Copy synchronization mechanism:

TODO:????? Knowledge of the underlying principles and properties.

Zk is used to monitor the action changes of different CK instances.

The replica node pulls the behavior log of the primary node and executes it again. Or should I just pull over the resulting data file?)

Distributed table building method

Copy table

The ability to replicate can only be applied using the Replicated table family engine of the Replicated***Tree.

The copy table uses the engine’s own copy synchronization policy

Zookeeper is a necessary component for CK cluster backup. It only monitors the status of different instances and does not involve data transmission.

ReplicatedMergeTree has some notable features in its replica design:

Rely on the ZooKeeper
Table level copies
Multi Master: You can perform INSERT and ALTER queries on either replica, and the effect is the same. These operations are distributed to each copy for local execution using ZooKeeper’s synergistic capabilities.
Block, Block

ReplicatedMergeTree If you want to use multiple replicas, you need to use two configurations.

zk_path
Replica_name: specifies the name of the replica created in ZooKeeper. The name uniquely identifies different replicas. One convention is to use the domain name of the server.

2. What’s the matter with you? 3.

CREATE TABLE dc_eth.events_local ON CLUSTER new_cluster 
(ts_date DateUser_id Int64, event_type String, site_id Int64, groupon_id Int64)ENGINE= ReplicatedMergeTree('/clickhouse/tables/{shard}/my_db/my_table'.'{replica}')PARTITION BYtoYYYYMM(ts_date)ORDER BY (site_id);;
Copy the code

Definition of parameters:

ReplicatedMergeTree: Engine name, prefixed with Replicated table engine, corresponds to the common localized table engine.

Parameter 1: zK listening path. {shard} is set in config to ensure that the listening path of each instance is different. The /clickhouse/** path is a conventional way of writing;

Parameter 2: {replica} indicates the name of the replica, usually the hostname of the node.

Distributed table

A distributed table is a view and is mapped to the local table of each instance through a relationship. Storage and query depend on the local table that has been created. When a distributed table is deleted, the local table is not deleted.

Create table method:

Local tables are now created in each instance of the cluster
In the use ofDistributedEngine, associated with the local table in the cluster

ENGINE = Distributed(ck_cluster, my_db, my_table, rand());
Copy the code

Definition of parameters:

Ck_cluster: cluster name.

My_db: library name;

My_table: table name, requires that each instance database name, table name is the same;

Rand (): selects the route fragment mode.

Configuration method

Configure the zk

Configure ZooKeeper, zK is the basic component, restart clickHouse service, and add zK configuration

Create zks.xml in /etc/clickhouse-server/config.d/


      
<yandex>
   <zookeeper>
     <node index="1">
            <host>cy3.dc</host>
            <port>2181</port>
     </node>
             <node index="2">
            <host>cy4.dc</host>
            <port>2181</port>
     </node>
             <node index="3">
            <host>cy5.dc</host>
            <port>2181</port>
     </node>
   </zookeeper>
</yandex>
Copy the code

In config.xml, import the newly created configuration file

<yandex>
    <! -- config Original configuration information -->
    <include_from>/etc/clickhouse-server/config.d/zks.xml</include_from>
</yandex>
Copy the code

Verify the ZK configuration:

select * from system.zookeeper where path = '/clickhouse';
Results indicate success
Copy the code

Note: Clickhouse restart method

Systemctl start clickhouse-server.service: Starts the ClickHouse service systemctl stop clickhouse-server.service: stops the ClickHouse service Systemctl status clickhouse-server.service: Displays the clickHouse service statusCopy the code

Configuring Fragment Attributes

The configuration location is /etc/clickhouse-server/config.xml. The configuration item is configured as one fragment, one copy (one original node, one backup node).

<! -- ReplicatedMergeTree ReplicatedMergeTree -->
<! -- configuration in CK1 -->
    <macros>
      <shard>01</shard>
      <replica>ck1</replica>
    </macros>

<! -- configuration in CK2 -->
    <macros>
      <shard>01</shard>
      <replica>ck2</replica>
    </macros>
Copy the code

replica: Configures the backup of the current node and synchronizes node information
shard: specifies the configuration in the cluster sharding information. In the cluster, the configuration isshard_1
replica: Replica node
layer: specifies our cluster flag, or usesclusterThe keyword

The configuration in this section is a user-defined parameter that is used to obtain variables of fields such as {shard} during table creation.

Configure the cluster

The logical concepts are defined in config.xml, or a single configuration file is referenced in config.xml.

<! -- Config file: config.xml -->
<! Clickhouse cluster node configuration -->
  <chuying_ck_cluster>
        <shard>
          <internal_replication>true</internal_replication>
          <replica>
            <host>cy11.dc</host>
            <port>9000</port>
          </replica>
          <replica>
            <host>cy7.dc</host>
            <port>9000</port>
          </replica>
        </shard>
  </chuying_ck_cluster>
Copy the code

Definition of label:

Shard label item: Partition instance

Replica: indicates a replica instance

Internal_replication: This TAB controls when data is written to a distributed table, the distributed table controls whether the write is written to all copies.

Problems encountered

After the replication table is deleted and a new table with the same name is created, ZK will report an existing path error.

Cause: ZK uses the listening mode, does not passively receive information, will delete expired information after a few minutes.

Solution: Modify the zK path of the replicated table, or add {UUID} to the create table path to randomly generate an ID.
ReplicatedMergeTree('/clickhouse/tables/{shard}/my_db/my_table/{UUID}'.'{replica}')Copy the code

Refer to the content

ClickHouse High Availability Cluster configuration – As small as possible – Cnblogs.com