One, the introduction

ClickHouse is a column database management system (DBMS) for online analytics (OLAP). It was opened in 2016 under the Apache 2.0 protocol, and is popular among big data engineers for its excellent query performance.

In order to serve customer business, Tencent Cloud launched ClickHouse service in April 2020. Since its launch, the service has rapidly gained extensive support from domestic and foreign customers, and the number of service businesses has grown in scale. At the same time, the pressure of operation, maintenance and control is also followed, and users are increasingly calling for flexible scalability.

In fact, ClickHouse is a typical Share-nothing architecture that naturally supports flexible scalability. It is easy to increase the number of nodes and the number of shard copies.

Figure 1 ClickHouse Share-nothing architecture

However, after adding nodes to a ClickHouse cluster, the data sets on the cluster are not automatically evenly distributed. Manual intervention is required to ensure data balance. Similarly, before a cluster node can be offline, manual intervention is also required to migrate the machines of the offline node to other nodes.

In the production environment, the intensity of o&M work increases dramatically with the increase of the number of tables and data scale in the cluster. To ease ClickHouse user operations on the cloud, it is valuable to automate ClickHouse data balancing operations.

This article will take you to understand Tencent cloud ClickHouse is how to achieve unattended data balancing service, hope to communicate with you.

ClickHouse cluster data balancing is missing

In a production environment, ClickHouse is typically deployed in clustered mode. In a ClickHouse cluster, users divide cluster nodes into subsets based on business requirements. Each collection stores several data sets. At the usage level, users can query the entire data set through Distributed Engine.

In ClickHouse’s semantics, there is the concept of a Cluster, which is a collection of nodes and defines the number of shards of a data set stored on the Cluster, the number of copies of the shards, and its storage nodes.

As shown in Figure 1, a cluster named cluster-dataset defines four shards, each with two replicas. When data sets are stored on the Cluster, they are typically distributed into four shards, and each shard stores two copies of the data.

Adding shards to a Cluster is easy: assign machines and modify configurations. Add a shard to the cluster-dataset, as shown in the following figure. However, the saved data sets are still on shard Shared1-4. Obviously, the newly added nodes have the problem of wasting resources, including computing and storage resources.

Figure 2: Schematic diagram of expanding nodes

To solve this problem, there are several solutions:

  • Delete all the data and re-import the data from the backup data source into ClickKhouse;

  • Add the weight of the new node. After a certain period of time, adjust the weight of the new node after data balancing.

  • For example, manually move data to new nodes

But no matter which method you use, there are drawbacks. For example, the first scenario is not viable if there is no backup data source for ClickHouse data. Even if there is backup data source, it takes time to re-import data, and the outage time is proportional to the amount of data, and the cost is high.

In the second scenario, multiple permission adjustments are required for the new node. During the adjustment period, the data store pressure is tilted to the newly added nodes, and the cluster cannot be fully utilized. In addition, new data is concentrated on newly added nodes, which wastes cluster resources and reduces query efficiency.

For the third scheme, the operation is complicated, in the case of many tables and large amount of data, easy to make mistakes.

ClickHouse solution on the Cloud

To solve the operation and maintenance pressure caused by the lack of ClickHouse cluster data balancing function, Tencent Cloud ClickHouse provides automatic data balancing function.

In short, after obtaining user authorization, users can simply configure and fill in the upper limit of data migration network bandwidth on the console to start the data balancing task.

The background management and control system reasonably arranges data migration plans based on the available disk capacity of the machine. Then, the migration plan is executed based on the upper limit of network bandwidth. Finally, the distribution of data on nodes tends to be balanced.

As an example, apply for a ClickHouse instance on the cloud with two nodes. Create a table named LineOrder on one of the nodes and import the test data. View the storage capacity of the table on the node. The following information is displayed:

There is no data for the table and no schema for the table on the other node. We complete data balancing through data migration function. Next, through the console, we complete the data migration. The specific steps are as follows:

1. Choose the Cluster

Select Cluster, select the ClickHouse instance, click Cluster Services, select the ClickHouse component, and from the Action drop-down list, select the Data Migration menu item. Select a data balancing mode.

2. Select the node to be migrated

After the Cluster is determined, you can choose to migrate data out and into the node.

3. Select migrate tables

After the migrated nodes are identified, we can select the tables to be migrated.

4. Confirm the information

Finally, submit the task. ClickHouse starts the data migration effort. You can view the data migration progress in the task center.

After the task is complete, you can view the details about the migration task.

After the data migration is complete, we can view the data distribution on the two nodes. The data volume on cluster nodes is as follows:

It can be seen that after data migration is complete, the number of data items is exactly the same as the amount of data.

Four, conclusion

The on-cloud data migration feature is designed to solve the problem of data migration when ClickHouse scales flexibly. Usage scenarios include:

  • After a new node is added, use the data migration function to migrate some data to the new node to balance data among cluster nodes.

  • Before scaling down a node, migrate data on the node to be offline to other nodes to avoid data loss.

The data migration feature greatly eases the operation and maintenance of ClickHouse cluster edition. Note That the migrated tables cannot be accessed by services during data balancing.