The paper

ClickHouse has become a popular alternative to ElasticSearch in the open source community in the past two years, probably because ElasticSearch is so expensive that a certain amount of brute force is tolerated in the absence of a search engine.

When we used Skywalking, we found that it was too demanding for back-end storage, with a (32C + 64G + 2T) X8 configuration, tens of thousands of monthly overhead on the cloud platform, performance was still very demanding, and queries were often timeouts. After half a year of optimization, I finally decided to replace ClickHouse.

After using ClickHouse, the number of machines decreased by 50%;

Query link list increased from 5.53/s to 166/s, and response time decreased from 3.52s to 166ms.

Query link details increased from 5.31/s to 348/s, and the response time decreased from 3.63s to 348ms.

The storage time of link data has increased from 5 days to 10 days, and the amount of data has reached tens of billions.

It’s worth noting that while disk space reduction is often mentioned in comparison to ES, ClickHouse’s compression rate is not that dramatic, at least in my experience. If the ES footprint is high, it’s probably because COdec: BEST_COMPRESSION is not enabled in the index.

ClickHouse is not without its drawbacks. This article shares how to use ClickHouse as a backend storage for Skywalking. This article will not belabor the basics of ClickHouse, but should be read with some understanding of ClickHouse.

(Due to the heavy workload, Skywalking storage will only store link data (i.e., Segment), the rest of the data will be saved by ElasticSearch, not changed)

Table design

ClickHouse can basically create only one index, the Order By index. But the link table query conditions are numerous, almost every field is query conditions, and contains a variety of sorting, design is more difficult.

Search criteria: Time, Service Name, service instance, link type, link ID, link name, Response time, abnormal Sort criteria: Time, response time

It is almost impossible (or impossible at all) to design all query patterns on a single table, a conclusion reinforced by numerous designs such as Jaeger-Clickhouse.

After several attempts, the final construction sentence is as follows:

CREATE TABLE skywalking.segment
(
    `segment_id` String,
    `trace_id` String,
    `service_id` LowCardinality(String),
    `service_instance_id` LowCardinality(String),
    `endpoint_name` String,
    `endpoint_component_id` LowCardinality(String),
    `start_time` DateTime64(3),
    `end_time` DateTime64(3),
    `latency` Int32,
    `is_error` Enum8('success' = 0.'error' = 1),
    `data_binary` String,
    INDEX idx_endpoint_name endpoint_name TYPE tokenbf_v1(2048.2.0) GRANULARITY 1,
    PROJECTION p_trace_id
    (
        SELECT 
            trace_id,
            groupArrayDistinct(service_id),
            min(start_time) AS min_start_time,
            max(start_time) AS max_start_time
        GROUP BY trace_id
    )
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(start_time)
ORDER BY (start_time, service_id, endpoint_name, is_error)
TTL toDateTime(start_time) + toIntervalDay(10)
Copy the code

First, the partition still uses days as a condition. In Order By, use time + service + link name + exception. Why not service + time? In many queries, time is a condition more often than service. If service is in front of ClickHouse, most of the time ClickHouse will need to traverse different parts of a file and IO will become scattered.

For the query using link names, ordre by + skip index is used to optimize. The link name is usually the name of an interface, which is less likely to be duplicated between services. In this way, physically, the same/similar link names are arrayed together, and skip Index is used to further filter granularity. The remaining data is highly likely to be arrayed together. This avoids all data in the scan range as much as possible.

The most important traceId queries are more cumbersome because they can be queried without any criteria (including time), and using traceId is actually cross-time. However, traceId can no longer be inserted into any position of the index. After trying various secondary indexes, the effect is still very unsatisfactory, so to say that there is basically no effect.

Finally, I used the then beta feature Projection, which can be understood simply as a table within a table (actually physically stored as well), with data stored in a partition using another structure. It is commonly used in ClickHouse as a materialized view, with the advantages of automatic selection and controlled life cycle by partition.

In this case, I use projection to store the Max start time, min start time, and de-duplicated service name list corresponding to traceId, and then retrieve the obtained results back to the source table. The final result is acceptable, which is why the response time of the link details query in the previous pressure test is almost twice that of the query list

But this design method also has the imperfect place, has two points:

  1. Response time ordering is not optimized. Querying a list can be very slow if the filter criteria are insufficient.
  2. TraceId is slow when a large amount of data is generated or the time span is large. Sometimes this happens because of a programming problem, or because a task is delayed for a long time.

Apart from these two imperfections, the overall use is still very smooth, open the link page from ten seconds to seconds. Using ClickHouse in the background to analyze service link composition, response times, and even periodic scanning to set alarms are additional capabilities. There is also a significant improvement in the visible latency of data. In ES, it takes approximately 60 seconds to query data to improve write performance. In ClickHouse, links are reported to the ground in 5 seconds.

ClickHouse optimized with stomping pits

ClickHouse is tricky compared to ElasticSearch’s mature cluster management capabilities.

write

The client uses the official JDBC, using CSV to assemble data imports to save as much memory as possible. However, this JDBC implementation has some limitations, such as the inability to compress packets, resulting in high memory usage. This was later optimized for use in other projects and will be discussed in a later article.

When a large number of partitions are written, the number of partitions is high, which slows the query of the latest data. Therefore, CHProxy is introduced as a proxy to write local tables and read distributed tables, greatly reducing the number of partitions.

Distributed command

Distributed commands (on clusters) are a long story, sometimes convenient and sometimes problematic. For example, distributed commands accumulate in ZooKeeper. If a command is not executed or fails to be executed on a node, all subsequent commands are blocked. For example, create database XXX on Cluster YYy is almost certain to fail.

After a new node is added to the cluster, the node will run the historical on cluster command. In this case, the node may fail to start. Therefore, you need to manually clear the historical on cluster command in the ZK.

The Merge Partition problem

ClickHouse has a strategy for managing partitions. You can refer to the Merge and Mutation mechanism of ClickHouse kernel analysis -MergeTree. Parameters are difficult to tune and there is little room for intervention. Sometimes it seems like it’s piling up a lot, but it just takes its time. Optimize table #table_name partition #partition_name final There is no way to configure it. There is also a weird phenomenon that too many part write rejection occurs on a node. When you look up, merge tasks are running, but when you look closely, none of the tasks are actually executing. Each task will return to zero and start from the beginning after running halfway. As a result, the number of parts only increases.

A search of the fruitless, there is no hint in the log. In the end, the merge task failed due to insufficient memory. Therefore, the task failed and had to start from the beginning. In particular, merge tasks occupy memory and fail one after another. A group by command fails to be executed on the node, indicating insufficient memory (this command can be executed normally on other nodes). Finally, the background_size in the configuration was reduced and the memory usage was increased. The problem was solved after the restart.

conclusion

ClickHouse is an inspiring software, but it’s not a panacea, and is ultimately better for analysis than point-and-click. If the situation is not rigid, it can easily become that you serve it instead of it serving you.

We are currently experimenting with ClickHouse as a log storage platform, replacing ElasticSearch (again) and cloud logging services. We will be able to implement schema independence, storage separation, tenant isolation, fast analytics, and more.