In Section 3, you learned that ClickHouse compresses data in blocks as it processes it and then writes it to a disk data file.

In Section 3, you learned that ClickHouse compresses data in blocks as it processes it and then writes it to a disk data file. This reduces the size of the data volume and reduces disk IO time. However, the absence of an index means that all the data has to be read on each query, and even though the data volume has been reduced by a factor of 6.2 through compression, this still costs a lot of disk IO. At this point the index appears, which again helps us reduce the amount of data we need to read when querying.

Before we introduce ClickHouse’s index, let’s review the B + tree, a common indexing technique used in the relational database MySQL. B + tree algorithm is beyond the content of this article, we will not do in-depth discussion here, we will mainly analyze the purpose of MySQL B + tree and the nature of B + tree. In fact, B + tree is essentially an N-branch tree, and its leaf nodes are the index values arranged in order. Therefore, when querying, data can be located quickly according to this tree. Moreover, due to its order, it can be adapted to range search. The following figure shows a B + tree.

B + tree diagram

With an understanding of the nature of B + trees, the reader can try to answer the question: Is it necessary for ClickHouse to index with a B + tree? Why is that?

If your answer is no, then you know a lot about both ClickHouse and MySQL storage engines. If your answer is yes or not sure, don’t worry. The reasons will be explained below.

The answer to this question is no, due to the design differences between ClickHouse’s storage engine and MySQL’s. Because MySQL supports transactions, it uses the transaction control mechanism of MVCC, so there is a problem: the data is inserted in a different order than the index is sorted. For example, if I index the AGE column, my insert order is 44,21,20,33,19. This set of data needs to be rearranged as 19,20,21,33,44 in the B + tree. This is the case where the insertion order of the data is different from the index sort. In a relational database, the order in which data is inserted is the order in which the storage engine writes data files.

In the ClickHouse storage engine, the LSM mechanism described in detail in Chapter 3 and the other chapters, the ClickHouse storage engine ensures that data is written to the storage engine in primary key order, even if the insert order is 44,21,20,33,19. The sequence in which the data is written to the data file after passing through ClickHouse’s LSM mechanism is 19,20,21,33,44. This means that the ClickHouse data is already stored in order when it is stored, so there is no need to use the B + tree for indexing again.

The primary index

To sum up, ClickHouse’s primary index is very simple. It only needs to record the first value of each block. For example, a set of 100 million rows of data with primary keys ranging from 1 to 100,000,000. When stored in ClickHouse, execute a block according to 8192, so there are 12,208 blocks in total. The index is 1,8193,16635… All you need to do at query time is determine which blocks to read from the values. SELECT * FROM ‘>500’ WHERE id <12258 AND ‘>500’ WHERE id <12258 AND ‘>500’ WHERE id <12258

In ClickHouse’s datastore file, the primary index exists in Primary. IDX. The essence of a primary index is to store the minimum value of the data in each block, thus determining the block in which the data needs to be queried is located. It maps data to blocks. Simply put, given a piece of data, a first-level index allows you to quickly query the block in which the data resides. This avoids the need to traverse the entire data set to query a single data.


The previous section has introduced the reader to what a primary index is. But the primary index alone can not achieve the goal of fast lookup. In other words, the primary index only implements the mapping of data to blocks. But there’s still a problem, even if I already know that my data is stored in the first block, how can I locate that block? The answer to this question needs to be achieved through the tag file. In other words, the tag file stores the mapping of blocks to file offsets.

This is not hard to understand. We know that a block consists of 8192 rows of data. If the size of each block is 64M. It is easy to find the address of the NTH block by passing N*64m, but the size of the key block is not guaranteed by compression. In other words, the size of each block is different. You must store the starting position of each block so that you can find it easily.

In ClickHouse’s actual processing, each line of markup has three fields consisting of blockid, the offset in the data file, and the offset in the compressed block. Each field is 8 bytes, so each tag has a total of 24 bytes. N*24B to location quickly with Blockid.

If you have blockid and offset in the data file, you can already read the location quickly. So what is the offset in the third compressed block in the clickhouse tag file? The answer to this question really has to do with ClickHouse’s handling of compressed blocks. As mentioned in Chapter 3, ClickHouse generates a block every 8192 lines. But what if the number at line 8192 is still small? For example, the UINT8 type has a single line of data with only 8 bits and 1 byte. Even 8192 rows of data are only 8192 byte =8KB. This amount of data is still too small for compression. Therefore, ClickHouse processes with a minimum of 64KB and a maximum of 1M per compressed block. If the sum of a block is less than 64K, the next block will be fetched. As a result, multiple blocks appear in a single compressed block. The third parameter in the tag file is used to handle this case.

So suppose a tag file has the following contents:

0 0 0
1 0 8192
2 0 16384
1 12016 0