Data analysis engine dark horse ClickHouse latest technology practice and application

Guide language | ClickHouse in recent years is the big data analytics engine side of a dark horse, from obscurity to take off all the way, in the DB engine Rank in the top 50, has become a global data engine industry dazzling star. Globally, ClickHouse single-table queries are several times faster than any other engine and have not been matched in the last 4 years. Why did ClickHouse move so fast? How would such an engine be used in practice? What other exciting and exciting features will be released? This article was shared and collated by Guo Wei, CTO of Analysys and TVP of Tencent Cloud, at the TechO TVP Developer Summit, “The Song of Data Ice and Fire — From Online Database Technology to Mass Data Analysis Technology”, “Practice and Application of the Latest Technology in ClickHouse”. To introduce the latest ClickHouse Feature and practical application in detail.

Click here for a video of the highlights

I. The Past and Present Lives of ClickHouse

Today I will mainly talk about four things. First, I will talk about the past and present lives of ClickHouse. The second is to tell you about the practical application, because many friends think ClickHouse is very good, but do not know how to use it. Thirdly, I will introduce some new features, and I will choose some features that I am interested in and talk with you about them. Finally, some thoughts about the future.

1. The birth and takeoff of ClickHouse

First, who is ClickHouse? ClickHouse is actually an engine derived from Russian hack technology, which was originally a tool for Yandex called Metrica. ClickHouse has been evolving from the core of its new standalone architecture, which is stable even with a large server cluster. It is currently used in China by Tencent, Toutiao, Sina and other companies.

Why is ClickHouse so popular? This project was open source at the end of 2016, and I introduced it into China in 2017 to build the Chinese community of ClickHouse. This past year it has suddenly become popular both at home and abroad. It has gone up 71 places on DB Ranking to # 50, but it has been going on for four years now, second only to its other popular product, Snowflake. So if you look at the rising trend, Snowflake is so popular, is Clickhouse going to be like Snowflake in the future? I’ll talk about the application later, but ClickHouse is similar to Snowflake, but it’s definitely not Snowflake, so who is ClickHouse?

2. ClickHouse is that fast

Our technical architect at the time said, “ClickHouse is this fast.” It was a very fast engine, column storage, column compressibility, and support for fuzzy queries. It also supported some complex SQL, almost linear scaling; When used well, data loading and import operations are also the fastest, easy to use, and novice friendly, so many people are using ClickHouse quickly.

This figure is the data shared by Jack Gao of Sina on Meetup in 2018. It is an earlier version, but as you can see, 19 servers, 30 billion queries per day, 8 million valid queries per day, and basically the average query time is 200 milliseconds. ClickHouse is so fast. The number of 19 servers, 200 milliseconds, is a big deal for an OLAP engine, although some core monitoring queries may take less time, at 40 milliseconds.

In 2018, we also did a side-by-side comparison test against a common OLAP engine. If you look at the figure on the right, you’ll see that the light blue column, ClickHouse, is very fast for single table queries. Compared to other engines like traditional Hive and Spark-SQL, which are tens of times faster, Join it a little slower. So it’s not Snowflake based on the scenario, it’s still doing some single-table scenarios, it’s pretty much the same as a normal engine when you Join, it’s not that fast, so if you’re doing wide-table single-table queries, it’s by far the most common one that people use.

Why is ClickHouse so fast? From my point of view, there are three reasons:

The first is a computing engine. You know it’s called vectorization, and the other one that uses the vectorization engine is Snowflake, and at this point these two projects are very similar, that when it’s written in C, it actually does vectorization on every cell at the assembly level, and then there are things like C++, which also uses a lot of extreme code framework optimization, Hashtable optimizations for this scenario of single-table aggregation, including refined memory handling, are areas where ClickHouse excels.

The second is storage design. In the column store, it nested a single data file for each single column. Various algorithms are used for column compression. Unlike other engines, each column can be compressed using a separate compression algorithm to improve its storage, including the internal query engine selected for each table during parsing and querying by ClickHouse. And this year we’re also raising something called Projections, which is going to help add another Projections to ClickHouse, so it’s going to be very quick to Projections, to Projections, and to solve a lot of the problems in a lot of the query scenarios.

The third is the community of Clickhouse. I also run the Chinese community, but the ClickHouse global community is really interesting because it’s self-cyclical, there’s no external dependencies, it’s not like in the past when you had to build something and you had to deploy a bunch of Hadoop, HDFS, Hive, Spark, ClickHouse can do it all in one set. Many of our customers have solved the last 30 Hadoop problems on one machine, and it’s a bottom-up design. Every time we merge code, it’s torn apart by Dov. The community promises that if a faster model comes along that you submit to the community and proves to work in ClickHouse, the community will immediately adopt it and quickly incorporate it in the next version. So the ClickHouse version is updated about once a month or two, because there are really smart people around the world who are constantly optimizing it. Will, of course, some people criticized its instability, as recently CentOS scroll version, may be have this problem, but you will find it an iterative speed and the latest things often particularly fast, so if you want to join the open source community, can learn a lot of new things, you will find that the things as long as you do good, will soon spread across the globe. Overall, ClickHouse combines all of these technologies very efficiently and pays great attention to implementation details, so the third feature of ClickHouse is the spirit of community openness.

Second, the practical application of ClickHouse

ClickHouse helps solve many practical problems in fixed scenes. Almost all Internet companies are using it, including some big companies like Tencent and Ali, which start to do ClickHouse cloud services. Traditional industries also start to use ClickHouse in large numbers, which is an opportunity for the future of the community.

1. Tencent Music

On the practice application, the earliest in Tencent with the most is Tencent music, what to do? It is mainly to solve the problem of data warehouse to the last kilometer of data analysis and use. We used to do the numbers storehouse, but between the numbers storehouse and the last mile we needed an OLAP engine that could quickly respond to the demand, so that our data analysts and operations people could look up the data in seconds, and ClickHouse solved that problem. Therefore, Tencent Music uses ClickHouse computing + storage as the middleware to do real-time data storehouse and realize batch stream integrated computing.

This picture was shared by Tencent Music’s Zeus in February this year: In the early days of using log stream, it went into message queue, maybe it was Kafka, maybe it was something inside Tencent, it made a live ETL through Flink and Consumer, and the live ETL went through two lines, one line now goes into ClickHouse, To realize its interactive data analysis and storage; For the application, all ad-hoc queries come straight out of this line, because it uses the same ETL to ensure that the detail data is of the same caliber. The other line is the offline file. The offline file actually walks through the traditional data warehouse, Spark and Hadoop, and walks through the middle table, walks through the wide table, and finally comes out the result table. The original real-time data realizes the batch stream integration, the query can be implemented quickly, and helps the end user to have a large number of AD hoc query reports.

I do believe that a lot of data of friend will encounter this problem, often leaders say this things to check, or operating personnel said give me something, how are you it is not easy to do an intermediate table, wide table, a result table, after you have had found the table only once, actually ClickHouse is to solve this problem, Let’s get the data out right away, and there are a lot of data analysts and operations people that know how to write SQL now, so let’s tell them what this thing looks like, put a SuperSet or some other data tool on top of it, go straight to ClickHouse, The business analyst can produce the data directly without the developer having to do the ETL. Technicians still spend more effort on the original underlying data, intermediate tables, etc., and those AD hoc query requirements do not require technicians to waste time on the daily development of ETL and data scripts, directly using ClickHouse can query.

2. Sina

At that time, Sina was faced with the problem of 30 billion pieces of data per day, with a particularly high number of queries, reaching 8 million times. Because it provided a system API, it was a pain, and then ClickHouse fast query, single-table query solved this problem perfectly in the fastest way. It actually does a few things: The first one is that ClickHouse is very fast when entering the database, so it directly takes advantage of this feature to do real-time database entry, the original database can be checked freely, because in the past, we have to add the ETL summary layer, first atomic layer, after the summary layer, and finally go up, now we directly go to the bottom level to find the data we want at that time, Shorten the entire data processing path, ETL is easy to expand capacity, resources greatly reduced. It started from Kafka, made a layer of Kafka from the complex ETL itself, and finally went directly from Kafka to ClickHouse. Starting from ClickHouse, some data results were extracted into MySQL. I made a DashBoard through SuperSet, and found details directly through the Ad Hoc interface. Grafana was very busy the last couple of days, it changed the protocol, and finally used ClickHouse to retrieve this data through Grafana, because MySQL is too large for the data to come out, so ClickHouse is using log query and this kind of basis to help do this.

3. The Himalayas

In fact, Ximalaya is also an early enterprise that began to use ClickHouse. There are three usage scenarios, which are also the most commonly used scenarios for the use of ClickHouse in China. The first one is how to do the retention and conversion of user behavior analysis log; The second is the circle selection of user portraits, such as to select a wave of people, how many people aged 10-15 in the crowd, who are they, and how can they be sent directly at last; Himalaya is also used for machine log queries, where things go wrong, where the APM can analyze all kinds of anomalies, even with ClickHouse, which of course also uses billions of data, second response. Streaming user events (Tracking, Events, System Logs) into ClickHouse via Spark is a very different approach. Streaming user events (Tracking, Events, System Logs) into ClickHouse via Spark is a very different approach. The other is to import a batch of the original transaction data from the warehouse through SQL into ClickHouse, and finally provide the final step of the query directly from ClickHouse. Whether it is user behavior analysis, clustering, or the final log query, it is done in this way, which is also a new usage.

4. Interesting headline

The challenge for the 2019 practical application of Qutiao is a hundred billion data, 210,000 huge queries, solved perfectly with the ClickHouse feature. Billions of data, over 100 machines, 32 cores, 128GB, 80% of queries, done in a second. Its gameplay is also more characteristic, first from Kafka to Flink, part of the data into the HDFS, ClickHouse query advantage lies in wide table and single table, when Join it may not be so fast, this time Qu Toutiao made an innovative method: the introduction of Presto. Presto can query across libraries and solve some problems by putting some data into HDFS when doing a Join, using Presto and ClickHouse. It makes two clusters for the entire log query and other queries, one cluster for APM queries and one cluster for analysts.

5. B station

B station scene is also more typical, it is to do user behavior analysis. There are thousands of event types, including real-time access, thousands of queries per day, and the challenge is arbitrary dimensions, arbitrary events. Long before everyone with this OLAP engine to do, but they all make uncertain, must have to check the details of SQL to the bottom, the analyst also requires a second level of results, then B stand made responsive user behavioral analysis of very large scale, and the two are similar, the first is the original offline data import, Spark comes in from HDFS, real-time data comes in from Kafka, Flink, and of course now our community is trying to write something that can come in directly from Kafka and replace Flink. ClickHouse does query access services through ClickHouse, and finally puts all kinds of report platforms and delivery queries through JDBC. Of course, it also does some internal development of library table management, permissions, metadata management, cluster monitoring and so on. So on the whole, Station B uses ClickHouse for behavior analysis scenarios.

6. Su ning

Su Ning is working on another scene: user portraits. It needs to accurately identify what each user’s label is and reduce the calculation cost, so it makes a materialized view and uses reduplication to solve the problem of user portrait. To be specific, since a lot of label processing is calculated off-line when making user portraits, this label does not need to be printed in real time, but it will be needed when querying or pushing. Therefore, Suning saved all relevant labels in HDFS at the beginning, and saved dimension tables in MySQL. Think of ClickHouse as the last step for the user portrait platform to use. Now I see a lot of traditional users using ClickHouse the same way – as their query platform because it’s the fastest query.

7. Data of gold

What does it solve? When you fill out a statistical report, it may be a lot of data, but you can see the results immediately after you fill in. Gold data originally used Mongo DB, but the check is not fast enough, and Mongo DB many times SQL compatibility is not good, how to do? Now its storage is still Mongo DB, but every time when people fill in or check the report of gold data, ClickHouse is provided for the end user to query. Therefore, ClickHouse is not only used for data analysts, but also can be used as a tool for the end user to quickly query detailed data.

8. Tigers live

The basic way to do a log query is to put all the log information into ClickHouse through several Kafka clusters for a quick query.

There are a lot of scenes just now. Overall, there are several features in ClickHouse usage scenes. First, it’s fast, fast enough to be used directly by end users. Second, it is very good at using user profiles and user log analysis, because that was the original purpose of the game when it was born in Russia. The third is to be good at APM query log, in this part of the log stored in ClickHouse instead of MySQL, the query speed will be much faster than originally thought.

III. The latest Feature and future of ClickHouse

Now there are various features, and I would like to introduce some interesting ones. People often ask me, “Look, I’m too slow to SELECT something with 2000 rows, and it’s not particularly friendly for column storage. What can I do?” The ClickHouse speed will be increased. Don’t treat it as 2000 columns, but change the 2000 columns into 100 columns. If the 100 columns are divided according to different dimensions, it will be very fast.

And the second one is the shadow that we just said. Projections has the feature of being able to do preaggregation, and that’s different than Vertica, where Vertica used to support only a few converging functions, and Projections supported all of them.

And separation of memory. A lot of people say how does ClickHouse save and separate? Recently, our community public account “ClickHouse developer” has sent a series of cases – how to do the separation of memory and counting on Tencent cloud, how to do the separation of memory and counting on S3. The community will also follow the trend of rapid and gradual separation of savings and accounts.

If you are interested in the rest, you can subscribe to the WeChat official account of the community on the last page, which will have details. There are also a lot of Meetup videos that I have organized on the official account of the ClickHouse community on the B site, so you can study carefully.

We’ve talked about a lot of roadmaps for the future, where ClickHouse will do a lot of work on detailed scenarios and the last mile of customer usage data. In China, we have met many ideas and demands from users and customers, so now the Chinese community is also considering whether it has the opportunity to become a company that can commercialize ClickHouse in the future and make ClickHouse bigger and better together with Russia.

ClickHouse has resources like this. At ClickHouse. Yandex, the national volunteer community is www.clickhouse.com.cn, and you can also include company-profession -name and my WeChat ID Guodaxia2999 to help make the community better. You are welcome to use ClickHouse. Make it the last mile of your company’s data analysis.

The lecturer introduction

Guo wei

CTO, Tencent Cloud TVP, official Member of the world’s top open source foundation – Apache Foundation, sponsor & PMC of the Apache DolphinScheduler, sponsor of the ClickHouse Chinese community, deputy chairman of the Intelligent Application Service Branch of the China Software Industry Association, The best 33 people in China open source community. Mr. Guo Wei graduated from Peking University. He used to be the big data director of Lenovo Research Institute and the general manager of the data department of Wanda E-commerce. He successively held important positions in the big data side of CICC, IBM and Teradata, making outstanding contributions to the frontier research of big data. After joining Canalys in 2015, I promoted the construction of Canalys big data technology architecture and system, and the construction of Canalys hybrid cloud architecture. In 2018, I proposed the Big Data IOTA framework and the concept of enterprise “Data River”. I led the team to build a second computing Data computing engine and carried out the architecture verification. Meanwhile, I put forward the Dolphin Scheduler, an open source framework for the Caosys. In 2019, I was selected into the Apache Foundation Incubator, and in 2021, I was selected as an Apache Foundation Member.