Abstract: In the era of mobile Internet, there are a large number of users, a large number of labels, and a huge amount of data of user labels. In the user portrait system, different enterprises have different implementation schemes for label storage and query. The current mainstream implementation scheme uses ElasticSearch scheme. However, the construction of user portrait platform based on ElasticSearch often faces some problems, such as lack of flexibility, high resource cost, no SQL interface development inconvenience, etc. Therefore, this paper provides a method to build user portrait system based on Huawei MRS ClickHouse.

This article is shared by Hourongqi from huawei Cloud community “Introduction to Building User Portrait System Based on MRs-ClickHouse”.

1. Service scenarios

User portrait is the tagging of user information. User portrait system through the collection of all dimensions of data, in-depth analysis and mining, to different users with different labels, so as to describe the overall picture of the customer. Through the user portrait system, each user can be accurately positioned, so as to apply it to personalized recommendation, precision marketing and other business scenarios. User portrait system has been widely used by various enterprises, is one of the important ways of big data landing.

In the era of mobile Internet, there are a large number of users, a large number of labels, and a huge amount of data of user labels. In the user portrait system, different enterprises have different implementation schemes for label storage and query. The current mainstream implementation scheme uses ElasticSearch scheme. However, the construction of user portrait platform based on ElasticSearch often faces some problems, such as lack of flexibility, high resource cost, no SQL interface development inconvenience, etc. Therefore, this paper provides a method to build user portrait system based on Huawei MRS ClickHouse.

2. Why build tag query system based on MRS-ClickHouse

2.1 MRS – ClickHouse profile

MRS-ClickHouse is a column database for on-line analytical processing. Its core features are extreme compression rate and fast query performance. Mrs-clickhouse supports SQL queries with good query performance, especially aggregate analysis queries based on large and wide tables, which are orders of magnitude faster than other analytical databases.

ClickHouse has the following features:

  • ClickHouse is a complete database management system with basic DBMS functions, including DDL, DML, permission control, data backup and recovery, and distributed management.
  • ClickHouse is a database that uses column storage. Data is organized by column. Data belonging to the same column is stored together and in separate files between columns. When performing data query, column storage can reduce data scan scope and data transfer size, and improve data query efficiency.
  • The vectorization execution engine ClickHouse implements vectorization execution using the CPU’s SIMD instructions. The full name of SIMD is Single Instruction Multiple Data, which means to operate Multiple Data with a Single Instruction and improve performance through Data parallelism. Its principle is to realize Data parallel operation at the LEVEL of CPU registers.
  • ClickHouse uses SQL entirely as the query language and provides a standard protocol SQL query interface, making it easy to integrate with existing third-party analysis visualization systems. ClickHouse also uses a relational model, so moving systems built on traditional relational databases or data warehouses to ClickHouse becomes less costly.
  • A ClickHouse cluster consists of one to many shards, each of which corresponds to one service node of ClickHouse. The maximum number of shards is determined by the number of nodes (one shard corresponds to only one service node). ClickHouse provides the concept of Local and Distributed tables. A local table is equivalent to a shard of data. The distributed table itself does not store any data, it is the local table access proxy, its role is similar to the branch library middleware. With distributed tables, you can broker access to multiple data shards for distributed queries.

2.2 Bitmap Index Introduction

A bitmap is a data structure associated with specific values by array subscripts. In a bitmap, each element occupies 1 bit. A bit of 1 indicates that the corresponding element has that particular value. Otherwise, there is none.

For example:

ID set: [0,1,4,5,6,7,9,10,13,14] can be represented as 11001111 01100110 by bitmap

As shown below:

A bitmap index is a special index that uses a bitmap and is created for a large number of columns with the same value. Each bit in the bitmap position code indicates the presence or absence of a corresponding data row. Bitmap indexes are suitable for columns with fixed values, such as gender, marital status, district, and so on. Columns with discrete values such as ID numbers and consumption amounts are not suitable. In the user portrait scene, each label corresponds to a large number of people. The number of labels is a finite enumeration of values, which is ideal for bitmap indexing.

For example:

Suppose there are two labels, one is label 1- Hold precious metal and the other is label 2- Hold insurance. The following table shows the labels owned by each cardholder.

From this, we can see that the set of cardholder ids with the tag 1 that holds the precious metal is: [0,1,4,5,6,7,9,10,13,14]. The cardholder ID clusters with label 2- holding insurance are: [2, 3, 5, 7, 8, 11, 12, 13, 15].

When we need to query the users who have this label at the same time, based on the bitmap index, we only need to carry out bitmaps corresponding to the two labels to get the final result. In this way, the storage space of label data is very small, and the speed of label calculation is very fast.

2.3 MRS-ClickHouse native support for bitmap indexing

Before ClickHouse, applying bitmap indexes to user portrait scenarios required building your own bitmap data structures and managing bitmap indexes, which was a high barrier to entry. The good news is that MRS-ClickHouse natively provides support for bitmap data structures and location indexing, encapsulating bitmap construction and maintenance within ClickHouse. It is much easier for users to build bitmap indexes based on ClickHouse.

ClickHouse bitmap constructor:

ClickHouse position manipulation functions:

ClickHouse bitmap operation results obtain function:

Why build a tag query system based on ClickHouse?

  1. ClickHouse queries are fast, with sub-second responses;
  2. ClickHouse has a built-in bitmap data structure to facilitate bitmap indexing and improve label query performance.
  3. Based on JDBC/SQL interface, development is simpler;
  4. Based on MPP architecture, can be horizontally expanded;

3. How to build a label query system based on MRS-ClickHouse

Create a raw label table in ClickHouse and import the raw label data into it. Then the label bit chart is constructed based on the original label table and the corresponding distributed table is created. Upper-layer label query applications perform label query based on label bit chart (distributed table).

The process is as follows:

The detailed process is described below.

Step 1: Create the original label table and import the original label data

First, create a label raw table to store the label raw data. The label result data calculated by the upstream system is written into this table. This table is the local surface. Its construction sentences are as follows:

CREATE TABLE IF NOT EXISTS tBL_tag_src ON CLUSTER default_cluster(tagname String, -- tagValue String, Userid - tag value UInt64) ENGINE = ReplicatedMergeTree ('/clickhouse/default/tables / {shard} / tbl_tag_src ', '{up}') PARTITION BY tagname ORDER BY tagvalue;Copy the code

Then create a distributed table:

CREATE TABLE IF NOT EXISTS default.tbl_tag_src_all ON CLUSTER default_cluster 
AS tbl_tag_src 
ENGINE = Distributed(default_cluster, default, tbl_tag_src, rand());
Copy the code

The data preview is as follows:

Step 2: Create a label bitmap and build a label bitmap

To create a label bit chart, start by creating the local table. This table is used to store label bitmap data. Its creation statement is as follows:

Create bitchart, CREATE TABLE IF NOT EXISTS tBL_tag_bitmap ON CLUSTER default_cluster (tagName String, -- tagValue String, AggregateFunction(groupBitmap, UInt64) - userid collection) ENGINE = ReplicatedAggregatingMergeTree ('/clickhouse/default/tables / {shard} / tbl_tag_bitmap ','{replica}') PARTITION BY tagname ORDER BY (tagname, tagvalue) SETTINGS index_granularity = 128;Copy the code

Then create the corresponding distributed table. Distributed tables are used to query labels for upper-layer applications. Its construction sentences are as follows:

CREATE TABLE IF NOT EXISTS default.tbl_tag_bitmap_all ON CLUSTER default_cluster ( tagname String, -- tag name tagvalue String, -- tagvalue tagbitmap AggregateFunction(groupBitmap, UInt64) -- USERID set (ENGINE = Distributed(default_cluster, default, TBL_tag_bitmap, rand());Copy the code

Import the data from the label raw table into the label bit chart. During the import process, the bitmap is constructed using the groupBitmapState() function. The SQL statement is as follows:

-- Import data, Merge all userids of the same label INTO one bitmap INSERT INTO tBL_tag_bitmap_all SELECT using the groupBitmapState function tagname,tagvalue,groupBitmapState(userid) FROM tbl_tag_src_all GROUP BY tagname,tagvalue;Copy the code

Step 3: Quickly retrieve labels based on distributed tables

Query the list of userids that hold precious metals and are male:

WITH (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'hold' AND tagvalue = 'hold' LIMIT 1) AS bitmap1, (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'gender' AND tagvalue = 'male' LIMIT 1) AS bitmap2 SELECT bitmapToArray(bitmapAnd(bitmap1, bitmap2)) AS resCopy the code

The total number of male and female insurance holders is as follows:

---- Query the number of male customers with insurance: WITH (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'hold' AND tagvalue = 'insurance' LIMIT 1) AS bitmap1, (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'gender' AND tagvalue = 'male' LIMIT 1) AS bitmap2 SELECT BitmapCardinality (bitmapAnd(bitmap1, bitmap2)) AS res ---- WITH (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'hold' AND tagvalue = 'insurance' LIMIT 1) AS bitmap1, (SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = 'gender' AND tagvalue = 'female' LIMIT 1) AS bitmap2 SELECT * FROM tbl_tag_bitmap_all WHERE tagname = 'gender' AND tagvalue = 'female' LIMIT 1) AS bitmap2 bitmapCardinality(bitmapAnd(bitmap1, bitmap2)) AS resCopy the code

4. To summarize

In view of the massive tag query in the user portrait scenario, the traditional scheme has some problems, such as lack of flexibility, high resource consumption, lack of SQL interface and difficulty in developing. Based on Huawei MRS-ClickHouse, it is very convenient to build bitmap index and realize real-time retrieval of massive label data. Mrs-clickhouse reduces development costs significantly, responds faster to tag queries, and facilitates precision marketing.

Huawei FusionInsight MRS cloud native data lake has been widely used in governments, finance, carriers, large enterprises, and Internet industries. Together with more than 800 partners, huawei FusionInsight MRS cloud native data lake serves more than 3,000 government and enterprise customers in more than 60 countries and regions.

Click to follow, the first time to learn about Huawei cloud fresh technology ~