Apache Kylin Starter 5 - Build Cube - Moment For Technology

Apache Kylin starter series directory

Introduction to Apache Kylin 1 – Basic Concepts
Getting Started with Apache Kylin 2 – Principles and Architecture
Apache Kylin Getting started 3 – Details of installation and configuration parameters
Apache Kylin Starter 4 – Building the Model
Apache Kylin Starter 5 – Build Cube
Apache Kylin Starter 6 – Optimizing Cube
Construct Kylin query time monitoring page based on ELKB

Since business requirements vary, this article does not use specific cases to demonstrate, but rather provides a comprehensive overview of options and considerations in the construction process.

A, Cube Info

The Cube Info interface mainly fills in some basic information of Cube. First, select a data model and then fill in the Cube name. The Cube name must be unique globally. After completing the Cube information, click “Next” to enter the Next step.

Second, the Dimensions

Dimensions is a dimension selection interface, which selects some columns from the Dimensions of the data model as the Dimensions of Cube. This is the first important step in the process of Cube construction. The setting here will affect the number of cuboids generated, thus affecting the size of Cube data.

When selecting a dimension, each dimension column can be either Normal or Derived. Compared with the ordinary dimension, the derived dimension does not participate in the dimension Cuboid, and the foreign key (FK) corresponding to the derived dimension participates in the dimension Cuboid, thus reducing the number of CuBoids. At query time, queries on derived dimensions are first converted to queries on the dimension where the foreign key resides, resulting in a small (but acceptable) performance sacrifice.

1. Suggestions on dimension selection:

Dimensions as cubes need to satisfy the following criteria: dimensions that might exist in a WHERE condition or in groupBy;
The Fact Table only selects the fields that participate in the query, and the fields that do not participate in the query must not be checked (even if the foreign key);
The primary key in the Lookup Table corresponds to the foreign key of the fact Table one by one. It is recommended to select the foreign key of the fact Table and select the Derived dimension as the primary key of the dimension Table.
For star models, dimension table fields can often all be derived fields;
For the snowflake model, if the dimension table has child tables, the foreign keys of the dimension table for the child tables are recommended as Normal dimensions.

2. Special attention:

Table join fields do not have to participate in Cuboid calculations;
Table join fields will not participate in Cuboid if they are not selected and there are no fields in the foreign key table as derived dimensions.
Once set to Normal, it must participate in Cuboid calculation;
If the dimension table has levels (such as province, city, county, day, month, and year), the fields related to the recommendation level are selected as Normal dimensions.

Third, Measures

After dimension selection is complete, you need to select the measure aggregation method. The general aggregation methods are COUNT, SUM, MIN, MAX, and PERCENTILE. The other aggregation methods are described in detail below.

1, TOP_N

Top-n measurement is designed to predict the required top-N during Cube construction. In the query phase, top-N records can be quickly obtained and returned, so that the query performance is much higher than the Cube without top-N predicted results.

1.1. How to select Group By in top-N?

For example, the provincial ranking of the total amount of carbon dioxide pollutants in China is the result of the provincial ranking. What needs to be measured is the total amount of pollutants, so Group By needs to be set as the pollutant type.

1.2 what is the meaning of Top N in Return Type?

TOP N indicates that the order of the TOP N is accurate. For example, TOP 10 indicates that the final TOP 10 is accurate (there is an error if the cardinality of the dimension is very large), but it does not mean that only the TOP 10 can be selected (Limit 10). You can use other numbers, such as Limit 500. There is just no guarantee of accuracy when returning more content.

1.3 top-N Storage

When top-n is used, the sort measure field and Group By field are combined together to form a single field for storage. The user needs the TOP 100 results, and Kylin keeps the TOP 5000 (50 times) records for each combination condition value for later merging again.

2, Count_Distinct

The Count_Distinct metric has two implementations:

Approximate implementation: Based on HyperLogLog algorithm, the error rate can be selected to accept (from 9.75% to 1.22%), low error rate requires more storage;
Precise implementation: Based on Bitmap (Bitmap) algorithm, for data type tinyINT, SmallINT and INT data, the corresponding value of the data will be directly into the Bitmap; For long, string, and other data types, encode them into a dictionary as a string, and then insert the corresponding values into the bitmap. The measure returned is the serialized bitmap data, not just the calculated value. This ensures that the result is correct in different segments, even across different segments.

The more accurate the HyperLogLog, the more storage space it consumes. In most scenarios, the approximate implementation of HyperLogLog can meet the requirements.

3, EXTEND_COLUMN

In analysis scenarios, an ID is often filtered but the query result is displayed as name, such as user_id and user_name. There are three common solutions to this problem:

Set both ID and name to dimensions. Query statements are similarselect name, count(*) from table where id = 1 group by id,name, the problem of this method is that it will lead to the increase of dimensions, resulting in the expansion of predicted calculation results;
Both ID and name are set as Dimensions, and both are set as Joint Dimensions. The advantage of this approach is that the number of dimension combinations will not increase, but it restricts other optimization of Dimensions. For example, ID can no longer be set as a mandatory dimension or a hierarchical dimension.
Set the ID as a dimension, name as a special Measure, and type as Extended Column. In this way, the id filtering and name query requirements can be ensured, and the further optimization of THE ID dimension will not be affected.

Fourth, the Refresh Setting

Auto Merge Thresholds: Automatically merges small segments to medium or larger segments. If you do not want to automatically Merge segments, delete the default two Thresholds.
Volatile Range: defaults to 0. Auto Merge automatically merges all possible cube segments. After setting a specific value, Auto Merge will not Merge the latestVolatile RangeDay cube segments; If Volatile Range is set to 7, cube segments generated in the last seven days are not automatically merged.
Retention Threshold: Kylin can set Retention Threshold to automatically clear old segments that do not need to be queried, saving disk space. Whenever a new Segment is built, Kylin will automatically check the old Segment. If the difference between the end date of these segments and the end date of the latest Segment is greater than the retention time threshold, the Segment will be cleared. If automatic cleaning is not required, you can set the retention time threshold to 0 by default.
Partition Start Date: the Start time of the Cube build,The 1970-01-01 08:00:00The default value is the start time of the partition.

Fifth, the Advanced Setting

Advanced Settings are mainly used for pruning optimization of Cuboid, Through Aggregation Group, Mandatory Dimension, Hierarchy Dimension, Joint Dimension, etc., The combination of CuBOids can be kept within the desired range.

1. Aggregation Group

Based on the dimension combination of the query, you can divide the dimension combination categories, which are called aggregation groups in Kylin. For example, the query requirement is the ranking of pollutant emissions in each region (province, city, district and county) and in each river basin (first, second and third river basins) within a specific time range.

The above query requirements can be combined into two aggregation groups:

Query pollutant emissions based on the regional dimension and time dimension.
Query pollutant emissions based on watershed and time.

If only one aggregation group is used, the region and watershed dimensions can produce many combinations of Cuboids that are useless to the query. In this case, two aggregation groups can be used to separate regions and basins, which greatly reduces the number of useless combinations.

2. Mandatory Dimension

Mandatory dimensions are those that always appear in Where conditions or Group By statements.

Of course, it must exist and may not appear explicitly in the query statement. For example, the query date is a necessary field, and month, quarter and year are its derivative fields. Then the occurrence of month, quarter and year in the query is equivalent to the occurrence of the necessary field of the query date.

3. Hierachy Dimension

Hierarchy is a group of hierarchical dimensions, for example, country > province > city. Country is a high-level dimension, and province and city are lower-level dimensions. Users can query by higher-level dimensions as well as lower-level dimensions. However, when querying lower-level dimensions, the data of the higher-level dimensions is often accompanied by the conditions of the higher-level dimensions rather than the data of the lower-level dimensions. In other words, user queries for these three dimensions can be categorized into the following three categories:

group by country
Group by country, province (equivalent to group by province)
Group by country, province, city (equivalent to Group by country, city or group by city)

4. Joint Dimension

Some dimensions tend to come together, or they have very close cardinality (1:1 mapping), such as “user_id” and “email.” After multiple dimensions are defined as a combinatorial relation, all Cuboids that do not fit this relation are skipped.

In terms of Joint Dimension (A, B), it is better for A and B to appear at the same time in group BY so as not to lose performance. However, if only A or B appears, further aggregation operations need to be performed from the results of group by A and B during query, which will slow down the query speed.

5, Rowkeys

5.1, coding

Kylin stores Cube in HBase in key-value mode. The HBase Key, also known as Rowkey, is composed of values of all dimensions. To store these values more efficiently, Kylin encodes and compresses them; Each dimension can select the appropriate Encoding, and Dictionary Encoding is used by default. The basic encoding types supported by the fields are as follows:

dict: applies to most fields, recommended by default, but inUltra high gay love cases can cause memory shortageThe problem;
boolean: Applies to fields whose value istrue, false, TRUE, FALSE, True, False, t, f, T, F, yes, no, YES, NO, Yes, No, y, n, Y, N, 1, 0;
integer: Applies to fields whose value isInteger character, the supported integer range is[ -2^(8N-1), 2^(8N-1)];
date: Applies to fields whose value isThe date of character, the supported formats includeYyyyMMdd, YYYY-MM-DD, YYYY-MM-DD HH: MM: SS, YYYY-MM-DD HH: MM: sS.sssIf the timestamp part is included, it will be truncated;
time: Applies to fields whose value isTimestamp character, the supported range is[1970-01-01 00:00:00, 2038/01/19 03:14:07], the millisecond part will be ignored, time encoding is applicable to time, Datetime, timestamp and other types;
fix_length:Applicable to ultra-high base scenarios, select the first N bytes of the field as the encoding value. When N is smaller than the field length, field truncation occurs. When N is larger, RowKey length is too long and query performance deteriorates.
fixed_length_hex: applies to hexadecimal characters, such as 1A2BFF or FF00FF. Each two characters requires one byte. It is only applicable to VARCHAR or NVARCHAR.

5.2, the order

The order of each dimension in Rowkeys has a significant impact on query performance. Here the user can adjust the order of dimensions on Rowkeys by dragging and dropping, depending on the query pattern and custom. The recommended order is Mandatory, where filter conditions occur frequently, high cardinality, and low cardinality. In this way, filtering conditions are fully utilized to narrow the scan scope in HBase, improving query efficiency.

5.3, shard

Specify the column of ShardBy, and the detailed data will be sharded according to the value of the column; Columns that do not specify ShardBy are sharded by default based on the data in all columns; Selecting appropriate ShardBy column can make detailed data evenly distributed on multiple data slices, improve parallelism, and obtain more ideal query efficiency. It is recommended to select columns with large cardinals as ShardBy columns to avoid uneven data dispersion.

6. Other Settings

Mandatory Cuboids: Dimension combination whitelist, specifying the dimension combination of cuBOIDS to be built;
Cube EngineCube build engine, MapReduce and Spark; If your Cube only has simple measures (SUM, MIN, MAX), Spark is recommended; MapReduce is recommended if Cube has complex type metrics (COUNT DISTINCT, TOP_N);
Global Dictionary: dictionary used to calculate COUNT DISTINCT exactly, which converts a non-INTEGER value to integer for bitmap de-duplication; If you want to calculate a COUNT DISTINCT column that is already of type INTEGER, you don’t need to define a Global Dictionary; The Global Dictionary is shared by all segments and therefore supports volume up and deduplication across segments.
Segment Dictionary: Another Dictionary for precisely counting COUNT DISTINCT, unlike the Global Dictionary, is built on the values of a segment and therefore does not support aggregate calculations across segments. Use “Segment Dictionary” instead of “Global Dictionary” if your cube is not partitioned or if you can ensure that all your SQL is group by partition_column. This avoids the problem of a single dictionary being too large.
Advanced Snapshot Table: designed for global lookup tables, providing different storage types;
Advanced ColumnFamily: if there isMore than one COUNT DISTINCT or TopN metric, you can put them in more column clusters to optimize I/O with HBase.

Sixth, the Configuration Overwrites

Kylin uses many configuration parameters to improve flexibility, users can configure different parameters according to the specific environment, scenario, etc. Kylin global parameters can be configured in the conf/kylin.properties file. If Cube needs to override global Settings, you need to specify them on this page, and these configuration items will override the default values in the project level and configuration file.

Seven, the Overview

You can take an overview of your cube and go back to the previous steps to make changes. Click the Save button to complete the cube creation.

Eight, the Planner

If you start the Cube Planner, you can view the number of cuboids and the combination of each dimension on the Planner TAB after saving the Cube. This will help you intuitively understand the combination of your dimensions, and you can adjust the Cube at any time if it differs from what you expect.

Any Code, Code Any!

Scan code to pay attention to “AnyCode”, programming road, together forward.

Apache Kylin Starter 5 – Build Cube