Kylin official case analysis and pruning optimization -OLAP business environment combat

Copyright Notice: This set of technical column is the author (Qin Kaixin) usually work summary and sublimation, through extracting cases from real business environment to summarize and share, and give business application tuning suggestions and cluster environment capacity planning and other content, please continue to pay attention to this set of blog. Copyright notice: No reprint, welcome to learn.

Kylin’s official case is a very classic case, which contains technical details that can be mined to reveal more valuable information, such as data warehouse theory, star models, snowflake models, role-playing dimensions, dimension pruning, etc. I believe you will through my in-depth analysis, there is a sense of enlightenment. The official case is an order case, which contains the order fact table, order date table, item category dimension, account dimension (buyer and seller), and region dimension (buyer and seller).

A detailed explanation of Kylin official case table relations and fields

1.1 Brief explanation of Kylin’s official case sheet

KYLIN_SALES is the fact table that holds details of the sales order. Each column stores seller, commodity classification, order amount, commodity quantity and other information, and each row corresponds to a transaction order.
KYLIN_CATEGORY_GROUPINGS is a dimension table that stores detailed descriptions of product categories, such as category names.
KYLIN_CAL_DT is also a dimension table that holds extended information about time. Such as a single date, month, week, year, month, etc.
KYLIN_ACCOUNT is also a dimension table, which acts as a role-playing dimension and contains buyer and developer accounts.
KYLIN_COUNTRY contains the region dimension, which, as a role-playing dimension, contains information about the buyer’s and developer’s region.

1.2 KYLIN_SALES fact table fields:

1.3 KYLIN_CATEGORY_GROUPINGS Field Description:

1.3.1 KYLIN_CATEGORY_GROUPINGS Main fields:

LEAF_CATEG_ID: the primary key
SITE_ID: foreign keys

1.3.2 KYLIN_CATEGORY_GROUPINGS External Connection Relationship:

1.3.3 KYLIN_CATEGORY_GROUPINGS Main fields:

1.4 Main fields in KYLIN_COUNTRY

1.5 KYLIN_ACCOUNT Main fields:

1.6 Overview of snowflake Model Relationship:

Ii. Model Building

2.1 Kylin’s official model model diagram is as follows:

2.2 Step 1: Snowflake and star model, inner join relationship is established, it is important to determine the field association, for role dimensions (KYLIN_ACCOUNT and KYLIN_COUNTRY), alias processing is required.

The following describes the Add Lookup Table page:

Data relationships are not only between fact tables and dimension tables (star models), but also between dimension tables and dimension tables (snowflake models).
There are three types of Join added between tables: Left Join, Inner Join, and Right Join.
The Skip Snapshot for this Lookup Table option refers to whether to Skip the creation of snapshotTable. Because some lookup tables are very large (greater than 300M), if the cardinality of a dimension is large, memory may appear OOM. Therefore, when creating snapshotTable, the size of the original table cannot exceed a configured upper limit (kylin.snapshot. Max – MB, default 300).
Lookup tables that skip build snapshot will not be searchable and will not support setting to Derived dimensions.
In most cases, “Left Join” is used. The other two Join methods are not very common.

2.2.1 Each Snapshot corresponds to a Hive dimension table. The generation process is as follows:

Read the values of each row and column sequentially from the original Hive dimension table;
All of these values are encoded in TrieDictionary mode (one value corresponds to one Id);
Read the values of each row in the original table again, replace the values of each column with the encoded Id, and get a new table with only Id.
Saving the new table together with the Dictionary object (a mapping between Id and value) will save the entire dimension table;

Kylin stores this data in a metadata database.

Here the Snapshot Table summarizes well, so the quotation address: https://juejin.cn/post/6844903697755947015Copy the code

2.2.2 Relationship diagram of snowflake model

2.3 Step 2: Qualification Dimension selection

Select the Dimensions that may be involved in calculation on the Dimensions page. The selected Dimensions are only those that have the qualification to be selected when Cube is constructed, not the Dimensions that will be involved in Cube construction at last. It is recommended to select all the fields in the dimension table. The Dimensions selection is shown below:

For KYLIN_SALES, where SLR_SEGMENT_CD, PRICE, and ITEM_COUNT are not selected, they are not eligible to participate in cubeId builds

In the case of KYLIN_CAL_DT, only some of the participating dimensions are selected, and the others are not eligible to participate in cubeId builds

For KYLIN_CATEGORY_GROUPING, only some participating dimensions are selected. Others are not eligible to participate in cubeId construction

For BUYER_ACCOUNT and SELLER_ACCOUNT, all are eligible to participate in cubeId construction

For BUYER_COUNTRY, only COUNTRY, NAME is eligible to participate in cubeId builds
For SELLER_COUNTRY, only COUNTRY, NAME is eligible to participate in cubeId construction

In the same way as above

2.3.1 Summary of qualification Dimension Selection Results:

2.4 Third Step: Measures Selection:

On the Measures page, select the Measures that you might use to calculate. In general, sales, flow, temperature and humidity, etc.

2.5 Step 4: Settings Settings

In the Settings page, you can set partitions and filter criteria. Partitions are designed for incremental build. Currently, Kylin supports date-based partitions. Then select the date format; After filtering conditions are set, Kylin will select the data that meets the filtering conditions for construction. A few points to note:

The time partition column can support date or more fine-grained time partitioning;
Time partition columns support the data type of the time/date/datetime/integer, etc.;
Filter criteria do not need to write WHERE;
Filter criteria cannot contain date dimensions.

Construct cube model

3.1 Knowledge summary of dimension selection

When selecting a dimension, each dimension column can be either Normal or Derived. Compared with the ordinary dimension, the derived dimension does not participate in the dimension Cuboid, and the foreign key (FK) corresponding to the derived dimension participates in the dimension Cuboid, thus reducing the number of CuBoids. At query time, queries on derived dimensions are first converted to queries on the dimension where the foreign key resides, resulting in a small (but acceptable) performance sacrifice.

3.1.1 Dimensional pruning optimization

To optimize dimensions, first make sure you set the cube dimensions that will be used in your queries.

At present, Kylin can use the following dimension optimization methods:

Gather group
Derivative of latitude
Compulsory dimension
Hierarchical dimension
The joint dimensions
Extended Column

In A multidimensional data set, the number of dimension determines the number of possible combinations between dimensions, and the size of each dimension member set determines the number of every possible combination, for example, there are three common dimension A, B, C, and their different member number 10/100/1000, respectively, then the combination of one dimension have 2 3 times, {empty, A, B, C, AB, BC, AC, ABC}, each member we call cuboID (combination of dimensions), and the number of member combinations of these sets is 1, 10, 100, 1000, 10*100, 100 1000, 101000, and 10*100 *1000 respectively. We call the different number of members in each dimension Cardinatily, and we want to avoid storing combinations of higher cardinatily dimensions.

In the above example, we can not cache BC and C cuboids. We can calculate the value of a combination of BC and C members by using the values of ABC members. This is a tradeoff between time and space. Four dimensions exist in Kylin to reduce the number of cuBoids, not whether each dimension is cached or not. Currently Kylin computes and stores all combinations of cuBoids. For normal dimensions, It can be seen from the above example that the number of CuBOids in N dimensions is 2 to the power of N, and the number of CuBoids can be reduced by setting some dimensions in Kylin. Of course, this requires users to have a good understanding of the dimensions they need and know what they may group by based on.

3.1.2 Mandatory dimension

This dimension means that it is carried in the group BY of each query. Setting a dimension to MANDATORY reduces the number of CuBoids by half, as shown below:

3.1.3 hierarchy dimension

This kind of dimension is the most common, especially in Mondrian, we often have roll-up and driller operations for multidimensional data, which requires hierarchical relationships among dimensions, such as country, province, city, year, quarter, month, etc. Hierarchical dimensions can also greatly reduce the number of Cuboids. The diagram below:

3.1.4 derived dimension

This dimension means a dimension that can be derived. One or more of the columns corresponding to this dimension can be one-to-one with the primary key of the dimension table. This dimension can greatly reduce the number of CuBOids, as shown in the figure below:

For example, if timeID is the primary key of the time dimension table, that is, the foreign key of the fact table, and the time is only accurate to day, then the year, month, and day columns can uniquely correspond to a time_id, and time_id is the foreign key of the fact table. So we can specify year, month, and day as a derived dimension, and we can actually store the combination of dimensions based on the value of timeID alone, but that requires that the group by we use in the query must specify all columns in the derived dimension set. Each Joint dimension consists of two or more dimensions. The dimensions within the Joint dimension either do not appear or must appear together. There should be no common dimension between different associations

3.1.5 Union dimensions

Union Dimensions: Treat several dimensions as one. Applicable scenarios:

1 allows you to set several dimensions that you are sure will be used together when querying as a union dimension.
2 Dimensions with a small cardinality can be set as a union dimension.
3 You can set several dimensions that are rarely used when querying as a union dimension.

Optimization result: If N dimensions are set as union dimensions, the number of cuBOids combined by these N dimensions will be reduced from 2 to the power of N to 1.

Application examples:

Suppose you create a Cube of transaction data that has many common dimensions, such as transaction date cal_DT, transaction city city, customer sex sex_id, payment type pay_type, etc. The analysis method commonly used by analysts is to obtain the different consumption preferences of male and female customers in different cities by aggregating according to transaction time, transaction place and customer gender, for example, aggregating transaction date cal_DT, transaction city and customer gender sex_id to group simultaneously. Aggregate group: [cal_dt, city, sex_id, pay_type] Union dimension: [cal_dt, city, sex_id]

Case 1:

SELECT cal_dt, city, sex_id, count(*) FROM table GROUP BY cal_dt, city, sex_idCopy the code

Case2:

If there is an infrequently used query:

SELECT cal_dt, city, count(*) FROM table GROUP BY cal_dt Compute the final result from the existing Cuboid.Copy the code

3.1.6 Particle size optimization

Granularity optimization corresponds to improving the concurrency of Cube, which is set to three attributes in the custom attributes that can improve the concurrency. 1. Kylin. Hbase. Region. The cut (use several partitions) 2. Kylin. Hbase. Region. Count. Min (minimum USES several partitions) 3. Kylin. Hbase. Region. Count. Max (at most) USES several partitions

Increasing the least used partition and decreasing the maximum used partition according to the corresponding situation can effectively increase the parallelism of the system.

3.1.7 RowKey optimization

Rowkeys: consists of dimensionally encoded values. ‘Dictionary’ is the default encoding; Dictionaries can only deal with dimensions of medium and low cardinality (less than 10 million); If the dimension cardinality is high (e.g., greater than 10 million), select “False” and enter an appropriate length for the dimension, usually the maximum length value for that column; If it exceeds the maximum value, it will be truncated. Note that without dictionary encoding, the cube size can be very large. You can drag and drop the dimension column to adjust its position in the RowKey; The column that precedes the RowKey can be used to significantly narrow the scope of the query. It is usually recommended to start with the Mandantory dimension, followed by the dimension that plays a big role in filtering (where conditions); If multiple columns will be used for filtering, place a high-cardinality dimension (such as user_id) before a low-cardinality dimension (such as age).

Kylin stores Cube in HBase in key-value mode. The HBase Key, also known as Rowkey, is composed of values of all dimensions. To store these values more efficiently, Kylin encodes and compresses them; Each dimension can select the appropriate Encoding, and Dictionary Encoding is used by default. The basic encoding types supported by the fields are as follows:

Dict: Works for most fields and is recommended by default, but can cause memory issues in the case of hyper-gay love;
Boolean: Applies to fields whose values are true, false, true, false, true, false, t, f, T, f, yes, no, yes, no, Y, n, y, n, n, 1, 0;
Integer: Indicates that the field value is an integer. The integer range is [-2^(8N-1), 2^(8N-1)].
Date: applies to date characters in the following formats: yyyyMMdd, YYYY-MM-DD, YYYY-MM-DD HH: MM :ss, and YYYY-MM-DD HH: MM :ss.SSS. The timestamp is truncated.
Time: applies to timestamp characters whose field value is [1970-01-01 00:00:00, 2038/01/19 03:14:07], and the ms part is ignored. Time encoding applies to time, datetime, timestamp and other types.
If N is smaller than the field length, field truncation occurs. If N is larger, RowKey length is too long and query performance deteriorates. This parameter is applicable only to VARCHAR or NVARCHAR.
Fixed_length_hex: Applies to hexadecimal characters, such as 1A2BFF or FF00FF, and requires one byte for every two characters. Only applies to VARCHAR or NVARCHAR types.
Similar to RowKey optimization in Hbase, dimensions used as filtering criteria may be placed in front of other dimensions during query. Frequently used dimensions and large cardinality dimensions are placed in front of each other
```
(refer to link: https://juejin.cn/post/6844903700800995336)Copy the code
```

3.2: Dimension selection of official cases

3.2.1 Problems encountered in KYLIN_SALES dimension selection

It is found that the join key is not selected, for example, LEAF_CATEG_ID and LSTG_SITE_ID are foreign keys. The derived foreign keys of PART_DT are also not selected. Therefore, when CUBE construction is implemented, only the computational dimension is concerned and the derived dimension still applies even if the join key is not selected:

3.2.2 KYLIN_CAL_DT dimension selection, it is found that the derived primary key of PART_DT is not checked:

3.2.3 KYLIN_CATEGORY_GROUPINGS Dimension Selection Problems

“LEAF_CATEG_ID” and “SITE_ID” are not applied. “USER_DEFINED_FIELD1” and “USER_DEFINED_FIELD3” are derived columns. Look at my chat with my friends:

3.2.3 Problems encountered in BUYER_ACCOUNT dimension selection

ACCOUNT_SELLER_LEVEL and ACCOUNT_CONTACT are not selected. ACCOUNT_COUNTRY is renamed to BUYER_COUNTRY.

3.2.4 Problems encountered when selecting the SELLER_ACCOUNT dimension

ACCOUNT_BUYER_LEVEL and ACCOUNT_CONTACT are not selected. Note that ACCOUNT_COUNTRY is renamed SELLER_COUNTRY.

3.2.5 Problems encountered in selecting BUYER_COUNTRY and SELLER_COUNTRY dimensions

The COUNTRY key is not selected and NAME is changed to BUYER_COUNTRY_NAME and SELLER_COUNTRY_NAME.

3.3 Official case measurement selection

The main aggregation methods are COUNT, SUM, MIN, MAX, and PERCENTILE. The other aggregation methods are described in detail below:

3.3.1 Count Distinct Theoretical knowledge:

Apache Kylin provides two Count Distinct calculation methods, one is approximate and the other is accurate. An accurate Count Distinct indicator consumes more resources (memory and storage) during Build, and the Build process is slow.

Apache Kylin uses the HyperLogLog algorithm to achieve an approximate Count Distinct, providing several precision choices with error rates ranging from 9.75% to 1.22%; Theoretically, the maximum result is only 64KB, and the lowest error rate is 1.22%. This implementation is used for Count Distinct indicator calculations that require fast calculation, save storage space, and accept error rates.

Exact Count Distinct Kylin implements the bitmap-based exact Count Distinct calculation method. When the data type is Tiny Int (Byte), Small Int (short), or int, the data value is directly mapped to the bitmap. When the data type is long, String, or whatever, you need to encode the data value as a dict as a string, and map the dictionary ID to a bitmap. The calculated result of the metric is not the counted value, but a bitmap containing the serialized value. In this way, you can ensure that the Count Distinct results are correct on any dimension. This implementation provides accurate, error-free Count Distinct results, but requires more storage resources, and should account for several hundred MEgabytes of storage if there are more than a million unduplicated values in the data.

3.3.2 EXTEND_COLUMN theoretical knowledge:

Extended Column In OLAP analysis scenarios, an ID is often filtered but the query result is displayed as name, such as user_id and user_name. There are three common solutions to this problem:

A. Set both ID and Name as dimensions. The query statement is similar to SELECT Name, count(*) from table where ID = 1 group by ID, Name. The problem with this method is that it will lead to the increase of dimensions, resulting in the expansion of predicted results;
B. Set id and name to dimensions, and set them to union. The advantage of this approach is that the number of dimension combinations does not increase, but other optimizations for the dimension are limited. For example, ID can no longer be set to mandatory or hierarchical dimensions.
C. Set the ID to a dimension, Name to a special Measure, and type to Extended Column. This method can not only ensure the need of filtering ID and querying name, but also does not affect the further optimization of ID dimension.

3.4 Count Distinct and TopN Official Case Description:

3.4.1 Count Distinct cases are used to remove duplicates from orders. The error rate must be less than a certain percentage

3.4.2 TopN case, mainly used for grouping according to seller_id, and then aggregating price to achieve the total sales amount of each order

3.5 Messures overview in cube

4 Refresh Setting of the official case

Auto Merge Thresholds: Automatic Merge Thresholds, seINTERFACES increase every 7 days. 7-day segments are merged every 28 days
Retention Threshold: defaults to 0 and retains all historical segments.
Volatile Range: defaults to 0. Auto Merge automatically merges all possible cube segments. When specified, Auto Merge will not Merge cube segments of the most recent Volatile Range; If Volatile Range is set to 7, cube segments generated in the last seven days are not automatically merged.
Partition Start Date: indicates the Start time of a Partition

4.1 Official Case Refresh Setting Enforces dimension Setting

The mandatory dimension of repeated confirmation with friends, here the official case really appears, in the query, the PART_DT does not appear is also ok, why? Please refer to the explanation below

Mandatory dimensions are those that always appear in Where conditions or Group By statements. Of course, it must exist and may not appear explicitly in the query statement. For example, the query date is a necessary field, and month, quarter and year are its derivative fields. Then the occurrence of month, quarter and year in the query is equivalent to the occurrence of the necessary field of the query date.

4.2 Official case Refresh Setting Hierarchical dimension Setting

It can be seen that the official case is constructed according to the hierarchical order of cubeId, and pruned according to the binding relationship

4.3 Official Case Refresh Setting Setting of joint dimensions

According to the binding relationship, the pruning process was carried out

4.4 Setting of Rokeys

The order of each dimension in Rowkeys has a significant impact on query performance. Here the user can adjust the order of dimensions on Rowkeys by dragging and dropping, depending on the query pattern and custom. The recommended order is Mandatory, where filter conditions occur frequently, high cardinality, and low cardinality. In this way, filtering conditions are fully utilized to narrow the scan scope in HBase, improving query efficiency.

We find that the infrequently used dimension comes later:

5 Official Case overview

A total of 20 dimensions and 6 metrics were found to be used, along with 6 Lookup tables (containing unique tables)

6. Official case construction CUBE

7. Official case test

In Cube construction, I did not select join key and derived dimension (primary key). Let’s wait and see if they can be used normally. Here is the content of dimension pruning optimization, based on which, let’s test:

7.1 Test one: Does the Joint Dimension report an error when it lacks partners?

7.1.1 Joint query of order Fact Table and Order Category Table:

7.1.2 Conclusion: The joint dimension does not report errors

7.2 Test 2: Can derived dimensions be grouped and conditionally filtered properly?

7.2.1 Grouping of derived fields in order Fact table and Time Dimension table:

7.2.2 Conclusion: There is no error in derived dimension grouping

7.2.2 Derivative field grouping + filtering for order Fact table and Time Dimension table:

7.2.3 Conclusion: Derivative dimension filtering does not report errors

7.3 Test 3: Mandatory Dimensions Is an error reported?

If the forced dimension is the join key of the table where the derived dimension resides (i.e. the join key CAL_DT between KYLIN_SALES and KYLIN_CAL_DT), then it would not be good to not include the forced dimension in the query, as can be seen from 7.2.1.

7.3.1 Conclusion: Kylin_sales.part_dt is missing without an error, although it is set to a mandatory dimension

7.4 Test 4: Can an error be reported if the hierarchy dimension is out of order?

It can be seen that no error will be reported if the hierarchy dimension order is arbitrarily changed. In principle, optimization has been done:

7.4.1 Conclusion: No error will be reported

It should be understood that the hierarchical dimensions ABC, AB and A contain all the combinations of dimensions, so there is no need for other combinations, BC is actually checked by ABC, so Kylin reduces the number of cubeids by setting the hierarchical dimensions.

8 epilogue

At this point, the whole official case analysis finished, and left us thinking and when to end? As a pioneer of OLAP technology, Kylin really brings HBASE’s columnar storage function to the extreme. Without HBASE and SPARK as technical support, what would Kylin be? If I had the opportunity, I would write a technical column on HBASE to see how HBASE implements the idea of indexing-as-data based on LOG-structured Merge Tree (LSM). For those interested in Spark, please check out my Spark technical architecture anatomy column.

Qin Kaixin was born in Shenzhen on October 30, 2018