This is the 25th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

An overview,

Apache Kylin, a MOLAP data analysis engine.

Originated from eBay’s R&d center in China and contributed to Apache Foundation, the core development team of Apache Kylin has now set up Kyligence (Kylin Intelligence). It is worth mentioning that Apache Kylin is the first Top-level Apache project led by Chinese. The traditional data warehouse and business intelligence platforms used by eBay have encountered bottlenecks. Although Hadoop platform can process large-scale data in batches, it cannot provide efficient data interaction analysis. So Kylin was incubated by eBay.

Kylin provides second-level responses for Multidimensional Data analysis (MOLAP). Kylin features:

  • Data source and model: Hive and Kafka are supported

  • Build engine: MapReduce computing engine is supported in the early stage, and Spark and Flink computing engine are supported in the new version. In addition to full builds, incremental builds are supported for time-based partitioning features

  • Storage engine: The Cube is stored in HBase as a key-value, and Rowkey is optimized to speed up query. The permutations and combinations of each dimension are stored in a materialized view called Cuboid

  • Optimization algorithm: Cube itself trades space for time. It also prunes and optimizes some redundant Cuboids according to the algorithm to seek balance

  • Access interface: support standard SQL interface, can connect to Zeppelin, Tableau and other BI tools. SQL can be routed to the corresponding Cuboid through the query engine.

(1) Application scenarios

Typical application scenarios of Kylin are as follows:

  • Huge amount of data, a single data source table hundreds of billions of rows of data level, and a single data source up to 100 TB level

  • Heavy query pressure (high concurrency of queries)

  • Quick response to queries

  • The flexible downstream query mode supports SQL queries with complex conditions

The core idea of Kylin is predictive computation, which calculates all possible query results in advance according to specified dimensions and indicators, and uses space for time to accelerate OLAP query with fixed mode.

(2) Basic terms

Data warehouse is a kind of information system data storage theory, which emphasizes the use of some special data storage methods, so that the contained data is particularly conducive to analysis and processing, thus generating valuable information, and can make decisions based on this.

The data stored in the way of data warehouse has the characteristic that once stored, it will not change with time. In addition, the data stored must contain the time attribute. Usually, a data warehouse will contain a large amount of historical data, and it can use specific analysis methods to dig out specific information from them.

OLTP: Online transaction processing, application of traditional relational database.

OLAP (Online Analytical Process), Online Analytical processing, to analyze data in a multidimensional way. It is a method of presenting integrated decision information, most commonly used in data warehousing or business intelligence. Its main function is to facilitate large-scale data analysis and statistical calculation, which can provide reference and support for decision-making. And the difference is online transaction processing (OLTP), online transaction processing, focusing on the basic, daily transaction processing, mainly is the increase, deletion, change and check of data.

The concept of OLAP has two different ways of understanding in the broad and narrow sense. The broad meaning is the same as the literal meaning, generally refers to any analysis that does not update the data. But more often OLAP is understood in its narrow sense, that is, it is related to multidimensional analysis, based on Cube calculations.

OLAP can be divided into ROLAP, MOLAP and HOLAP according to the different ways of storing data.

  1. Relational OLAP(ROLAP), detailed data, aggregated data are stored in the class Relational database. Hive and SparkSQL belong to ROLAP.

  2. MOLAP(OLAP) : the summarized data will be computed in advance and stored in our own special Multidimensional database. Users’ OLAP operations can be directly mapped to the Multidimensional database access without SQL access. In essence, it is space for time. Apache Kylin is essentially a MOLAP.

  3. HOLAP(Hybrid OLAP) : Hybrid OLAP implementation based on Hybrid data organization. For example, the lower level is relational, and the higher level is multidimensional matrix. This approach offers greater flexibility.

1) Fact tables and dimension tables

Fact Table: a Table that stores factual records, such as system logs, sales records, sensor values, etc.

The records of a fact table grow dynamically, so it is usually much larger than a dimension table.

Dimension Table: A Dimension Table, also called a Lookup Table, is a corresponding Table to a fact Table.

It holds the attribute values of the dimension and can be associated with the fact table; It is equivalent to extracting and standardizing frequently repeated attributes from the fact table and managing them in a single table.

Common dimension tables include date table (stores attributes corresponding to the date, such as week, month, and quarter) and region table (contains attributes such as country, province, state, and city). Dimension tables usually don’t change very much.

There are many benefits to using dimension tables:

  • Reduced the size of the fact table

  • Facilitate dimension management and maintenance, add, delete, and modify dimension attributes without changing a large number of records in the fact table

  • Dimension tables can be reused for multiple fact tables

2) Dimensions and metrics

Dimension: The perspective from which data is viewed. It is usually an attribute of a data record, such as time, place, etc.

Metrics: Aggregated statistics, which are the results of aggregation operations. Usually a number, such as total sales, different users, etc.

Analysts often review measurements in combination with several dimensions in order to find patterns of change.

In an SQL query, the attributes of Group By are usually dimensions, and the calculated values are measures.

SELECT part_dt, lstg_site_id, sum(price) as total_selled, count(distinct seller_id) as sellers
FROM kylin_sales
GROUP BY part_dt, lstg_site_id;
Copy the code

In the preceding query, part_dt and lSTg_site_id are dimensions, and sum(price) and count(distinct seller_id) are metrics.

3) Star model and snowflake model

Star Schema: It is one of the commonly used data models in dimensional modeling of data warehouse.

It is characterized by a fact table, and one to more dimension tables, the fact table is associated with the dimension table by a primary foreign key, there is no correlation between the dimension tables, like many small stars around a star, hence the name star model.

SnowFlake Schema: Some dimension tables in the star model are extracted into finer granularity dimension tables, and then the correlation between dimension tables is made. This kind of model is called the SnowFlake model.

Earlier versions of Kylin only support star models, and from 2.0 onwards support snowflake models.

4)CubeCuboid

A Cube is a multidimensional Cube, also known as a data Cube.

This is an OLAP cube composed of three dimensions (the number of dimensions can be more than three, the figure above is just for the sake of illustration). The cube contains the values of the cells (subcubes) that meet the conditions. These cells contain the data to be analyzed, which are called measurements.

  • Cube: A multidimensional space constructed from dimensions that contains all the basic data to be analyzed. All aggregate data operations are performed on the cube

  • Dimension: The Angle from which data is viewed. It’s usually a discrete set of values. For N dimensions, there are 2 to the N possible combinations

  • Measure: the result of an aggregate calculation, usually a continuous value

  • Cuboid: Data calculated in a particular combination of dimensions in Kylin

  • A field in the fact table that is either a dimension or a measure (which can be aggregated)

  • Given a data model, you can combine all the dimensions on it. For N dimensions, there are 2 to the N possible combinations

  • Cube (or Data Cube), namely Data Cube, is a technology commonly used for Data analysis and index. It can establish multi-dimensional index of original Data and greatly speed up the query efficiency. A data cube is just a fancy word for a multidimensional model.

  • Cuboid refers specifically to the data calculated in a particular combination of dimensions in Kylin.

The combination diagram is as follows:

Second,KylinThe technical architecture

Apache Kylin system can be divided into two parts: online query and offline construction.

The online query pattern is mainly in the top half, and the offline build is in the bottom half.

Kylin technology architecture is as follows:

  • The data source is Hadoop Hive. The data is input in the form of relational tables, which store the data to be analyzed. Based on the definition of metadata, the build engine extracts data from the data source and builds cubes

  • Kylin can use MapReduce or Spark as a build engine. The Cube is stored in the storage engine on the right. HBase is generally used as the storage

  • After offline construction is completed, users can send SQL from the query system for query analysis

  • Kylin provides a variety of Rest apis, JDBC/ODBC interfaces. No matter which interface it enters from, the SQL ends up in the Rest service layer and is handed over to the query engine for processing

  • SQL statements are written based on the relational model of the data source, not Cube:

    • Kylin deliberately hid the concept of Cube from query users in his design

    • You only need to understand the relational model to use Kylin, there is no additional threshold to learn, and traditional SQL applications are easy to migrate

    • The query engine parses the SQL, generates a logical execution plan based on relational tables, transforms it into a Cube based physical execution plan, and finally queries the resulting Cube without accessing the original data source

Component functions:

  • REST Server: Provides Restful interfaces, such as Cube operations such as creation, construction, refresh and merge, Kylin Projects, Tables and other metadata management, user access control, SQL query and so on.

  • Query Engine: SQL parsing is implemented using the open source Apache Calcite framework, which can be understood as the SQL Engine layer

  • Routing: Converts the execution plan generated by parsing SQL into Cube cached queries, which can be completed in seconds or even milliseconds

  • Metadata: Kylin contains a large amount of metadata information, including Cube definition, star model definition, Job and Job execution output information, and model dimension information, etc. Kylin’s metadata and Cube are stored in HBase in json string format

  • Cube Build Engine: It is the basis of all modules. It is mainly responsible for Cube creation in Kylin’s calculation. The process of Cube creation is to read original data through Hive first, then generate Htable through some MapReduce or Spark calculation, and finally load data into HBase table.

The working principle of

Apache Kylin works by making Cube predictions of data models and using the results to speed up queries.

The specific working process is as follows:

  • Specify the data model and define dimensions and measures

  • Calculate Cube, calculate all CuBOids and save them as materialized views (stored in HBase)

  • When the query is executed, the Cuboid is read, the query result is computed and produced

Efficient OLAP analysis:

  • Kylin’s query process does not scan the original records, but performs complex operations such as association and aggregation of tables through precomputation

  • Using the results of the predicted calculation to execute the query, compared with the non-predicted calculation query technology, its speed is generally one to two orders of magnitude faster, more obvious advantages in large data sets

  • With data sets in the hundreds of billions or even trillions, Kylin can outperform other algorithms by a factor of 1,000

Kylinecological

As shown in figure:

  • Apache Kylin core: The OALP engine of Kylin consists of metadata engine, query engine, task engine and storage engine. In addition, it has a REST server to serve query requests externally

  • Extensibility: Provide plug-in mechanisms to support additional features and functions

  • Integration with other systems: task scheduler, ETL tools, monitoring and alarm systems can be integrated

  • Drivers: Provides integration of ODBC and JDBC driver support with other tools (such as Tableau)