This is the 16th day of my participation in the August Text Challenge.More challenges in August

The body of the

Apache Kylin system can be divided into two parts: online query and offline construction. The technical architecture is shown in the figure. Online query modules are mainly in the upper half, while offline construction is in the lower half.

Offline build

Let’s first look at the offline build part.

As you can see from the figure, the data source on the left, mainly Hadoop/Hive/Kafka/RDBMS, holds the user data to be analyzed.

Based on the metadata definition, the build engine below extracts the data from the data source and builds the Cube.

The data is entered as relational tables and must conform to the Star Schema.

Map Reduce and Spark are the primary build technologies, and Spark Engine is the only build Engine in Kylin 4.0.

The Cube is stored in the storage engine on the right, and Parquet is chosen as the storage.

In 3.x and previous releases, Kylin has been using HBase as a storage engine to store the predicted results of cube builds. As a column family-oriented database on HDFS, HBase has excellent query performance. However, it still has the following disadvantages:

  1. HBase is not true column storage;
  2. HBase has no secondary index, and Rowkey is the only one.
  3. HBase does not encode stored data. Kylin must encode data himself.
  4. HBase is not suitable for cloud deployment and automatic scaling.
  5. API versions of different HBase versions are incompatible, for example, 0.98, 1.0, 1.1, and 2.0.
  6. There are compatibility issues between HBase versions from different vendors.

To solve the above problems, the community proposes using Apache Parquet + Spark to replace HBase for the following reasons:

  1. Parquet is an open source, mature and stable column storage format;
  2. Parquet is more cloud friendly and compatible with various file systems including HDFS, S3, Azure Blob Store, Ali OSS, etc.
  3. Parquet integrates well with Hadoop, Hive, Spark, Impala, and more;
  4. Parquet supports custom indexes.

Online inquiry

After offline construction is complete, users can query the system from the top to send SQL for query analysis.

Kyin provides various rest apis, JDBC/ODBC interfaces.

No matter which interface it enters from, the SQL ends up in the Rest service layer and is handed over to the query engine for processing.

It is important to note here that SQL statements are written based on the relational model of the data source, not Cube.

Kylin was deliberately designed to shield query users from the concept of Cube. Analysts only need to understand simple relational models to use Kylin, there is no additional barrier to learning, and traditional SQL applications are easy to migrate.

The query engine parses the SQL, generates a logical execution plan based on relational tables, translates it into a physical execution plan based on Cube, and queries the resulting Cube for results.

The entire process does not access the original data source.

Extensible architecture for Apache Kylin version 1.5

Version 1.5 of Apache Kylin introduces the concept of “extensible architecture”.

Extensibility means that Kylin can extend and replace any of the three modules it relies on.

Kylin’s three dependency modules are the data source, build engine, and storage engine.

At the beginning of the design, as a member of the Hadoop family, these three were Hive, MapReduce, and HBase.

But with the promotion and use of in-depth, gradually some users found that they are inadequate.

For example, real-time analytics might want to import data from Kafka rather than Hive, and Spark’s rapid rise has forced us to consider replacing MapReduce with Spark in order to dramatically speed up Cube builds, and HBase, It may not be as readable as Cassandra or Kudu, for example.

As you can see, the question of whether one technology can be replaced with another has become a common one.

Therefore, the system architecture of Kylin version 1.5 was reconstructed to abstract the three dependencies of data source, build engine and storage engine into interfaces.

Deep users can do secondary development according to their needs, replacing one or more of them with a more suitable technology.