This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.

background

In the field of open source big data, the earliest big data engines are mainly MapReduce computing engines. MapReduce uses distributed computing to process data in batches. The data throughput is high, but the data processing performance is low. The performance is affected by the multiple I/O drops during data processing, and the scheduling and startup costs of MapReduce jobs are high. MR is not suitable for interactive ad-hoc query analysis under such circumstances. Also Google launched a new interactive Dremel calculation engine, it USES the MPP architecture (massively parallel processing structure) using multilayer query tree, the task can be executed in parallel on the thousands of nodes and polymerization as a result, the other is to implement the nested columns of data storage, avoid unnecessary data read, greatly reduces the network and disk I/o overhead, The interactive query performance is significantly improved.

Classification of interactive engines

Open source big data OLAP (Online Analytical Processing) component, there are a variety of implementation methods, depending on the way of storing data, It can be divided into MOLAP (Relational OLAP), ROLAP(Relational OLAP) and HOLAP (Hybird).

  • MOLAP is based on OLAP implementation of multi-dimensional data organization. It uses multi-dimensional array to store data with multi-dimensional data organization as its core. Multidimensional data forms a “data Cube” structure in the storage system, optimizes the data storage, and performs partial prediction, so the query performance is the highest. The main disadvantages are large storage space and delay in data update.
  • ROLAP is based on OLAP implementation of relational data and uses relational structure to represent and store multidimensional data. ROLAP has the advantage of obtaining the latest data updates from source data in real time to maintain real-time data, but has the disadvantage of lower computing efficiency and longer waiting time compared with MOLAP. It can also be subdivided into MPP database and SQL engine. The SQL engine can be further subdivided into SQL engine based on MPP architecture and SQL engine based on general computing framework.
  • HOLAP OLAP implementation based on mixed data organization is actually a combination of the above two types of engines. Generally speaking, analysis that is not commonly used and needs flexible definition can be implemented by ROLAP, while common and conventional models can be implemented by MOLAP.

An MPP database is a complete database, such as Greenplum and ClickHouse, into which data usually needs to be imported for OLAP functionality. The MPP database can optimize the data distribution when the data is entered into the database. Although the efficiency of the entry is reduced to a certain extent, it is of great help to improve the query performance in the later period. MPP database can provide flexible AD hoc query capabilities, but generally has a certain limit on the amount of query data, cannot support particularly large data volume query.

The SQL engine only supports SQL execution and does not store data. It can interconnect with multiple data stores, such as HDFS, HBase, and MySQL. Some also support federated query capabilities, which can be used for joint analysis of multiple heterogeneous data sources. SQL engine, SQL engine based on MPP architecture, such as Impala, Hawq, Presto, Drill general online query scenarios have special optimization, so the end-to-end query performance is generally higher than SQL engine based on general computing framework; However, it is inferior to SQL engines based on common computing frameworks (Hive and Spark SQL) in fault tolerance and data volume.

In summary, no OLAP system can handle all three aspects of scale, flexibility and performance perfectly, and users need to make trade-offs and choices based on their own needs.

Interactive engine comparison

OLAP engine comparison for common MPP architectures

Noun explanation

Massively Parallel Processing (MPP) : it is a Massively Parallel Processing structure that consists of multiple compute nodes. Each node accesses only its own resources. The CPU of each node cannot access the memory of another node.

HTAP (Hybrid Transactional/Analytical Processing) databases are designed to deal with delays of several minutes or even hours between OLAP and OLTP systems. The consistency between OLAP database and OLTP database cannot be guaranteed, which is difficult to meet the requirements of real-time analysis in business scenarios. Enterprises need to maintain different databases to support two different types of tasks, which is costly to manage and maintain.

Reference documentation

Cloud.tencent.com/developer/n…

www.jianshu.com/p/b23800d9b…

www.clickhouse.com.cn/topic/5c453…

www.jianshu.com/p/4b3bcbaba…

Jishuin.proginn.com/p/763bfbd3a…