The introduction

The first article in this series, “The Coming of the Age of Data Intelligence: Essence and Technical System Requirements”, Outlines the understanding of data intelligence and introduces the corresponding core technical system requirements:

Intelligent data is the data as the means of production, through a combination of mass data processing, data mining, machine learning, human computer interaction and visualization technology, derived from a large amount of data, discover, access to knowledge, based on the data to make decisions for people to provide effective intelligence support, reduce or eliminate the uncertainty.

From the definition of data intelligence, the technical system of data intelligence needs to include at least several aspects, as shown in the following figure:

▲ Data intelligent technology system composition

Among them, data asset governance, data quality assurance, and secure computing system under data intelligence will be elaborated in the following series of articles.

However, recently in the actual work, I found that people have some real problems about how to process the multi-dimensional data analysis to solve the practical business problems, especially for what to choose the underlying system, after all, there are not too many companies that have the resources for you to experiment.

Therefore, I studied together with my team and also drew on some external materials to write the second article of this series on this topic, which mainly centered on the topic of “the selection method of multi-dimensional analysis system” for your reference, hoping to shorten your decision time.

Body content

Analyze the considerations of the system

Everyone is familiar with CAP theory. You can’t have it both ways. You have to have it both ways. In an analytics system, there are trade-offs and trade-offs between three elements: data volume, flexibility and performance.

▲ The analytical system considers three elements

Some systems cannot meet the processing requirements, even a simple analysis requirement, when the amount of data reaches a certain amount, such as the level of P, under the condition of constant resources.

Flexibility mainly refers to the flexibility of the way to manipulate the data. For example, for the average analyst, the use of SQL to manipulate the data is preferred, there are not many constraints, if the use of domain-specific language (DSL) is relatively limited; Another is whether operations are constrained by preconditions, such as whether flexible ad-hoc queries across multiple dimensions are supported. The last one is the performance requirement, whether it can meet multiple concurrent operations, and whether it can respond in seconds.

Process analysis of data query

When conducting aggregate type queries on data, the following three steps are generally followed:

▲ Real-time query process

First of all, it is necessary to use the index to retrieve the line number or index position corresponding to the data, requiring the ability to quickly filter out hundreds of thousands or millions of data from hundreds of millions of pieces of data. This is where search engines excel, because relational databases are good at using indexes to retrieve small amounts of data that are relatively accurate.

Then from the main storage according to the line number or location of the specific data load, the request can quickly load the filtered tens of millions of data into the memory. Analytical databases excel in this area because they typically use column storage, and some use mmap to speed up data processing.

Finally, distributed computation can calculate the final result set of these data according to the requirements of GROUP BY and SELECT. This is where big data computing engines, such as Spark and Hadoop, excel.

Comparison and analysis of architectures

Combined with the above two elements, there are mainly three types in terms of architecture:

MPP (MPP) • MPP (MPP) • MPP (MPP) • MPP (MPP) • MPP (MPP) • MPP (MPP

MPP architecture Traditional RDBMSs have an absolute advantage in ACID. In the era of big data, if most of your data is still or structured data, and the data is not so big, it doesn’t have to be used like Hadoop platform, nature also can be used in a distributed architecture to meet the growth of the scale of data, and to meet the requirements of data analysis, at the same time also can use familiar SQL to operate.

That architecture is MPP(Massively Parallel Processing) — Massively Parallel Processing.

Of course, MPP is really just an architecture. The underlying architecture is not necessarily an RDBMS, but a distributed Query Engine (Query Planner, Query Coordinator, Query Exec Engine, etc.) that can be built on top of the underlying Hadoop infrastructure. Do not use batch processing like MapReduce.

The systems under this architecture are: Greenplum, Impala, Drill, Shark, etc. Greenplum (commonly referred to as GP) uses PostgreSQL as the underlying database engine.

Compared with MPP system, the architecture based on search engine converts data (documents) into inverted Index when the search engine enters the database, uses the three-level structure of Term Index, Term Dictionary and Posting to set up the Index, and uses some compression techniques to save space.

The data (the document) is distributed across nodes by rules such as hashing the document ID. In the process of data retrieval, the Scatter-Gather calculation model is adopted. After processing on each node, the data will be collected to the node where the search is initiated for final aggregation.

The systems under this architecture mainly include Elasticsearch and Solr, which are generally operated by DSL.

A system like Apache Kylin is a predictive system architecture. It preaggregates the data when it is put into the database, and forms a “materialized view” or data Cube by pre-processing the data through the establishment of a certain model in advance. In this way, most of the data processing is actually completed before the query stage, which is equivalent to secondary processing.

The main systems under this architecture are: Kylin, Druid. Although both Kylin and Druid are predicative computing architectures, there are a number of differences between them.

Kylin uses the Cube method to conduct prediction (supporting SQL method). Once the model is determined, the cost to modify will be relatively large, and basically the whole Cube needs to be recalculated. Moreover, prediction is not carried out at any time, but according to a certain strategy, which also limits its requirements as a real-time data query.

Druid, on the other hand, is better suited for real-time computation and ad-hoc queries (SQL is not yet supported). It uses Bitmap as its primary indexing method, so it can quickly filter and process data, but it is less effective than Kylin in terms of performance for more complex queries.

Based on the above analysis, Kylin generally promotes offline OLAP engines with large data volumes, while Druid is a real-time OLAP engine with large data volumes.

Comparison of the three architectures

MPP architected systems: have good support for data volume and flexibility, but there is no guarantee of response time. As the amount of data and computational complexity increases, the response time can slow down, ranging from seconds to minutes, or even hours.

Search engine based systems: compared to MPP systems, they sacrifice some flexibility for better performance and can achieve subsecond response on search queries. However, with the increase of the amount of data processed, the response time will degrade to the minute level for the query based on scan aggregation.

Predictive computing system: Pre-aggregates data at entry, further sacrificing flexibility for performance to achieve second-order response to large data sets.

Combined with the above analysis, the above three categories are: • Support for data volume increases from small to large; Flexibility increases from large to small; Performance increases from low to high as data volume increases

Therefore, we can consider based on the size of the actual business data volume, the requirements for flexibility and performance. For example, GP may be able to meet the needs of most companies, and Kylin can meet the needs of large amounts of data.

conclusion

Recently I read a sentence: “the key thinking of architecture design is judgment and choice, the key thinking of program design is logic and implementation”, I deeply agree with that!

In the future, our technical team will continue to explore the selection method of multi-dimensional analysis system, and discuss with you together, as always, to provide better service for you developers.