Data warehouse is also a basic service that the company will provide when the data develops to a certain scale. The construction of data warehouse is also an essential part of “data intelligence”. This paper will share the details of data warehouse from six aspects: introduction, development, construction, architecture evolution, application cases and comparison between real-time data warehouse and offline data warehouse.

1. Introduction to data warehouse

Data warehouse is a subject-oriented, Integrate, non-volatile and Time Variant data set, which is used to support management decision.

Data warehouse is accompanied by the development of enterprise informatization, in the process of enterprise informatization, with the upgrade of information tools and the application of new tools, data becomes more and more large, data format is more and more, decision requirements are more and more demanding, data warehouse technology is also constantly developing. Trends in data warehousing:

  • Real-time data warehouse to meet real-time & automated decision needs;
  • Big Data & Data Lakes to support large & complex data types (text, image, video, audio);

2. Development of data warehouse

Data warehouse has two links: the construction of data warehouse and the application of data warehouse.

In the early stage, data warehouse construction mainly refers to modeling and summarizing enterprise business database data such as ERP, CRM, SCM and so on into data warehouse engine according to the requirements of decision analysis. Its application mainly focuses on reports and aims to support management and business personnel to make decisions (medium and long-term strategic decision).

Both are changing dramatically as the business and the environment evolve.

  • With IT technology to the Internet, mobile, data sources is becoming more and more rich, appeared in the original business database on the basis of unstructured data, such as web log, IoT device data, APP buried point data, and so on, these data structured data for several order of magnitude larger than in the past, the ETL process, storage, have put forward higher requirements;
  • The online characteristics of the Internet also push business needs to real-time, and it is becoming more and more common to adjust strategies according to the current customer behavior at any time, such as inventory management and operation management in the process of big promotion (that is, both medium and long-term strategy type and short-term operation type); At the same time, the Internet of the company’s business leads to a sharp increase in the number of customers served at the same time. In some cases, it is difficult for manual to deal with completely, which requires automatic decision-making by machines. Such as fraud detection and user auditing.

In summary, the need for a data warehouse can be abstracted into two aspects: producing results in real time and processing and storing large amounts of heterogeneous data.

Note: Data lake technologies are not discussed here.

3. Data warehouse construction methodology

3.1 Subject-oriented

Starting from the business of the company, there are macro areas of analysis, such as supplier topics, commodity topics, customer topics, and warehouse topics

3.2 Services for multidimensional data analysis

Data report; Data cube, roll up, drill down, slice, rotate and other analysis functions.

3.3 Anti-paradigm data model

A star data model consisting of fact tables and dimension tables

Note: Image from 51 CTO

4. Evolution of data warehouse architecture

The concept of data warehouse was proposed by Inmon in 1990 and the complete construction method was given. With the advent of the Internet era, the amount of data has exploded, and big data tools are used to replace the traditional tools in the classical data warehouse. At this time, it is only the replacement of tools, and there is no fundamental difference in architecture. We can call this architecture offline big data architecture.

Later, with the continuous improvement of real-time requirements of business, people began to add an acceleration layer on the basis of offline big data architecture, using stream processing technology to directly complete the calculation of those indicators with high real-time requirements, which is Lambda architecture.

Later, with more and more real-time businesses and event-based data sources, real-time processing became the main part from the secondary part, and the architecture was adjusted accordingly. Kappa architecture with real-time event processing as the core emerged.

4.1 Offline Big Data Architecture

The data source is imported into the offline database in an offline way. Downstream applications choose to read DM directly or add a layer of data services, such as MySQL or Redis, depending on their business needs. The data warehouse is divided into three layers from the model level:

  • ODS, operate the data layer, save the original data;
  • DWD, the detail layer of the data warehouse, defines the fact and dimension tables according to the topic and stores the most fine-grained fact data;
  • DM, data mart/light summary layer, which makes light summary according to different business requirements on the basis of DWD layer;

Typical data warehouse storage is HDFS/Hive, and ETL can be MapReduce script or HiveSQL.

4.2 Lambda architecture

Along with the development of the large data application, people put forward the real-time requirements for system gradually, in order to calculate some real-time index, is on the basis of the original offline for several positions increased by a real-time calculation of the link, and the data source for streaming transformation (that is, the data sent to the message queue), real-time computing to subscribe to a message queue, directly to complete the calculation of index increment, Push to the downstream data service, and the data service layer completes the combination of offline and real-time results.

Note: The index of stream processing calculation is still calculated in batch, and the final calculation is based on batch, that is, the result of stream processing will be overwritten after each batch calculation. (This is just a compromise made by the imperfect flow processing engine)

Lambda architecture issues:

  • The same requirements require the development of two identical sets of code: This is the biggest problem, Lambda architecture means more than just two sets of code development difficult (the same demand, a realization in the batch processing engine, an implementation on stream processing engines, and structure data test to ensure the results are the same respectively), the late maintenance more difficult, such as requirements change after the need to change the two sets of code, the independent test results, Both jobs must be online at the same time.
  • Increased resource usage: If the same logical calculation is performed twice, the overall resource usage will increase (including real-time calculation)

4.3 Kappa architecture

Although the Lambda architecture meets the real-time requirements, it brings more development and operation work. The architectural background of the Lambda architecture is that the flow processing engine is not perfect, and the flow processing results are only used as temporary and approximate values for reference. Later, with the advent of stream processing engines such as Flink, stream processing technology became mature. At this time, to solve the problem of two sets of code, Jay Kreps of LickedIn proposed the Kappa architecture.

  • The Kappa architecture can be thought of as a simplified version of the Lambda architecture (just remove the batch part of the Lambda architecture).
  • In the Kappa architecture, requirements modification or historical data reprocessing is done through upstream replay.
  • The biggest problem with the Kappa architecture is that streaming reprocessing history is less throughput than batch processing, but this can be offset by increasing computing resources.

The reprocessing process of Kappa architecture:

Reprocessing is the point of most concern about the Kappa architecture, but it’s actually not complicated:

  • Select a message queue with replay capability that can store historical data and support multiple consumers, and set the length of time that historical data can be stored as required. For example, Kafka can store all historical data.
  • When one or more metrics require reprocessing, a new job is written according to the new logic, and then reconsumed from the beginning of the upstream message queue, writing the results to a new downstream table.
  • When the new job catches up, the switch data source is applied and the new result table produced in 2 is read.
  • Stop the old job and delete the old result table.

4.4 Comparison between Lambda architecture and Kappa Architecture

  1. In real scenarios, in many cases, it is not the fully standardized Lambda architecture or Kappa architecture, but a mixture of the two. For example, most real-time indicators are calculated using Kappa architecture, while a few key indicators (such as amount correlation) are recalculated using batch processing using Lambda architecture, adding a proofreading process.
  2. Kappa architecture is not intermediate results do not fall to the ground, now a lot of big data system needs to support machine learning (offline training), so the real-time intermediate results need to be born the corresponding storage engine for the use of machine learning, sometimes also need the detail data query, the scene also need to write to the corresponding real-time detail layer engine. Refer to the case below.
  3. In addition, with the development of data diversity, it becomes more and more difficult for data warehouse to support flexible exploration and analysis requirements with the schema specified in advance. At this time, a data lake technology emerges, that is, all the original data is cached in a big data store, and then the original data is analyzed according to the requirements during subsequent analysis. To put it simply, the data warehouse schema is Schema on write and the data lake schema is Schema on Read.

5. Case of real-time counting warehouse

This case refers to the sharing of cainiao warehouse distribution team, involving global design, data model, data security and other aspects.

Note: Special thanks to Yuanqiao for her selfless sharing.

5.1 Overall Design

The overall design is shown in the figure below. Based on the data of the business system, the data model adopts the design concept of the middle layer to construct the warehouse and match the real-time data warehouse. Computing engine, choose real-time computing which is easier to use and has better performance as the main computing engine; Data service: Tiangong data service middleware is selected to avoid direct connection to the database, and based on Tiangong, second-level switch can be flexibly configured between the master and standby links. Data application, around the whole promotion link, the warehouse distribution promotion data system is built from five dimensions of activity planning, activity stock preparation, activity live broadcasting, activity after-sale and activity restocking.

5.2 Data Model

In terms of computing cost, ease of use, reuse, consistency, etc., we must avoid smokestack development and instead build warehouses and live data warehouses in the middle tier. Basically consistent with the offline middle layer, we split the real-time middle layer into two layers.

  • Layer 1 DWD public real-time detail layer

Real-time computing subscribs to the business data message queue. Then, by combining data cleaning, multi-data source join, streaming data and offline dimension information, some business systems with the same granularity and dimension attributes in dimension tables are all associated together to improve data ease-of-use and reuse and obtain the final real-time detailed data. This part of data has two branches, one is directly landed to ADS for real-time detailed query, and the other is sent to message queue for lower level calculation.

  • The second layer DWS common real-time summary layer

The public summary layer is constructed based on the concept of data domain + business domain. Different from offline data warehouse, the summary layer is divided into mild summary layer and high summary layer, which are produced at the same time. The mild summary layer is written into ADS, which is used for complex OLAP query scenarios of front-end products to meet the needs of self-service analysis and production reports. At the high summary layer, Hbase is written to simple KV query scenarios to improve query performance, such as real-time large screen.

Note:

  • ADS is an OLAP analysis engine. Open source offers similar features like Elastic Search, Kylin, Druid, etc.
  • In the case, data can be written to Hbase for KV query, or other engines can be selected according to the situation. For example, MySQL can be used if the data volume is not large and the query pressure is not large.
  • Topic modeling is not described here because it is closely related to business.

5.3 Data Protection

Alibaba has an annual singles Day promotion, during which traffic and data volume will soar. Compared with offline system, real-time system is more sensitive to data volume and requires higher stability to ensure real-time performance. So to prepare for this scenario, you need to do two things in this scenario:

  • System pressure measurement before big promotion;
  • Primary and secondary link guarantee in great promotion;

Cainiao Double 11 “Real-time storage and distribution data” details ~

6. Comparison between real-time and offline data warehouse

After reading the previous description and the rookie case, let’s take a look at how real-time and offline counting stacks compare:

  • Firstly, there are obvious differences between real-time data warehouse and offline data warehouse in terms of architecture. Real-time data warehouse is dominated by Kappa architecture, while offline data warehouse is dominated by traditional big data architecture. The Lambda architecture can be thought of as an intermediate state between the two.
  • Secondly, from the construction method, real-time data warehouse and offline data warehouse basically follow the traditional data warehouse theme modeling theory, the output of the fact wide table. In addition, the join of real-time stream data in real-time data warehouse has hidden time semantics, which should be paid attention to during construction.
  • Finally, from the perspective of data guarantee, the real-time data warehouse is sensitive to the change of data volume because it needs to ensure real-time performance. Pressure measurement and primary/secondary guarantee work need to be done in advance in scenarios such as big push, which is an obvious difference from offline data.


    Guo Hua (Fu Kong)

    The original link

    This article is the original content of the cloud habitat community, shall not be reproduced without permission.