A data warehouse is a strategic collection of all types of data to support the decision-making process at all levels of the enterprise. It is a single data store, created for analytical reporting and decision support purposes. Under normal circumstances, data analysis system will use data warehouse modeling, generally business system for relational modeling

First, the general process

There are two data sources of electronic mall data warehouse: the user’s behavior log and the system’s business database

  1. After obtaining the data, import the data to HDFS, and then import it to Hive for subsequent analysis (by date).

  2. Based on Hive system, the establishment of each layer of data warehouse, each layer of modeling interdependence

  3. After the warehouse modeling is completed, it can provide basic data for machine learning, data mining and other low-level data applications, and finally complete the purpose of data analysis and decision analysis

Second, data stratification

Why stratification?

  • Complex problems can be simplified. Complex tasks are divided into multiple layers to complete, and each layer only deals with simple tasks to facilitate problem locating
  • Reduce repetitive development. Standard data stratification, through the middle layer of data, can reduce the great amount of repeated calculation, increase the reusability of a calculation result
  • Isolate raw data. Real data can be decoupled from statistical data by layering, whether it is data anomalies or data sensitivity

What is the difference between data mart and data warehouse?

Data mart: A miniature data warehouse, with less data, fewer subject areas, and less historical data, is department-level and generally serves only a local area of management

Data warehouse: enterprise-level, can provide decision support means for the operation of all departments of the entire enterprise

A data warehouse is the sum of data markets

What are the layers?

The name and number of layers are not fixed. They can be adjusted according to specific services and requirements. Generally, they are divided into the following layers:

  1. Operation Data Store (ODS) raw Data layer

The original data layer stores the original data, directly loads the original log and data, and keeps the data intact without processing

  1. Data warehouse detail (DWD) Detail data layer

ODS layer data cleaning (remove null values, dirty data, data beyond the limit range), dimension degradation (dimension modeling), desensitization, etc

DWD layer is obtained by dimensional modeling based on ODS data

  1. Data Warehouse Service (DWS) service data layer

Light summary by day based on previous data layers

  1. DWT (Data Warehouse Topic) data Topic layer

Summarize according to previous data and topics of interest

  1. Appilcation Data Store (ADS) Data application layer

Provide data for various statistical reports according to actual application requirements

Number warehouse theory

classification

Star model

The standard star model has only one dimensionCopy the code

Snowflake model

It's closer to 3NF and has multiple dimensionsCopy the code

The constellation model

The constellation model is composed of multiple star models, and there is common dimension table among themCopy the code

So how do you choose in actual modeling?

  • The selection should be based on the principle of performance and flexibility first

  • In general, star model with fewer dimensions is preferred without excessive pursuit of data table simplicity. In Hadoop system, reducing Join means reducing shuffle. Although star model has a lot of redundant data, it has good performance

A dimension table

Definition:

It’s usually a description of a fact. Each dimension table corresponds to an object or concept in the real world. For example: user, product, date, region, etc.

Features:

  • Dimension tables have a wide range (multiple attributes, multiple columns), so put all the related attributes you can think of into one dimension table
  • The number of rows is relatively small compared to a fact table: usually < 100,000
  • The content is relatively fixed: equivalent to a code table

Common dimension tables:

Time dimension table

Location dimension tables and so on

Classification:

  • Full scale. Import them all at once, and nothing changes after that

  • Zipper table

    Definition:

    A zipper table that records the life cycle of each message. Once the life cycle of a record ends, a new record is restarted and the current date is put into the effective start date. If the current information is valid to date, enter a maximum value for the effective end date (such as 9999-99-99).

    Why use a zipper watch? Zipper table is suitable for large amount of data, data will change, but most of it is unchanged. (i.e., slowly changing dimension, only updated with daily additions and changes)

    If you use a zipper watch? If the validity start date <= a certain date and validity end date >= a certain date, you can obtain a full slice of data at a certain point in time.

    How to make a zipper table? All the user data of the day and the daily change data in MySQL are spliced together to form a new temporary zipper table data. Overwrite old zipper table data with temporary zipper table. (This fixes the problem of not updating data in hive tables)

The fact table

Definition:

Each row represents a business event (order, payment, refund, evaluation, etc.)

Ingredients:

  • A numerical measure with additivity
  • A foreign key connected to a dimension table
    • It usually has two or more foreign keys
    • The foreign keys represent a many-to-many relationship between dimension tables

Features:

  • The amount of data is very large and the number of lines is quite large

  • The content is relatively narrow: the number of columns is small, and the columns are mainly composed of the following two types

    • Dimension a foreign key
    • measurements
  • It changes a lot, it adds a lot every day, it increases as it happens

Category:

  • The transactional fact table is the most common, based on each transaction or event, and generally only updates the business data that is added that day
  • The periodic snapshot fact table does not retain all data, but only data at fixed intervals, focusing on the final result rather than the update process. Update once over a period of time (collectibles, items collected today may be removed tomorrow)
  • The cumulative snapshot fact table tracks service fact changes and updates service status changes

The layers of modeling

1. The ODS layer

  • Keep the original data and do not modify it to back up data
  • Data is compressed to reduce disk storage space
  • Create a partitioned table (one partition for a day’s data) to prevent subsequent full table scans

2. DWD layer

In theory, the DWD layer can satisfy any business requirement, and the dimension table is actually the description Angle of the fact table

Select business process

A business line corresponds to a fact sheet. In the business system, select the business line we are interested in, such as ordering business, payment business, refund business, logistics business

Statement of granularity

Determine strength according to his specific business

For example, in the order business order, each item is taken as a row in the order fact table, and the granularity is the number of orders placed per week as a row. The granularity is the number of orders placed per week as a row, and the granularity is the number of orders placed per month

Confirm the dimension

  • Determine which dimensions are associated with each fact table to identify specific dimensions
  • Complete this step to determine the foreign keys of the fact table

Confirm facts confirm metrics, such as order amount, number of orders, etc

The determination of a dimension table brings together all the information related to that dimension (dimension degradation: dimensions have very few fields and are merged directly into the fact table), thereby reducing join operations and improving performance

3. DWS layer

DWS is a driver that models topic objects and then builds wide tables (partitioned tables that measure the behavior of the day)

steps

  • Specify which tables to build. One topic (dimension) corresponds to one table
  • Example of statistical results of all the metrics in the fact table related to the current topic: Statistics related to the user dimension are: order, comment, pay, etc., calculate these metrics in days, primary key is the user

meaning

Optimize warehouse performance and reduce double calculation. For example, to calculate two different statistical values of the same dimension (population of a province in 2022, GDP growth of a province), if there is no DWS layer, DWD layer will be frequently operated and repeated database query operations will be carried out, resulting in high time cost

4. The DWT layer

meaning

Optimize data warehouse performance, reduce double calculation, general statistics is cumulative behavior, not partition table, is a large table, storage is the full amount of data

5. ADS layer

Perform related data analysis operations according to the specific requirements of the lower level