Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Today’s article is about data warehousing and data layering. I have read a lot of articles and have been asked many times in the interview, so I will sort them out and share them with you.

Data layering is a very important link in data warehouse design. A good layering design can make the whole data system easier to understand and use. This article explains how to understand the role of the layers in a data warehouse.

Graphical data stratification

I hope you bookmark the graphics below, keeping in mind the role of each data layer.

What is warehouse DW

Data Warehouse (can be abbreviated as DW or DWH) Data warehouse, is in the database has a large number of cases, it is a complete theoretical system including ETL, scheduling, modeling.

The purpose of data warehouse project construction is to serve as the foundation for front-end query and analysis, which is mainly applied to ON-LINE Analytical Processing (OLAP), supporting complex analysis operations, focusing on decision support, and providing intuitive and understandable query results. AWS Redshift, Greenplum, Hive, etc.

The data warehouse is not the final destination of the data, but the preparation for the final destination of the data, which includes: cleaning, escaping, sorting, reorganization, merging, splitting, statistics, etc

Why stratification

Issues involved in data warehousing:

  1. Why do data warehousing?
  2. Why do data quality management?
  3. Why do metadata management?
  4. What is the role of each layer in the storehouse hierarchy?
  5. … .

In practice, we all want our data to flow sequentially, so that designers and users can clearly see the whole cycle of data declaration, as shown in the left figure below.

However, in reality, the data situation we are faced with is likely to be highly complex and hierarchical, and we may make a data system with chaotic table dependence structure and cyclic dependence, such as the figure on the right below.

In order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly, which is data stratification. Benefits of data layering:

  • Clear data structure: Each data layer has its own role and responsibility, easier to use and maintain and understand
  • Complex problem simplification: Break down a complex task into multiple steps, each layer solving only a specific problem
  • Unified data caliber: Provides unified data outlet and unified output caliber through data stratification
  • Reduce duplication: Formalizing data layering and developing a common middle tier can greatly reduce duplication

Data hierarchy

Each company’s business can be layered according to its own business needs; Currently more popular data layer: data operation layer, data warehouse layer, data service layer.

Data operation layer ODS

Data Operation layer: Operation Data Store Data preparation area, also known as paste source layer. The data in the data source enters this layer after extraction, cleaning and transmission, namely ETL process. The main functions of this layer:

  • The ODS is the staging area for the back data warehouse layer
  • Provide raw data for the DWD layer
  • Reduce the impact on business systems

In order to consider the subsequent need to trace the data, it is not recommended to do too much data cleaning for this layer, just access the original data intact

The data in this layer is the source of processing data in the subsequent data warehouse. Data sources:

  1. Business library: SQOOP regularly extracts data; In real time, canal should be used to monitor the binlog log of mysql
  2. Buried logs: Logs are generally saved as files. You can use Flume for periodic synchronization. Spark Streaming or Flink or Kafka can be used for real-time access
  3. Message queues: data from ActiveMQ, Kafka, etc

Data warehouse layer

Data warehouse layer can be divided into three layers from top to bottom: data detail layer DWD, data intermediate layer DWM, data service layer DWS.

Data detail layer DWD

Data Details layer: Data Warehouse Details, DWD

This layer is the isolation layer between the business layer and the data warehouse, and maintains the same data granularity as the ODS layer. It is mainly for ODS data layer to do some data cleaning and standardized operations, such as removing empty data, dirty data, outliers and so on.

In order to improve the ease of use of the data detail layer, this layer usually adopts some dimension degradation methods to degrade the dimensions to the fact table and reduce the association between the fact table and the dimension table.

Data middle layer DWM

Data Warehouse Middle, DWM;

Based on the data of the DWD layer, this layer performs some slight aggregation operations on the data to generate some intermediate result tables, so as to improve the reusability of common indicators and reduce the work of repeated processing.

In short, aggregate the common core dimensions and calculate the corresponding statistical indicators

Data services layer DWS

Data Warehouse Service, DWS;

This layer is a data service layer based on the basic data on DWM, which is integrated and summarized to analyze a topic domain. Generally, it is a wide table, which is used to provide follow-up business queries, OLAP analysis, data distribution, etc.

In general, this layer will have relatively few data tables; A table covers a large amount of business content. Because it has many fields, it is also called a wide table.

Data application layer ADS

Data Application layer: Application Data Service, ADS;

This layer mainly provides data for data products and data analysis, which is generally stored in ES, Redis, PostgreSql and other systems for online system use. It can also be stored in Hive or Druid for data analysis and data mining. For example, common data reports are stored here.

A Fact Table

Fact tables are tables that store factual records, such as system logs, sales records, and so on. The record of the fact table is constantly growing, such as the goods order table of e-commerce, is similar, so the size of the fact table is usually much larger than other tables.

D surface Dimension

A Dimension Table or Lookup Table, sometimes called a Lookup Table, is the counterpart of a fact Table. It stores the attribute values of dimensions and can be associated with the fact table. It is equivalent to extracting and standardizing the frequently recurring attributes in the fact table and managing them in a table. The dimension table consists of two main parts:

  • High cardinality dimension data: it is usually a user table, commodity table similar data, data volume may be tens of millions or hundreds of millions of levels

  • Low cardinality dimension data: generally configuration tables, such as the Chinese meaning of enumeration fields, or date dimension tables, etc. The amount of data may be in the single digit or tens of thousands.

Common dimension tables include date table (storing attributes of the week, month, and quarter corresponding to the date) and location table (including attributes of the country, province, state, and city).

The resources

References for this article:

1, the general method of hierarchical data warehouse: www.cnblogs.com/itboys/p/10…

2, Data Warehousing – Architecture:www.lastnightstudy.com/Show?id=48/…

3, What is a data warehouse? (IBM) : www.ibm.com/cloud/learn…