Data Warehouse design

Data warehouse design scheme

Summary of a.

Data warehousing is characterized by subject-orientation, integration, stability, and time-variability to support management decisions. The significance of the existence of data warehouse is to summarize all the data of the enterprise and provide unified and standardized data export for each department of the enterprise. Data warehouse construction usually requires hierarchical processing. The layered technical approach varies from business to business. The main reasons for storehouse stratification:

Clear data structure each data layer has its scope, which makes it easier to locate and understand when using tables.
Data kinship tracing ultimately presents a business table that can be directly used by the business, but there are many data sources in the table. If a source table has problems, we hope to quickly and accurately locate the problem and know its harm range.
Reducing repetitive development of standard data layering and developing some common middle tier data can reduce significant double computation.
Complex problem simplification a complex task is divided into multiple steps to complete, each layer deals with a single step, relatively simple and easy to understand. In addition, it is convenient to maintain the accuracy of data. When there is a problem with data, it is not necessary to repair all the data, but only need to repair from the steps with problems.

2. Hierarchical design of data warehouse

Common data warehouse is divided into ODS operation data storage layer, DW data warehouse layer and DM data mart layer, among which DW layer is divided into DWD layer and DWS layer. The hierarchical structure of the data warehouse is shown below:

2.1 the ODS layer

All data in the ODS layer come from the business database, and the tables in the ODS layer also correspond to the tables in the business database one by one. That is, the tables in the business database are re-established at the bottom of the data warehouse, and the data and structure are completely consistent. Because the business database (OLTP) is basically modeled according to the ER entity model, the modeling approach in the ODS layer is also ER entity model.

2.2 the DW layer

All the DWD layer has to do is clean up, consolidate, normalize, dirty data, garbage data, data with inconsistent specifications, inconsistent state definitions, and misnamed data will be processed. The DWD layer should be a complete, clean, consistent data layer covering all systems. In the DWD layer, the fact table and dimension table are designed according to the dimension model. That is to say, the DWD layer is a very standard, high quality and trusted data detail layer. The DWS layer is the public summary layer, which carries out mild summary with a granularity slightly coarser than detailed data. Based on the basic data on the DWD layer, it integrates and summarizes the service data for analyzing a certain topic domain, which is generally a wide table. The DWS layer should cover 80% of the application scenarios.

2.3 DM layer

The DM layer is the data mart layer and is topic-oriented. Report or indicator statistics are collected at the DM layer. The DM layer contains coarse-grained summary data rather than detailed data. The DM layer builds a model for a certain business domain, and specific users (usually the decision layer) view the reports generated by the DM layer.

Dimension model

The theory of dimension modeling was developed by Ralph Kimball, who proposed to divide tables in data warehouses into fact tables and dimension tables. Dimensional modeling is derived from data marts and is oriented toward analysis scenarios. Dimensional modeling is based on the analysis of the scenario to build a warehouse model; The focus is on fast, flexible solutions to analysis requirements, while providing fast response performance for large-scale data. It is mainly used in data warehouse construction and OLAP engine underlying data model. “Fact table” is used to store measures of facts and foreign key values pointing to each dimension. Dimension table is used to store the metadata of the dimension, that is, the description information of the dimension, including the dimension level and member categories. Simply put, a dimension table is the Angle (dimension) from which you view something, and a fact table is what you pay attention to. For example, if a user uses Didi Dache, the matter of taking a taxi can be transformed into a fact table, namely, the fact table of taking a taxi order. Then, the user corresponds to a user dimension table, and the driver corresponds to a driver dimension table.

3.1 the fact table

In the real world, every operational event, basically between entities, generates measurable values along with this operational event, and this process produces a fact table that stores every measurable event. Measurable values generated by operational events that occur in the real world are stored in fact tables. At the lowest level of granularity, a fact table row corresponds to a measure event and vice versa. As a result, the design of the fact table is entirely dependent on physical activity and is not influenced by the final report that may be produced. In addition to numeric metrics, the fact table always contains foreign keys to associate dimensions with, as well as optional degenerate dimension keys and date/time stamps. The primary goal of the query request is to expand computation and aggregate operations based on the fact table. Fact tables tend to contain three important elements:

Dimension table foreign key
Measurement data
Event Description

For example, a purchase event in the e-mall scene involves subjects including customers, goods and merchants, and the measurable value generated includes quantity, amount and number of goods, etc.

3.2 dimension table

Each dimension table contains a single primary key column. The primary key of a dimension table can act as a foreign key for any fact table associated with it, and of course the description environment of a dimension table row should correspond exactly to the fact table row. Dimension tables are generally wide, flat, non-standard tables that contain a large number of low-granularity text attributes. For example, the single primary key is the commodity ID, and the attributes include origin, color, material, size, unit price, etc., but not necessarily the attributes are text. For example, unit price and size are numerical descriptive.

To sum up, if dimensional modeling is carried out for users’ ordering behavior (single commodity), the following model can be obtained:

Data warehouse specification

4.1 Table name naming conventions

To give all interested parties of the data a common understanding of the information contained in the table. For example, which layer does it belong to (ODS, DW details, DW summary, DM)? Which business/department? Which dimension (users, vehicles, equipment)? Which time span (day, month, year, real time)? Incremental or full? Naming format: Level _ Service/Department _ Modification/Description _ Range/Cycle

The naming conventions of tables in the data warehouse are shown in the following table:

Several warehouse level	Period/data range
Common dimension	dim	Day snapshot	d
DM layer	dm	The incremental	i
ODS layer	ods	weeks	w
DWD layer	dwd	Zipper table	l
DWS layer	dws	Non-partitioned full scale	a

Transfer: www.cnblogs.com/skyell/p/11…