Digital warehouse architecture design and practice

Recently, the company is rebuilding its multiple positions, mainly referring to alibaba’s multiple positions construction theory; There are mainly the following aspects:

Data hierarchy

In the data system of Alibaba, it is suggested to divide the data warehouse into three layers. From bottom to bottom, they are as follows: Ods (Operation Data Store), CDM (Common Data Model), ADS (Application Data Server).

Ods layer: Stores raw data that has not been processed to the data warehouse system. It is consistent with the source system in structure and is the data preparation area of the data warehouse.
CDM layer: mainly completes data processing and integration, establishes consistent dimensions, builds reusable detailed fact tables oriented to analysis and statistics, and summarizes indicators of common granularity.
- DWD layer: Detail granularity fact layer. Based on the characteristics of each specific business process, the fine-grained detail layer fact table is constructed. In combination with the data usage characteristics of enterprises, some important dimension attribute fields of detailed fact tables can be appropriately redundant, that is, wide table processing. Tables at the detail granularity fact layer are also commonly referred to as logical fact tables.
- DWS layer: The common summary granularity fact layer. The topic object of analysis is taken as the modeling driver, and a common granularity summary index fact table is constructed based on the upper application and product index requirements, and the model is physicalized by means of wide table. Build naming norms and consistent statistical indicators, provide public indicators for the upper layer, and build summary wide tables and detailed fact tables.
- Dim layer: The common dimension layer, based on the dimension modeling idea, establishes the consistency dimension of the whole enterprise. Reduce the risk of inconsistency of data calculation aperture and algorithm. The tables of the common dimension layer are also commonly referred to as logical dimension tables, and dimensions and dimension logical tables usually correspond one to one.
Ads: Data application layer. Store the data. Store the data.

Consideration: I always have a question here: what is dimensional modeling and what is thematic modeling? What kind of relationship do they have?

Look at the structure of the figure below:

Dimensional modeling (DIM layer)

The first is dimension modeling, which occurs at the DIM layer, the common dimension layer. In dimensional modeling, measures are called “facts” (that is, atomic indicators, measures under the behavior of a business event, indicators that cannot be split in the business process, and names with clear business meaning; For example: a payment amount, an AD request for TV), describe the environment as a “dimension”. Dimension: the variety of circumstances required for analyzing facts. For example, when analyzing the transaction process, the environment in which the transaction takes place can be described in terms of buyer, seller, commodity and time.

Dimensions contain columns that represent them, called dimension attributes. For example, under the commodity dimension, there are dimension attributes such as commodity ID, commodity name, and commodity category.

So dimensional modeling is about extracting the context from the business process that describes it, and finding its attributes (dimensional attributes) to establish common and consistent dimensions. In dimension design, we should enrich dimension attributes as much as possible according to the principle of dimension design.

The main steps in designing dimension tables are:

Complete the initial definition of dimensions and ensure the consistency of dimensions.
Determine the principal dimension table (the central fact table, using a star model). The primary dimension table here is typically a data entry layer (ODS) table that is directly synchronized with the business system. For example, s_auction is the commodity table synchronized with the foreground commodity center system, which is the primary dimension table.
Determine the correlation dimension table. Data warehouse is the data integration of business source systems. Tables in different business systems or the same business system are associated with each other. According to business combing, which tables are associated with the main dimension table, and select some of them for generating dimension attributes. Taking the commodity dimension as an example, it can be concluded that there is a correlation relationship between commodities and categories, sellers, shops and other dimensions. Take the single dimension of the plan as an example, there is optical connection with areas, materials, etc.
There are two main stages in determining dimensional attributes. The first stage is to select dimension attributes from the main dimension table or to generate new dimension attributes. The second stage is to select dimension attributes from the related dimension tables or to generate new dimension attributes. Take the commodity dimension as an example, select dimension attributes or generate new dimension attributes from the primary dimension table (s_auction) and related dimension tables such as category, seller, and store. The same goes for the plan sheet.

Generate as many dimensional properties as possible.

Give as many meaningful verbal descriptions as possible.

Distinguish numeric attributes from facts.

Try to precipitate generic dimensional attributes.

In our TV advertising in the table, like this not many dimension tables (only plan unidimensional – integration of the material form, order, order information such as the wide table, because this table is less, so directly on the DWD layer, forming a plan of single wide table), originally can also be pulled out of the dimension table like geographical information, equipment, But the DATA in the ODS layer already carries this information. It is equivalent to having made a wide table, so there are fewer dimension tables.

Topic modeling

Subject modeling is an abstract concept to synthesize, classify, analyze and utilize the data in enterprise information system at a high level. Each Subject basically corresponds to a macro analysis domain.

How to divide topics: After business research, you can divide topic areas. To divide the subject area, you need to analyze what business activities are in each business module. Generally, subject areas can be divided according to the department of the user enterprise, or according to the functional modules in the business process or business section.

Divide by system: there are several business systems, divide into several.

Divide according to business process: for example, business system has commodity, transaction, logistics, etc

By part: such as production, supply chain, research and development, sales within the company

Recommendation: Divide the first level subject area through the business system, so that the boundary will be clear, the data warehouse development process will not because of the model subject’s ownership “dispute”, and then divide the second level subject area according to the business process in each system.

For example: advertising business system as the primary subject area, equipment, inventory, exposure as the secondary subject area.

Digital warehouse architecture design and practice

Data hierarchy

Consideration: I always have a question here: what is dimensional modeling and what is thematic modeling? What kind of relationship do they have?

Dimensional modeling (DIM layer)

Topic modeling

Related Posts

Springboot series (13): Swagger2

Spring Cloud Netflix integration with Eureka

A problem with updating Redis using PHP