Click on the top of the “QI QI” attention,” star “grow together

Hi, I’m Yunqi.

Today we talk about a platitude topic, how to design DWS layer?

For the number of storehouse layering, presumably everyone is not strange. Three-tier warehouse division based on OneData methodology: The Data introduction layer (ODS, Operational Data Store), the Data Common Dimenions Model (CDM, Common Dimenions Model) and the Data Application layer (ADS, Application Data Store) have long been popular.

Of course, we may have our own understanding of how to develop and model each layer.

But the good news is that everyone agrees on the importance of data modeling. If we compare metrics to the fruit on a tree, then the model is like the trunk of a tree. For the fruit to bear well, the trunk must be thickened.

Let’s think back to the original intention of building a data hub:

Lack of reusable data
You have to use raw data for cleaning, processing, and metrics
The resource consumption of a lot of repetitive code development

At the root of the problem is the inability to reuse data models and the smokestack nature of data development. So figuring out how to design a robust data model is the key to solving this problem.

Part1 Common design ideas of storehouse layering

The following is the logical architecture diagram of the data warehouse hierarchy. You may wish to recall the hierarchical design of the data model:

Data Introduction Layer (ODS) : The Operational Data Store (ALSO known as Data base layer) is the Data preparation area of the Data warehouse, which stores raw Data in the Data warehouse system with almost no processing and is basically consistent with the source system in structure. The primary responsibility of this layer is to synchronize and store the underlying data.
The data Common Dimension Model (CDM) : Stores detailed factual data, dimension table data, and Common metric summary data. Among them, detailed fact data and dimension table data are generally processed and generated according to ODS layer data. The summary data of public indicators are generally processed and generated based on dimension table data and detailed fact data. CDM layer is subdivided into dimension layer (DIM), detail data layer (DWD) and summary data layer (DWS). Using dimension model method as the theoretical basis, the relationship between the primary key of dimension model and the foreign key of fact model can be defined to reduce data redundancy and improve the ease of use of detailed data table. In the summary data layer, the dimensions in the statistical granularity can also be associated with reuse, and more broad tabulation methods can be adopted to construct the public indicator data layer, so as to improve the reuse of public indicators and reduce repeated processing.
Dimension layer (DIM, Dimension) : Based on the business meaning of each Dimension, the process of attribute definition and consistent data analysis Dimension table are completed by adding Dimension attributes, associating dimensions and defining calculation logic. In order to avoid redundant correlation of dimension attributes in dimension model, dimension table is constructed based on snowflake model.
Detail Data layer (DWD, Data Warehouse Detail) : Business process as model-driven, based on the characteristics of each specific business process, build the most fine-grained Detail fact table. Some important attribute fields can be properly redundant, that is, wide table processing.
Summary Data Layer (DWS, Data Warehouse Summary) : A common granularity Summary indicator table is built based on upper-level application and product metrics requirements, using the analyzed subject objects as model-driven. Physical model by means of broad table, construct statistical indicators with naming norms and consistent caliber, provide common indicators for the upper level, establish summary broad table and detailed fact table.
Data Application Layer (ADS) : stores personalized statistical index Data of Data products, which are processed and generated according to CDM layer and ODS layer.

Part2DWS layer important?

The common question is: Why bother building a DWS summary logical table when you can just pull numbers from the DWD layer?

Here’s what I’m saying: if the business scenario isn’t complex, that’s fine. But when faced with complex business scenarios, this approach is a recipe for confusion, as is the case with smokestack development, wasted computing resources, and so on.

For example, we need to make a preliminary summary from the data detail layer, abstract out some general dimensions: time, user ID, IP, etc., and make some statistics according to these dimensions, such as the number of goods purchased by users at different login IP in each period.

A layer of light summary here makes the calculation much more efficient, and on top of that it would be much faster to calculate only 7 -, 30 -, and 90-day actions. We want 80% of our business to be calculated through our DWS layer, not the ODS or DWD layer.

Part3 design principles to be followed

Aggregation is the aggregation of raw, fine-grained data by Pointers. The DWS summary data layer is subject aggregation modeling oriented to analysis objects. Taking the retail scene as an example, our final analysis objectives are: the total sales volume of a certain category (for example, kitchen utensils) in each province in the last day, the names of the Top10 items in the sales volume of this category, and the purchasing power distribution of users in each province.

Therefore, we can summarize the data of the latest day from the perspectives of commodities, categories and buyers that are finally successfully traded. Note the following for data aggregation:

Aggregation does not cross facts. Aggregation is a summary of the original star model. In order to obtain and query results consistent with the original model, the dimensions and measures of the aggregation must be consistent with the original model, so the aggregation is not across the facts, so atomic indicators can only be defined based on a fact table, but atomic indicators can be combined into derived atomic indicators.
Aggregation improves query performance, but it also makes ETL maintenance more difficult. When the subcategory corresponds to a first-level category change, the pre-existing data that has been aggregated into the aggregation table needs to be readjusted.

In addition, when designing the DWS layer, you need to follow the principle of data utility. Data utility requires consideration of whether the aggregation of aggregates can be made available for use by third parties. We can consider whether aggregation based on a dimension is often used in data analysis, and if so, it is necessary to aggregate detailed data into aggregation tables.

To put it simply:

The theme
Wide table
Mild summary

Part4 illustrates the DWS layer design process

Taking the e-commerce retail scenario as an example, we have completed the modeling of DWD layer through ETL based on the order table, user table, commodity table and coupon table of ODS layer, generally adopting the star model.

Here, the model is modeled in strict accordance with: Business process → declaration granularity → confirmation of dimensions → confirmation of facts. The process is as follows:

Next, it’s time to design the DWS layer. By looking at the fact table from the dimension table, we can get the broad table for each day.

In this way, the daily behavior of each subject object can be counted, the subject width table serving the ADS layer and the detailed data of some businesses can also be used to cope with some special needs, such as purchase behavior and commodity repurchase rate statistics.

With foreign keys to get the relevant metrics, we combine the detailed fact table metrics of multiple DWD to form the new table.

Here, we still want to follow the design principles mentioned above, as far as possible in the design of public, easy to use and users easy to understand.

Part5 Thinking: How to design a perfect DWS layer?

In the actual implementation of our data center, the team not only needs to build a public data layer and form a data center, but also bear the pressure of new demands.

Often we have to meet the requirements first (live), then develop the common data layer (build a better future), and iterate and optimize the model as the business needs are met. Over time, more and more of the business needs can be fulfilled with data from the DWS layer.

In this process, the degree of perfection is a good evaluation standard, mainly depends on how many query requirements can be met by the summary data of THE DWS layer. If the summary data cannot meet the requirements, the data user must use detailed data, even the original data of the ODS layer.

The higher the degree of perfection of DWS/ADS layer, the more perfect the upper construction of data, and from the user’s point of view, it is the absolute truth that the query is fast, easy to take the number and cool to use.

end ~

Cough cough, just nine o ‘clock 😂 di, off work….

Recently a 18 line was actually urged more by his small partner, crying…

The reason for the slow update is that I recently subscribed to a column about the data center practice, which is very wonderful. So cloud qi learning is also very serious, three thinking guide chart proof!

However, the article I haven’t started liver 😭, set a Flag, a weekend?

Recommended reading:

[

What is OneData architecture? Ali data in Taiwan implementation methodology interpretation

2021-06-15