Stratified construction theory

Simple point, direct ODS+DM can be, all data synchronization, and then directly develop some application layer reports, this is the simplest; When DM layer content is too much, if you want to reuse it, you will split a common layer into a three-layer architecture. This process is similar to code refactoring, which is to continuously abstract and summarize in practice

Data warehouse modeling or layering, in fact, are for better organization, management, maintenance of data, so when you stand in a higher dimension, all the division is for better management. The division of JVM memory regions, the division of heap space in the JVM (young generation, old generation, method region, etc.), and the division of provinces and cities in the country are all for better organization and management

Therefore, data warehouse layering is a very important link in data warehouse design. Excellent layering design can make the whole data system easier to understand and use

This section, we mainly proceed from the overall analysis and introduction, just like in the previous section number CangJianMo methodology, comparative analysis of the progress in more detail things behind us will split out alone, with a case of demonstration, dimensional modeling, for example, the design of the dimension table, the design of fact table, and how to design the label, how to manage the label and so on

Meaning of stratification

Clear data structure architecture

Each data layer has its own scope, which makes it easier to locate and understand when using tables.

Data lineage tracing

In the end, the business is presented with a business table that can be used directly, but there are many sources of data in the table. If a source table has problems, we hope to quickly and accurately locate the problem and know its impact range, so as to timely give feedback to the business side, so as to minimize losses.

Reduce duplication of development and waste of resources

Standardizing data layering and developing some common middle tier data can reduce significant double calculations

The clear structure reduces development and maintenance costs

Reduce duplication and waste of storage resources

Simplification of complex problems

Breaking a complex task down into multiple steps, with each layer dealing with a single step, is relatively simple and easy to understand. In addition, it is convenient to maintain the accuracy of data. When there is a problem with data, it is not necessary to repair all the data, but only need to repair from the steps with problems.

In the actual construction process, due to the use of business data is very urgent and unified warehouse layer construction cannot keep up with business needs, so DIM and ADS may directly use the ODS layer business for quick response, but this is not standard operation may lead to inconsistent data caliber, so stay for warehouse construction, to switch to unity for warehouse and DIM

Uniform data caliber

The data layer provides a unified data outlet and a unified external output data caliber, which is often referred to as the data application layer

A little thought about layering

Previously, we talked about stratification in order to better, faster and more accurate organization and management, but this is from the macro point of view. Next, we also look at stratification from the micro point of view

The higher the level is, the more friendly it is to the application. For example, the ADS layer is basically designed for the application. In terms of data aggregation degree, the higher the level of aggregation is, the lower the understandable degree is

Several warehouse internal layer of division is not for layered and layered, stratified is to solve the ETL tasks and workflow of the organization, the flow of data, read and write access control, meet the different needs of the various problems, of course, we often say that the stratification is also facing the industry, also is our common hierarchical methods, but you need to pay attention to is the layered only means

The layering of a warehouse

Ods operates on the data layer

ODS stands for OperationalDataStore. The operation data layer stores data oriented to the business system and is the layer closest to the data in the data source. The data in the data source is extracted, cleaned and transmitted, which is also known as ETL, and loaded into this layer.

In fact, ETL is a little inappropriate here, but ELT is more accurate, you can taste it carefully

This layer data, on the whole is mostly according to the source of the business system classification and classification, we talk about why in several positions in front of the main use of dimensional modeling, we still want to learning paradigm model, because our data source is paradigm modeling, so learning paradigm modeling can help us better understand the business system, understand the business data, So you can think of our ODS layer as being modeled using real paradigms

However, this level of data is not the same as raw data. In the source data into this layer, such as denoising (for example, there is a data middleman’s age is 300 years old, this belongs to the abnormal data, you need to do some processing in advance) and heavy (for example in the personal data in the table, the same ID has two duplicate data, at the time of access needs to be done to heavy), fields, naming conventions, and a series of operations The data processing, Is not involved in the business logic, is only for data integrity and the processing of duplicate values and null values, is actually do is data code, data cleaning, but in order to consider the follow-up may trace data problems, so on this layer is not recommended to do too much data cleaning, intact, access to the source data as to the denoising of data, to heavy, Procedures such as outlier handling can be placed later in the DW layer

In fact, a lot of people have a different understanding of this layer, that is, whether or not to do data cleaning, actually depends on the habits of the company, in fact, a lot of companies will also form a layer before this layer, with all kinds of names, but its purpose is to buffer data, and then to clean, The data after cleaning is stored in ODS, and at this time the buffer layer data storage is generally about a week, almost no more than a month; ODS is permanently stored.

Design principles

Table name design ODS_ business system _ table name _ tag, such a design can keep consistent with the business table name, but also can have a clear hierarchy, but also can distinguish the source. Tags generally refer to other store-specific attributes, such as whether the table is day or hour, full or incremental

  • The ODS layer does not perform field name normalization and field type unification, and uses compatible data types if necessary
  • For delta tables, you need to design the delta tables (ODS_ Business system _ table name _delta) and the full table, and then merge the delta tables with the full table data
  • Design parsing is required for semi-structured data
  • Because the business database (OLTP) is basically modeled as a dimensional model, the modeling approach in the ODS layer is also a dimensional model

Ods is designed to ensure that all data is stored according to uniform specifications.

DW unified number storehouse layer

DW is the core of the data warehouse, and the data obtained from the ODS layer builds various data models according to the topic. DW subdivides the data detail layer DWD and the mild summary layer DWS

This layer and dimensional modeling can have a more deep contact, business data is according to the perspective of business process and convenient operation, to organize the data and layer number of unified storehouse is according to the business point of view of its easy to understand, or is the Angle of the business analysis, data organization defines a consistent indicator, dimensions, all business sectors, data fields are in accordance with the unified specification to construction, The result is a unified and normative standard business data architecture, which is usually built based on Kimball’s dimensional modeling theory, with consistency dimensions and data buses to ensure dimensional consistency across subtopics.

If ods layer data is very neat, basic can meet the needs of most of us, of course it is good, then DWD layers are simple, but the reality of contact layer is ods data is difficult to ensure the quality, variety after all of the data source, push party will also have their own push logic, in this case, We need an extra layer of DWD to mask some of the underlying differences. Doesn’t look much like a JVM

Design principles

Conformance dimension specification

In the dimension table of the common layer, the field names, data types and data contents of the same dimension attributes in different physical tables must be the same, because this can reduce the probability of making mistakes in the process of use, such as using incorrect fields, or causing some strange errors due to the data type

Combination and splitting of dimensions

Implement the business-relevant fields described by the dimensions in a physical dimension table. Strong correlation refers to frequent queries or report presentations, and whether there is a natural relationship between the attributes of the two dimensions. For example, basic attributes and brand.

DWD detail data layer

Announcement detail data layer, can be said to be the core of our warehouse construction

All the DWD layer has to do is clean up, consolidate, normalize, dirty data, garbage data, data with inconsistent specifications, inconsistent state definitions, and misnamed data will be processed. And then processed into a number of storehouse based list, this time can be processed for some analysis of the wide table

The DWD layer should be a complete, clean, consistent data layer covering all systems. In the DWD layer, the fact table and dimension table are designed according to the dimension model. That is to say, the DWD layer is a very standard, high quality and trusted data detail layer.

DWS mild summary layer

The DWS layer is the common summary layer, which carries out mild summary with a granularity slightly coarser than detailed data. Based on the basic data on the DWD layer, it integrates and summarizes the service data for the analysis of a certain topic domain. Generally, it is also for the analysis of the wide table or the summary table for a certain attention. The DWS layer should cover 80% of the application scenarios so that we can respond quickly to data requirements, otherwise, if a lot of requirements have to be done from ODS, then our warehouse construction is not complete

For example, by business division, such as traffic, orders, users, etc., a wide table with more fields is generated for subsequent business query, OLAP analysis, data analysis, etc.

Dimension model method is generally adopted as the theoretical basis, and some dimension degradation methods are more often used to degrade dimensions to the fact table, so as to reduce the association between dimension table and fact table and improve the ease of use of detailed data table. At the same time, the dimension degradation of indicators should be strengthened in the summary data layer, and more broad tabulation methods should be adopted to construct the public indicator data layer, so as to improve the reusability of public indicators and reduce repeated processing

DIM dimension layer

In fact, the dimension layer is composed of a large number of dimension tables. In order to uniformly manage these dimension tables, we build the dimension layer. There are also many types of dimension tables themselves, such as stable dimension table and gradual dimension table

Dimensions refer to the perspective from which things are viewed, provide descriptive attributes of what filters and categorizes a business process event, “who, when, where, why, how” what is done, and dimensions represent the basis and soul of dimensional modeling

For example, “Xiao Wang bought baozi for 5 yuan at the grocery store this morning”, time dimension — morning, place dimension — grocery store, commodity dimension — baozi how about the fact table?

So you can see that the dimension table contains the context and context of the business process measures recorded by the business process. Dimension tables contain a single primary key column. The core of dimension table design is to determine dimension fields, which are the basic sources of query constraints (WHERE), grouping conditions (group), sorting (order), and report labels.

Dimension table is generally a single primary key. In ER model, entities are objectively existing transactions with their own descriptive attributes, which are generally textual and descriptive, and these descriptions are called dimensions. The core of dimension modeling is that data can be abstracted into facts and dimensions, dimension is the Angle of observing things, a measure word in a certain granularity of facts, and dimension must be for entities

Each dimension table contains a single primary key column. The primary key of a dimension table can act as a foreign key for any fact table associated with it, and of course the description environment of a dimension table row should correspond exactly to the fact table row. Dimension tables are generally wide, flat, non-standard tables that contain a large number of low-granularity text attributes. For example, customer, goods, and D_time are dimension tables that have a unique primary key and store detailed data in the table.

Design principles

Dimension tables are generally wide and flat standard tables with multiple attributes. In practice, dimensions with dozens or hundreds of attributes are not uncommon. Therefore, dimension tables should include meaningful descriptions for downstream use

The dimension attributes of the dimension table should be as rich as possible, so there are often some anti-paradigm designs in the dimension table, which combine other dimension attributes into the main dimension attributes to achieve the effect of easy to use and less association.

The design of dimension table includes the process of dimension selection, determination of main dimension table, sorting out associated dimensions and defining dimension attributes

Dimension is usually selected from the reporting needs and found in business people talk, it is mainly used for filtering, grouping, sorting, main dimension table general synchronization directly from business library, such as user table, but the number of positions itself will also have its own dimension, this is because the number of warehouse is an analysis, so there will be a lot of the dimension from the perspective of analysis

Relational dimension mainly refers to the correlation between tables of different business systems or the same business system (paradigm modeling). According to the sorting of business tables, it determines which tables are associated with the main dimension tables and selects some of them for generating dimension attributes.

TDM tag data layer

With the popularity of the Internet, the cost of acquiring customers is getting higher and higher, which also makes the company put forward higher requirements for user operation, requiring not only refinement but also personalization. One way to solve this problem is to establish a relatively complete label system, and the label layer of data warehouse plays an important role in the label system, just as the data warehouse plays an important role in the data system. Such a label system needs to be closely integrated with the business and obtain nutrients from the business – user label. But also to serve the business — to provide users with more accurate and personalized services.

** The bottom of the label system is like an index, layer upon layer to show the world, and the user from the world constantly choose some things to show their identity and preferences, but also constantly feed, make the world more colorful. ** In the end, the user is just a collection of tags.

It integrates the data of specific objects across business plates and data domains. Through unified ID-mapping, the data of the same object in each business plate and business process is connected to form a global data label system for objects, which is convenient for in-depth analysis, mining and application. Id-mapping is used to associate and identify the same objects in different data systems by object identifiers. The object identifier identifies an object. Generally, it is the object ID, such as the mobile phone number, ID card, and login account

A natural person has his ID number for unique identification, but he uses a real medical insurance account when he pays water and electricity, a different account when he uses a mobile phone, and an online business account when he accesses the Internet. After the object is confirmed, the object identification of the same object is different in different business systems. Therefore, different ID identification of the same object needs to be connected so that all business data can be connected on the object. This is the ID – Mapping

Complete object ID to give objects a need to set a super ID, need according to the object ID of the current business system and access to get or to calculate the super ID, and then to complete all business identity ID through generally ID get through is the premise of building label system, if there is no ID will not be able to get through to collect the comprehensive information of an object, There is no way to fully label this object.

The traditional calculation method requires a pial relationship between ID-ID. For example, if email and mobile phone number can be connected, and mobile phone number and ID number can be connected, then email and ID number can be connected. However, when the amount of data is very large and there are many business segments, such as the last object, Each object has dozens of ids, which can be a very long calculation to get through

So what is a label? Using raw data, through certain logic processing output can be directly used by businesses, readable, valuable data. Label category, is the tag classification organization, is a structured description tag information, the purpose is to management, search, generally adopts multilevel categories, generally when an object tag number more than 50, business people find the label will become very trouble, this time we tend to label category for organization and management

Classification of labels

Tags can be divided into attribute tags, statistical tags, algorithm tags and associated tags according to the different generation and calculation methods

attribute

The nature of the object itself is the attribute tag, such as the tag applied to the user when painting the user

Statistical label

Object atomic indicators generated in the business process, through different calculation methods can generate statistical labels

Algorithm to label

The label produced by the characteristic rule of the object in multiple business processes through a certain algorithm

Associated label

Objects can be associated with other objects during a particular business process, and the label of the associated object can also be attached to the main object

Design principles

Our labels must be in view of the user, rather than on some false and tall and useless, must reflect the user behavior preferences, so we can’t only depend on the analysis of the artificial intelligence algorithm, to complete the establishment and regular maintenance of a user label, we need to go out and user interaction, guide users to use, to seize the user pain points, Get user feedback in time, form closed loop

How do you guide them? There are many ways of doing this that we won’t cover here, but we will cover the construction details of this layer later

ADS layer

ApplicationDataService Customization-oriented application data, mainly provided for data products and data analysis, is generally stored in ES, MYSQL, Redis and other systems for online system use, can also be stored in Hive for data analysis and data mining. Or use other big data tools for storage and use

The data warehouse layer, DIM layer and TDM layer are relatively stable, so they cannot meet the flexible and changeable business needs. Therefore, this is inconsistent with the specification and division of data warehouse layer. On this basis, we establish another layer, namely ADS layer, to solve the contradiction between the stability and flexibility of planning. In fact, you will gradually understand that there is not much difference between hierarchical and classification, in fact, they are similar together, a bit of code refactoring ah

Data application layer, according to the needs of business, and then from the unified data warehouse layer and DIM, and for the special needs of business data processing to meet the needs of business and performance. The ADS layer is oriented to a large number of requirements, so there are not too many specifications for this layer. It only needs to follow the naming specification

Design principles

As mentioned above, the ADS layer is faced with a large number of demands, so there are not too many specifications for this layer. However, the construction of ADS layer is strongly driven by business, and business departments need to participate in the construction of ADS. At least, we need to understand the pain points of users before we can apply appropriate medicine

The implementation process

Clarify the requirements and understand the business side’s requirements on data content, usage mode (how to interact, report, interface, AD hoc query, online query, indicator query, search) and performance

Check whether the existing data warehouse table can be supported, and see if there are similar requirements in the past, and if there are reusable interfaces and statements

Code implementation, select the appropriate storage engine and query engine, configure online monitoring and delivery

Usage scenarios and performance
  • According to the usage scenarios of business side, we need to design an efficient ADS layer table that meets the requirements
  • For multidimensional analysis, in order to reduce connections and improve performance, we generally adopt large-width table design and use high-performance engine support
  • If it is a specific index query, it is generally organized in the form of KV
  • In a search scenario, search engines are generally used

DM data mart layer

It is used to provide data products and data analysis. It is usually stored in ES, Mysql, hive or Druid for data analysis and data mining. The data mart represents the subject domain of the data warehouse

A DM is built for a single topic, so it doesn’t build from a global perspective and only focuses on its own data, often a line of business, such as traffic topics, social topics, e-commerce topics, and so on