This article is divided into two large sections, the first section is the number of warehouse construction, the second section is data governance, the content is longer, please read patiently!

Before we talk about the warehouse, let’s look at the following questions:

Why should the warehouse be layered?

  1. Use space for time, through a lot of preprocessing to improve the user experience (efficiency) of the application system, so there will be a lot of redundant data in the data warehouse; Without layering, if the business rules of the source business system change, the entire data cleaning process will be affected, which is a huge workload.
  2. Through the layered management can simplify the data cleaning process, because the original step of the work assigned to multiple steps to complete, is equivalent to a complex work into several simple work, put a large black box into a white box, each layer of the processing logic is relatively simple and easy to understand, so we are more likely to ensure the accuracy of each step, When something goes wrong with the data, we often just need to tweak one step locally.

Bill Inmon, the father of the data warehouse, defined a data warehouse as a subject-oriented, integrated, relatively stable set of historically changing data used to support management decisions. By definition, the key words of a data warehouse are subject oriented, integrated, stable, reflecting historical change, and supporting management decisions, and the implementation of these key words is embodied in a hierarchical architecture.

A good layered architecture has the following benefits:

  1. Clear data structure: Each data layer has a corresponding scope, which makes it easier to locate and understand data when using it.
  2. Data blood trace: The data service provided to the business or downstream system is target data. The source of target data is usually from multiple table data. If the target data is abnormal, the clear blood relationship can quickly locate the problem. Moreover, kinship management is an important part of metadata management.
  3. Reduce duplication of development: The principle of processing data layer by layer. The lower layer contains all the data needed for processing the upper layer of data. This way of processing avoids every data developer having to reextract data from the source system for processing.
  4. Data relationship organization: There are complex data relationships between source systems, such as customer information in the core system, credit system, financial system, capital system, how to decide when to take the number? Data warehouse will carry out unified modeling for the data of the same topic, combing the complex data relationship into a clear data model, and can avoid the above problems when using.
  5. Shielding the influence of original data: the principle of layer by layer processing of data, the data of the upper layer are processed by the data of the next layer, and skipping is not allowed. The original data is located at the bottom of the storehouse, and there are several layers of data processing from the data of the application layer, so the changes of the original data will be eliminated in the process of processing the data of the application layer to maintain the stability of the application layer.

How many storehouses is the best?

At present, the mainstream layering on the market is dazzling, but things can not only see the surface, but also to see the internal rules, not for layering and layering, there is no best, only the most suitable.

Stratification aims to solve the rapid data support of current business, abstract a common framework for the future and empower other business lines, at the same time to provide stable and accurate data support for business development, and can provide direction for new business development according to the existing model, that is, data-driven and empowered.

How to build a good warehouse?

  1. Stability: Data output is stable and guaranteed.
  2. Trustworthiness: data is clean and of high quality.
  3. Rich: The data covers a wide range of businesses.
  4. Transparency: the data structure is transparent enough.

For warehouse design

Three dimensions of digital warehouse design:

  • Functional architecture: clear structure hierarchy.
  • Data architecture: Data quality is guaranteed.
  • Technical architecture: easy to extend and use.

Several warehouse architecture

According to the process of data inflow and outflow, the data warehouse architecture can be divided into source data, data warehouse, and data application.

The data warehouse

Data in data warehouse comes from different source data and provides various data applications. Data flows into data warehouse from bottom to top and then opens up to upper-level applications. Data warehouse is only a platform for intermediate integrated data management.

Source data: there is no change in this layer of data, directly using the data structure and data of the peripheral system, not open to the public; As a temporary storage layer, it is a temporary storage area for interface data to prepare for data processing in the next step.

Data warehouse: Also known as the detail layer, the data in the DW layer should be consistent, accurate, and clean data, that is, data that has been cleaned (removed of impurities) from the source system data.

Data application: front-end application directly read data source; According to the report, thematic analysis needs to calculate the generated data.

Data warehouse obtains data from various data sources and data conversion and flow in the data warehouse can be considered as THE process of ETL (Extra, transformation, Transfer, Load). ETL is the pipeline of data warehouse and can also be considered as the blood of data warehouse. It maintains the metabolism of data in the data warehouse. The daily management and maintenance of data warehouse is to keep ETL normal and stable.

Building a data warehouse is like creating a new life, and the hierarchical architecture is only the logical skeleton of this life. In order to grow flesh and blood on the skeleton, data modeling must be done properly, and the results of modeling will make the data warehouse strong or weak, handsome or ugly.

Digital warehouse modeling method

There are many kinds of modeling methods for data warehouse, each of which represents a philosophical point of view and represents a method of induction and generalization of the world. The common methods are paradigm modeling, dimensional modeling, entity modeling, and so on. Each method essentially looks at the problems in the business from a different perspective.

1. Paradigm modeling

In fact, the paradigm modeling method is a commonly used method in the construction of data models, which is mainly advocated by Inmon, mainly to solve the data storage of relational databases, using a technical level of the method. At present, most of our modeling methods in relational database are three-paradigm modeling.

A paradigm is a collection of relational patterns that conform to a certain level. The construction of a database must follow certain rules, which in relational databases are known as the paradigm, and this process is also called normalization. There are currently six normal forms for relational databases: first Normal Form (1NF), second normal form (2NF), Third normal form (3NF), Boyce-Codd Normal form (BCNF), Fourth normal form (4NF), and fifth Normal form (5NF).

In the model design of data warehouse, the third normal form is generally adopted. A relationship conforming to the third Normal form must have three conditions:

  • Each attribute value is unique and has no ambiguity.
  • Each non-primary attribute must depend entirely on the entire primary key, not just a part of it;
  • Each non-primary property cannot depend on a property in another relationship, because then the property should belong to the other relationship.

Modeling paradigm

According to Inmon, a data warehouse model is built in a similar way to an enterprise data model for a business system. In business systems, the enterprise data model determines where the data comes from, and the enterprise data model is divided into two levels, the subject area model and the logical model. Similarly, the subject domain model can be viewed as a conceptual model of the business model, while the logical model is an instantiation of the domain model on a relational database.

2. Solid modeling

Entity modeling is not a common method in data warehouse modeling, it comes from a school of philosophy. In a philosophical sense, the objective world is supposed to be subdivided, it’s supposed to be made up of entities and relationships between entities. So we can fully introduce this abstract method in the modeling process of data warehouse, and the whole business can be divided into one entity, and the relationship between each entity, as well as the description of these relationships is the work we need to do data modeling.

Although the substantive method may seem a little abstract, it is actually easy to understand. In other words, we can divide any business process into three parts: entity, event and description, as shown in the figure below:

Entity modeling

The picture above expresses an abstract meaning if we describe a simple fact: “Xiao Ming drives to school”. Taking this business fact as an example, we can regard “Xiaoming” and “school” as an entity. “Going to school” describes a business process, which can be abstracted as a concrete “event”, and “driving to school” can be regarded as an explanation of the event “going to school”.

3. Dimensional modeling method

Dimensional modeling is advocated by Ralph Kimall, another master of data warehouse, whose Data Warehouse Toolbox is the most popular model classic in data warehouse engineering. Dimensional modeling starts from the analysis of decision requirements to build a model, and the data model constructed serves the analysis requirements, so it focuses on how users can complete the analysis requirements more quickly, and at the same time, it has better response performance of large-scale complex queries.

Star model

Typical examples are the familiar star-Schema, and the snow-Schema, which is suitable for some special scenarios.

The important concepts in Dimension modeling are Fact table and Dimension table. The simplest way to describe it is to build a data warehouse or data mart based on fact tables and dimension tables.

At present, the most commonly used modeling method in Internet companies is dimensional modeling.

How to build dimensional modeling:

In the actual business, we are given a bunch of data, how do we take these data to carry out warehouse construction, warehouse toolbox author according to their own 60 years of practical business experience, we summed up the following four steps.

Four steps for dimensional modeling in the warehouse toolbox:

Dimensional modeling takes four steps

These four steps are interlinked. How to break down each step in detail

1. Select the business process

  • The dimension modeling is close to the business, so the modeling must be based on the business. Then the business process selection, as the name suggests, is to select the business we need to model in the whole business process, according to the requirements provided by the operation and the future scalability and so on. For example, the whole process of the mall is divided into merchant end, user end and platform end. The operation demand is the total order quantity, the number of orders, and the purchase situation of users, etc. When we select the business process, we select the data of the user end, and the merchant and platform end are temporarily not considered. Business selection is important because all subsequent steps are based on this business data.

2. Declaration granularity

  • For example: first for the user, a user has an id number, a household registration address, multiple phone number, more than one card, then the same as the user size particle size, particle size attributes have id registration address granularity, granularity than users finer particle size has a mobile phone number granularity, granularity of bank CARDS, there is a one-to-one relationship is the same size. Why the same granularity? Because in dimensional modeling, we are required to have the same granularity in the same fact table. We should not mix different granularity in the same fact table, and different granularity data should be used to establish different fact tables. And when retrieving data from a given business process, it is strongly recommended to start with a focus on atomic granularity, that is, the finest granularity, because atomic granularity can withstand unexpected user queries. However, the volume summary granularity is very important for the improvement of query performance, so for the data with clear requirements, we set up the volume summary granularity for the requirements, and for the data with unclear requirements, we set up the atomic granularity.

3. Confirm dimensions

  • Dimension tables are also known as the “soul” of a data warehouse because they serve as entry points and descriptive identifiers for business analysis. In a pile of data is how to identify which dimension properties, if the column is the description of the specific values, is a text or constant, a constraint and identity of the participants, this property is often dimension properties at this time, the number of positions the toolbox tell us the size, firmly grasp the fact table will be able to distinguish all possible dimensions, Also, make sure that there is no duplicate data in the dimension table, and make the dimension primary key unique

4. Confirm your facts

  • The fact table is used for measurements, which are basically expressed as quantitative values. Each row in the fact table corresponds to one measure, and the data in each row is a specific level of detail, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity. This ensures that there will be no double-counting of metrics. Sometimes it is not clear whether the column data is a fact attribute or a dimension attribute. The most practical facts to remember are numeric types and additive facts. So you can analyze whether the column is a metric that contains multiple values and acts as a participant in the calculation, which is often the case.

It is very important to the particle size, particle size is used to determine the fact table rows said, advice from focus on atomic particle size data start to design, because the atomic size can withstand unable to forecast user query, and atomic data can be roll up in every possible way, and once chose high granularity, cannot meet the needs of users to drill down to details.

Fact is the core of the whole dimension modeling, in which the snowflake model or star model is based on a fact table and expanded through the external key associated dimension table to generate a model wide table that can support the predictable query requirements, and the final query is also placed in the fact table.

In the actual business, the number of stores is layered

Several warehouse layered want to combine the company’s business, and need to clear the layers of responsibility, to ensure the stability of the data layer and shielding effects on downstream, generally USES the following hierarchy:

Data hierarchical architecture

Data layer concrete implementation

Use four diagrams to illustrate the implementation of each layer

  • Data source layer ODS

The data source layer

At the data source layer, service data is imported to the big data platform and stored as snapshots of service data.

  • Data detail layer DW

Data detail layer

Each row in the fact table corresponds to one measure, and the data in each row is a specific level of detail, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity. This ensures that there will be no double-counting of metrics.

Dimension tables generally have a single primary key, and a few have a joint primary key. Be careful not to have duplicate data in dimension tables, otherwise the data divergence will occur when the dimension table is associated with the fact table.

Sometimes it is not clear whether the column data is a fact attribute or a dimension attribute. The most practical facts to remember are numeric types and additive facts. So you can analyze whether the column is a measure that contains multiple values and acts as a participant in the calculation, in which case the column is often true; If the column is a description of a specific value, a text or constant, an actor for a constraint and row identification, then the attribute is usually a dimensional attribute. However, it is still necessary to combine the business to make the final judgment of the dimension or the fact.

  • Mild data summary layer DM

Data light summary layer

This layer is named light summary layer, which means that this layer has begun to summarize data, but it is not a complete summary, only the same granularity of data is associated summary, different granularity but related data can also be summarized, in this case, the granularity needs to be unified through operations such as aggregation.

  • Data application layer APP

Data application layer

The table of the data application layer is provided for the users. The construction of the data warehouse is nearing the end here. Next, different numbers will be taken according to different needs, such as directly performing report presentation, or providing data required by data analysis colleagues, or other business support.

A graph summarizes the overall process of building a data warehouse:

The whole process of the warehouse

Data governance

The real difficulty of the construction of several storehouses is not the design of several storehouses, but the data governance after the subsequent business development and the huge business line change, including the construction of asset management, data quality monitoring and control, data index system.

In fact, the scope of data governance is very wide, including the management of data itself, data security, data quality, data cost and so on. In the DAMA data management body of knowledge guide, data management in data management “is the central figure” wheels, is a data structure, data modeling, data storage, data security, data quality and metadata management, master data management and so on ten big sum in the field of data management, provide general guidance strategies for data management activities.

What is data governance

1. Data governance requires system construction

In order to give full play to the value of data, three elements need to be satisfied: reasonable platform architecture, perfect governance services, and systematic operation means.

According to the enterprise size, industry, data volume and other circumstances to select the appropriate platform architecture; Governance services should run through the whole life cycle of data to ensure the integrity, accuracy, consistency and effectiveness of data in the whole process of collection, processing, sharing, storage and application. Operation means should include standard optimization, organization optimization, platform optimization, process optimization and so on.

2. Data governance needs to lay a solid foundation

Data governance needs to be progressive, but in the early stages of construction, at least three areas need to be addressed: data specification, data quality, and data security. Standardized model management is the prerequisite for ensuring that data can be governed, high-quality data is the prerequisite for data availability, and data security control is the prerequisite for data sharing and exchange.

3. Data governance requires IT empowerment

Data governance is not a pile of normative documents, but the need to governance process generated by the specification, process, standards landing on the IT platform, in the process of data production through the “end as the beginning” forward data governance, to avoid post-audit brings a variety of passive and operation and maintenance costs increase.

4. Data governance needs to focus on data

The essence of data governance is to manage data. Therefore, it is necessary to strengthen metadata management and master data management to manage data from the source and complete related attributes and information, such as metadata, quality, security, business logic, and kinship, to manage the production, processing, and use of data in a metadata-driven manner.

5. Data governance requires integration of construction and management

The consistency of data model consinities and task scheduling is the key to the integration of construction and management, which is helpful to solve the problem of inconsistency between data management and data production caliber, and avoid the low efficiency management mode of two skins.

Talk about data governance

As mentioned above, the scope of data governance is very wide, of which the most important is data quality governance, and the scope of data quality is also very wide, throughout the entire lifecycle of the warehouse, from data generation -> data access -> data storage -> data processing -> data output -> data presentation, each stage needs quality governance. Evaluation dimensions include completeness, standardization, consistency, accuracy, uniqueness, relevance and so on.

In each stage of system construction, data quality testing and specification should be carried out according to standards, and timely governance should be carried out to avoid cleaning work after the event.

Quality inspection can refer to the following dimensions:

The dimension Measure of the
integrity The service specifies whether the required data is missing. Null characters or null values are not allowed. For example, whether the data source is complete, whether the dimension value is complete, and whether the data value is complete
timeliness Whether the data reflects current facts when needed. That is, the data must be timely to meet the requirements of the system for data time. For example, timeliness of processing (obtaining, sorting, cleaning, loading, etc.)
uniqueness Is the data value unique in the specified dataset
Referential integrity Whether the data item is defined in the parent table
Dependent consistency Whether the value of a data item meets the dependency relationship with other data items
correctness Consistency of data content and definition
accuracy Whether the data precision reaches the number of bits required by business rules
Technical availability Whether data items are organized according to a defined format standard
Service effectiveness Whether the data item conforms to the defined
credibility Based on customer survey or customer initiative
availability The ratio of the time data is available to the time it needs to be accessed
accessibility Whether the data can be automatically read

The following are some specific governance methods summarized according to the technical article of Meituan:

1. Standardize governance

Specification is the guarantee of warehouse construction. In order to avoid repeated construction of indicators and poor data quality, standardized construction should be carried out in accordance with the most detailed and practical method.

(1) the root

Root is the basis of dimension and indicator management. It is divided into common root and proprietary root to improve the ease of use and relevance of root.

  • Common root: the smallest unit describing something, e.g., transactions-trade.
  • Proprietary root: a description that is conventionally or industry-specific, e.g. Dollar-usd.

(2) Table naming convention

General specification

  • Table names and field names are separated by an underscore (example: clientType ->client_type).
  • Use lowercase English words for each part. Those belonging to common fields must meet the definition of common field information.
  • The table name and field name must start with a letter.
  • The table name and field name contain a maximum of 64 English characters.
  • Use existing keywords in the root first (root management in the standard configuration of the database) and Review the irrationalities of new names regularly.
  • Do not use non-standard abbreviations in the table name customization section.

Table Naming rules

  • Table name = Type + Business topic + subtopic + Table Meaning + Storage format + Update frequency + end, as shown in the figure below:

Uniform table naming convention

(3) Naming conventions of indicators

Combined with the characteristics of the index and the root management specification, the index is structured.

  1. Base indicator root, that is, all indicators must contain the following base root:

  1. A business modifier, a term used to describe a business scenario, such as trade- transaction.

3. A date modifier used to describe a time period in which a business occurs.

4. Aggregate modifier, which aggregates results.

5. Base indicator, the single business modifier + base indicator root builds the base indicator, for example: trade amount – trade_AMt.

6. Derived index, multiple modifiers + base index root to build derived index. Derived metrics inherit features from base metrics, such as: Number of installed stores – install_POi_CNt.

7. Common indicator naming convention, which is consistent with field naming convention, can be converted from terms.

Architectural governance

(1) Data stratification

An excellent and reliable data store system often needs a clear data hierarchical structure, which is to ensure the stability of the data layer, shield the impact on the downstream, and avoid excessively long links. The general hierarchical architecture is as follows:

(2) Data flow

Stable business is developed according to the standard data flow, that is, ODS–>DWD–>DWA–>APP. For unstable services or exploratory needs, you can follow ODS->DWD->APP or ODS->DWD->DWT->APP. After ensuring the rationality of the data link, the hierarchical reference principle of the model is confirmed on this basis:

  • Normal flow: ODS>DWD->DWT->DWA->APP. When ODS>DWD-> DWA->APP occurs, it indicates that the subject field is not fully covered. DWD data should fall into DWT, allowing DWD->DWA for very infrequently used tables.
  • Try to avoid the DWA wide table that uses DWD and DWT (which DWD belongs to).
  • In principle, DWT tables generated by DWT in the same subject area should be avoided as far as possible, otherwise the efficiency of ETL will be affected.
  • DWT, DWA, and APP do not directly use ODS tables. ODS tables can only be referenced by DWD.
  • Reverse dependencies, such as DWT tables depending on DWA tables, are prohibited.
3. Metadata governance

Metadata can be divided into technical metadata and business metadata:

Technical metadata is used by IT personnel who develop and manage data warehouses. IT describes the data related to the development, management, and maintenance of data warehouses, including data source information, data transformation description, data warehouse model, data cleaning and update rules, data mapping, and access rights.

Common technical metadata are:

  • Storage metadata: information about tables, fields, and partitions.
  • Running metadata: Information about all jobs running on the big data platform: Similar to Hive Job logs, including the Job type, instance name, I/O, SQL, running parameters, execution time, and execution engine.
  • Development platform of data synchronization, computing tasks, task scheduling and other information: including input and output table and field data synchronization, as well as the synchronization task itself node information: computing tasks mainly include input and output, the task itself node information task scheduling mainly depend on the type of task, dependencies, etc., as well as different types of scheduling tasks run log.
  • Data quality and O&M metadata, such as task monitoring, O&M alarm, data quality, and fault information, includes task monitoring run logs, alarm configurations, and run logs, and fault information.

Business metadata serves management and business analysts to describe data from a business perspective, including business terms, what data is in the data warehouse, where data is located, and the availability of data, to help business people better understand what data is available in the data warehouse and how to use it.

  • Common business metadata include dimensions and properties (including dimensions coding, field types, founder, creation time, condition, etc.), business process, index (include name, indicators coding, business scope, target type, responsibility, creation time, state, SQL, etc.), security level, computing logic, etc. The standardization of the definition, to better manage and use the data. Data Application metadata, such as the configuration and operation metadata of data reports and data products.

Metadata not only defines the data schema, source, extraction and transformation rules in the data warehouse, but also is the basis of the whole data warehouse system operation. Metadata connects various loose components in the data warehouse system and forms an organic whole.

Metadata governance addresses three issues:

  1. Through the establishment of corresponding organizations, processes and tools, promote the implementation of business standards, achieve the standard definition of indicators, eliminate the ambiguity of indicators cognition;
  2. Abstract the business model based on the current situation and future evolution of the business, formulate a clear theme, business process and analysis direction, build a complete technical metadata, accurate and perfect description of the physical model, and get through the relationship between technical metadata and business metadata, complete description of the physical model;
  3. Through the construction of metadata, the efficiency of using data is improved, and the problems of “finding numbers, understanding numbers and evaluating numbers” and “taking numbers and data visualization” are solved.
4. Safety management

Around the data security standards, the first thing to do is to have the classification and classification standards of the data to ensure that the data has an accurate confidentiality level before going online. Second, for data users, there should be clear role authorization standards. Through classification and role authorization, important data can not be taken away. Third, for sensitive data, there should be privacy management standards to ensure the safe storage of sensitive data. Even if unauthorized users bypass permission management to get sensitive data, they should also ensure that they cannot understand it. Fourth, through the formulation of audit standards, to provide audit basis for the follow-up audit, to ensure that the data will not be removed.

5. Data lifecycle governance

Everything has a certain life cycle, and data is no exception. From the data of the production, processing, use and die, should have a scientific management methods, will be little or no longer use the data from the system, and through the verification of the storage device to keep, not only can improve system efficiency, better service customers, also can greatly reduce the storage costs because of long-term preservation of data. Generally contain online data life cycle stage, stage of archive (sometimes file are further divided into online and offline archive phase), destroy, three phase, content management categories, including establishing reasonable data for different categories of data to set up at various stages of retention time, storage medium, the cleansing rules and methods, matters needing attention, etc.

From the relationship between parameters in the data life cycle in the figure above, we can see that data life cycle management can greatly improve the query efficiency of high-value data and reduce the purchase amount of high-priced storage media. However, with the decline of the use of data, the data is gradually archived, and the query time is slowly becoming longer. Finally, as the frequency and value of the data is basically gone, it can be gradually destroyed.


Reference links:

Mp.weixin.qq.com/s/h6HnkROzl…

zhuanlan.zhihu.com/p/137454121

www.infoq.cn/article/KJz…

Blog.csdn.net/MeituanTech…

Baijiahao.baidu.com/s?id=169953…

Tech.meituan.com/2020/03/12/…