@TOC

The data flow

The sample application

What is warehouse DW

Data Warehouse (can be abbreviated as DW or DWH) Data warehouse, is in the database has a large number of cases, it is a complete theoretical system including ETL, scheduling, modeling.

The purpose of data warehouse project construction is to serve as the foundation for front-end query and analysis, which is mainly applied to ON-LINE Analytical Processing (OLAP), supporting complex analysis operations, focusing on decision support, and providing intuitive and understandable query results. AWS Redshift, Greenplum, Hive, etc.

The data warehouse is not the final destination of the data, but the preparation for the final destination of the data, which includes: cleaning, escaping, sorting, reorganization, merging, splitting, statistics, etc

The main features

  • Subject oriented *[Appendix]*
    • While an operational database organizes transaction-oriented tasks, data in a data warehouse is organized according to a subject domain.
    • A topic refers to the key aspects that users care about when making decisions with a data warehouse. A topic is related to multiple operational information systems.
  • integration
    • The source data need to be processed and fused, unified and integrated
    • Inconsistencies in source data must be eliminated in the process of processing to ensure that the information in the data warehouse is consistent with global information about the whole enterprise. (Association relation)
  • Do not modify the
    • The data in the DW is not up to date, but comes from other sources
    • Data warehouse mainly provides data for decision analysis, and the operations involved are mainly data query
  • Time dependent
    • The data in the decision needs data warehouse needs to be labeled with time attributes

Comparison with the database

  • DW: Designed specifically for data analysis, it involves reading large amounts of data to understand relationships and trends between the data
  • Database: Used to capture and store data
features The data warehouse Transaction database
Suitable workload Analytics, reporting, big data Transaction processing
The data source Collect and standardize data from multiple sources Data captured from a single source, such as a transactional system
Data capture Batch writes are performed by following a predetermined batch schedule It is optimized for continuous write operations because the new data maximizes transaction throughput
Data standardization Non-standardized schemas, such as star schema or Snowflake schema Highly standardized static schema
Data is stored Optimized with column storage for easy access and high-speed query performance Optimized for high-throughput writes in single-row physical blocks
The data access Optimized to minimize I/O and maximize data throughput Lots of small read operations

Why stratification

Issues involved in data warehousing:

  1. Why do data warehousing?
  2. Why do data quality management?
  3. Why do metadata management?
  4. What is the role of each layer in the storehouse hierarchy?
  5. … .

In practice, we all want our data to flow sequentially, so that designers and users can clearly see the whole cycle of data declaration, as shown in the left figure below.

However, in reality, the data situation we are faced with is likely to be highly complex and hierarchical, and we may make a data system with chaotic table dependence structure and cyclic dependence, such as the figure on the right below.

In order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly, which is data stratification. Benefits of data layering:

  • Clear data structure: Each data layer has its own role and responsibility, easier to use and maintain and understand
  • Complex problem simplification: Break down a complex task into multiple steps, each layer solving only a specific problem
  • Unified data caliber: Provides unified data outlet and unified output caliber through data stratification
  • Reduce duplication: Formalizing data layering and developing a common middle tier can greatly reduce duplication

Data hierarchy

Each company’s business can be layered according to its own business needs; At present, there are mature data layers: ODS of data operation layer, DW of data warehouse layer and ADS(APP) of data service layer.

Data operation layer ODS

Data Operation layer: Operation Data Store Data preparation area, also known as paste source layer. The data in the data source enters this layer after extraction, cleaning and transmission, namely ETL process. The main functions of this layer:

  • The ODS is the staging area for the back data warehouse layer
  • Provide raw data for the DWD layer
  • Reduce the impact on business systems

In the source data into this layer, such as denoising (for example, there is a data middleman’s age is 300 years old, this belongs to the abnormal data, you need to do some processing in advance) and heavy (for example in the personal data in the table, the same ID has two duplicate data, at the time of access needs to be done to heavy), fields, naming conventions, and a series of operations.

However, in order to consider the subsequent need to trace the data, it is not recommended to do too much data cleaning for this layer. It is also possible to access the original data according to the requirements of the specific business layer.

The data in this layer is the source of processing data in the subsequent data warehouse. Data sources:

  • Business library
    • Sqoop is often used for extraction, such as at regular times every day.
    • In real time, canal can be used to monitor the binlog of mysql for real time access.
  • Burial site log
    • Logs are saved as files. You can use Flume to synchronize logs periodically
    • Spark Streaming or Flink can be used for real-time access
    • Kafka is also OK
  • Message queues: data from ActiveMQ, Kafka, etc.

Data warehouse layer

Data warehouse layer can be divided into three layers from top to bottom: data detail layer DWD, data intermediate layer DWM, data service layer DWS.

Data detail layer DWD

Data Warehouse Details, DWD(Data cleaning /DWI)

This layer is the isolation layer between the business layer and the data warehouse, and maintains the same data granularity as the ODS layer. It is mainly for ODS data layer to do some data cleaning and standardized operations, such as removing empty data, dirty data, outliers and so on.

In order to improve the ease of use of the data detail layer, this layer usually adopts some dimension degradation methods to degrade the dimensions to the fact table and reduce the association between the fact table and the dimension table.

Data middle layer DWM

Data Middle layer: Data Warehouse Middle, DWM

Based on the data of the DWD layer, this layer performs some slight aggregation operations on the data to generate some intermediate result tables, so as to improve the reusability of common indicators and reduce the work of repeated processing.

In short, aggregate the common core dimensions and calculate the corresponding statistical indicators

Data services layer DWS

Data Service Layer: Data Warehouse Service, DWS(Wide table – User behavior, light aggregation)

This layer is a data service layer based on the basic data on DWM, which is integrated and summarized to analyze a topic domain. Generally, it is a wide table, which is used to provide follow-up business queries, OLAP analysis, data distribution, etc.

In general, this layer will have relatively few data tables; A table covers a large amount of business content. Because it has many fields, it is also called a wide table.

  • User behavior, mild aggregation on DWD
  • Mainly do some light summary of ODS/DWD layer data.

Data application layer ADS

Data Application layer: Application Data Service, ADS(APP/DAL/DF)- Generate report results

This layer mainly provides data for data products and data analysis, which is generally stored in ES, Redis, PostgreSql and other systems for online system use. It can also be stored in Hive or Druid for data analysis and data mining. For example, common data reports are stored here.

A Fact Table

Fact tables are tables that store factual records, such as system logs, sales records, and so on. The record of the fact table is constantly growing, such as the goods order table of e-commerce, is similar, so the size of the fact table is usually much larger than other tables.

Surface Dimension (DIM)

A Dimension Table or Lookup Table, sometimes called a Lookup Table, is the counterpart of a fact Table. It stores the attribute values of dimensions and can be associated with the fact table. It is equivalent to extracting and standardizing the frequently recurring attributes in the fact table and managing them in a table. The dimension table consists of two main parts:

  • High cardinality dimension data: it is usually a user table, commodity table similar data, data volume may be tens of millions or hundreds of millions of levels
  • Low cardinality dimension data: generally configuration tables, such as the Chinese meaning of enumeration fields, or date dimension tables, etc. The amount of data may be in the single digit or tens of thousands.

Temporary table TMP

There are many temporary tables in each layer of computation, so we have a DWTMP layer to store the temporary tables in our data warehouse

The data mart

Narrow ADS layer; In a broad sense, it refers to data that Hadoop synchronizes from DWD DWS ADS to RDS

Data Mart, also known as Data market, is to meet the needs of specific departments or users. Data Mart is stored in a multidimensional way, including defining dimensions, indicators to be calculated, dimension levels, etc., to generate Data cubes oriented to decision analysis needs.

In scope, data is extracted from enterprise-wide databases, data warehouses, or more specialized data warehouses. The point of the data center is that it caters to the specific needs of the professional user community in terms of analytics, content, presentation, and ease of use. Users of data centers expect data to be represented in terms they are familiar with.

A data warehouse structure with a data mart

Differential data warehouse

A data mart is a subset of an enterprise-level data warehouse that is oriented primarily to departmental business and only to a specific topic. To resolve the conflict between flexibility and performance, a data mart is a small department or workgroup level data warehouse added to the data warehouse architecture. Data marts store pre-calculated data for specific users to meet their performance requirements. Data mart can alleviate the bottleneck of accessing data warehouse to some extent.

In theory, there should be a total data warehouse concept before there is a data mart. When actually building data marts, this is rarely done in China. In China, we usually start with data mart first, and then build data warehouse on a specific topic (such as enterprise customer information). The order in which data warehouse and data mart are established is closely related to the design method. As an engineering discipline, data warehouse is not right or wrong.

In terms of data structure, a data warehouse is a subject-oriented, integrated collection of data. Data mart is usually defined as star structure or snowflake type data structure. Data mart is generally composed of a fact table and several dimension tables.

The problem summary

Is ODS different from DWD?

** Q: ** still don’t understand the difference between ODS and DWD layer, DWD is useless after oDS layer.

** A: ** Well, from an ideal point of view, if the ODS layer data is very neat, basically meet most of our needs, of course it is good, then the DWD layer is not necessary. However, in reality, it is difficult to ensure the quality of data at the ODS layer. After all, data sources are diverse, and the push side also has its own push logic. In this case, we need to shield some underlying differences through an additional layer of DWD.

DWD is mainly used to clean and normalize oDS layer data, while DWS is mainly used to lightly summarize ODS layer data.

** answer: ** is right, it can be roughly understood like this.

What does the APP layer do?

Q&a 3: What is the App layer for?

** Q: ** feel there is no room for DWS layer, should DWS table of each business be in DWD or app?

** Answer: ** This question is not easy to answer, I think the main thing is to clarify what the DWS layer is. If your DWS layer is to put some wide tables that can be used by business parties, put them in the APP layer. If the data mart you say is a relatively general concept, then in fact DWS, DWD, APP together are the content of the data mart.

** Q: is the data stored in Redis and ES the app layer?

** ANSWER: ** is, my personal understanding, app layer mainly store some relatively mature tables, can be used by the business side. These tables can be in Hive or imported from Hive to systems with better query performance, such as Redis or ES.

The appendix

ETL

ETL: extract-transform-load, which describes the process of extracting, converting, and loading data from the source end to the destination end.

Wide table

  • Description: Indicates a database table with a large number of fields. Usually refers to a database table associated with metrics, latitude, and attributes related to a business principal.
  • Features:
    • Because of the different contents in the same table, the wide table does not conform to the model design specification of three paradigms:
      • Cons: Lots of redundancy in data
      • Benefits: Improved query performance and convenience
    • The design of wide table is widely used in data preparation before the training of data mining model. By putting related fields in the same table, it can greatly provide the message problem of iterative calculation during the training of data mining model.

Subject

It is an abstract concept that synthesizes, classifies and analyzes the data in enterprise information system at a higher level. Each topic corresponds to a macroscopic analysis field. In a logical sense, it is the analysis object corresponding to a macro analysis field in the enterprise. For example, “Sales Analytics” is an analytics domain, so the topic of this data warehouse application is “Sales Analytics.”

Reference: www.cnblogs.com/amyzhu

www.jianshu.com/p/1dd894e5b…

Blog.csdn.net/pmdream/art…

Baidu encyclopedia

This article is published by OpenWrite!