Data warehouse: The data warehouse receives the data of the source system comprehensively, the ETL process standardizes, verifies and cleans the data, and finally loads it into the data mart, through which the data mart supports the system for data query and analysis. The whole data warehouse contains four levels.
1. Four operations of data warehouse
ETL(ExtractionTransformation Loading) is responsible for extracting data from scattered and heterogeneous data sources into the temporary intermediate layer for cleaning, transformation, integration, and finally loading into a data warehouse or data mart. ETL is the core and soul of the implementation of data warehouse, and the design and implementation of ETL rules account for about 60% ~ 80% of the workload of the whole data warehouse construction.
1. Data Extraction
Including initialization data load and data refresh: initialization data load is mainly concerned with how to create dimension tables, fact tables, and put the corresponding data into these data tables; Data refresh focuses on how to append and update the corresponding data in the data warehouse when the source data changes (for example, periodic tasks can be created, or data can be periodically refreshed in the form of triggers).
2. Clean data
It is mainly used for unified processing of data with ambiguity, duplication, incompleteness, and violation of business or logical rules in the source database. That is, to purge data that does not fit the business or is useless. For example, you can write Hive or MR to clean data of improper length in fields.
3. Data Transformation
Mainly to the data after cleaning the data into A data warehouse data needed for: derived from different source systems of the same data fields or data format of data dictionary may not be the same (such as A table name id, B called ids) in the table, in the data warehouse need to give and they provide A unified data dictionary format, normalization of data content; On the other hand, the content of some fields required by the data warehouse may not be available in the source system, but need to be determined based on the content of multiple fields in the source system.
4. Data loading
It is to import the data processed above into the corresponding storage space (mysql, etc.) to facilitate the data mart to provide, and then visualization.
General large companies for the sake of data safety and easy to operate, is encapsulated data platform and task scheduling, the underlying encapsulates the big data clusters such as hadoop cluster, spark cluster, sqoop, hive, zookeepr, hbase only provides a web interface, and for different employees to different permissions, Different operations and calls are then made to the cluster. Taking data warehouse as an example, the data warehouse is divided into several logical levels. So for different levels of data operation, to create different levels of task, task flow of execution to be put in different levels (big company a cluster often timing task every day thousands of waiting for execution, or even tens of thousands of, so divided into different levels of task flow, different levels of task into the corresponding flow of task execution, will be more convenient management and maintenance).
2. Four logical architecture levels of data warehouse
Data warehouses are typically divided into four layers. But note that this division and naming is not unique. Generally, there are four storeys, but it may be called differently by different companies. For example, the temporary layer here is called replication layer SSA, jingdong is called BDM. Similarly, Alibaba has a five-tier warehouse structure, which is more detailed, but the core concept is derived from the four-tier data model. The following shows the structure level and naming of jingdong and Alibaba’s data warehouse respectively.
1. Replication layer (SSA,system-of-records-staging-area)
SSA directly copies data from the source system (for example, it reads all data from mysql and imports it into the same structure table in Hive without processing it) to keep the original service data as much as possible. The only difference with the source system data is that the SSA data adds the timestamp information on the basis of the source system data, forming multiple versions of the historical data information.
2. SOR (System-of-record)
SOR is a set of table structures developed based on model and in accordance with 3NF paradigm rules. It stores the data at the lowest level in the data warehouse and stores the data according to different subject domains. For example, the university data statistics service platform stores the university data in SOR layer according to the four major themes of personnel, students, teaching and scientific research according to some current requirements; SOR is the core and foundation of the entire data warehouse, and should be flexible enough to handle the addition of more data sources, support more analysis requirements, and support further upgrades and updates.
3. Summary Layer (SMA,summary-area)
SMA is the intermediate transition between SOR and DM (market layer). As SOR is highly standardized data, it requires a lot of association work to complete a query. Meanwhile, the granularity of data in DM is often much higher than SOR, and a lot of summary work is needed to generate the summary data in DM. SMA according to the requirements of SOR data for appropriate anti-norm (for example, the design of wide table structure of personnel information, cadre information and other tables of data combined) and summary (for example, some commonly used head summary, organization summary, etc.); Thus improving data warehouse query performance.
4. Market layer/Presentation layer (DM, Data Mart)
The data saved by DM is directly accessible to users. DM can be understood as the data that end users finally want to see. DM is mainly business data of various granularity. By providing data of different granularity, it can adapt to different data access requirements. Data in DM of university Data Statistics service platform