This article is collated according to ali Cloud senior technical expert Li Jinbo’s “enterprise big data platform warehouse architecture construction ideas” shared in the first Alibaba Online Summit. With the continuous expansion of the scale of the Internet, data is also growing explosively. With the production of various structured, semi-structured and unstructured data, more and more enterprises begin to conduct data processing under the platform of big data. In the sharing, Li Jinbo mainly introduced how to make use of the characteristics of big data platform to build a data warehouse more suitable for big data application from four aspects: general idea, model design, data plus architecture and data governance.

The overall train of thought

With the continuous expansion of the scale of the Internet, data is also exploding, and all kinds of structured, semi-structured and unstructured data are constantly produced. The data application in the new environment is characterized by fast business change, multiple data sources, multiple system coupling and deep application. So how do you build a data warehouse based on these characteristics? I think we should start with four key words: stability, credibility, wealth and transparency. Among them, stability requires that the output of data be stable and guaranteed; Trustworthiness means that the quality of the data is high enough; Rich index data to cover the business side to be rich enough; Transparency requires data composition process system to be transparent, so that users can rest assured to use.

The reason why we choose to build data warehouse based on big data platform is determined by the rich characteristics of big data platform:

  • Powerful computing and storage capabilities make it possible to design a flatter data flow, simplifying the calculation process;
  • Various programming interfaces and frameworks enrich the means of data processing;
  • Rich data acquisition channels can realize the collection of unstructured data and semi-structured data;
  • Various security and management measures ensure the availability of the platform.

The design principles of warehouse architecture include four points: first, the bottom-up approach combined with the top-down approach ensures the comprehensiveness of data collection; Second, high fault tolerance, with the increase of system coupling degree, any system problems will have an impact on the warehouse service, so in the construction of warehouse, high fault tolerance is an essential factor; Third, data quality monitoring needs to run through the whole data process. It is no exaggeration to say that data quality monitoring consumes resources equal to those of data warehouse construction. Fourth, don’t worry about data redundancy, make full use of storage for easy use.

Model design

The first step of constructing data warehouse is model design.

Dimensional modeling or entity relationship modeling

Common model design ideas include dimension modeling and entity relationship modeling. Dimension modeling is easy to implement, convenient for real-time data analysis, suitable for business analysis reports and BI; The structure of entity relationship modeling is complicated, but it is convenient to get through the main data and suitable for in-depth mining of complex data content.

Each enterprise should choose the appropriate modeling method according to the business form and demand scenario when constructing its own data warehouse. For enterprises with application complexity, a variety of modeling methods can be combined. For example, dimension modeling is adopted at the base layer to make the dimensions clearer. The middle layer adopts entity relationship modeling, which makes the middle layer easier to be used by upper-layer applications.

** Star model and Snowflake model **

In addition to modeling methods, the choice between star and snowflake models can also be a dilemma for users. In fact, the two models coexist, and stars are one type of snowflake model. Theoretically, real data models are snowflake models; In actual data warehouse, the two models coexist.

Due to the relatively simple structure of the star model, we can use data redundancy in the middle layer to transform the snowflake model into the star model, which is beneficial to data application and reduce the consumption of computing resources.

** Data layering **

After determining the modeling idea and model type, the next step is data layering. Data layering can make the data construction system clearer and facilitate the data users to locate the data quickly. At the same time, data layering can simplify the data processing process and reduce the computational complexity.

We commonly used data warehouse data layer is usually divided into three layers of market layer, middle layer, basic data layer. The purpose of reducing the traditional multi-layer structure to the upper and lower three-layer structure is to reduce the length of the overall data processing process, and the flat data processing process is conducive to data quality control and data operation and maintenance.

On the right side of the three-tier structure, we added streaming data as part of the data architecture. This is because the current data application direction will pay more and more attention to the timeliness of data, and the more real-time the data, the higher the value.

However, due to the high cost of collection, processing and management of streaming data sets, they are generally built in a demand-driven manner. In addition, due to cost considerations, streaming data systems are more flat in structure and often do not design an intermediate layer.

Let’s look at the specific role of each layer.

Data base layer

The data base layer mainly completes the following tasks:

  • Data collection: collect data from different data sources on one platform;
  • Data cleaning, cleaning does not meet the quality requirements of the data, to avoid dirty data involved in subsequent data calculation;
  • Data classification, the establishment of data catalog, in the base layer generally according to the source system and business domain classification;
  • Data structuring: For semi-structured and unstructured data, it is structured.
  • Data normalization, including standardized dimension identification, unified units of measurement and other standardized operations.

Data intermediate layer

The most important goal of the data middle layer is to connect the data from different sources of the same entity. This is because the data of the same entity may be scattered in different systems and sources under the current business form, and the identifiers of these data to the same entity may be different. In addition, the data middle tier can abstract relationships from behaviors. The underlying relationships abstracted from the behavior will be an important data dependency for future upper-layer applications. For example, abstracted interest, preference, habit and other relational data are recommended, personalized basic production materials.

In the middle tier, appropriate data redundancy is often performed to ensure the integrity of the topic or to improve the ease of use of the data. For example, if a factual data is related to two topics but does not become a separate topic, it will be placed in the two topic libraries; In order to improve the reusability of a single data table and reduce computational association, it is common to redundancy some dimension information in the fact table.

** Data mart layer **

The data mart layer is the top layer of a three-tier architecture, usually driven by requirements scenarios and constructed vertically between the marts. At the data mart layer, we can dig deeply into the value of data. It is important to note that the data mart layer needs to be trial-and-error fast.

The data architecture

Data architecture includes data integration, data system and data service. Among them, data integration can be divided into three types: structured, semi-structured and unstructured.

Data integration

Structured data acquisition can be subdivided into three types: full acquisition, incremental acquisition and real-time acquisition. The characteristics and suitable occasions of the three collection methods are shown in the figure above, among which the full collection method is the simplest. The quality of real-time acquisition is the most difficult to control.

In traditional architectures, the structured processing of logs is placed outside the warehouse system. In the big data platform warehouse architecture, logs are not structured before being collected to the platform. On the big data platform, each log is segmented by line character, and the whole log is stored in a data table field. Later, log structure is implemented through UDF or MR computing framework.

In our opinion, the more formal the log structure, the lower the parsing cost. In the process of log structuring, it is not necessary to completely tile data content, only to structure important common fields; At the same time, to ensure scalability, we can use data redundancy to save the original compliance fields (such as the UserAgent field).

Unstructured data needs to be structured to be used. Feature extraction of unstructured data includes speech to text, image recognition, natural language processing, image standard, video recognition and other methods. Although unstructured data feature extraction is not currently included in the warehouse architecture architecture, it will be possible in the future.

Data servitization

Data servitization includes statistical service, analysis service and label service:

  • Statistical services are mainly traditional report services, which use big data platform to put the results of data processing into the relational database for the front-end report system or business system query;
  • Analysis services are used to provide detailed factual data, utilize the real-time computing power of the big data platform, and allow operators to independently and flexibly conduct cross-combination queries of various dimensions. The ability of analysis service is similar to the content provided by traditional Cube, but in big data platform, there is no need to build cube in advance, which is more flexible and cost saving.
  • Label service: In the application scenarios of big data, the main body is often characterized, such as customers’ consumption power, interests and habits, physical characteristics and so on. These data are converted into KV data service by labeling for front-end application query.

Some practical points in architectural design

There are some practical points in architectural design that I would like to share with you:

First, through the clever use of virtual nodes to achieve multi-system data source synchronization, data transmission across systems, data interaction between multiple applications. Through the clever use of virtual nodes to reduce the operation and maintenance personnel in the actual problems of the operation and maintenance cost.

Second, use mandatory partition, add time partition on all tables. Through partitioning, each task can be independently rerun without data quality problems, reducing the cost of data repair; In addition, the computing cost can be reduced by partition clipping.

Third, the application of computing framework to complete log structure, the same data calculation process and other operations, reduce the burden of developers, while easier to maintain.

Fourth, optimize the critical path. Optimizing the most time-consuming task in the critical path is the most effective means to guarantee the data output time.

Data governance

Data governance is not independent of the system, it should be throughout the warehouse architecture and data processing processes.

Data quality

Ensure data quality, can be engaged in before, during, after the start. In advance, we can formulate data quality monitoring rules for each data. The more important the data, the more monitoring rules there should be. In the event, by monitoring and influencing the data production process, the data that does not meet the quality requirements for intervention, so that it does not affect the quality of downstream data; After the event, through the analysis and scoring of the data quality, some deficiencies were identified and the feedback data monitoring system was improved to promote the overall data quality improvement.

Data life cycle management

Due to cost and other factors, we still need to manage the data life cycle on the big data platform. According to the frequency of use, the data can be divided into four categories: ice, cold, warm and hot. A reasonable data life cycle management should ensure that warm data accounts for the majority of the whole data system; At the same time, to ensure the integrity of data assets, important basic data will be retained for a long time.

For the data in the middle calculation process, the data retention period can be shortened to reduce storage costs on the premise that most applications can access historical data. One last point worth noting is that cold backup is a thing of the past, and there is no need for separate cold backup devices under big data platforms.

Welcome to “Big data into the road of God” series of articles

Welcome to “Big data into the road of God” series of articles

Welcome to “Big data into the road of God” series of articles