1. Preface


Hi, I’m Yunqi.

Long time no see. I have shared common data warehouse modeling methods and modeling examples in the field of big data. Today, I will talk to you about data Center, the specific implementation and delivery process of Data Center and the very important OneData theory.

A long time ago, a number of data industry bigshots wrote about this data center and The theory of OneData, with deep insights. Here I stand on the shoulder of giant, chat about their own understanding, if there is something wrong, but also ask everyone to see the official message as if it were correct.

2. What is data center?

With the “Ali in the end did not tear down” question, we might as well first to discuss what is in the end? 😮

Early industry for “data center” has not been a clear definition, but as a straight man, we can understand the center is a kind of middle layer.

And the middle layer, it assumes a very important feature of decoupling!

Image from: Alibaba cloud data in Taiwan

From the figure above, it is not difficult to see that before 2014, each business of Ali will have the corresponding ETL development team to provide data support, and each ETL development team will build their own data system according to their own ideas. This has resulted in:

  • Data flow will be chaotic, lack of direction;

  • Data management disorder, in a state of control;

  • Waste of R & D manpower and computing storage resources, but also not good business needs;

Through the development of such a chimney, caused a great deal of business troubles and technology waste in the group.

In the early stage, Ali proposed the concept of “data center” to cope with the ever-changing data needs of multiple business departments. Not only to meet the daily data needs of multiple business front desks of business departments, but also to meet business peaks such as Double 11 and 618.

Up to now, data Center has become the best practice to realize data intelligence on ali Cloud through “methodology + organization + tools”.

As shown below:

Data center logical architecture diagram

3. What is OneData system?

The official explanation:

Ali Cloud OneData data center solution based on big data storage and computing platform as the carrier, OneModel unified data construction and management methodology as the main trunk, OneID core business elements as the core asset, to achieve full-domain link, label extraction, stereo portrait, to data asset management as the skin, Data application service is the overall solution of loose coupling of branches and leaves. Its data services concept is rooted in the heart, emphasizing the business model and realizing value in promoting digital transformation.

Up to now, the construction achievements of data center are mainly reflected in two aspects: one is the technical capability of data, the other is the assets of data.

Today, Alibaba’s businesses share the same set of data technologies and assets. Alibaba internally named this unified data system as “OneData”. OneData is abstracted into three parts: OneID, OneModel and OneService.

  • Part one: OneModel is committed to the realization of data standards and unification;

  • Part two: OneID is committed to the unification of entities, so that data can be integrated rather than isolated, providing the basis for accurate user portraits.

  • Part three: OneService aims to unify data services so that data can be reused rather than replicated.

Image from: Alibaba cloud data in Taiwan

In the architecture diagram, we can see that the content at the bottom is mainly data collection and access, which access data according to the format (such as Taobao, Tmall, Hema, etc.) and extract these data to the computing platform. Through OneData system, “common data center” is constructed with “business block + analysis dimension” as the architecture.

Based on the public data center in the upper layer according to business needs to build: consumer data system, enterprise data system, content data system and so on.

After deep processing, data can play its value to be used by products and businesses; Finally, the unified data service is provided through the unified data service middleware “OneService”.

4. Entry point of OneData methodology

OneData methodology is described in detail in the book “The Road to Big Data: Alibaba’s Big Data Practice”. The specific implementation still needs to start from data architecture methods, data model design methods and data standardization.

Data architecture Approach (planning data systems globally) :

  • Data domain partitioning -> Data bus matrix construction -> hierarchical data planning

  • Realize the overall planning and design of enterprise data

Data model design method (easy to use and reusable) :

  • Dimensions -> Facts -> Public summary

  • Build data model for data analysis scenario, make general calculation precipitation, data reuse, improve efficiency.

Data standardization method (calculation caliber & unified expression) :

  • Derived indicator = atomic indicator + service limit + Statistical period + Statistical granularity

  • Standardized data definition, unified calculation caliber, ensure data quality

OneData architecture diagram

5. Implementation process of OneData

First of all, sufficient business research and demand analysis should be carried out when implementing the process in the data center. This is the cornerstone of the construction of data center. Whether the business research and demand analysis are done adequately directly determines whether the construction of data center is successful.

Secondly, the overall data architecture design is mainly divided according to the data domain. According to dimension modeling theory, bus matrix is constructed to abstract business process and dimension.

Thirdly, the report and large screen requirements were abstracted and sorted out the relevant indicator system, and the OneData intelligent data construction and management platform was used to complete the definition of indicator specifications and model design.

Finally, it is code development and operation and maintenance.

4.1 Project Survey

According to the business demand research + data research + business system research + environmental research ideas.

In this stage, the main attention is to avoid the wrong understanding of user requirements; Do not understand the network situation, affecting data on the cloud; The business system is not comprehensive, resulting in the late model and effect can not be achieved.

4.2 Architecture Design

Partition of data fields

A data domain is a collection of abstracted business processes or dimensions for business analysis.

Among them, business process can be summarized as one by one unsplit behavior events, under which relevant indicators can be defined; Dimension refers to the measured environment, such as the buyer order event, the buyer is the dimension.

In order to ensure the vitality of the whole system, the data domain needs to be abstracted, maintained and updated for a long time, but not easily changed.

When dividing the data domain, it can not only cover all the current business requirements, but also can be included into the existing data domain and expand the new data domain without influence when the new business enters.

Data fields describe
Members of the domain Registered user information, user registration events, points, login
Interaction domain Reply, top, comment, post
Pay Yi Yu Order, transaction
Storage domain inventory
Store domain Stores information
Public and custom Common dimension information

Building the Bus Matrix

After sufficient business research and requirements research, it is time to build the bus matrix.

In this step we need to do two things: identify what business processes are under each data domain; Which dimensions are associated with business processes and define the business processes and dimensions under each data domain.

4.3 Specification Definition

Specification definition mainly defines the index system, including atomic index, business qualification, statistical period, derived index.

Time period

It is used to specify the time model or point of time for data statistics, such as the last 30 days, nature week, and date up to the present day.

The business is limited

Is an abstract division of business. Service qualification Belongs to a certain service domain. For example, the type of access terminal in the log domain can be wireless terminal or PC terminal.

Metric/atomic metrics

Atomic indicators and measures have the same meaning. Measures based on the behavior of a business event are non-separable indicators in the definition of business, with specific business terms, such as payment amount.

The dimension

A dimension is a measurement environment that reflects a class of attributes of the business, and the collection of such attributes constitutes a dimension, which can also be called an entity object. Dimensions belong to a data domain, such as geographic dimension (including country, region, province and city level content), time dimension (including year, season, month, week, day level content)

The derived indicators

Derived metric = one atomic metric + multiple business qualifiers (optional) + time period. It can be understood as the delineation of the statistical scope of atomic indicators. For example, atomic indicator: payment amount, the amount paid by overseas buyers on the latest day is derived indicator

Atomic metrics, business qualifiers, and modifiers all belong directly to business processes, where modifiers inherit data fields of the modified type.

The category of derived metrics

Derived indicators can be divided into three categories: transactional indicators, stock-based indicators and composite indicators. According to their different characteristics, some atomic indicators must be created, and some can be derived by adding modifiers on the basis of other types of atomic indicators.

4.4 Model Design

The dimensional design of data model is mainly based on the theory of dimensional modeling and the bus architecture of dimensional data model to construct consistent dimensions and facts. About dimension table and fact table design ideas, we can refer to talk about the soul of dimension modeling — dimension table design, dimension modeling technology practice — in-depth fact table.

There are slight differences in the number of storehouse layers, but they all come to the same conclusion. In the process of implementation delivery, Ali data Center usually adopts the following layered data modeling and RESEARCH and development work.

Operational Data Layer (ODS)

Store business system data in a data warehouse with little or no processing.

  • Synchronization: Incrementally or fully synchronizes structured data to MaxCompute

  • Structured: Unstructured (log) structured processing and stored in MaxCompute

  • Cumulative history and cleansing: Save historical data and cleansing data according to data business requirements and audit and audit requirements.

Common Dimension Model Layer (CDM)

Store detailed fact data, dimension table data and summary data of public indicators, among which detailed fact data and dimension table data are generally processed and generated based on ODS layer data; The summary data of public indicators are generally processed and generated based on dimension table data and detailed fact data.

CDM layer is further subdivided into DWD layer and DWS layer, which are detailed data layer and summary data layer respectively. Dimension model method is adopted as the theoretical basis, and some dimension degradation methods are adopted to degrade dimensions into fact tables, reduce the association between fact tables and dimension tables, and improve the ease of use of detailed data tables.

At the same time, in the summary data layer, the dimension degradation of indicators is strengthened, and more broad table means are adopted to construct the public indicator data layer, so as to improve the reusability of public indicators and reduce repeated processing. Its main functions are as follows:

  • Combine related and similar data: adopt detailed wide table, reuse associated calculation, reduce data scan.

  • Unified processing of public indicators: Statistical indicators with naming norms, consistent caliber and unified algorithm are constructed based on OneData system to provide public indicators for upper-level data products, applications and services; Create logical summary wide table.

  • Establish consistency dimension: Establish consistent data analysis dimension table to reduce the risk of inconsistent data calculation caliber and algorithm.

Application Data Layer (ADS)

Store personalized statistical index data of data products, which are processed and generated according to CDM layer and ODS layer.

  • Personalized index processing: non-utility, complexity (index type, ratio type, ranking type index)

  • Application-based data assembly: wide and large table market, horizontal table to vertical table, trend indicator string

4.5 summarize

The implementation process of OneData is a highly iterative and dynamic process, generally adopting a spiral implementation method. After the overall architectural design is complete, iterative model design and review is initiated against the data domain.

Review mechanism is introduced in the implementation process of model, such as architecture design, specification definition and model design, to ensure the correctness of model implementation process.

The above are some applications of OneData system in data center implementation and delivery process, hopefully

5. Write at the end

At the end of May this year, at the 2021 Ali Cloud Summit, Zhang Jianfeng, president of Ali Cloud Intelligence, said that there are four key words in 2021, which are: deep foundation, thick zhongtai, stronger ecology, and good service.

“The middle platform includes middleware, database, operating system, big data platform, etc. The middle platform is the core of the future” cloud “can be further rapid development”.

After dealing with the data center in the past two years, I think the center is essentially a “centralized capability reuse platform”. Just as mentioned in the beginning, it undertakes the mission of “decoupling”.

The construction of data center is not accomplished overnight. No matter large enterprises or small enterprises, they all need to accumulate from small business scenes, accumulate long-term business experience, and continue to optimize and innovate. Finally, they can build data center with their own business characteristics.

References:

The Road to Big Data: Alibaba’s Big Data Practice

Tan Hu, Chen Xiaoyong “How to build Alibaba Data Center”

Shi Xiufeng “What is One Data System? Interpretation of Alibaba Data Center”

Dialogue with Zhang Jianfeng, President of Aliyun