preface

In January 2015, “Data Warehouse Toolkit — Authoritative Guide to Dimensional Modeling” was published; In July 2017, the Road to Big Data — Alibaba’s Big Data Practice was published. Now the latter is no longer in print, and you can only buy second-hand books at twice the original price or photocopied versions of uneven quality online. At the speed of Internet technology development, it seems that these technologies and methodologies in the book should be outdated, but the reality is that, except for the dozens of companies that started early and became the leaders of the industry, the data warehouse of most companies with such requirements is still built on demand. In the eyes of the management, DW/BI system is mainly positioned as an analysis and decision system (BI business analysis). His absence will not affect the early expansion of the company. If the data mart of the business line is built to meet the needs of the current development, it is understandable that the investment in the construction of logarithmic warehouse is not valued. Only when the company develop successful enough that the scale stage, the business of the company’s target into the operation mode optimization, make scale after the operation more efficient, and create more business ecosystem, the data of the company is then put forward the higher request, if the data is scattered in the data mart and cannot get through, Therefore, it cannot meet the management’s requirement of horizontal comparison of business indicators between different business lines. API traffic, channel relationships, and competing product information, which were just interference signals before, are all worth thinking about in the scale stage. Technically, the company collects a huge amount of data in the process of development, and the cost of data storage and calculation has climbed astronomically. However, due to the insufficient construction of data warehouse, it cannot be effectively accessed (not converted into usable assets to generate benefits), which inevitably pushes the management to start to solve this problem.

According to this, after entering the scale of companies every year, there will be a demand for talents in warehouse standardized modeling. In addition to the talents recruited by big factories, talents who jump out to look for opportunities and are familiar with the methodology of warehouse modeling and have relevant experience every year are bound to be scarce and therefore hot. (pei) or need to know that a good project, but less good project (company so dozens of larger size, and the company number warehouse building was built after the major remaining data governance requirements), so when your company’s demand for CangJianMo, grasp the opportunity to fight for it, to practice several CangJianMo methodology, the modeling method learn to achieve mastery through a comprehensive study, Because warehouse development on the career path is not so deep technical requirements, but warehouse modeling requirements are very high. , after all, technical big data platform can for you by high P backend implementation and decrease the difficulty (SQL) is the trend of development, design, how to put the requirements modeling, namely business data abstraction into a series of the ability of the high availability data set, the number of successful experience CangJianMo copied to other topics for warehouse capacity, will be the key to career development. Skilled and methodological people can always save a company time and trial and error costs.

Number of warehouse modeling steps

The warehouse model can be regarded as the library management system of books as data, and the modeling is the method of data organization and storage. There are two mainstream methodologies for data warehouse modeling in China, one is Kimball’s dimensional modeling methodology and the other is Ali OneData methodology based on Kimball’s development. The steps of Kimball’s dimensional modeling theory are as follows:

  1. Collect business requirements and data implementation
  2. Discussion on collaborative dimension modeling
  3. Dimension model design: a. Select business process B. declare granularity C. confirm dimension D. Confirm the fact that

Ali’s theoretical steps for modeling are:

  1. Data warehouse planning A. Data survey B. Data domain division C. Bus Matrix construction D. Clear statistical indicators
  2. Data standard a. Specification definition
  3. Model design a. Detailed model design B. Summary model design

Kimball’s dimensional modeling approach is more general than Ali’s, and for this reason is sometimes more abstract in terms of word choice, so it is recommended that newcomers read Ali’s before Kimball’s. Step by step, Kimball can model around a single business process at a time. As his step introduction explains, start modeling by selecting the most rewarding and best-done business process in the main activity stream, thus supporting iterative development and output under time constraints. As Ali’s book evaluates, It focuses on how users can do requirements analysis more quickly. Ali’s modeling methodology is usually applied to integrate the business subject to build several positions, but not like e-r model to the entire enterprise to integrate all the theme, and on the basis of the unification of group method of data integration and management system (namely OneData), because of ali this dimension and with so many BU the company, The benefits of unified and integrated management are very high.

Number of warehouse planning

I. Data research

Alibaba’s data research is divided into two parts, business research and demand analysis. Business research – business understanding of business systems and business level decomposition and programming. Star models of dimensional modeling are basically built on relational databases, and relational databases mostly comply with the 3NF of E-R model, where decomposition and programming can be understood as each entity of E-R model and the relationship between entities. This step is to build the warehouse from the bottom up. Demand analysis — Collect the requirements of quotient/operation personnel for data or statements and establish different analysis perspectives (the analysis perspectives here can be understood as the establishment of large-width ADS application tables for quotient in different directions, which is to build data warehouse from top to bottom). The corresponding steps of Kimball dimension modeling are: collecting business requirements and data implementation, which can be divided into two steps:

  1. Dimension designers connect with business representatives (product managers) to understand business requirements (goals) and tease out key performance indicators
  2. Dimension designer docking source system experts (R&D, DBA) — to find out the actual situation of the underlying source data, feasibility

You can see that Ali is completely consistent with Kimball’s dimensional modeling in this step. Here we focus on process 2: requirements analysis. The process captures the crux of the job by answering a list of questions (question-driven) :

  1. Who is our core audience?

This question can lead you to layering users, building a user priority pyramid, and carefully selecting core users and their needs when r&d manpower is not enough

  1. What indicators do they want to see?

Inheriting from the last user pyramid, it can respectively summarize the indicators that different users care about3. How to evaluate the value/benefits of indicators?

I haven’t seen any official information yet, but I think it can be measured as follows:

A. Is the indicator directly used to make some decision/operation basis after it goes online?

B. Pv/UV used by various reports after the indicator goes online

C. Indicates how much ODS/DWD layer usage is reduced after it goes online

  1. What mathematical/quotient model does the user use?

The current mainstream should be AAARR and OSM, because they are popular enough not to be repeated in this chapter

  1. What quotient perspective do you want to look at? What are the indicators under these perspectives?

It can be summarized by business direction, and here is a reference example:

2. Data domain division

Business process: the business activity events of an enterprise, such as order, payment and refund, are all business processes and non-separable business behavior event data fields: The business process or dimension is summarized and the set after abstraction is designed from the perspective of business ownership, and the business is abstractly grouped. This is a step that Alibaba has but Kimball did not have in the original version. Because Ali modeling is to build OneData system in a one-stop way, it strives to summarize, define and abstract all existing businesses. Moreover, according to the principle of high cohesion and low coupling, the extracted knowledge system is decoupled from the superstructure with frequent iterations, and the unchangeable knowledge is deposited into the infrastructure structure (of course, it still needs long-term maintenance and update), so as to avoid the situation that the foundation needs to be shaken in future iterations and the upper model is forced to change. The expected goal is to divide the data domain, which can not only cover all the current business requirements, but also can be included into the existing data domain and expand the new data domain without impact when new business enters.

Third, build the bus matrix

A. Determine the data domain to which a business process belongs

B. Determine the dimension of the service process

C. Define the relationship between business processes and dimensions (matrix)

Dimension — The context of a measure, used to reflect a class of business attributes, the collection of which makes up a dimension, also known as entity objects (same as entities in the E-R model)

Enterprise data warehouse Bus architectureAn architectural framework that breaks down the DW/BI planning process into manageable modules by focusing on business processes and enables integration by reusing standardized conformance dimensions across different processes. Support incremental establishment of DW/BI systems and agile implementation of the enterprise data warehouse bus matrix

Enterprise data warehouse Bus matrixA tool for designing and interacting with the bus architecture. The rows of the matrix represent the business process, the columns represent the dimensions, and the points in the matrix indicate whether the dimensions are related to a given business process. When combing the bus matrix, each row can be analyzed to test whether the relevant candidate dimensions have been defined for the business process, and each column can be analyzed to consider the need for a dimension to be consistent across multiple business processes.

You can follow the above steps to comb the business processes in the data domain and then comb the corresponding dimensions of the business processes to complete the bus matrix. In Kimball dimension modeling, this step corresponds to point 1, 2 and 3 of dimension design, namelySelect business process.Statement of granularity, as well asIdentify the dimensions that describe the environment. The expected output of this step isEnterprise data warehouse bus matrix diagramThat is as follows:

4. Clear statistical indicators

Identify the corresponding under each business process:

  • Atomic indicators
  • The derived indicators

Atomic indicators— A measure under the behavior of a business event. It is a non-separable indicator in the business definition and has a definite English field name, data type, and algorithm description

The derived indicators= Atomic indicator + (multiple) modifiers (the relationship between modifiers is “and” and “or”) + time period, inheriting the English name, data type, and algorithm requirements of the atomic indicator

modifier— Qualified abstraction of business scenarios for metrics other than statistical dimensions, belonging to a modifier type (abstract division of modifiers)

Examples of derived indicators being generated are as follows:The steps for Kimball are:Confirm the facts used for measurement

Derived indicators can also be divided into:

  • Transactional metrics — metrics that measure business activities, such as number of new items shipped, number of reshipped items, and amount paid for orders
  • Stock-type indicator — Statistics on some states of entity objects (such as commodities and members), such as total number of commodities and total number of registered members, and the corresponding time period is generally “up to the current time in history”.
  • Composite metrics – a combination of transactional and stock metrics, such as browse UV- number of buyers conversion

The expected output of this step isData indicator chartThat is as follows:

Data standards/data dictionaries

Specification defines

Ali combines data warehouse construction experience and data characteristics to design a set of data standard naming system (data standard in Ali Dataworks is also called global field management), applied to several processes from data domain division to model design. Include steps:

  1. Build conformance logical dimensions and their properties
  2. Build conformance measures and metrics

A naming convention and indicator system is expected to be produced in this step. Naming conventions – Try to use English abbreviations, followed by Chinese pinyin initials when English names are too long or there is no proper meaning. For details of this chapter, see Pages 142-147 in Ali. For naming conventions, different companies may have different requirements at different stages. Some may just maintain a data dictionary, requiring all existing roots to use the data dictionary if they already exist. New roots need to be entered into the data dictionary. Some set up a metadata platform to manage the table structure and corresponding dimensions and indicators for modeling. If all the table building logic follows this platform, it is possible to realize the field-level blood relationship. Some may require to take over everything, define all the possible roots of the current business and input them into a system. When indicators need to be generated, the system will automatically generate indicator names according to the description of indicators to avoid inconsistency caused by differences between people. Table naming specification: DWD/DWS/ADS Layer Fact table: DWD_ Data field _XXX_{time period} Scheduling period {I/F}. I: incremental table F: Full table Dimension table: DIM [Topic name]_[Dimension short for short] Field naming specification — Meets the following three properties:

  • Consistency: Common field naming does not allow the same noun to point to different meanings in different scenarios, which is equivalent to satisfying the same name and synonym
  • Ease of use: as simple as it can be, easy to understand
  • Intuitive: The use of field identifiers and data structures conforms to the user’s thought process and vocabulary

P.S. originally, the index system should also be included here, but the index system involves a lot of data governance content, and I still have little understanding of it. I will write a single article after I have enough personal opinions

Below is the overall flow chart:

The rest of the model design will be introduced in chapter 2.

References:

  • Data Warehouse Toolkit – The Definitive Guide to Dimensional Modeling
  • The Road to Big Data — Alibaba’s Big Data Practice
  • Lean Data Analysis