As a data person, is he often surrounded by a variety of nouns, is he vaguely aware of many of them? Although some words are only one word apart, but they have completely different meanings. Today we will understand some common concepts and meanings of data warehouse construction and data analysis and the relationship between them.

This article was first published on the official account [Learning big Data in five minutes]

The structure of this paper is shown in the figure below:

I. Analysis of common concepts in data warehouse

1. Entity

Entity refers to the subject of attachment, which is an object we analyze. For example, when we analyze the sales of products, for example, the sales volume of Huawei mobile phones in the last six months, huawei mobile phones are an entity. We analyze the activity of the user, the user is an entity. Of course, entities can also exist in reality, such as virtual business objects, activities, members, etc., can be regarded as an entity.

Entities exist for business analysis, as a filtering dimension of analysis, have attributes that describe themselves and are themselves of analyzable value.

Dimension of 2.

Dimension is the Angle of looking at a problem, analyzing business data, and establishing a dimension based on the Angle of analysis. So a dimension is a quantity used to analyze data, for example, if you want to analyze product sales, you can choose to analyze by product category, which constitutes a dimension, and all product categories are aggregated together to form a dimension table.

3. The measurement

A metric is a number on a business process node. Volume, price, cost, etc.

Measures in a fact table can be divided into three categories: fully additive, semi-additive, and non-additive.

  1. Fully additive measures are the most flexible and useful, such as sales volume, sales volume, etc., which can be summarized in any dimension;
  2. The semi-additive measure can be summarized for some dimensions, but not for all dimensions. The balance is a common semi-additive measure, which can be added across all dimensions except the time dimension.
  3. There is also one that is not additive at all, such as ratio. For such non-additive measures, it is a good idea to store as many fully additive components of the non-additive measures as possible, and to aggregate these components into the final result set before calculating the final non-additive facts.

Granularity of 4.

Granularity is the unit of measurement in a business process, such as whether goods are measured by piece or batch records.

In warehouse construction, we say that this is the fact table of user granularity, then each row of data in the table is a user, there is no duplicate user; For example, if you have a sales granularity table, each row in the table is a sales record.

Choosing the appropriate granularity level is an important and key content of data warehouse construction. When designing data granularity, the following factors should be considered:

  1. The type of analysis to be accepted, the minimum granularity of acceptable data, and the amount of data that can be stored;
  2. The higher the level of granularity is defined, the less detailed analysis can be done in that warehouse;
  3. If storage resources are limited, high data granularity is required.
  4. Data granularity partitioning strategy must ensure that the granularity of data can really meet the needs of users for decision analysis, which is the most important criterion in data granularity partitioning strategy.

5. Size

Caliber is the logic of taking numbers (how to take numbers). For example, the number to be taken is the average height of boys under the age of 10. This is the caliber of statistics.

Index of 6.

The index is a measure of caliber, which is the final result. Such as the number of orders in the last seven days, the purchase conversion rate of a promotion, etc.

An index specific to the calculation and implementation, mainly consists of the following parts:

  • Index processing logic, such as count,sum, AVG
  • Dimensions, such as statistics by department and region, correspond to group by in SQL
  • Business qualifiers/modifiers, such as different payment channels to calculate the corresponding indicators, wechat pay order refund rate, Alipay order refund rate. Corresponds to where in SQL.

In addition, indicators themselves can also be derived and more indicators can be derived. Based on these characteristics, indicators can be classified:

  • Atomic metrics: Basic business facts, no business qualification, no dimension. For example, the order quantity and total order amount in the order table are atomic indicators;

The indicators that the business side cares more about are those that have actual business meaning and can directly obtain data. For example, the order payment amount in the last day is a derivative indicator, which will be directly displayed to the merchants on the product. However, this metric cannot be taken directly from the unified middle layer of the warehouse (since there are no existing fact fields, the warehouse usually provides large and wide tables). There needs to be a bridge between the metrics needs of the warehouse middle tier and the metrics needs of the business side, hence the derived metrics

  • The derived indicators:Dimension + modifier + atomic index. In the order payment amount of the store in recent 1 day, the store is a dimension, the order payment amount is a time type modifier, and the payment amount is an atomic index.

Dimension: The Angle from which indicators are observed; Modifiers: one or more values of a dimension. For example, male and female are two modifiers under dimension gender.

  • Derivative indexFor example, the conversion rate of a promotion is a derivative indicator because of the needTarget number of promotion usersandPromotion order number indexTo calculate.

7. The label

Labels are highly refined feature identifiers that are artificially set and obtained by applying certain algorithms to target objects according to the requirements of business scenarios. Visible label is the result of artificial reprocessing, such as net red, Baifumei, girl. For ambiguous labels, we can distinguish them internally. For example, apple refers to fruit, and iPhone refers to mobile phone.

8. Natural keys

A key consisting of properties that already exist in the real world, which are unique in business concepts and have certain business meanings, such as item ID, employee ID.

From a warehouse perspective, identifiers from a business system are natural keys, such as employee numbers in a business library.

9. Lasting bond

It stays permanent and doesn’t change. It’s sometimes called the supernatural persistence key. For example, the ID number is a persistent key.

The difference between the natural key and the permanent key: for example, after the employee leaves the company and re-enters the job, his natural key is the employee number has changed, but his permanent key ID number is the same.

10. The surrogate key

Keys that have no business meaning. Proxy keys have many other names: meaningless keys, integer keys, unnatural keys, artificial keys, composite keys, and so on.

Proxy keys are simply represented as integers produced in sequential order. The first row of the product row has a surrogate key of 1, the next row has a surrogate key of 2, and so on. The proxy key simply joins the dimension table and the fact table.

11. Degenerate dimensions

Degenerate dimensions are those that look like a dimension key in the fact table, but actually have no corresponding dimension table, that is, dimension attributes stored in the fact table, such dimension columns stored in the fact table are called degenerate dimensions. Like other dimensions stored in dimension tables, degraded dimensions can also be used for filtering queries of fact tables, aggregation operations, and so on.

So how do you define a degenerate dimension? Such as the order id, this kind of scale of large dimension, it is not necessary to use a dimension table for storage, data query and we or data filtering and very need, so this kind of redundancy in the fact table, this is called degradation dimension, citycode this we will be redundant in the fact table, but it has a corresponding dimension table, So it’s not a degenerate dimension.

12. Drill down

This is a common concept in data analysis, where drudge can be understood as adding layers of dimensions that allow data to be viewed from coarse-grained to fine-grained, such as from year to month to day, along time dimensions for product sales. From the year dimension you can drill down to the month dimension, day dimension and so on.

13. To get up

With drill-down, roll-up is easy to understand. It is an inverse operation, so roll-up can be understood as deleting some layers of dimension, observing data from fine to coarse-grained, or aggregating data up the dimension hierarchy.

Two, the relationship between the concepts of data warehouse

1. Relationships between entity tables, fact tables, and dimension tables

In Kimball dimension modeling, there are dimensions and facts, and in Inmon paradigm modeling, there are entities and relationships. These concepts are easier to understand if we separate the two modeling approaches. However, there are also a lot of hybrid modeling methods. When the two modeling methods are combined, whether these concepts are easy to be confused, especially the fact table and entity table, what are the differences and connections between them? Let’s take a look at their respective concepts first:

  1. Dimension table: The dimension table can be regarded as a window for users to analyze a fact, and the data in it should describe all aspects of the fact, such as time dimension table and region dimension table. The dimension table is an analysis Angle of the fact table.
  2. Fact table: Fact table is actually to determine a fact through the combination of various dimensions and some index values, such as time dimension, regional organization dimension, index value can determine the fact of some index values at a certain time and place. Each piece of data in the fact table is derived from the intersection of several dimension table data and index values.
  3. Entity table: entity table is a table of actual objects, entity table put data must be a strip of objective things data, such as a variety of goods, it is objective existence, so it can be designed an entity table. Real-time table only describes each thing, there is no concrete facts, so some people say that entity table is a fact table without facts.

For example, if there are iphones, Huawei and other mobile phones of various brands and models in the mobile phone market, these data can form a physical table of mobile phones, but there is no measurable data in the table. On a given day apple sold 15 phones and Huawei sold 20 phones. These phone sales data belong to facts and form a fact table. This allows you to do various dimensional analyses of the fact table using the date dimension table and the region dimension table.

2. Differences between indicators and labels

  • The concept of different

An indicator is a standard or method used to define, evaluate and describe a particular thing. For example: new users, cumulative users, user activity rate is to measure the user development indicators;

Labels are highly refined feature identifiers that are artificially set and obtained by applying certain algorithms to target objects according to the requirements of business scenarios. Visible label is the result of artificial reprocessing, such as net red, Baifumei, girl.

  • A different

Index name is the naming of the two characteristics of matter and quantity. The index value is the quantity performance of the index in specific time, region and conditions, for example, the weight of a person, the index name is weight, the index value is 120 kg;

Label names are usually adjectives or adjectives + nouns. Labels are generally unquantifiable and usually isolated. Except for basic class labels, labels processed by certain algorithms generally have no units and dimensions. Such as more than 200 catties called big fat.

  • Classification of different

Classification of indicators:

According to the calculation logic of indexes, indexes can be divided into atomic indexes, derived indexes and derived indexes.

According to the different content of event description, it can be divided into process index and result index.

Classification of labels:

According to the variation of the label is divided into static label and dynamic label;

According to the different reference and evaluation indexes of labels, they can be divided into qualitative labels and quantitative labels.

The best applications for indicators are monitoring, analysis, evaluation and modeling. The best application of tags is labeling, characterization, classification and feature extraction. In particular, it should be pointed out that the labeling of results is also a kind of label, so in the application scenarios of natural language processing and machine learning-related algorithms, labels are of great value for supervised learning, which is difficult to be achieved by simple indicators. The role of indicators in task allocation, performance management and other fields is also beyond the ability of labels.

3. Differences and connections between dimensions and indicators

Dimension is the observation Angle of data, that is, from which Angle to analyze the problem, look at the problem.

An index is a measure of the result on a dimensional basis.

A dimension is generally a discrete value, such as each independent date or region in the time dimension. Therefore, in statistics, the same records of the dimension can be aggregated together and aggregation function can be used to perform aggregation calculation, such as accumulation, mean, maximum and minimum.

The index is the general calculation of the aggregate, that is, the result of the aggregate operation, usually a continuous value.

4. The difference between the use of natural key and proxy key in data warehouse

The warehouse toolkit says that the only primary key of a dimension table should be a surrogate key rather than a natural key. Sometimes modelers are reluctant to give up using natural keys because they want to query the fact table with operational code rather than join with dimension tables. However, multidimensional keys with business implications should be avoided, because whatever assumptions we make may eventually become invalid because we have no control over changes to the business repository.

So every join between the dimension table and the fact table in the data warehouse should be based on a meaningless integer proxy key. Avoid using natural keys as primary keys for dimension tables.

All articles will be published on the official account