preface

Hello everyone, I’m Yunqi! Today I’m going to talk to you about dimension table design in a data warehouse.

Dimension tables are the soul of the dimensional modeling, in the dimension table design problems (such as dimension change, dimension hierarchy, dimension consistency, dimension integration and separation, etc.) will be directly related to the stand or fall of dimensional modeling, so good dimension table design, it is critical that today, let us together to explore the related concepts about dimension table design and some technologies.

The dimension change

Dimension table data usually comes from the front desk business systems, such as commodity dimension table may come from the ERP or supermarket POS system goods list, but is will change in the goods, such as commodities belong to categories, tags such as prices, commodity description, there is an error need corrections before these changes are likely to be caused by, or is the actual business situation changes.

In either case, it is important during dimension design to determine how source data changes are represented in the dimension table. Therefore, in dimensional modeling, this phenomenon is called slowly changing dimension, or SLOWLY changing Dimension (SCD).

Downstream analysis may require different approaches depending on the content of the change

For example, for the description information of the product, the business personnel may not be sensitive to it, or consider it irrelevant, which can be directly covered.

However, a change in the category to which a commodity belongs needs careful consideration, as this involves classifying the sales activities of the commodity into the new category or into the old category altogether. Before the change to the old category or after the change to the new category? This actually involves several approaches to the slow-changing dimension to be shared below.

1. Overwrite dimension values

When a dimension value attribute changes, the override dimension value method directly overwrites the old value with the new value.

This technique is suitable for the case that the historical change of the dimension attribute is not needed to be retained in dimension modeling. It is often used in the case of error correction or irrelevant change of dimension attribute, such as the input error before the user’s birthday, so the historical data of the previous birthday need not be retained.

By overwriting the dimension value, all historical measures of the dimension property will be changed.

For example, the analyst wants to analyze the relationship between constellation and sales. The previous user’s birthday belongs to Aries, but the modified birthday belongs to Gemini, so the sales will all belong to Gemini after the dimension attribute is modified. So dimension designers use this approach only when necessary, and they need to inform downstream analysts.

The dimension table and fact table changes with the dimension value rewriting method as shown in the figure below:

2. Insert new dimension rows

While overwriting dimension values does not maintain dimension attribute changes, inserting new dimension rows saves and records changes by inserting new rows into the dimension table.

The fact table row before the property changes is associated with the old dimension value, while the new fact table row is associated with the new dimension value.



If we look closely at the changed dimension table, we can see that a new row of the user’s information is copied, with the only difference being the state (AZ before, CA after). Also, a closer look at the order fact table shows that the old order is associated with the old single row, while the new order is associated with the new dimensional row.

By adding dimension rows, we preserve the dimension changes and associate the pre-change and post-change facts with their respective old and new dimension values, respectively.

But this also creates confusion for dimension table users. Why does a query member find multiple rows in the dimension table? Although it can be explained to users, the cost of use and learning for users has undoubtedly increased, and the logic for data developers to deal with dimensional changes has undoubtedly become more complex.

3. Insert a new dimension column

In some cases, users may want to use both the before and after property values to analyze all the facts before and after the change. You can do this by inserting a new dimension column.

Instead of adding a row in the previous method, this method adds a new column, such as region_previous column for the previous size and region_current for the changed size. If there are multiple changes, you need to have multiple columns to store them.

In fact, all three approaches approach dimension change from different angles, and there are various other techniques that can be developed by combining these three approaches to deal with dimension change, which I won’t go into here.

Of course, no matter which technology is perfect in the era of big data, it has some processing complexity and learning cost.

What is the simplest and most straightforward way to solve dimensional changes? We’ll talk about snapshot technology later to address dimensional change in the age of big data.

Dimension hierarchy

Dimension hierarchy refers to the problem of dependency between attributes in a dimension table. For example, the categories of goods may be hierarchical (level 1, level 2 and level 3, especially for large FMCG enterprise groups such as P&G and Unilever), and the categories, brands and products are also hierarchical in fact. So how does dimensional modeling deal with these hierarchies?

There are actually two ways to deal with it:

  1. The first is to store all dimension hierarchies in one dimension table horizontally and redundantly. For example, the first to third categories of goods are stored in three fields respectively, and the processing of brands is similar.
  2. The second is to create a category dimension table and maintain parent-child relationships in the dimension table.

The first is actually a star architecture, and the second is a snowflake architecture. In dimension modeling, we use the first method to deal with the dimension hierarchy problem. Such de-normalization process sacrifices part of storage, but brings convenience to users and reduces the cost of learning and using.

The hierarchy of dimensions is often associated with drilling, which is the continuous digging of information.

Drilling can be divided into drilling up and drilling down. For example, for a retailer’s annual sales report, its annual sales show an increase of 20%, so which quarter has a higher growth rate from the perspective of time?

At this point, the growth rate of each quarter can be analyzed downward, and the same can be further analyzed to the monthly growth rate or even the daily growth rate. The same analysis can also be applied to categories, brands, etc., to analyze which category growth or which brand growth leads to the annual total sales growth of 20%. This is drilling down.

The opposite is drilling up, which in essence increases or decreases dimensions. Increasing dimensions (drilling down) from summary data into detail data, and decreasing dimensions (drilling up) from detail data into summary data. By drill-down, users gain a deeper understanding of the data, making it easier to identify problems and make the right decisions.

Dimensional consistency

In Kimball’s dimensional design theory, there is no physical data warehouse. A data warehouse is built gradually over multiple iterations of multiple topics and business processes, often logically divided into data marts.

The so-called data mart usually consists of one and more closely related fact tables and multiple dimension tables, usually at the departmental level or for a specific topic. Data warehouse is enterprise – class, subject – oriented, integrated data collection.

Physical logical data mart mix into the data warehouse, the establishment of a data mart is complete, if it step by step in the process of building data mart dimension table, then the data mart will become isolated bazaar, cannot logically combined into an integrated data warehouse, and dimensions of consistency is to solve the problem.

Dimensional consistency means that if two dimensions are related, they are either exactly the same or one is mathematically a subset of the other.

Inconsistencies include inconsistencies in dimension table contents as well as dimension attributes.

  • For an e-commerce company, for example, if the browsing subject domains related to commodity dimension table contains all of the enterprise information access to the goods, but for some reason pay Yi Yu goods missing part of its goods, clinch a deal could be done in other platforms), then analyzing these missing goods trade cannot be completed.
  • Similarly, if the two commodity attributes are different, such as date format, category division (it is possible that browsing is divided into categories of the day before yesterday, transaction is background class) and other inconsistent, then the cross analysis of category and date across browsing domain and trading domain cannot be carried out, because the category division is inconsistent.

Dimension consistency plays a key role for data mart into data warehouse, data mart design and development process, must ensure that the dimensions of consistency, concrete can be used in a Shared the same dimension table or one dimension table is a subset of the set of another dimension table to ensure consistency, to avoid the occurrence of isolated data mart.

Dimensional integration and unbundling

In actual dimension table design, sometimes the same dimension table comes from multiple foreground business systems, which will bring dimension integration and split.

For example, the system architecture of the mobile terminal trading system and the PC terminal trading system is completely inconsistent with the underlying database and table structure. In this case, the dimension integration problem exists.

In actual integration, the same dimension integration needs to consider the following issues:

  • Naming conventions: Ensure consistency and uniformity
  • Field type: Unified into a field type
  • Field encoding and meaning: The encoding and meaning should be integrated into a consistent way

The opposite of integration is unbundling

For large group company, sinopec, for example, its main business for the refined oil sales, but at the same time the sinopec gas station and quick retail store (at this point use only), their goods table fields and properties due to the different of the business is quite different (oil goods and retail sales of food, beverage, etc.).

Is there a need for a unified list of items at this point (which intuitively is not necessary because the business is so different)?

In dimensional modeling theory, there are usually two approaches to the above situation

  1. Build a base dimension table that contains the common attributes of these different businesses, and a separate dimension table for each business to contain its unique business attributes. For example, in the above example, a common commodity dimension table can be established to record common attribute fields such as commodity price and commodity description, while a commodity dimension table can be established to record unique commodity attributes of refined oil products such as oil label number (92, 95, 97, etc.). In addition, establish a dimension table of retail goods to record various commodity attributes of convenience stores (in practice, two separate dimension tables are usually established first, and then common commodity dimension tables or views are generated based on the separate dimension table)
  2. Split, that is, do not merge, that is, each business with unique differences will establish two completely independent dimension tables and manage their own dimension tables and attributes.

In practice, we generally prefer to split (i.e., not merge) and manage our dimension tables separately for businesses that are very different from each other, because coupling together does not bring much convenience and benefit. For businesses with large business similarity, the first method mentioned above can be adopted.