0 x00 preface

The topic of this article is a discussion of normalization and de-normalization of data models, which is a common conflict between the design of dimensional modeling and the ease of business use.

0 x01 discussion

Question:

When designing data tables, is it better to have one wide table or multiple dimensional tables?

Answer a:

The construction of each table in data warehouse mainly depends on the function and relevant significance of this table in the whole data warehouse. First, it should be clear which problems the table exists to solve, which roles to use, and how to ensure the best possible user experience to solve the problem. Looking at the problem from the perspective mentioned above, unpack the following factors:

  1. How difficult is it to write the query SQL of multiple data tables in the case of disassembling tables? Is it necessary to associate multiple tables for data extraction, and to know the association between each table in advance? If there are many people using this data, everyone needs to understand the association of multiple tables before conducting data query. In this way, will the cost of dimensional communication be high, the query experience will be reduced and the user’s work efficiency will be affected?
  2. How often are multiple table associated queries used? Is it better to simplify things that repeat frequently?
  3. In the query experience, the query performance after the association of multiple tables needs to be considered. If the contents of a table are excessive, does the query speed be affected?
  4. The rationality of multi-table association, different data dimensions and content and order table association, whether there will be a pit against common sense. For example, whether the mapping of data fields is one-to-one or many-to-many, and whether it will cause users to ignore filtering constraints when querying data.
  5. Data security issues, the security range of each data table is different, merged into the same table is faced with greater access. For example, an order sheet may only be available to a subset of people and not to suppliers.

Answer two:

Combined with some experience of our company, our company will use data for various reports of different topics and latitudes, and will also use data for data mining to make models. Therefore, data division is definitely necessary. According to report data according to the report of a class of different reverse table divided into different latitude, this way is actually the mysql business library data after SQL statements to generate one or more Zhang Weidu table, according to the experience in this will extract a often use fields as public fields into public layer data, some measurements will often need to use to smoke Take to the scale, so some non-developers to see the data as long as the page simply write a few SQL statements can be statistical data, such as monthly sales, weekly sales, daily sales these.

If the machine learning model students need data, we only need to extract data from dimension tables, degree tables and fact tables to make large and wide tables for them. Due to the small number of models, we have less experience with large and wide tables, so we can only write A SEPARATE SQL statement to extract model data.

0 x02 supplement

The problem is, essentially. I want to discuss normalization and de-normalization in data model design.

From a normalization point of view, the designers of the data warehouse want to be as formal as possible, because this reduces data redundancy and makes it easier to extend the model. From the point of view of de-normalization, the users of data warehouse want to use as convenient as possible. They are not concerned with standardization or redundancy, as long as it is convenient to use.

This situation is very common in the workplace, so how to solve it? Here are two ideas:

  1. Both ways. Although this may seem to take up more storage space, it is a suitable solution because wide tables are concatenated from other tables, so the storage cycle of wide tables can be shorter.
  2. Only multiple dimension tables are stored, and wide tables are created through views. This method is suitable for situations where the number of queries for a wide table is small. For example, in Hive, wide tables are only calculated and imported to Es systems for other systems to query. There is no need to store a wide table for a long time. You can encapsulate a wide table directly through views.

In addition, the design of data warehouse is often not to calculate a few tables to end, we should provide data services, let users access our data through the way of services, rather than simply expose the table. Both ease of use and security are easier to satisfy when we provide data as a data service.

0 XFF summary

Thanks to Joker and Alan for the answer, thanks to Rebie for the collation, thanks to muto koji for the summary (thanks to myself, ^_^).

The article structure of the DataTalk series is generally divided into three parts:

  • The first part is a small prologue by lay people, which roughly defines the theme of the text
  • The second part is the main part of the discussion, lay people will summarize and sort out the content of the discussion, try to ensure the original flavor.
  • The third part is the lay people’s summary, subjective relatively strong, can be regarded as their own understanding.

Generally speaking, each article corresponds to an Issue on GitHub. For example, the address of this discussion is github.com/dantezhao/d…

WeChat pay

  • The author: Mudong Koshi
  • Links to this article: www.mdjs.info/2018/04/06/…
  • Copyright Notice: All articles on this blog are licensed under a CC BY-NC-SA 3.0 license unless otherwise stated. Reprint please indicate the source!