Abstract:

With a clear dimension definition, you can also look at the structure of the database in a clearer way. This is a common E-R graph: the E-R graph is a mesh structure, and the foreign key relationship between entities (tables) is directly drawn on the graph. When there are too many entities, the graph will be very messy, and the association line is very random. Any two entities may be associated, showing a high degree of data structure coupling.

With a clear dimension definition, you can also look at the structure of the database in a clearer way.

Here is our common E-R diagram:



The e-R graph is a mesh structure, and the foreign key relationship between entities (tables) is directly drawn on the graph. When there are too many entities, the graph will be very messy, and the association line is very random. Any two entities may be associated, showing a high degree of data structure coupling. When adding or deleting an entity, you have to consider all the other entities associated with it, and it is likely that missing associations or circular associations will occur.

If the dimensions are extracted, we can use a bus-like structure diagram:



All dimensions are listed separately in a central position, entities (tables) are only associated with dimensions, there is no direct correlation between entities, and the degree of coupling of data structures looks very low. Adding and deleting entities does not affect other entities, and does not cause missing associations and duplicate associations.

However, it is important to point out. Whether it is an E-R graph or a graph, as long as it is drawn correctly, the number of correlation lines is about the same, which is determined by the relationship of the data itself. The graph is no less correlating than the e-R, but it is clearer when you change the way you look at it.

In order to provide associative query capability, some BI products expose association relationships between tables (equivalent to a partial E-R graph) directly to business people. This is not a good approach, as business people have difficulty understanding the E-R graph, and the usability of this solution is poor. Usability would be greatly improved if reasonable associations could be automatically established by business people after data items (fields) were selected.

With the concept of dimensions, this can be achieved to some extent.

After the business person randomly selects fields, we can find the tables in which those fields reside, find same-dimensional fields between those tables (primary key preferred), and then use those same-dimensional fields to form a JOIN relationship. When only a unique field on a table is in the same dimension as the primary key field on another table, a JOIN relationship based on these two fields is correct most of the time. Moreover, when the data structure is not particularly complex, the condition that only a single field has the same dimension between the two tables can often be met. At this time, the correct association can be established automatically only based on the data items. Some BI products do indeed do this.

However, this approach does not deal with self-association with tables, multiple same-dimensional fields between tables, and multiple recursive associations. To fully solve the problem, you still need to implement association based on DQL syntax.

In the above discussion, we will JOIN fields of the same dimension as we find them. DQL syntax does the same, as long as fields of the same dimension can be joined. Does such a JOIN necessarily make business sense?

Yes, as long as the field is in the same dimension, JOIN can always come up with reasonable business sense. Conversely, only fields of the same dimension can JOIN. The JOIN of fields of different dimensions has no business significance. However, SQL does not prohibit the JOIN as long as the data type is the same. A field is equivalent to a JOIN that has business meaning, and DQL ensures this in this respect.

In DQL, GROUP BY always corresponds to ON. In other words, GROUP BY always corresponds to a dimension. In fact, grouping operations on measures makes no business sense, but SQL does not explicitly define dimensions and measures and does not prohibit such operations. DQL ensures that no meaningless grouping occurs.

Using this feature, the performance of grouping operation can be improved. The possible values of dimensions are determined by the length of the dimension table, and the dimension table is known in advance, so that grouping can be accelerated by means similar to radix sorting, of course, the sorting operation for dimensions can also be used in this way. However, the details of this algorithm are less relevant to this topic and will not be explained here.


The original post was published on January 11, 2018

Author: Jiang Buxing

This article is from the cloud community partner “Datapai THU”. For relevant information, you can follow the wechat public account of “Datapai THU”

If you find any content suspected of plagiarism in our community, you are welcome to send an email to [email protected] to report and provide relevant evidence. Once verified, our community will immediately delete the content suspected of infringement.

The original link