Moment For Technology

Cloud native data warehouse solution -- 1

Posted on Oct. 3, 2022, 2:14 a.m. by 張承翰
Category: reading Tag: Big data
What is the essential difference between a database and a data warehouse? Mingqi answer - zhihu the code

Author: Mingqi links: The copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.

Database Database

(Oracle, Mysql, PostgreSQL) mainly used for transaction processing, Datawarehouse Datawarehouse (Amazon Redshift, Hive) mainly used for data analysis.

Different uses determine the different characteristics of the two architectures.

The characteristics of a Database are:

  • Relatively complex table structure, relatively compact storage structure, less redundant data.
  • Reading and writing are optimized.
  • A relatively simple Read /write Query that works on a relatively small amount of data at a time.


Features are:

  • Relatively simple (Denormalized) table structure, relatively loose storage structure, more redundant data.
  • Generally just read optimizations.
  • A relatively complex Read Query that works on a relatively large amount of data (historical data) at once.

Use the book form system as an example. If stored in a database, the form would look like this:

Database table structure

Here are six lists of authors, books, types of books, publishers and the relationships between them.

If we store the above data in a data warehouse, the form design requires Denormalization of the original form.

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. - Denormalization - Wikipedia

Now let's store the five tables in the database with books. Title as the primary key and Denormalize them in the database repository using the script shown below.

-- Denormalization Script -- 
a.firstname as AuthorFirstName,
a.lastname as AuthorLastName,
a.gender, as PublisherName
from BOOKS b
left join GENRE g on b.genre =
left join AUTHOR_BOOK_MAP abm on = abm.bookid
left join AUTHORS a on = abm.authorid
left join PUBLISHER_BOOK_MAP pbm on = pbm.bookid
left join PUBLISHER p on = pbm.publisherid;
Copy the code

Then there is only one table left in the data warehouse, as shown below.

Datawarehouse table structure

Storage Space Comparison

Obviously, because in the process of denormalization, if the relationship between the primary table and the secondary table of the database is not one-to-one, the final data warehouse will inevitably have duplicate data in the primary table or secondary table. Therefore, from the perspective of storage space, compared with the tight storage structure of database, data warehouse has a large number of redundant and repeated data.

Database forms are compact

There is a lot of redundancy in Datawarehouse forms

Read/write optimization comparison

  • Comparison of basic Read operations

The two types of query shown in the figure below, one is to find information about a book (PrimaryKey) and the other is to find information about all works of an author (non-key). Because the database needs to use the association between tables to find all the data it needs, it is relatively inefficient. In contrast, the data warehouse converts these associations into repeated data records to the same table, and the query efficiency will be relatively high. Data warehousing is equivalent to sacrificing space for query efficiency.

When writing this query in the database, we need to understand the structure of the form in relation to it -- this is not very friendly for reporting or analyzing data, especially if the form structure is complex (for example, the form uses a logical tree storage structure). In this case, the simple Denormalized form structure of the data warehouse is very advantageous for generating data reports.

In addition, because data reporting and data analysis often involve large-scale queries, these queries are likely to be CPU intensive and may interfere with regular database read and write operations -- because databases are often single-instance (as discussed below); The multi-Instances structure of the data warehouse does not have much of this problem at this point.

  • Comparison of big data Read operations

When the volume of data is very large, the advantages of data warehouse read optimization under certain conditions begin to overwhelm the database. Most of the databases are single-instance, while the data warehouse is a multi-instances distributed system. Data warehouse allocates storage nodes according to the PrimaryKey/PartitionKey. During query, the corresponding node location is searched according to the value of the query key, and a large number of parallel queries are conducted at the same time. This gives great advantages when querying big data.

However, for not all reads, the data warehouse has always had an advantage. For example, a data warehouse does not behave like a database when it reads:

  1. When reading a small amount of data, the overall efficiency of the data warehouse is not as good as that of the database, because the data warehouse needs to perform pre-operations such as finding Node location.
  2. If the read operation is not aimed at a PrimaryKey or a PartitionKey, then the data warehouse query also needs to be scanned globally, and it is difficult to say whether it is more efficient than the database.

These two factors are part of the reason why SQL Database cannot be replaced today, even with Datawarehouse applications as powerful as Amazon Redshift.

  • Compare Write operations

In most cases, data warehouses do not perform precise writes. Because there are so many redundant lines, sometimes even changing a very small field can change a large number of lines. With databases, writes can be very subtle and efficient because of their compact table structure. For example, I need to change the copyright of the book "Java Complete" from 1999 to 2002. I only need one line in the database and five lines in the data warehouse.

Data warehouse writes are either full segment (table) refreshes or full segment inserts, depending on their purpose -- data analysis. Due to the whole table refresh and distributed storage nature of the data warehouse, we can record historical data over a period of time by setting the PartitionKey to the creation/update time of the data. This has important implications for data analysis and for using data to make decisions.

Although the comparison between database and data warehouse is made here, it is not intended to draw a conclusion as to which one is better than the other. The reality is that many architectural storage solutions use databases and data warehouses together. A simple example of such an architecture is described below.

Common related architectures and Data Catalog

The usual software architecture is simply that the user interacts with the database through an API.

Here, if you want to do data analysis, data monitoring and other tasks on the database directly, there will be the following problems:

  • Data analysis usually involves a large number of data queries, which can take up too much CPU and affect the basic functionality of the software.
  • The form structure of database is usually quite complex, which requires data analyst to have a deep understanding of DB structure.
  • Databases are inefficient when performing large numbers of data queries.
  • Opening database access (even if it is only read access), especially to outside groups, is a security risk.

In order to solve the above four problems, we can use scripts to Denormalize all the data in the database to the data warehouse every once in a while and conduct data analysis in the data warehouse. According to all the characteristics of the data warehouse mentioned above (independent without affecting services, simple table structure, fast data reading speed, relatively safe), these four problems can be well solved.

But there is one small problem. If there are many different groups that need to share the Datawarehouse, their scripts may also influence each other. To solve this problem, we introduce the concept of Data Catalog.

A data catalog is A metadata management tool designed to help organizations find and manage large amounts of data -- Including tables, files and databases -- Stored in their ERP, human resources, finance and e-commerce systems as well as other sources like social media feeds. -

Metadata is stored in a data directory and then published so that different groups of data warehouses can synchronize this data. Thus, each of the different groups of data warehouses received the same Denormalized data, but independently of each other.

Such an architecture basically decouples the transactional database from the data warehouse that does the analysis. At the same time, the expansibility of the whole system is increased.

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.