As businesses focus more on the importance of data, I believe you have done a lot of multidimensional analysis requirements. When researching technology selection, you will find that many OLAPS, such as Druid, Clickhouse, and starRocks, are columnar storage databases. Today we will compare row storage to the following storage

For example, the following storage and row storage

It is time to go home for the Spring Festival. Today, we have done nucleic acid test, and we will store nucleic acid test as the business scene

chestnuts

  • As for application pages, nucleic acid records need to store the following fields
  • The name,Id number,Testing institutions,Testing time,The results of,The price

Line storage

  • In a row store (e.g. Mysql, etc.), each record is each row in a table, which is stored as follows
The name Id number Testing institutions Testing time The results of The price
Daniel 123512387 Beijing The 2021-12-24 12:12:45 negative 35
DE hua 213124157 Shanghai The 2021-12-22 12:12:45 negative 20
A passer-by 213123145 henan The 2021-12-21 12:12:45 negative 8
DE hua 213124157 Guangzhou The 2021-12-29 12:12:45 negative 23
Daniel 123512387 Shanghai The 2021-12-30 12:12:45 negative 20
  • Because data is stored in rows, data is stored in contiguous space in physical storage. In physical storage, data is stored in the following ways:

Analysis of the advantages of row storage
  • In such a physical structure, because it is a continuous space, it is convenient to insert a piece of data only after the current data
  • It is also very convenient to query by record, for example: we want to query all the nucleic acid records of The Daniel, the page application should be through the Daniel’s ID number
  • The SQL is as follows:
Select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from)))Copy the code
  • The execution flow of this SQL is relatively clear
Query the physical address of the record store in the index. 2. Query the data in the physical store in the table using the physical addressCopy the code
  • So we can quickly get the Daniel’s nucleic acid record
Analysis of row storage shortcomings
  • At this time, the business side raised a demand, he wanted to statistics Daniel to do nucleic acid how much money
  • For this requirement, the SQL implementation is also simple, through theThe pricecolumnsumSQL > alter database
Select sum(price) from (select sum(price) from (select sum(price) from (select sum(price) from (select sum(price))Copy the code
  • The execution flow of this SQL is also relatively clear
Query the physical address of the record store in the index. 2. Query the data in the physical store in the table using the physical address. 3. When you get all the data, sun aggregates the results through the price columnCopy the code
  • Analysis, because row storage uses contiguous space, even though the requirements only need itSelect sum(price), but when reading the physical storage, still read out all the fields
Summary of advantages and disadvantages of row storage
  • Through the above analysis, summarize the advantages and disadvantages of downlink storage
  • advantages:
1. Continuous space is convenient for insert/update. 2Copy the code
  • disadvantages
1. Many unnecessary columns are displayedCopy the code

Column storage

The name Id number Testing institutions Testing time The results of The price
Daniel 123512387 Beijing The 2021-12-24 12:12:45 negative 35
DE hua 213124157 Shanghai The 2021-12-22 12:12:45 negative 20
A passer-by 213123145 henan The 2021-12-21 12:12:45 negative 8
DE hua 213124157 Guangzhou The 2021-12-29 12:12:45 negative 23
Daniel 123512387 Shanghai The 2021-12-30 12:12:45 negative 20
  • In column storage, for the same nucleic acid record table, the physical structure of the storage is as follows:

  • In column storage, each column is stored together, for exampleThe nameColumns are the columns that put all the records inThe nameThe values in this column are stored together using contiguous space
  • For each column between, it is not necessary to use continuous space stored together, so many column database use distributed storage, storage of each column
  • Let’s analyze the following storedData compressionandQuery Execution Process
Column storage data compression
  • Many column databases are passedA dictionary tableIn the manner ofData compression
  • Since each column is stored together, it is easy to build a dictionary table by de-duplicating each column, for example:
  • forThe nameAll data for this column is as follows:
Daniel | | a passer-by DE China | | DE China DanielCopy the code
  • After recertifying this column of values, build oneThe namecolumnA dictionary table, the construction algorithm is ignored, and the method of self-increasing ID is used, as follows:
id The name list
1 Daniel
2 DE hua
3 A passer-by
  • In this way, the dictionary table can be constructed, and the physical storage structure of the column store can be performed to store the IDS in the dictionary table without storing the specific valuesThe nameColumn stores are as follows:
1 | 2 | 3 | 2 | 1Copy the code
  • For the sameThe priceAll data for this column is as follows:
35 23 | | | | 8 20 20Copy the code
  • After recertifying this column of values, build oneThe pricecolumnA dictionary table, the construction algorithm is ignored, and the method of self-increasing ID is used, as follows:
id The price list
1 35
2 20
3 8
4 23
  • Once you have a dictionary tableThe priceColumn stores are as follows:
1 | 2 | 2 | 3 | 4Copy the code
  • In this way, some data compression algorithms can be used to compress the data storage
The query execution process for column stores
  • Now that we have dictionary tables, let’s look at how column stores are generally queried
  • Business Requirement QueryDaniel.20 dollarsNucleic acid records made:
Select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from))Copy the code
  • For this SQL, the execution process is as follows:
1. Query the name dictionary table where name = robin where id=1Copy the code
id The name list
1 Daniel
2 DE hua
3 A passer-by
2. By querying the id of The ancestor, compare the sex names, build a bitmap, and set the index bit of the matching column to 1; otherwise, it is 0Copy the code

Select * from table where id=2 where id=2; select * from table where id=2Copy the code
id The price list
1 35
2 20
3 8
4 23
` ` `
4. By querying the ID of price 20, compare the price column, build a bitmap, and set the index bit of the matching column to 1, otherwise to 0
` ` `

In the bitmap, the bitmap with bit 1 is the index of all columns of the data to be queried. For example, in this case, the bitmap with bit 1 is the index of all columns to be queried. Therefore, the 5th value of all columns is the data to be queriedCopy the code

6. So we take out the fifth value of all the columns and assemble it into the data we needCopy the code
Analysis of column storage benefits
  • We talked about column storageData compressionColumn storage in data compression has certain advantages
  • Each column can be indexed naturally, and no additional data structures are required to index the columns, so you can index them regardless of the data type of each column
  • The need to figure out how much the Daniel spent on his nucleic acid
Select sum(price) from (select sum(price) from (select sum(price) from (select sum(price) from (select sum(price))Copy the code
  • Because the column isSeparate storageAccording to the above query process, in fact, we finally getResults the bitmap.Get the index in place =1Later, weYou do not need to query all columns, just take the index to the price column to get the value of the corresponding position, and then proceedThe sum aggregate
Analysis of column storage shortcomings
  • Because the columns are stored separately, the values ofOperate on each column, no row storageContinuous spaceSo convenient
  • Looking at the query process above, after each query, you need to do a data assembly of the desired columns
Summary of advantages and disadvantages of column storage
  • Through the above analysis, summarize the advantages and disadvantages of the following storage
  • advantages:
Advantages of data compression 2. Any column can be indexed 3. Only the columns involved in the query are readCopy the code
  • disadvantages
1. Each query requires data reassembly for the queried columns. 2. Insert/update operations are difficultCopy the code

So much for column storage database, welcome to exchange, point out some of the wrong places in the article, let me deepen my understanding, wish you no bug, thank you!