A quick word about column storage databases

As businesses focus more on the importance of data, I believe you have done a lot of multidimensional analysis requirements. When researching technology selection, you will find that many OLAPS, such as Druid, Clickhouse, and starRocks, are columnar storage databases. Today we will compare row storage to the following storage

For example, the following storage and row storage

It is time to go home for the Spring Festival. Today, we have done nucleic acid test, and we will store nucleic acid test as the business scene

chestnuts

As for application pages, nucleic acid records need to store the following fields
The name,Id number,Testing institutions,Testing time,The results of,The price

Line storage

In a row store (e.g. Mysql, etc.), each record is each row in a table, which is stored as follows

`The name`	`Id number`	`Testing institutions`	`Testing time`	`The results of`	`The price`
Daniel	123512387	Beijing	The 2021-12-24 12:12:45	negative	35
DE hua	213124157	Shanghai	The 2021-12-22 12:12:45	negative	20
A passer-by	213123145	henan	The 2021-12-21 12:12:45	negative	8
DE hua	213124157	Guangzhou	The 2021-12-29 12:12:45	negative	23
Daniel	123512387	Shanghai	The 2021-12-30 12:12:45	negative	20

Because data is stored in rows, data is stored in contiguous space in physical storage. In physical storage, data is stored in the following ways:

Analysis of the advantages of row storage

In such a physical structure, because it is a continuous space, it is convenient to insert a piece of data only after the current data
It is also very convenient to query by record, for example: we want to query all the nucleic acid records of The Daniel, the page application should be through the Daniel’s ID number
The SQL is as follows:

Select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from)))Copy the code

The execution flow of this SQL is relatively clear

Query the physical address of the record store in the index. 2. Query the data in the physical store in the table using the physical addressCopy the code

So we can quickly get the Daniel’s nucleic acid record

Analysis of row storage shortcomings

At this time, the business side raised a demand, he wanted to statistics Daniel to do nucleic acid how much money
For this requirement, the SQL implementation is also simple, through theThe pricecolumnsumSQL > alter database

Select sum(price) from (select sum(price) from (select sum(price) from (select sum(price) from (select sum(price))Copy the code

The execution flow of this SQL is also relatively clear

Query the physical address of the record store in the index. 2. Query the data in the physical store in the table using the physical address. 3. When you get all the data, sun aggregates the results through the price columnCopy the code

Analysis, because row storage uses contiguous space, even though the requirements only need itSelect sum(price), but when reading the physical storage, still read out all the fields

Summary of advantages and disadvantages of row storage

Through the above analysis, summarize the advantages and disadvantages of downlink storage
advantages:

1. Continuous space is convenient for insert/update. 2Copy the code

disadvantages

1. Many unnecessary columns are displayedCopy the code

Column storage

`The name`	`Id number`	`Testing institutions`	`Testing time`	`The results of`	`The price`
Daniel	123512387	Beijing	The 2021-12-24 12:12:45	negative	35
DE hua	213124157	Shanghai	The 2021-12-22 12:12:45	negative	20
A passer-by	213123145	henan	The 2021-12-21 12:12:45	negative	8
DE hua	213124157	Guangzhou	The 2021-12-29 12:12:45	negative	23
Daniel	123512387	Shanghai	The 2021-12-30 12:12:45	negative	20

In column storage, for the same nucleic acid record table, the physical structure of the storage is as follows:

In column storage, each column is stored together, for exampleThe nameColumns are the columns that put all the records inThe nameThe values in this column are stored together using contiguous space
For each column between, it is not necessary to use continuous space stored together, so many column database use distributed storage, storage of each column
Let’s analyze the following storedData compressionandQuery Execution Process

Column storage data compression

Many column databases are passedA dictionary tableIn the manner ofData compression
Since each column is stored together, it is easy to build a dictionary table by de-duplicating each column, for example:
forThe nameAll data for this column is as follows:

Daniel | | a passer-by DE China | | DE China DanielCopy the code

After recertifying this column of values, build oneThe namecolumnA dictionary table, the construction algorithm is ignored, and the method of self-increasing ID is used, as follows:

`id`	`The name list`
1	Daniel
2	DE hua
3	A passer-by

In this way, the dictionary table can be constructed, and the physical storage structure of the column store can be performed to store the IDS in the dictionary table without storing the specific valuesThe nameColumn stores are as follows:

1 | 2 | 3 | 2 | 1Copy the code

For the sameThe priceAll data for this column is as follows:

35 23 | | | | 8 20 20Copy the code

After recertifying this column of values, build oneThe pricecolumnA dictionary table, the construction algorithm is ignored, and the method of self-increasing ID is used, as follows:

`id`	`The price list`
1	35
2	20
3	8
4	23

Once you have a dictionary tableThe priceColumn stores are as follows:

1 | 2 | 2 | 3 | 4Copy the code

In this way, some data compression algorithms can be used to compress the data storage

The query execution process for column stores

Now that we have dictionary tables, let’s look at how column stores are generally queried
Business Requirement QueryDaniel.20 dollarsNucleic acid records made:

Select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from (select * from))Copy the code

For this SQL, the execution process is as follows:

1. Query the name dictionary table where name = robin where id=1Copy the code

`id`	`The name list`
1	Daniel
2	DE hua
3	A passer-by

2. By querying the id of The ancestor, compare the sex names, build a bitmap, and set the index bit of the matching column to 1; otherwise, it is 0Copy the code

Select * from table where id=2 where id=2; select * from table where id=2Copy the code

`id`	`The price list`
1	35
2	20
3	8
4	23
` ` `
4. By querying the ID of price 20, compare the price column, build a bitmap, and set the index bit of the matching column to 1, otherwise to 0
` ` `

In the bitmap, the bitmap with bit 1 is the index of all columns of the data to be queried. For example, in this case, the bitmap with bit 1 is the index of all columns to be queried. Therefore, the 5th value of all columns is the data to be queriedCopy the code

6. So we take out the fifth value of all the columns and assemble it into the data we needCopy the code

Analysis of column storage benefits

We talked about column storageData compressionColumn storage in data compression has certain advantages
Each column can be indexed naturally, and no additional data structures are required to index the columns, so you can index them regardless of the data type of each column
The need to figure out how much the Daniel spent on his nucleic acid

Select sum(price) from (select sum(price) from (select sum(price) from (select sum(price) from (select sum(price))Copy the code

Because the column isSeparate storageAccording to the above query process, in fact, we finally getResults the bitmap.Get the index in place =1Later, weYou do not need to query all columns, just take the index to the price column to get the value of the corresponding position, and then proceedThe sum aggregate

Analysis of column storage shortcomings

Because the columns are stored separately, the values ofOperate on each column, no row storageContinuous spaceSo convenient
Looking at the query process above, after each query, you need to do a data assembly of the desired columns

Summary of advantages and disadvantages of column storage

Through the above analysis, summarize the advantages and disadvantages of the following storage
advantages:

Advantages of data compression 2. Any column can be indexed 3. Only the columns involved in the query are readCopy the code

disadvantages

1. Each query requires data reassembly for the queried columns. 2. Insert/update operations are difficultCopy the code

So much for column storage database, welcome to exchange, point out some of the wrong places in the article, let me deepen my understanding, wish you no bug, thank you!

A quick word about column storage databases

For example, the following storage and row storage

chestnuts

Line storage

Analysis of the advantages of row storage

Analysis of row storage shortcomings

Summary of advantages and disadvantages of row storage

Column storage

Column storage data compression

The query execution process for column stores

Analysis of column storage benefits

Analysis of column storage shortcomings

Summary of advantages and disadvantages of column storage

Related Posts

Take an in-depth look at the JUC concurrency tool classes – cache consistency and memory barriers

Postgresql. conf parameter setting system environment section

Jenkins 2.0 New Era: From CI to CD