Abstract:For tables with frequent update and delete operations, there will be a large amount of garbage data, resulting in a waste of disk space and additional IO overhead during query scanning. It is necessary to perform periodic vacuum-cleaning operations to control the expansion of row storage tables and indexes on the table. This paper will briefly introduce the principle of VACUUM and the factors that affect the effect of VACUUM.

This article is shared from the Huawei cloud community “why vacuum back watch or continue to expand?” , originally written by: Da Wei Tian Lung :-.

Vacuum introduction:

For row storage tables in GAUSSDB, the old version of the data still exists after the tuple is updated or deleted, and only the transaction number (Xmax) for deletion or update is marked in the tuple header. For tables with frequent update and delete operations, there will be a large amount of garbage data, resulting in a waste of disk space and additional IO overhead during query scanning. It is necessary to perform periodic vacuum-cleaning operations to control the expansion of row storage tables and indexes on the table.

Internal principle of VACUUM operation:

Main steps of VACUUM:

1. Remove the dead tuple and Frozen the old tuple that meets the conditions.

2. Remove index tuples that point to dead tuples and update FSM and VM files for corresponding tables

  • FSM: Free Space Map free space map file, based on which to select the appropriate page when inserting data.
  • VM: Visibility map visibility map file, which will be used to select whether to scan a page according to the following VACUUM file, so as to improve VACUUM efficiency; This file is also used during index-only-scan to improve the efficiency of visibility judgment).

Update statistics pg_stat_all_tables. LinePointer is not removed for reuse at a later time. The schematic diagram is as follows:

Before the vacuum

After vacuum

Why does the vacuum watch continue to expand

Factors affecting the effect of VACUUM

1. The Oldestxmin push vacuum only cleans up garbage data generated by transactions prior to the globally alive Oldestxmin. So if old transactions are still present (such as long transactions or long SQL), garbage generated by new transactions is not immediately cleared by VACUUM. Such as:

Session 1: Create a new table ROW_TBL and insert a new table. Start a transaction without committing

gaussdb=# create table row_tbl(a int, b int);
CREATE TABLE
gaussdb=# insert into row_tbl values(1,1);
INSERT 0 1
gaussdb=# begin;
BEGIN
gaussdb=# SELECT txid_current_snapshot();
 txid_current_snapshot
-----------------------
 210115:210115:
(1 row)
gaussdb=# SELECT txid_current();
 txid_current
--------------
       210115
(1 row)

Session 2: After removing data, do VACUUM cleaning operation, insert data again, and the data does not reuse the previous space; Query the view to find that garbage data is not being cleaned up.

gaussdb=# select ctid,* from row_tbl; Ctid | a | b -- -- -- -- -- - + - + - (0, 1) | | 1 row (1) gaussdb = # delete row_tbl; DELETE 1 gaussdb=# SELECT txid_current_snapshot(); txid_current_snapshot ----------------------- 210115:210122: (1 row) gaussdb=# vacuum row_tbl; VACUUM GAUSSDB =# insert into row_tbl values(2,2); VACUUM GAUSSDB =# insert into row_tbl values(2,2); INSERT 0 1 gaussdb=# select ctid,* from row_tbl; Ctid | a | b -- -- -- -- -- - + - + - (0, 2) | | 2 (row 1) gaussdb = # 2 select n_dead_tup, last_vacuum from pg_stat_all_tables where relname='row_tbl'; N_dead_tup | last_vacuum -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 1 | 2021-06-10 20:04:58. 987631 + 8 (1 row)

Session 1: Run VACUUM after the transaction ends in session 1. Query view can find no dead tuples and insert data can find old space reused (i.e., space where CTID is (0,1)).

gaussdb=# SELECT txid_current_snapshot(); txid_current_snapshot ----------------------- 210136:210136: (1 row) gaussdb=# vacuum row_tbl; VACUUM gaussdb=# select n_dead_tup, last_vacuum from pg_stat_all_tables where relname='row_tbl'; N_dead_tup | last_vacuum -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 0 | 2021-06-10 20:09:10. 516564 + 8 (1 row) Gaussdb =# insert into row_tbl values(3,3); INSERT 0 1 gaussdb=# select ctid,* from row_tbl; Ctid | a | b -- -- -- -- -- - + - + - (0, 1) (0, 2) | | 3 | 3 | 2 (2 rows)

2, LinePointer state has not been in unused

After a tuple is deleted, it is only if VACUUM sets the LinePointer(or item pointer to a specific tuple) of a tuple to a state of LP_unused that the LinePointer will be reused for new inserted data.

3. Relying on the FSM file to select the available Page when the insert data has not been generated by the FSM will result in using the new Page instead of reusing the old Page.

In older versions of GaussDB, when a table was inserted in bulk, a new page was directly requested to insert data. Therefore, in some scenarios, although VACUUM cleaves dirty data after vacuum, the control effect of VACUUM on expansion is not ideal because the business scene is mainly about bulk insertion. Multiplexing of space for bulk inserts of data is already supported.

Some Suggestions and Conclusions

  1. Try to avoid long transactions. Use the view PG_RUNNING_XACTS to see if there are any old transactions left unfinished or two-phase transactions left
  2. Regularly do VACUUM to recycle garbage space in time
  3. An inflated index can be reduced by reindex.
  4. VACUUM can clean up junk data, but it cannot return this space to the operating system. For inflated tables, VACUUM FULL can only be used to reduce the size.

For more information about GuassDB(DWS), welcome to WeChat search “GaussDB DWS” pay attention to WeChat public number, and share with you the latest and most complete PB series silo black technology ~

Click on the attention, the first time to understand Huawei cloud fresh technology ~