PostgreSQL series organized from The Internals of PostgreSQL and other series of articles, from The fragmented reading to systematic learning, feel a deeper understanding of The database; By analogy, mutual verification, is also conducive to master MySQL and other relational databases or NoSQL databases.

PostgreSQL series concurrency control and transaction mechanism

Concurrency control is designed to ensure Consistency and Isolation in ACID for scenarios where transactions are parallel in a database. The three main concurrency control technologies in database technology are: Multi-version Concurrency Control (MVCC), Strict Two-Phase Locking (S2PL), and Optimistic Concurrency Control (OCC), There are also many variations of each technology. In MVCC, each write creates a new version on top of the old version and preserves the old version. When a transaction needs to read data, the database system selects from all versions the version that meets the transaction isolation level requirements. The biggest advantage of MVCC is that reads do not block writes, and writes do not block reads; On systems like S2PL, write transactions acquire an exclusive lock beforehand, which blocks read transactions.

RDBMSS such as PostgreSQL and Oracle actually use the so-called Snapshot Isolation (SI) variant of MVCC technology. Oracle introduced additional Rollback Segments, in which data from older versions is written into Rollback Segments and then overwritten to the actual data blocks when new data is written. PostgreSQL uses a relatively simple implementation. New data objects are inserted directly into the associated Table Page. When reading table data, PostgreSQL uses Visibility Check Rules to select the appropriate version.

SI avoids three anomalies defined in the ANSI SQL-92 standard: Dirty Reads, non-repeatable Reads, and Phantom Reads. Serializable Snapshot Isolation (SSI), introduced after version 9.1, provides true sequential read and write capabilities.

Isolation Level Dirty Reads Non-repeatable Read Phantom Read Serialization Anomaly
READ COMMITTED Not possible Possible Possible Possible
REPEATABLE READ Not possible Not possible Not possible in PG; See Section 5.7.2. (Possible in ANSI SQL) Possible
SERIALIZABLE Not possible Not possible Not possible Not possible

The Tuple structure

Transaction ID

When a Transaction is started, PostgreSQL’s built-in Transaction Manager assigns it a unique Transaction ID(TXID). Txid is a 32-bit untyped integer value. We can use the txid_current() function to get the current txID:

testdb=# BEGIN;
BEGIN
testdb=# SELECT txid_current();
 txid_current
--------------
          100
(1 row)
Copy the code

PostgreSQL also reserves three key TXID values for special marking: 0 for an invalid TXID and 1 for a startup TXID, used only when the Database Cluster is started; 2 represents the Frozen TXID, which is used when serializing transactions. PostgreSQL chooses numeric type as TXID to facilitate comparison; For transactions with txID of 100, all transactions less than 100 are in the past and visible; All transactions greater than 100 are in the future, invisible.

Since the number of Txids in a real system may need to exceed the maximum, PostgreSQL actually treats these TXids as rings.

HeapTupleHeaderData

Heap Tuples in Table Pages usually contain three parts: HeapTupleHeaderData structure, NULL bitmap, and user data.

The attributes of HeapTupleHeaderData that are strongly related to transaction processing are:

  • (TransactionId)t_xmin: specifies the TXID used when the Tuple is inserted
  • (TransactionId)t_xmax: specifies the txID of the Tuple to be deleted or updated
  • (CommandId) t_CID: Stores the Command ID, the number of all SQL commands executed within the transaction that created the Tuple; Such asBEGIN; INSERT; INSERT; INSERT; COMMIT;This transaction, if it is the Tuple created by the first INSERT command, has a T_CID value of 0, and the second one is 1
  • (ItemPointerData) t_cTID: When a Tuple is updated, it points to the newly created Tuple, otherwise it points to itself

Insert, delete, and update Tuple

As mentioned above, Tuples in Table Pages have the following layout:

insert

When an insert is performed, PostgreSQL inserts a new Tuple directly into a page of the target table:

If a transaction with txID 99 inserts a new Tuple, the Tuple’s header field is set to the following value:

  • T_xmin is consistent with the TXID of the transaction that created the Tuple, which is 99
  • T_xmax is set to 0 because it has not been deleted or updated
  • T_cid is set to 0 because the Tuple was created by the first Insert command in the transaction
  • T_ctid is set to(0, 1)“That is, pointing to oneself
testdb=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
testdb=# CREATE TABLE tbl (data text);
CREATE TABLE
testdb=# INSERT INTO tbl VALUES('A');
INSERT 0 1
testdb=# SELECT lp as tuple, t_xmin, t_xmax, t_field3 as t_cid, t_ctid
                FROM heap_page_items(get_raw_page('tbl', 0)); Tuple | t_xmin | t_xmax | t_cid | t_ctid -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + -- -- -- -- -- -- -- -- 1 | 99 | | | 0 0 (0, 1)Copy the code

delete

In a delete operation, the target Tuple is logically deleted first, that is, t_xmax is set to the TXID value of the transaction in which the Tuple is being deleted.

When a transaction is committed, PostgreSQL marks the Tuple as a Dead Tuple and vacuums it completely.

update

PostgreSQL logically removes the latest Tuple and inserts a new Tuple:

The row shown above is inserted by a transaction with TXID 99 and updated twice consecutively by a transaction with TXID 100; After the transaction commits, Tuple_2 and Tuple_3 are marked as Dead Tuples.

Free Space Map

When a Heap Tuple or Index Tuple is inserted, PostgreSQL uses the FSM of the associated table to determine which Page should be selected for the specific insertion. Each FSM node stores information about the remaining space for tables or index files. You can view the information in the following ways:

testdb=# CREATE EXTENSION pg_freespacemap;
CREATE EXTENSION

testdb=# SELECT *, round(100 * avail/8192 ,2) as "freespace ratio"
                FROM pg_freespace('accounts'); Blkno | the avail | freespace thewire -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0 96.00 1 | 7520 | | 7904 | | 2 | 3 | 87.00 7136 91.00 7136 | | 7136 | 87.00 87.00 5 | 7136 | 87.00...Copy the code

Commit Log

PostgreSQL uses Commit logs, also known as CLOGs, to store transaction status. Clogs are stored in Shared Memory and play an important role in the entire transaction life cycle. PostgreSQL defines four different transaction states: IN_PROGRESS, COMMITTED, ABORTED, and SUB_COMMITTED.

Clogs are composed of multiple 8KB pages in Shared Memory, which logically represent an array-like structure. The subscript of the array is the TXID of the associated transaction, and the value is the current transaction state:

PostgreSQL automatically creates a new page if the current TXID exceeds the maximum range that the current CLOG page can hold. When PostgreSQL is stopped or the Checkpoint process is running, clog data is stored persistently in the pg_xact subdirectory named 000000000001. The maximum size of a file is 256KB. When PostgreSQL restarts, files stored in the PG_xact directory are reloaded into memory. As PostgreSQL continues to run, cloG is bound to accumulate a lot of outdated or useless data. Vacuum will also remove such useless data.

The Transaction the Snapshot | Transaction snapshots

Xmin :xmax:xip_list ‘; xmin:xmax:xip_list ‘; For example, “100:100:” means that all transactions with txIDS less than or equal to 99 are deactivated, while transactions with TXids greater than or equal to 100 are active.

testdb=# SELECT txid_current_snapshot();Txid_current_snapshot -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 100:104-100102 (1 row)Copy the code
  • Xmin: The earliest txID that is still active, and all transactions prior to that are either visible after being committed or in a suspended state after being rolled back.
  • Xmax: the first transaction number that has not yet been assigned. All transactions whose TXID is greater than or equal to this value have not yet occurred relative to the transaction to which this snapshot belongs and are therefore invisible.
  • Xip_list: Active txids at snapshot time. Only txids between xmin and xmax will be included.

Take 100:104:100,102 as an example, and the schematic diagram is as follows:

Transaction snapshots are provided primarily by the Transaction Manager, and at the READ COMMITTED isolation level, a Transaction is assigned to a snapshot regardless of whether an SQL command is executed. For REPEATABLE READ or SERIALIZABLE isolation level transactions, only when the first SQL statement is executed will a transaction snapshot be assigned for visibility detection. The significance of transaction snapshot is that when a snapshot is used for visibility judgment, regardless of whether the target transaction has been committed or abandoned, as long as it is marked as Active in the snapshot, it will be treated as an IN_PROGRESS transaction.

The transaction manager always saves information about the currently running transaction. Suppose three transactions are started one after another, and the isolation level of Transaction_A and Transaction_B is READ COMMITTED, and the isolation level of Transaction_C is REPEATABLE READ.

  • T1:

    • Transaction_A starts and executes the first SELECT command. When the first command is executed, Transaction_A requests the TXID and snapshot for the moment. In this case, the transaction manager allocates txID 200 and returns transaction snapshot ‘200:200:’.
  • T2:

    • Transaction_B starts and executes the first SELECT command. The transaction manager allocates txID 201 and returns transaction snapshot ‘200:200:’ because Transaction_A (TXID 200) is in progress. Therefore, Transaction_A cannot be seen from Transaction_B.
  • T3:

    • Transaction_C starts and executes the first SELECT command. The transaction manager assigns txID 202 and returns transaction snapshot ‘200:200:’, so Transaction_A and Transaction_B cannot be seen from Transaction_C.
  • T4:

    • Transaction_A has been committed. The transaction manager deletes information about this transaction.
  • T5:

    • Transaction_B and Transaction_C execute their respective SELECT commands.
    • Transaction_B requires a transaction snapshot because it is at the READ COMMITTED level. In this case, Transaction_B gets a new snapshot ‘201:201:’ because Transaction_A (TXID 200) has committed. Therefore, Transaction_B is no longer invisible in Transaction_B.
    • Transaction_C does not require a transaction snapshot because it is at REPEATABLE READ level and uses the snapshot obtained, i.e. ‘200:200:’. Therefore, Transaction_A is still invisible to Transaction_C.

The Visibility Check | Visibility detection

Rules | visibility detection Rules

The rules for visibility detection are used to determine whether a Tuple is visible relative to a transaction based on its T_xmin and t_xmax, CLOG, and its assigned transaction snapshot.

T_xmin corresponds to a transaction whose state is ABORTED

A Tuple is never visible when the transaction’s t_xmin value is ABORTED:

* / / / / * t_xmin status = ABORTED Rule 1: If the status (t_xmin) = ABORTED ⇒ Invisible Rule 1: IF t_xmin status is 'ABORTED' THEN RETURN 'Invisible' END IFCopy the code

T_xmin corresponds to a transaction whose status is IN_PROGRESS

The Tuple is never visible to tuples associated with transactions other than the one into which it was inserted; Visible only to tuples that belong to the same transaction as the Tuple (the Tuple has not been deleted or updated at this time).

/* t_xmin status = IN_PROGRESS */ IF t_xmin status is 'IN_PROGRESS' THEN IF t_xmin = current_txid THEN // Rule 2: If Status(t_xmin) = IN_PROGRESS ∧ T_xmin = current_TXID ∧ T_xmax = INVAILD Visible Rule IF t_xmax = INVALID THEN RETURN 'Visible' // Rule 3: If Status(t_xmin) = IN_PROGRESS ∧ T_xmin = current_txID ∧ T_xmax ≠ INVAILD ELSE /* this tuple has been deleted or updated by the current transaction itself. */ RETURN 'Invisible' END IF // Rule 2: If Status(t_xmin) = IN_PROGRESS ∧ T_xmin ≠ current_TXID ELSE /* t_xmin ≠ current_TXID */ RETURN 'Invisible' END IF END IFCopy the code

T_xmin Specifies the COMMITTED transaction status

The Tuple is visible in most cases, except when the Tuple is updated or deleted.

/* T_xmin status = COMMITTED */ IF T_xmin status is 'COMMITTED' THEN // IF status (t_xmin) = COMMITTED ∧ Snapshot(t_xmin) Invisible Rule 5: IF T_xmin is active in the obtained transaction snapshot THEN RETURN 'Invisible' // IF Status(T_xmin) = COMMITTED ∧ (t_xmax = INVALID ∨ Status(t_xmax) = ABORTED) ELSE IF t_xmax = INVALID OR status of t_xmax is 'ABORTED' THEN RETURN 'Visible' ELSE IF t_xmax status is 'IN_PROGRESS' THEN // If Status(T_xmin) = COMMITTED ∧ Status(T_xmax) = IN_PROGRESS ∧ T_xmax = current_TXID IF T_xmax = current_TXID THEN RETURN 'Invisible' // IF Status(T_xmin) = COMMITTED ∧ Status(T_xmax) = IN_PROGRESS ∧ ID T_xmax ≠ CURRENT_TXID Visible Rule 8: ELSE /* T_xmax ≠ current_TXID */ RETURN 'Visible' END IF ELSE IF T_xmax status is 'COMMITTED' THEN // IF status (t_xmin) "= COMMITTED ∧ Status(T_xmax) = active ∧ Snapshot(t_xmax) IF T_xmax is active in the obtained transaction snapshot THEN RETURN 'Visible' // IF Status(T_xmin) = COMMITTED ∧ Status(T_xmax) = COMMITTED ∧ Snapshot(T_xmax) ≠ Active Invisible Rule 10: ELSE RETURN 'Invisible' END IF END IF END IFCopy the code

Visibility detection process

Take a simple two-transaction update and query as an example:

In the figure above, the isolation level of txID 200 transaction is READ COMMITED and txID 201 transaction is READ COMMITED or REPEATABLE READ.

  • When executing the SELECT command at time T3:

According to Rule 6, only Tuple_1 is visible:

[x] [x] [X] [X] [X] [X] [X FROM tbl; name -------- Jekyll (1 row) testdb=# -- txid 201 testdb=# SELECT * FROM tbl; name -------- Jekyll (1 row)Copy the code
  • When executing the SELECT command at time T5:

For txID 200 transactions, Tuple_1 is visible and Tuple_2 is not:

# Rule7(Tuple_1): Status(T_xmin :200) = COMMITTED ∧ Status(T_xmax :200) = IN_PROGRESS ∧ T_xmax :200 = CURRENT_TXID :200
# Rule2(Tuple_2): Status(t_xmin:200) = IN_PROGRESS ∧ t_xmin:200 = current_txid:200 ∧ t_xmax = INVAILD ⇒ Visible

testdb=# -- txid 200
testdb=# SELECT * FROM tbl;
 name
------
 Hyde
(1 row)
Copy the code

Tuple_1 is visible and Tuple_2 is invisible for txID 201 transactions:

# Rule8(Tuple_1): Status(T_xmin :199) = COMMITTED ∧ Status(T_xmax :200) = IN_PROGRESS ∧ T_xmax :200 ≠ CURRENT_TXID :201 Visible
# Rule4(Tuple_2): Status(t_xmin:200) = IN_PROGRESS ∧ T_xmin :200 ≠ current_TXID

testdb=# -- txid 201
testdb=# SELECT * FROM tbl;
  name
--------
 Jekyll
(1 row)
Copy the code
  • When executing the SELECT command at time T7:

Txid 200: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1: Tuple_1 And Tuple_2 is visible:

# Rule10(Tuple_1): Status(T_xmin :199) = COMMITTED ∧ Status(T_xmax :200) = COMMITTED ∧ Snapshot(T_xmax :200) ≠ Active Invisible
# Rule6(Tuple_2): Status(T_xmin :200) = COMMITTED ∧ T_xmax = INVALID

testdb=# -- txid 201 (READ COMMITTED)
testdb=# SELECT * FROM tbl;
 name
------
 Hyde
(1 row)
Copy the code

If the transaction of TXID 201 is in the isolation level of REPEATABLE READ and the transaction snapshot is 200:200:, the transaction of TXID 200 must be treated as IN_PROGRESS. So Tuple_1 is visible and Tuple_2 is not:

# Rule9(Tuple_1): Status(T_xmin :199) = COMMITTED ∧ Status(T_xmax :200) = COMMITTED ∧ Snapshot(T_xmax :200) = Active Visible
# Rule5(Tuple_2): Status(T_xmin :200) = COMMITTED ∧ Snapshot(t_xmin:200) = active shadow

testdb=# -- txid 201 (REPEATABLE READ)
testdb=# SELECT * FROM tbl;
  name
--------
 Jekyll
(1 row)
Copy the code

Preventing Lost Updates | to avoid Lost Updates

Lost Update, aka write conflict, occurs when two transactions Update the same row at the same time; In PostgreSQL, both REPEATABLE READ and SERIALIZABLE levels need to avoid this exception.

(1)  FOR each row that will be updated by this UPDATE command
(2)       WHILE true

               /* The First Block */
(3)            IF the target row is being updated THEN
(4)	              WAIT for the termination of the transaction that updated the target row

(5)	              IF (the status of the terminated transaction is COMMITTED)
   	                   AND (the isolation level of this transaction is REPEATABLE READ or SERIALIZABLE) THEN
(6)	                       ABORT this transaction  /* First-Updater-Win */
	              ELSE
(7)                           GOTO step (2)
	              END IF

               /* The Second Block */
(8)            ELSE IF the target row has been updated by another concurrent transaction THEN
(9)	              IF (the isolation level of this transaction is READ COMMITTED THEN
(10)	                       UPDATE the target row
	              ELSE
(11)	                       ABORT this transaction  /* First-Updater-Win */
	              END IF

               /* The Third Block */
                ELSE  /* The target row is not yet modified or has been updated by a terminated transaction. */
(12)	              UPDATE the target row
                END IF
           END WHILE
      END FOR
Copy the code

In the above process, the UPDATE command iterates through each row to be updated, and when it is found that the row is being updated by another transaction, it goes into a wait state until the row is unlocked. If the row is already updated and the isolation level is REPEATABLE or SERIALIZABLE, the update is abandoned.

Being updated means that the row is updated by another concurrent transaction and its transaction has not yet terminated. Because PostgreSQL’s SI uses a first-upater -win scheme, in this case the current transaction must wait for the transaction that updates the target row to terminate. Suppose the transactions Tx_A and Tx_B are running at the same time, and Tx_B tries to update rows; But Tx_A has updated it and is still in progress, and Tx_B waits for Tx_A to terminate. After updating the transaction committed by the target row, the update operation for the current transaction continues. If the current transaction is at the READ COMMITTED level, the target row is updated; Otherwise REPEATABLE READ or SERIALIZABLE, the current transaction is immediately aborted to prevent lost updates.

Space to sort out

PostgreSQL’s concurrency control mechanism also relies on the following maintenance processes:

  • Remove Tuples and Index Tuples marked Dead
  • Remove obsolete parts of cloG
  • Freeze old Txids
  • Update FSM, VM, and other statistics

If a Tuple_1 transaction is inserted into a txID 100 transaction, then the value of t_xmin corresponding to the Tuple is 100. The server then ran for a long time, unchanged during Tuple_1. Tuple_1 will not be visible until the txID is 2^31 + 101. Tuple_1 will not be visible until the txID is 100. Tuple_1 will not be visible until the txID is 100.

To solve this problem, PostgreSQL introduces the so-called frozen TXID (frozen TXID) and sets the FREEZE process to handle the problem specifically. As mentioned above, TXID 2 is a reserved value, representing tuples that are frozen, and these tuples are always inactive and visible. The FREEZE process is also called by the Vacuum process, which scans all table files and sets the T_xmin field of tuples whose differences from the current TXID are greater than the vacuum_freeze_min_age definition to 2. After 9.4, the XMIN_FROZEN bit in the T_INFomask field is set to indicate that the Tuple is frozen.

read

  • If you want to get to the bottom of distributed systems, distributed computing, distributed storage, databases, operating systems, virtualization, and more, see The bottom of Distributed Infrastructure, DistributedSystem CheatSheet, Database CheatSheet, Linux CheatSheet, MySQL CheatSheet, Docker CheatSheet, Flink CheatSheet, Kafka CheatSheet, etc.

  • For more information on distributed systems, virtualization scheduling, databases, distributed storage, distributed computing, and operating systems: Docker List, Kubernetes List, Linux List, HTTP List, Distributed System List, Blockchain List, Flink List, Kafka List, Database List, MySQL List, PostgreSQL List, etc.