Series: MySQL series columns

The InnoDB data page

We briefly mentioned the concept of a page earlier. A page is the smallest unit of disk that the InnoDB storage engine manages for a database. The size of a page is typically 16KB. Read at least one page of data to memory at a time, or refresh one page of data to disk.

In this section we will focus on the page that holds the data record, which is the data page of type INDEX.

Data page structure

The data page consists of seven parts, roughly as shown below:

The sizes of File Header, Page Header and File Trailer are fixed, which are 38, 56 and 8 bytes respectively. The User Records, Free Space, and Page Directory sections are the actual row record storage Space, so the size is dynamic.

Record header information

The record header information was briefly mentioned in the previous article, but before moving on, take a closer look at what is stored in the record header information.

The line record looks like this:

  • delete_mask

This property marks whether the current record has been deleted, with a value of 1 indicating that the record has been deleted and a value of 0 indicating that the record has not been deleted.

As you can see, when a record is deleted, it is only marked deletion and has not actually been removed from the page. The main purpose of this is to reuse the storage space of deleted records if new records are inserted into the table later.

  • min_rec_mask

The flag is added to the smallest record in non-leaf nodes at each level of the B+ tree and set to 1, or 0 otherwise.

If you look at the index structure below, the lowest leaf node holds the real data, so min_rec_mask is 0 for each record. The top two layers are non-leaf nodes, so the min_rec_mask for the smallest record in the leftmost part of each page is set to 1.

  • n_owned

Indicates the number of records currently owned by the record. The data in the page can also be divided into multiple groups, each of which has a maximum record. The largest record n_owned records the number of records in the group. You’ll see what this property is used for later in the Page Directory section.

  • heap_no

This property represents the current record position on the page.

  • record_type

Record type: 0 indicates common record, 1 indicates non-leaf node record of B+ tree, 2 indicates minimum record, 3 indicates maximum record, and 1xx indicates reserved record

Again, the record_type records in the non-leaf nodes of the upper two layers should be 1. The lowest leaf node should be the normal record, record_type 0. Each page also has a minimum record (Infimum) and a maximum record (Supremum) of type 2 and 3, respectively.

  • next_record

Represents the address offset from the real data in the current record to the real data in the next record, or 0 if there is no next record.

The records in the data page look like the following, in primary key order, heAP_NO records the current record location on the page, and is then connected via next_record.

Note that next_record points to the positional offset between the record header and the data. Reading from this position to the left is the record header information, and reading from the right is the real data. In addition, as mentioned before, variable-length field length list and NULL value list are stored in reverse order. In this case, the identifier read from the left corresponds to the column read from the right, which improves the reading efficiency.

If one of the records is deleted, delete_mask is set to 1, marked as deleted, and next_record is set to 0. Records that are deleted from a page form a garbage linked list through next_record, which can be reused when records are inserted later.

File Header

File Header Is used to record some Header information of a page. It consists of eight parts and occupies 38 bytes.

Take a look at the following information:

  • FIL_PAGE_SPACE_OR_CHKSUM

This represents the current page’s checksum. Whenever a page is modified in memory, its checksum is computed before synchronization. This checksum is the first thing written to disk when a page is flushed to disk.

  • FIL_PAGE_OFFSET

Each page has a separate page number, which InnoDB uses to uniquely locate a page.

If the size of an independent tablespace A. ibd is 1GB and the default page size is 16KB, there are 65536 pages in total. FIL_PAGE_OFFSET indicates the position of the page in all pages. If the ID of this table space is 10, the search page (10, 1) represents the search for the second page in table A.

  • FIL_PAGE_PREVFIL_PAGE_NEXT

InnoDB stores data in pages. InnoDB tables are indexed tables, and data is stored in primary key order. Data may be stored in multiple discrete pages, where the previous and next pages are linked by FIL_PAGE_PREV and FIL_PAGE_NEXT to form a bidirectional linked list. This allows many pages to be connected through a bidirectional linked list without the pages being physically connected.

  • FIL_PAGE_TYPE

This represents the type of the current page. InnoDB designs many different types of pages for different purposes.

InnoDB has the following page types:

Page Header

Page Header Records the status information of the data Page. It consists of 14 parts and occupies 56 bytes.

  • PAGE_N_DIR_SLOTS

The records in the Page are grouped by primary key. Each group is assigned to a Slot. PAGE_N_DIR_SLOTS records the number of slots in the Page Directory.

  • PAGE_HEAP_TOP

PAGE_HEAP_TOP Records the address of Free Space so that Space can be quickly allocated from Free Space to User Records.

  • PAGE_N_HEAP

The number of records on this page, including minimum records (Infimum) and maximum records (Supremum) and records marked for deletion (delete_mask=1).

  • PAGE_FREE

Deleted records are connected to a single linked list through next_record. The record space in this single linked list can be reused. PAGE_FREE points to the first record address marked for deletion, which is the head node of the single linked list.

  • PAGE_GARBAGE

Total number of bytes occupied by records marked as deleted.

  • PAGE_N_RECS

The number of records on this page, excluding minimum and maximum records and records marked for deletion, is distinguished from PAGE_N_HEAP.

Supremum and Infimum

InnoDB has two virtual row records per data page to define record boundaries. A Infimum record is a record that is smaller than any primary key value on the page, and a Supremum record is a record that is larger than any primary key value on the modified page. These two records are created when the page is created and will not be deleted under any circumstances.

And since these two Records are not our own defined Records, they are not stored in the User Records section of the page, they are placed separately in a section called Infimum + Supremum.

Both Infimum and Supremum consist of a 5-byte record header and an 8-byte fixed part. The fixed part of the smallest record is the word Infimum, and the fixed part of the largest record is the word Supremum. Since there are no variable-length fields or nullable fields, there are naturally no variable-length field lists and no NULL value lists.

The structure of the Infimum and Supremum records is shown below. Note that the Infimum record header record_type=2 indicates the minimum record. Supremum Record header record_type=3, which indicates the maximum record.

After adding the Infimum and Supremum records, the records on the page look like the following figure. The next_record of the Infimum record header points to the record with the smallest primary key on the page, and the next_record of the record with the largest primary key on the page points to the Supremum. The Infimum and Supremum form the record boundary. Note also that Infimum and Supremum are the first in the heAP_NO order in the record header.

User Records and Free Space

User Records is the part that actually stores row Records, and Free Space is obviously Free Space.

At the beginning of page generation, there is no User Records part. Every time a record is inserted, a Space of record size will be applied from the Free Space part to the User Records part. When the Space of the Free Space part is used up, the page will also be used up.

Page Directory

First of all, InnoDB’s data is organized by index. The B+ tree index itself cannot find a specific record, but only the page where the record is located. The page is the smallest basic unit of data storage.

In the figure below, if we want to find the row with ID=32, we can only find page 17 through the index.

After locating the page, we can find the record ID=32 by looking down the linked list through the next_record in the record header of Infimum.

But as you can imagine, the performance of lookups along a linked list is very low. So, the data in the page is actually divided into multiple groups, which appears to form a subdirectory that can narrow the scope of the query and improve query performance.

A Page Directory is a section that stores a number of slots. Records in a Page are divided into groups. Slots store the relative position of the largest record in each group. The number of records in this group is indicated by n_owned in the record header of the largest record.

The number of records in a group is specified: The number of records in the Infimum group is 1 to 8, the number of records in the Supremum group is 1 to 8, and the number of records in other groups is 4 to 8.

The Page Directory generation process is as follows:

  • Initially there are only Infimum and Supremum records on a data page, and they belong to two groups. There are two slots in the Page Directory that point to these two records, each with n_owned equal to 1.

  • After each new record is inserted, the system finds the slot in the page directory whose primary key value is greater than that of the current record and whose difference is the smallest. Then, the n_owned value of the corresponding record is increased by 1, indicating that another record is added to the group until the number of records in the group equals eight.

  • When a record is inserted after the number of records in a group is eight, the records in the group are divided into two groups. One group contains four records and the other five records. This process adds a slot in the page directory to record the relative position of the largest record in the new group.

  • When a record is deleted, the n_owned value of the maximum record in the slot is reduced by 1. When n_OWNED is less than 4, the groups are balanced to meet the above requirements.

In normal cases, a new record is added to the Supremum group each time based on the primary key. When the n_owned value of the new record is 8, the new record is divided into two groups: a group with four records and a group with five records. A slot will be added pointing to the largest of the four record groups, and the n_owned of the largest record will be changed to 4, and the n_owned of the Supremum will be changed to 5.

The number of slots in the Page Directory is recorded as PAGE_N_DIR_SLOTS in the Page Header.

The following figure illustrates the relationship between slots in Page Directory and the maximum number of records in a group.

  • First, Slot0 points to the Infimum record, because the smallest record in the group can only have one record, itsn_owned=1.
  • Then Slot1, Slot2, and Slot3 point to the maximum record in their respective groups, andn_owned=4You can imagine that it isSupremumGroups came in groups.
  • Finally, Slot4 points toSupremumThis is the maximum recorded group after grouping itn_owned=5.

As you can see, the data in the Page Directory is grouped into a Directory slot, and each slot points to the largest record in the group. N_owned in the record header of the largest record records the number of records in the group.

Once InnoDB has a directory slot, it uses binary lookup to quickly locate the slot and find the record in the group with the smallest primary key value. InnoDB then traverses the record with the smallest next_record to quickly locate the record that matches.

The time complexity of binary lookup is low and the lookup in memory is fast, so this part of the lookup time is usually ignored.

File Trailer

As mentioned in the File Header section, when a page is written to disk, the first thing written to the File Header is the value of FIL_PAGE_SPACE_OR_CHKSUM, which is the checksum of the page. During the write process, the database may go down, causing the page not to be fully written to disk.

InnoDB sets up the File Trailer section to verify that pages are fully written to disk. File Trailer has only one FIL_PAGE_END_LSN that takes up 8 bytes. FIL_PAGE_END_LSN is divided into two parts. The first four bytes represent the checksum of the page. The last 4 bytes represent the position (LSN) of the log sequence corresponding to the last modification of the page, which is the same as FIL_PAGE_LSN in File Header.

By default, the InnoDB storage engine checks the integrity of a page every time it is read from disk, The checksum and LSN in File Trailer are compared with FIL_PAGE_SPACE_OR_CHKSUM and FIL_PAGE_LSN in File Header to ensure page integrity.