1. Introduction

So far we have seen that “row format” determines the storage format of records on disk, and records are concatenated in a one-way linked list by Pointers in the header information. To better manage records, InnoDB uses “pages” as the basic unit for storing records, with pages connected in a bidirectional list. At the same time, in order to improve the retrieval efficiency of records, InnoDB uses the design of Page Directory in the Page to create a Directory entry record for all leaf node pages and store it in the inner node, so as to build a tree, which is the familiar B+ tree index.

All of this seems to be fine, InnoDB works fine, but there are some performance issues. We insert records into tables, essentially inserting records into clustered index B+ trees and secondary index B+ trees. Inserting records into these trees is essentially inserting records into nodes of the tree, each node of the tree corresponds to an index page. The File Header section of the page has PREV and NEXT Pointers that record the page numbers of the previous and NEXT pages to build bidirectionally linked lists that physically discontiguous logically contiguous pages. If InnoDB claims disk space based on pages, it will likely result in pages that are logically adjacent and physically far apart. This will inevitably result in a lot of random I/O. Random I/O performance on mechanical hard disks is poor, and the heads need to be readdressed every time. To solve this problem, InnoDB introduced the concept of extents, groups, and segments, aiming to make logically adjacent pages physically as adjacent as possible, and converting random IO into sequential IO as much as possible. Note that I say “as much as possible”, InnoDB will do its best to keep pages contiguous, and even if they are not contiguous, they will work, but the performance will be poor.

2. The table space

InnoDB supports a variety of tablespaces:

  • System table space
  • Independent table space
  • The undo tablespace
  • Temporary table space
  • Generic table space

System table Spaces and standalone table Spaces can be used to store our table data and index data, so we focus on them. By default, InnoDB has a shared tablespace ibdata1, where all data is stored.

SHOW VARIABLES LIKE 'innodb_data_file_path'; Results: ibdata1:12M:autoextend
Copy the code

As you can see, the default system tablespace is the ibDatA1 file on the disk. The default initial test size is 12MB, which is too small to fill with very little data. Don’t worry, autoextend means it is a self-growing file that will automatically expand when it runs out of capacity.

You can also use innodb_file_per_table to create a separate table space for each form. To check whether a separate table space is enabled, run the following command:

SHOW VARIABLES LIKE 'innodb_file_per_table'; Results:NO
Copy the code

In my MySQL instance, independent table Spaces are enabled, so InnoDB creates two separate table name. FRM and ibd files for each form. The former is a table structure definition file and is usually small, while the latter is used to store data and indexes in the table and grows as the number of records increases.

Ibd files. You can think of a table space as a large pool of pages. When we need to insert data, we take pages out of this pool.

3. The concept of districts

As mentioned earlier, InnoDB is fine with managing data by pages, and it works fine, but the performance is not ideal because logically adjacent pages can be physically far apart, resulting in a lot of random IO. To remedy this, InnoDB introduces the concept of extents.

InnoDB specifies that physically contiguous 64 pages are an extents, i.e. the size of an extents is16KB*64=1MBTable Spaces, whether system or standalone, are composed of contiguous extents. The contiguous 256 extents are a group, that is, a group is 256MB in size.With area, InnoDB can “area” as the unit to assign the data in the table and index space, when large amounts of data in the table, even one-time allocation of several continuous area, benefits to do so is also continuous on these pages in the physical, random IO can optimize for sequential IO, defect is probably these pages to burn, a bit of a waste of space, But it was worth it.

InnoDB divides “extents” into four states:

The state name instructions
FREE The free zone
FREE_FRAG Fragmentation areas with free pages
FULL_FRAG There are no fragments of free pages
FSEG An area belonging to a section

A segment is a logical concept, a B+ tree will have two segments, namely leaf node segment and non-leaf node segment. This section will be covered in more detail in a later article, but will be skipped here.

Only extents in the FSEG state belong to segments directly. Extents in the other three states belong only to the tablespace and do not belong to any segment. A segment belongs to a table or, more specifically, to a specific B+ tree index. That is, once an extent is FSEG, it means that the pages inside it only store records from that B+ tree.

A table has at least one B+ tree, or at least two segments, and each segment is allocated an extent. This means that even if there is only one record in the table, it will occupy 2MB of space, which is too much waste. To solve this problem, InnoDB uses the policy of “allocate scattered pages first and reallocate space”. InnoDB’s policy for allocating space to table records looks like this: When we insert records into the table, InnoDB takes out a FREE range and changes its state to FREE_FRAG. Then it takes out a page from the range and stores the record. When the range runs out of FREE space, InnoDB changes its state to FULL_FRAG. Then continue with the FREE zone and repeat the process. The table data is then scattered fragments of the storage in these areas, we call these pages for the “separate page”, when a section of scattered pages with more than 32, InnoDB page will no longer to allocate space for the unit, but in area as the basic unit, thus avoiding the small amount of data would take up 2 the problems.

3.1 XDES Entry

How do you know which section a section belongs to? And the usage of pages in the zone? To better manage extents, InnoDB creates an XDES Entry node for each extents. Each XDES Entry node occupies a fixed length of 40 bytes. The structure is as follows:

The name of the The size of the instructions
Segment ID 8Byte ID of the owning segment
List Node 12Byte XDES Entry Indicates a pointer to a bidirectional linked list
State 4Byte Region states (total 4 types)
Page State Bitmap 16Byte Bitmap of the status of the pages in the area
  • List Node

The List Node consists of two Pointers that are used to concatenate multiple XDES Entry nodes into a bidirectional linked List. The structure is as follows:

The name of the The size of the instructions
Prev Node Page Number 4 bytes Page number of the last XDES Entry node
Prev Node Offset 2 – The offset of the page address of the last XDES Entry node
Next Node Page Number 4 bytes Page number of the next XDES Entry node
Next Node Offset 2 – The offset of the next XDES Entry node’s address on the page

An XDES Entry is also stored in a page. You can quickly locate an XDES Entry by using the page number + address offset. The default page size is 16KB, and 2 bytes are sufficient to record the address offset. Tips: The group consists of 256 contiguous sections, corresponding to 256 XDES Entry nodes. These XDES Entry nodes are stored in the first page of the group. The physical structure is contiguous, and there is no need for Pointers at all.

  • Page State Bitmap

It records whether 64 pages in the zone are free, occupying 16 bytes, or 128 bits. Each two bits represents a page. The first bit indicates whether the corresponding page is free, and the second bit is not currently in use.

On page 3.2 XDES

A group consists of 256 contiguous extents, which correspond to 256 XDES Entry nodes. Where are these nodes stored? As we mentioned earlier, InnoDB designs many different types of pages for different purposes, including pages specifically designed to store XDES Entry nodes. The page type is FIL_PAGE_TYPE_XDES, or XDES page for short. Its page structure is as follows:

The name of the The size of the instructions
File Header 38 byte Common file header information for all pages
Empty Space 112 bytes Don’t use
XDES Entry 40 * 256 bytes 256 XDES Entry nodes
Empty Space 5986 bytes Don’t use
File Trailer 8 bytes Common end-of-file information for all pages to verify page integrity

The XDES page is fixed as the first page of each group, that is, each group takes out the first page to record the XDES Entry node corresponding to the 256 area in the group.

The first page of the first group of the tablespace is of type FIL_PAGE_TYPE_FSP_HDR, which is very similar to the XDES page except that the File Space Header is added to record the tablespace related state information.

3.3 XDES Entry Linked list

Now we know that when there is very little data in the table, InnoDB will first allocate the scattered pages from the fragment belonging to the table space. When the number of scattered pages used by a segment exceeds 32, space is allocated in extents. There are two questions:

  1. How do I know which extents in a table space are FREE? Which are FREE_FRAG?
  2. Which FSEG segments belong to segments that have free space? Which ones are unused?

The essence of these two problems is the same, and the solution is the same. Of course, iterating through all the XDES entries is also a solution, but this solution is too stupid, when the table data reaches 1GB, the number of XDES entries will be thousands, traversal is too inefficient, InnoDB does not do this. Remember the List Node section of XDES Entry? These are two Pointers that can be used to build a bidirectional linked list of multiple XDES entries. Extents that belong to a tablespace form three linked lists:

  • FREE list: A linked list that concatenates all FREE zones.
  • FREE_FRAG linked list: A linked list that concatenates all FREE_FRAG areas.
  • FULL_FRAG linked list: A linked list that concatenates all the FULL_FRAG regions.

So, when we want to allocate scattered pages, we start with an extent from the FREE_FRAG list, change its state to FULL_FRAG when the extent is used up, and then remove it from the FREE_FRAG list and add it to the FULL_FRAG list. Repeat the previous procedure. When there is no extant available on the FREE_FRAG list, take an extant from the FREE list, change its state to FREE_FRAG, add it to the FREE_FRAG list, and repeat the previous procedure.

Let’s look at the second problem, which is essentially the same thing, but let’s do a list of FSEG regions based on State. A B+ tree has two segments, leaf and non-leaf nodes are stored separately, and each segment corresponds to three linked lists:

  • FREE list: All unused extents belonging to the same segment are automatically added to the list.
  • NOT_FULL: All extents belonging to the same segment that have been used but still have free space are automatically added to the list.
  • FULL list: All segments belonging to the same segment that have run out of free space are automatically added to the list.

The allocation is the same as a linked list of table Spaces, which I won’t describe here.

That is, when we create a table with only primary keys and no secondary indexes, it contains nine XDES entries. Mysql > alter table tablespaces; mysql > alter table tablespaces;

3.4 Linked list base nodes

The simplest table with only primary keys will have 9 XDES Entry lists. When we insert records into the table, we need to find the corresponding linked list. We need to retrieve pages from the area to store records.

To solve this problem, InnoDB designs a structure called “List Base Node”. Each List Base Node occupies a fixed 16 bytes. The structure is as follows:

The name of the The size of the instructions
List Length 4 bytes Total number of linked list nodes
First Node Page Number 4 bytes Linked list header XDES Entry Page number of the page where the node resides
First Node Offset 2 – Address offset of the page where the linked header XDES Entry node resides
Last Node Page Number 4 bytes The page number of the XDES Entry node at the end of the list
Last Node Offset 2 – Address offset of the page where the XDES Entry node resides at the end of the linked list

The base node structure of the linked list is very simple. The first 4 bytes count the number of nodes in the list, and the last 12 bytes point to the top and bottom nodes. Another question arises: where are the base nodes of the linked list?

We already know that there are two types of XDES Entry lists, one belonging to table Spaces and one belonging to segments. The XDES Entry base node belongs to the tablespace and is stored in the File Space Header on the first page of the tablespace group 1. The 16*3 bytes store the three base nodes of the list, respectively pointing to the three lists mentioned above. InnoDB creates an INODE Entry node for each segment. The XDES Entry base node for each segment is stored in the INODE Entry node, which takes 16*3 bytes. The three base nodes refer to the three lists mentioned above.

4. To summarize

InnoDB index pages series logically adjacent pages into a bidirectional linked list by means of two Pointers, so that these index pages do not need to be physically contiguous, but can cause logically adjacent pages to be physically far apart, resulting in a large number of random IO when reading data. To optimize random IO to sequential IO as much as possible to improve performance, InnoDB introduces the concept of extents, groups, and segments. Physically contiguous 64 pages are a range, and contiguous 256 ranges are a group. This allows InnoDB to allocate space in “extents”. To better manage these extents, InnoDB creates an XDES Entry node for each extents. These nodes are stored exclusively on the XDES page, fixed to the first page of each group. Through XDES Entry node, we can know which section the area belongs to, the usage of pages in the area, and connect the area into a linked list according to the State, so InnoDB can find the area with the specified State more quickly. XDES Entry nodes can be divided into three linked lists based on page usage. To find these lists, InnoDB designs a “list base node” structure, which records the number of nodes in the list and the Pointers to the top and bottom nodes. If the linked list base nodes belong to the tablespace, they will be stored in the first page of the tablespace group 1. If it belongs to an segment, it is stored in the INODE Entry node corresponding to the segment. Each takes 16*3 bytes and points to three linked lists in different states. In this way, when InnoDB wants to allocate fragmented pages for records, it finds and allocates fragmented pages in the linked list corresponding to the table space. When the number of pages occupied by a segment exceeds 32, InnoDB finds an area in the segment’s list with free pages and allocates them.