The file storage structure contains a lot of implementation details of the system, such as Java class file structure, Rocksdb storage structure. MySQL InnoDB storage format is a bit complicated, but it is a necessary step to understand MySQL technology inside. There are two traditional methods of analysis

  • The jeremycole innodb_ruby
  • Ali open source Java version innodb-Java-Reader

However, both versions are not very intuitive, so I decided to implement them by analyzing.ibd files for two purposes: visualization and templating. This brings me to the 010 Editor, the most powerful binary analysis tool in the universe.

010 Editor is introduced

010 Editor is a commercial software that is a powerful tool for binary analysis, able to parse multiple file formats and present them in a visual interface. It has a powerful internal engine that allows anyone to customize the desired parsing scripts or templates. 010 Editor has a repository of scripts and template libraries for you to learn and use, such as.zip,.exe,.class. MySQL doesn’t have a template for its.ibd file repository.

Most of the examples that follow are from the kids’ gold nuggets booklet as a supplement.

All things are difficult before they are easy

The basic syntax of the 010 template is similar to that of THE C language, which essentially takes the binary data of the entire file as input and maps it to a structure we define ourselves, where we can handle the internal logic ourselves.

We know that MySQL manages storage space in the basic unit of pages. A page size is usually 16KB, so the ibd file size can be divided by 16KB to know how many pages there are, or we can loop through the 16KB size to the end of the file.

Create a new test_01.bt file with the following contents.

#define PAGE_SIZE (16 * 1024)

// Define a Page structure that represents the size of a Page
typedef struct Page {
	byte content[PAGE_SIZE];
};

// Keep reading until the file is finished
while(! FEof()) { Page page;// Read 16KB of content mapped to the Page structure
}
Copy the code

Or you could write it this way

#define PAGE_SIZE (16 * 1024) #define PAGE_SIZE (16 * 1024) Typef struct Page(int index) {byte content[PAGE_SIZE]; typef struct Page(int index) {byte content[PAGE_SIZE]; }; local int file_size = FileSize(); local int i; for (i = 0; i < file_size / PAGE_SIZE; ++i) { Page page(i); // Read 16KB content into the Page structure}Copy the code

So if you import this template file, you can see in the interface that this IBD file has 6 pages, and we’ll talk about what’s in it later.

As you may know, ibD files have many different types of pages, but each type of Page used to represent a 38-byte File Header, which describes information common to various pages, such as the Page type of the Page, the Page number, its previous Page, and the next Page.

Its constituent structure is as follows.

The name of the Space occupied describe
FIL_PAGE_SPACE_OR_CHKSUM 4 bytes Page checksum
FIL_PAGE_OFFSET 4 bytes Page number
FIL_PAGE_PREV 4 bytes The page number of the previous page
FIL_PAGE_NEXT 4 bytes Page number of the next page
FIL_PAGE_LSN 8 bytes The position of the Log Sequence corresponding to the last modification of the page
FIL_PAGE_TYPE 2 – The type of the page
FIL_PAGE_FILE_FLUSH_LSN 8 bytes Defined only on one page of the system tablespace, representing that the file has been flushed to at least the corresponding LSN value
FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID 4 bytes Which table space the page belongs to

So in 010 we can define a structure to represent the File Header.

// A 38-byte File Header, Typedef struct FilHeader {u4 FIL_PAGE_SPACE_OR_CHKSUM<format=hex,comment=" checksum ">; U4 FIL_PAGE_OFFSET<fgcolor=0xFF0000,comment="页 "> U4 FIL_PAGE_PREV<comment=" 查 看 全 文 ">; U4 FIL_PAGE_NEXT<comment="下 载 页 ">; U8 FIL_PAGE_LSN<comment=" Log Sequence Number "> u8 FIL_PAGE_LSN<comment=" Log Sequence Number "> PAGE_TYPE FIL_PAGE_TYPE; U8 FIL_PAGE_FILE_FLUSH_LSN<comment=" Only defined in a page of the system tablespace, indicating that the file has been flushed to at least the corresponding LSN value "> U4 FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID<comment=" which tablespace does the page belong to ">; };Copy the code

PAGE_TYPE is an enumeration class that represents the type of a page. Common page types are as follows.

Type the name hexadecimal describe
FIL_PAGE_TYPE_ALLOCATED 0 New assignment, not used yet
FIL_PAGE_UNDO_LOG 2 The Undo log page
FIL_PAGE_INODE 3 Segment information node
FIL_PAGE_IBUF_FREE_LIST 4 Insert Buffer free list
FIL_PAGE_IBUF_BITMAP 5 The Insert Buffer bitmap
FIL_PAGE_TYPE_SYS 6 The system page
FIL_PAGE_TYPE_TRX_SYS 7 Transaction system data
FIL_PAGE_TYPE_FSP_HDR 8 Table space header information
FIL_PAGE_TYPE_XDES 9 Extended Description page
FIL_PAGE_TYPE_BLOB A Overflow page
FIL_PAGE_INDEX 45BF Index pages are what we call data pages

We need to do different parsing for different page types. With the above information, we can write a new version of the template script.

Next we’ll focus on parsing pages of the type FIL_PAGE_INDEX, which we call pages that hold data. Its structure is as follows.

Let’s start by defining a PageHeader, as shown below.

// struct PageHeader {u2 PAGE_N_DIR_SLOTS<comment=" number of slots in the Page directory ">; U2 PAGE_HEAP_TOP<comment=" Free Space"> PAGE_FORMAT_ENUM PAGE_FORMAT: 1 <comment=" page type ">; U2 NUM_OF_HEAP_RECORDS: 15 <comment=" number of records in this page (including Max and Max records and marked as delete) "> U2 PAGE_FREE<comment=" Next_record is also a single linked list, which can be recycled) ">; U2 PAGE_GARBAGE<comment=" number of bytes occupied by deleted records ">; U2 PAGE_LAST_INSERT<comment=" last_insert ">; U2 PAGE_DIRECTION<comment=" 查 看 全 文 ">; U2 PAGE_N_DIRECTION<comment=" 内 容 提 名 ">; U2 PAGE_N_RECS<comment=" number of records in the page (excluding minimum and maximum records and those marked as deleted) ">; U8 PAGE_MAX_TRX_ID<comment=" modify the maximum transaction ID of the current page, which is defined only in the secondary index ">; U2 PAGE_LEVEL<comment=" where is the page in the table? ">; U8 PAGE_INDEX_ID<comment=" ID ">; U1 PAGE_BTR_SEG_LEAF[10]<comment="B+ tree header, only defined on Root page "> U1 PAGE_BTR_SEG_TOP[10]<comment="B+ tree non-leaf segment header, only defined on the Root page of B+ tree "> };Copy the code

Following the Page Header are two special records, Infimum + Supremum, which are two virtual row records that represent the minimum and maximum records in the Page.

Infimum and Supremum consist of two parts, a 5-byte record header and an 8-byte string.

Use the script definition as shown below.

Enum Enum < U1 > RECORD_TYPE {CONVENTIONAL = 0, NODE_POINTER = 1, INFIMUM = 2, SUPREMUM = 3}; // Record type Enumeration Class enum < U1 > RECORD_TYPE {CONVENTIONAL = 0, NODE_POINTER = 1, INFIMUM = 2, SUPREMUM = 3}; // struct {u1:2; u1 delete_mask : 1; u1 min_rec_mask : 1; u1 n_owned : 4; u2 heap_no : 13; RECORD_TYPE record_type : 3; short next_record : 16; } RecordHeader<fgcolor=0xFF00FF>; Typedef struct {RecordHeader record_header; Char text[8]; // 8-byte word} InfimumSupremumRecord;Copy the code

The following information is displayed.

We can then iterate through the list with Infimum and Supremum.

Typedef struct Page(int index) {FilHeader file_header; // 38 bytes file header if (file_header.FIL_PAGE_TYPE == FIL_PAGE_INDEX) {PageHeader page_header; // 56 bytes of page header InfimumSupremumRecord infimum_record; Local int page_start_pos = index * PAGE_SIZE; local int infimum_record_pos = FTell() - page_start_pos - 8; // Obtain the start address of the infimum record InfimumSupremumRecord supremum_record; Local int supremum_record_pos = FTell() -page_start_pos - 8; Local int next_rec_rel_pos = infimum_record_pos + obtain the start address of the supremum record infimum_record.record_header.next_record; // iterate over the list while(next_rec_rel_pos! = supremum_record_pos) { FSeek(page_start_pos + next_rec_rel_pos - 5); RecordHeader record_header; next_rec_rel_pos += record_header.next_record; FSeek((index + 1) * PAGE_SIZE); };Copy the code

At this point, we have completed one of the simplest ibD file parsing logic. The rest of the logic is to parse according to different table structures and record formats.

Revisit the Compact/Dynamic line format

Dynamic is the default MySQL row format, which is basically similar to Compact and consists of the following structure.

Let’s experiment with the example in the booklet.

Table building and initialization statements are as follows.

CREATE TABLE record_format_demo (
        c1 VARCHAR(10),
        c2 VARCHAR(10) NOT NULL,
        c3 CHAR(10),
        c4 VARCHAR(10)
) CHARSET=ascii ROW_FORMAT=COMPACT;

INSERT INTO record_format_demo(c1, c2, c3, c4) VALUES('aaaa'.'bbb'.'cc'.'d'), ('eeee'.'fff'.NULL.NULL);
Copy the code

This table does not define a primary key; MySQL generates a 6-byte value as the primary key. In addition, a 6-byte transaction_id and a 7-byte roll_pointer are added. So we can add a row record structure to the script.

typedef struct { u8 row_id : 48; U8 trx_id: 48; Transaction ID u8 roll_pointer: 56 <format=hex>; } Row;Copy the code

We then add the structure of the record after the record header of the previous while loop.

// iterate over the list while(next_rec_rel_pos! = supremum_record_pos) { FSeek(page_start_pos + next_rec_rel_pos - 5); RecordHeader record_header; if (record_header.record_type == CONVENTIONAL) { Row row; } next_rec_rel_pos += record_header.next_record; }Copy the code

The blue background of the image is the first three parts of a row, row_id + transaction_id + roll_pointer.

This is followed by the actual values of c1, c2, C3, and C4 in the first row, where c3 is of type CHAR (10), followed by a space (0x20) if the actual length is less than 10. This is why there is no space at the end of the char data.

The list of variable length fields in the first row is 01, 03 and 04, corresponding to C4, C2 and C1 respectively, which are sorted in reverse order according to the table fields. Since the C3 field is char and the table encoding is ASCII, it is not variable-length, so the length is a definite 10 bytes and there is no need to record it in the variable-length field length list. Then we look at the NULL value list. Since none of the fields in the first row are NULL, this field is 0.

Now let’s look at the second row of data,

c1 c2 c3 c4
eeee fff NULL NULL

The corresponding binary list is as follows.

As you can see, since the c4 field is NULL, there is no need to record it in the variable length list, so the variable length list is 03, 04. The NULL list is changed to 0x06(00000110). The NULL list is also in reverse order, so its value is 0110. Since C2 is not defined as NULL, 011 means that C1 is not NULL, and C3 and C4 are NULL.

This example is generated when the table is encoded as ASCII. In the case of variable-length encoding such as UTF8, do we need to record the actual length of char data in the variable-length list?

Let’s try it out, change the table encoding to UTF8, insert “one, two, three, four, five, six, seven, eight, nine”,

CREATE TABLE record_format_demo2 ( c1 VARCHAR(10), c2 VARCHAR(10) NOT NULL, c3 CHAR(10), c4 VARCHAR(10) ) CHARSET=utf8 ROW_FORMAT=COMPACT; INSERT INTO ` record_format_demo2 ` (` c1 ` ` c2 `, ` c3 `, ` c4 `) VALUES (' GGGG ', 'HHH', 'ten', 'j');Copy the code

The corresponding file information is as follows.

You can see that there are now four varied-length columns, including the C3 field of type CHAR, which is 30(0x1E) in length, where each Chinese character is represented by three bytes.

How does vARCHAR handle long lines

Here we continue with the example from the booklet

CREATE TABLE `varchar_size_demo_1` (
  `c` varchar(65532) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=ascii ROW_FORMAT=COMPACT;

INSERT INTO varchar_size_demo_1(c) VALUES(REPEAT('a', 65532));
Copy the code

MySQL is 16K per page, so obviously you need at least 4 pages to store everything. So how is this data organized? The diagram in the booklet clearly illustrates the page layout in COMPACT format.

Knowing this structure, we can go ahead and customize the template for this table.

typedef struct OverflowPagePointer { u4 space; u4 page_no; // u4 page_offset; // Offset u4:4 * 8; / / short u4 length; }; typedef struct { u8 row_id : 48; U8 trx_id: 48; Transaction ID u8 roll_pointer: 56 <format=hex>; // Roll back pointer char data[768]; OverflowPagePointer Overflow_page_pointer; // 20 bytes overflow struct} Row;Copy the code

What is not said here is the address of the other 20-byte pages, and what exactly is stored in them. By reading the source code, I extracted the structure of this part, which is the OverflowPage above.

You can see that the overflow record pointer points to the contents of the Page with Page number 4 and offset 38. Let’s look at the content area with page number 4

As you can see from the File Header, this Page is of type BLOB. The File Header is followed by eight bytes of Header like information, the first four bytes of length indicating the length of the overflow data stored in the current page, and the next four bytes indicating the next page number to store the overflow data. Based on this, we can implement data format parsing for bloB-type pages.

The BLOB Page format consists of 8 bytes of header information plus the actual data, and the header information consists of 4 bytes of length information plus 4 bytes of the next Page Page number.

So we can write BLOB type structures.

typedef struct {
	u4 length;
	u4 next_page_number;
	char data[length];
} Blob;
Copy the code

Modify our template script to add judgment for BLOB type pages.

if (file_header.FIL_PAGE_TYPE == FIL_PAGE_INDEX) { PageHeader page_header; // 56 bytes of page header InfimumSupremumRecord infimum_record; // Virtual minimum record of 13 bytes... } } else if (file_header.FIL_PAGE_TYPE == FIL_PAGE_TYPE_BLOB) { Blob blob; }Copy the code

Running the script yields the following results.

Dynamic records are similar to the Compact row format in that all real records are stored in other pages and are not expanded here.

How are leaf and non-leaf nodes represented in terms of storage structure

Here we create a T2 table and write a stored procedure to insert a thousand rows.

CREATE TABLE `t2` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB;


drop procedure if exists gen_data;
delimiter ;;
create procedure gen_data()
begin
    declare i int;
    set i = 1;
    while(i<=1000) do
        insert into t2 values(i, i, i);
        set i = i + 1;
    end while;
end;;
delimiter ;
call gen_data();
Copy the code

There are 1000 rows in the table from 1 to 1000.

Next, let’s look at the storage structure of the table, run the script, and look at page 3 first (pages 0, 1, and 2 are system pages, regardless).

It can be seen that the record type on page 3 is 1(NODE_POINTER), where 1 indicates a non-leaf node record of the B+ tree. The complete record type is as follows.

  • 0 indicates a common record
  • 1 indicates the B+ tree non-leaf node record
  • 2 indicates the minimum record
  • 3 indicates the maximum record

The page_level value on page 3 is 1, indicating a non-leaf node (page_level is 0).

For records of type NODE_POINTER, which actually stores the minimum primary key ID and Page number of the Page pointing to the next LEVEL Page, its type is defined as follows.

typedef struct NonLeafRecord { u4 min_key_on_child_page<read=read_int>; u4 child_page_num; }; String read_int(u4 n) {string ret; SPrintf(ret, "%u", n & 0x7FFFFFFF); return ret; }Copy the code

Then add NODE_POINTER processing logic to the code.

while(next_rec_rel_pos ! = supremum_record_pos) { FSeek(page_start_pos + next_rec_rel_pos - 5); RecordHeader record_header; if (record_header.record_type == CONVENTIONAL) { Row row; } else if (record_header.record_type == NODE_POINTER) {// Process NonLeafRecord record; } next_rec_rel_pos += record_header.next_record; }Copy the code

The result is as follows.

You can see that there are three non-leaf nodes, pointing to the leaf node page on pages 5, 6, and 7. The minimum primary key values for these three pages are 1, 242, and 725, respectively. Let’s draw it as a graph.

other

The template file here covers only a small part of the MySQL file format and can be used as a simple template to continue to improve the extension support. All the sample template code used in this article has been posted on Github at github.com/arthur-zhan… Interested students can continue to improve.