We are all familiar with mysql database service architecture and the execution order of SQL. Mysql data is stored in disk and memory using B+ tree data structure, but in InnoDB engine, how data is displayed in memory and disk and how it relates to mysql service architecture. The connection between SQL queries and InnoDB engines may not be unclear. The logical architecture of mysql is shown below:

InnoDB storage engine structure

What is the logical storage structure of InnoDB’s storage engine? In fact, all the data is logically stored in one space and the files in this space are the actual physical files, the table space. By default, a database table occupies a tablespace that stores the data, index, INSERT Buffer bitmap undo information, INSERT Buffer index page, and double write buffer corresponding to the table in the shared tablespace.

By default, a database form has an exclusive tablespace show variables like'%innodb_file_per_table%'SET GLOBAL innodb_file_per_table=OFF;1.23.4.. 5.
Copy the code

2.1 InnoDB table storage engine file

Each tablespace consists of segments, extents, and pages. A page is the basic unit of data store data, with a default size of 16KB. An extent is a space made up of contiguous pages, with a default size of 1MB. Multiple extents form a segment of a table. InnoDB logical storage structure

When we execute SQL, both query and modify, myql always the data read from disk in memory, and when reading data, not on a data alone, but direct load data in data pages into memory, and read in one of two ways, the current way of pre-reading lectures and random way, using linear to proofread by default.

InnoDB engine architecture

2.2. InnoDB Prefetch mechanism

  • Linear and random prefetch:

The unit of linear prefetch is extent, while the unit of random prefetch is page. Linear prefetch reads the next extent data to the buffer pool. Random prefetch means that the remaining pages in the current extent are read to the buffer pool. If the number of pages that are read sequentially in an extent exceeds a certain amount (innodb_read_ahead_threshold), the remaining pages in the extent are loaded directly.

2.3 features of InnoDB

2.3.1 Insert cache

To improve Insert performance, Insert buffers are an enhanced version of Insert buffers. Insert buffers are only valid for inserts. The change buffer is valid for insert/update/ DELETE. Insert buffer only applies to non-unique indexes and secondary indexes. Each insert is not written to the index page. Instead, it determines whether the inserted non-clustered index page is in the cache. This combines multiple inserts into a single operation in order to reduce the performance cost of random IO.

2.3.2 Write twice(double write)

The insert cache gives the InnoDB storage engine a performance boost, while the double write ensures the reliability of the InnoDB storage engine to operate the data pages. Double Write consists of two parts: a 2MB double write buffer in memory, and 128 consecutive data pages in the shared table space on the physical disk, or two extentsizes (also 2MB). When flushing dirty pages from the buffer pool, instead of writing directly to disk, memcpy is used to copy the dirty pages to the doubleWrite buffer in memory, and then the doublewrite buffer is used to write the dirty pages to the physical disk of the shared table space in two consecutive 1MB increments. Fsync is then called immediately to synchronize the data to disk. Since doubleWrite is continuous space, this sequential write IO is not expensive. After the DoubleWrite page is written, discrete writes are made again to the various table Spaces. If the operating system crashes while writing a data page to disk, during recovery, the InnoDB engine finds a copy of the page from doubleWrite in the shared tablespace, copies it to the tablespace file, and applies the redo log.

2.3.3 Adaptive Hash Index

The InnoDB storage engine monitors queries for each Index page on the table. If it sees that creating a hash Index will increase the speed, it will create the corresponding Index, so it is called Adaptive Hash Index (AHI). AHI is constructed from B+ tree pages in the buffer pool. The establishment speed is relatively fast, and there is no need to create a hash index for the whole table, but only the index of hot pages. AHI is enabled by default.

2.3.4 Asynchronous IO

In order to improve the performance of disk operations, current database systems generally use Asynchronous IO(AIO) to handle disk operations, InnoDB storage engine is also the same,AIO advantage is to reduce the time required for SQL queries, in addition to IO Merge operations. You can combine multiple I/OS into one I/O to improve IOPS performance.

# 开启本地 AIO
show valiables like 'innodb_use_native_aio';1.2.
Copy the code

2.3.5 Refresh the adjacencies page

The InnoDB storage engine provides a Flush Neighbor Page feature. When a dirty Page is flushed, the InnoDB storage engine detects if there are other dirty pages in the area. If there are other dirty pages, the InnoDB storage engine will Flush them together. Mechanical hard disks have obvious advantages, but fixed hard disks have higher IOPS. Whether to enable this function depends on the actual situation. Set the parameters as follows:

show varables like 'innodb_flush_neighbors'1.
Copy the code

3. Logic for SQL execution

3.1 SQL execution

Mysql write files have 2 caches. One is the self-defined log buffer in memory, and the other is the OS cache mapped from disk to memory. Mysql can call Flush to proactively flush the log buffer to the disk memory map, or it can call fsync to forcibly synchronize the disk mapped file to disk. By default innodb_flush_log_at_trx_COMMIT and sync_binlog are both set to 1.

Not only do InnoDB engines have the concept of buffers in user space, but also OS buffers in kernel space

You can also call Flush + fsync to flush the cache directly. innodb_flush_log_at_trx_commit =0It calls Flush + fsync every second, and the timer maintains itself. innodb_flush_log_at_trx_commit =1The real-time call to Flush + fsync is not batched and has poor performance. innodb_flush_log_at_trx_commit =2Is the real-time flush, timed fsync to the OS maintenance timer. The sync_binlog configuration is equal to0: indicates that each committed transaction only writes but fsync is equal to1: indicates that fsync is performed for each committed transaction. Equal to n: indicates that after the write transaction, n transactions are accumulated before fsync is performed. show variables like'sync_binlog';
show variables like 'innodb_flush_log_at_trx_commit'; Mysql > show processList12.3.4.. 56.7.8.9.10.11.12.13..
Copy the code

Relationships between InnoDB engine BufferPool, LogBuffer, OS Buffer and Log Files.

The InnoDB engine performs the following steps when mysql adds or deletes SQL:

  • 1 After obtaining the SQL to be executed, the actuator loads the data to be modified from the disk to the memory (that is, the buffer pool) according to the update conditions.
  • 2 Before modifying the corresponding data, back up the corresponding data, that is, put the data into the undo log to facilitate operations during transaction rollback.
  • 3 Modify the corresponding value in the memory according to the SQL statement.
  • 4 After the new data is modified, the redo log buffer is written based on the new data.
  • 5 Write the redo log to a disk. For details, see innodb_flush_AT_trx_COMMIT. Write the redo log to the cache first and then to the system file asynchronously by DMA. In this case, data is transferred from the user thread to the kernel thread. When reading or writing a file, there are multiple levels of caching between the disk file and the memory to improve the efficiency of data exchange. This is what OS Cache does here.
  • 6 After writing the redo log, write data to the bin log.
  • 7 Similar to the redo log, the log is first written to the OS cache and then flushed to a disk file. The configuration of sync_log is shown in the figure. In general, innodb_flush_AT_trx_COMMIT and sync_log are configured as 1.
  • 8 After the redo log and bin log are written, commit the transaction. When data is written to disk, InnoDB writes data twice.

Redo log before bin log If the redo log has not been written to the bin log, the mysql instance will be restored using the redo log, which will cause data inconsistency between the primary and secondary databases.

Binary log files are recorded in statement, ROW, and Mixed formats. The Statement mode is used to execute SQL directly. If there are functional operations (such as the database time set to now()), the data will be inaccurate. The ROW mode synchronizes all rows of data. If the full table operation changes the state, this mode is not appropriate. Therefore, use mixed mode when synchronizing data.

3.2, FreeList, LRU List and Flush List

  • Free List Indicates the Free List

Record all unused data pages, put the loaded data into the buffer pool in sequence, and delete the corresponding nodes in the Free List

  • LRU List LRU data access List

Connect hot and cold data blocks and maintain them according to the LRU algorithm. If the data loaded into memory is put into the head of the list at one time, some data will be eliminated if the heat of the data is not determined. Mysql InnoDB puts the data in the back of the list. If the data is accessed within 1s, it will enter the list head, which is the data hot area.

# Place the newly loaded data in the list by default375/8Show variables like'innodb_old_blocks_pct'; How often do you use cold data to access the hot end of the list? Default: 1000ms show variables like'innodb_old_blocks_time';1.23.4..
Copy the code
  • Flush List Flushes the dirty page List

Record the modified data pages in memory, use two-way linked list for connection, and do disk operation when convenient.

The size of InnoDB’s redo log is fixed to ensure transaction persistence, and the number of files can be configured according to the needs of InnoDB. You need to move the check point forward and persist the redo log to make room for further log writing.

Data manipulation

The redo log buffer is written periodically

The key here is write-ahead Logging (WAL).

For details about how to configure the redo log, see the following command.

Replication after login

show variables like '%innodb_log%'------- Result ------ innodb_log_buffer_size16777216
innodb_log_checksums	ON
innodb_log_compressed_pages	ON
innodb_log_file_size	50331648
innodb_log_files_in_group	2
innodb_log_group_home_dir	./
innodb_log_write_ahead_size	8192Innodb_log_buffer_size specifies the size of the redo log buffer in memory.16777216/1024/1024=16MB innodb_log_file_size = size of each redo log50331648/1024/1024=48MB Innodb_log_files_in_group Specifies the number of files in the redo log file group. The default value is2Show table status like'my_table';
Copy the code