1 introduction

We are all familiar with mysql database service architecture and the execution sequence of SQL. The storage structure of mysql data in disk and memory adopts B+ tree data structure. However, in InnoDB engine, how to display data in memory and disk and how to establish a relationship with mysql service architecture. The connection between SQL queries and the InnoDB engine is probably not unclear. The logical architecture of mysql is as follows:

2 InnoDB storage engine architecture

What is the logical storage structure of the InnoDB storage engine? In fact, all data is logically placed in a space and the files in that space are the actual physical files, the table space. By default, each database table occupies one tablespace, which stores the corresponding data, indexes, INSERT Buffer bitmap undo information, INSERT Buffer index pages, and double Write buffers in a shared tablespace.

Show variables like '% innodb_file_per_TABLE %' innodb_file_per_table=ON # modify SET GLOBAL innodb_file_per_table=OFF;Copy the code

2.1 InnoDB tables store engine files

Each tablespace consists of segments, areas, and extent pages. A page is the basic unit of data stored in a data store, with a default size of 16kb. An extent is a space made up of consecutive pages, with a default size of 1MB. The extents form the segments of a table. InnoDB logical storage structure

When we execute SQL, both query and modify, myql always the data read from disk in memory, and when reading data, not on a data alone, but direct load data in data pages into memory, and read in one of two ways, the current way of pre-reading lectures and random way, using linear to proofread by default.

InnoDB engine architecture

2.2 InnoDB prefetch mechanism

  • Linear prefetch and random prefetch:

Linear prefetch is in the unit of extent, while random prefetch is in the unit of page in extent. Linear prefetch aims to read the next extent data into the buffer pool. Random prefetch reads the remaining pages in the current extent to the buffer pool. If the number of sequential read pages in an extent area exceeds a certain number (innodb_read_ahead_threshold), the remaining data pages in an extent area are directly loaded.

2.3 the InnoDB features

2.3.1 Inserting the Cache

Insert Buffer (Change Buffer) To improve Insert performance, Insert Buffer is an enhanced version of Insert Buffer. The change buffer works for insert/update/delete. Insert cache is only valid for non-unique indexes and secondary indexes. Each insert is not written to the index page, but determines whether the inserted non-clustered index page is in the cache. If so, it is directly inserted into the insert buffer, and if not, it is merged at a certain frequency and written back to disk. This combines multiple inserts into one operation to reduce the performance cost of random I/O.

2.3.2 secondary to write(double write)

Inserting caches gives the InnoDB storage engine a performance boost, while double write ensures that the InnoDB storage engine operates on data pages reliably. Double write is divided into two parts, one is the double Write buffer in memory, the size of which is 2MB, and the other is the continuous 128 data pages in the shared table space on the physical disk, that is, two extential sizes (also 2MB). When the dirty pages of the buffer pool are flushed, they are not written directly to disk. Instead, the dirty pages are copied to the doubleWrite buffer in memory by the memcpy function. After that, the doubleWrite buffer is sequentially written to the physical disk of the shared table space in two 1MB intervals. Fsync is then immediately called to synchronize the data to disk. Since Doublewrite is a contiguous space, such sequential IO writes are inexpensive. After the DoubleWrite page is written, the individual table Spaces are again written discreetly. If the operating system crashes while writing data pages to disk, during recovery InnoDB engine finds a copy of the page from DoubleWrite in the shared table space, copies it to the table space file, and applies the redo log.

2.3.3 Adaptive Hash Indexes

Hash is an equivalent query. InnoDB’s storage engine monitors queries on each Index page of a table. If it sees an increase in the speed of creating a hash Index, it creates an Index accordingly, hence the Adaptive Hash Index (AHI). AHI is constructed from B+ tree pages in the buffer pool, which is relatively fast to build, and does not need to hash index the whole table, just index hot pages. AHI is enabled by default.

2.3.4 asynchronous I/o

In order to improve disk performance, database systems use Asynchronous IO(AIO) to handle disk operations. InnoDB storage engine also uses AIO to reduce the time required for SQL queries. You can combine multiple I/OS into one I/O to improve IOPS performance.

# Enable local AIO
show valiables like 'innodb_use_native_aio';
Copy the code
2.3.5 Refreshing the adjacency page

InnoDB storage engine provides Flush Neighbor Page. When a dirty Page is flushed, InnoDB storage engine detects if there are other dirty pages in that area. If there are other dirty pages, InnoDB storage engine flushes them all at once. Mechanical hard disks have obvious advantages, but fixed hard disks have high IOPS. Whether to enable this function depends on the actual situation. The parameters are as follows:

show varables like 'innodb_flush_neighbors'
Copy the code

3 SQL execution logic

3.1 SQL execution

Mysql write files have 2 caches. One is a self-defined log buffer in memory, and the other is an OS cache mapped from disk to memory. Mysql can call Flush to flush the log buffer to the disk memory map, or it can call fsync to force the disk map file to be synchronized to disk. By default innodb_flush_LOG_at_trx_COMMIT and SYNc_binlog are both set to 1.

Not only does InnoDB engine have the concept of buffer, which is in user space, but it also has the concept of OS buffer in kernel space

You can also call Flush + fsync to drop the cache directly to disk. Innodb_flush_log_at_trx_commit = 0 is flush + fsync every second. Innodb_flush_log_at_trx_commit = 1 flush_log_at_trx_commit = 1 Innodb_flush_log_at_trx_commit = 2 Is flush in real time. Fsync is assigned to the OS maintenance timer. If sync_binlog is set to 0, the transaction is written but not committed. If fsync is set to 1, the transaction is committed every time. If fsync is set to n, n transactions are accumulated after the transaction is written. show variables like 'sync_binlog'; show variables like 'innodb_flush_log_at_trx_commit'; # show processList show processListCopy the code

InnoDB engine BufferPool, LogBuffer, OS Buffer, Log files relationship.

InnoDB engine performs the following steps for mysql to add, delete, or alter SQL:

  • 1 When the executor gets the SQL to be executed, it needs to load the data to be modified from the disk to the memory according to the update conditions, that is, to store the data in the buffer pool.
  • 2 Before modifying the corresponding data, you need to back up the corresponding data, that is, put the data in the Undo log to facilitate the operation during the transaction rollback.
  • 3 Modify the value in the memory based on the SQL statement.
  • 4 Write the redo log buffer based on the modified data.
  • 5 Write the redo log to a disk. For details about how to write the redo log to a disk, see innodb_flush_at_trx_commit. The data is written to the cache of the system first, and then written to the system file asynchronously by the OPERATING system DMA. Flush is a flush operation that writes data from system memory to the operating system buffer. This step is usually performed by the kernel thread, which transfers data from the user thread to the kernel thread. Multiple levels of cache are placed between disk files and memory to improve data exchange efficiency. The OS cache here does just that.
  • 6 Write the bin log after the redo log is complete.
  • 7 Similar to the redo log operations, the OPERATING system writes data to the OS cache first and then flusits data to disk files. The configuration of sync_log is shown in figure 1. In general, innodb_flush_at_trx_COMMIT and SYNc_log are both set to 1.
  • 8 Commit a transaction after the redo log and bin log have been written. When data is written to disk, InnoDB writes data twice.

Redo log bin log Mysql is replicated using the bin log. If a redo log is submitted before the bin log is written, the database crashes. If a redo log is performed during mysql instance recovery, data inconsistency between the primary database and the secondary database may occur.

Binary log files are recorded in statement, row, and mixed formats. Statement mode is used to execute SQL directly, and if there is a function operation (for example, the database time is set to now()), the data will be inaccurate. Row mode is to synchronize all rows. If a full table operation changes the state, this mode is not suitable. Therefore, mixed mode should be used for data synchronization.

3.2 FreeList, LRU List, and Flush List

  • Free List Indicates the Free List

Record all data pages that are not occupied, put the data loaded into memory into the buffer pool in sequence, and delete the nodes in the corresponding Free List

  • LRU List LRU data access List

Cold and hot data blocks are connected and maintained according to LRU algorithm. If the data loaded into the memory is put into the head of the list at one time, some data will be eliminated if the heat of this batch of data is not determined. Mysql InnoDB puts the data in a later position. If the data is accessed within 1s, it can enter the head of the linked list, that is, the data hot area.

# show variables like 'innodb_old_blocks_pct'; # show variables like 'innodb_old_blocks_time';Copy the code
  • Flush List Flushes the List of dirty pages

Record the data page modified in memory, use bidirectional linked list to connect, when convenient to do disk operation.

The size of the redo log in InnoDB is fixed, and the number of files in the redo log can be configured as required. When write pos catches up with checkpoint, new commands cannot be executed and the checkpoint must be pushed forward. This is to persist the redo log to make room for further log writing.

Data manipulation

Redo log buffer is written repeatedly

The key point here is also a technique, write-Ahead Logging (WAL).

For details about the redo log configuration, see the following command.

------- Result ------ innodb_log_buffer_size 16777216 Innodb_log_checksums ON innodb_log_compressed_pages ON innodb_log_file_size 50331648 innodb_log_files_in_group 2 innodb_log_group_home_dir ./ Innodb_log_write_ahead_size 8192 innodb_log_buffer_size specifies the size of the redo log buffer in the memory. 16777216/1024/1024=16MB Innodb_log_file_size: indicates the size of each redo log. 50331648/1024/1024=48MB innodb_log_files_in_group: indicates the number of files in the redo log group. The default value is 2 Show table status like 'my_table';Copy the code
References:
  • 1 mysql practice 45 Talk

  • InnoDB Storage Engine (Version 2)