The steps that a query statement goes through

This time we will talk about the execution process of SQL from the overall architecture of MySQL, as shown in the following figure:


The whole is divided into two parts: Server and engine layer. I use InnoDB to replace the engine layer. The design of engine layer is in the form of plug-in, which can be replaced at will.

Server layer

Connector: The connector is responsible for establishing the connection with the client, obtaining permissions, maintaining and managing the connection; Query cache: the query cache of the service. If the corresponding query can be found, it will directly return the result set in the cache without the process of query parsing, optimization and execution. Parser: According to the query statement, the parser constructs a parsing tree, which is mainly used to verify whether the statement is correct according to the syntax rules, for example, whether the SQL keyword is correct and the keyword sequence is correct. Optimizer: The parse tree is converted into a query plan. Generally, a query can be executed in many ways and eventually return the same result. The optimizer is to find the optimal execution plan according to the cost. Executor: The execution plan calls the query execution engine, which queries the data through a series of APIS;

InnoDB

Background thread: responsible for refreshing the data in the memory pool, ensuring that the memory cache in the cache pool is the latest data, refreshing the modified data to disk files, and ensuring that the abnormal situation of the database can be restored to the normal situation; Memory pool, pool can also be called a buffer pool, mainly to make up for slower disk on the impact of the database, query, will first disk read page data in the memory pool, next time read directly read data from the memory pool, modify the data, modify data in a memory pool first, Background threads are then flushed to disk at a regular rate. Files: mainly tablespace files, but also some log files; MySQL5.6 系 统 的 图 片 the overall architecture of MySQL, including memory pool, files, background threads and other things with details, is not introduced. Later we introduce other times to bring out the detailed part of it, and attach a picture of MySQL5.6 overall architecture.

How does InnoDB store data

This part is built on the basis of the previous part, which needs to go into details about the memory pool, files and background threads to understand the composition. Next, we will start to explain it in three parts:

file

The files are divided into log files and storage files, which are divided into two parts:

Storing files

The storage file is also the storage of table data. The overall storage structure is shown as follows:


A table space is divided into two types of files: a shared table space and a unique table space for each form. A separate table space stores data and indexes in a table, while a shared table space mainly stores transaction information and rollback information. Table Spaces are made up of segments, Extend, pages, and rows.

  1. Segment (Segment)

    Common segments include data Segment, index Segment, rollback Segment, etc. Data Segment is Leaf node Segment of B+ tree, and index Segment is non-leaf node Segment of B+ tree. The diagram below:



    Every index is created, an index segment is created, and the leaf node of the index segment points to the data segment, which is required when we query data through such a combination. Therefore, the more indexes are created, the more index segments need to be built, resulting in an increase in data insertion time.

  2. Extend is the basic element that constitutes a segment. A segment consists of several segments. A segment is a segment of space that is physically consecutively allocated. If an area is insufficient to store more data, you need to allocate a new area to store the new data. The amount of space managed by a segment is infinite and can be extended forever, but the smallest unit of extension is the extent. Each extents are fixed in size at 1MB and consist of pages. InnoDB usually requests 4-5 extents from disk at a time to ensure the continuity of pages in the extents. With the default Page size of 16KB, an area consists of 64 consecutive pages.
  3. Page: A Page is the basic unit of an area and the smallest unit of InnoDB disk management. Both logically (page numbers are contiguous from small to large) and physically. When to insert the data in the table, if a page has been finished, the system will be from the current area allocated in a new handle to use free pages, if the current area of 64 pages are assigned finished, system will be from the current page in a new section in the distribution area, and then assign a new page in this area.
  4. InnoDB stores data in rows. Each Page can store up to 16KB of data. When data is larger than 16KB, rows overflow and the data is stored in an external Page (Uncompressed BLOB Page).
The log file

Log files: binlog, redo log, and redo log

binlog

Binlog Records write operations (excluding queries) performed by the database and saves them in binary format on the disk. Binlog is the logical log of mysql and is logged by the Server layer. Binlog is logged by mysql databases using any storage engine. The size of each binlog file can be set using the max_binlog_size parameter. When the file size reaches the specified value, a new file will be generated to store the log.

Binlog Log format
  1. ROW is based on ROW replication and does not record the context information for each SQL statement, only which data was modified. Advantages: There are no problems with stored procedures, or function, or trigger calls and triggers that cannot be copied correctly in certain cases; Disadvantages: Because each line should be logged, the log volume will be skyrocketed;
  2. STATMENT is based on the replication of SQL statements. Each SQL statement that modifies data is recorded in a binlog. Advantages: No need to record the changes of each row, reducing the amount of binlog, saving I/O, and improving performance. Disadvantages: Can cause inconsistencies between master and slave data in some cases, such as when performing functions such as sysdate().
  3. MIXED replication is based on the MIXED replication of STATMENT and ROW modes. Generally, STATEMENT mode is used to store binlogs for replication. For operations that cannot be copied in STATEMENT mode, ROW mode is used to store binlogs
Usage scenarios

Binlog can be used in two scenarios: master/slave replication and data recovery.

  1. For Master/Slave replication, enable binlog on the Master and then send the binlog to each Slave. The Slave replays the binlog to achieve data consistency between the Master and Slave.
  2. Restore data to a log at a time, using the mysqlbinlog tool to restore data;
Brush set time

For InnoDB storage engine, biglog is only recorded when a transaction is committed, while the records are still in memory. Mysql controls the flush timing of Biglog by using the sync_binlog parameter. The value ranges from 0 to N, where N indicates how many times to flush. If innodb_support_XA is set to 0, the system determines when to write to disk. If innodb_support_XA is set to 1, the system determines when to write to disk. If innodb_support_XA is set to 1, the system determines when to write to disk. This ensures that the two logs are in sync.

redo log

The redo log consists of two parts :redo log buffer and redo log file. The redo log buffer is stored in memory, and the redo log file is stored on disk. Redo log buffer is written first, and redo log file is written in conditional order. When is the buffer written to the file triggered?

  1. The main thread in the InnoDB background thread flusher buffer data to disk once per second.
  2. The innodb_flush_log_at_trx_COMMIT parameter is set to 1 to control the flush timing. Each time a transaction commits, the log buffer is written to the OS buffer and fsync() is called to the log file on disk. This way, no data is lost even if the system crashes, but IO performance is poor because every commit is written to disk. When set to 0, instead of writing logs from the log buffer to the OS buffer during transaction commits, the OS buffer is written every second and fsync() is called to log file on disk. This means that 0 is written to disk (approximately) every second, and when the system crashes, 1 second of data is lost. When set to 2, each commit is written only to the OS buffer, followed by a call to fsync() every second to write the logs from the OS buffer to the log file on disk.


Redo log Log format

Redo logs log changes to data pages. Redo logs are designed in a fixed size, circular fashion. When redo logs are written to the end, they are written back to the beginning, essentially in a loop.


The redo log is designed to reduce the need for data pages to be flushed to the real data disk. Here’s how to update data files to checkpoint:


The ring has 4 ib_logfile_* files. This is the redo logfile. You can control the number of innodb_log_files_in_group files by controlling the number of innodb_log_files_in_group files. Innodb_log_file_size is used to control the size of files. If the size is too large, the crash recovery will be too slow. Do not set the size too small, which may cause a transaction to switch log files multiple times.


Write POS and Check Point. The redo log is empty between write POS and check point. Between the check point and Write POS are data page changes recorded in the redo log. When the Write POS catches up with the Check point, it moves the Check point forward to make room for new logs.


When InnoDB is started, it tries to restore the database regardless of whether it was shut down normally last time. There are two cases:

  1. The checkpoint indicates that all logs are flushed to the LSN on the data page on the disk. Therefore, only the logs starting from the checkpoint are recovered. The LSN indicates the total number of bytes written to the log. And the transaction is committed. When the database is started, the LSN of the data page on disk is checked. If the LSN of the data page is smaller than the LSN in the log, the recovery starts from the checkpoint.
  2. The data page is flushed faster than the log page. Procedure In case of a downtime, the LSN recorded in the data page is larger than that recorded in the log page. This is detected during the recovery process of the restart. In this case, the part that exceeds the log progress will not be redone, because this indicates that the work has been done.

Because redo logs record physical changes to data pages during recovery, redo logs are much faster than logical logs such as binlogs.

Scenarios used

MySQL is used to ensure the persistence of transactions. The redo log records the status after a transaction is executed. It is used to restore the updated data of a successful transaction that is not written to the data file. In case there are dirty pages that have not been written to disk at the time of the failure, redo the mysql service according to the redo log to achieve transaction persistence.

undo log

Undo logs record logical changes of data, rollback operations of user transactions, and MVCC. Undo logs are stored in the shared tablespace as rollback segments.

Undo log format

Logs in a logical format can logically restore data to the state before the transaction when the transaction is rolled back.

Scenarios used

The atomicity of the data is ensured, a version of the data before the transaction occurs can be used for rollback, and multi-version concurrent control read (MVCC), also known as unlocked read, can be provided.

Brush set time

After a transaction is committed, the Undo log cannot be deleted immediately. Instead, it is placed in the linked list to be cleaned, and the Purge thread determines whether the undo log space can be cleaned by other transactions that used version information prior to the last transaction in the undo segment table.

Memory pool

The InnoDB storage engine is disk-based, meaning data is stored on disk. Due to the gap between CPU speed and disk speed, the InnoDB storage engine uses buffer pool technology to improve overall database performance. A memory pool is simply an area of memory. When a page is read in the database, the page read from the disk is stored in the memory pool. When the same page is read next time, the system checks whether the page is in the memory pool. If yes, the system reads the page directly. Otherwise, the page on disk is read. For page changes in the database, the page is first modified in the memory pool and then flushed to disk at a certain frequency, not flushed back to disk every time the page changes.


The information cached in the memory pool includes index page, data Page, INSERT Buffer, adaptive hash index, Lock INFO, and data dictionary information. Index and data pages make up a large portion of the buffer pool. In InnoDB, the default page size in the memory pool is 16KB, the same as the default page size on disk. We have already introduced the storage structure of data files and we believe that you will have some understanding of the contents of the cache structure. We will not introduce the contents of the cache structure separately, but we will focus on the insert buffer and adaptive hash index, as well as the design principle of the extended memory pool.

Insert Buffer

The Insert Buffer is designed to determine whether a non-clustered index page is in the Buffer pool, and if it is not, it is inserted into an Insert Buffer object first. The non-clustered index of the database is inserted into the leaf node when it is not, but stored in a different location. The merge of Insert Buffer and secondary index page child nodes is then performed with a certain frequency and circumstance, which usually increases performance for non-clustered index inserts. This may be the case when the MySQL database is down and a large number of Insert buffers are not merged into the pages of the non-clustered index, which takes a long time for MySQL to recover. The following conditions must be met: The index is a non-clustered index, and the index is not unique. We’ll talk about the implementation next time;

Adaptive hash index

The InnoDB storage engine monitors queries to index pages on tables. If it is observed that creating a hash index can improve speed, the resume hash index is called an adaptive hash index. The AHI is constructed from the B+ tree pages of the buffer pool. So build very fast, and do not build hash indexes on the entire table. The InnoDB storage engine automatically hashes some hot pages based on the frequency and mode of access.

A background thread

Master Thread

This is the core thread, which is mainly responsible for asynchronously refreshing data from the buffer pool to disk to ensure data consistency, including refreshing dirty pages, merging and inserting buffers, etc.

IO Thread

InnoDB storage engine uses a lot of asynchronous IO to handle write I/O requests. IO threads are mainly responsible for the callback processing of these I/O requests.

Purge Thread

After the transaction is committed, the Undo log may no longer be needed, so Purge Thread is needed to reclaim the undo pages that have been used and allocated. InnoDB supports multiple Purge threads to speed up undo page recycling. With the overall feature introduction completed, let’s talk about how data is inserted into the InnoDB engine: Assume the following scenario: Create table T(Id int primary key, a int not null, name vARCHar (16),index (a))engine=InnoDB; Insert into T (id,a,name) values(id1,a1,’ hahaha ‘),(id2,a2,’ hahaha ‘); There are two possible scenarios for inserting data: the first scenario assumes that Id1 is in the memory pool. The second scenario assumes that Id1 is in the memory pool.

  1. Update the Index Page and Data Page in the Buffer Pool.
  2. Write to redo log in pre-commit state;
  3. I’ll write it in binlog,
  4. Commit transaction, in commit state, two-phase commit;
  5. Background threads write to the index and data segments of the data file;

The second scenario assumes that id2 is no longer in the memory pool,

  1. Data is written to the memory pool, non-clustered indexes to the Insert Buffer, and other Data to the Data Page;
  2. The subsequent actions remain the same as the remaining steps above.

Further reading

Let’s talk about how a Buffer Pool works, from the following three aspects:

  1. How do I manage cached pages?

    InnoDB creates control information for each cached page, including the tablespace number of the page, page number, page address in the Buffer Pool, LSN, and other information. The corresponding control information for each cached page occupies the same amount of memory. Let’s call the memory occupied by the corresponding control information of each page a control block. The control block and the cache page correspond one by one, and they are stored in the Buffer Pool. The control block is stored in the front of the Buffer Pool, and the cache page is stored in the back of the Buffer Pool. So the entire Buffer Pool looks like this:



    Fragmentation is a cached page that does not have enough space to allocate.

    When we first start MySQL server, we need to complete the initialization process of the Buffer Pool, that is, allocate the memory space of the Buffer Pool, divide it into several control blocks and cache pages. However, no real disk pages are cached in the Buffer Pool at this time. Later, with the running of the program, there will be continuous pages cached in the Buffer Pool. Then there will be a question: how to distinguish which cached pages in the Buffer Pool are free and which have been used? We’d better keep track of which pages are available somewhere. We can wrap all the Free pages into a node to form a linked list, which could also be called a Free list. Since all cached pages in the newly initialized Buffer Pool are Free, each cached page will be added to the Free linked list. The overall design is as follows:


    As can be seen from the figure, the Free linked list contains the address of the first node, the address of the last node, and the number of nodes in the current linked list. Each Free linked list node records the address of a cache page control block, and each cache page control block records the corresponding cache page address, so it is equivalent to each Free linked list node corresponds to a Free cache page.

    Whenever a page needs to be loaded from the disk into the Buffer Pool, a Free cached page is taken from the Free list and the corresponding control block information is filled in. Then the corresponding Free list node of the cached page is removed from the list, indicating that the cached page has been used.

  2. Obsolescence of caching? The memory size of the machine is limited, so the size of the MySQL InnoDB Buffer Pool is also limited. If the memory size of the page that needs to be cached exceeds the size of the Buffer Pool, the InnoDB Buffer Pool uses the classic LRU algorithm for page flushing. To improve cache hit ratio. When there are no more idle pages in the Buffer Pool, you need to flush out some of the recently infrequently used pages. When we need to access a page, we can handle the LRU list as follows: 1. If the page is not in the Buffer Pool, the cache page is wrapped as a node and stuffed into the head of the list when the page is loaded from disk into the cache page in the Buffer Pool. 2. If the page is in the Buffer Pool, the corresponding LRU list node is directly moved to the head of the list. However, there are some performance issues, such as hot data flushing in one full table scan or logical backup, which can cause buffer pool contamination problems! All the data pages in the Buffer Pool are refreshed once, and other query statements have to be loaded from disk to the Buffer Pool again. The full table scan statement is not executed frequently, and the Buffer Pool cache page is refreshed every time. This severely affects the use of the Buffer Pool by other queries and reduces the cache hit ratio. For this scenario InnoDB storage engine has made some optimizations to the traditional LRU algorithm and added midpoint into InnoDB. The newly read page, although the most recently accessed page, is not inserted directly into the head of the LRU list, but into the MIDpoint location of the LRU list. The algorithm is called midpoint Insertion Testy. The default configuration is inserted at 5/8 of the list length. Midpoint is controlled by the innodb_old_blocks_pct parameter. The list before MIDPoint is called a new list, and the list after it is called an old list. Pages in the New list can simply be thought of as the most active hot data.
  3. How do dirty pages refresh? If the update is made first in the cache pool, it will not match the page on disk. Such cached pages are also called dirty pages. So when do these modified pages flush to disk? Of course, the simplest thing to do would be to synchronize every change to the corresponding page on disk immediately, but writing data to disk frequently can seriously affect program performance. Therefore, every time a cache page is modified, we do not rush to synchronize the changes to disk immediately, but at a certain point in the future, and the background refresh thread will flush the changes to disk in turn to achieve the landing of the changes to disk. But if we don’t synchronize to disk immediately, how do we know which pages in the Buffer Pool are dirty and which have never been modified when we synchronize again? We need to create a list of dirty pages. All pages that have been modified in the LRU list need to be added to this list. This list is also called Flush because all pages in the LRU list need to be flushed to disk. If a dirty page is modified for the first time since it was loaded into the Buffer Pool, it will not be added to the Flush list if it is modified again because it already exists. It is important to note that the dirty page data is actually still in the LRU list, whereas the dirty page records in the Flush list are Pointers to dirty pages in the LRU list.

The end of the

Welcome to your little attention, little praise, thank you!