Brief introduction: They record REDO and UNDO logs for their servers while maintaining atomicity (A, Atomic) and Durability (D, Durability) while maintaining flexible flush policies for disk sequential writes. The ARIES method. This article focuses on the function of REDO LOG, LOG content, organization structure, and writing method. We hope readers can understand the position of REDO LOG in InnoDB. This article is based on MySQL 8.0 code.

The author | he | ali technology derived from the public

They record REDO and UNDO logs for their servers while maintaining atomicity (A, Atomic) and Durability (D, Durability) while maintaining flexible flush policies for disk sequential writes. The ARIES method. This article focuses on the function of REDO LOG, LOG content, organization structure, and writing method. We hope readers can understand the position of REDO LOG in InnoDB. This article is based on MySQL 8.0 code.

Why do I need to log REDO

In order to achieve better read/write performance, InnoDB will cache data in the Buffer Pool (InnoDB Buffer Pool), and the modification of disk data will lag behind the memory. In this case, if the process or machine crashes, the memory data will be lost. In order to ensure the consistency and durability of the database itself, InnoDB maintains a REDO LOG. Before modifying a Page, record the modified content in REDO LOG and ensure that the REDO LOG falls before the corresponding Page. WAL is also called Write Ahead LOG. When a failure occurs and memory data is lost, InnoDB replays REDO on restart to restore the Page to its pre-crash state.

What REDO is required

So what kind of REDO do we need? First, REDO maintenance adds a log of data to the database, and to ensure that data is correct, a transaction does not return success to the user until all of its REDO logs fall. The time of REDO logs directly affects system throughput, so keep the amount of REDO data to a minimum. Second, system crashes often occur unexpectedly. When REDO REDO logs are restarted, the system does not know which REDO REDO pages have fallen, so REDO REDO logs must be reentrant, meaning that REDO logs must be idempotent. Finally, to speed up restart recovery through concurrent replays, REDO logs should be page-based, meaning that a REDO log involves only one Page change.

For those familiar with InnoDB, the advantages of Logical Logging are low data volume and idempotent and Page-based Physical Logging, so InnoDB uses a method called Physiological Logging to take advantage of both. Physiological Logging is recorded in a logical way in a Page. For example, a REDO log of type MLOG_REC_UPDATE_IN_PLACE records a change to a Record in a Page as follows:

(Page ID, Record Offset, (Filed 1, Value 1)… (Filed i, Value i) … )

Where, PageID specifies the Page Page to operate, Record Offset records the Record in the Page Offset position, behind the Field array, Record the need to modify the Field and the modified Value.

Physiological Logging adopts logical notation in physical Page, leading to two problems:

REDO the REDO file based on the correct Page state

Because REDO logs two changes within a single Page in a logical manner, REDO must be based on the correct Page state. However, InnoDB’s default Page size is 16KB, which is larger than the 4KB atomic size guaranteed by the file system, so the Page content can be half successful. InnoDB uses a Double Write Buffer to ensure that the correct Page state is found when InnoDB recovers by writing twice. This will be covered in more detail later when we introduce Buffer pools.

2. Ensure that REDO REDO is idempotent

The Double Write Buffer ensures that the correct Page state is found. We also need to know which log on the REDO log corresponds to this state to avoid repeated changes to the Page. To do this, InnoDB records a globally unique increasing Number (LSN) for each REDO Log. When a Page is modified, the LSN of the REDO log is recorded on the Page (FIL_PAGE_LSN). In this way, when REDO REDO is resumed, the applied REDO log can be judged to be skipped, thus implementing idempotency of REDO REDO.

What is recorded in sanredo

Knowing how REDO is recorded in InnoDB, what is recorded in REDO? As of MySQL 8.0, InnoDB has over 65 REDO logs in response to the wide variety of InnoDB requirements. REDO log is used to log the different types of REDO log. The 65 REDO logs can be classified into three major categories based on who the REDO logs are for: Page, Space, and Logic types that provide additional information.

1. REDO file on Page

This type of REDO log is the majority of all REDO log log types, and can be subdivided into Index PageREDO, Undo PageREDO, Rtree PageREDO, etc. For example, MLOG_REC_INSERT, MLOG_REC_UPDATE_IN_PLACE, and MLOG_REC_DELETE correspond to the insertion, modification, and deletion of records in a Page respectively. Here is another example of MLOG_REC_UPDATE_IN_PLACE:

Where Type is MLOG_REC_UPDATE_IN_PLACE, Space ID and Page Number uniquely identify a Page. These are the required headers for all REDO logs. MLOG_REC_UPDATE_IN_PLACE, where Record Offset is used to give the Offset of the Record’s position in the Page, Update Field Count specifies how many fields are in the Record to change. Secondly, Field Number, Field Data Length and Field Data are given for each Field.

REDO files on Space

This type of REDO file changes to a Space file, such as MLOG_FILE_CREATE, MLOG_FILE_DELETE, and MLOG_FILE_RENAME, create, delete, and rename a Space, respectively. Because the REDO log is not recorded until after the REDO log is completed, it indicates that the REDO log is successful. Therefore, most of the REDO log is performed to check the status of the file, using the MLOG_FILE_CREATE command to see what is recorded:

The first three fields are Type, Space ID, and Page Number. The Page Number is always 0. After this, the flag file and the file name are recorded for the check during the restart.

Logic REDO provides additional information

In addition to the above types, there are a few REDO types that do not involve data modification, but simply log information. For example, the most common one, MLOG_MULTI_REC_END, marks the end of a REDO group, or a complete atomic operation.

4. How is REDO organized

The REDO file organization is how REDO files are recorded to disk so that REDO files can be efficiently written, read, restored, and cleaned. REDO files are divided into three layers from top to bottom: logical REDO layer, physical REDO layer, and file layer.

Logical REDO layer

This layer is the true REDO log. REDO log consists of REDO log endings of different types, with a globally unique increasing offset sn. InnoDB maintains the current maximum value of SN in the global log_sys and increases the length of REDO log by SN each time InnoDB writes data. As shown below:

2 Physical REDO layer

A disk is a Block device. Blocks are used to read and write data in InnoDB. The length of a Block OS_FILE_LOG_BLOCK_SIZE is equal to 512 BYTES of the disk sector. In addition to the REDO data, some additional information is needed in a Block. The following diagram shows the composition of a Log Block, including the 12-byte Block Header: Flush Flag contains the highest bit, which identifies the first Block of an I/O. The remaining 31 bits are Block numbers. After that, the data length is 2 bytes, and the value is [12,508]. The following 2-byte First Record Offset points to the start of the First REDO group in the Block. This value allows any Block to find a valid REDO starting position. The last 4-byte Checkpoint Number records the next_checkpoint_number used when writing blocks, which is used to detect file recycling. This is explained in detail at the file level. At the end of the Block is a 4-byte Block Tailer, which records the Checksum of the current Block. With this value, it is clear when reading the Log whether the Block has been written completely.

The remaining 498 middle bytes of the Block are where the REDO log is stored, the logical REDO log we described above. We now place logical REDO REDO in a physical REDO space. Since the size of the Block is fixed and the length of the REDO log is variable, there may be multiple REDO REDO logs in one Block, or one REDO log log may be split into multiple REDO logs, as shown in the following figure. Brown and red represent Block headers and tailers, respectively. The REDO log in the middle is split into two consecutive blocks due to insufficient space in the previous Block.

Due to the increased byte overhead of the Block Header and Tailer, the LSN is used to indicate the offset in the physical REDO space. We can see that there is a simple conversion between LSN and SN:

constexpr inline lsn_t log_translate_sn_to_lsn(lsn_t sn) {
  return (sn / LOG_BLOCK_DATA_SIZE * OS_FILE_LOG_BLOCK_SIZE +
          sn % LOG_BLOCK_DATA_SIZE + LOG_BLOCK_HDR_SIZE);
}
Copy the code

SN plus the length of all previous Block headers and tailers translates to the corresponding LSN and vice versa.

Layer 3 files

Finally, REDO is written to the REDO logfile as ib_logFILE0, ib_logFILe1… Innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group: innodb_log_files_in_group Write REDO to multiple files sequentially. Each file is divided into blocks. Four blocks are reserved at the beginning of each file to record additional information. The first Block is called a Header Block.

The first Header Block contains file information, as shown in the following figure. The 4-byte Formate field records the Log version. There are changes in the REDO type for each version of the Log. The 8-byte Start LSN identifies the Start LSN of the current file. This information can be used to match the offset of the file with the corresponding LSN. Finally, the Creator information is up to 32 bits long, which normally records the MySQL version.

Logical REDO is the data that is really needed, indexed by SN. Logical REDO is organized as a fixed-size Block, with Block headers and endings added to form physical REDO. Indexed by LSN, these blocks are placed somewhere in the circular file space. File indexed with offset:

Although a REDO file location can be uniquely identified by an LSN, the REDO file must be converted to the file I/O, and the file offset must be represented as follows:

const auto real_offset =
      log.current_file_real_offset + (lsn - log.current_file_lsn);
Copy the code

The file offset, current_FILE_real_offset, and the corresponding LSN, current_FILe_lsn at the beginning of the current file are updated in memory when switching files. Using these two values, THE LSN can be easily converted to file offset in the same way as above. Note that the offset here corresponds to the entire REDO file space. Since InnoDB’s space layer supports multiple REDO files, the first connected REDO files can be treated as one large file, and the offset here is the offset in the large file.

How to write REDO effectively

As important information for maintaining database integrity, REDO logs must be dropped before a transaction is committed. Otherwise, data may be lost in the event of a power outage. Therefore, the entire process from REDO log generation to final drop becomes a critical path for database writes, and its efficiency directly determines database write performance. This process includes REDO writing, REDO writing to InnoDB Log Buffer, REDO writing from InnoDB Log Buffer to the operating system Page Cache, REDO flushing, and then waking up waiting user threads to Commit. Let’s take a look at how InnoDB can efficiently write REDO with high concurrency through these phases.

1 REDO generated

We know that a transaction generates REDO logs while writing data. A single atomic operation may contain multiple REDO logs. These REDO logs may visit different places on the same Page, or they may visit different pages (such as a Btree split). InnoDB has a full set of mechanisms to ensure that REDO REDO atoms are either replayed or not replayed at all. This will be covered later in the restore logic section. This article only covers the basic requirement that REDO logs must be contiguous. InnoDB is implemented through min-Transaction, MTR for short. When atomic operation is needed, mtr_start is called to generate a MTR, which maintains a dynamically growing M_log, which is a dynamically allocated memory space. Write all REDO REDO to InnoDB’s Log Buffer. When atomic REDO is complete, call mtr_commit to copy the data from m_log to InnoDB’s Log Buffer.

2 Write InnoDB Log Buffer

In a high-concurrency environment, there are a lot of min-Transactions (MTR) that need to be copied to the Log Buffer at the same time, and this becomes a significant performance bottleneck if the lock is mutually exclusive. Therefore, starting from MySQL 8.0, a lockless log writing mechanism is designed, whose core idea is to allow different MTR to concurrently write different positions of the log Buffer. Different MTR types first call the log_buffer_reserve function, which fetch_add the global offset log.sn atomically using its own REDO length to get its own space in the log Buffer. After that, different MTR copies the data in their m_log to their own space in parallel.

/* Reserve space in sequence of data bytes: */
const sn_t start_sn = log.sn.fetch_add(len);
Copy the code

3 Write to the Page Cache

REDO data written to the Log Buffer needs to be further written to the operating system’s Page Cache. InnoDB has a separate log_writer to do this. There is a problem here. Since the data in the Log Buffer is written concurrently by different MTR, there are holes in the Log Buffer during this process, so log_writer needs to sense the end of the continuous Log in the current Log Buffer. Write continuous logs to the operating system Page Cache through the pwrite system call. InnoDB introduces a data structure called link_buf here, as shown in the figure below:

Link_buf is a circular array, and modulo each LSN to get a slot on the link_BUF where the REDO length is recorded. Another thread traverses the log file from the beginning to the end of the REDO log file until the last REDO file is 0. This position is called the tail of the link_buf. Let’s see how log_writer and many MTR’s take advantage of this link_buf data structure. The link_buf here is log.recent_written, as shown below:

The upper part of the figure is the REDO log diagram. Write_lsn is the log that has been written to the end of the Page Cache. Current_lsn is the maximum LSN position that has been assigned to MTR. Buf_ready_for_write_lsn is the continuous end of the Log Buffer found by log_writer. The range from write_lsn to buf_readY_for_write_lsn is the range in which log_writer can write to the Page Cache consecutively next time. The range from buF_readY_FOR_write_Lsn to current_lsn is where THE CURRENT MTR is concurrently writing to the Log Buffer. Recent_written is the data structure of log.recent_written. You can see that buf_readY_for_write_LSN is unable to advance due to the two all-zero holes in the middle. If the MTR, the first empty space in the middle, has also finished writing the Log Buffer and updated log.recent_written*, as shown below:

Recent_written (buf_readY_for_write_lsn); recent_written (buf_ready_for_write_lsn);

So lift the current buf_readY_for_write_lsn and slide the tail position of log.recent_written forward to zero for future reuse:

Log_writer then flushed the continuous content and promoted write_LSN.

Brush set

When the log_writer elevates the write_lsn, the log_flusher thread is notified, and the log_flusher thread calls fsync to flush REDO, completing the REDO write process.

5 Wake up the user thread

To ensure that data is correct, transactions cannot commit until REDO writes are complete, so during REDO writes, a large number of user threads block until their last log writes. By default, innodb_flush_log_at_trx_commit = 1. Wait for REDO to finish flushing. This is the safest way to flush. You can also set innodb_flush_log_at_trx_commit = 2 so that any REDO write to the Page Cache is considered complete. In extreme cases, power failures may cause data loss.

A large number of user threads call log_write_up_to and wait on their LSN. To avoid a large number of invalid wakes, InnoDB splits blocking condition variables into multiple ones, and log_write_UP_TO modules into different condition variables according to the block in which the LSN is waiting. InnoDB also introduces two threads that are responsible for waking up users in order to prevent a lot of wakeup work from affecting the log_writer or log_flusher thread: Log_wirte_notifier and log_flush_notifier. When more than one condition variable needs to be aroused, log_writer and log_flusher notify the two threads to complete the wakeup work. Here’s a schematic of the process:

With the help of some internal data structure, multiple threads efficiently complete the process from REDO generation, REDO disk writing, and wake up of user threads. The following is a sequence diagram of the entire process:

How do I safely Clean REDO

Due to limited REDO file space and to minimize REDO REDO during recovery, InnoDB introduces the log_checkpointer thread to Checkpoint periodically. During a restart and recovery, REDO logs are replayed from the latest Checkpoint, so that REDO logs from the previous Checkpoint can be deleted or reused.

We know that REDO is used to avoid memory loss due to failure, so the Checkpiont position must ensure that all memory dirty pages created by REDO have been flushed. Most directly, we can obtain the minimum REDO LSN for all current dirty pages from the Buffer Pool: lwm_lsn. This is not enough, however, because a portion of min-transaction REDO REDO pages have not been added to the dirty pages in the Buffer Pool. If checkpoint access occurs after these REDO REDO pages, this portion of data will be lost in the event of failover. Therefore, you also need to know the REDO LSN location of the Buffer Pool: dpa_lsn. The lower of the two values is the final checkpoint. The core logic is as follows:

/* LWM lsn for unflushed dirty pages in Buffer Pool */
lsn_t lwm_lsn = buf_pool_get_oldest_modification_lwm();

/* Note lsn up to which all dirty pages have already been added into Buffer Pool */
const lsn_t dpa_lsn = log_buffer_dirty_pages_added_up_to_lsn(log);

lsn_t checkpoint_lsn = std::min(lwm_lsn, dpa_lsn);
Copy the code

MySQL 8.0 allows concurrent registration of dirty pages to Buffer pools in order to allow greater concurrency between MTR. Similar to log.recent_written and log_writer, we introduce a link_buf called recent_closed to handle the void caused by concurrency. The tail of recent_closed is raised by a separate thread, log_closer. This is the maximum NUMBER of consecutive dirty pages added to the Buffer Pool (dPA_LSN). It should be noted that the value of LWM_lsn cannot simply obtain the LSN of the oldest dirty page in the current Buffer Pool due to the existence of such disordering. For the sake of conservatism, we need to subtract the capacity of recent_CLOSED, that is, the maximum disordering range. The simplified code is as follows:

/* LWM lsn for unflushed dirty pages in Buffer Pool */
const lsn_t lsn = buf_pool_get_oldest_modification_approx();
const lsn_t lag = log.recent_closed.capacity();
lsn_t lwm_lsn = lsn - lag;

/* Note lsn up to which all dirty pages have already been added into Buffer Pool */
const lsn_t dpa_lsn = log_buffer_dirty_pages_added_up_to_lsn(log);

lsn_t checkpoint_lsn = std::min(lwm_lsn, dpa_lsn);
Copy the code

There is a problem here. Since lwM_LSN has subtracted the capacity of recent_CLOSED, theoretically this value must be less than DPA_Lsn. It may not make sense to compare LWM_LSN and DPA_LSN to get Checkpoint positions.

As mentioned above, two of the first three blocks of the ib_logfile0 file are reserved as Checkpoint blocks, which are used interchangeably during Checkpiont. In this way, no Checkpoint is available due to a crash during Checkpoint writing. The Checkpoint Block reads as follows:

The Checkpoint LSN is the last Checkpoint Number. The Checkpoint LSN is the REDO log at which the REDO log was Checkpoint. The REDO log is replayed from this Checkpoint. The following 8 bytes of Checkpoint Offset correspond to the Checkpoint LSN Offset of the file space. The last 8 bytes are the length of the Log Buffer mentioned earlier, which is not currently used in the recovery process.

Seven summarizes

This article systematically introduces the function of REDO in InnoDB, characteristics, organizational structure, write mode and cleaning time, basically covering most of REDO content. How to use REDO to restore the database to the correct state during restart recovery will be covered later when InnoDB failover mechanism is introduced.

reference

[1] MySQL 8.0.11 Source Code Documentation: the Format of redo log dev.mysql.com/doc/dev/mys…

[2] MySQL 8.0: the New Lock free, scalable WAL design mysqlserverteam.com/mysql-8-0-n…

[3] How InnoDB handles REDO logging www.percona.com/blog/2011/0…

[4] MySQL Source Code github.com/mysql/mysql…

[5] database failover mechanisms of incarnations catkang. Making. IO / 2019/01/16 /…

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.