The authors introduce

Yang Jianrong, co-founder of DBAplus Community. Now I am working for Sohu Changyou, Oracle ACE and YEP. I have more than 7 years of database development and operation experience, and I am good at telecom data business, database migration and performance tuning. Oracle 10G OCP, OCM, MySQL OCP certification, author of Oracle DBA Working Notes.


There is a saying that there are only two tools in the world, those that are reviled and those that are not used. The more it is criticized, the higher the level of attention, the higher the usage rate, and the more it is used, the more mature it is. MySQL is a good example. MySQL can support a lot of storage engines, InnoDB is the best, as the best.


InnoDB has been the default storage engine since MySQL 5.5.5, and MyISAM has had its place before, but the MySQL development team jumped MySQL from 5.x to 8.0 when they announced DMR, the milestone release of MySQL 8.0.0. One of the highlights is the transactional data dictionary, completely separated from the MyISAM storage engine, so InnoDB is the key storage engine we need to learn about MySQL. The double write feature of InnoDB is very interesting and is the focus of our discussion today.


In fact, MySQL and Oracle both face this problem, but each has a different solution. I also see a lot of DBAs on the Internet struggling and arguing about this. On the Oracle side, by contrast, there is more silence. I’ve watched their discussion, but so far I haven’t seen a reading that takes care of both. So I decided to do this to compare some of the implementations and differences between MySQL and Oracle. A lot of it is personal, so some of it may not be true, but it is an attempt to provoke some thought and discussion.


Double write in InnoDB


First let’s talk about InnoDB and double write.


InnoDB has three shiny features: Insert Buffer, double write and adaptive hasing. In fact, there are also several such as asynchronous IO and Flush Neighbour Page, which are related to the system level, so the three bright features are universal.


First, let’s take a quick look at why Double Write is designed this way and what problems it solves. I have drawn a relatively crude picture of this, with many details not taken into account, but to illustrate the point.



In general, double Write buffer is a kind of buffering and caching technology. Its main purpose is to prevent data loss in case of system power failure or abnormal crash. There are a few things to note here: the buffer pool has been modified to become dirty pages. Binglog and redo records are generated during this process. Of course, writing cached data to data files is an asynchronous task. If you look closely, there will be a 2M tablespace in the system tablespace, divided into two cells, a total of 128 pages, 120 for batch dirty data, the other 8 for Single Page Flush.


According to the analysis of Ali Zhai Weixiang, the reason for doing this is that the batch cleaning is done by the background thread, which does not affect the foreground thread. Single Page Flush is initiated by the user thread and needs to be flushed as quickly as possible and replaced with a free Page. So it’s not a strict 64+64 split, and I link to this article at the end. (https://yq.aliyun.com/articles/50627)


In the process of data refresh, memCopy is used to copy dirty data to the double Write buffer in memory, and two writes are completed, each write 1MB to the shared tablespace, and then fsync is called to synchronize the data to disk. One thing to note here is that the flush to the shared table space, although twice, is sequential, so it’s not very expensive, and it’s not as bad as you might expect for a Double Write. According to Percona, the difference between data and performance is about 5%, which is a basic proposition. Of course, the corresponding tablespace file will be written later, this process is random, the performance cost will be higher. So many people had the same concern when they tried SSD in the early days, whether to write sequentially or randomly. This concern will also be explained in this article.


Of course, Double Write is designed to be used for recovery, otherwise it wouldn’t be worth the fuss. For file verification, a central word is checksum. Partial write partial write partial write partial write partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write Partial write In case of a problem, the page information that was previously written to the shared table space can be reconstructed and written again.


In fact, another feature of Double Write is that when data is written from the Double Write buffer to the real segment, the system will automatically merge the connection space refresh mode, so that multiple pages can be refreshed each time, thus improving efficiency.


For example, in the following environment, we can obtain a merge page based on the result of show status.


> show status like’%dbl%’;  

+—————————-+———-+

|Variable_name             | Value    |

+—————————-+———-+

| Innodb_dblwr_pages_written | 23196544 |

| Innodb_dblwr_writes        | 4639373 |

+—————————-+———-+


Through the InnoDB_dblwr_pages_written/InnoDB_dblwr_writes

Or it can be basically understood through indicators. In this example, the ratio is 5:1, proving that the frequency of data change is very low.


For doublewrite, of course, there are ongoing improvements in Percona. In Percona 5.7, there was an improvement. You can see a new parameter, Innodb_PARALLEL_doubleWrite_path.

 

|innodb_parallel_doublewrite_path | xb_doublewrite |


At the system level, there will also be a 30M file corresponding.


-rw-r—– 1 mysql mysql 31457280Mar28 17:54 xb_doublewrite


This is called parallel double write, which implements parallel brushes. For a detailed description and testing of this feature, see the following links:

https://www.percona.com/blog/2016/05/09/percona-server-5-7-parallel-doublewrite/?utm_source=tuicool&utm_medium=referral


It provides a lot of detailed test comparisons and analyses. Of course MariaDB, Facebook and Aurora have their own implementations and considerations in this regard. MariaDB is customised with a new parameter innodb_use_atomic_writes to control atomic writes. Innodb_doublewrite is turned off even if innodb_doublewrite is enabled when atomic write support is checked at startup.


Facebook provides an option to write only the corresponding page number to the DBLWR before writing the page. Crash recovery reads the page number recorded in the DBLWR, and indirect recovery.


Aurora uses a separation of storage from the database server, without turning on double Write for those interested.


So far, the double write explanation at MySQL level is enough. But we definitely have some questions, because the partial write issue is one of those critical issues that many database designs need to consider. In MySQL, the page size is 16K, and the data is verified in this unit. The operating system level is definitely not 16K (e.g. 4k), so if there is a power failure, only part of the write is retained. Oracle DBAs are usually very cool about this. A partial page write error occurred because the page was damaged, so you could not locate the partial page write error because the page was damaged. At this point redo cannot be directly restored.


Partial write is a partial write problem, but Oracle smoothed it out for us. There are design differences, and there are recovery technology differences. But in any case this problem will not go away, it will have to be solved. So HERE I need to combine with Oracle to compare what is good and what is not good. This is a good habit and learning method. For this, we raise two questions.


How do YOU do this in Oracle?


To answer this question, we need to interpret it from the following two aspects.


  • Is there partial write in Oracle?

  • How does Oracle handle partial write?


We have to put MySQL and Oracle side by side, like two toys, looking at each other, not only from the outside but also from the inside. Some students say that some Internal things are useless, and it takes a lot of time and energy to learn. This has to look at dialectically, a lot of things to a certain extent, we need to break through their own, in-depth understanding is always no harm, this process is a subtle process. Chairman MAO said: “Theory with practice, close contact with the masses (in this case, our DBA and users), criticism and self-criticism, it is worth learning.


Check whether Oracle has partial write



Let’s start with a very similar argument that many Oracle dbAs and MySQL DBAs struggle with.


In Oracle, there is a backup method called hot Backup, that is, you can directly copy data files in the open state to make backup. Begin backup is used for backup, and end backup is used for backup. During this process, data changes may occur in the copied files, resulting in inconsistencies, which is called a split block. This type of data block is also called Fractured block. It’s explained in the official document 11G, and it’s incorrectly described in the official document 10G.


fractured block



In simple terms, the SCN at the head and tail of the data block is inconsistent. In user-managed backups, operating system tools (such as the cp command) back up data files while DBWR is updating them.


An operating system tool may read a block in a semi-updated state, with the result that the upper half of the block is updated and copied to the backup medium, while the lower half still contains older data. In this case, the block is broken.


ALTER TABLESPACE… The BEGIN BACKUP or ALTER DATABASE BEGIN BACKUP command is the solution to the broken block problem. When a table space is in hot standby mode and a change is made to a data block, the database records a copy of the entire block mirror before the change so that the database can rebuild the block if media recovery discovers that the block is corrupted.


In 10G is described as follows, note the red “every time” below, this is an error description in the documentation.


When the table space is in hot standby mode, and each time a change is made to a data block, the database records a copy of the entire block mirror before the change so that the database can rebuild the block if media recovery discovers that the block has been corrupted.


Jonathan Lewis, the maestro, took it a step further and put it more explicitly.


A simple translation is:


The first change to a block (or any change to a block in the cache) after a checkpoint is completed is written to redo, while subsequent changes to the block in the buffer are not repeated.


Document description problems exist in 10G document, fixed in 11G. In the actual application, cp command is used for copying because the disk writing operation will use the file system block size as the minimum I/O, but the Oracle block size is used as the minimum I/O when RMAN writes to the disk, so there will be no split block.


To that end, let’s mention data blocks in Oracle. The size of blocks in Oracle can be classified into data blocks, redo log blocks, and control file blocks.


  1. Select file#, name, block_size from v$datafile; select * from v$datafile;

  2. Select lebsz from x$kccle; select lebsz from x$kccle;

  3. The controlfile data block is called controlfile block. You can query select block_size from V $controlfile.


We extend the concept by adding a new attribute blocksize to redo in 11g. The value of blocksize is fixed in the database source code and is OS specific. The default value is 512, which varies between operating systems.


To see the blocksize configuration, you can use the base table X $kCCle from Oracle’s internal view:


SQL> select max(lebsz) from x$kccle;

MAX(LEBSZ)

———-

       512


As you can see, there is no problem with reconstructing the database through redo, but there is an important concept called checkpoints.


Oracle can confidently confirm that if data is committed and returned successfully, it will be restored the next second after a power failure. It’s not enough to be confident. We need to have a theory on how to do data recovery through redo.


How does Oracle restore through Redo


We assume that there was a partial write problem when redo was written.


There is a myth in MySQL that the redo file cannot be directly restored because the page was damaged and the redo file cannot be retrieved.


What does Oracle do? If you look at the figure below, one file that is interesting to look at is the control file. Oracle is a control file that acts as a checkpoint for data. The control file is more vividly described as the brain of the database, thus showing its status, although its function is relatively simple, but very important.



Oracle writes changes to data files asynchronously to ensure data security. After data changes are submitted, redo entries are created in the Redo log buffer. Describes the data changes before and after the modification. Lgwr brushes redo entries to the redo log, although this process is subdivided into background and synchronous writes.


Background write trigger conditions will be more, such as a 3-second interval; DBWR triggers LGWR to write to the redo log until it is complete. Or LGWR is triggered when 1/3 of the log buffer is reached; Or it can be triggered when it reaches 1M. For more details, please refer to the official documentation.


The trigger condition of synchronous write is relatively simple, that is, when the user commits, LGWR is triggered to write. Therefore, if the situation of over commit occurs, there will always be obvious log file sync waiting event.


What does this process have to do with CKPT? In short, Oracle constantly locates this starting point so that data can be effectively protected and recovered in the event of an unexpected instance crash, which is CKPT’s mission. Starting too close to the head of the log file means processing a large number of redo entries, which can cause poor recovery. Second, the starting point should not be too close to the end of the log file, because too close to the end of the log file means that only a small amount of dirty data is not being written, requiring DBWR to brush data frequently. So Oracle has the concept of a checkpoint queue, which is an LRU list of buffer headers, and if a block of data is modified multiple times, it will only appear once on the list, as Jonathan Lewis explains.


Innodb_lru_scan_depth controls the number of available pages in the LRU list, and innodb_max_dirty_pages_pact controls the frequency of dirty pages (default is 75, Google recommends 80).


To summarize: CKPT is a key, with checkpoint queues and incremental checkpoints to improve data recovery efficiency and reduce frequent DBWR brushes. This checkpoint is not only maintained in redo files, data files, data headers, and, crucially, in control files. This is the cornerstone of our data recovery, SCN, which is called LSN in MySQL.


Therefore, during data recovery, the checkpoint of the data file in the control file is found to be empty, indicating an abnormal outage, and Crash Recovery will be started. This checkpoint will grab the nearest one in the control file, and then apply redo to a pre-crash state, called roll-forward, and roll back uncommitted transactions for consistency, so the design of the control file makes sense. This is a rough recovery process.


Question 1: Criticism and self-criticism



Okay, it’s time to turn the tables, and I’m sure a lot of MySQL DBAs will have more questions when they see this, and I’m going to be self-critical, two questions.


  1. How do you explain that the redo log is written in 512 bytes

  2. MySQL also has the Checkpoint technology you mentioned.


That’s a perfectly correct understanding. Let me explain.


MySQL InnoDB also has checkpoint LSN that grows as the log buffer grows. Show global status like ‘% innodb_OS_LOG_written ‘innodb_OS_LOG_written = ‘%’ So just looking at this we can’t see the difference, even though they have different granular subdivisions.


MySQL InnoDB’s checkpoint technology is very rich, mainly divided into two categories, Sharp checkpoint and fuzzy checkpoint.


Innodb_fast_shutdown =1 Sharp checkpoint is a full checkpoint set by innodb_fast_shutdown=1. Fuzzy Checkpoint is much richer, and generally a partial page refresh, which can be a bit more complex.


  • Master Thread Checkpoint

  • FLUSH_LUR_LIST Checkpoint

  • Async/Sync Flush Checkpoint

  • Dirty Page too much Checkpoint


But to get back to the essence of the problem, these are checkpoints that InnoDB does at the level, so that’s what we started with.


Because the page is corrupted, the transaction number in the page cannot be located, so redo cannot be directly restored.


Oracle has the control file level, and data recovery is started after the control file is mounted in the mount state.


This time we Oracle DBA to ask the MySQL DBA.


Question 2: Why do binlogs and redo files coexist in MySQL


Binlog is in the MySQL Server category. It records data changes and supports multiple storage engines. That is to say, whether it is MyISAM, InnoDB and other storage engines, Binlog will record, so data recovery and slave construction is often used. In addition, depending on the two-phase commit scenario, crash recovery will also use Binlog.


Redo is an InnoDB engine category that logs changes to physical pages and is used for crash recovery.


In general, MySQL has introduced Binlog in the Server layer for compatibility with other transaction engines so that replication can be enabled for all engines. While a transaction writes Binlog and redo, consistency between Binlog and redo also needs to be reconciled, primarily through two-phase commits.


Oracle has only redo, which is a combination of Binlog and redo, because Oracle is not a plug-in database and does not support third-party storage engines. So this is all unified from the architecture.


So the answer to this question is: don’t hold a hammer in your hand and all you see is nails.


InnoDB is a plug-in storage engine. Transaction support is done at the storage engine level, and if I may add more, foreign keys should not be implemented by the storage engine, but this is a special case, because the Server layer does not support it, so it is implemented by InnoDB. In short, the storage engine is table-oriented, not database-oriented, which is important to understand and makes InnoDB’s positioning clearer.


But we should not look at the problem in isolation, not only from the database level, system level, but also need to consider the storage level, also need to listen to the storage community’s point of view.


Store the relationship to double write


At the storage level, I will cite the shared content of three experts in the community to illustrate.


Oracle Online Redo Log Optimization on PCIe flash cards


The previous generation used sectors of 512 bytes, while the current generation uses sectors of 4K, which is the minimum I/O size at a time. 4K sector can work in two modes: native mode and emulation mode.


Native mode, or 4k mode, has the same physical and logical block sizes of 4096bytes. The disadvantage of Native mode is that it requires operating system and software (such as DB) support. Oracle supports 4k IO operations starting with 11gR2. The Linux kernel also began supporting 4K IO operations after 2.6.32.


Emulation mode: the physical block is 4k, but the logical block is 512bytes. In this mode, the underlying physics of I/O operations is 4k, which causes Partial I/O and 4K alignment problems.


In emulation mode, each I/O operation was 512 bytes, but the underlying STORAGE I/O operation must be 4K. If you want to read 512 bytes of data, you need to read 4K, which is 8 times the original value. This is partial I/O. In writing, 4k physical blocks are read, 512 bytes of data are updated, and 4k is written back. So over emulation mode, the extra work increased latency and reduced performance.


For SSDS, double write brings two problems, performance penalties and increased wear to SSDS, as mentioned in a previous post on Flash Storage features and Database optimization.


They tested the flash memory card according to the following scenario.



In terms of security, Metadata Journal+DW or Metadata Journal+Data Journal (the second and third rows in the figure above) can protect the security of database Data, that is, Data cannot be damaged by accidental power failure, and the database can be started normally without Data loss.


But in cpu-bound (computation-intensive) cases, the performance decay of the former combination (8%) is less than that of the latter protection combination (10%).


In the case of IO bound(I/O intensive), the performance decay of the former combination (10%) is less than that of the latter protected combination (34%). However, the amount of data written in DW mode increases by 23%, which increases SSD wear. That’s what we need to be careful about when we apply it.


At the file system level, we also need to pay attention to these areas. The following is excerpted from the community’s previous sharing of Databases and Storage Systems.


Most file systems support 4K (except VXFS and ZFS). VXFS supports a maximum of 64K, which can be set to 512BYTE, 1K, 2K, 4K, 8K, 16K, 32K, and 64K.


ZFS is a special monster; Data blocks are dynamic, which means that as much data is written, the size of the block on ZFS is the same. Traditionally dynamic 512byte to 128K is supported.


So ZFS itself provides partial write fail-prevention, in which case you can turn off double Write.


summary


It’s funny to think about MySQL and Oracle sometimes, one open source, one commercial, one popular, and one popular, seemingly at odds, but fate has brought them back together. And when we learn to question, to think, and to try, we are bound to gain.


References:

  • https://yq.aliyun.com/articles/50627

  • http://blog.itpub.net/22664653/viewspace-1140915/

  • http://www.mysqlperformanceblog.com/2012/10/08/measuring-the-amount-of-writes-in-innodb-redo-logs/