According to the characteristics of Alibaba’s transactional business and the rare demand in the industry such as Double 11, we added many practical functions and performance patches on the basis of the official MySQL. In the process of using MySQL, data consistency is one of the topics that cannot be avoided. This paper mainly starts from the post-IOE era of Alibaba, and briefly introduces our efforts and practices in MySQL data consistency in the past few years, as well as the current solutions.

MySQL single-server data consistency

MySQL, as a pluggable database system, supports plug-in Storage engines. It is divided into Server layer and Storage Engine layer in design. At the Server layer, MySQL records Binlog of various database operations in the form of events. Its basic core functions are: replication and backup. In addition, we combine the requirements of diversified business scenarios and build a strong MySQL ecosystem based on Binlog features, such as DTS, unit, real-time synchronization between heterogeneous systems, etc. Binlog has long been an indispensable module in the MySQL ecosystem. In the Storage Engine layer, InnoDB as a relatively general Storage Engine, it has made a good balance in the two aspects of high availability and high performance, has become the first choice to use MySQL (PS: official MySQL 5.5.5, InnoDB as the default Storage Engine MySQL). Like most relational databases, InnoDB uses WAL, or InnoDB Redo Log, to Log physical changes to data files. The Redo Log is always logged first, and the Redo Log is written to disk before persisting data files. Whether the Binlog and InnoDB Redo Log fall will directly affect how much data an instance can recover after an abnormal outage. InnoDB provides parameters to control the logging mode and strategy when a transaction commits. For example:

Innodb_flush_method: controls the opening and flushing methods of innoDB data files and log files. The recommended values are fsync and O_DIRECT. Innodb_flush_log_at_trx_commit: controls the write and drop policies of redo logs at each transaction commit. The value can be 0, 1, or 2. When innodb_flush_log_at_trx_COMMIT =1, every time the transaction commits, the Log is written to InnoDB Log Buffer, and the Log Buffer is written to InnoDB Log file and flushed to disk. Sync_binlog: controls how long the Binlog is flushed to the disk when a transaction is committed. The value can be 0 or n(n is a positive integer). Different values affect MySQL performance and data recovery after an abnormal crash. When sync_binlog=1, MySQL forces data from binlog_cache to disk every time a transaction commits. Innodb_doublewrite: Controls whether to enable the double Writer function. The value is ON or OFF. When Innodb's default page size is 16K and the page size of a single disk write is usually 4K or much smaller than Innodb's page size, there will be partial page write problems if there is a system outage/OS crash and only part of the write is successful. As a result, data recovery may be affected due to partial write failures after a crash. InnoDB provides the Double Writer technology to avoid partial Page write. Innodb_support_xa: Controls whether to enable InnoDB's two-phase transaction commit. By default, innodb_support_XA =trueTo support xa two-phase transaction commits.Copy the code


Different values of the preceding parameters affect the extent of data recovery and write performance after the MySQL crash. In actual use, you need to set proper configurations based on service features and actual requirements. Such as:

MySQL single-instance, Binlog disabled When innodb_flush_log_at_trx_commit=1 and innodb_doublewrite=ON, data will not be lost in case of MySQL Crash, OS Crash or host restart. MySQL single-instance, Binlog enabled Scenario: By default, Innodb_support_XA =ON. If Binlog is enabled, the transaction commit process becomes two-phase commit. When innodb_flush_log_at_trx_commit=1, SYNC_binlog =1, Innodb_doublewrite =ON, innodb_support_XA =ON, It also ensures that no matter MySQL Crash, OS Crash or host power failure and restart, data will not be lost.Copy the code


However, if the host fails to start due to a hardware failure, the single instance of MySQL may lose data due to a single point of failure. Therefore, the single instance of MySQL is not suitable for the production environment.

MySQL cluster data consistency

MySQL cluster usually refers to the master-slave replication architecture of MySQL. The primary/secondary replication of MySQL is usually used to solve the single point of failure of MySQL. It synchronizes changes from the primary database to the secondary database through logical replication. The strict consistency between the primary database and the secondary database cannot be guaranteed. MySQL supports asynchronous replication, semi-synchronous replication, and full synchronous replication.

Asynchronous replication

Main library after the execution of the client committed transaction would immediately return the result to give to the client, don’t care whether from library has been receiving and processing, so there will be a problem, if the Lord crash out, as the master committed transaction may not have reached from the library, if at this point, the force will be taken from ascension is given priority to, could result in data inconsistency. Early MySQL only supported asynchronous replication.

Semi-synchronous replication

MySQL introduced semi-synchronous replication in 5.5. The master library must ensure that at least one slave library receives and writes to the relay log before replying to a transaction committed by the client. Semi-synchronous replication uses the rpl_semi_SYNC_master_WAIT_point parameter to control which link the master receives the SLAVE ACK. After receiving the ACK, the master returns the status to the client. There are two options for AFTER_SYNC and AFTER_COMMIT.

Configured to WAIT_AFTER_COMMIT



When rPL_semi_SYNC_MASTER_WAIT_POINT is WAIT_AFTER_COMMIT, commitTrx calls are made after the ENGINE layer commit, as shown in the figure above. That is, while waiting for the Slave ACK, although the Slave ACK does not return to the current client, the transaction has been committed, and other clients will read the committed transaction. If the Slave has not read events for the transaction and the master database crashes, switch to the Slave database. Then the previously read transaction disappears and the data inconsistency problem occurs, as shown in the figure below. The picture to quoteLoss-less Semi-Synchronous Replication on MySQL 5.7.2.

​​


If the master library never starts, transactions that have actually been successfully committed in the master library cannot be found in the slave library, i.e., data is lost. PS: As early as 11 years ago, Alibaba database innovatively implemented the method of waiting for Slave ACK before engine layer commit to solve this problem.

Configured to WAIT_AFTER_SYNC

MySQL introduced loss-less semi-synchronous in 5.7.2. After binlog sync is called, the Engine layer waits for Slave ACK before committing. In this way, the transaction will commit only after confirming that the Slave has received the transaction Events. The following figure is taken from Loss-Less semi-synchronous Replication on MySQL 5.7.2:



After_sync resolves data inconsistencies caused by After_COMMIT because the master library did not commit transactions. But there is also a problem. When abort occurs on the primary after binlog flush and before binlog sync occurs on the secondary, it is clear that the transaction was not committed on the primary. The transaction will be rolled back after the master database is restored), but since the master database has received these binlogs and executed them successfully, it is equivalent to extra data on the slave database, which may cause “data inconsistency”. In addition, in the semi-synchronous replication architecture of MySQL, when the master database is waiting for the standby DATABASE ACK, if the timeout will degenerate into asynchronous, it may also cause “data inconsistency”.

3. Data Consistency scheme for the active and standby MySQL servers

The following is a brief introduction of Alibaba’s early thinking and practice on MySQL data consistency.

1. “Data consistency” in the Unitized architecture

Background: Due to the limitation of rack space, capacity bottleneck of single room or region always occurs, and business development is limited; As well as the need for cross-regional disaster recovery, Alibaba solved the problem through unitary solutions in the early stage.



As can be seen from the figure above, real-time data synchronization between the center and each unit is carried out through DTS. In order to ensure data consistency between the center and each unit, we built a data verification and correction platform in the early stage. It mainly includes :TCP(Terminal Compare Platform) full data verification and correction platform (supporting data verification at table level, library level, instance level and cluster level) and Alibaba Magic Glass (AMG) real-time incremental data verification and correction platform. TCP and AMG have long been the core components of Alibaba’s database ecosystem, and are widely used to ensure data consistency in many scenarios, such as master-slave replication, unit-synchronization, logical migration, database splitting, character set upgrading, etc.

2.ADHA rollback and rollback

Alibaba Database High Availability (ADHA) is the Database High Availability system of Alibaba Group. The rollback function of ADHA helps us to ensure data quality as much as possible in the switchover process. We can rollback the data that has not been transferred from the old master database to the old standby database and replay the data that has been rolled back to the new master database.





The purpose of ADHA rollback and rollback is to ensure data consistency during HA switchover as much as possible.

3. Guarantee measures for master-slave consistency

Automatic handling of replication conflicts: The slave_exec_mode parameter of MySQL 5.5/5.6/5.7 is used to resolve master-slave replication conflicts and errors. The default value is STRICT for all modes (does not resolve conflicts), and the value IDEMPOTENT ignores duplicate-key and no-key-found errors, and is not suitable for resolving master/slave inconsistencies above. We added a value smart to slave_EXEC_mode starting from 5.6, HA_ERR_KEY_NOT_FOUND, HA_ERR_FOUND_DUPP_KEY, and HA_ERR_END_OF_FILE caused by PK conflicts and UK conflicts are automatically repaired



Replication from secondary libraries If the SMART mode is enabled, the replication interruption error can be rectified, but the consistency between the primary and secondary libraries cannot be strictly guaranteed. Therefore, after the replication problem is rectified in SMART mode, you need to perform full data verification (including TCP full verification and AMG incremental verification) for the primary and secondary libraries as soon as possible to identify data differences.

4. Maximum Protection logic Max Protection

In order to ensure strong consistency between master and slave, we added the function of MySQL maximum protection mode, referred to as MP mode (this is based on the design of ORACLE database maximum protection mode). The value is controlled by maximum_protection and can be ON or OFF. When semi-synchronization is configured, if the primary and secondary connections are disconnected, the primary library stops external services, all current connections are killed, and new connection requests from common accounts are rejected. If there is a transaction waiting for the binlog synchronization step, the connection will not be killed, the transaction will wait a timeout and continue to complete (Engine Commit), then return network error to the client. That is, the transaction is committed and ADHA needs to intervene to roll back. MySQL MP mechanism needs to be implemented together with ADHA. The maximum protection logic is introduced to meet business scenarios that require high data consistency, such as financial services. It also presents a greater challenge to MySQL’s high availability solutions. The above are all part of our early countermeasures on the “data consistency” problem in the MySQL master-standby era. Their purpose is to ensure the “data consistency” as much as possible, but do not completely solve the “data consistency” problem. However, we believe that the development of technology can bring greater convenience of operation and maintenance and better user experience. The NewSQL system represented by Google Spanner and Amazon Aruora gives a different idea for the “data consistency” of database: based on consistency protocol! Based on the consistency protocol, we built a high performance and strong consistency MySQL database, RDS three-node enterprise edition. The following chapters will give you a detailed introduction to the consistency protocol and RDS. Please pay attention to them.


Author: Jixiang_ZY