This article introduces the evolution of the MySQL database high availability architecture in recent years, as well as some of our innovations based on open source technology. At the same time, a comprehensive comparison with other solutions in the industry, to understand the industry’s progress in high availability, and some of our future plans and prospects.

MMM

Before 2015, Meituan-Dianping (review side) used MMM (Master-Master Replication Manager for MySQL) for database high availability for a long time, accumulated a lot of experience, but also made a lot of mistakes. MMM has played a big role in the rapid development of the company’s database.

MMM’s architecture is as follows.

As shown above, the MySQL cluster provides one write VIP (Virtual IP address) and N (N>=1) read VIP to provide external services. Each MySQL node has an Agent (MMM-Agent) deployed. Mmm-agent communicates with MMM-Manager and periodically reports the survival status of the current MySQL node (called heartbeat) to mmM-Manager. When MMM-Manager fails to receive heartbeat messages from MMM-Agent for several consecutive times, the switch is performed.

Mmm-manager handles exceptions in two cases.

The exception occurred at the slave node

Mmm-manager will attempt to remove the read VIP of the slave node and migrate the read VIP to other surviving nodes, thus achieving high availability of the slave library.

The exception is on the primary node

If the node is not fully suspended at that time, the response times out. Then try to flush tables with read lock on Dead Master.

Select a candidate primary node from the nodes as the new primary node for data completion.

After completing the data, remove the Dead Master write VIP and try to add it to the new Master node.

Other surviving nodes are completed and remounted to the new master node.

After the primary database fails, the cluster status changes as follows:

Mmm-manager detects that master1 is faulty, and after data completion, the write VIP is shifted to Master2, and the write operation is applied to continue on the new node.

However, MMM architecture has the following problems:

There are too many VIPs and it is difficult to manage them (once there was a cluster with 1 master and 6 slave, totaling 7 VIPs). In some cases, most viPs in the cluster are lost at the same time, and it is difficult to tell which VIPs were used on the node.

Mmm-agent is overly sensitive, which may easily lead to THE loss of VIP. At the same time, because mmM-Agent itself has no high availability, once it is suspended, mmM-Manager will misjudge and mistake the MySQL node as abnormal.

Mmm-manager has a single point, and once it fails for some reason, the whole cluster loses its high availability.

VIP needs to use THE ARP protocol. The high availability across network segments and equipment rooms cannot be implemented, and the guarantee capability is limited.

At the same time, MMM is an old highly available product developed by Google’s technical team. It is not widely used in the industry and the community is not active. Google stopped maintaining THE MMM code branch a long time ago. We found a lot of bugs in the process of using, and we modified some of them and submitted them to the open source community. If you are interested, you can refer to this.

MHA

In view of this, Since 2015, Meituan-Dianping has improved the high availability architecture of MySQL and updated it to MHA, which has largely solved various problems encountered by MMM before.

MySQL Master High Availability (MHA) is a High Availability software for MySQL developed by Facebook engineer Yoshinori Matsunobu. As the name suggests, MHA is only responsible for the high availability of MySQL’s main library. When the Master database fails, MHA will select a candidate Master node whose data is closest to the original Master database (there is only one slave node, so the slave node is the candidate Master node) as the new Master node, and complete the Binlog that is different from the previous Dead Master. After data completion, write VIP is about to drift to the new master library.

The overall ARCHITECTURE of MHA is as follows (only one master and one slave are described for simplicity) :

Here we have made some optimizations to the MHA to avoid some brain-splitting problems.

For example, the upstream switch of the DB server jitter, so that the main database cannot be accessed, which is judged as failure by the management node, triggering the MHA switchover, and the VIP is floated to the new main database. After the switch is restored, the master library can be accessed, but the VIP is not removed from the master library, so the two machines have the VIP at the same time, resulting in brain split. We added the MHA Manager to detect other physical machines on the same machine, and compared more information to determine whether it was a network fault or a single machine fault.

MHA+Zebra (DAL)

Zebra, a Java database access middleware developed by the Infrastructure team of Meituan-Dianping, is a dynamic data source for internal use of Meituan-Dianping wrapped on the basis of C3P0, including read/write separation, database and table division, SQL flow control and other very strong functions. Together with MHA, it becomes an important part of the high availability of MySQL database. The overall architecture of MHA+Zebra is as follows:

Take the failure of the main library as an example. The processing logic can be divided into the following two ways:

After the MHA switchover is complete, the Zebra Monitor proactively sends a message to the Zebra Monitor, which updates the ZooKeeper configuration and marks the read traffic configured on the primary library as offline.

The Zebra Monitor checks the health status of nodes in the cluster every 10s to 40s. Once a node is found to be faulty, the Zebra Monitor refreshes the ZooKeeper configuration and marks the node as offline.

Once the node change is complete, the client listens for the node change and immediately reestablishes the connection with the new configuration, while the old connection is gradually closed. The entire cluster failover process is described as follows (describe Zebra monitor active detection only, the first MHA notification please imagine ^_^).

Because the switching process still relies on VIP drift, it can only be carried out in the same network segment or layer 2 switch, and high availability across network segments or across computer rooms cannot be achieved. To solve this problem, we carried out secondary development on MHA, removed the operation of adding VIP from MHA, and informed Zebra Monitor to readjust the read and Write information of the node after the switchover (adjust Write to the real IP of new master and remove the read traffic of Dead Master). The whole switch is completely de-VIPS, and can be switched across network segments or even across computer rooms, completely solving the problem that high availability is limited to the same network segment before. The above switching process becomes the following figure.

However, the MHA management node in this approach is a single point, which is still at risk in the event of network failure or machine downtime. At the same time, because the Master and Slave are asynchronous replication based on Binlog, the Master database breaks down or the Master database cannot be accessed, and data may be lost during the MHA switchover.

In addition, if the master-slave delay is too large, the operation of data completion will incur extra time cost.

Proxy

In addition to Zebra middleware, Meituan-Dianping also has a set of proxy-based middleware that works with MHA. After MHA switchover, Proxy is notified to adjust read and write traffic. Proxy is more flexible than Zebra and can cover non-Java application scenarios. The disadvantage is that there is an extra layer of access links, and the corresponding Response Time and failure rate also increase to a certain extent. Interested students can go to GitHub for detailed documentation.

Future Architecture Ideas

There are still two problems with the MHA architecture mentioned above:

Management node single point.

Data loss in MySQL asynchronous replication.

For this purpose, we use semi-sync in some core businesses, which can guarantee data loss in more than 95% of scenarios (there are still some extreme cases where strong data consistency cannot be guaranteed). In addition, high availability solves the single point problem of MHA Manager by using distributed Agent and selecting a new Master through certain election protocol when a node fails.

We looked at some of the industry’s leading practices to address these issues, which are briefly described below.

The primary/secondary synchronization data is lost

In case of data loss caused by master/Slave synchronization, one method is to create a Binlog Server. The Server simulates the Slave to receive Binlog logs. The master database considers data writing to be successful only after receiving an ACK reply from the Binlog Server. The Binlog Server can be deployed on the nearest physical node to ensure that data is quickly written to the Binlog Server. When a fault occurs, you only need to pull data from the Binlog Server to prevent data loss.

Distributed Agent is highly available

To solve the single point problem of MHA management node, one method is to deploy Agent on each node in the MySQL database cluster. When a fault occurs, each Agent will vote in the election, and appropriate Slave will be elected as the new master library to prevent switchover only through Manager and remove the SINGLE point of MHA. The entire architecture is shown below.

MGR is highly available in combination with middleware

The above method solves the previous problems to some extent, but Agent and Binlog Server are newly introduced risks, and the existence of Binlog Server also brings additional overhead in response time. Is there a way to remove the Binlog Server and Agent without losing data? Of course there is.

In recent years, the MySQL community has been very passionate about the distributed protocols Raft and Paxos. The Community has also introduced the MGR version of MySQL based on Paxos, which pushes consistency and switching processes down the database and blocks switching details from the upper layer. The architecture is as follows (take MGR’s single-primary as an example).

When the database fails, MySQL performs a switch internally. After the switchover, the TopO structure was pushed to the Zebra Monitor, which changed the read/write traffic accordingly. However, this architecture has the same problem as Binlog Server, that is, every write to the master database, most nodes need to reply ACK, the write is successful, there is a certain response time overhead. At the same time, each MGR cluster must require an odd number of nodes (greater than 1), resulting in the need for at least three machines instead of only one master and one slave machines, resulting in a certain waste of resources. However, there is no doubt that the advent of MGR is another great innovation of MySQL database.

conclusion

This article introduces the evolution of HIGH availability architecture of Meituan-Dianping MySQL database from MMM to MHA+Zebra and MHA+Proxy, and also introduces some high availability practices in the industry. Database has been developing rapidly in recent years. There is no perfect solution for database high availability design, only continuous breakthrough and innovation, and we have been exploring more excellent design and more perfect solution on this road.