In the past few months, I have often seen various discussions about whether MySQL can run in Docker. This is wrong! It’s the right thing to do! There are a lot of reasons why he said wrong, and the idea of saying right is clear. Everyone has a point. But I personally don’t see much point in such a discussion. Because what is right and what is wrong must be done by practice.

Therefore, Tongcheng Travel also started the docker-ization practice of MySQL very early. Up to now, there have been more than 1,000 MySQL instances running safely and stably on Docker platform, and the DB operation and maintenance ability has been improved substantially (DBAs no longer need to worry about deleting libraries and running away).

Of course, this can not prove the conclusion of the discussion before – is right. I don’t think so, because we are just a bird learning to fly, there is more to learn, so we will share our practice in MySQL docker-ization with you.

Tongcheng Travel’s early databases are based on MSSQL, this product has a feature is great UI operation. But batch and automated management is difficult to do, a lot of human work. Later, it was gradually replaced by MySQL and managed in the traditional operation and maintenance way. Most of the jobs require human operations.

Of course, MSSQL, which we used earlier, also has advantages: it has better stand-alone performance, and we could often run multiple libraries on highly available instances in those resource-poor days. In this case, the number of physical machines and the number of instances are relatively controllable, and the relative number is relatively small, which can be fully handled by human operation and maintenance.

However, MSSQL also has many defects, such as the difficulty of horizontal split, resulting in the database becomes the biggest bottleneck in the system. However, we started to solve this bottleneck by using MySQL+ middleware (we also put a lot of thought into this middleware, can share later) to do horizontal splitting.

The introduction of horizontal splitting also brings a minor drawback, which is that the number of database instances increases dramatically. For example, if we do 1024 shards, we usually do 32 nodes, one master and one slave is necessary (most cases are one master and two slave), then at least 64 instances, plus emergency expansion and backup nodes that is more (middleware developers prefer 1024 pieces or 1024 instances).

It took two DBAs 4 hours to create a 32node shard extension slave library. In addition, if you do a single single instance that is certainly more not, among other things, the cost will also be a big problem, and the physical machine resources have not been maximized. In addition, because MySQL does not have the performance advantage of the single unit, so the fragmentation is the majority, so in most cases not every library can run the entire physical machine. Even if there is a part of the library that can run all the resources of the whole machine, its multi-node backup, environment uniformity and operation and maintenance action unity will make DBA a mess, and busy and error-prone work is meaningless.

There is a need to run MySQL instances on a single machine with multiple instances. The main problem to consider is that if you implement resource isolation and restriction, there are many implementation solutions. How to choose? KVM, Docker and Cgroups are the mainstream solutions that can achieve isolation.

KVM is too heavy for isolation of a DB and has too much of a performance impact to be suitable for production environments. This is because MySQL runs as a process and requires a lot of IO, so KVM does not meet the requirements (although the IO can be improved after optimization).

Cgroups are light and not very isolated, but they are sufficient for our MySQL multi-instance isolation (Docker resource limits use Cgroups). But we also want to run additional administrative processes (such as monitoring, etc.) for each MySQL instance. Cgroups can be complicated to implement, and we want to keep instance management separate from the physical machine, so cgroups is also abandoned.

As for Docker, it’s great. It takes care of all the naked cgroups. And there are apis to provide support, low development costs. And we can do deployment automation based on Docker images, so the environment can be easily solved. Therefore, we finally chose Docker as the resource isolation scheme of cloud platform (of course, a lot of performance, stability and other adaptation work was also done in the process, which will not be described here).

The following two graphs illustrate the revolutionary significance of this product:

Of course, to be able to call it a cloud, the most basic requirements of the platform are resource computing and resource scheduling functions, and resource allocation without manual participation. For users, the resources should be directly available and have built-in functions such as high availability, automatic backup, alarm monitoring, and slow log analysis. Users do not need to care about the things behind the resources. The second is the servitization output of various daily DBA operation and maintenance requirements. Let’s talk about how our platform is implemented step by step.

I always believe that to evaluate the merits of a database, we should not only evaluate the database itself. We should consider whether the surrounding ecology is sound, such as: high availability plan, backup plan, daily maintenance difficulty, talent reserve, etc. Of course, the same is true for a cloud platform, so we did a short trial and error process, breaking up the platform into multiple releases. The development cycle of the first version is relatively short and mainly used for testing, so we should try our best to use existing open source products to meet our requirements, or to achieve customized requirements after secondary development of existing open source products. Here are some of the open source products and technologies we used at the time.

Here are a few examples of what we can do with them:

  • Percona: Our backup, slow log analysis, overload protection and other functions are based on the PT-Tools toolkit.

  • Prometheus: TSDB with excellent performance and powerful functions, used for monitoring alarms for entire platform instances. The disadvantage is that there is no cluster function and the performance of a single machine is a bottleneck (although the processing capacity of a single machine is already very strong), so we split DB at the business level to achieve distributed storage and expansion.

  • Consul: distributed service discovery and configuration shareware, working with Prometheus to monitor node registration.

  • Python: Manages the agent and part of the operation script of the MySQL instance in the Docker container.

  • Docker: Hosts MySQL instances and implements resource isolation and resource limitation.

The main open source products for container scheduling are Kubernetes and Mesos, but we did not choose those two. The main reason is that we have developed a docker-based resource management and scheduling system, which has been running stably for more than two years. A slight modification of the structure is desirable.

In addition, the third-party resource scheduling system is compatible with our current highly available architecture, and other automatic management is somewhat difficult. Meanwhile, the resource allocation strategy also needs to be customized. Therefore, I chose to adopt self-developed resource scheduling management. What suits your current situation is best. And of course later on when there is an opportunity to separate computational scheduling from storage scheduling we might switch to The Kubernetes solution.

Let’s take cluster creation as an example. When the platform has launched a task, create the cluster will first according to the cluster scale (a master from a master from more, or a shard cluster) to create the number of instances, then according to the requirements in accordance with our resources filtering rules (such as master-slave not on the same machine, the memory configuration does not allow oversold, etc.), Match available resources from existing resource pools, then create master-slave relationships, create high availability management, check cluster replication status, push cluster information to middleware (if middleware is used) control center, and finally synchronize all of the above information to CMDB.

For each of these tasks, the server sends a message to the Agent, who then executes the corresponding script, which returns the execution result in the specified format and is developed by the DBA. The advantage of this approach is that dbAs know more about the database than anyone else, so it can improve the efficiency of the project and get DBAs involved in the project. Development only needs to write the front-end logic, and the DBA is responsible for the specific instructions executed by the back-end. If there are future functionality changes or iterations, you only need to iterate through the script with minimal maintenance.

After years of DB operation and maintenance data analysis, the following experience is obtained:

  • The maximum CPU oversold is 3 times, but the memory is not oversold.

  • The machine with the least available resources is preferentially selected in the same room.

  • The master and slave roles cannot be on the same machine.

  • If the master and slave ports required by VIP need to be consistent, there is no restriction on the consistency of gratuitous ports directly connected to the middleware required by VIP.

  • A sharded cluster distributes nodes on multiple physical machines.

The above are some of the core functions that have been launched, and there are many functions that will not be shown one by one.

The backup tool we use is Percona – Xtrabackup. Data is backed up to a remote backup server in stream backup mode. Multiple backup servers are deployed based on the equipment room they belong to.

We provide manual backup and scheduled backup to meet the requirements of different scenarios. Multi-instance backup must pay attention to disk I/O and network. Therefore, our backup policy limits the number of parallel backups on a single physical machine. In addition, the parallelism of backup task queues in a single machine room is also controlled to ensure that the number of parallel backup jobs is always maintained to the specified number.

If 50 concurrent tasks are performed in the equipment room, if 5 of the 50 tasks are backed up in advance, 5 new tasks waiting for backup are added to the backup queue. We later changed the way backups were stored to run them directly into fractional storage.

Before the launch of this cloud platform, we still use the traditional Zabbix to achieve alarm monitoring. Zabbix is very powerful, but the back-end database is a bottleneck, which can be solved by splitting the database.

There are a lot of indicators to be monitored by the database. If there are a lot of items to be collected, Zabbix needs to add proxy. The architecture is more and more complex, and the cost of docking with our platform is relatively high, so the performance of some complex statistical queries (95 values, predicted values, etc.) is poor.

So we chose TSDB, Prometheus, which is a very powerful temporal database that is ideal for monitoring systems. The advantage of Prometheus is that it has excellent stand-alone performance. The downside is that it doesn’t support clustering (though we’ve addressed the extension issue, which we’ll cover below).

Prometheus was probably introduced a year ago as a secondary monitoring system, and as we got familiar with it, we saw it as an excellent solution for container monitoring. So I chose it as the monitoring system of the whole platform when I went on the cloud platform.

Prometheus is a way to support pushGateway and pull. We chose the pull method. Because the structure is simple, the development cost is low, but also can perfect docking with our system. Consul cluster is responsible for registering instance information and service information, such as MySQL instance master-slave service, Linux master-slave service, and container service. Prometheus then obtains monitoring targets from information registered on Consul and then pulls monitoring data. The monitoring client is an Agent, and Prometheus obtains data collected by the Agent through HTTP.

It has to be said that Grafana is the master of monitoring graphics, a full-featured measurement dashboard and graphics editor that can be easily configured to display all kinds of monitoring graphics. Then we connected grafana to the cloud platform, where users need to view instance or cluster information by clicking a button.

Alarm management includes alarm sending, alarm recipient management, and alarm silence. Prometheus provides alertManager, an alarm sending module, which sends alarms to the alarm API of the cloud platform via Webhook and sends alarms based on the following logic.

The AlertManager only sends alarms for instance latitudes. Therefore, we combine the instance information of the alarm platform to create a multi-dimensional alarm content. Let dbAs know at a glance whose cluster triggered what alarm at what level and at what time. After the alarm is cleared, another notification will be sent.

Alertmanager is a powerful tool that supports alarm suppression, alarm routing policies, sending intervals, and silent alarms. If necessary, it can be configured by itself. However, this kind of platform-independent management is not what we want, so we want to integrate the alertManager function for alarm information processing into the cloud platform.

However, the official documentation does not mention the API of AlertManager. By analyzing the source code, we found the API related to alarm management. The functionality of alertManager’s native UI was then perfectly ported to our cloud platform, along with additional information about instance-related cluster names, owners, and more.

Here are some examples:

Current alarms:

Adding alarm silence:

Created silence rules:

Slow logs are collected by pT-Query-digest for hourly local analysis. After analysis, the results are written to the database where slow logs are stored. If you want to view the status of the current slow logs immediately, you can click Slow log Analysis on the page. After the analysis is complete, you can click View Slow Logs on the UI to view the analysis result of slow logs of the instance. It also integrates explain, view table status and other functions.

As one of the core functions of the platform, cluster management accounts for 70% of the work of the whole platform. These functions are often used in DBA operations and maintenance. Our design is clustered, so you can only operate on one cluster instance at a time. This way, you don’t have too much useless information on a page, which can look messy and lead to misoperations. Taking a look at the features below makes it easier to understand why.

This is just a part of the picture, but there are some features not shown (integration middleware, dashboards, black screen diagnostics, etc.), and more features in the later version.

We use the most popular MySQL high availability solution, MHA. The advantages and disadvantages of MHA are not covered here, and those of you who are DBA students are already familiar with it. Here I would like to talk about the adjustments we made based on the same-trip business.

Since we mainly use MariaDB, the latest version of MHA also does not support the GTID switch of MariaDB. So we improved on that and supported MariaDB’s GTID. The other aspect is that sync_master_info and sync_relay_log_info do not need to be set to 1 (MariaDB does not support table writing, only file writing). Greatly reduces the IOPS associated with copying from the library.

We adjust the sync_binlog and innodb_flush_log_at_trx_COMMIT parameters during the switch. These parameters determine how data is dropped from disk. By default, we set the parameter to double 1. This is the most secure relative data, but also the highest IO.

In the multi-instance deployment of cloud services, a physical machine has both master and slave. We certainly don’t want the slave to generate too much IO to affect other slaves on the same machine (although I/O isolation is possible, reducing unnecessary IO is a priority). So theoretically, the Master can have double 1’s, but the slave can’t. However, the original salve may become the master. Therefore, the slave is not double 1 by default, and the two parameters of the new master are automatically set to 1 during the MHA switchover.

We have sentinel services deployed at multiple points. The Sentinel is a simple API service with parameters that respond to requests to specific instances. When the MHA Manager detects that a Master cannot connect, it will trigger the secondary check mechanism and request the API of the sentinel node with the Master’s relevant information. According to the return of the sentinel nodes, if more than half of them cannot connect, the switch will be made. Otherwise, the switchover is abandoned.

The DB middleware is connected to the DB through a physical IP address. When a ha switchover occurs, the latest Master IP address and Master port information are pushed to the DB middleware control center. After the DB middleware receives the configuration, the information is delivered and takes effect immediately.

The original purpose of the migration function is to migrate the instances or libraries outside the platform to the inside of the platform. Later, with the gradual use of this function, we found that there is a large space to explore, such as the need to split the table of the platform library. The implementation principle is also very simple, use MyDumper to back up the specified data, and then use MyLoader to restore to the specified database.

This is a full scale process, incremental copying is done using a tool we developed to support parallel copying, which also supports idempotent processing, making it more flexible to use. The reason for not using native replication is that if you want to migrate one of the multiple libraries of the source instance to the target instance, then native replication would require replication filtering of the binlog, which would involve configuration changes and restart of the instance, so it is definitely not considered.

The implementation wasn’t stellar, but it met the requirements. Of course myDumper and MyLoader also have some problems, we also made a small change to implement. In the future, we plan to use streaming mode to do data export and import (similar to Datax of Ali open source).

When the migration is completed and the increment is not delayed, people will be concerned about the data consistency before and after the migration. We provide a self-developed data verification tool. The actual data verification time of 300G is about 2 to 3 minutes, the speed depends on how many threads are opened.

To the user, the platform provides a database service or a set of database services, without any relation to which machine the back-end instance is on. Resource calculation and scheduling are all managed by the system’s algorithms.

Through a single machine with multiple instances, CPU resources can be oversold, effectively improving the utilization of CPU resources. Memory resources are not oversold, but you can control memory usage for each instance to ensure that there is enough memory for each instance. If there is free memory, continue to allocate containers, do not use OOM to squeeze memory resources.

The increase in efficiency is due to automation following standardization. The cost of batch operation and maintenance is very low. A sharded cluster that used to take nearly six hours to deploy (not including one or two hours for middleware docking) can now be deployed in five minutes. And after the completion of the deployment will provide a set of middleware +DB sharding cluster services.

After the platform was launched, the resource utilization rate was effectively improved. At the same time, according to the method of one library and one instance, the problem of mutual influence caused by uneven pressure of different libraries could be effectively avoided. Performance monitoring can also be done at the library level.

These are just the beginning, there are many more features to be improved, here are some of the latest planned features, some of which have already been developed in the later version. As the functionality continues to iterate, we will build a more perfect private cloud platform.

The launch of database private cloud platform means the end of an era and the beginning of an era for same-journey DB. The traditional operation and maintenance era of low efficiency and high cost ended, and the operation and maintenance era of low cost, high efficiency and high security began. We believe the future will be better!

Wang Xiaobo is chief architect of LY.com and a member of EGO. Focus on high concurrency Internet architecture design, distributed e-commerce transaction platform design, big data analysis platform design, high availability system design. I have designed several e-commerce trading platforms with over one million orders and over 200,000 orders per minute. I am familiar with the technical design of B2C, B2B, B2B2C, O2O and other e-commerce systems, and familiar with the technical development characteristics of e-commerce platforms. I have more than ten years of rich experience in technical architecture and technical consulting. Have a deep understanding of the importance of e-commerce systems for technology selection.

Today’s recommendation,

Click on the image below to read it

Remember a Go program optimization practice that achieved triple performance


On August 10-11, 2017, Tandun, Geekbang technology and InfoQ will co-host the second Application Performance Management Conference in China — APMCon 2017. The speech will focus on the latest technology and the most practical cases in the industry. Discuss APM related performance optimization, technical solutions and innovative ideas together, and guide more industry practitioners to improve application performance maze. To reward InfoQ readers, we have a buy one get one promo code: APMCon_0802, act now! Go to www.apmcon.cn, or click “Read the article” to learn more!