We’ve looked at replication and extensibility before, so let’s look at usability. In the final analysis, high availability means “less downtime”.

As always, when you talk about a noun, you have to define it, so what is usability?

1 What is usability

Our common availability is often expressed as a percentage, which has its own hidden meaning: High availability is not absolute. In other words, 100% availability is impossible to achieve. Yes, it’s safe to say so.

We usually use a number of “9” to describe availability. X 9 indicates the ratio of the normal service time of each system to the total service time (one year) during the operation of the data center for one year. For example, if five nines are 99.999%, the application downtime time is t:

(1-99.999%) * 3600 * 24 * 365 = 315.36s = 5.256m

Therefore, we can say that “five nines” means that applications are only allowed 5.256 minutes of downtime per year. Using the same calculation, we get 8.76 hours of downtime per year for three nines and 52.6 minutes for four nines.

In fact, five nines is difficult for most applications, but some large companies will certainly try to get more nines in order to minimize application downtime and reduce downtime costs.

Every application has different requirements for availability. Before setting a goal, be sure to consider whether you really need to reach it. The ratio between the effect of usability and the cost of usability does not increase linearly, and every increase in usability costs a lot more than before.

Therefore, for usability, we can follow the following principle:

The amount of downtime you can afford guarantees the time available.

This may be a bit confusing, but to put it simply: for an application with 10W users, assuming that it takes 100W to implement 5 nines, even if the application is down for 9 hours every year, the total loss is only 50W. Do you think it is necessary for the application to implement 5 nines?

In addition, we defined availability as “down time” above, but availability should also include whether the application is performing well enough to handle requests. For a large server, restarting MySQL can take hours to warm up the data to ensure response times for requests. These hours should also be included in the downtime.

At this point, we should have a general impression that usability is related to application downtime. Next, let’s take a closer look at why the application is down.

2 Causes of the outage

The most common reason we hear about database outages is that ** SQL performance is poor **. But in fact, this is not the case, according to the manifestation of the cause of downtime into the following categories:

Outage event representation Accounted for Cause of the outage
Runtime environment 35% Disk space Used up
Performance issues 35% 1. Low performance SQL; 2. Server bugs; Poor table structure design and index design
copy 20% The active and standby data is inconsistent
Data is lost or damaged 10% Data deletion due to misoperation and lack of backup

The runtime environment is generally regarded as the set of system resources that support the running of the database server, including the operating system, hard disk, and network.

In addition, while replication is often used to improve available time, it is also a major cause of downtime. This also illustrates a general situation:

Many highly available strategies can backfire

Knowing the definition of availability and the factors that reduce it, it’s time to consider how to improve the availability of your system.

3 How to implement high availability

As you may have noticed from the above analysis, our availability depends on two times:

  • Average application failure time
  • Average application recovery time

Therefore, usability can also be improved from these two aspects. First, you can minimize downtime by avoiding application outages. In fact, many of the problems that cause outages can easily be avoided with proper configuration, monitoring, and specifications or security safeguards.

Second, try to ensure quick recovery in the event of an outage. The most common strategy is to create redundancy in the system and ensure that the system is fail-over capable.

Next, let’s take a look at specific targeted measures.

3.1 Reduce the average failure time

Our lack of management of system changes is the most common cause of all outages. Typical errors include careless upgrades that fail with bugs, or running changes to table structures or query statements online without testing, or failing to plan for failures such as reaching disk capacity limits.

Another major cause of downtime is a lack of rigorous evaluation. For example, inadvertently failing to verify that the backup is recoverable.

Also, there may not be accurate monitoring of MySQL related information. For example, the cache hit ratio alarm may only be a false positive, which does not indicate that there is a problem. As a result, we think that the monitoring system is useless and ignore the alarm with real problems. When the boss asks you why the disk is full and no alarm message is received, you look at him innocently.

Therefore, we can avoid a lot of downtime if we just do more targeted work. Specific measures can be taken from the following:

  • Test recovery tools and processes, including restoring data from backups.
  • Comply with the minimum rights rule.
  • Keep the system clean and tidy.
  • Use good naming and organization conventions to avoid confusion. Such as whether the server is used in a development or production environment.
  • Schedule database server upgrades carefully.
  • Before upgrading, use a tool such as pt-upgrade in the Percona Toolkit to check your system carefully.
  • Use InnoDB and configure it appropriately to make sure InnoDB is the default storage engine. If the storage engine is disabled, the server cannot start.
  • Verify that the basic server configuration is correct.
  • Disable DNS by skip_name_resolve.
  • Query caching should be disabled when it is not clear that it is useful.
  • Avoid using complex features unless you really need them. Such as replication filtering, triggers, and so on.
  • Monitor important components and functions. Especially critical items like disk space and RAID volume status.
  • Record server status and performance metrics for as long as possible.
  • Check replication integrity periodically.
  • Will be deliberately set to read-only, do not let replication start automatically.
  • Periodically review query statements.
  • Archive and clean up unwanted data.
  • Reserve some free space for the file system.
  • Develop the habit of evaluating and managing information about changes, status, and performance of the system.

3.2 Reducing the average recovery time

There are three approaches to recovery time:

  • Establish redundancy for the system to ensure the system failover capability and avoid single point of failure.
  • Develop a complete recovery process specification for personnel.
  • Organize follow-up for personnel to avoid similar mistakes in the future.

Next, let’s discuss specific measures.

1) How to avoid single point of failure?

For single point failure, the first thing we need to do is to find it and then address it.

Every single point in a system can fail. It could be a hard drive, a server, a switch, a router, even a single point of power for a rack, and so on.

Before proceeding, it is important to understand that single point failures cannot be completely eliminated. New technologies may be introduced to address the single point of failure, but the new technologies introduced may lead to more downtime. Therefore, we should sort the single point of failure according to the impact level and solve the single point of failure according to the order.

Specific measures to increase redundancy, there are two solutions: increase free capacity and duplicate components.

Increasing spare capacity is relatively easy. You can create a cluster or server pool using a load balancing scheme. This way, when one server fails, other servers can take over the load of the failed server.

In addition, many considerations may require redundant components, so that a spare part can be readily replaced when a major component fails. Redundant components can be idle network cards, routers, or hard drives.

Also, most importantly, fully redundant MySQL servers. This means we have to make sure that the secondary server has access to the data on the primary server. Here are some ways to start:

  • Shared storage or disk replication
  • MySQL synchronous replication

2) How to ensure the system failover and recovery capability?

Before we get into this topic, let’s look at what “failover” is. Some people say “back off,” while others use “switch,” to indicate a planned switch rather than a post-failure response.

We use “failover” here to represent the opposite of failover. If the system has fail-over capability, then fail-over is a two-way process:

When server A fails, server B replaces it and can be replaced again after server A is repaired.

The most important part of failover is failover. If you can’t switch between servers freely, failover is a dead end that only slows down downtime. Therefore, when using replication, you can use a symmetric replication layout, such as a double master structure. Because the configuration is peer-to-peer, failover and recovery are the same operations in opposite directions.

Let’s look at some of the more common failover techniques.

Upgrade the standby database or switch roles

Promoting a standby master, or switching active and passive roles in a master-master replication structure, is an important part of many MySQL failover strategies. For details, see MySQL Replication – Cornerstones of Performance and scalability 4: Primary/Secondary Database switchover

Virtual IP address or IP takeover

You can specify a logical IP address for the MySQL instance that needs to provide the featured services. When the MySQL instance fails, the IP address is transferred to another MySQL server. The solution here is essentially the same virtual IP technology as in load balancing, except that it is now used for failover.

This approach has the advantage of being transparent to the application. It breaks existing connections without modifying the configuration.

Of course, it also has some disadvantages:

  • You need to define all IP addresses in the same network segment or use network Bridges.
  • Sometimes you need to update the ARP cache. Some network devices may save ARP information for a long time, so that one IP address cannot be instantly switched to another MAC address.
  • You need to determine whether your network hardware supports fast IP takeover. Some hardware needs to clone MAC addresses to work.

3) How can team members improve system recovery time?

Since everyone on the team has different levels of proficiency and resilience in recovering from downtime, we also need a process specification for the appropriate people to help everyone participate in downtime and reduce system recovery time.

Once the system is back up, we need to organize an assessment of the outage time, but be careful not to expect too much of this kind of post-mortem. The cause of an outage is often multifaceted, and it is difficult to look back at the situation at the time of the problem and find the true cause. Therefore, our conclusions on hindsight should be taken with a grain of salt.

4 summarizes

  1. Availability is measured by n nines of down time.
  2. Achieving availability starts with average failure time and average recovery time.