About highly available systems

In the article “The Technology I’ve been Working on for years”, I talked about the technology area THAT I’ve been following for years. I mentioned industrial-grade software several times. I thought a lot of people would ask me how to define industrial-grade software. And how should a high availability software system work out? So I can write this article, but no one asked, so I had to write this article shamelessly. Ha ha.

In addition, I have seen in places where a high availability system we only discuss the technical scheme for each company, in fact, high availability is not a simple technical scheme of the system, a high availability system actually includes a lot of other things, so, I think we know about high availability system is not complete, in order to let everybody know more comprehensive, so I write this article.

Understand high availability systems

First, we need to understand what High Availability is. Basically, it means making our computing environments (hardware and software) fully available. Generally speaking, the following design needs to be done:

Redundancy of hardware and software to eliminate single points of failure. Any system has one or more redundant systems as standby
Fault detection and recovery. Detect failures and take over failure points with backup nodes. This is called failover
It needs a very reliable CrossOver. These are nodes that are not easily redundant, such as domain name resolution, load balancer, etc.

It sounds simple, but it’s not. The details are all devil. The biggest difficulty of redundant nodes is to ensure data replication and consistency of stateful nodes (the redundancy of stateless nodes is relatively simple). The consistency problem with redundant data is the devil’s devil:

If the system data is mirrored asynchronously to redundant nodes, data differences will occur during failover.

If the system is synchronized between data mirroring and redundant nodes, the more redundant nodes, the slower the performance.

So many highly available systems are doing all kinds of choices, this need to compare the characteristics of the business, such as bank account balance is a state of the type of data, then, redundancy is necessary do strong consistency, again, for example, order record belongs to additional data, then at the time of failover, can to prepare for additional, This is relatively simple (Ali’s current so-called remote hypermetro actually can not do state-based data hypermetro).

Here is a summary of high availability design principles:

In order to keep data from being lost, persistence is necessary
In order for the service to be highly available, it is necessary to have backup (duplicates), whether application nodes or data nodes
To replicate, there are data consistency issues.
We can’t achieve 100% high availability, that is, we can achieve several slAs of 9.

Call waiting welfare

1. Recently sorted out 20G resources, including product/operation/test/programmer/market, etc., and Internet practitioners [necessary skills for work, professional books on the industry, precious books on interview questions, etc.]. Access:

Scan the code of wechat to follow the public account “Atypical Internet”, forward the article to the moments of friends, and send the screenshots to the background of the public account to obtain dry goods resources links;

2. Internet Communication Group:

Pay attention to the public account “atypical Internet”, in the background of the public account reply “into the group”, network sharing, communication;

Technical solutions for high availability systems

I used the following figure in Transaction Processing for Distributed Systems: Transaction Across DataCenter, Google App Engine co-founder Ryan Barrett’s 2009 Google I/O talk www.youtube.com/watch?v=srO,…).

This diagram is basically the basis for all of the solutions available in today’s high availability systems. M/S and MM are not difficult to implement, but there will be many problems. The problem of 2PC is that the performance is not good, and the problem of Paxos is too complex and difficult to implement.

To summarize the problems with each high availability solution:

For final consistency, in the case of downtime, data is not fully synchronized and data differences may occur.
For strong consistency, either use the two-phase commit scheme of the slower performance XA family or use the better performance but more complex implementation of the Paxos protocol.

Note: this is a software solution. Of course, more expensive hardware solutions can be used, but hardware solutions are not covered in this article.

In addition, many caches, message middleware, or databases in open source software today have persistence and Replication designs and therefore have highly available solutions, but open source software generally does not have high SLAs, so it is important to be aware of this if we use open source software.

Examples of highly available technical solutions

Let’s take a look at the SLA of MySQL’s highly available solutions (the red flags below show how many of these solutions are 9) :

Image credit: MySQL High Availability Solutions

A brief explanation of MySQL’s schemes (mainly to show that the more nines there are, the more complex they are)

MySQL Repleaction is a traditional asynchronous data synchronization or semi-synchronous semi-sync (as long as one slave receives an update, it returns success), which is essentially less than two nines.
MySQL Fabric is simply an M/S read-write separation mode for data sharding. The availability of this solution can reach 99%
DRBD uses the underlying disk synchronization technology to solve data synchronization problems, namely RAID 1, which mirrors the hard disks of more than two hosts into one. This is less than three nines
Solaris Clustering/Oracle VM, a mechanism that monitors hardware, operating systems, networks, and databases. This solution is usually accompanied by a “heartbeat mechanism” between nodes, and it also uses a Storage Area Network (SAN) or local distributed Storage system. It also uses virtualization technology to migrate virtual machines to reduce the probability of downtime. This solution is completely an “all stack solution.” This scheme is close to four nines.
MySQL Cluster is an open source solution that divides MySQL Cluster into SQL Node and Data Node. Data Node is a Cluster NDB that automates sharing and replication. MySQL Cluster uses a “fully synchronous” data replication mechanism to redundant data nodes. This scheme is close to five nines.

So, what do these two 9’s, three 9’s, four 9’s, five 9’s mean? How did it come about? Please look down.

Definition of SLA for high availability

None of that is the focus of this article. The focus of this article is now to measure the high availability of the system. SLA, of course, stands for Service Level Agrement, which is high availability with a few nys.

Industry has two ways to measure SLAs,

One is the time between failure and recovery
The other is the time between failures

But most use the first method, which is the time between failure and recovery, which is the time when the service is unavailable, as shown in the following table:

System availability %	Downtime/year	Downtime/month	Downtime per week	Downtime per day
90% (1个9)	36.5 days	72 hours	16.8 hours	2.4 hours
99% (2)	3.65 days	7.20 hours	1.68 hours	14.4
99.9% (3)	8.76 hours	43.8	10.1 minutes	1.44
99.99% (4, 9)	52.56	4.38	1.01 minutes	8.66 seconds
99.999% (5, 9)	5.26	25.9 seconds	6.05 seconds	0.87 seconds

For example, with 99.999% availability, only five and a half minutes of service are unavailable in a year. That’s hard to do.

Even with the availability of 3 nies, the outage time of a month is only more than 40 minutes. Look at those teams that don’t take design and coding seriously, and put all their expectations on the operation and maintenance teams that deal with the failure manually. A failure can be handled for more than an hour or even 2-3 hours, without even an automatic tool. If you have the nerve to declare on your official website that your SLA is 3 or 5 nines, isn’t that cheating the public? .

Factors affecting high availability

To be honest, it’s hard to figure out how usable the system we’re designing is, because there are so many factors that affect a system, not just the software design, but also the hardware, and the services of each of the three parties (telecom unicom’s broadband SLA), including, of course, “the diggers of the construction crew.” So, as the SLA is defined, this is not just a technical indicator, but a contract or contract between the service provider and the user. This industrial gameplay, just like an airplane, is not just about building an airplane, but also about a lot of incredibly professional facilities, tools, processes, management and operations.

In a nutshell, the n9’s of slAs are the levels at which services are consistently available, but the industry divides service unavailability into planned and unplanned factors.

Unplanned outage cause

The chart below is from Oracle’s High Availability Concepts and Best Practices

Planned outage cause

The chart below is from Oracle’s High Availability Concepts and Best Practices

As we can see, the reasons for the outage above include the following:

unplanned

System-level failures – including hosts, operating systems, middleware, databases, networks, power supplies, and peripherals
Data and mediation failures – this includes human error, hard disk failure, and data clutter
Also: natural disasters, man-made sabotage, and power supply problems.

In a planned way,

Daily tasks: backup, capacity planning, user and security management, back-end batch applications
O&m: database maintenance, application maintenance, middleware maintenance, operating system maintenance, and network maintenance
Upgrade related: database, application, middleware, operating system, network, including hardware upgrade

What really determines the nature of a high availability system

What do you see from these factors that affect a highly available SLA? If you’re still only looking at the technical aspects or software design, you’re only looking at the tip of the iceberg. Let’s think more carefully, that SLA of 5 9 is only unavailable for 5 minutes in a year, 5 minutes, if you only fail once a year, you have to recover within 5 minutes. Let’s think about what that means.

If you don’t have a set of scientific excellent software engineering management, advanced automatic operation and maintenance tools, and technical ability of excellent engineers, how can there be a highly available system ah.

Yes, to build a highly available system, which is a set of rigorous scientific engineering management, including but not limited to:

Level of software design, coding, testing, launching and software configuration management
The skill level of an engineer
Operation and maintenance management and technical level
Data center operation management level
The level of management that depends on third-party services

Deep down, respect for the science of engineering:

Attitudes towards technology
The engineering culture of a company
Leader’s respect for engineering

So, in the future, when someone improves their usability in front of you, you should look not at their technical design, but at their engineering capabilities, and whether their company truly respects the science of engineering.

other

Many non-technical and even technical personnel have told me that to make an APP and a website is to find a few code farmers to write code. When the system doesn’t work, they’ll understand that they need someone with good technical skills, but even if you talk to them a thousand times, they’ll have a hard time understanding how writing code is engineering and engineering is a science. In fact, many people who do technology do not understand this truth.

Many technical people will never understand why Code Review, automated operations, automated testing, continuous integration are so boring. Like I’m in from Code Review about how to make technology mentioned, ali many engineers, architects/experts, senior architects have even don’t have the consciousness, of course, it’s not blame them, because experience decided to ways of thinking, their experience is civil level system, do all the work of heap functions, do not need.

After reading this, let’s all ask ourselves: can you still say that your system is highly available? ; -)

(Full text)