You call this shit high availability?

Hello, I’m Kun Ge

Today we are going to talk about high availability in the three high points of the Internet (high concurrency, high performance, high availability). After reading this article, WE believe that you can solve most of the confusion about high availability design

preface

The main purpose of High Availability (HA) is to ensure “business continuity,” meaning that services are always normal (or almost normal) in the eyes of users. High availability in terms of architecture, mainly for the so high availability is to work is the first good architecture design, the first step we usually adopt a layered thought will be a huge IT system split into the application layer, middleware, data storage layer such as independent, each layer split again become more granular component, the second step is to provide the service for each component, After all, each component does not exist in isolation and needs to work with each other to make sense to provide services externally.

To ensure the high availability of an architecture, it is necessary to ensure that all components in the architecture and their exposed services are designed for high availability. Any component or service that is not highly available means that the system is at risk.

So how to do the high availability of so many components design, in fact, any component to be highly available, cannot leave the “redundancy” and “automatic failover”, known as a single point is an enemy of high availability, so components generally in the form of cluster (at least two machines), such problems as long as a machine, other machines in the cluster can replace at any time, This is redundancy. Simple calculation, assuming that the availability of one machine is 90%, then the availability of a cluster of two machines is 1-0.1*0.1 = 99%, so obviously the more redundant machines, the higher availability.

But light has redundancy is not enough, if the machine appear problem, need manual switching is time-consuming, laborious and prone to errors, so we also need to use third-party tools (mediator) power to realize the “automatic” failover, in order to achieve the realization of the aim of the near real-time failover, near-real-time failover is the main meaning of high availability

What kind of system can be called high availability, the industry generally use a number of nine to measure the availability of the system, as follows

Availability level	System availability %	Downtime/year	Downtime/month	Downtime per week	Downtime per day
Do not use	90%	36.5 days	73 hours	16.8 hours	144 minutes
Basic available	99%	87.6 hours	7.3 hours	1.68 hours	14.4 minutes
High availability	99.9%	8.76 hours	43.8 minutes	10.1 minutes	1.44 minutes
High availability	99.99%	52.56 minutes	4.38 minutes	1.01 seconds	8.64 seconds
Highly available	99.999%	5.26 minutes	26.28 seconds	6.06 seconds	0.86 seconds

Generally realize two 9 is very simple, after all, every day down 14 minutes already seriously affect the business, companies such as quit sooner or later, the giant general requirements 4 9, other demanding business to achieve more than five to nine, such as if because a computer failure all trains stopped, then will be tens of thousands of people normal life have been hampered, This requires more than five nines

Let’s take a look at how each component of the architecture can achieve high availability with the help of redundancy and automatic failover.

Anatomy of Internet Architecture

At present, most Of the Internet adopts the microservice architecture. The common architecture is as follows:

As you can see, the architecture is divided into the following layers

Access layer: Mainly by F5 hardware or LVS software to carry all traffic entry
Reverse proxy layer: Nginx, mainly responsible for distributing traffic according to URL, limiting traffic, etc
Gateway: mainly responsible for flow control, risk control, protocol conversion, etc
Site layer: mainly responsible for calling basic services such as membership and promotion to assemble data such as JSON and return it to the client
Basic Services: in fact, both the site layer and the site layer are micro services, which are level relations, but the basic services are infrastructure, which can be invoked by the upper layer of each business layer server
Storage layer: that is, DB, such as MySQL, Oracle, etc., is generally returned to the site layer by underlying service calls
Middleware: ZK, ES, Redis, MQ, etc., mainly accelerate access to data and other functions. We will briefly introduce the functions of each component in the following article

As mentioned above, to achieve the high availability of the overall architecture, it is necessary to achieve the high availability of each layer of components. Let’s take a look at how each layer of components achieve high availability

Access layer & Reverse proxy layer

The high availability of both layers has something to do with Keepalived, so let’s look at it together

Note that only the master is working (i.e., the VIP is in effect on the master). The other BACKUP will take over the master’s work if the master is down. Keepalived software is installed on both the master and standby machines. When booted up, Keepalived checks each other’s health by heartbeat. Once the master is down, Keepalived checks each other’s health by heartbeat. In this way, the HIGH availability of LVS is solved by the way that the VIP address (115.204.94.139 in the figure) takes effect on the backup.

Keepalived heartbeat detection is mainly done by sending ICMP messages, or using TCP port connections and scanning checks. It can also be used to detect Nginx exposed ports. Keepalived can also detect if some Nginx is abnormal and remove it from the list of services that LVS can forward.

Keepalived is a third-party tool that implements high availability of BOTH LVS and Nginx. In the event of a failure, keepalived can also be sent to the corresponding developer’s email so that they can receive a timely notification. It is very convenient. As we will see below, it can also be used in MySQL to make MySQL highly available.

Micro service

Next, let’s look at the “gateway”, “site layer” and “basic service layer”. These three are generally called microservice architecture components. Of course, these microservice components also need to be supported by some RPC framework such as Dubbo, so microservice must achieve high availability. This means that dubbo RPC frameworks should also provide the ability to support high availability of microservices. Let’s take Dubbo as an example to see how it achieves high availability

Let’s start with a brief look at the basic architecture of Dubbo

First, the Provider (service Provider) registers the service with Registry (such as ZK or Nacos, etc.), and then the Consumer (service Consumer) subscribs to and pulls the Provider service list from Registry. After obtaining the service list, The Consumer can then choose one of the providers based on its load balancing policy to send requests to it. When one of the providers is unavailable (offline or due to GC blocking, etc.), the registry will listen to it (via heartbeat) and push it to the Consumer. This allows the Consumer to remove them from the list of available providers, enabling automatic failover. The registry acts as keepalived

The middleware

Let’s take a look at how these middleware such as ZK, Redis and so on achieve high availability

ZK

In the last section of microservices, we mentioned the registry. Let’s take ZK (ZooKeeper) as an example to see how its high availability is realized. First, let’s take a look at its overall architecture as follows

The main roles of Zookeeper are as follows

Leader: There is only one Leader in a cluster, which performs the following functions
1. The unique scheduler and handler of transaction requests ensures the sequential processing of things in the cluster. All write requests from followers are transferred to the Leader for execution to ensure the consistency of transactions
2. Schedulers for each server within the cluster: After the transaction request is processed, the data is broadcast to each Follower and the number of successful writes is counted. If more than half of the followers have written successfully, the Leader considers the write request to have been submitted successfully and notifies all the followers to commit. This ensures that the write operation will not be lost even after the cluster crashes and recovers or restarts.
Follower:
1. Process the client’s non-transaction request and forward the transaction request to the leader server
2. Participate in the vote of the Proposal request (more than half of the servers need to pass to notify the leader commit data; Leader initiated proposal, asking followers to vote)
3. Vote in the Leader election

Voiceover: Zookeeper 3.0 has added a new Observer role, but it is not very relevant to the ZK high availability discussed here, so it is omitted to simplify matters

It can be seen that since there is only one Leader, it is obvious that the Leader has a single point of danger. How does ZK solve this problem? Firstly, the followers and the Leader keep connected by the heartbeat mechanism. If the Leader has a problem (downtime or unresponsiveness due to FullGC or other reasons), followers will not be able to sense the heartbeat of the Leader and will think that the Leader has a problem, so they will vote. Finally, a Leader is selected from multiple followers (Zookeeper Atomic Broadcast is mainly used here, namely ZAB protocol, which is a consistency protocol specially designed for ZK to support crash recovery). Details of the election are not the focus of this paper. I won’t go into detail here.

In addition to ZAB, protocol algorithms such as Paxos and Raft are commonly used in the industry, which can also be used in Leader election. In distributed architecture, these protocol algorithms assume the role of the “third party”, namely the arbiter, to undertake the automatic transfer of faults

Redis

The high availability of Redis needs to be looked at according to its deployment mode, which can be divided into “master-slave mode” and “Cluster sharding mode”

A master-slave mode

Take a look at the master-slave pattern, which is structured as follows

Master-slave mode that is more than a master from (from one or more nodes), in which the primary node is responsible for reading and writing, and then from the data synchronization to multiple nodes, the Client can also initiate read requests to multiple slave node, so that we can ease the pressure on the master node, but like the ZK, since there is only one master node, to a single point of risk, So must introduce third-party mediator mechanism to determine whether the master node goes down, and in determining the main node after downtime quickly pick out a slave node ACTS as a master node, the role of the third-party mediator in Redis we generally referred to as “sentry” (sentinel), sentry process itself may hang up, of course, so to be on the safe side, Multiple sentinels need to be deployed (i.e., sentinel clusters)

These sentinels receive information about whether the master is offline via the Gossip protocol and use Raft protocol to elect a new master after determining that the master is down

Cluster Fragmented Cluster

The master-slave model seems perfect, but there are several problems

The write pressure on the primary node is difficult to reduce. Because only one primary node can receive write requests, in the case of high concurrency, if the write requests are very high, the network card of the primary node may be filled up, causing the primary node to be unable to provide external services
The storage capacity of the primary node is limited by the capacity of the single node. Because both the primary and secondary nodes store the full cache data, the cache data is likely to increase as the service volume increases until the storage bottleneck is reached
Synchronization storm: Data is synchronized from the master to the slave. If there are multiple slave nodes, the master node will be under great pressure

In order to solve the master-slave mode of the above questions, shard cluster arises at the historic moment, the so-called shard cluster the data fragmentation, each shard data shall be the responsibility of the corresponding master node, speaking, reading and writing, so there is more than one master node to share the pressure, and each node only store the data, also solved the problem of the single storage bottlenecks, However, it should be noted that each primary node has a single point of failure. Therefore, ha needs to be implemented for each primary node. The overall architecture is as follows

The principle is also very simple. After the Proxy receives the read/write command of Redis executed by the client, it will first calculate a value for the key. If the value falls within the value range of the corresponding master (generally, each number is called a slot, and there are 16,384 slots in Redis), Each master node is only responsible for processing part of the redis data. In order to avoid the single point problem of each master node, it is equipped with multiple slave nodes to form a cluster. When the master node is down, The cluster uses the Raft algorithm to elect a master node from the slave nodes

ES

In ES, data exists in the form of shards. As shown in the figure below, index data in a node is stored in three shards

But with only one node, you obviously have the same single-point problem as with the master-slave architecture of Redis. When this node dies, ES dies, so you obviously need to create multiple nodes

Once multiple nodes are created, the advantages of sharding are realized. Sharding data can be distributed to other nodes, greatly improving the horizontal expansion capability of data. In addition, each node can undertake read and write requests, avoiding single point of read and write pressure in the form of load balancing

ES’s write mechanism is a little different from the master-slave architecture of Redis and MySQL (the latter two write directly to the master node, but ES does not), so here is a little explanation of how ES works

First, let’s talk about the working mechanism of the Node. The Node is divided into Master Node and Slave Node. The main responsibility of the Master Node is to take charge of related operations at the cluster level, manage cluster changes, such as creating or deleting indexes, and track which nodes are part of the cluster. There is only one master node, which is generally elected through Bully algorithm. If the master node is unavailable, other slave nodes can also be elected through this algorithm to achieve high availability of the cluster. Any node can receive read and write requests to achieve load balancing

Shards are divided into Primary shards (such as P0, P1, and P2) and Replica shards (such as R0, R1, and R2). The Primary Shard is responsible for writing data, so while any node can receive read and write requests, However, if the node receives a write request and does not have the primary shard where the data is written, the node schedules the write request to the node where the primary shard is located. After writing data to the primary shard, the primary shard copies the data to the duplicate shards of other nodes. Take a cluster of two replicas as an example

MQ

The idea of using data sharding in ES to enhance high availability and horizontal expansion ability is also applied to the architecture design of other components. Let’s take Kafka in MQ as an example to look at the application of data sharding

As shown in the Kafka cluster, the partitions of each Topic are distributed on other message servers. In this way, if a Partition becomes unavailable, a leader can be elected from other followers to continue the service. However, different from the data sharding in ES, the follower Partition is cold standby, which means that the follower Partition does not provide external services under normal circumstances. The follower Partition can only provide external services after the leader is selected from multiple followers after the leader dies

Storage layer

Let’s take a look at the last layer, storage layer (DB). Here we take MySQL as an example to briefly discuss its high availability design. In fact, if you look at the high availability design above, you will find that MySQL’s high availability is just like Redis. It is also divided into master-slave and shard (we often say shard library shard table) two architectures

Master-slave is similar to LVS and generally uses keepalived form for high availability, as shown below

If the master is down,Keepalived will detect it in time, and the master will be upgraded to the master, and VIP will “drift” to the original master

When the amount of data is large, it is necessary to divide the database and table, so there are multiple masters, just like the sharding cluster of Redis, which requires multiple slaves for each master, as shown below

Before, some readers asked why we need to be the master after the separation of database and table. Now I think we can understand that it is not to solve the problem of read and write performance, but mainly to achieve high availability

conclusion

After looking at the architecture level of high availability design, I believe you will have a deeper understanding of the core ideas of high availability “redundancy” and “automatic failover”. If you look at the components in the above architecture, you will find that the main reason for redundancy is because there is only one master, why can’t there be multiple masters? But in such a distributed system, it is very difficult to ensure the consistency of data, especially if there are many nodes, the synchronization between data is a big problem, so most components adopt the form of one master, and then synchronize between master and multi-slave. The reason why most components choose one master is technically tradeoff in essence

Is the entire architecture really usable after making each component highly available? No, this is only the first step, and there are many unexpected production challenges for our system, such as

Instantaneous traffic problem: For example, the instantaneous traffic surge caused by the SEC may overwhelm the carrying capacity of the system, which may affect core links such as daily transactions. Therefore, it is necessary to isolate systems, for example, to deploy an independent cluster for the SEC
Security problems: For example, DDOS attacks, frequent requests from crawlers and even deletion and running away lead to system denial of service
Code problems: For example, code bugs cause memory leaks, FullGC causes system unresponsiveness, etc
Deployment issues: It is not acceptable to stop a service that is currently running during the release process, but to do a graceful shutdown and smooth release
Third party problems: For example, our previous services rely on the third party system, and the problems of the third party may affect our core business
Force Majeure: If the machine room is cut off, it is necessary to do a good job in disaster recovery and live more in different places. Before, the service of our company was unavailable for four hours due to the failure of the machine room, resulting in heavy losses

Therefore, in addition to the high availability of the architecture, we also need to take measures such as system isolation, flow limiting, circuit breaker, risk control, degradation, and limiting operator permissions for key operations to ensure the availability of the system.

A special mention is made of downgrading, which is a common measure taken to ensure system availability, just to give a few examples

We docking before a third-party funds for their own reasons loan function wrong cause unable to borrow, users will panic in order to avoid this kind of circumstance, so we returned when users apply for third-party loan a similar “in order to increase your limit, funds are system upgrade” such copy, avoid the customer complaint
In the field of streaming media, when there is a serious problem when users watch live broadcast, the first choice of many enterprises is not to check the log to troubleshoot the problem, but to automatically reduce the bit rate for users. The fact that the card is not visible is obviously more painful to the user than the degraded picture quality
During the peak hours of Double 11, we stopped non-core functions such as user registration and login to ensure the smooth operation of core processes such as placing orders

So we need to do unit testing, do full link pressure test and so on to find the problems. We also need to monitor the CPU, the number of threads and so on. When it reaches the value set by us, the alarm will be triggered so that we can find and repair the problem in time (our company has encountered a similar production accident recopy before, you can have a look). In addition, under the premise of unit testing, there may still be online problems caused by potential bugs in the code. So we need to block the Internet during critical times (such as during singles’ Day).

In addition, we also need to be able to quickly locate and roll back problems after accidents, which requires recording the release time and publisher of each release, including not only the release of the project, but also the release of the configuration center

Voice-over: Above is our release record, you can see that there are code changes, rollback, etc., so that if there is a problem can be rolled back with one click

Finally, let’s summarize the common approaches to high availability with a graph

Welcome to pay attention to my public account: code sea, common progress

reference

“Qiao Xinliang’s CTO Growth Review — Highly available design, So that products have no worries”