This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!

Hello, I’m Wukong.

The background,

No need to imagine an unusual scenario, this really happened: station B suddenly hung up at 11 PM, the site homepage directly reported 404.

Data cannot be loaded from the mobile APP.

23:30, B station made a downgrade page, the 404 page to the more friendly exception page.

But when you refresh the page, you go back to the 404 page.

22:35 The home page can load the data, but click the dynamic will still report 502

Click on a video, report 404.

After 2021-07-14 02:00 Station B began to recover gradually.

What is the reason

At 2 o ‘clock this morning, Station B issued a notice saying that last night, part of the server room of Station B was faulty, resulting in inaccessible. The technical team immediately investigated and fixed the problem, and services are now gradually returning to normal. In response to Internet users’ rumors of a fire in the B station building, Shanghai fire officer bo has denied that there is no fire in the B station building.

It seems that the high availability of Station B is not to our satisfaction. Let’s look at what high availability is and the idea of cross-machine deployment. The text is as follows:

What exactly is high availability

After 2 hours, B station began to recover gradually, then B station system in the end is high availability?

First of all, high availability is a relative adjective. So what is high availability?

3.1 high availability

High Availability (HA) refers to the ability of a system to operate without failure.

B station for high availability architecture also made a share:

Cloud.tencent.com/developer/a…

Key: later B station interview this kind of question does not take an examination, look well-known.

A common high availability solution is one master with many slaves. If the master node is down, the slave node can be quickly switched to. The slave node acts as the master node and continues to provide services. For example, SQL Server master-slave architecture, Redis master-slave architecture, they are to achieve high availability, even if a Server down, can continue to provide services.

I just mentioned fast, which is a qualitative term, but what about quantitative high availability?

3.2 Quantitative Analysis of high availability

Two related concepts need to be mentioned: MTBF and MTTR.

MTBF: the interval between failures, which can be understood as the length of the interval from the last failure to this failure. The longer the interval, the higher the system stability.

MTTR: average fault recovery time, which can be understood as how long it takes for the system to recover to normal when a fault occurs suddenly. The shorter the time, the better. Otherwise, users will receive many complaints when they are in a hurry.

Availability calculation formula: MTBF/ (MTBF+MTTR) * 100%, which is the sum of the interval between failures divided by the interval between failures + the average recovery time.

Normally, we use a number of nines to indicate the availability of the system, and previously our project team’s system was required to have a failure time of less than 5 minutes per year, which is a standard of five nines.

3.3 Quantitative analysis of Station B

From 2021-07-13 23:00 to 2021-07-14 02:00, the system gradually recovered. If calculated according to the total annual failure time, the failure of STATION B lasted more than one hour, which can only be counted as reaching the standard of three n9. In terms of daily down time, it only achieves two nines, which is 99% high availability, which is kind of sad…

3.4 One nine and two nines

Very easy to reach, a normal online system would not be down for 15 minutes every day, or it would really stop working.

3.5 Three and four nines

Allow failure time is short, the annual failure time is 1 hours to eight hours, from architecture design, code quality and operational system, fault processing manual, etc, which is a key ring is operational system, if the line out of the question, the first wave of notified exception must be operational team, according to the severity of the problem, There will be different operation and maintenance personnel to deal with, such as station B such a big accident, the operation and maintenance person in charge of the battle.

In addition, in the event of an emergency fault, whether you can manually degrade or switch, and limit some functions, also need to consider. I have encountered a problem before, the TWO-DIMENSIONAL code swiping function failed, but a switch was made before, which can hide the two-dimensional code function. If the user wants to use the two-dimensional code swiping function, the user can be guided to the offline swiping function.

3.6 the five nine

The annual failure time is less than 5 minutes, which is quite short. Even with a strong operation and maintenance team on duty every day, it is difficult to quickly recover within 5 minutes after receiving abnormal alarm, so it can only be solved by automatic operation and maintenance. That is, the server itself to ensure the disaster recovery and automatic recovery of the system.

3.7 six of nine

That’s pretty tough, 32 seconds per year.

For different systems, there are actually different nines. System internal employees in the company, for example, requires four to nine groups, if it is to users across the country, and use the number of many, such as a treasure, a hungry, so requires more than five to nine, but even the highest electricity system, it also has the non-core business, actually can also relax restrictions, four to nine, this depends on the requirements of each system, These are trade-offs of cost, manpower, and importance.

Iv. How to achieve high availability

Highly available solutions are also common, such as failover, timeout control, current limiting, isolation, circuit breaker, and degradation, to summarize here.

You can also see this: Double 11 carnival, dry this bowl of “flow prevention and control” soup

4.1 current limiting

Control the flow of requests and release only part of the requests so that the service can not bear more than its own capacity.

There are three common algorithms: time window algorithm, leaky bucket algorithm, and token bucket algorithm.

4.1.1 Time Window

Time window is divided into fixed window and sliding window. The specific principle can see this article: in the end of the Eastern Han Dynasty, they played the “service avalanche” to the extreme (dry goods)

Fixed time window:

Principle: Collect statistics on the total flow in a fixed period of time. If the flow exceeds the threshold, limit the flow.

Defect: Unable to limit concentrated traffic in a short time.

Sliding window principle:

Principle: The total time of statistics is fixed, but the time period is sliding.

Faults: Unable to control flows to make them smoother

A schematic of the time window is shown here:

4.1.2 Leaky Bucket algorithm.

Principle: Traffic is exposed to the receiver at a fixed rate.

Defect: When faced with burst traffic, the system cache the traffic in a leaky bucket. As a result, the response time of the traffic increases. This does not meet the requirement of low latency for Internet services.

4.1.3 Token bucket algorithm

Principle: Limit the access times to N times per second. Every 1/N time, put a token into the bucket. In a distributed environment, Redis is used as the token bucket. The schematic diagram is as follows:

The mind map is summarized here:

4.2 the isolation

  • Each service runs as a separate system, and if one system has a problem, the other services will not be affected.

The conventional approach is to use two components: Sentinel and Hystrix.

4.3 Failover

There are two types of failover:

  • Failover of full peers. The nodes are all equal.
  • Failover of unequal nodes. If both the active and standby nodes exist, they are unequal.

In a peer system, all nodes are responsible for read and write traffic, and nodes do not store state. Each node is a mirror of the other. If a node breaks down, you can access other nodes based on the load balancing weight.

In an unequal system, there is one active node and multiple standby nodes, which can be hot standby (the standby node also provides online services) or cold standby (the standby node only serves as a backup). If the active node breaks down, the system can detect the breakdown and perform an active/standby switchover immediately.

The distributed Leader election algorithm is needed to detect the failure of the primary node. Common algorithms include Paxos and Raft. Detailed election algorithms can be seen in these two articles:

Zhuge Liang vs. Pang Tong for distributed Paxos

Distributed Raft with GIFs

4.4 Timeout Control

Timeout control module and module between the invocation of the need to limit the request of the time, if the request timeout setting much longer, such as the 30 s, so when faced with a large number of requests timeout, due to the request threads blocked on the slow request, has caused many requests before processing, if the duration is long enough, can produce a cascade, forming an avalanche.

Or we place the order with the most familiar scenes, for example: users to order a commodity, service to generate a client calls order advance payment orders, orders for service invocation commodity service check which product is of the order, commodity service invocation inventory service to judge whether the goods have inventory, if have inventory, you can generate the advance payment orders.

How are avalanches caused?

  • First snowball: The inventory service is not available (for example, the response times out, etc.), the inventory service has received many requests that have not been processed, and the inventory service will not be able to process more requests.
  • Second snowball: because the requests of goods and services are waiting for the return result of inventory service, many requests of goods and services calling inventory service are not processed, goods and services will be unable to process other requests, resulting in goods and services are unavailable
  • Third snowball: Because the commodity service is unavailable, the order service cannot process other requests that the order service calls the commodity service, resulting in the order service being unavailable.
  • The fourth snowball: Because the order service is unavailable, the client cannot place an order. More customers will retry to place an order, resulting in more order requests becoming unavailable.

So setting a reasonable timeout is very important. Specific Settings: between modules, request database, cache processing, invoking third-party services.

4.5 a fuse

Key words: circuit breaker protection. For example, when service A invokes service B, the request time is excessively long due to network problems, service B breakdown, or long processing time of service B. If this situation occurs several times within A certain period of time, service B can be directly disconnected (A no longer requests SERVICE B). A call to service B returns degraded data without waiting for service B to execute. Therefore, the problem of service B will not cascade to affect service A.

The detailed principle of circuit breaker can be read in this article: At the end of the Eastern Han Dynasty, they played the “service avalanche” to the extreme (dry goods).

4.6 the drop

Keyword: Return degraded data. The website is in the peak of traffic, and the server pressure increases sharply. According to the current business situation and traffic, some services and pages are degraded strategically (stop the service, and all calls directly return degraded data). In this way, the pressure on server resources is relieved, the normal operation of core business is ensured, and the customer and most of the customer get the correct response. Degraded data can be simply interpreted as a quick false return, and the front page tells the user “the server is currently busy, please try again later.”

  • What are the similarities between circuit breakers and downgrades?
    • Circuit breaker and current limiting are used to ensure the availability and reliability of most services in the cluster. Prevent core services from crashing.
    • The end user is given the impression that a feature is not available.
  • What’s the difference between a circuit breaker and a downgrade?
    • Circuit breaker is an operation that is invoked when a fault occurs and is triggered actively.
    • Downgrading is to stop some normal services and release resources based on global considerations.

Five, different places to live more

5.1 Multi-Equipment Room Deployment

Meaning: Multiple sets of services are deployed in data centers (IDCs) in different regions, and these services share the same business data, and they can handle user traffic.

When a service hangs up, other services switch to another machine room in another region.

Now the service is more than a set, that database is not also more than a set, there are no more than two schemes: shared database or not shared.

  • Share the database of the same machine room.

  • Do not share databases. Each machine room has its own database, which is synchronized between databases. The solution is more complex to implement.

Either way, there is a problem of data transmission latency across the machine room.

  • Multi-machine room dedicated line in the same place, delay 1ms~ 3ms.
  • Remote multi-machine room dedicated line, delay about 50 ms.
  • Transnational multi-machine room, delay about 200 ms.

5.2 Same-City hypermetro

The core idea of high-performance same-city hypermetro is to avoid cross-machine room calls:

Ensure calls from the same equipment room service: Different PRC (remote call) services register different service groups with the registry, but the RPC service only subscribes to the RPC service group in the same equipment room, and RPC calls only exist in the equipment room.

Ensure that the query cache is invoked in the equipment room. If no cache query occurs in the equipment room, the query cache is loaded from the database. The cache is also in active/standby mode, and the data is updated in multi-room mode.

Ensure the same as the equipment room database query: The equipment room database is read in active/standby mode, the same as the cache mode.

5.3 Remote multiple work

The same-city hypermetro cannot implement city-level DISASTER recovery. So you need to consider living in different places.

For example, the server in Shanghai is down, and the server in Chongqing can be on top. But the distance between the two places should not be too close, because the occurrence of spontaneous combustion disaster may be affected by another place.

Same as the core idea of in-city hypermetro, avoid cross-machine room calls. However, since the call latency in the remote scheme is much greater than in the same-machine scheme, data synchronization is an interesting point to explore. Two options are provided:

  • Master-slave replication based on storage systems, MySQL and Redis come naturally. However, in the case of a large amount of data, the performance is poor.
  • Asynchronous replication mode. Based on the message queue, the data operation is placed as a message on the message queue, and another machine room consumes the message and operates the storage component.

5.4 Two places and three centers

This concept has been mentioned many times in the industry.

Two places: local and remote.

Three centers: local data center, same-city data center, and remote data center.

These two concepts, which I mentioned above, are the same-city dual-live and off-site multi-live approaches, only for data centers. The principle is shown in the figure below:

Through B station this matter, I also learned a lot from it, this is to throw a brick to introduce jade, welcome to leave a message to discuss.

About the author: Wu Kong, 8 years of experience in first-line Internet development and architecture, explains distribution, architecture design, Java core technology with stories. “JVM performance optimization actual” columnist, open source “Spring Cloud actual PassJava” project, independent development of a PMP brush problem small program.

I am Wukong, work hard to become stronger, turn super saiya!

Shoulders of giants:

Cloud.tencent.com/developer/a…

Mp.weixin.qq.com/s/Vc4N5RcsN…

Time.geekbang.org/column/arti…

Time.geekbang.org/column/arti…

www.passjava.cn