The author | plantar far, he apertureSource |Alibaba cloud original public account

preface

Due to the complexity of the external environment and the unreliability of the hardware, the high availability of Internet services is faced with great challenges, and there are not a few cases of service unavailability of major Internet companies due to the disconnection of the network, power failure and other accidents. Business unavailability, small to bring economic losses affect the reputation of the enterprise, to wechat, Alipay and other national applications, affect the national economy and people’s livelihood. Faced with the inevitable natural and man-made disasters, the construction of disaster tolerance architecture has become an urgent demand of digital enterprises.

In December 2020, Alibaba Cloud Application High Availability Service (AHAS) released a new functional module, AHAS-MSHA, which is a multi-activity Dr Architecture solution evolved in alibaba’s e-commerce business environment. In this article, we will first introduce some important concepts in the field of disaster recovery. Then, we will share high availability practices based on AHAS-MSHA and AHAS-Chaos to help businesses implement disaster recovery architectures with an example of an e-commerce micro service.

Disaster recovery and evaluation indicators

1. What is Dr?

Disaster (emergence of how) refers to the far beyond the city, establish a set of two or more sets of the same system function, system can be health surveillance and mutual function between switch, when a system for accident (such as fire, flood, earthquake and man-made sabotage, etc.) to stop working, the whole application system can switch to another, So that the system function can continue to work normally.

2. How to evaluate the Dr Capability?

A DISASTER recovery (Dr) system aims to prevent services from being interrupted when a disaster occurs. How do you evaluate and quantify the Dr Capability? This section describes the disaster recovery capability evaluation indicators commonly used in the industry.

  • Recovery Point Objective (RPO)

A data recovery point is a point in time at which the system and data must be recovered when a disaster occurs. RPO indicates the maximum amount of data loss that the system can tolerate. The smaller the amount of data loss that the system can tolerate, the smaller the RPO value.

  • RTO (Recovery Time Objective)

Recovery time target. It is the time required for information systems or service functions to be restored after a disaster occurs. RTO indicates the maximum duration of service interruption that the system can tolerate. The higher the urgency of the system service, the smaller the VALUE of RTO.

AHAS-MSHA

1. Introduction

Multi-site High Availability (MSHA) is a multi-active Dr Architecture solution (solution = technical product + consulting service + ecological partner). MSHA decouples service recovery from fault recovery, supports rapid service recovery in fault scenarios, and improves enterprise Dr Stability.

1) Product architecture

MSHA disaster using live in exotic architecture, the core idea is “isolation of redundant”, we will all the logic of redundant data center called unit, MSHA did business flow within the unit closed, isolation between units, the fault controlling blast radius within a unit, not only can solve the problem of disaster, business continuity, And can realize capacity expansion.

2) Comparison of mainstream Dr Architectures

2. Functions and features

  • Quick Fault recovery

Based on the principle of recovery first, location later, MSHA provides disaster recovery (Dr) and traffic cutting capabilities to decouple service recovery time from fault recovery time on the premise of data protection, ensuring service continuity.

  • Capacity remote expansion

The rapid development of business is limited by the limited resources in a single place, and there are problems such as database bottleneck. MSHA enables rapid expansion of service units in other areas or equipment rooms.

  • Traffic allocation and error correction

MSHA provides layer by layer traffic error correction and verification from the access layer to the application layer. It forwards calls that do not meet traffic routing rules and controls the fault explosion radius within one unit.

  • Data is written dirty

Multi-unit write data may cause dirty write overwrite. The MSHA provides write protection when traffic flows into the wrong unit and write/update protection during the delay of data synchronization.

3. Application scenarios

MSHA applies to the construction of multi-active Dr Architecture in the following typical service scenarios:

  • Read more and write less

    • Business scenarios: Typical business scenarios are information and shopping guide services (such as commodity browsing, news and information).
    • Data features: The core of the read service is the read service. It can accept the temporary unavailability of the write service.
  • Billing business

    • Business scenarios: Typical business scenarios are e-commerce transactions, billing services (such as orders, call records, etc.).
    • Data features: Data can be sharded according to certain dimensions and can accept the final consistency of data.

Service Dr Practice

The following describes how to construct a Dr Architecture in different scenarios using an e-commerce wechat service example.

1. E-commerce business background

1) Business application

  • Frontend, the portal WEB application, is responsible for interacting with users
  • Cartservice, cart application. Record the user’s shopping cart data, using self-built Redis
  • Productservice, product application. Provide merchandise, inventory services, using RDS MySQL
  • Checkoutservice, order application. Generate a purchase order for items in the shopping cart using RDS MySQL

2) Technology stack

  • SpringBoot
  • RPC framework: SpringCloud, registry using homegrown Eureka

3) E-commerce application Architecture 1.0

In the early stage of e-commerce business, like many Internet companies, they did not consider the disaster recovery problem and only deployed in a single region.

2. Case 1: Read too much and write too little SERVICE Dr

1) Occurrence of a fault

The e-commerce business developed rapidly in the early stage, and the single-region deployment mode of small but American remained unchanged until a commodity application failure broke down the e-commerce business and the page could not be accessed for a long time. The failure was eventually solved, but the loss of customers and the impact of corporate reputation caused by the failure caused a big blow to the rapidly developing business, forcing us to start to consider the construction of high availability capacity.

E-commerce business is mainly divided into shopping guide, shopping cart, transaction and other business scenarios, the first is shopping guide. It is a typical business scenario of read more and write less. The core of it is the display of shopping guide page (read link), which can accept the temporary unavailability of goods and services (write link). Combined with our own disaster recovery demands, we set a small improvement goal — “Read more in different places”.

2) Transform the remote read Dr Architecture

Based on MSHA, the shopping guide business is transformed into “remote reading”.

More active transformation & MSHA Access:

  • Partition dimension: uses the userId to identify traffic streams.

  • Scope of transformation: The portal WEB application and commodity application related to shopping guide link will be deployed in two regions.

  • Control configuration: Access the MSHA console to configure multiple live resources for each layer.

3) Fault recurrence

After the Dr Architecture transformation is complete, the Dr Capability needs to be verified. Next, we reproduce historical faults and verify the Dr Capability by creating real faults.

[Preparation for the drill]

Service monitoring indicator: Determines steady-state service monitoring indicators based on MSHA traffic monitoring or other monitoring capabilities to determine the impact surface when a fault occurs and actual service recovery after a fault is rectified.

Drill expectation:

  • The shopping guide link is weakly dependent on the shopping cart application (the shopping guide page shows the number of items that users put into the shopping cart). The weak dependency failure does not affect services.

  • The shopping guide link is strongly dependent on the commodity application, and the failure of strong dependence will lead to service unavailability. The explosion radius of the failure should be controlled within the unit.

[Fault Drill]

Ahas-chaos fault drill can be used to drill various fault scenarios conveniently.

Stage 1: Weak dependency failure drill
  • Fault injection: Perform fault injection for the shopping cart application
    • Expectation: Shopping guide business will not be affected
    • Results: The guide page can be opened normally and meet the expectation

Stage 2: Strong dependency failure drill

The routing rules configured before the drill are as follows (userId%10000 is matched according to the following routing range rules) :

  • Fault injection:Beijing unitThe commodity application performs fault injection
    • Expected: Users with userId=6000 are routed to Beijing unit and will be affected by the fault
    • Results: The guide page access is abnormal, as expected

  • Explosion radius verification: Verify whether the safeguard radius is controlled within the fault unit
    • Expectation: Users with userId=50 are routed to Hangzhou cell and not affected by the failure of Beijing cell
    • Results: The visit of the guide page was normal and in line with the expectation

4) Tangent flow recovery

In a fault scenario, use the MSHA traffic cutting function to verify the Dr Capability.

  • Verify Dr Switchover: Stream userId=6000 to hangzhou cell
    • Expectation: After flow cutting, the user will be routed to Hangzhou unit, not affected by the failure of Beijing unit.
    • Result: The shopping guide page is normal (see the following GIF for the actual call chain of the shopping guide request), and the Dr Capability meets expectations.

Follow-up: Fault revocation

  • Fault injection termination
  • Feedback the results of the drill and record the risk problems identified during the drill
  • Flow back to cut
  • Check whether the steady state service indicator is restored

3. Case 2: Water bill service Dr Case

1) New faults

After the above transformation, the shopping guide business has the ability to resist regional faults. And the order application large area failure, became the last straw of the order business. Therefore, the construction of high-availability architecture for ordering business is also on the agenda.

Placing an order is a typical flow bill business scenario. Compared with shopping guide, it is a more complex read-write business. In combination with business scenarios and business DISASTER recovery demands, we choose a disaster recovery construction scheme suitable for the business — “Remote Multiple Activities”.

2) Transformation of remote multi-active Dr Architecture

Based on MSHA, the order business is transformed into “remote live”.

Note: Single link strongly depends on shopping cart application, complete construction of multi-live DISASTER recovery, and the shopping cart application should be transformed into “remote multi-live”.

More active transformation & MSHA Access:

  • Scope of transformation: Order application and order database are deployed in two regions.

  • MSHA access: The single-link application is installed on the Agent, so that the SpringCloud RPC cross-cell routing function and data write prevention function are non-invasive.

  • Control configuration:

3) Fault recurrence

After the DISASTER recovery (Dr) architecture is reconstructed, we reproduce historical faults to verify the Dr Capability by creating real faults.

[Preparation for the drill]

Service monitoring indicators: Determines steady-state service monitoring indicators based on MSHA traffic monitoring or other monitoring capabilities.

Drill expectation: The single link is strongly dependent on the order application, the services are unavailable due to the strong dependent fault, and the fault explosion radius is controlled within the unit.

[Fault Drill]

The routing rules configured before the drill are as follows (userId%10000 is matched according to the following routing range rules) :

  • Fault injection:Beijing unitThe order application performs fault injection
    • Expected: Users with userId=6000 are routed to Beijing unit and will be affected by the fault
    • Results: The order was abnormal and in line with the expectation

  • Explosion radius verification: Verify whether the safeguard radius is controlled within the fault unit
    • Expectation: Users with userId=50 are routed to Hangzhou cell and not affected by the failure of Beijing cell
    • Results: The order was normal and in line with the expectation

4) Tangent flow recovery

This section describes how to use MSHA to verify the failover capability in a fault scenario.

  • Verify Dr Switchover: Stream userId=6000 to hangzhou cell
    • Expectation: After flow cutting, the user will be routed to Hangzhou unit, not affected by the failure of Beijing unit
    • Results: The order was normal (see the following GIF for the actual call chain of the single request), and the Dr Recovery ability met expectations.

conclusion

In this article, we introduced one of the most powerful tools AHAS can provide for business disaster: MSHA disaster live scheme, combined with an electric business, this paper introduces the read write less and water documents two typical business scenarios case for the construction of disaster and disaster are given architecture construction practice, at the same time combining AHAS – Chaos drills function to simulate a real the possibility of failure, to test whether disaster ability in line with expectations.

The public cloud MSHA has started public testing, and the e-commerce service Demo of the two service scenarios described in this document is available. You are welcome to apply for the Demo. You are also welcome to join the MSHA Nailing search group 31623894.

Disaster finally want to talk to you saying is that construction is a systematic project, not felled at one stroke is not a one-time deal, need according to the business scenario disaster, disaster appeals, technology stack, budget and other comprehensive disaster to evaluate and make appropriate architecture construction scheme, to welcome you all to disaster’s own demands and scene consultation and communication.

read

  • Ahas-msha Multi-active DISASTER Recovery Solution Official document: help.aliyun.com/document_de…

  • Ahas-chaos Failure Drill Official documentation: help.aliyun.com/document_de…

  • E-commerce business more active practice: help.aliyun.com/document_de…

  • Strong and weak dependency governance & Best Practices for fault drilling: help.aliyun.com/document_de…