Introduction: This document uses a service Demo case to explain the difficulties in hybrid cloud Dr Construction and how to quickly build an application hypermetro architecture based on MSHA and achieve minute-level service recovery capability.

Author: Far Zhi

preface

More and more enterprises choose the hybrid cloud mode (cloud + self-built IDC or cloud + other vendors’ cloud) for DISASTER recovery construction in the process of digital transformation and cloud upscaling. On the one hand, they do not rely too much on a single cloud vendor, and on the other hand, they can make full use of existing offline IDC resources.

MSHA Cloud Native Multi-active Dr Solution **[1]**, also released hybrid cloud multi-active Dr Product capabilities. This document uses a service Demo case to explain the difficulties in hybrid cloud Dr Construction and how to quickly build an application hypermetro architecture based on MSHA and achieve minute-level service recovery capability.

Service hybrid cloud Dr Practice

Service Background

Enterprise A is an e-commerce trading platform in the retail industry. Its business system is deployed in the self-built IDC room, which has the following pain points:

  • Services are deployed only in IDC servers, which lacks the Dr Capability.
  • The IDC capacity is insufficient and the upgrade and replacement cycle of physical machines is long, which is insufficient to support rapid service development.

In the process of rapid business development, the company’s senior management paid attention to the insufficient capacity and faults repeatedly, and determined to build disaster recovery capacity. The self-built IDC is an existing asset of the company and has been used steadily for many years. In addition, the company does not want to rely too much on the cloud. Therefore, the hybrid cloud Dr Architecture of IDC and cloud is expected to be established.

Current application deployment architecture

E-commerce trading platform includes the following applications:

  • Frontend: A Web application that interacts with users.
  • Cartservice: Cart application, which provides cart addition, storage and query services.
  • Productservice: Product application, providing product and inventory services.

Technology stack:

  • SpringBoot.
  • RPC framework: SpringCloud, Dubbo, registries using self-built Nacos, Zookeeper.
  • Databases Redis and MySQL.

Hybrid cloud Dr Target

Service Dr Requirements are summarized as follows:

  • ** Switch the RTO level to minute for cloud – on – cloud – off Dr. ** Expect cloud on cloud off cloud disaster, continue to play the value of IDC, and not 100% dependent on the cloud. In an IDC or cloud fault scenario, you must be able to perform a switchover at a critical moment, and the switchover RTO must be less than 10 minutes.
  • ** Risk of no data consistency. ** Data in the two data centers in the upper cloud and the lower cloud are strongly consistent. Therefore, avoid data consistency risks such as dirty write during the daily normal and Dr Switchover.
  • ** One-stop control. ** Technical stack framework and cloud products involved in service Dr Require unified management and control, unified operation and maintenance, and unified switchover. Operations are converged on a one-stop management and control platform to facilitate quick white-screen operation and automatic execution in fault scenarios.
  • ** Short implementation cycle, low transformation cost. ** services have multiple product lines, complex dependency relationships, long call links, and are in a period of rapid development and frequent iteration. It is expected that disaster recovery construction will not bring transformation burden to the business R&D team.

The construction of the difficulty

  • Traffic management is difficult
  • If DNS is used to resolve traffic to the upper and lower clouds by weight, it takes a long time for DNS resolution modification to take effect (usually 10 minutes or hours. For details, see FAQ**[2]**). Therefore, it cannot meet the requirement of less than 10 minutes for Dr Switchover.
  • Business applications rely on Redis and MySQL. IDC uses self-built open source while cloud products are directly used on cloud products. It is difficult to achieve the Dr Switching capability of self-built open source and cloud products.
  • Data quality of Dr Switchover is difficult to ensure
  • During a Dr Switchover, stale data may be read due to data synchronization delay or the time when switchover rules are pushed to distributed application nodes is inconsistent. Dirty data may be written to and read from databases on and off the cloud at the same time. Data quality assurance is a key and difficult issue during the switchover.
  • No service code intrusion is difficult
  • In order to realize the Dr Switching capability of Redis and MySQL, business applications usually need to cooperate with the transformation, which will greatly invade the business code.

The solution

Based on the service Dr Requirements and the characteristics of hybrid cloud IDC+ cloud, the application Active-active architecture can meet the service Dr Requirements.

Apply the hypermetro architecture

Schematic diagram:

Architecture specification:

  • Select the Region on the cloud whose physical distance from the IDC is less than =200km, and the network latency is about 5 to 7ms.
  • Applications and middleware are deployed in symmetric redundancy mode on and off the cloud, and services (application hypermetro) are provided externally.
  • Database remote active/standby, asynchronous replication backup. Applications read and write to databases in the same data center, avoiding consistency concerns.

Detailed scheme

  • Application traffic hypermetro

Service applications Are deployed symmetrically in the upper and lower clouds of the cloud, and access layer clusters based on MSHA to receive HTTP/HTTPS traffic and distribute traffic between the upper and lower clouds based on the proportion or precise routing rules. The active console provides routine O&M capabilities such as white-screen deployment, capacity expansion, and monitoring for MSFE clusters, as well as minute-level flow interruption in fault scenarios.

  • Service interworking and same cell priority call

Service applications need to be uploaded to the cloud in batches based on service product lines. In this process, only IDC is deployed for downstream applications. Using the MSHA registry synchronization function, services on and off the cloud can communicate with each other, facilitating services on the cloud. At the same time, based on the aspect capability of MSHA-Agent, when Dubbo/SpringCloud service is invoked, the Consumer calls the Provider in the same cell first, so as to avoid the network delay caused by cross-room invocation and reduce the business request RT.

  • Data synchronization & database connection switchover

The databases are deployed in remote active/standby mode, and the applications on and off the cloud normally read and write Redis and RDS databases on the cloud every day, without considering the data consistency problem. The MSHA console supports on-cloud and off-cloud data synchronization (asynchronous replication) by integrating the DTS synchronization component. At the same time, based on the mSHA-Agent section capability, it has the ability to switch the application database access connection. If the Redis or RDS failure on the cloud can switch the read and write access connection to Redis or MySQL in IDC, and vice versa. During the switchover, it also provides write protection to prevent data quality problems such as old data read and dirty write.

  • One-stop control & no business code intrusion

The MSHA console supports unified management, control, and switchover of HTTP and database access traffic. Operations are converged on a one-stop management and control platform, facilitating rapid blank screen operations in fault scenarios. In addition, the Agent access mode is provided for service application MSHA, enabling the related Dr Switchover capability without service code modification.

Modified content

  • Application on the cloud
  • Select the Aliyun region that is close to the self-built IDC, and deploy a set of applications, middleware, and databases in full redundancy on the cloud to build an on-cloud and off-cloud Active-active Dr Architecture. In this Demo case, the Hangzhou Region is selected as the Dr Unit.
  • Network access:
  • Access to CEN cloud enterprise network to realize network connectivity between cloud on cloud and cloud off cloud (see document **[3]** for constructing enterprise-level hybrid Cloud by Multiple Access Methods).
  • Access cluster deployment and configuration:
  • The MSHA access layer cluster (MSFE) is deployed on the cloud and the SLB is mounted on the cloud for public network access and load balancing of THE MSFE cluster (see Usage document **[4]**).
  • Enter domain names, URIs, and back-end application addresses for on-cloud and off-cloud traffic and minute-level traffic (see using documentation **[5]**).
  • Application:
  • Deploy service applications in batches on the cloud.
  • JAVA applications install MSHA-Agent and use Nacos as the channel to control command delivery, so as to have the ability to preferentially invoke microservices in the same unit and switch database access connections (see usage document **[6]**).
  • Middleware and database:
  • Deploy the MSE to host the ZK/Nacos registry, cloud database Redis, and RDS on the cloud. You are advised to deploy the high availability version across availability zones to provide the same-city active-active Dr Capability.
  • If an application is deployed only in IDC, configure service synchronization for the registry (see Usage document **[7]**).
  • Configure data synchronization between cloud database Redis/RDS and self-built Redis/MySQL (see Using documentation **[8]**).

Modified application deployment architecture

Daily scenario: Service traffic on THE IDC and cloud – application hypermetro

Visit the home page of e-commerce Demo to check the actual traffic call chain: visit Beijing or Hangzhou unit in probability, and read and write the database in Beijing unit.

Disaster ability

  • RPO: <=1min (depends on DTS synchronization performance)
  • RTO: <=1min (Depending on the DTS synchronization delay, the MSHA component implements second-level switching. Overall RTO < = 1 min)

Verify the Dr Capability

After the application hypermetro architecture is constructed based on MSHA, verify that the service Dr Capability meets expectations. The next step is to create a real fault to verify the Dr Capability.

7.1 Preparation for Drill

  1. Go to the MSHA console and select Monitor On the left menu bar. At the top of the page, the drop-down selection switches to the actual namespace in use.

  2. View monitoring indicators on the page.

Note: before a test, determine a monitoring indicator (RT<=200ms, error rate <1%) based on MSHA traffic monitoring or other monitoring products to determine the impact level when a fault occurs and the actual service recovery after a fault is rectified.

7.2 Application Fault Injection

Here, we use Ali Cloud fault drill product to inject faults into ali Cloud-Beijing commodity application.

  1. Enter the Chaos Failure drill product console **[9], switch to the corresponding region with the top selection, and select MySpace ** in the left navigation bar.

  2. Select the configured walkthrough in My space (50% chance of network loss) and click Execute Walkthrough.

After fault injection is successful, an access exception may occur on the e-commerce home page or when you place an order, which is as expected.

7.3 Recovery of current interruption

In the case of failure of commodity application in Beijing unit, MSHA can be used to cut off the flow on the cloud to 0 to quickly restore services.

expected

After 100% traffic is switched to Hangzhou unit, the service is fully recovered and not affected by the failure of Beijing unit.

Flow operation

1. Access the MSHA console, and choose Switch Flow > Remote Hypermetro Switch Flow.

2. On the cutting stream page, click one-key to cut zero for Beijing unit.

3. Click to perform the pre-check. In the flow check area, click OK to start the flow check.

4. If the current status of the stream cutting task page displays that the stream cutting is complete, the stream cutting is successful.

5. Refreshed the home page of e-commerce Demo, which can be displayed normally after repeated visits, meeting expectations.

Check the actual traffic call chain: traffic always accesses the database in Hangzhou cell and reads and writes the database in Beijing cell.

7.4 Database Fault Injection

As can be seen from the above call chain, the applications in Hangzhou unit still access the Redis and MySQL databases in Beijing Unit. We continued to use Chaos failure drill **[10] product to inject failures into Redis and MySQL databases of Beijing unit ** and create database failure scenarios.

After the fault injection is successful, an exception occurs when you open the e-commerce home page or place an order, as expected.

7.5 Switching the Database for Recovery

If the database of Beijing unit is faulty, the connection of Redis/MySQL accessed by the application can be switched to the database of Hangzhou Unit through the MSHA database switching function (during the switching process, data synchronization will be synchronized, and write will be temporarily banned).

expected

After the database connected to the application is switched to Hangzhou, the service is fully recovered and not affected by the failure of the Beijing unit.

Flow operation

1. Access the MSHA console, and choose Remote Application hypermetro > Data Layer Configuration in the navigation tree.

2. In the data protection rule list, locate the product database, order database, and shopping cart database one by one, and click primary/secondary switchover.

3. Click the active/standby switchover. The pre-check page is displayed.

4. On the active/standby switchover details page, you can view the switchover progress and result. The switchover is complete when the task progress reaches 100%.

5. After the primary/secondary switchover of commodity, order and shopping cart databases is completed. After repeatedly accessing the Demo home page or placing an order, the Demo is normal. After the active/standby switchover, the service functions are restored as expected.

conclusion

In this article, we introduce a practical case of MSHA disaster recovery (MSHA) to help enterprises construct hybrid cloud application Active-active DISASTER recovery (ACTIVE-active DISASTER recovery), and provide practical methods for disaster recovery architecture construction. At the same time, we use Chaos fault drill products to inject real faults to verify whether the service DISASTER recovery capability of fault scenarios meets expectations.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.