Introduction: In the era of big data, data is the core asset and lifeline of an enterprise. In the real world, disasters happen from time to time. When disasters occur, disaster tolerance becomes the key to the survival of enterprises. DRaaS (Disaster Recovery as A Service) can save the cost of self-built Dr Centers and subsequent OPERATION and maintenance (O&M) costs. DRaaS helps customers quickly establish cross-regional Dr Solutions. The DRaaS feature provides users with great flexibility. This article describes the Dr Products on the cloud. An enterprise can select an appropriate Dr Solution based on its own characteristics. It analyzes the technical architecture of traditional Dr And on-cloud Dr For asynchronous replication products. Alibaba Cloud Block Storage also realizes block storage asynchronous replication products based on its own architecture characteristics.

preface

Data is the lifeblood of an enterprise

Remote data disaster recovery is a universal demand of enterprise customers, especially for government, financial and other large customers, is the core demand. In the era of big data, data is the core asset and lifeline of an enterprise. In the real world, disasters happen from time to time. When disasters occur, disaster tolerance becomes the key to the survival of enterprises.

In the United States in the “911” incident, the Twin towers in the United States collapsed, several banks’ data centers were destroyed. Deutsche Bank, which backed up its data dozens of miles away, was quickly restored to praise from customers, while Bank of New York, which had no disaster plan, collapsed months later.

In March, the computer room of OVHcloud, the largest data center operator in France, caught fire, affecting more than 3.5 million websites.

In the zhengzhou 720 flood just past, he Hospital district of the First Affiliated Hospital of Zhengzhou University was affected by continuous heavy rain, and the whole hospital was flooded, resulting in power failure. The hospital started the remote disaster recovery mechanism and switched the core business to the east core machine room in only 15 minutes, ensuring the normal operation of the other two hospitals.

Cloud Dr Becomes a trend

One real case after another rang the alarm bell, which also makes the enterprise investment in data protection and disaster recovery continue to expand. Traditional Dr Solutions require enterprises to build their own Dr Centers, purchase dedicated lines, and invest human resources for O&M, which costs a lot. In the era of rapid development of cloud computing, more and more enterprise customers consider cloud Dr. DRaaS (Disaster Recovery as A Service) can save the cost of self-built Dr Centers and subsequent OPERATION and maintenance (O&M) costs. DRaaS helps customers quickly establish cross-regional Dr Solutions. The DRaaS feature provides users with great flexibility. The following table summarizes the comparison between DRaaS and traditional Dr Solutions. It can be seen that DRaaS has the characteristics of zero infrastructure, less o&M, and high flexibility compared with traditional Dr Solutions. Therefore, IN the era of rapid development of cloud computing, DRaaS has become the trend of Dr.

ESSD Asynchronous replication of cloud disks

Alibaba cloud storage ESSD product is the world’s leading flagship product, has gradually become mature. In order to better serve enterprise customers and meet their cloud Dr Requirements, Ali Cloud Block Storage also launched its own DRaaS product to realize asynchronous replication of cloud disks across regions. This article introduces how users choose suitable Dr Products on the cloud, analyzes the similarities and differences of different Dr Architectures from the technical point of view, and then describes how we choose Dr Architectures for ESSD and the technical principles behind asynchronous replication of cloud disks.

How does an enterprise choose a cloud Dr Solution

Select an appropriate Dr Type based on the RPO and RTO

When selecting a Dr Solution, an enterprise must determine the Dr Level based on its own service characteristics. In the disaster Recovery field, the Recovery Point Objective (RPO) is used to measure the maximum duration of data that can be lost in a DISASTER Recovery system. The Recovery Time Objective (RTO) is used to measure the maximum duration of data that can be lost in a disaster Recovery system.

The country has issued relevant standards, which divide disaster recovery capability into six levels, as shown in the following figure

For enterprises from Level 1 to level 6, the higher the level is, the lower the risk of data loss is, but the higher the cost of Dr Construction is. In traditional storage industry, usually the data backup archive products can meet the demand of a disaster to the secondary, common storage array backup function can meet the demand of three to five, high-end storage array of asynchronous replication can meet the demand of four to five, and high-end storage synchronous replication, dual function, and based on the application of replication can meet the demand of five to six levels.

On the cloud, cloud vendors also provide a variety of cloud products to meet the requirements of different DISASTER recovery levels. The disaster recovery center on the cloud usually provides cross-region or cross-availability area disaster recovery services to meet the requirements of Level 1 to level 4, while asynchronous and synchronous replication products can meet the requirements of level 5 to Level 6. Mainstream applications, such as database services, also have their own Dr Products with the maximum I/O granularity.

According to the preceding levels, asynchronous replication meets the requirements of level 4 or level 5 DISASTER recovery and is widely required by financial customers such as banks and government organizations.

Select appropriate Dr Services based on system features

From the perspective of implementation modes, existing CLOUD vendors’ Dr Schemes are divided into three categories: application-based, instance-based, and block-storage-based:

Based on the application

This type is usually specific to a particular disaster service scheme of application of cloud database, for example, a message queue, object storage, etc., related to the use of cloud services users can choose according to their own needs corresponding product disaster service, this kind of disaster service has the advantage of combining business often can do application level of consistency, the disadvantage is that universality is not strong, Only services based on specific applications can be used.

Cloud-based host

To buy only the IaaS service, or have their own custom business, or a disaster application level service cannot meet the demand, can choose to let evil scheme based on cloud host, this scheme will make the machine data consistency protection, or across instances of data consistency protection, usually in addition to storing data recovery disaster end, also for the recovery of the host network, It is convenient to use. The Dr Service has the advantages of simple operation and strong universality. The disadvantage is that you need to purchase host resources at the Dr End and the cost is relatively high.

Based on block Storage (Cloud Disk)

At the heart of the disaster is disaster data, so some companies targeted cloud disk itself cross-regional copy products, this product more flexible in form, the application generally does not limit, disaster during replication does not need to buy end mainframe, can reduce the user cost, also can use seamless match with other cloud services, disaster formation application level similar effects. Consistency groups can also be created based on cloud disks to meet the collapse consistency semantics for the replicated data of a group of cloud disks.

ESSD cloud disk asynchronous replication, simple steps to help you complete service recovery

Asynchronous replication on cloud disks ESSD products can asynchronously replicate data on cloud disks across regions (regions) with an RPO of 15 minutes. You can create a Dr Pair in three simple steps: First, select the cloud disk that you want to replicate; second, select the Dr Site and create secondary disks; third, create a Dr Pair and activate it. After the Dr Pair is activated, data on cloud disks is periodically replicated to secondary disks at the Dr Site. If you want to stop the replication temporarily, you can use the stop function of the Dr Pair to stop the replication temporarily.

When a fault occurs, you can use the failover function to switch between the primary and secondary sites. Failover disconnects the replication link, restores the secondary device to the consistency point of the previous replication, and provides users with read and write permissions.

If you want to restore services to the original production site after disaster recovery, you can use the reverse recovery function to restore incremental data generated at the secondary site to the primary site.

The technology behind asynchronous replication on the cloud

This chapter discusses the asynchronous replication technology widely used in DISASTER recovery (Dr) products and the similarities and differences between it and traditional storage architecture. The core of Dr Is data Dr. Block storage Dr Is the most common Dr Solution. Therefore, the following discussion focuses on the asynchronous replication technology of cloud disks.

Replication architecture for traditional storage

Traditional asynchronous storage replication can be implemented in three ways:

Storage gateway-based: A storage gateway is a storage service technology based on a SAN network, located between servers and storage devices. Storage network

guan

Can enter the IO provide flexible storage service flow, storage gateway are separate from the array with the host server, do not take up to the host and storage resources, can easily supports replication between heterogeneous systems, but the IO much gateway link, so there will be some performance loss, is not suitable for the high performance requirements of the business.

Host-based: In a SAN device, data is distributed on the Initiator end and distributed on the host according to I/O replication requirements. The typical implementation is DRBD. This architecture has no requirements on back-end storage arrays, and corresponding software needs to be installed on the host. This architecture is mostly used by third-party DISASTER recovery service providers.

** Array-based: ** Most storage array vendors implement array-based replication architecture based on their own array characteristics. In this architecture, manufacturers implement data tracking and double-write at the Target side according to their array IO architecture.

Two technical architectures for asynchronous replication on the cloud

Cloud vendors typically combine their product characteristics with two technical architectures:

Agent implementation architecture: In this approach, plug-ins are usually installed in user virtual hosts as I/O proxies. Plug-ins intercept user I/O requests and forward them for replication. This solution has the advantage of providing application-level consistency semantics. This architecture has no special requirements on cloud disk vendors and can easily implement DISASTER recovery (Dr) for heterogeneous systems. However, users need to deploy plug-ins to use the cloud disk and may have restrictions on the versions of their operating systems. Third-party providers of cloud products usually adopt this architecture.

Agentless implementation architecture: This implementation mode is usually based on the underlying storage system. Full or incremental data replication is carried out by relying on the consistency points provided by the storage system and the acquisition of bitmap of data difference, providing users with data collapse consistency semantics. This mode has the advantages of efficient differential data replication combined with storage systems, no intrusion on user host systems, and simpler service usage. However, application-level data consistency cannot be achieved. Mainstream cloud vendors usually have block storage services developed by themselves and implement agentless replication architecture based on block storage architecture features.

Aliyun ESSD cloud disk asynchronous replication architecture

Ali Cloud storage also launched its own asynchronous replication products. This chapter introduces how Ali Cloud realizes asynchronous replication products based on its own architecture.

The asynchronous replication function of Ali cloud block storage is implemented without agent. Disaster system architecture as shown in figure, disaster management software in the production site and site deployment, a copy of periodic task is disaster control system for asynchronous replication IO components, replicating components to get the data from the cloud disk storage systems after end differences and the data is copied to the target area, the cross-regional copy RPO design goal for 15 minutes.

Highly available architecture

Asynchronous replication technology adopts high availability architecture. To ensure that the system is still available in a fault scenario, the Dr Management components are deployed at both the production site and the Dr Site rather than at a unilateral location or in a third area. A copy of Dr Management metadata is synchronized to both the primary and secondary Dr Pairs. This ensures that the Dr Management function of the secondary site is still available in the event of a disaster at the primary site. In addition, the Dr Management software and replication links adopt a high availability architecture respectively. All managed nodes are deployed in active/standby mode to ensure service continuity.

Efficient replication

Asynchronous replication uses incremental replication to minimize the amount of data to be replicated and transferred, improving the replication efficiency. The underlying storage system can efficiently obtain the data breakdown consistency view of the cloud disk by using the internal consistency point obtaining technology, and the storage system can efficiently obtain the incremental difference of data by using the internal index technology. The following figure shows how a storage system obtains the difference bitmap of consistency points. The obtained difference bitmap is serialized as a Data difference Log, that is, the Data Change Log (DCL) is sent to the replication component. Consistency point Data in the corresponding area is read according to the difference bitmap and written to the slave disk.

Replication links automatically fragment and replicate replication processes concurrently based on features such as the size and bandwidth of cloud disks, improving the replication efficiency and maximizing the RPO. Component to get the figure below shows the replication process, differences between bitmap and copy the task cloud disk will be according to the size of the cut into multiple different data server, shard to store data copy component will get consistent view from storage server DCL, according to the cloud disk size and bandwidth situation, decided to how much a cloud disk segmentation for subtasks to reproduce, To better match the replication bandwidth. The following figure shows how copy IO components work.

The performance of the master disk is not damaged

Based on the efficient indexing system and high-performance consistency point generation technology of the storage system, asynchronous replication has negligible impact on the performance of the cloud disk at the primary site. The performance of the primary disk fully meets the official sales standards.

Second level RTO

Traditional backup services usually store data on external systems such as OSS. When needed, OSS snapshots are used to create disks, create disks, and load data. As a result, the RTO takes a long time (usually minutes or longer). Asynchronous replication on a cloud disk Periodically writes data to the secondary cloud disk. The cloud disk cannot be read or written during the replication phase. The cloud disk can be instantly available after a failover, and the RTO can reach the second level.

Summary and Prospect

This article describes the Dr Products on the cloud. An enterprise can select an appropriate Dr Solution based on its own characteristics. It analyzes the technical architecture of traditional Dr And on-cloud Dr For asynchronous replication products. Ali Cloud block storage implements asynchronous replication of block storage based on its own architecture features. Compared with traditional remote Dr Solutions, the block storage Dr Solution has the following advantages:

Low cost: No VM binding is required. Users only need to purchase cloud disks instead of standby VMS at the redundancy site. During Dr, users can purchase VMS based on site requirements, greatly reducing operating costs.

Ease-of-use: No agent plug-in is required for user VMS, ensuring no application awareness and no version requirements for user host operating systems. Purchase on demand: easy to use, and supports one-click switchover and one-click generation of Dr Test disks.

High availability: All Dr Components adopt a high availability (HA) design to ensure that the Dr System can perform switchover in a disaster scenario.

Quick service recovery: Provides a low service recovery time (RTO) of up to seconds.

Very low performance overhead: Primary disk performance is almost unaffected during replication.

Block storage asynchronous replication products dedicated to provide users with high efficiency and easy to use simple low-cost disaster different scheme, the subsequent will continue to enrich product features, launch the consistency copy group, Shared disk, the data link to heavy features, such as compression and rich product usage scenarios and reduce the use of user cost, provide reliable and easy to use low cost DRaaS service, Stay tuned.

Author: Ali Cloud storage Li Weiwei

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.