Author: Tina

As Internet technology advances into 2021 and cloud access becomes more common, outages seem to be barely abating.

In October, Facebook, which has 3 billion users, suffered a massive outage that took about seven hours to bring most of its services back online. It was said to be the worst Internet access incident in Facebook’s history and wiped about $47.3 billion off Facebook’s market value overnight.

Earlier, a Chinese video website crashed due to a computer room failure, causing a large number of users to “vagabonding” to other websites, and the huge traffic peak caused other platform chains to crash. Salesforce, which has more than 150,000 customers, suffered a five-hour global outage and Roblox, an online gaming platform, suffered a 73-hour outage……

Internet technology development to now, theoretically speaking, it is possible to achieve “never break down”, but why there are so many large-scale, long time system failure? How to reduce the occurrence of downtime? InfoQ spoke with ali Cloud’s global HIGH availability technology team about how to ensure business sustainability in complex systems.

Take the numerous outages

With the rapid development of cloud computing, more and more “national applications” are created. However, traditional DISASTER recovery architecture cannot meet the needs of rapid service recovery.

Statistics show that 96% of enterprises have experienced at least one system outage in the past three years. For small businesses, an hour of downtime costs an average of $25,000. For large businesses, the average cost can be as high as $540,000. These days, the longer the downtime, the greater the chance of permanent damage.

However, outages are unpredictable, so they are also known as the “black swan” of the system. Zhou Yang, head of the global high availability technology team of Ali Cloud, said that the current large-scale Internet system architecture is increasingly complex, and the stability risk is also increasing, there will be some black swans lurking in the system that have not been discovered.

While it is impossible to predict when a black swan will occur, it is possible to seek some classification from the failure and to defend against a specific type of problem. For example, the current DISASTER recovery architecture is a means of disaster defense, which is mainly aimed at the machine room level fault scenarios.

From an IDC perspective, the equipment room fault scenarios include the network failure at the entrance of the equipment room, network failure between equipment rooms, and power failure in the equipment room. At the application layer, the faults can be classified into access gateway faults, service application faults, and database faults. The causes of the faults may be software bugs or hardware faults, such as cabinet power failure and access switch faults.

The Dr Architecture aims to quickly recover services and ensure rtos and Rpos in the event of a single server failure.

The RTO (Recovery Time Target) is the maximum amount of time a user is willing to spend recovering from a disaster. In general, the larger the amount of data, the longer it takes to recover.

The RPO (recovery point target) refers to the maximum amount of data loss that can be sustained in the event of a disaster. For example, if the user can afford to lose data for one day, the RPO is 24 hours.

RTO and RPO

The DISASTER recovery industry provides three defense modes for different types of faults: data-level, application-level, and service-level. The mainstream Dr Architecture in the industry is Dr, which is a data-level Dr Solution. The disaster recovery datacenter does not work normally, and the integrity and running status of application services are unknown. Therefore, the disaster recovery datacenter faces the problem of whether to switch off application services at critical moments when faults occur.

In some enterprises, services cannot be fully restored due to incorrect application status in the standby equipment room. As a result, the RTO or RPO is long, which is a large “downtime” event.

Source from Ali’s practice of AppActive

In 2021, many well-known companies and cloud platforms at home and abroad experienced serious service interruption and outage events, which sounded the alarm for enterprises. More and more enterprises put disaster recovery construction on the agenda. In order to maintain cost control, support the future evolution of multi-cloud architecture, and ensure the certainty of disaster recovery, many enterprises choose to adopt multi-active DISASTER recovery.

When a disaster occurs, service traffic can be switched at the minute level, and users do not even feel the disaster. There are three typical architectures for different deployment scenarios: Intra-city application multi-activity when the physical distance of the equipment room in the same city is less than 100 km, remote application multi-activity when the physical distance of the remote equipment room is greater than 300 km, and hybrid cloud multi-activity when the hybrid cloud is converged. In active mode, resources are not idle and are not wasted. In addition, the equipment room capacity limit in a single region is overcome, thus achieving capacity expansion across regions.

Multi-active disaster recovery has been practiced within Ali for many years. As early as 2007 to 2010, Alibaba adopted the same-city multi-activity architecture to support business capacity and availability.

In 2013, due to the limited capacity of the machine room and the risk of limited power supply in hangzhou machine room, Alibaba began to explore the architecture scheme of living in different places, which is later known as the so-called “unitization”. In 2014, the pilot verification of the unitary architecture was completed. In 2015, three places and four centers were officially realized thousands of miles away, thus enabling the production level of remote live capacity. In 2017, the cut-off flow was completed at midnight on Double 11.

In 2019, The Alibaba system was fully launched into the cloud, and the remote Live architecture followed the rhythm of the cloud to hatch into the original product AHAS-MSHA, which serves Alibaba and customers on the cloud. It has helped more than ten large enterprises in different industries such as digital government, logistics, energy, communication and Internet to successfully build application live architecture. Including cainiao rural urban applications, Unicom new customer service remote applications, Huitongda hybrid cloud applications, etc.

In the interview with ali Cloud global HIGH availability technology team, the general feeling is that “there is no unified cognition of multi-activity in the industry, and the attention is not enough.”

First of all, different people have different definitions of the word “live”, everyone says they are “live”, but when the failure comes, it turns out that the current system is not really live. Secondly, some enterprises do not understand the remote live, some understanding of the enterprise will think that the cost of remote live is high, difficult to land. After some enterprises understand “live more”, they subconsciously want to invest resources in the enterprise to carry out technical pre-research, and resist the input of commercial products from cloud manufacturers.

The cognitive bias of “live more” will lead users to misuse it or not use it, thus not enjoying the stability bonus brought by “live more”.

In the view of ali Cloud global HIGH availability technology team, application live will become the trend of cloud native DISASTER recovery field, rather than waiting for the arrival of the trend, it is better to promote the development of application live through open source. Through open source collaboration, they hope to form a set of technical specifications and standards for application living, making application living technology more usable, universal, stable and extensible.

On January 11, 2022, Ali Cloud officially opened source aHAS-MSHA code and named it AppActive. This is the first time in the open source field that the concept of “application live” has been proposed.

Project address: github.com/alibaba/App…

AppActive, ali Cloud’s first open source application multi-activity project in the industry, jointly builds cloud native disaster recovery standards with the community

Implementation and future planning of AppActive

Ali Cloud also opened its own chaos project in 2019, aiming to help enterprises solve the problem of high availability in the cloud native process through chaos project. AppActive is more defensive, while ChaosBlade is more offensive. The combination of attack and defense forms a more sound landing safety production mechanism.

ChaosBlade: github.com/chaosblade-…

ChaosBlade: From Chaos Engineering Experimental Tool to Chaos Engineering Platform

AppActive is designed to serve multiple site production systems simultaneously. In order to achieve this goal, there are some difficulties in technical implementation, such as traffic routing consistency, data read and write consistency, and multi-active operation and maintenance consistency.

In order to cope with the above challenges, ali Cloud global HIGH availability technology team made various technology stack abstraction and interface standard definition.

Zhou Yang introduced that they abstract AppActive into three parts: application layer, data layer and cloud platform:

The application layer is the main path of service traffic links, including access gateways, microservices and message components. The core of the application layer is to solve the problem of global traffic routing consistency, and ensure the correctness of traffic routing through layer upon layer routing error correction. The access gateway, located at the entrance of the traffic in the equipment room, is responsible for layer 7 traffic scheduling. It identifies service attributes in traffic and corrects routing errors according to certain traffic rules. Microservices and message components can be invoked synchronously or asynchronously to ensure that traffic goes to the correct equipment room for logical processing and data reading and writing through route correction, traffic protection, and fault isolation.

The core of the data layer is data consistency. Data consistency protection, data synchronization, and data source switchover are used to protect data from dirty writes and provide data DISASTER recovery (Dr).

Cloud platform is the cornerstone to support the operation of service applications. The use of cloud may include self-built IDC, multi-cloud, hybrid cloud, and heterogeneous chip cloud. Cloud platform Dr Requires multi-cloud integration and data communication, on which to build and have the Dr And recovery capabilities of cloud platform and cloud service PaaS layer.

Apply live response to 6 major disaster failures

At present, AppActive is in v0.1 version. The open source content includes all the standard interface definitions of the application layer and data layer on the data plane, and provides the basic implementation based on Nginx, Dubbo and MySQL. Developers can run and verify the basic functions of the application based on the current capabilities.

In the short term, AppActive’s planning will align with application multiactivity standards to improve the integrity of AppActive, including the following:

1, enrich the access layer, service layer, data layer plug-in, support more technical components to the AppActive support list.

2, extend the standards and implementation of application live, such as adding standards and implementation of messaging application live.

3. Establish AppActive control plane to improve the integrity of AppActive application.

4, Follow the application Live LRA standard extension to support the city live form.

5, Follow the application live HCA standard extension to support hybrid cloud live form.

In the future, Ali Cloud will continue to polish AppActive, and strive to make it the best practice under the application of multiple living standards, so as to meet the strict requirements of large-scale production availability; It will also follow the trend of cloud development, explore distributed cloud, and achieve cross-cloud, cross-platform, cross-geographic application multi-active full scene coverage.

As the consensus of “no DISASTER recovery, no cloud” is gradually reached, Ali Cloud hopes to help more enterprise application systems build the escape ability to deal with disaster failure, and also hopes to jointly build the application multi-active DISASTER recovery standard with the developers in GitHub community. Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!