Author: Github @zhongxig, principal of AppActive, from aliyun cloud native high availability architecture team, engaged in the development and open source work of disaster recovery architecture and fault recovery.

Abstract: After Sentinel and Chaosblade open source by the high Availability Architecture team, the third blockbuster high availability product: The application of multi-active AppActive formal open source, the formation of high availability of three carriages, to help enterprises build stable and reliable enterprise production system, improve the enterprise in the face of disaster tolerance, fault tolerance, capacity and other problems of steady system construction capacity.

On January 11, at the Cloud native combat Summit in Shanghai, Ali Cloud intelligent researcher Ding Yu released the “Application live technology White paper”. At the same time, in order to promote the development of disaster recovery in the industry and establish the cloud native business disaster recovery standard, Ali Cloud open source “application live” middleware: AppActive.

What is a AppActive

“What if the equipment room resources are unavailable? What if the machine room is down? What if the business suddenly collapses? Typhoon earthquake caused power cut how to do?”

In 2013, shortly after Taobao completed its online shopping spree, the size of singles’ Day soared further from the previous year. Ali engineers are facing a series of problems mentioned above. On the one hand, the resources of the computer room are very tight and the capacity is insufficient; on the other hand, the computer room is facing the risk of power failure due to the rare high temperature in Hangzhou. The remote Live architecture was incubated in this context as a carrier for the group version of UnitRouter&UnitBrain.

With the evolution of Taobao’s business scale, remote live also from the near distance of the city double room to long-distance remote live, and then to three places, four units, multiple places, precipitation of rich machine-room level application live experience.

In 2019, The Alibaba system was fully launched into the cloud, and the remote Live architecture also followed the rhythm of the cloud to hatch the AHAS-MSHA, service groups and customers on the cloud

On January 11, 2022, the AHAS-MSHA code was officially open source and named AppActive.

AppActive is an open source middleware that builds cloud native high availability multi-active DISASTER recovery architecture for business applications. Its main values are as follows:

  • Minute-level RTO. The recovery time is fast. The average recovery time of ali’s internal production level is less than 30s, and that of external customers’ production system is less than 1 minute.

  • Full utilization of resources. There is no idle resource problem, and multiple computer rooms and multiple resources are fully utilized to avoid resource waste.

  • High switching success rate. Relying on the mature multi-activity technology architecture and visual operation and maintenance platform, compared with the existing disaster recovery architecture, the switching success rate is high, and the annual switching success rate of thousands of times within Ali is as high as over 99.9%.

  • Precise flow control. The application of multi-activity supports traffic sealing from the top to the bottom, and the specific business traffic is driven into the corresponding machine room depending on the precise drainage ability. Based on this advantage, enterprises can incubate the characteristics of full-range gray scale and key traffic guarantee.

Why open source

Through nearly 9 years of practical experience in serving Ali Group and more than 2 years of commercial iteration in serving customers on the cloud, AHAS-MSHA has been implemented in the disaster recovery scenarios of more than 10 large enterprises covering Ali. The usage of AHAS-MSHA continues to grow, and the stability and functional characteristics of the code have been fully tested.

In 2021, many well-known companies and cloud platforms at home and abroad experienced serious service interruption and outage events. This also sounded the alarm for enterprises, more and more enterprises put disaster recovery construction on the agenda. In order to control the cost, support the future evolution of multi-cloud architecture, and ensure the certainty of disaster recovery, many enterprises choose to implement multi-active DISASTER recovery.

But there is no unified cognition, the industry for more than live in the word “live” different enterprises have different definitions of many enterprises has achieved “live” is often thought, but when trouble comes, only to find that the current system of failure escape ability is very weak, business recovery and fault location cannot be decoupled, dragged down by production enterprises, problems such as the external public opinion, capital damage; In addition, some enterprises in the understanding of “live more”, subconsciously want to invest resources inside the enterprise to carry out technical rehearsal, but due to the lack of experience, often cause repeated waste of human and material resources. With the development of cloud native technology, more and more customers are using cloud native technology for system construction. Building a stable and highly available system on cloud native is a core challenge. The cognitive bias of “live more” will intensify the investment in infrastructure cost, application transformation cost, operation and maintenance cost and other costs, but there are problems of low efficiency, misuse, even useless or disuse, and thus cannot enjoy the stability dividend brought by “live more”. Therefore, “live more” needs a relatively unified standard and cognition, deepen users’ understanding and use of it, so as to improve the stability of business system.

Under the current situation of cloud native development and market cognition, The project leader of AppActive, Chinese and Western said that application live open source and interpretation, can initially define the standard and realization of “live”, to help developers form a unified “live” cognition. When an enterprise builds a live architecture, it shares existing mature experience based on application live to avoid unnecessary waste of resources. At the same time, different enterprises have different business scenarios and advantages, which reverse promote the application of living to further improve and evolve the mature form and capability of living. The hope is to rely on the power of the community to make “live more” a de facto inclusive technology, rather than a few people who are not deterred by the technology, to help more enterprises and individuals to build production-level highly available architecture.

Open source content

AppActive standard introduction

The standard definitions of application live include LRA (In-city live), UDA (remote Live), HCA (Hybrid cloud Live) and BFA (Service Traffic Live), as detailed in the White Paper on Application Live Technology. In AppActive V0.1, we gave priority to the basic capabilities of BFA and UDA, and added LRA and HCA capabilities while improving BFA and UDA in subsequent versions. BFA and UDA are mainly introduced in this paper.

1. BFABusiness Flow Active

BFA: Indicates that the final presentation of application multi-activity is a service. The multi-activity Dr System fine-allocates production traffic based on service characteristics.

AppActive in BFA index, support automatic traffic correction, strong route to the designated machine room self-closed loop, belongs to the fine deployment of traffic.

When illegitimate traffic flows into the equipment room, plug-ins at all layers of the equipment room handle the traffic according to unified scheduling rules:

  • The access layer identifies incorrect traffic and automatically corrects it to the correct equipment room.

  • The service layer identifies the wrong traffic and automatically corrects it to the correct machine room.

  • The data layer identifies incorrect traffic and throws exceptions to ensure data quality.

2. UDA (Ultra Distance Active)

UDA refers to the fact that the service system still has good access performance when the distance between machine rooms is more than 300 km. When the Dr State enters, the RTO and RPO are at the minute level.

AppActive supports good access performance in UDA indicators.

Traffic parsing is supported at the access layer to resolve request traffic and send the traffic to application machines in the equipment room. Based on the capabilities of Servlet plug-in, Dubbo plug-in and MySQL plug-in on the application side, service traffic requests are self-closed in a single machine room, and finally read and write to the database in the machine room.

In a remote scenario, the service system still has good access performance because the traffic is confined in the equipment room.

Rpos in Dr Mode are protected by open source data synchronization components or commercial synchronization tools. RTO in AppActive 0.1 only provides primary traffic switching capabilities, and will evolve to production-level RTO assurance tools in later versions.

The AppActive module is introduced

AppActive is a definition and implementation of multiple applications. It has the overall implementation of the data plane and the control plane. The data plane is divided into four parts, all of which support adding capabilities in the form of plug-ins without changing the original technical components used by enterprises:

  • Access gateway. As the first hop of service traffic to the equipment room, the access gateway is responsible for identifying and distributing the traffic of the application multi-entry and has two core capabilities, namely, equipment room routing and application routing.

  • The service layer. Service traffic can be invoked synchronously within or across equipment rooms. The functions include Consumer, Provider, and registry. The function provides three core functions, namely traffic routing, traffic protection, and fault isolation, to avoid dirty data write caused by invocation errors and accelerate service recovery during traffic interruption.

  • The message layer. The asynchronous invocation of service traffic within and across the equipment room is based on message peak cutting and valley filling. Generally, there are producers, consumers, and brokers, and the three core capabilities of traffic routing, traffic protection, and fault isolation are provided to avoid dirty data writes caused by message miscasting and protect messages from being lost during flow cutting.

  • Data layer: covers service application data read and write, data storage, and data synchronization. It has three core capabilities: traffic routing, data consistency protection, and data synchronization.

The management plane covers routine operation of the multi-active DISASTER recovery (Dr) rules and traffic switching in disaster scenarios.

AppActive is currently in v0.1, open source:

  • The above – mentioned data plane all layers of the definition of the basic implementation.

  • Access layer gateway Nginx plug-in implementation.

  • Service layer dubbo2. x plug-in implementation.

  • Data layer open source MySQL plug-in implementation.

  • Basic capability for controlling plane traffic switchover.

Based on the capabilities of V0.1, developers can run and verify the basic functions of the application.

AppActive follow-up planning

  1. Enrich access layer, service layer, data layer plug-ins, support more technical components to the list of AppActive support.

  2. Added plug-in implementation of message layer to support message application live ability.

  3. Add other layers in the application of live standards and implementations.

  4. Support Web white screen, follow the application of UDA standards, improve RTO.

  5. Support for hybrid cloud multi-live modality following the application Multi-live HCA standard.

  6. Follow the application Live LRA standard to support the city live mode

The starting point

“Live more in different places” and “unitary” originated from Ali, has also been recognized by the industry. Ali has always hoped that the application of the lively product ecosystem can achieve standards and openness, to make contributions to the industry.

Based on the application live standard technology, service applications can be interconnected between different cloud vendors, different infrastructures, and different chips. Service applications can fully utilize resources and achieve the RTO of minutes or even seconds, ensuring that they are not afraid of failures.

Today, the first release of AppActive open Source is just a starting point for the application Live ecosystem. To learn more about AppActive, join the AppActive Open Source discussion group at 34222602.

Click here to download the Application Live Technology White Paper now.