Founder and Maintainer of chaos Engineering sermons, Chaoblade community and commercial products.

On December 7, 2021, the “Chaos Engineering Technology Salon-Financial Industry Boutique” salon, sponsored by THE Information and Communication Institute and undertaken by chaos Engineering Laboratory, was held in Beijing. Qionggu, a technical expert from Aliyun, shared “From Party A to Party B, how to do a good job in the professional implementation of chaos Engineering”.

Chaos engineering has gradually become an important means for enterprises to improve stability

Enterprise system Cloud the original biochemical, drive application release iteration speed faster and faster, but the complexity of distributed system is becoming more and more high, resulting in frequent failure, such as Google Cloud server authentication system interruption due to the internal storage quota, AWS data processing flow media service problems, Cloud service outage 5 hours of fault, Have had a major impact.

Due to the increase of unpredictable behavior risks of the system, the stability of complex systems on the cloud is difficult to guarantee, so chaos engineering has gradually become an important scheme for enterprises to seek business continuity. The distributed system that enterprises attempt to guarantee production environment through chaos engineering still has strong toughness in the face of runaway conditions.

Is a growing industry also chaos in the field of engineering standards, mail tunnels court released by the chaotic engineering platform ability request, the chaotic engineering maturity model “, “chaotic engineering stability measurement model” and so on standard specification propel the development of the chaotic engineering field, joint companies established chaos engineering laboratory at the same time, accelerate the development of the chaotic engineering at home.

The demand and supply of party A and Party B are different, resulting in the difficulty of the professionalization of chaos engineering

Since the principle of chaos engineering was put forward, a group of chaos engineering platform services began to emerge in domestic and foreign companies such as Internet and cloud manufacturers. The characteristics of Internet companies make the platform development path focus on productization, production environment, experimental exploration, cloud origin and other aspects. According to the “China Chaos Engineering Investigation Report” jointly released by the Information and Communication Institute and other enterprises, Party A and Party B have different choices on the technology products in the development of chaos engineering technology. Party B (service supply side) is more inclined to use self-research platform as auxiliary, while Party A (service demand side) is more inclined to use commercial platform as auxiliary, and pays more attention to product perfection, industry cases, safety, technology control and other capabilities when choosing commercial platform. Such as scene how to comb, faced with the problem of how to implement environmental construction, stability evaluation, how to control the impact range, the experiment process, how to develop, how the organization coordination, how to value, industry characteristics, such problems as how to integrate operations system, so not only need technology, also need to service, into the industry features combining with chaotic engineering practice mode to fall to the ground. So how do you do that? Based on the evolution of internal chaos engineering, Alibaba provides chaos engineering capabilities in all directions, including group, commercialization and open source, and provides a set of mature professional solutions for chaos engineering in combination with the difficulties faced by professional customers in implementing chaos engineering.

Alibaba Chaos engineering professional solutions to provide community version and enterprise version

Alibaba chaos engineering professionalization solution includes two parts: platform technology and service. The platform technology part includes the community version of Chaos Engineering platform and the enterprise version of Chaos Engineering platform. The community version is a version of the code that is all open source and maintained by the community. The enterprise version provides public cloud SaaS and private cloud deployment. Compared with the community version, the enterprise version provides enterprise-scale, scenario-based, secure and controllable platform requirements.

Chaos Engineering Platform community edition

Chaos Engineering Platform community edition is an open source, multi-cluster, multi-environment, multi-language universal chaos engineering platform, aiming to solve the problem of users starting chaos engineering. On the platform, multiple environments can be configured to achieve resource isolation. Each environment supports multi-host, multi-cluster, multi-container resource management and fault injection, and supports Java, Golang, C++ and other multi-language applications running on these resources. It also supports the hosting of mainstream chaos engineering experimental tools such as ChaosBlade, Chaos Mesh and LitmusChaos, which can be deployed on the platform with one click. And unified experimental interface, you can directly use the experimental scenes provided by these tools on the platform. In addition to the tool hosting function, it also provides scenario management, multi-dimensional drill, process choreography, steady-state detection, drill defense, drill report, and multi-tenant capabilities. It also provides OpenAPI for external integration. The community edition is closely associated with CNCF ecology projects such as Prometheus and HELM.

Chaos Engineering Platform Enterprise edition

Chaos Engineering Platform enterprise edition is positioned to provide large-scale, scenario-oriented, automated and secure product capabilities, covering IaaS, PaaS and SaaS full-stack scenarios. The kernel capability of community edition is adopted to provide one-click upgrade capability of community edition to enterprise edition, and ADAPTS and integrates existing operation and maintenance systems of industrial customers, such as full-link pressure measurement system, environmental technology, unitized disaster recovery platform, contingency plan system, observation, etc. By integrating with these platforms, It can solve the problems of explosion radius, steady state assessment, automatic operation experiment and so on.

The application architecture of professional customers is generally characterized by multi-language, multi-platform, heterogeneous cloud and multi-supplier architecture. Chaos Engineering Enterprise Edition can be well adapted and integrated to facilitate the realization of chaos engineering integration platform. In addition to adaptation integration, enterprise edition has more capabilities than community edition platforms, which can be seen from four aspects:

  • Rich drill scenarios: Enterprise edition supports more than 200 fault scenarios, supports cloud services, and is compatible with Windows platforms. It supports one-stop disaster recovery and disconnection drills such as precheck, network disconnection, recovery, and recovery, as well as service-level micro-service drills.

  • Diversified forms of drills: Supports custom drills and scenarios. You can deposit experiments into experience databases or directly create experience templates for one-click drills, which is simple and convenient. Provide advanced drill solutions, as required. Visual drills are supported. One-click drills can be initiated on the architecture topology diagram to effectively view the drill status and support radius.

  • Easy-to-use platform for exercise: Enterprise edition platform can be used without any business modification. It supports automatic awareness of architecture, automatic carding of architecture topology, visualization of exercise, etc. Support one-click upgrade from community edition to enterprise edition, meeting enterprise needs.

  • Security of drill: Various drill recovery policies are provided. For example, service indicator thresholds are configured to control the drill status. Provides fine-grained permission management and control.

General chaos engineering practice mode

Alibaba Chaos Engineering practice mode is a set of general chaos engineering practice mode abstracted on the basis of chaos engineering practice in Alibaba for many years, community open source discussion and several enterprise project cases. Through this practice mode, it can greatly reduce the introduction, goal setting and organization design of chaos engineering in enterprises, and ensure the implementation of chaos engineering with purpose. The practice modes can be divided into three categories: business-oriented chaos practice, architecture-oriented chaos practice and organization-oriented chaos practice:

  • Business-oriented chaos practice is a practical method based on the business perspective. You can use a pattern template to quickly drill and expose problems in service architecture design and reduce the impact of sudden faults on services. The practice mode includes strong and weak dependence mode between services, involving capital loss prevention and control mode of finance and capital, user experience mode of user experience and terminal disaster recovery mode of client. Typical application scenarios include mobile banking and transaction settlement.

  • Architecture-oriented chaos practice is an infrastructure-oriented practice that uses pattern templates to identify problems, measure stability, and shorten failure recovery time from the perspective of users and operators of infrastructure. The practice mode includes the observed mode to verify the monitoring coverage and effectiveness, the SLI test to verify the service level agreement (SLA) mode to provide SLA, the disaster recovery mode to verify the same-city hypermetro remote disaster recovery mode, and the fault recovery mode to verify the service self-healing. Application scenarios include distributed transformation of core architecture and cloud on core services.

  • Organization-oriented chaos engineering practice is a practical method to measure and improve stability from a global perspective. Through organizational operation, chaos engineering atmosphere can be greatly improved, team coordination can be promoted, and failure emergency efficiency can be improved. The practice mode includes planned fault drill mode, red and blue attack and defense mode, and surprise attack mode of production environment. Typical use scenarios include the standard assessment of 1-5-10 for fault emergency, promotion of large stability projects, etc.

Three landing delivery modes

Through consulting services, a set of chaos engineering practice mode is summarized, which can solve the practice problem of chaos engineering hierarchically. In the actual customer delivery, according to the customer stage, three delivery modes are gradually derived, namely: community edition plus feasibility assessment mode, enterprise edition plus scale implementation mode, enterprise edition plus industry depth co-construction mode.

  • Mode 1: Community edition + feasibility assessment mode (light consultation), mainly through the open source chaos engineering platform and chaos engineering expert experience, quickly implemented chaos engineering in enterprises, and carried out the feasibility assessment of the subsequent comprehensive implementation of chaos engineering.

  • Mode 2: Enterprise + Scale deployment mode, through public cloud or private cloud deployment, with the help of the platform’s enterprise-level features, can achieve scale deployment in the enterprise.

  • Mode 3: Enterprise version + industry in-depth construction mode, through the deployment of proprietary cloud, through the integration of proprietary cloud version and the ability to be integrated, combined with customers’ existing systems, in-depth construction, platform integration.

Community edition + feasibility assessment (light consultation) delivery model

With the help of chaos Engineering community edition and Alibaba’s years of experience in the group and customers, we can effectively solve the dilemma that customers don’t know how to start chaos engineering, quickly implement chaos engineering, and evaluate the feasibility of chaos engineering landing in enterprises.

The typical customer case of this mode is as follows: The customer has reformed the distributed architecture of the system in order to meet the host downshifting. The self-developed distributed framework needs to verify its high availability, but he does not have chaos engineering experience, so he wants to implement chaos engineering in the enterprise through this project. The client also has a clear purpose, which requires to provide chaos engineering technology methodology and teach relevant testers how to do scene analysis, deploy chaos engineering tool platform and teach testers how to expand failure scenarios based on ChaosBlade, lead the implementation of chaos engineering and teach testers the whole implementation process of chaos engineering.

Based on customer’s background and target, to customers the technical architecture, business architecture, deployment architecture, and current situation of research, guarantee the stability of output stability analysis report, put forward the stability problem risk points, chaotic engineering training, failure analysis and scenario analysis case, leading customers based on the customer from the research of distributed framework do fault scenario analysis; Meanwhile, guide customers to develop self-developed fault scenarios based on ChaosBlade, and provide the overall technical scheme and implementation plan of chaos engineering:

Deploy the community edition platform on the customer side, provide and review the drill scheme based on the analyzed fault scenarios, implement the fault drill, produce standardized fault drill report and organize the re-check of the fault drill, and provide follow-up planning suggestions for chaos engineering.

The overall project delivery time was only one month, so as to guide and self-organize dozens of self-developed framework fault scenarios, guide the implementation of fault drills twice, self-implement fault drills for many times, and discover the benefits of many stability problems such as high availability switchover, fault self-healing, monitoring and alarm. It provides feasibility evaluation for enterprises to carry out large-scale chaos engineering.

Enterprise edition plus scale delivery model

Chaotic engineering value in each big Internet companies gradually accepted by people, more and more financial banking, securities and insurance companies began planning and landing chaotic engineering technology, the financial enterprise in the face of the distributed architecture transformation, cloud technology upgrade and the financial letter a complex infrastructure environment, using chaos control of complex systems engineering technology is stable, It is a fast and effective technical means. In terms of how to quickly and effectively implement chaos engineering practice, the lack of experience is the biggest adjustment these financial enterprises face. Because the basic platform of chaos engineering itself is industry-independent, introducing mature chaos engineering enterprise versions and services through innovative projects has become the first choice of more and more financial enterprises.

Chaotic engineering enterprise edition provides a rich ability of each scene, one-stop platform is easy to use, safe and controllable, safeguard enterprise scale to deploy, discovery system stability problem, improve the system of toughness, such as using chaotic engineering platform, promote efficiency of fault emergency, such as fault discovery, fault location, fault handling, fault checking, etc. It can help customers to shorten the implementation time of large-scale construction drill, improve the execution efficiency of drill, guarantee the input-output ratio of chaos engineering implementation, and let customers focus on architecture risk identification and system optimization.

With the development and innovation of business, the number and complexity of systems for a head securities customer are constantly increasing, and the production operation is faced with risks such as functional defects, performance capacity and single point of failure that threaten the safe and stable operation. Therefore, the high performance and stability of the system need to be improved technically. With the help of aliyun Chaos engineering enterprise version of the mature ability to carry out regular, large-scale drill practice, in advance of a large number of fault scenarios in the production system vulnerability test, maximize the advance identification and elimination of technical risks, improve the reliability of system operation. In the one-month stability thematic guarantee project, chaos engineering has obtained huge benefits: 23 types of risks have been discovered, with more than 300 problem points, and the number of drills has been more than 2000 times, with the highest number of drills up to nearly 300 times on that day, covering 300 core systems, etc. Based on this enterprise version platform, chaos engineering organization and operation are carried out, such as double random drill, drill large-screen kanban, production quality analysis report, drill data operation, etc.

Enterprise edition plus the industry depth of the construction of the delivery model

Chaotic engineering enterprise’s ability to provide integrated and to be integrated, with the aid of this ability, can and customer’s monitoring system, change the system, test system, emergency system, CMDB, such as system integration, according to the characteristics of the industry system, to build together, such as heterogeneous systems, localization system, the network cloud system, etc., to provide from chaotic engineering technology innovation, Such as quantitative evaluation, scene exploration, experimental automation and other aspects, to achieve a win-win situation.

The standardization capability provided by chaos Engineering Enterprise edition is adapted and integrated with the existing system of customers, and the chaos engineering platform capability with industry attributes is built to meet the needs of the industry and accelerate the implementation and innovation of customers in the field of chaos engineering.

More and more

Alibaba is committed to the implementation of chaos engineering and industrial solutions, through a variety of practice modes to make the implementation of chaos engineering more focused, a variety of delivery modes to serve various needs of enterprises. You can quickly experience Aliyun chaos engineering services or consult chaos engineering solutions through the following links.

1) Open source project Address:

​​https://github.com/chaosblade-io/chaosblade​​

2) Product Experience Address:

​​https://developer.aliyun.com/adc/scenario/e9b27357ab9c4785bc7f43fb62f872e3​​

3) Solution address:

​​https://www.aliyun.com/solution/cloudnative/chaosengineering​​

Click here to get started with Chaos Engineering! For more discussion on chaos engineering, welcome to group communication! Scan the qr code below or search the nail group number: 23177705 can enter the group!