Ji-yeon (Jeong Yeon), Wan Bi (He Ying)

What is chaos engineering, the characteristics of chaos engineering under the cloud primitive tide

By using the services provided by cloud computing vendors such as Aliyun and AWS, modern service providers can provide rich software services at a lower cost and in a more stable manner. But is it really so easy? The major cloud vendors have all had their own history of failures within the SCOPE of their SLA commitments, as shown in this bloody list of reports on Github [1]. On the other hand, the various cloud products provide some high availability capabilities for users to use, which often still need to be configured and used with the right posture.

Chaos engineering helps business system service providers identify vulnerable links in production services and implement improvements against expected SLA targets by creating disruptive events, observing how systems and people respond, and targeting optimization improvements. In addition to pointing out the design problems of system components that need to be improved, chaos engineering can also help to find the blind spots in monitoring and alarm, and the deficiencies in personnel’s understanding of the system, emergency response SOP and troubleshooting ability, thus greatly raising the overall high availability level of the business system and its R&D, operation and maintenance personnel. Therefore, after Netflix put forward this concept, major software manufacturers have carried out internal practice and external product provision.

On the basis of traditional cloud computing, cloud native provides faster and lower cost flexibility and better software and hardware integration flexibility, and has become the fastest developing technology direction of cloud computing. Cloud native helps developers dramatically reduce resource costs and delivery costs to win markets faster and better. At the same time, cloud native has also brought a complete change to the traditional operation and maintenance, research and development methods, which makes the traditional chaotic engineering means need to follow the evolution.

Under the cloud native background, what is the difference between the chaotic engineering implementation of the application services on it and the traditional one? From our extensive practice in Alibaba e-commerce and middleware cloud biotechnology, we can summarize the following main differences:

Under such a different background, it is more appropriate to use cloud native means to implement chaos engineering which is more targeted at scenarios rooted in cloud native applications and can provide more capability improvement.

Stages and development of chaos engineering implementation mode

Since chaos engineering can bring so many benefits, how should an application service or system based on cloud native be implemented?

From the perspective of drill tools and implementation, an organization’s fault drill is often divided into several development stages: manual drill, automatic drill of process tools, regular unattended drill, and production surprise drill.

The implementation difficulty of these stages is from low to high, of course, the corresponding benefits are also from low to high. An organization (cloud users) can choose its own appropriate stage according to the actual situation, and then upgrade and develop with the increase of the volume, complexity and high availability of its own business application services. Even starting with the simplest manual exercise can often lead to significant and long-term systematic improvements in high availability.

So what are the characteristics of each stage, and how to choose

  • Manual drill: Manual fault injection is usually performed at the initial stage of high availability capability construction or one-time acceptance. Manually check whether the alarm takes effect and the system recovers. At this stage, you only need some troubleshooting tools or scripts for later use.

  • Automated drills: After the high availability capability construction reaches a certain stage, it is necessary to periodically check whether the high availability capability degrades, and automatic drills are put on the agenda. The steps of automatic drill include: Environment preparation > Fault injection > Check > Environment recovery. Configure scripts at each step to create a walkthrough flow that can be automated with one click the next time.

  • Normal execution: In the next stage of the drill, we will have higher requirements. We hope that the drill can be carried out autonomously and chaotically in an unattended way, which poses new challenges to the high availability of the system. This requires that the system not only has monitoring alarms to find faults, but also has the corresponding plan module to be responsible for recovery. In order to be unattended, the system needs to judge the fault situation more intelligently and accurately, and automatically execute the corresponding plan.

  • Production raid: Mostly in gray environment for this exercise, will not affect the business, production raids require system has the ability to control in production under the premise of blast radius fault, in order to find some business related, size, configuration, emergency response, the missing part in grayscale environment, the practice of the production environment on the system of the demand is higher, There needs to be a set of implementation specifications and high requirements for the isolation capability of the system. Most of the work and capacity building are verified in the grayscale environment, but the production raid is still an effective and necessary means of rehearsal. More realistic scenes are used to give the R & D a sense of body, so that the r & D can truly execute the plan, but also exercise the emergency ability, and have more confidence and cognition of the system.

How to conduct a complete fault drill

When first using Kubernetes application deployment and expansion, the first is more attention on the function is available, fault drills is higher requirements, we assume that the current system has passed the preliminary acceptance, but for some failure cases under the premise of system performance is still unknown, to start our fault trip. As a destructive operation, a fault drill must be performed step by step in accordance with certain specifications and procedures. Here we introduce how an application deployed for the first time in Kubernetes should implement a fault drill step by step from the aspects of environment building, system capability analysis, high availability capability building, and implementation recommendations.

Step 1: Isolation environment construction

In the fault drill, especially before the first execution, we need to make clear the environment of the current injection failure, whether it may affect the business flow or cause irreparable losses. Inside Ali, we have complex environmental isolation and change control to prevent the fault injection from affecting the business flow.

In terms of environment, we will classify the following categories:

  • Service test environment: Used for E2E testing and comprehensive function acceptance. This environment is isolated from the production network with service traffic, preventing traffic errors from entering other environments. Therefore, fault tolerance tests can be performed on this environment.

  • Canary environment: can be understood as a comprehensive link grayscale environment, this environment has all the components of the current system, generally used to do upstream and downstream joint adjustment, the system internal link grayscale use, this environment is no actual business flow;

  • Safe production grayscale environment: in this environment, we will introduce 1% of the production flow, and build the cutting flow capacity in advance. Once the environment has problems, the flow can be quickly switched to the production environment. The environment is generally used to combine the user flow to do grayscale for a period of time, so as to avoid the uncontrollable result of full release.

  • Production environment: the environment of real user traffic. Any operation and maintenance of this environment requires strict change review and gray level approval of the previous several environments before it can be changed.

General will start in the canary environment is introduced into fault exercise, can all over the link, have no real traffic environment do some high ability construction and acceptance are available, and the practice of the normal execution, drills repeatedly scene in this environment, can regularly in the gray environment and production environment, controlling blast radius under the premise of a real surprise, as the acceptance ability.

In general, given the cost and complexity of the system, a business application may not build four isolated environments for progressive progress, but we recommend that the application should have at least two environments to separate user traffic and at least one grayscale environment isolated from production, at least initially. Environmental construction needs to pay attention to the following issues:

  • Isolation: grayscale environment and production environment should be isolated as far as possible, including but not limited to network isolation, authority isolation, data isolation, etc. Considering some disaster recovery capacity, two clusters can also be built in different regions of Kubernetes cluster.

  • Authenticity: The grayscale environment should be as consistent as possible with the production environment, such as external dependencies, component versions.

Only after the environmental construction reaches the standard can the access conditions for the exercise be met.

Step 2: Fault scenario analysis

When analyzing the high availability capability of a system, there is often no uniform answer. The weak points and bottlenecks of each system are different. However, when sorting out the high availability capability of a system, we can provide some general ideas.

  • Historical faults:

Historical faults are usually used as a textbook to quickly understand the weak capabilities of a system. By analyzing and classifying historical faults, you can quickly find out which components of the current system are more prone to faults.

For example, the system capacity needs rapid elastic expansion, expansion failure may affect business flow, it can be inferred that it strongly depends on the expansion and contraction capacity of Kubernetes, need to monitor the availability of this ability; For example, if data reads and writes are frequent and data inconsistency occurs in the system, you can improve data stability to increase backup and rollback capabilities.

  • The architecture analysis

The architecture of the system determines the bottleneck of the system to a certain extent. By analyzing the dependence of the system, we can better understand the boundary of the system and optimize the operation and maintenance.

For example, if an application is deployed in active/standby mode, you must check whether the active/standby switchover is smooth and whether the switchover affects service traffic. For example, if an application relies heavily on the underlying storage, a large number of service failures may occur once the storage fails. In this case, you need to know whether there is a degradation plan after the storage failure and whether you can provide early warning for storage problems.

  • Community Experience:

The architecture of many systems is very similar, and referring to the experience of the community or friends is like looking at the practice exam in advance. There are always unexpected results. We always reflect and reorganize ourselves when some faults break out in the industry, and we have found some problems of our own many times. The valuable experience of network cable being cut off, deleting the library and running are all in the list of our regular drills.

On ali cloud native architecture, we compiled the following exercises for reference model, in this highly available capacity model, we according to the system architecture according to the control layer of components, yuan cluster components, extension, components, data storage, node layer, the overall cluster to distinguish, in each module has some common fault can learn from each other.

Step 3: System high availability capacity building

There are a few more questions we need to ask ourselves before we actually do fault injection. Based on the list of high availability capabilities we want the system to have analyzed above, does the system have the ability to detect these failures quickly when they occur, the ability of people to respond quickly, does the system itself have the ability to heal itself or some tools that can be used to recover the system quickly during failures? Here are some general suggestions from both discovery and recovery.

  • Ability to find

Monitoring and alerting is a way to find out if the system is in steady state and make it clear to the application owner. Ali’s internal team has built two ways to monitor alarms. One is white box alarm, which finds potential problems by means of abnormal fluctuations of observable data of various dimensions exposed inside the system. One is the black box alarm, which treats the system as a black box from the customer’s perspective and detects forward functions.

  • resilience

After the failure occurs, the optimal result is that the system is stable and silky without any impact, which requires high capacity building of the system, but the actual situation is often more complicated. In ali’s internal practice, in addition to the special construction of the basic process of self-healing system itself, cut flow ability, the ability to migrate, current limiting capability, etc., also construction plan center, centralized precipitation we all stop capability into the system, bad management, access, running, according to the experience of experts set up stop-loss capability sets, as an important tool of failure.

Step 4: Practice

After the above steps are completed, we believe that the system has preliminary high availability and can start to conduct a fault drill.

Under normal circumstances, we will select some core scenarios for the first drill, which will be triggered by semi-automated scripts or pipeline containing only fault injection modules in the pre-release or test environment. The first test will be carried out in the presence of R&D and operation personnel. Before the test, confirm the scenario expectation. For example, it takes 1 minute for an alarm to be generated after fault injection, and the system will recover automatically within 10 minutes. After the drill, all personnel need to manually confirm whether the system performance meets expectations. After the drill, recover faults and the environment in time. If the scene does not meet expectations during the exercise, it needs to be verified and rehearsed repeatedly at this stage. By marking up the expected scenario, you can begin to move into the regular walkthrough phase.

The key words of the normal drill stage are chaos and unattended. Kubernetes cluster has certain self-healing ability due to the advantages of its architecture, so it is more suitable for the unattended drill. We will filter the set of scenarios that have passed the semi-automatic drill and organize them into some fault drill pipelines. Each pipeline generally contains steps such as fault injection, monitoring check, recovery check, and fault recovery to complete a single drill flow in closed loop. At the same time, Ali uses cloud native technology to trigger chaos, realizing randomness in the drill object, environment, time and scene, so that these drill scenes can be chaotic, normalized and unattended. Regular fault drills help detect occasional system problems and help check existing high availability capabilities during system upgrades.

The implementation of production raid needs to be carried out according to the architecture of the system. In ali’s internal implementation, one way to control risks is to choose low-peak flow to carry out, and prepare one-key cut flow plan in advance, and immediately cut flow to stop loss in case of failure that cannot be recovered. Other risk control designs related to raids will be examined in detail in a future series of articles.

conclusion

In the process of implementing fault drill in the internal cloud native field, we analyzed more than 200 drill scenarios and conducted normal fault drill at the frequency of 1000+/ month, which effectively found more than 90 problems and avoided further expansion of the problem radius. Through the construction, verification and chaotic execution of the drill process, the system’s alarm and pre-plan recovery ability was regularly monitored, and more than 50 new high availability problems were effectively blocked online. The surprise drill in the production environment was a difficult but powerful step for us, which exercised the emergency response ability of the R&D operation personnel, tempered the system in real user scenarios, promoted the shift system of the product, and improved the stability and competitiveness of the cloud native base.

A link to the

[1] on a lot of report list: https://github.com/danluu/post-mortems

Click here to go to the Chaos home page for more details!