With the end of Moore’s Law, the performance of single-machine computing has reached its limit. However, both the scale and complexity of our software systems have been growing, so the software systems are developing towards the direction of distribution. In recent years, with the emergence of cloud services and containers, some distributed systems are also easier to microservize. Regardless of these various distributed technologies, our requirements for system reliability are the same: distributed systems need to be highly available, resilient and fault-tolerant to self-recovery or graceful degradation in the event of a single point or cluster failure. We in reasonable structure, high quality code, perfect test and so on has made a lot of effort, but a lot of distributed system is still short of high availability, flexibility, in order to discover weaknesses existing in the system as much as possible, many large software companies have introduced the chaotic engineering, such as Google, netflix abroad, domestic jingdong, and so on. What are the weaknesses of the system? For example

  • The failure of the external system leads to the chain failure of the internal system. Our company has experienced the internal failure caused by the service failure of Qiniu
  • Inappropriate downgrading scheme when service is unavailable
  • Improper timeout mechanism, resulting in infinite retries on request errors

Definition of chaos engineering:

By observing the behavior changes of the distributed system in the controlled fault injection test, the weaknesses of the system are discovered and targeted improvements are made to improve the reliability of the system and build confidence in the ability of the system to withstand runaway conditions. Therefore, chaos engineering is not a new concept, and common remote DISASTER recovery test is also an application of chaos engineering.

General implementation steps of chaos engineering

  • Look for some measurable indicator of the normal operating state of the system as the “stable state” of the benchmark.
  • Let’s assume that both the experimental group and the control group can continue to maintain this “stable state.”
  • The experimental group was injected with events, such as server crash, hard disk failure, network connection disconnection, etc
  • Compare the “steady state” differences between the experimental group and the control group to overturn the hypothesis of Article 2 above

If the two “stable states” are consistent after the implementation of chaos engineering, it can be considered that the system is elastic to deal with such failures, so as to build more confidence in the system. On the other hand, if the stable states of the two are inconsistent, then we have identified a system weakness and can fix it to improve system reliability.

Ideal principles of chaos engineering:

1) Make a hypothesis based on “characteristics of the system in steady state”

Electricity order, for example, order system may contain the goods and services, trade services, payment services, “hypothesis” is not focus on the “screws” service status of, but look at the whole order system under the normal operation of the external condition, the following orders, clinch a deal amount, system throughput, delay, error rate, etc., these indicators tend to have the market monitoring, And except in the case of promotions, these curves tend not to fluctuate wildly, and the trend is predictable. However, there is one point that needs special attention. Although some problems do not affect the overall data (such as cache failure, failure of a CDN node, etc.), we still need to monitor the micro indicators of each node in the system (such as CPU, IO, etc.) in order to find such problems (cache failure may lead to increased pressure of Mysql cluster. The CPU/IO pressure increases.

2) Events are real world possibilities

Any event that may affect the stable state of the system can be regarded as an event

  • Faults: Hardware faults such as server downtime and network disconnection, and software faults such as unavailable external services such as Qiniu
  • Non-failure events: things like traffic surges

We can also analyze the type and frequency of events that have caused system failures, prioritize them, and implement these events to avoid such failures.

3) Run in a production environment

According to article 1, only production environment indicators are generally predictable, such as daily registrations of new users and daily orders of users. In addition, since the test environment cannot be exactly the same as the production environment, chaos engineering is generally recommended to be implemented in the production environment to truly reflect the reliability of the system.

4) Continuous integration

Internet software is updated every day, so it makes sense to implement chaos engineering like running continuous integration.

5) Minimize the scope of influence

According to article 3, chaos engineering may lead to the unavailability of on-line functions or even cause capital losses. Therefore, on the premise of identifying system weaknesses, it is necessary to minimize the scope of failure influence and quickly recover when serious problems occur, that is, the failure is controllable. For this reason, sometimes A/B testing can be introduced to minimize the scope of impact.

The above is chaos engineering in the optimal situation. In reality, we need to implement chaos in stages according to the maturity of existing software:

Stage 1: Distributed system is generally flexible

  • Take JINGdong as an example, they will conduct a failure drill before the 11th National Congress of China. The team is divided into two groups: one group acts as a failure maker, and the other group acts as a failure solver and responder to inspect the team’s ability to detect, respond to, deal with and recover from faults when faults occur. Small failures do not need human intervention, large failures can be quickly handled by manual intervention. Improve the team’s fault tolerance for large-scale failures through intensive chaos engineering during the two months prior to the big push.

  • Take Youzan as an example, since we have just started, in order to control risks, we will only implement chaos engineering in the test environment at first, so there is no accurate mass data to refer to, that is, the appropriate benchmark “stable state”. But not impossible, to observe the market data that reflect the system of macro indicators, from microscopic perspective, we can filter out a batch of a direct impact on the core market data (such as registration, order quantity, etc.) of interface, the system implemented the chaos after the execution of these interfaces scene integration test, through the observation test results to evaluate the system reliability, To look for system weaknesses, which is possible in a test environment. In addition, chaos engineering can be regarded as the automatic realization of the general anomaly, which is not fixed at any time and is not fixed at any target. If we discard this layer and manually inject one or more specific anomalies into the target machine, supplemented by the corresponding anomaly recovery means, it can be applied in the general anomaly test.

Stage 2: Flexible and mature distributed system

  • Netflix, for example, has basically implemented chaos engineering in accordance with the above ideal steps and principles. Chaos engineering is implemented continuously and automatically on weekdays, with a high degree of reliability and flexibility.

Praise the implementation of chaos engineering:

Since chaos engineering is mainly about injecting specific events and causing system failures, since it is “doing bad things”, we named it Megatron (the villain Boss from Transformers). Since we are still in the first stage, the injection of faults is mainly controlled by humans. The types of faults realized so far are as follows:

  • High CPU load
  • Disk high load: Frequent disk reads and writes
  • Insufficient disk space
  • Gracefully take an app offline: Use the app’s Stop script to smoothly stop an app
  • Using the kill process to stop applications may cause data inconsistency
  • Network deterioration: Changes some packet data randomly so that the data content is incorrect
  • Network latency: Packets are deferred for a specific range of time
  • Network packet loss: Construct a packet loss rate at which TCP does not fail completely
  • Network black hole: Ignores packets from an IP address
  • Unreachable external service: The domain name of the external service points to the local loopback address or the OUTPUT packets of the port accessing the external service are discarded

PRINCIPLES OF CHAOS ENGINEERING

My other blog asynchronous system has two test methods

My open source project – convenient product, development, testing tripartite collaborative self-testing management tool Zubuji

Ps: Good test group is continuing to recruit, a large number of job vacancies, as long as you come, we can help you light up the full stack development skill tree, students who are interested in changing jobs can send resumes to sunjun [@] youzan.com