The author | XiaoChangJun source valley (arch) | alibaba cloud native public number

With the development of cloud protozoa, chaos engineering has gradually entered the field of vision of everyone. Through chaos engineering, high availability problems in the process of cloud protozoa can be well discovered and solved. In 2019, Alibaba opened source the underlying chaos engineering tool — ChaosBlade. At the beginning of this year, it opened source again chaosBlade-Box, which further upgraded the brand of ChaosBlade. This paper mainly focuses on the high availability challenges and chaos engineering opportunities faced by cloud native, and introduces the design, characteristics, practices and future planning of open source console in detail, aiming to help enterprises better understand the console and implement chaos engineering through it, and solve the high availability problems under cloud native system.

At the end of last year, both AWS and Google experienced serious service outages. The AWS outage was caused by a problem with kinesis, a data flow service, which made many cloud services unavailable. The Google fault is caused by the failure of the expansion quota of the login service. It can be found that they all have problems such as unreasonable service dependence, one service failure affecting multiple services being unavailable, lack of emergency plans, long recovery time of the whole fault, imperfect alarm monitoring system, etc. Google didn’t perceive the occurrence of the fault until dozens of minutes after the occurrence of the fault. AWS CloudWatch is unavailable. Failures are inevitable, and everything is always at risk of failure.

Especially with the emergence of agile development, DevOps, microservices, cloud native architecture and governance, the application delivery capability has been greatly improved, but the complexity of the system is also increasing day by day. In the rapid business iteration, how to ensure the continuous high availability and stability of the business is facing great challenges. Chaos engineering discovers the weak points of the system in advance, promotes the improvement of the architecture, and finally realizes the business toughness by actively injecting faults.

What does not defeat me makes me strong. Building a resilient architecture is the goal of chaos engineering. A resilient architecture consists of two parts. One is a resilient system that provides redundancy, scalability, degraded fuses, and fault isolation to avoid cascading faults and build a resilient system that is disaster tolerant and fault-tolerant. The other part is the resilience organization, including the collaborative construction of efficient delivery, failure plan, emergency response and other organizations. The highly resilient system will also have unexpected failures, so the resilient organization can make up for the missing part of the resilient system and construct the ultimate resilient architecture through chaos engineering.

Common cloud native HIGH availability architecture is based on multiple availability zones or cross-regional Dr Architecture. Business applications are deployed in clusters under microservices architecture, and middleware has fault tolerance and disaster recovery capabilities. Potential faults exist in both the underlying facilities and upper-layer services, for example, the equipment room is disconnected from the network, the entire availability zone is unavailable, clusters break down, and middleware nodes crash. From available areas to clusters, hosts, and fine-grained requests, the explosion radius of the impact of a fault is gradually reduced, which is also an important point in the chaos engineering principle of controlling the explosion radius. There are two methods to control the explosion radius. One is to isolate the environment by isolating the equipment room and cluster. Second, it is based on the scene control capability of the experimental tool or platform itself, such as ChaosBlade experimental tool, which controls the granularity of the experiment through experimental parameters, such as microservice invocation delay, which can control to a single service interface, version, or even a request. Here’s a look at the ChaosBlade chaos experiment tool.

Chaosblade is a chaos experiment execution tool that follows the chaos experiment model. It has the characteristics of high scene richness, simplicity and ease of use, and is particularly convenient to expand scenes. It was added into CNCF Landspace soon after open source, and has become a mainstream chaos tool. Chaosblade is a tool that can be used directly by downloading and decompressing files without installation. It supports CLI calls and directly executes the blade command. Here is an example of network masking: If you add the -h parameter, you can see a very complete command prompt, such as a 9520 port call to do network packet loss, which is the target of the exercise network; Its action is packet drop; Its matcher is a service port 9520 that calls the remote. After successful execution, experimental results will be returned. Each experimental scene will be treated as an object, and it will return the UID of an experimental object, which will be used for subsequent experimental management, such as destruction and query experiments are done by this UID. To destroy the experiment, that is, restore it, simply execute the blade Destroy command.

Chaosblade supports multiple platforms and languages, including Linux, Kubernetes, Docker, Java, NodeJS, C++, and Golang. It involves more than 200 scenes and more than 3000 parameters, providing users with rich scene and experimental parameter control. You can run the blade -h command to view detailed usage documents, including cases, scenarios, and parameter description. Let’s focus on ChaosBlade’s support for application service scenarios.

Chaosblade supports Java, C++, Golang, and NodeJS applications, including OOM, thread pool full, specified number of threads, CPU load, codecache full, and other JVM scenarios. Such as Druid, Dubbo, Elasticsearch, HBase, HttpClient, Redis, Kafka, Lettuce, MongoDB, MySQL, PostgreSQL, RabbitMQ, RocketMQ, Servlet, Tars, G RPC, etc. One of the more powerful features of the Java scenario is that you can specify any class or method to inject exceptions, delay, and tamper with returns. You can even write your own Groovy or Java scripts to implement more complex experiment scenarios to meet your own business experiment needs. It also supports link identification and request number limitation. Golang scenarios are implemented by injecting buried logic into any line of code at compile time, and currently support changing variable values, changing parameter values, changing return values, exceptions, delays, memory overflow, and Panic scenarios. At present, 40 enterprises have registered for trial or use, including some enterprise users jointly built through in-depth cooperation. Here is an example of a chaosBlade fault injection execution flow.

Take the three-second delay of the cloud native Dubbo application calling the downstream PetQueryService as an example, we can use the blade tool or Kubectl of ChaosBlade and code it. Execution using Kubectl and its own blade tool is listed here. Kubectl creates a ChaosBlade resource by configuring a ChaosBlade YAML file and using kubectl apply. Kubernetes creates a ChaosBlade resource. You can restore the experiment by deleting this consultation using the kubectl delete command. After the ChaosBlade resource is created, the ChaosBlade operator listens to the chaosBlade resource creation, queries the target container, passes through the experiment tools related to the scenario as required, and calls the Blade tool to perform experiments in the container. Execute with blade, command as shown above, specifying the dubbo application injection delay failure under K8s, Use the process parameter to specify the application name, the time parameter to specify the delay, the service parameter to specify the affected service interface, the NAMES parameter, and the container-names parameter to specify the Pod and container names, respectively. If you do not know the parameters, you can add -h to view the command help.

As can be seen from the above cases, ChaosBlade tool is easy to use and supports rich experimental scenarios. We upgraded the ChaosBlade brand based on this tool.

We open source chaosBlade-Box chaos engineering console, which can realize the platform operation of chaos experiment, and support the hosting of more chaos engineering experiment tools, such as LitmusChaos, etc. After the brand upgrade, we further solve the difficulty of users’ implementation of chaos engineering, so that users can focus more energy on improving the toughness of the propulsion system, aiming to help enterprises solve the problem of high availability in the process of system cloud biogenesis through chaos engineering.

Chaosblade-box is a cloud native chaos engineering platform for multi-cluster, multi-environment and multi-language. The key functions are as follows:

  • The automatic deployment of experimental tools is realized, and the user does not need to log in to each machine to deploy the experimental tools, which simplifies the user deployment cost.

  • Support for lab tool hosting, now supports LitmusChaos, will support more excellent lab tools to meet the needs of various lab scenarios.

  • By providing a unified user interface for chaos experiment and shielding the underlying fault injection mode, users can realize experiments of different tools on the same platform.

  • Support automatic acquisition of experimental targets, experimental scene management and so on.

  • Support multiple experimental dimensions, such as host, Kubernetes, application, Kubernetes contains Container, Pod, Node experimental dimensions.

  • In the future, chaos engineering closed-loop will be further supported to achieve steady-state definition, experiment execution, steady-state assessment, etc., to assist users in building highly available cloud native systems.

Here is a screenshot of the chaosBlade-Box platform.

The above pictures show the overall functions of chaosBlade-Box platform. On the basis of hosting more tool scenarios, it standardizes experiment scenarios and experiment management and control interfaces, simplifies user operations, lowers the threshold of use, and provides detailed white-screen logs for problem tracking and troubleshooting. Let’s take a look at the technical architecture diagram of the platform.

The console page can realize the deployment of managed tools such as ChaosBlade and Litmus, unify the experiment scene according to the chaos experiment model established by the community, divide the target resources according to the host, Kubernetes and application, and control the target manager. In the experiment creation page, you can realize the blank screen selection of target resources. The platform called chaos experiment execution to execute experiment scenarios of different tools, and with Prometheus monitoring, the metric indicators of experiments could be observed, and abundant experiment reports would be provided later. Chaosblade-box is also easy to deploy at github.com/chaosblade-… .

Here we introduce the use of the platform through a Pod killing experiment scenario.

After deployment, create an experiment on the experiment list page and select Kubernetes Pod experiment dimension. Experiment creation is divided into four steps. The first two steps of resource selection and scene selection are mandatory, and the last two steps of monitoring access and experiment name are not mandatory. Select multiple target Pods from the Pods list, and then select the Kill Pods experiment scenario to monitor with Prometheus Pod to complete the experiment creation. Click “Execute Experiment” on the experiment details page to enter the experiment task details page and view the experiment details.

Chaosblade-box’s subsequent planning focuses on hosting more experimental tools, realizing more automatic deployment of tools, supporting more language applications, adding more complex scheduling policies and process orchestration, and generating experimental reports. The experimental reports are divided into three stages: the first is the basic experimental report, which contains basic experimental and monitoring information. The second is the experimental defect report, which contains the problems found in the experiment; the third is the experimental high availability construction report, which proposes solutions according to the problems found in the experiment.

Details of the Roadmap can be found below:Github.com/chaosblade-….

After upgrading ChaosBlade brand, the project has just started, and there are still many imperfections. Welcome to download and use it and participate in the project construction. You can also register the use of enterprises in the issue, and we will communicate offline at github.com/chaosblade-… .

  • ChaosBlade

Github.com/chaosblade-…

  • ChaosBlade-box

Github.com/chaosblade-…