Chaos engineering is an emerging technical discipline, and most IT teams’ understanding of IT has not yet risen to a domain concept, but IT is actually a complex technical means to improve the resilience of technical architecture. This article aims to explain what chaos engineering is, how it is used, and the definition and usage of alibaba’s ChaosBlade model.



Chaos engineering is a discipline of conducting experiments on distributed systems, aiming to establish the ability and confidence of the system to resist runaway conditions in production environment. It was first proposed by Netflix and related teams.



Understanding Chaos Engineering



Embrace chaos



Netflix has been beefing up its infrastructure to support an increasingly complex business. Netflix now has 100 million subscribers in more than 190 countries. Early companies ran servers in their own rooms, but this caused single points of failure and other problems. In August 2008, a problem with the database caused a three-day outage during which it was impossible to watch any videos on Netflix. Netflix engineers migrated the service to Amazon Web Services in 2011.



This new distributed architecture of hundreds of microservices eliminates single points of failure. But it also introduces new complexities that require more reliable and fault-tolerant systems. It’s here that Netflix’s engineering team learned an important lesson: avoid failure by constantly failing.



A new use of chaos



To do this, Netflix engineers created Chaos Monkey, a tool that can cause failures at random locations throughout the system. With the advent of Chaos Monkey, a new discipline was born: Chaos engineering, described as “the discipline of conducting experiments on distributed systems in order to build confidence in the system’s ability to withstand turbulent conditions in production environments.” .



Chaos Monkey was open-source by Netflix in 2012. Today, many companies (including Google, Amazon, IBM, Nike, etc.) employ some form of chaos engineering to improve the reliability of modern architectures. Netflix has even expanded its chaos engineering toolset to include the entire “Simian Army,” using it to attack its own systems.



Chaos Engineering: Not so chaotic



It is a misconception to regard chaos engineering as actual chaos. In fact, a large number of tests are non-random. Chaos engineering, by contrast, involves deep thinking prior to implementation, organizing planned and controlled experiments designed to reveal how the system will behave if it fails.



Minimize minefields



Tom Petrocelli, a researcher at Amalgam Insights, said in an interview that the key to chaos engineering best practice is to “minimize minefields. This means minimising the impact on the business.”



To ensure it doesn’t disrupt business, Petrocelli advised the engineering team to “orchestrate” the disruption.



Not just tests, but experiments that generate knowledge



Casey Rosenthal, former engineering manager for Netflix’s Chaos team, made it clear in DZone Q&A that chaos engineering is not just a way to test systems, it’s also the right way to generate new knowledge. Rosenthal said during the Q&A session that traditional testing is still vital, but chaos engineering should be complementary to it.



Define and measure the “steady state” of the system. Start by defining metrics precisely. In chaos engineering, business metrics are often more useful than technical metrics because they are better suited to measuring user experience or operations.



2. Create a hypothesis. Chaos engineering should involve real experiments involving real unknowns.



“Chaos engineering doesn’t work for events that are predictable, covered by the operation manual, that you know have to be automated but haven’t started yet,” says DevOps solution strategist New Relic SRE Beth Long. “You need it to deal with all the factors that come out of complexity itself. Because they didn’t know how to intervene, everyone felt like the tiger was eating the sky.”



3. Simulate what might happen in the real world. In Chaos Engineering: In building Confidence in System Behavior through Experimentation, Netflix architects Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri propose a number of chaos engineering practices:



  • Simulate a data center failure
  • Forced system clock synchronization
  • Simulate I/O exceptions in driver code
  • Simulate delays between services
  • Random throw function throws an exception


Be sure to prioritize potential errors. The more complex and important the system, the more likely it is to be a candidate for chaos engineering.”



4. Prove or disprove your hypothesis. Steady-state indicators were compared with those collected after interference injection into the system. If you find a discrepancy in your measurements, your chaos engineering experiment has succeeded – you can now proceed to harden your system so that similar events in the real world do not cause major problems. Or, if you find that steady states can be maintained, you can rest assured about the system’s stability.

The above content is from the high Availability Architecture public account.



Alibaba open source tool ChaosBlade

At the end of March, Ali opened source a tool called “ChaosBlade,” a distillation of six years of creativity and practice in the field of troubleshooting. Since ChaosBlade has been open source, many developers have used it to test the fault tolerance and robustness of their own systems and verify whether the configuration of container orchestration is reasonable due to its simple operation, non-intrusion and strong scalability.



Next, we will introduce the ChaosBlade chaos experimental model from two aspects of the definition and implementation of the model. Following this model, we can execute a chaos experiment simply and clearly, not only control the minimum explosion radius of the experiment, but also easily and quickly expand the new experimental scene or enhance the existing scene. Both the ChaosBlade and chaosBlade-exec-JVM projects are implemented based on this model.

Chaosblade Project links:

https://github.com/chaosblade-io/chaosblade

Chaosblade-exec-jvm project

https://github.com/chaosblade-io/chaosblade-exec-jvm



The model definition

Before giving the model, we first clarify some problems involved in implementing a chaos experiment:

  • Do chaos experiments on what?
  • What is the scope of chaos experiment?
  • What experiments are carried out?
  • What are the valid matching conditions for the experiment?
Here’s an example:

An application on a 10.0.0.1 machine with an IP is calling [email protected] Dubbo service delayed 3s. According to the above list of problems, it is first clear that chaos experiments should be conducted on Dubbo components. The scope of implementation experiments is 10.0.0.1 single machine, and 3s delay simulation for invoking [email protected] service, etc. A precise chaos experiment can be carried out if the above content is clear. We summarize these steps and abstract the following model:





  • Target: Refers to the component where the experiment takes place, such as the container, application framework (Dubbo, Redis, Zookeeper), etc.
  • Scope: Scope of experiment implementation, which refers to the machine or cluster that triggers the experiment.
  • Matcher: experimental rule Matcher. It defines the relevant experimental matching rules according to the configured Target. Because each Target may have its own special matching conditions, such as HSF and Dubbo in RPC domain, it can match services provided by service providers and services invoked by service consumers. Redis in the cache domain can be matched according to set and GET operations.
  • Action: indicates the simulated scenario. For example, you can test scenarios such as full disk, high DISK I/O reading and writing, and disk hardware failure. In the case of applications, you can abstract out experimental scenarios such as delays, exceptions, returning specified values (error codes, large objects, etc.), parameter tampering, and repeated calls.
Returning to the above example, we can summarize the experiment in one sentence: The Dubbo component (Target) is tested for a 10.0.0.1 host (Scope) application, invoking [email protected] (Matcher) service delay 3s (Action).

The pseudocode can be written as:

Toolkit. // Experimental target dubbo. // scope, here is the host("1.0.0.1"). // Consumer (). // Consumer ()."com.example.HelloService"). // Component matcher, 1.0.0 interface version version"1.0.0"). // Experiment scenario, 3s delay(3000);Copy the code


Implementation of the ChaosBlade model

Chaosblade cli calls

For the above example, the chaosBlade call command is:

Blade create dubbo delay --time 3000 --consumer --service com.example.HelloService --version 1.0.0Copy the code
  • Delay: Action in the model, executing the delay drill scenario.
  • –time: the action parameter in the model, indicating the delay time.
  • — Consumer, –service, –version: Matchers in the model, experimental rule matchers.
Note: Since ChaosBlade is a tool executed on a single machine, scope in the chaos experimental model defaults to native and no longer displays declarations.

Definition of chaosBlade model



  • Definition of a component chaos experimental model, containing the component name and a list of supported experimental scenarios.
typeExpModelCommandSpec Interface {// Component Name Name() String // List of supported scenarios Actions() []ExpActionCommandSpec //... Slightly}Copy the code


  • A definition of an experimental scenario action, containing the scenario name, parameters required for the scenario, and some experimental rule matchers.
typeExpActionCommandSpec Interface {Name() string Matchers() []ExpFlagSpec // Action parameter Flags() []ExpFlagSpec // Action Executor(channel channel) Executor //... Slightly}Copy the code


  • The definition of an experimental matcher, including parameter names, parameter descriptions, and so on.
typeExpFlagSpec Interface {// Parameter name FlagName() String // Parameter Description FlagDesc() string // Whether parameter value FlagNoArgs() bool // Whether parameter is required FlagRequired() bool }Copy the code


A concrete implementation of chaosBlade model

Taking the Network component as an example, network, as a chaos experimental component, currently includes network delay, network shielding, network packet loss and DNS tampering drill scenarios. Based on the model specification, the concrete implementation is as follows:

type NetworkCommandSpec struct {
}

func (*NetworkCommandSpec) Name() string {
    return "network"
}

func (*NetworkCommandSpec) Actions() []exec.ExpActionCommandSpec {
    return []exec.ExpActionCommandSpec{
        &DelayActionSpec{},
        &DropActionSpec{},
        &DnsActionSpec{},
        &LossActionSpec{},
    }
}Copy the code
Network target defines four chaos experiment scenarios: DelayActionSpec, DropActionSpec, DnsActionSpec, and LossActionSpec. DelayActionSpec is defined as follows:

type DelayActionSpec struct {
}

func (*DelayActionSpec) Name() string {
    return "delay"
}

func (*DelayActionSpec) Matchers() []exec.ExpFlagSpec {
    return []exec.ExpFlagSpec{
        &exec.ExpFlag{
            Name: "local-port",
            Desc: "Port for external service",
        },
        &exec.ExpFlag{
            Name: "remote-port",
            Desc: "Port for invoking",
        },
        &exec.ExpFlag{
            Name: "exclude-port",
            Desc: "Exclude one local port, for example 22 port. This flag is invalid when --local-port or remote-port is specified",
        },
        &exec.ExpFlag{
            Name:     "device",
            Desc:     "Network device",
            Required: true,
        },
    }
}

func (*DelayActionSpec) Flags() []exec.ExpFlagSpec {
    return []exec.ExpFlagSpec{
        &exec.ExpFlag{
            Name:     "time",
            Desc:     "Delay time, ms",
            Required: true,
        },
        &exec.ExpFlag{
            Name: "offset",
            Desc: "Delay offset time, ms",
        },
    }
}

func (*DelayActionSpec) Executor(channel exec.Channel) exec.Executor {
    return &NetworkDelayExecutor{channel}
}Copy the code
DelayActionSpec contains two scene parameters and four rule matchers.

The above content of ChaosBlade comes from alibaba middleware public account, written by Xiao Changjun (name: Qionggu), GitHub ID@xcaspar, Senior development engineer of Alibaba, with many years of experience in application performance monitoring and chaos engineering, core development of AHAS of Ali Cloud products, and leader of ChaosBlade open Source project.



The 5th Shenzhen GIAC (June 21-23) carefully planned the topic of “Chaos Engineering”, specially invited Xiao Changjun as the lecturer of this special session to share the topic of “Chaos Engineering practice under Distributed Services”.

In addition, the organizing Committee selects the most cutting-edge technological innovation practice cases from the most popular fields of Internet architecture, such as AI, DAzhongtai, chaos engineering, software engineering, middleware, Java special stage and so on. The qr code in the identification picture can see the details. The deadline for the 25% discount tickets of the conference is coming, and seats are limited.