Chaos Engineering: System-level fault simulation

background

With the popularity of container technology and the ability of K8s container orchestration, the flexibility and deployment speed of business development are significantly improved, but the complexity of the system and the difficulty of reliability testing are gradually increased. As businesses grow and distributed systems grow in size, we need to proactively discover weaknesses in systems before they are exposed to users through production. At this time, it is necessary to introduce chaos engineering.

Chaos engineering is the discipline of experimenting on distributed systems to build confidence in the system’s ability to withstand runaway conditions in a production environment. (Sometimes even when all the individual services in a distributed system are functioning properly, the interactions between those services can lead to unpredictable results. These unpredictable outcomes, compounded by rare and disruptive events that affect the production environment, make these distributed systems inherently chaotic.

Pain points and difficulties encountered in the verification process

1. In most scenarios, only regular or normal service logical links are covered by the link coverage test. For some abnormal scenarios, such as exceptions of upstream and downstream services. It is very difficult to construct, requires the collaboration of other teams and even cannot be simulated in the test environment. For such scenarios that cannot be covered, unknown behaviors may occur and even lead to system abnormalities once they occur online. Lack of an effective and simple fault simulation method to find problems

2. It is difficult to conduct system reliability test. For systems with high reliability requirements, it is necessary to master the behaviors of various business processes and evaluate the performance of services in some bad scenarios during the test

3. System alarms are generated. In most scenarios, alarm policies cannot be triggered in the test environment. As a result, monitoring alarms cannot be trusted and it is difficult to determine whether the system runs properly based on the alarm results.

Scheme selection

ChaosBladeAli open source chaos model injection tool. Currently supported scenarios include: basic resources, Java applications, C++ applications, Docker containers and Kubernetes platform.

Solution implementation in cloud native scenarios

This section only introduces the use of ChaosBlade on THE K8S platform. You can refer to the official documents for other scenarios and applications

1. Concept introduction

1. Kubernetes (k8s)

Kubernetes (K8S) is an open source platform for automating container operations such as deployment, scheduling, and scaling between node clusters.

2.POD

Pods (the green box above) are arranged on nodes and contain a set of containers and volumes. Containers in the same Pod share the same network namespace and can communicate with each other using localhost. A Pod is transient, not a persistent entity.

3.Replication Controller

The Replication Controller ensures that a specified number of Pod “copies” are running at any given time. If a Replication Controller is created for a Pod and three copies are specified, it creates three pods and continuously monitors them. If a Pod does not respond, the Replication Controller replaces it

Differences between Replication Controller and Deployment Replication Controller(RC) is a core content of Kubernetes. The Replication Controller is the key to ensure that the application can continue to run after being hosted by Kubernetes. Deployment is also a core part of Kubernetes. The main responsibility of Deployment is to ensure the number and health of pods. 90% of the functions are exactly the same as Replication Controller, which can be regarded as the new generation of Replication Controller. There are some new features.

4.Node

A Node is a physical or virtual machine,

5.ChaosBlade-Operator

ChaosBlade-Operator is ChaosBlade Kubernetes platform experimental scenario implementation. Chaos experiments are defined using the Kubernetes standard CRD. Users can define ChaosBlade experiments in the same way as Deployment or StatefulSet. As long as they have some knowledge of Kubectl and Kubernetes objects, You can easily create, update and delete experimental scenes; Experimental scenarios can also be manipulated using the ChaosBlade CLI tool.

2. Environment installation

Chaosblade-operator needs to be installed using Helm

# installation Helm
$ curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Download the installation package$wget - qO chaosblade - operator - 0.6.0. TGZ https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.6.0/chaosblade-operator-0.6.0-v3.tgzCreate a namespace for ChaosBlade
$ kubectl create namespace chaosblade
# installation ChaosBlade - Operator
$ helm install chaos chaosblade-operator-0.6.0.tgz --set webhook.enable=true --namespace=chaosblade
View the installation result
$ kubectl get pod -n chaosblade | grep chaosblade
chaosblade-operator-6b6b484599-gdgq8   1/1     Running   0          4d23h
chaosblade-tool-7wtph                  1/1     Running   0          4d20h
chaosblade-tool-r4zdk                  1/1     Running   0          4d23h
Copy the code

After ChaosBlade -operator is started, one ChaosBlade -tool Pod and one ChaosBlade -operator Pod will be deployed on each node. If both run properly, the installation is successful. The –set webhook.enable=true setting above is for Pod file system I/O failure testing. If the testing is not required, you do not need to add this setting.

3. Common experiments

1.Container Experiment scenario

General execution command

Container-id Indicates the obtaining mode

kubectl -n xxx-xxx-xx-xxxx get pod xxxx-xxxx-xxxx-xxxx -o custom-columns=CONTAINER:.status.containerStatuses[0].name,ID:.status.containerStatuses[0].containerID
Copy the code

Execute the command and start the test

$ kubectl apply -f remove_container_by_id.yaml
Copy the code

Check test status

Run the kubectl get blade remove-container-by-id -o json command to check the experimental status.Copy the code

To stop the

Run the kubectl delete -f remove_container_by_id. Yaml command or run the kubectl delete blade remove-container-by-id command to delete blade resourcesCopy the code

1.1 remove the container

Purpose: Simulate service and data flow processing in the case of abnormal service restart at the Container level (simulate fault occurrence or service recovery)

Remove_container_by_id. Yaml content:

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: remove-container-by-id
spec:
  experiments:
  - scope: container
    target: container
    action: remove
    desc: "remove container by id"
    matchers:
    - name: container-ids
      value: ["c6cdcf60b82b854bc4bab64308b466102245259d23e14e449590a8ed816403ed"]
      # pod name
    - name: names
      value: ["guestbook-7b87b7459f-cqkq2"]
    - name: namespace
      value: ["chaosblade"]
Copy the code

1.2 CPU load

Objective: To simulate the performance of service processing capability in the scenario of high service load at the Container level (some code execution exceptions are triggered frequently under high service load)

Increase_container_cpu_load_by_id. Yaml content:

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: increase-container-cpu-load-by-id
spec:
  experiments:
  - scope: container
    target: cpu
    action: fullload
    desc: "increase container cpu load by id"
    matchers:
    - name: container-ids
      value:
      - "5ad91eb49c1c6f8357e8d455fd27dad5d0c01c5cc3df36a3acdb1abc75f68a11"
    - name: cpu-percent
      value: ["100"]
      # pod names
    - name: names
      value: ["redis-slave-55d8c8ffbd-jd8sm"]
    - name: namespace
      value: ["chaosblade"]
Copy the code

1.3 Network Exception

Objective: To simulate service network anomalies at the Container level (service performance in scenarios such as high latency and network packet anomalies)

Delay_container_network_by_id. Yaml content:

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: delay-container-network-by-id
spec:
  experiments:
  - scope: container
    target: network
    action: delay
    desc: "delay container network by container id"
    matchers:
    - name: container-ids
      value:
      - "02655dfdd9f0f712a10d63fdc6721f4dcee0a390e37717fff068bf3f85abf85e"
    - name: names
      value:
      - "redis-master-68857cd57c-hknb6"
    - name: namespace
      value:
      - "chaosblade"
    - name: local-port
      value: ["6379"]
    - name: interface
      value: ["eth0"]
    - name: time
      value: ["3000"]
    - name: offset
      value: ["1000"]

Copy the code

Observational experiment results

$kubectl get pod -l app=redis,role=master -o jsonpath={.status.. PodIP} 10.42.69.44 # test time $time echo "" | Telnet 10.42.69.44 6379 Trying 10.42.69.44... Connected to 10.42.69.44. Escape character is '^]'. Connection closed by foreign host.real 0m3.790s user 0m0.007s sys 0 m0. 001 sCopy the code

Some other experiments can be simulated by referring to “Container Experiment Simulation “.

2.POD experiment scenario

General execution command

Execute the command and start the test

$ kubectl apply -f remove_container_by_id.yaml
Copy the code

Check test status

Run the kubectl get blade remove-container-by-id -o json command to check the experimental status.Copy the code

To stop the

Run the kubectl delete -f remove_container_by_id. Yaml command or run the kubectl delete blade remove-container-by-id command to delete blade resourcesCopy the code

2.1 POD Network Exception

Experiment Objective: To simulate service network abnormality at POD level (service performance under scenarios of high latency and network packet abnormality)

Delay_pod_network_by_names. Yaml content:

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: delay-pod-network-by-names
spec:
  experiments:
  - scope: pod
    target: network
    action: delay
    desc: "delay pod network by names"
    matchers:
    - name: names
      value:
      - "redis-master-68857cd57c-dzbs9"
    - name: namespace
      value:
      - "chaosblade"
    - name: local-port
      value: ["6379"]
    - name: interface
      value: ["eth0"]
    - name: time
      value: ["3000"]
    - name: offset
      value: ["1000"]
Copy the code

Observational experiment results

$kubectl get pod -l app=redis,role=master -o jsonpath={.status.. PodIP} 10.42.69.44 # test time $time echo "" | Telnet 10.42.69.44 6379 Trying 10.42.69.44... Connected to 10.42.69.44. Escape character is '^]'. Connection closed by foreign host.real 0m3.790s user 0m0.007s sys 0 m0. 001 sCopy the code

2.2 delete the pod

Delete_pod_by_labels. Yaml content:

apiVersion: chaosblade.io/v1alpha1
kind: ChaosBlade
metadata:
  name: delete-two-pod-by-labels
spec:
  experiments:
  - scope: pod
    target: pod
    action: delete
    desc: "delete pod by labels"
    matchers:
    - name: labels
      value:
      - "role=master"
    - name: namespace
      value:
      - "chaosblade"
    - name: evict-count
      value:
      - "2"
Copy the code

Some simulations of other experiments can be referred to: “POD experimental simulations”

3.Node Experiment scenario

The influence of the Node level experiment is extensive, because some pods of a team are distributed on target nodes, which is not very controllable. This method is recommended for disaster drill (some nodes in the cluster are down), and it will not be explained here

For the experiment of Node, please refer to “Node Experimental Simulation “.

series

“Chaos Engineering: System-level fault Simulation.”

“Fault Injection: Code-level Fault Simulation”

A link to the

“Interactive Tutorial”