Start Chaos Engineering by ChaosToolkit

Chaos Engineering means Chaos Engineering, which is a technology to test the toughness of a complex system. Through this technology, we can find the deficiencies in the complex system through experiments, especially introduce all kinds of Chaos into the production environment, and observe the ability of the system to cope with Chaos, so as to build our confidence in the system. Here to AWS open source Chaos Engineering framework ChaosToolkit to simply experience how to carry out a simple Chaos Engineering.

The code address

https://gitee.com/lengdanran/chaostoolkit-experiment

Target determination system

Here, I use two simple flask systems

  • DataSourceService: emulates a database service that represents the data source for the entire system
  • ShowDataService: Simulates a foreground service that presents data
  • Gateway: emulated Nginx for request forwarding
  • Keeper: background daemon process that automatically creates new instances of the service process when the service is unavailable

Here I will start several different processes to simulate containerized cluster deployment in a production environment, improving the overall system availability by increasing system redundancy. At the same time, the Gateway is used to distribute requests from clients to a small pseudo-cluster system.

Write the experiment. json Experiment plan

The following is an example configuration provided by ChaosToolkit

{
    "title": "What is the impact of an expired certificate on our application chain?"."description": "If a certificate expires, we should gracefully deal with the issue."."tags": ["tls"]."steady-state-hypothesis": {
        "title": "Application responds"."probes": [{"type": "probe"."name": "the-astre-service-must-be-running"."tolerance": true."provider": {
                    "type": "python"."module": "os.path"."func": "exists"."arguments": {
                        "path": "astre.pid"}}}, {"type": "probe"."name": "the-sunset-service-must-be-running"."tolerance": true."provider": {
                    "type": "python"."module": "os.path"."func": "exists"."arguments": {
                        "path": "sunset.pid"}}}, {"type": "probe"."name": "we-can-request-sunset"."tolerance": 200."provider": {
                    "type": "http"."timeout": 3."verify_tls": false."url": "https://localhost:8443/city/Paris"}}},"method": [{"type": "action"."name": "swap-to-expired-cert"."provider": {
                "type": "process"."path": "cp"."arguments": "expired-cert.pem cert.pem"}}, {"type": "probe"."name": "read-tls-cert-expiry-date"."provider": {
                "type": "process"."path": "openssl"."arguments": "x509 -enddate -noout -in cert.pem"}}, {"type": "action"."name": "restart-astre-service-to-pick-up-certificate"."provider": {
                "type": "process"."path": "pkill"."arguments": "--echo -HUP -F astre.pid"}}, {"type": "action"."name": "restart-sunset-service-to-pick-up-certificate"."provider": {
                "type": "process"."path": "pkill"."arguments": "--echo -HUP -F sunset.pid"
            },
            "pauses": {
                "after": 1}}]."rollbacks": [{"type": "action"."name": "swap-to-valid-cert"."provider": {
                "type": "process"."path": "cp"."arguments": "valid-cert.pem cert.pem"}}, {"ref": "restart-astre-service-to-pick-up-certificate"
        },
        {
            "ref": "restart-sunset-service-to-pick-up-certificate"}}]Copy the code

pip install chaostoolkit-lib[jsonpath]

Now let’s read the experiment plan in sections.

As can be seen from the figure above, there are not many modules to be configured in this configuration file, including the following 6 items:

  • Title: Give a name to this chaos experiment
  • Description: A basic overview of the chaos experiment
  • Tags: the tag
  • Steady-state-hypothesis: Defines the steady-state hypothesis
  • Method: define a series of interference behaviors that the experiment will do to the system, mainly asactionandprobeThese two kinds of
  • Rollback: After the chaos experiment is complete, all previous operations on the system should be rolled back to the original state (optional).

Obviously, the above six configurations are only important for the last three

Steady-state-hypothesis — Defines the steady-state hypothesis

In this module, the parameters are defined for the steady state of the normal operation of the system. For example, when the concurrency reaches 10000QPS, an interface of the system should return code:200. As long as the interface responds properly under the current conditions, the system is considered to be working properly.

The homeostasis hypothesis consists of one or more probes and their corresponding fault tolerance ranges. Each probe seeks a property in the given target system and determines whether the value of the property is within a reasonable fault tolerance range.

The experiment.json file used for the experiment

{
  "title": "<======System Chaos Experiment======>"."description": "<===Simple Chaos Experiment By ChaosToolkit===>"."tags": [
    "Chaostoolkit Experiment"]."steady-state-hypothesis": {
    "title": "System State Before Experiment"."probes": [{"type": "probe"."name": "<====System GetData Interface Test====>"."tolerance": {
          "type": "jsonpath"."path": "$.data"."expect": "Handle the get http request method"."target": "body"
        },
        "provider": {
          "type": "http"."timeout": 20."verify_tls": false."url": "http://localhost:5000/getData"}}, {"type": "probe"."name": "<====System ShowData Interface Test====>"."tolerance": {
          "type": "jsonpath"."path": "$.data"."expect": "Handle the get http request method"."target": "body"
        },
        "provider": {
          "type": "http"."timeout": 20."verify_tls": false."url": "http://localhost:5000/showData"}}, {"type": "probe"."name": "<=====python module call=====>"."tolerance": "this is a test func output"."provider": {
          "type": "python"."module": "chaostkex.experiment"."func": "test"."arguments": {}}}]},"method": [{"type": "action"."name": "Kill 1 service instance of DataSourceService"."provider": {
        "type": "python"."module": "chaostkex.experiment"."func": "kill_services"."arguments": {
          "num": 1."port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt"}}}, {"type": "action"."name": "Kill 1 service instance of ShowSourceService"."provider": {
        "type": "python"."module": "chaostkex.experiment"."func": "kill_services"."arguments": {
          "num": 1."port_file_path": "E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt"}}}]."rollbacks": []}Copy the code

Chaos experiment engineering steps

This system adopts the structure is simple, the DataSource services independent of other services, chaotic engineering test system provides the interface of http://127.0.0.1:5000/getData and http://127.0.0.1:5000/showData is working correctly, Requests come in through the gateway, are distributed through the gateway to the server, and are returned to the caller.

The overall experiment is simple:

  • Kill the DataSource and ShowData service processes, and then check whether the two open interfaces of the system work properly

Write the service driver

In order to enable Chaostoolkit to conduct various actions and probes on the target system in the experimental process, it is necessary to customize an experimental driver of target system for Chaostoolkit. The following is my driver this time:

import os
import platform
from chaosservices import DataSourceService, ShowDataService


def test() :
    print("this is a test func output")
    return "this is a test func output"


def kill_services_by_ports(ports: list = []) - >bool:
    sysstr = platform.system()
    if (sysstr == "Windows") :try:
            for port in ports:
                with os.popen('netstat -ano|findstr "%d"' % int(port)) as res:
                    res = res.read().split('\n')
                result = []
                for line in res:
                    temp = [i for i in line.split(' ') ifi ! =' ']
                    if len(temp) > 4:
                        result.append({'pid': temp[4].'address': temp[1].'state': temp[3]})
                for r in result:
                    if int(r['pid'= =])0:
                        continue
                    os.system(command="taskkill /f /pid %d" % int(r['pid']))
        except Exception as e:
            print(e)
            return False

        return True
    else:
        print("Other System tasks")
        for port in ports:
            command = '''kill -9 $(netstat -nlp | grep :''' + \
                      str(port) + ''' | awk '{print $7}' | awk -F"/" '{ print $1 }')'''
            os.system(command)
    return True


def get_ports(port_file_path: str) - >list:
    if port_file_path is None or os.path.exists(port_file_path) is False:
        raise FileNotFoundError
    ports = []
    with open(port_file_path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        ifline.strip() ! =' ':
            ports.append(line.strip())
    return list(set(ports))


def kill_services(num: int = 1, port_file_path: str = ' ') - >bool:
    if num < 1:
        return True
    ports = get_ports(port_file_path=port_file_path)
    cnt = min(num, len(ports))
    for i in range(0, cnt):
        kill_services_by_ports([ports[i]])
    return True


def start_datasource_service(port: int = 8080, portsfile: str = None) - >bool:
    DataSourceService.start(port=port, portsfile=portsfile)
    return True


def start_showdata_service(port: int = 8090, portsfile: str = None) - >bool:
    ShowDataService.start(port=port, portsfile=portsfile)
    return True


if __name__ == '__main__':
    # port_file_path = '.. /chaosservices/ports/dataSourcePort.txt'
    # kill_services(num=1, port_file_path=port_file_path)
    kill_services_by_ports([8080])

Copy the code

Target system program

DataSource

from typing import Dict

from flask import Flask, request

app = Flask(__name__)


@app.route("/", methods=["GET"])
def getData() - >Dict[str.str] :
    if request.method == "GET":
        return {"data": "Handle the get http request method"}
    else:
        return {"data": "Other methods handled."}


def clear_file(portsfile=None) - >None:
    f = open(portsfile, 'w')
    f.truncate()
    f.close()


def start(host='127.0.0.1', port=8080, portsfile='./ports/dataSourcePort.txt') - >None:
    print("[Info]:\tServe on %s" % str(port))
    clear_file(portsfile=portsfile)
    with open(portsfile, "a+") as f:
        f.write(str(port) + '\n')
    app.run(host=host, port=port, debug=False)


if __name__ == '__main__':
    start(port=8080, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt')

Copy the code

ShowDataService

import requests as net_req
from flask import Flask

app = Flask(__name__)

# add command line startup parameter, ChaostoolKit will not recognize correctly
# parser = argparse.ArgumentParser(description='manual to this script')
# parser. Add_argument (" host ", type = STR, default = "127.0.0.1)"
# parser.add_argument("--port", type=int, default=8090)
# parser.add_argument("--portsfile", type=str, default='./ports/showPort.txt')
# args = parser.parse_args()

url = 'http://127.0.0.1:5000/getData'


@app.route('/', methods=['GET'])
def show_data() - >str:
    rsp = net_req.get(url=url)
    print(rsp)
    return rsp.text


def clear_file(portsfile=None) - >None:
    f = open(portsfile, 'w')
    f.truncate()
    f.close()


def start(host='127.0.0.1', port=8090, portsfile='./ports/dataShowPort.txt') - >None:
    print("[Info]:\tServe on %s" % str(port))
    clear_file(portsfile=portsfile)
    with open(portsfile, "a+") as f:
        f.write(str(port) + '\n')
    app.run(host=host, port=port, debug=False)


if __name__ == '__main__':
    start(port=8090, portsfile='E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt')

Copy the code

Gateway

import requests as net
import json
import sys
from flask import Flask, request

app = Flask(__name__)

List of data source servers
datasource = []
# Data display front desk service list
datashow = []

datasource_idx = 0
datashow_idx = 0


@app.route('/getData', methods=['GET'])
def get_data() - >str:
    print('[====INFO===]:\tHandle the request from %s' % request.url)
    res = get(urls=datasource)
    return res ifres ! =' ' else 'There is no DataSourceService available.'


@app.route('/showData', methods=['GET'])
def show_data() - >str:
    print('[====INFO===]:\tHandle the request from %s' % request.url)
    res = get(urls=datashow)
    return res ifres ! =' ' else 'There is no ShowDataService available.'


def get(urls: list) - >str:
    """ From the given list of urls, request the first available URL and return the response :param urls: COLLECTION of urls: return: response string  STR  """
    for url in urls:
        try:
            rsp = net.get(url, timeout=10)
            print('[====INFO====]:\tForward this request to %s' % url)
            return rsp.text
        except Exception as e:
            print("[====EXCEPTION====]:\t%s" % e)
            continue
    return ' '


def _get_configuration(file_path='./conf/gateway.json') - >None:
    """ Load configuration from configuration file :param file_path: path to configuration file. Default is './conf/gateway.json' :return: None """
    print('[====INFO====]:\tLoad configuration from file : %s' % file_path)
    with open(file_path) as f:
        conf = json.load(f)
        global datasource, datashow
        datasource = conf["datasource"]
        datashow = conf["datashow"]


if __name__ == '__main__':
    print('[====INFO====]:\tLoads the configuration...... ')
    try:
        _get_configuration()
    except IOError as error:
        print('[====ERROR====]:\t%s' % error)
        sys.exit(-1)
    print('[====INFO====]:\tStart the Gateway... ')
    app.run(host='127.0.0.1', port=5000, debug=False)

Copy the code

Keeper

This part of the program is used to monitor the service status, if the service is not available, can automatically start a new service, so that the system works normally

import os
import socket
import time
import DataSourceService, ShowDataService
from multiprocessing import Process


def get_ports(port_file_path: str) - >list:
    if port_file_path is None or os.path.exists(port_file_path) is False:
        raise FileNotFoundError
    ports = []
    with open(port_file_path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        ifline.strip() ! =' ':
            ports.append(int(line.strip()))
    return list(set(ports))


def get_available_service(port_file: str = None) - >bool:
    if port_file is None:
        return False
    ports = get_ports(port_file_path=port_file)
    for p in ports:
        if check_port_in_use(port=p):
            return True
    return False


def check_port_in_use(host='127.0.0.1', port=8080) - >bool:
    s = None
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(1)
        s.connect((host, int(port)))
        return True
    except socket.error:
        return False
    finally:
        if s:
            s.close()


def creat(func, args) :
    p = Process(target=func, args=args)
    p.start()


def start(port_files: list = []) - >None:
    sleep_time = 5
    while True:
        print('Start Checking... ')
        Get a list of ports for each service
        port_file = port_files[0]
        Check if there are any service instances available
        if get_available_service(port_file=port_file) is False:
            If no service instance is available, create a new instance
            print('[===INFO===]:\t Create DataSourceService instance ')
            ports = get_ports(port_file_path=port_file)
            if len(ports) == 0:
                last = 8080
            else:
                last = ports[-1]
            new_p = last + 1
            DataSourceService.clear_file(portsfile=port_file)
            creat(func=DataSourceService.start, args=('127.0.0.1', new_p,port_file,))

        port_file = port_files[1]
        Check if there are any service instances available
        if get_available_service(port_file=port_file) is False:
            If no service instance is available, create a new instance
            print('[===INFO===]:\t Create ShowDataService instance ')
            ports = get_ports(port_file_path=port_file)
            if len(ports) == 0:
                last = 8090
            else:
                last = ports[-1]
            new_p = last + 1
            ShowDataService.clear_file(portsfile=port_file)
            creat(func=ShowDataService.start, args=('127.0.0.1', new_p, port_file,))

        time.sleep(sleep_time)


if __name__ == '__main__':
    start(port_files=[
        'E:/desktop/chaosmodule/chaostest/ports/dataSourcePort.txt'.'E:/desktop/chaosmodule/chaostest/ports/dataShowPort.txt'
    ])

Copy the code

Start the test

System defect – The Keeper daemon is not started

In this system, the Gateway, DataSource, and ShowData services are enabled. The Gateway, DataSource, and ShowData services are killed. Chaostoolkit should have detected such an obvious system toughness deficiency for us.

$ chaos run experiment.json
Copy the code

Start target system:

Running results:

It’s pretty obvious from the run results that there are

[2021-12-06 17:31:50 CRITICAL] Steady state probe '<====System GetData Interface Test====>' is not in the given tolerance so failing this experiment
Copy the code

Chaostoolkit found the System toughness deficiency for us, which was detected in the verification stage <====System GetData Interface Test====>

[2021-12-06 17:31:50 INFO] Experiment ended with status: deviated
[2021-12-06 17:31:50 INFO] The steady-state has deviated, a weakness may have been discovered
Copy the code

In the directory where we executed the Chaos Run command, the journal. Json file generated by the experiment is generated, which contains the detailed report data of the experiment.

Start two service instances

In order to improve the availability, a simple method is to improve the redundancy of the system. In this experiment, I started two service instances for DataSource and ShowData respectively to run the chaos experiment again

It can be seen that after improving the redundancy, the system can still run normally after being interfered by injection

Start the Keeper daemon

In addition to improving redundancy to solve this problem, you can also start a monitoring process to monitor the status of the service and generate a new service instance in case of a service exception to improve availability

The resilience of the system has also improved!