The realization principle of chaos engineering ChaoBlade

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Definition of chaos engineering

According to chaos Engineering principles, it is defined as follows:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

The Chinese translation reads like this:

Chaos engineering is the discipline of experimenting on distributed systems to build confidence in the system’s ability to withstand runaway conditions in a production environment. There seems to be no such word as distributed system in English. It seems that the scope of the Chinese translation is too small.

It is described in principle:

Build a hypothesis around steady-state behavior
Diversify real world events
Run experiments in a production environment
Continuously run experiments automatically
Minimize blast radius

It’s interesting to see some of the newer words. Others distinguish it from exception testing, failure testing, etc. To say that we still have to integrate the concept, concept or prior to the development of technology, to guide the direction of technology, and landing, always need some time.

Ali’s ChaosBlade is said to be a chaos engineering tool. Let’s look at what it does.

Download and unpack

This tool is very simple, download unzip to use.

[gaolou@7dgroup2 ~]$ wget -c https://github.com/chaosblade-io/chaosblade/releases/download/v0.2.0/chaosblade-0.2.0.linux-amd64.tar.gz [gaolou@7dgroup2 ~]$tar ZXVF chaosBlade-0.2.0.linux-amd64.tar.gzCopy the code

Three, use and implementation

1. Simulate CPU load

[gaolou@7dgroup2 chaosblade-0.2.0]$./blade create CPU fullload {"code": 200,"success":true."result":"cb6300fd4899c537"} [gaolou @ 7 dgroup2 chaosblade - 0.2.0] $Copy the code

View the simulation effect:You can see from the figure above that the US CPU usage consumption is actually achieved.

And let’s see how it works.BurnCpu in this method. Key source code is as follows:

func runBurnCpu(ctx context.Context, cpuCount int, cpuPercent int, pidNeeded bool, processor string) int {
  args := fmt.Sprintf(`%s --nohup --cpu-count %d --cpu-percent %d`,
    path.Join(util.GetProgramPath(), burnCpuBin), cpuCount, cpuPercent)
  if pidNeeded {
    args = fmt.Sprintf("%s --cpu-processor %s", args, processor)
  }
  args = fmt.Sprintf(`%s > /dev/null 2> &1 &`, args)
  response := channel.Run(ctx, "nohup", args)
  if! response.Success { stopBurnCpuFunc() bin.PrintErrAndExit(response.Err) }if pidNeeded {
    // parse pid
    newCtx := context.WithValue(context.Background(), exec.ProcessKey, fmt.Sprintf("cpu-processor %s", processor))
    pids, err := exec.GetPidsByProcessName(burnCpuBin, newCtx)
    iferr ! = nil { stopBurnCpuFunc() bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, cannot get the burning program pid, %v", err))
    }
    if len(pids) > 0 {
      // return the first one
      pid, err := strconv.Atoi(pids[0])
      iferr ! = nil { stopBurnCpuFunc() bin.PrintErrAndExit(fmt.Sprintf("bind cpu core failed, get pid failed, pids: %v, err: %v", pids, err))
      }
      return pid
    }
  }
  return - 1
}
Copy the code

Other associated code is not posted. In general, write a small program to consume the CPU, this function can be a do while.

2, analog IO high

[root @ 7 dgroup2 chaosblade - 0.2.0]# ./blade create disk burn --write --read --size 10 --count 1024 --timeout 300
{"code": 200,"success":true."result":"f026b3510722685d"}
Copy the code

View the simulation effect:

[root @ 7 dgroup2 chaosblade - 0.2.0]#Device: RRQM /s WRQM /s r/s w/s rkB/s wkB/s AVgrq-sz avgqu-sz await r_await w_await SVCTM %util vda 0.00 91.00 250.00 815.00 84892.00 92588.00 333.30 43.92 39.27 41.60 38.56 0.93 99.50 DM-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 DM-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: RRQM /s WRQM /s r/s w/s rkB/s wkB/s AVgrq-sz avgqu-sz await r_await w_await SVCTM %util vda 1.00 105.00 496.00 865.00 98012.00 92692.00 280.24 43.72 34.02 33.40 34.37 0.73 dM-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 DM-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: RRQM /s WRQM /s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await SVCTM %util vda 0.99 106.93 259.41 675.25 99853.47 91750.50 410.00 36.22 38.53 47.09 35.24 1.06 98.81 DM-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 DM-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Device: RRQM /s WRQM /s r/s w/s rkB/s wkB/s AVgrq-sz AVgqu-sz await r_await w_await SVCTM %util Vda 0.00 80.00 241.00 1103.00 Dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 DM-7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00Copy the code

From the above results, it does consume IO. So let’s see how it does that.

TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 24036 BE /4 root 104.55m /s 0.00 B/s 0.00%99.99% DDif=/dev/vda1 of=/dev/null~ iflag=dsync,direct, fullBlock 24034 be/4 root 0.00 B/s 104.55m /s 0.00%68.17%ddif=/dev/zero of=/tmp/chao~bs=10M count=1024 oflag=dsync
Copy the code

You can see these two processes by looking at the processes with high IO. That is, ChaosBlade calls the IO high emulation of DD’s implementation. The key implementation code is as follows:

// write burn
func burnWrite(size, count string) {
  for {
    args := fmt.Sprintf(`if=/dev/zero of=%s bs=%sM count=%s oflag=dsync`, tmpDataFile, size, count)
    response := channel.Run(context.Background(), "dd", args)
    channel.Run(context.Background(), "rm", fmt.Sprintf(`-rf %s`, tmpDataFile))
    if! response.Success { bin.PrintAndExitWithErrPrefix(response.Err)return}}}// read burn
func burnRead(fileSystem, size, count string) {
  for {
    // "if" arg in dd command is file system value, but "of" arg value is related to mount point
    args := fmt.Sprintf(`if=%s of=/dev/null bs=%sM count=%s iflag=dsync,direct,fullblock`, fileSystem, size, count)
    response := channel.Run(context.Background(), "dd", args)
    if! response.Success { bin.PrintAndExitWithErrPrefix(fmt.Sprintf("The file system named %s is not supported or %s", fileSystem, response.Err))
    }
  }
}
Copy the code

One reads and one writes.

3. The analog port is unavailable

3.1. Before simulation

(Base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... (base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... Connected to 101.201.210.163. Escape character is'^]'.
Copy the code

You can see that this port is open

3.2. The analog port is unavailable

[root @ 7 dgroup2 chaosblade - 0.2.0]# ./blade create network drop --local-port 9100
{"code": 200,"success":true."result":"55321ca383ef272c"} [root @ 7 dgroup2 chaosblade - 0.2.0]#
Copy the code

3.3. After simulation

You can see that the port is no longer connected

(Base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... (base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... Telnet: connect to Address 101.201.210.163: Operation timed out Telnet: Unable to connect to remote host (base) GaoLouMac:~ Zee$Copy the code

But how to realize the port is not connected?

3.4. Implementation code

As you can see from the following code, ChaosBlade disables the port by adding the drop rule to the iptables command.

The following code can be seen in dropNetwork.go:

iflocalPort ! ="" {
  channel.Run(ctx, "iptables", fmt.Sprintf(`-D INPUT -p tcp --dport %s -j DROP`, localPort))
  channel.Run(ctx, "iptables", fmt.Sprintf(`-D INPUT -p udp --dport %s -j DROP`, localPort))
}
Copy the code

The iptables configuration:

[root @ 7 dgroup2 chaosblade - 0.2.0]# iptables -L -n|grep 9100DROP TCP -- 0.0.0.0/0 0.0.0.0/0 TCP DPT :9100 DROP udp -- 0.0.0.0/0 0.0.0.0/0 UDP DPT :9100 [root@7dgroup2 Chaosblade - 0.2.0]#
Copy the code

By querying iptables records, you can see that ChaoBlade added two records to drop TCP and UDP packets on port 9100. Note that this operation is temporary and there is no record in the iptables file.

What does this simulation look like?

3.5. Analysis of simulation effect

Simulation of previous packet capture results:


[root@7dgroup2 ~]# tcpdump -i eth0 port 9000
tcpdump: verbose output suppressed, use -v or -vv forfull protocol decode listening on eth0, link-type EN10MB (Ethernet), Capture size 65535 bytes 18:40:19.162485 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [S], seq 4090540787, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187658956 ecr 0,sackOK,eol], Length 0 18:40:19.162592 IP 7dgroup2.csListener > 61.148.243.67.9485: Flags [S.], seq 3080683668, ack 4090540788, win 28960, options [mss 1460,sackOK,TS val 871980746 ecr 1187658956,nop,wscale 7], Length 0 18:40:19.202395 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 1, win 4120, options [nop,nop,TS val 1187658998 ecr 871980746], 18:40:51.771422 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [P.], seq 1:7, ack 1, win 4120, options [nop,nop,TS val 1187690315 ecr 871980746], Length 6 18:40:51.771534 IP 7dgroup2. csListener > 61.148.243.67.9485: Flags [.], ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], Length 0 18:40:51.772024 IP 7dgroup2.csListener > 61.148.243.67.9485: Flags [P.], seq 1:99, ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], Length 98 18:40:51.772062 IP 7dgroup2.csListener > 61.148.243.67.9485: Flags [F.], seq 99, ack 7, win 227, options [nop,nop,TS val 872013355 ecr 1187690315], Length 0 18:40:51.821279 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 99, win 4117, options [nop,nop,TS val 1187690362 ecr 872013355], Length 0 18:40:51.821336 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [.], ack 100, win 4117, options [nop,nop,TS val 1187690362 ecr 872013355], Length 0 18:40:51.821355 IP 61.148.243.67.9485 > 7dgroup2.cslistener: Flags [F.], seq 7, ack 100, win 4117, options [nop,nop,TS val 1187690364 ecr 872013355], Length 0 18:40:51.821380 IP 7dgroup2. csListener > 61.148.243.67.9485: Flags [.], ack 8, win 227, options [nop,nop,TS val 872013404 ecr 1187690364], length 0Copy the code

From the above results, communication was completely normal before the iptable rule was created. Standard TCP handshake and wave process

3.6. Packet capture results after simulation

18:43:12.531311 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187826295 ecr 0,sackOK,eol], length 0
18:43:13.551168 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187827296 ecr 0,sackOK,eol], length 0
18:43:14.611149 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187828296 ecr 0,sackOK,eol], length 0
18:43:15.582777 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187829296 ecr 0,sackOK,eol], length 0
18:43:16.622832 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187830296 ecr 0,sackOK,eol], length 0
18:43:17.654309 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187831296 ecr 0,sackOK,eol], length 0
18:43:19.691527 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187833296 ecr 0,sackOK,eol], length 0
18:43:23.741290 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187837296 ecr 0,sackOK,eol], length 0
18:43:31.761123 IP 61.148.243.67.9486 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187845296 ecr 0,sackOK,eol], length 0
18:43:48.062869 IP 61.148.243.67.9487 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,nop,wscale 5,nop,nop,TS val 1187861296 ecr 0,sackOK,eol], length 0
18:44:20.852129 IP 61.148.243.67.9705 > 7dgroup2.cslistener: Flags [S], seq 899103396, win 65535, options [mss 1400,sackOK,eol], length 0
Copy the code

After creating iptables, we still try to connect. See server still catch SYN packet

(Seeing this, I think anyone with a sense of security knows what the risk is, and the attack scenario immediately pops into mind.)

There are very few real application problem scenarios where the TCP handshake is broken at this level in an online environment. This problem occurs only when the TCP half-connection is faulty.

If you want to simulate connection problems at the application level, ChaosBlade cannot do it.

4. Packet loss simulation

4.1. Simulate commands

[root @ 7 dgroup2 chaosblade - 0.2.0]# ./blade create network loss --interface eth0 --percent 50
{"code": 200,"success":true."result":"c29053229c16c839"} [root @ 7 dgroup2 chaosblade - 0.2.0]#
Copy the code

4.2 Packet loss effect

(base) GaoLouMac:~ Zee$ping 101.201.210.163 (101.201.210.163): 56 data bytes 64 bytes from 101.201.210.163: ICmp_seq =0 TTL =50 time=95.615 ms 64 bytes from 101.201.210.163: Icmp_seq =1 TTL =50 time=78.823 ms Request timeoutfor icmp_seq 2
Request timeout forIcmp_seq 3 64 bytes from 101.201.210.163: icmp_seq=4 TTL =50 time=127.879 ms 64 bytes from 101.201.210.163: Icmp_seq =5 TTL =50 time=123.282 ms 64 bytes from 101.201.210.163: ICmp_seq =6 TTL =50 time=129.193 ms Request timeoutfor icmp_seq 7
Request timeout forIcmp_seq 8 64 bytes from 101.201.210.163: icmp_seq=9 TTL =50 time=123.712 ms Request timeoutforIcmp_seq 10 64 bytes from 101.201.210.163: icmp_seq=11 TTL =50 time=36.746 ms 64 bytes from 101.201.210.163: Icmp_seq =12 TTL =50 time=114.155 ms Request timeoutfor icmp_seq 13
Request timeout forIcmp_seq 14 64 bytes from 101.201.210.163: icmp_seq=15 TTL =50 time=91.469 ms Request timeoutforIcmp_seq 16 64 bytes from 101.201.210.163: icmp_seq=17 TTL =50 time=56.911 ms 64 bytes from 101.201.210.163: Icmp_seq =18 TTL =50 time=113.380 ms Request timeoutfor icmp_seq 19
Copy the code

4.3 Code implementation

// addQdiscForLoss
func addQdiscForLoss(channel exec.Channel, ctx context.Context, netInterface string, percent string) *transport.Response {
  // invoke tc qdisc add dev ${networkPort} root handle 1: prio bands 4
  response := channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s root handle 1: prio bands 4`, netInterface))
  if! response.Success {// invoke stop
    stopLossNetFunc(netInterface)
    bin.PrintErrAndExit(response.Err)
    return response
  }
  response = channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s parent 1:4 handle 40: netem loss %s%%`, netInterface, percent))
  if! response.Success {// invoke stop
    stopLossNetFunc(netInterface)
    bin.PrintErrAndExit(response.Err)
    return response
  }
  return response
}
Copy the code

From the above code, you can see that ChaosBlade implements the filter queue, classification, and filter via traffic control. That is, TC’s Netem Loss.

5. Simulate network delay

5.1. Simulate commands

[root @ 7 dgroup2 chaosblade - 0.2.0]# ./blade create network delay --interface eth0 --time 3000
{"code": 200,"success":true."result":"b9e568d93dcbb5cb"} [root @ 7 dgroup2 chaosblade - 0.2.0]#
Copy the code

5.2. Simulation effect

(Base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... (base) GaoLouMac:~ Zee$Telnet 101.201.210.163 9100 Trying 101.201.210.163... // There is a three-second delay Connected to 101.201.210.163. Escape character is'^]'.
Copy the code

5.3. Code implementation


func startDelayNet(netInterface, time, offset, localPort, remotePort, excludePort string) {
  ctx := context.Background()
  // assert localPort and remotePort
  if localPort == "" && remotePort == "" && excludePort == "" {
    response := channel.Run(ctx, "tc", fmt.Sprintf(`qdisc add dev %s root netem delay %sms %sms`, netInterface, time, offset))
    if! response.Success { bin.PrintErrAndExit(response.Err) } bin.PrintOutputAndExit(response.Result.(string))
    return
  }
  response := addQdiscForDelay(channel, ctx, netInterface, time, offset)
  if localPort == "" && remotePort == ""&& excludePort ! ="" {
    response = addExcludePortFilterForDelay(excludePort, netInterface, response, channel, ctx)
    bin.PrintOutputAndExit(response.Result.(string))
    return
  }
  response = addLocalOrRemotePortForDelay(localPort, response, channel, ctx, netInterface, remotePort)
  bin.PrintOutputAndExit(response.Result.(string))}Copy the code

From the above code, it can be seen that ChaosBlade is also a network delay implemented by adding filter queue, classification and filter to traffic control. That is, the NEtem delay of tc.

ChaosBalde is a simulation of packet loss and delay realized by tc.

Four,

The ChaosBlade can actually be thought of as a toolset, integrating various gadgets.

Chaos’s hat is in this tool, it’s still a little big to put on. To use it to simulate thousands of nodes, you need various integration configurations, remote execution and other tools.

Let’s go back to the principles of chaos engineering defined above. Do these simulations meet these principles? If you have experience dealing with production environment, you will know that such simulation, in fact, and the real environment of high CPU, HIGH IO logic is still different.

Usually we talk about whether an application can be robust with a high CPU. There are two meanings:

Whether the program under test can remain robust while other programs consume more CPU.
This refers to the robustness of the program being tested while the application’s own code consumes a lot of CPU.

Those of you who have dealt with similar production issues will know that the first scenario is almost invisible, except for the deployment dissonance that occurs. Chaosblade simulates this situation. The second, ChaosBlade, is still out of reach.

But the second scenario is the focus of the testing process.

Chaos actually means chaos in English. This is a very different concept from the Chinese concept of chaos. Now this concept is translated into chaos, which really brings down the meaning of the word itself.

What is chaos engineering, and how can it be implemented at all levels? It’s not really a question of how hard the tool is to implement. But what is the implementation logic? Does the production scenario have a realistic description?

So the core is the scene design