Chaos Engineering Experience: How to make the system stable and reliable in production environment

When you were a kid, did you ever intentionally rip something apart in order to learn its inner workings? We’ve all done it. Today, we call this chaos engineering.

As developers, one of our main goals is to develop software that is stable, secure, and error-free. To achieve these goals, we conduct unit and integration tests to detect unexpected behavior and ensure that the patterns we test do not lead to errors. Today’s software architectures often contain many components that cannot be fully covered by unit and integration testing. Some of the servers and components we fail to notice can still drag the entire system down when they fail.

You don’t choose the moment, the moment chooses you! You can only choose to be prepared for it. — Fire Chief Mike Burtch

In recent years, Netflix has been one of the drivers behind chaos engineering, greatly promoting the importance of chaos engineering in distributed systems. Security researcher Kyle Kingsbury took a slightly different approach, validating the promises made by vendors of distributed databases, distributed queues, and other distributed systems. He used Jepsen to probe the behavior of these systems and came to some dire conclusions. You can find videos of presentations on the subject on YouTube.

In this paper, I will make a simple and appropriate introduction to chaos engineering.

Don’t underestimate the social aspect of chaos engineering, which is not just about destroying something, but about bringing the right people together to create stable and fault-tolerant software.

When we develop new or existing software, we enhance the implementation through various forms of testing. We often use a testing pyramid to indicate what kind of testing should be done and to what extent.

The testing pyramid illustrates a dilemma: the higher the level of testing, the more effort, time, and cost it takes.

We use unit tests to verify the expected behavior of the software. We test components individually without dependencies and mock them to control their behavior. Such tests do not guarantee that they are error-free. If the developer of a module makes a logical error while implementing the component, that error will also show up in the test — even if the developer wrote the test first and then the code. Extreme programming, in which developers alternate between code and tests, is one way to solve this problem.

To give developers and stakeholders more free time and relaxing weekends with family and friends, we wrote integration tests after we wrote unit tests. Integration tests are used to test interactions between components. Ideally, integration tests run automatically after the unit tests have been successfully run and the interdependent components tested.

Our application was in a very stable state thanks to high test coverage and automation, but I’m sure everyone experienced that unpleasant feeling on the way to victory. I mean, our software has to show its true face in a production environment to see how good it really is. Only under real conditions can we see the true behavior of each component. The adoption of modern microservices architectures has undoubtedly exacerbated these unpleasant feelings.

In the era of loosely coupled microservices, software architecture can be summarized using the term “distributed system.” The systems are well understood: they can be deployed and scaled quickly. But in most cases, this results in an architecture like this:

I like architecture diagrams because they give us a clear and abstract view of our software. However, they can also hide all the nefarious traps and mistakes. They are particularly good at blurring underlying and hardware. In a real production environment, the following architecture diagram is closer to the status quo:

Because of firewall rules, the load balancer does not know all instances of the gateway or cannot access them over the network. Several applications crashed, but service discovery found no failures. In addition, service discovery cannot be synchronized and provides different results. Due to the lack of service instances, the load cannot be distributed, resulting in increasing load on each node. Why does a 12-hour batch have to run during the day, and why does it take 12 hours? !

I’m sure you’ve had a similar experience. You probably know what it means to deal with faulty hardware, faulty virtualization, misconfigured firewalls, or tedious coordination within your company.

Sometimes people say things like, “There’s no chaos here, everything works the way it always does.” It may be hard to believe, but the entire industry survives by selling ticket systems so that we can control and record chaos. Here’s a movie line that can be used to describe our daily lives:

Chaos is the engine that drives the development of the world.

Apis try to create a perfect world where we get what we need by calling apis that define inputs and outputs. We use countless apis to avoid direct interaction with hell (back end or API implementation). We implement an infinite number of layers of abstraction, and Hades plays his pranks with impunity in between. His job is to make sure that every API call is affected. Okay, I’m exaggerating, but you know what I’m trying to say.

Netflix shows where this can lead, so let’s take a quick look at their architecture to better understand the potential complexity of modern microservices architectures. The following images are from a presentation at QCon 2013 in New York:

What’s even more impressive is that Netflix’s architecture works well and responds to all possible errors. If you watch Netflix developers speak, you’ll hear them say “no one knows how it works or why it works”. That insight is what keeps chaos engineering alive at Netflix.

Before you start your first chaos experiment, make sure that your service has applied elastic mode and is prepared to handle possible errors.

Chaos engineering does not create problems, it reveals them. — Nora Jones, Senior Chaos engineer at Netflix

As Netflix’s Nora Jones points out, chaos engineering is not about creating chaos, it’s about preventing it. So if you want to start your chaos experiment, start small and ask yourself these questions ahead of time:

There is no point in experimenting with chaos if your infrastructure — and especially your services — are not ready for it. With this in mind, we are now entering into chaos engineering.

Discuss chaos experiments with your colleagues ahead of time!
If you know your chaos experiment will fail, don’t do it!
Chaos should not come as a surprise, your goal is to prove the hypothesis.
Chaos engineering helps you better understand your distributed system.
Control the influence range of chaos experiment.
Control the situation during the chaos experiment!

In chaos engineering, you should complete the five stages described below and always have control of your experiment. Start small and keep the experiment to a small range of potential effects. Unplug somewhere and see what’s going on that has nothing to do with chaos engineering! Instead of creating uncontrolled chaos, we must actively strive to prevent it.

I highly recommend the PrinciplesOfChaos.org website and the free ebook “Chaos Engineering”.

It is best to define metrics that tell you reliable information about the overall state of the system. These indicators must be continuously monitored during the chaos experiment. Of course, you can monitor these indicators outside of the lab as well.

Metrics can be technical metrics or business metrics — I think business metrics are more important than technical metrics. Netflix monitors the number of clicks a user successfully plays a video during the chaos experiment, which is their core metric and a business domain. Not being able to play video directly affects customer satisfaction. For example, if you run an online store, the number of successful orders or items placed in the shopping basket will be important business metrics.

Figure out what should happen in advance, and test it out experimentally. If your hypothesis is invalid, you must find the error based on the survey results and present it to the team or company. Sometimes it’s hard to be completely free of blame! As a chaos engineer, your goal is to understand the behavior of the system and present that knowledge to the developers. That’s why it’s important to get everyone involved in your experiment as early as possible.

What problems await us in the real world? What new problems might arise, which have ruined our previous weekends? In a controlled experiment, we have to think about these questions first.

Possible examples include:

A node in the Kafka cluster is faulty
Network packet loss
Hardware error
The JVM max-heap-size parameter is not set properly
Delay increase
Wrong response

Feel free to add anything else to this list, as long as it’s always relevant to the architecture of your choice. Even if your application is not hosted by a reputable cloud provider, your own company’s data center will have problems. I bet you have something to say!

Suppose we have several microservices that interact through REST apis and use service discovery. Just in case, Product Services maintains a local cache for the inventory in Warehouse Services. If the warehouse service does not respond within 500 milliseconds, the data in the cache should be used. We can implement this behavior in a Java environment, for example using Hystrix or Resilience4J. Because of these libraries, we can implement fallback and other elastic patterns very easily.

You will find the information you need for a successful experiment below.

Target: Warehouse services

Type of experiment: delay

Assumption: 30% of requests use data from the local cache of the production service due to increased latency for invoking the warehouse service.

Scope of impact: Product services and warehouse services

Previous status: OK

State after: ERROR

Result: The product service rollback fails and causes an exception because the cache cannot respond to all requests.

As you can see from the results, we used elastic patterns, but there were still errors. These errors must be eliminated and tested by running the experiment again.

Chaos works constantly, and your system is constantly changing: new releases into production, hardware updates, firewall rules tweaks, server restarts. The best thing to do is to create a culture of chaos engineering in the company and embed it in every employee’s mind. Netflix does this by letting Simian Army (Netflix’s open source test suite tool, which includes Chaos Monkey) “trash” developers’ products in production environments. They later decided to do it only during work hours, but they did it anyway.

For the first chaos experiment, select the same environment as the production environment. In a calm test environment that has no connection to the production environment, you won’t get any meaningful results. Once you’ve got the initial results and improved on them, you can move on to production and experiment there. The goal of chaos engineering is to operate in a production environment and always control the situation so that users are not affected.

At first, it was hard for me to understand and explore the idea of chaos engineering in my daily life — I’m not a senior chaos engineer at Netflix, Google, Facebook, or Uber, and my clients are just starting to implement microservices. But I can almost always find Spring Boot! Sometimes standalone applications, sometimes deployed in Docker containers. The demos I use at tech conferences always include at least one Spring Boot application. This led to Chaos Monkey for Spring Boot, which can be used to attack existing Spring Boot applications without modifying any code.

For all the information about how Chaos Monkey for Spring Boot will help you get a stable Spring Boot infrastructure, visit GitHub.

I hope this article has helped you understand the ideas and principles behind chaos engineering. Chaos engineering is so important, even if we’re not working at Netflix, there’s still a lot of work to be done. I love my job, but more importantly, I love my life with my family and friends. Rather than spend countless nights and weekends fixing serious glitches, chaos engineering gives us confidence that our systems can better cope with the harsh conditions in a production environment.

Chaos Engineering Experience: How to make the system stable and reliable in production environment

Related Posts

Why should you dig into Github

HDC technical sub-forum: ArkCompiler principle analysis

Introduce you to six design principles (open and close principle, Richter’s substitution principle….)