Leadership, it's time for the chaotic engineering

Gartner notes that infrastructure and operations (I&O) leaders responsible for DevOps must overcome their fear of chaotic engineering in order to achieve the goal of system reliability. Chaos engineering can not only be implemented safely, but also allows organizations to learn by doing, laying a solid foundation for increasing the depth of system reliability and understanding of system cascading dependencies.

First, the core challenge

In the complex environment of modern applications, I&O leaders, beset by the need for reliable delivery, worry that chaotic engineering is too dangerous and disruptive for them to respond to its impact.

As delivery speeds increase, platforms become more diverse, and distributed applications rely on each other, system complexity becomes more and more difficult to identify.

In the field of reliability, there is a lack of comprehensive and systematic knowledge system, which still remains in the practical experience of experts.

The I&O leader responsible for DevOps must:

Chaos engineering is promoted as an organizational ability and regarded as a regular product team activity.
In a pre-production environment, teams are required to safely practice chaotic engineering through a “test first” approach.
Urge the team to use tools and practices to introduce fault injection testing across the technology stack.

2. Chaos Theory

Chaos theory originates from the mathematical study of weather systems.

The first paper was originally published in 1972 by Edward Norton Lorenz, who is considered the father of chaos theory.

Predictability: Will the flapping of Brazilian butterfly wings whip up tornadoes in Texas?

This is colloquially known as the “butterfly effect,” in which even a small change in the environment can lead to a dramatic change in the future state.

Chaos theory contains the paradox that complex things can be understood, but understandable things are not always predictable.

In the case of systems, chaos theory states that small changes in the initial state of a system can lead to large behavioral differences later on. As shown in the figure below.

Third, the way of recommendation

Gartner notes that by 2024, more than 50 percent of large enterprises will be leveraging chaos engineering practices to enhance their digital capabilities to achieve higher availability (99.999 percent).

3.1 Reliability is the key to digital services

According to Gartner’s 2019 CIO Survey, the top three areas where investment will increase in 2019 are digital innovation (22%), revenue/business growth (22%), and operational excellence (13%). But when systems are down frequently, each of these is affected, undermining the top priorities of the CIO.

In addition, Gartner’s 2019 DevOps study found that “improving system reliability” ranked third among the top goals for establishing DevOps practices, just behind “reducing system defects”, which ranked second.

Many efforts to improve reliability focus or emphasize the process of incident response and failure recovery, but this does not prevent events from occurring that damage the organization’s brand and trust.

How can I&O leaders leverage the chain of events and improvement initiatives for change?

The leader must guide the team to intentionally introduce failures into the system in pre-production according to the steady-state hypothesis. It is also a catalyst for organizational learning. The knowledge and skills gained in this way can help us better manage and mitigate systemic risk (see figure below).

The “test first” method is used to practice chaotic engineering in pre-production, from which new knowledge is extracted, and then applied to production step by step to improve the stability of production.

In addition, the alternative to reliability engineering is risk management. Practicing chaos engineering is also equivalent to embracing risk management.

3.2 Make Chaos Engineering a regular team practice

It’s easy to see why many I&O leaders feel uncomfortable with the pressure of the pace of digital business feature releases.

New releases may introduce new production defects and stability issues. This leads to a decline in SLA and customer satisfaction. To keep up with today’s frequent release cycles, we must proactively reduce downtime, not just through event response practices.

The system reliability curve in the traditional release model shows that there are typically three phases in the life cycle of a product (see figure below).

Defects are high during product releases. As the number of defects decreases over time, the product enters the useful life phase and then stabilizes at a lower number. At the end of the curve, the product wears out and the defect reappears. Software also “wears out” when we stop maintaining it, or when we reduce optimal operating configurations and environments due to technical burden and complexity.

However, when we start releasing in frequent incremental iterations of features, we tend to see:

As the efficiency and frequency of software testing increases, the size, complexity, and technical debt of software grows.
Although version increments are smaller than traditional releases, small and frequent changes can increase potential problems in non-functionally related areas of the application.
With the enrichment of software functions, users become more dependent on it, thus obtaining more business value. Similarly, system reliability will also become the key to affect user satisfaction.

The reliability curve in the iterative release pattern below shows the potential impact of frequent changes on the team and the product.

To prevent this curve from becoming a reality, we should actively promote chaos engineering and improve product reliability with agile management, DevOps, and Site Reliability Engineering (SRE) practices.

During the pre-production phase, we can inject new failure scenarios into the incremental release of the product. When chaos engineering becomes part of the product life cycle, it can continuously lead to new knowledge and system reliability.

Like many practices, chaos engineering takes time for teams to become familiar with and master, so increasing the frequency of Gamedays and chaos experiments is crucial for teams that are new to the chaos engineering practice.

I&O leaders must treat chaos engineering as a regular team practice, not as a one-time activity.

Use reuse and sharing to help I&O leaders promote chaos engineering.

Often, the team’s products use similar platforms and services. This means that chaotic experimental plans, which have already been tried, can be reused by other teams and individuals. By reusing the practices of other internal teams in pre-production, the fear of chaotic engineering practices can be greatly reduced.

In addition, experience can be shared more widely through the community, including examples of failure. For example, the production system was affected by the chaos experiment. Of course, we have to make sure that we set the proper permissions to access these chaotic experimental schemes through shared repositories and version control.

3.3 Safe implementation of chaos engineering in pre-production

Two key tasks can be achieved by using chaos engineering in pre-production.

First, it allows teams to safely master chaos engineering practices in a “non-intrusive” and “test-first” manner.
Secondly, the knowledge gained from chaotic engineering practice can be used in normal operation and production reinforcement.

Only after we increased our knowledge of chaos engineering practices and helped us remove enough points of failure from our production systems did we begin to consider migrating or reusing these chaotic experimental plans in production.

It is worth noting that many organizations have difficulty creating and maintaining a “production-like” test environment.

Chaos engineering, like other practices in information technology, takes time to learn, adopt, and master. Chaos engineering can be mastered through active collaboration among business teams, architecture teams, application development teams, operations and maintenance teams, and security teams.

The following diagram Outlines five simple steps for creating a chaotic experiment plan.

1) Brainstorm potential service failure points

Work with the Product Owner to prioritize reliability issues. Also invite team members for feature build, feature test, or service usage. Using their view of the system and experience with the system, they brainstorm potential areas of failure and how the entire system and components will behave under stress. Encourage thinking from multiple perspectives to get as much out of it as possible.

2) Set steady-state assumptions

Collaborate with multiple teams to collate and create hypotheses. Determined how the system will fail, which components will be affected, how the user experience will be degraded, how it will be measured, and how the team will restore the service. These assumptions are the basis of the chaotic experimental scheme.

3) Conduct experiments on the system

Chaos in Chaos Engineering is the state of the system, not the implementation. Mature organizations even use random fault injection in production. Of course, this level of maturity can only be achieved with the accumulation of time and experience after the successful completion of many chaotic experimental schemes and a good understanding of system behavior.

The following figure contains an example of a fault injection plan.

4) Observe system behavior

Record the system’s behavior, features (functionality, availability, performance), SLA and service-level indicators, service-level goals, average detection time (MTTD), and average repair time (MTTR).

Gartner’s 2019 Software Quality Tools and Practices Survey found that only 36% of respondents reported that their organization measures MTTD and only 39% measure MTTR to assess application delivery performance.

Chaos engineering can help to measure the effectiveness of teams in promoting organizational capability and learning. For example, how many members of the team contributed relevant knowledge? Is the knowledge created or recorded new, isolated, or previously unknown?

In addition, users can participate in the execution of chaotic experiment plans. Measure the impact on downstream systems, even after service is restored. The key point here is that new knowledge and value comes from many areas when practicing chaos engineering, not just the steps of rehearsing service recovery.

5) Analyze, learn, and improve

The results of experiments and observations must be provided to the Product Owner and need to be prioritized in the product or platform backlog. What’s the next step? Or reinforce, or simplify? It is important to note that these learning and improvements may affect the work of other teams. If your organization is new to chaos engineering and your team is the first to do it, consider using the Chaos Engineering Practice Community to share your chaos experiment plans with others.

3.4 Push the team to break everything

Some teams have trouble executing/not executing (Go/ no-go) decisions because of the complexity of their systems. This is analysis paralysis, and if it happens frequently, it needs to be addressed.

Instead of endless “if/then” conversations, solve the problem by practicing chaos engineering in pre-production, and then move on.

While you may be the enabler of the pre-production implementation, realize that the value comes from the comprehensiveness of the chaotic experimental plan, which will help you gain new knowledge and give you the confidence to make decisions.

The technologies used in the system often come from different points in time, owners, and team capabilities.

1) The old system has been running for decades and is still the center of gravity of operation and maintenance.

While these systems can be trusted, it is likely that the original designers and developers have moved on to other teams or left. Faults injected into these systems will help us retain knowledge that is about to be lost or forgotten, or realize that we have knowledge gaps that require a lot of research. Chaos engineering experiments can further identify the forgotten knowledge and help the team learn.

2) SaaS products may also become dependencies on the environment.

While there may or may not be virtualization services in pre-production, there are. Create chaotic experiment plans that include SaaS solutions and see how they negatively impact the end-to-end environment.

3) System security is also the injection point of faults.

Investigate system behavior by changing the permissions of the account password, or even deleting the service account entirely. Some service accounts may be used by the team in areas they do not know about; Chaos experiments can show its effects. We can also change the ownership of a file or directory, or restart the service under an incorrect account.

4) Stop the business and delete the data.

Although it is unlikely that databases will be lost, lost connections are not uncommon. What happens if the application is data-driven with feature tags and those feature tags are completely removed? As you review the behavior of your application, you should also stop commonly used services. Consider stopping containers, virtual machines, and services such as SSH.

5) The network should also be in the chaos experiment plan.

Add a DNS that cannot exist, or remove it from the IP table or service registry, disable protocols and ports, lose network packets, or increase network connections.

6) Fault injection should be directed to unknown systems or dependencies.

When refactoring or reorganizing services or applications, teams may decide that there are gaps in understanding of some of the technologies and need chaos engineering to demonstrate what they do not know or know. Such a learning plan can be established for what they do not know.

3.5 Tools and Community

Embedding chaos engineering into DevOps practices and tooling choreography can help grow teams and products. Security introduces this approach by actively engaging in internal and external collaborations, leveraging vendor expertise, or participating in communities that support open source projects.

Many tools, both commercial and open source, are available to help with the chaos engineering practice. The following table shows some examples of tools in this area:

names	Open source or commercial	Web site address
Byteman	Open source	https://byteman.jboss.org
Chaos Monkey	Open source	https://github.com/Netflix/ch…
ChaosIQ	commercial	https://www.chaosiq.io
Gremlin	commercial	https://www.gremlin.com
Jepsen	Open source	http://jepsen.io
Mangle	Open source	https://github.com/vmware/mangle
Simian Army	Open source	https://github.com/Netflix/Si…
Spinnaker	Open source	https://github.com/spinnaker
Verica/ChaoSlinger	Open source	https://www.verica.io

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Leadership, it’s time for the chaotic engineering | IDCF

First, the core challenge

2. Chaos Theory

Third, the way of recommendation

3.1 Reliability is the key to digital services

3.2 Make Chaos Engineering a regular team practice

3.3 Safe implementation of chaos engineering in pre-production

1) Brainstorm potential service failure points

2) Set steady-state assumptions

3) Conduct experiments on the system

4) Observe system behavior

5) Analyze, learn, and improve

3.4 Push the team to break everything

1) The old system has been running for decades and is still the center of gravity of operation and maintenance.

2) SaaS products may also become dependencies on the environment.

3) System security is also the injection point of faults.

4) Stop the business and delete the data.

5) The network should also be in the chaos experiment plan.

6) Fault injection should be directed to unknown systems or dependencies.

3.5 Tools and Community

Recommended reading

Leadership, it’s time for the chaotic engineering | IDCF

First, the core challenge

2. Chaos Theory

Third, the way of recommendation

3.1 Reliability is the key to digital services

3.2 Make Chaos Engineering a regular team practice

3.3 Safe implementation of chaos engineering in pre-production

1) Brainstorm potential service failure points

2) Set steady-state assumptions

3) Conduct experiments on the system

4) Observe system behavior

5) Analyze, learn, and improve

3.4 Push the team to break everything

1) The old system has been running for decades and is still the center of gravity of operation and maintenance.

2) SaaS products may also become dependencies on the environment.

3) System security is also the injection point of faults.

4) Stop the business and delete the data.

5) The network should also be in the chaos experiment plan.

6) Fault injection should be directed to unknown systems or dependencies.

3.5 Tools and Community

Recommended reading

Related Posts

Kubernetes 1.20.5 Sentinel – Alibaba-Sentinel-Rate – Database Server Test

Fun: What is the experience of building deployed applications with Erda?

How to use cloud native technology to build modern applications?