Apache Kafka is an open source distributed messaging platform that delivers data with high throughput and low latency. At scale, it can process trillions of records per day, while also providing fault tolerance, replication, and automatic disaster recovery. Although Kafka can be used in many different scenarios, the most common is as a message broker between applications. Kafka can receive, process, and redistribute messages from multiple upstream sources to multiple downstream consumers without having to reconfigure the application. This allows you to stream large amounts of data while maintaining loose coupling in your applications, enabling scenarios such as distributed computing, logging and monitoring, tracking of Web site activity, and communication between Internet of Things (IoT) devices.

Because Kafka provides a critical pipeline between applications, reliability is critical. We need to plan to mitigate several possible failure modes, including:

  • Message Broker interrupts and other abnormal clustering conditions
  • Apache ZooKeeper is invalid, a key dependency of Kafka
  • Failures in upstream and downstream applications

Instead of waiting for these failures to occur in pre-production or during production, we can proactively test them through chaos engineering so that appropriate strategies can be developed to mitigate the impact. In this article, we demonstrate how chaos engineering can help improve the reliability of Kafka deployments. To do this, we will use Gremlin, an enterprise SaaS chaos engineering platform, to create and run four chaos experiments. Read this article to learn about the different ways that Kafka clusters can fail, how to design chaotic experiments to test these failure modes, and how to use the observations to improve their reliability.

In this article, we demonstrate the chaotic experiment on the Confluent platform, the enterprise event flow platform provided by Kafka’s original founders. The Confluent platform builds and adds enterprise features based on Kafka (such as Web-based GUI, comprehensive security controls, and the ability to easily deploy multi-region clusters). However, the experiment in this article will apply to any Kafka cluster.

Overview of the Apache Kafka architecture

To understand how Kafka benefits from chaos engineering, we should first examine the architecture of Kafka.

Kafka uses a publisher/subscriber (or pub/sub) messaging model to transfer data. Upstream applications (called publishers or producers in Kafka) generate messages that are sent to the Kafka server (called the broker). Downstream applications (called subscribers or consumers in Kafka) can then retrieve these messages from the broker. Messages are organized in categories of topics, and consumers can subscribe to one or more topics to use their messages. By acting as a middleman between producers and consumers, Kafka enables us to manage upstream and downstream applications independently of each other.

Kafka subdivides each topic into multiple partitions. Partitions can span multiple broker images to provide replication. This also allows multiple consumers (more specifically, groups of users) to work on a topic simultaneously. To prevent multiple producers from writing to a single partition, each partition has one broker acting as a leader and none or more brokers acting as followers. New messages are written to the leader, and followers copy them. When the follower is fully copied, it is called a synchronous copy (ISR).

This process is coordinated by Apache ZooKeeper, which manages metadata about the Kafka cluster, such as which partitions are assigned to which brokers. ZooKeeper is a required dependency of Kafka (Editor’s Note: Version 2.8 does not require ZooKeeper), but runs as a completely separate service on its own cluster. Improving the reliability of a Kafka cluster necessarily involves improving the reliability of its associated ZooKeeper cluster.

There are other components to the Kafka and Confluent platforms, but these are the most important considerations when it comes to improving reliability. We’ll explain the other components in more detail when we look at them in this article.

Why chaos engineering Kafka?

Chaos engineering is a method of actively testing system failures in order to make them more resilient. We observe the impact and resolve the observed problems by injecting a small number of controlled failures into the system. This allows us to find solutions to problems for the user before they occur, while also teaching us more about how the system behaves under various conditions.

Numerous configuration options, flexible producer and consumer deployment methods, and many other factors make it difficult to efficiently manage and maintain distributed systems like Kafka. It is not enough to keep our Broker and ZooKeeper nodes from failing; we need to consider the more subtle and unpredictable issues that can occur in applications, replicas, and other infrastructure components. These can affect the entire deployment in unexpected ways and, if they occur in production, may require significant troubleshooting overhead.

With Chaos Engineering, we can proactively test these types of failures and resolve them before they are deployed to production, thereby reducing the risk of downtime and emergencies.

Run chaos experiments on Kafka

In this section, we will step through the deployment and execution of four different chaos experiments on the Confluent platform. Chaos experimentation is a planned process in which faults are injected into the system to understand how it responds. Before running any experiments on the system, the experiments to be run should be fully considered and developed.

When creating an experiment:

  • The first step is to set the question the hypothesis is trying to answer and what the expected outcome is. For example, if the experiment was to test the ability to withstand broker interruptions, the hypothesis might state: “If a broker node fails, messages are automatically routed to other brokers without loss of data.”
  • The second step is to define the explosion radius and the infrastructure components affected by the experiment. Reducing the explosion radius limits the potential damage the experiment could cause to the infrastructure, while also allowing the focus to be on specific systems. We strongly recommend starting with the smallest possible explosion radius and then increasing it as the adaptability to conducting chaotic experiments improves. In addition, you should define the magnitude, which is the size or impact of the injected attack. For example, a low-amplitude experiment might add 20 milliseconds of latency to network traffic between producers and brokers. A large experiment could be an increase in latency of 500 milliseconds, as this would have a significant impact on performance and throughput. As with the explosion radius, start low and gradually increase.
  • Third, monitor the infrastructure. Identify which indicators will help draw conclusions about the hypothesis, make observations before the test to establish a baseline, and record these indicators throughout the test so that expected and unexpected changes can be monitored. With the Confluent platform, we can use the Control Center to visually observe the performance of the cluster in real time from a Web browser.
  • Step four, run the experiment. Gremlin allows you to run experiments on your applications and infrastructure in a simple, secure and reliable manner. We do this by running injection attacks, which provide a variety of ways to inject failures into the system. We also define abort conditions, which are conditions under which we should stop testing to avoid accidental damage. With Gremlin, we can define the status check as part of the scenario. With the status check, we can verify the status of the service during an injection attack. If the infrastructure is not performing properly and the status check fails, the experiment is automatically stopped. In addition, we can use the built-in pause button to immediately stop the experiment.
  • Step 5: Draw conclusions from your observations. Does it confirm or disprove the original hypothesis? Use the collected results to modify the infrastructure, and then design new experiments around these improvements. Repeating this process over time will help make Kafka deployments more resilient. The experiments presented in this article are by no means exhaustive, but should be used as a starting point for experimenting on the system. Keep in mind that although we are running these experiments on the Confluent platform, they can be performed on any Kafka cluster.

Note that we are using the Confluent platform 5.5.0 built on Kafka 2.5.0. Screen captures and configuration details may vary from version to version.

Experiment 1: The impact of Broker load on processing latency

Resource utilization can have a significant impact on message throughput. If the broker is experiencing high CPU, memory, or disk I/O utilization, its ability to process messages will be limited. Because Kafka’s efficiency depends on the slowest component, delays can have cascading effects throughout the pipeline and lead to failure conditions, such as producer backups and replication delays. High loads can also affect cluster operations, such as broker health checks, partition reallocations, and leader elections, putting the entire cluster in an abnormal state.

The two most important metrics to consider when tuning Kafka are network latency and disk I/O. The Broker constantly reads and writes data to the local store, and as message rates and cluster sizes increase, bandwidth usage may become a limiting factor. When determining the size of the cluster, we should determine the point at which resource utilization can adversely affect performance and stability.

To determine this, we will run a chaos experiment to gradually increase disk I/O utilization between brokers and observe its impact on throughput. While running this experiment, we will use the Kafka Music demo application to send a continuous stream of data. The application sends messages to multiple subjects spread across all three brokers and uses Kafka Streams to aggregate and process the messages.

Generate the Broker load using IO Gremlin

In this experiment, we will use IO Gremlin to generate a large number of disk I/O requests on the Broker node. We will create a scenario and incrementally increase the intensity of the injection attack over four phases. Each injection attack runs for three minutes with one-minute intervals, so we can easily correlate changes in I/O utilization to changes in throughput.

In addition, we will create a status check that uses the Kafka Monitoring API to check the health of the broker between stages. The status check is to send an automatic HTTP request via Gremlin to the endpoint of our choice, which in this case is the REST API server for our cluster. We will use the endpoint of the topic to retrieve the state of the broker and parse the JSON response to determine whether they are currently in sync. If any brokers are out of sync, we immediately stop the experiment and mark it as a failure. We will also use the Confluent Control Center to monitor throughput and latency during the running of the scenario.

  • Hypothesis: An increase in disk I/O leads to a corresponding decline in throughput.
  • Conclusion: Technical attacks do not have a significant impact on throughput or latency even if disk I/O is increased to more than 150 Mb /s. Both indicators remain stable, and none of our brokers are out of sync or underreplicated, and no messages are lost or corrupted.

At the moment, this leaves us with a lot of overhead, but the throughput requirements are likely to increase as the range of applications expands. We should pay close attention to disk I/O utilization to ensure that we have enough room to expand. If you start to notice disk I/O increases and throughput decreases, you should consider:

  • Use faster storage devices, such as higher RPM disks or solid state storage
  • Use a more efficient compression algorithm such as Snappy or LZ4

Experiment 2: Risk of data loss due to message loss

To ensure successful message delivery, producers and brokers use an acknowledgement mechanism. When the broker submits a message to its local log, it confirms with the responding producer that the message has been successfully received and that the producer can send the next message. This reduces the risk of loss of messages between producers and brokers, but it does not prevent loss of messages between brokers.

For example, suppose we have a broker leader who has just received a message from a producer and sent an acknowledgement. Each subscriber to the Broker should now take messages and submit them to their own local log. However, the broker fails unexpectedly before any of its subscribers can get the latest message. None of the followers knows that the producer sent a message, but the producer has received the acknowledgement, so it has moved on to the next message. Unless we can recover the failed broker or find some other way to resend the message, the message is virtually lost.

How do we determine the risk of this happening on the cluster? With chaos engineering, we can simulate Broker led failures and monitor message flows to identify potential data loss.

Simulate Broker leadership interrupts using black hole Gremlin

In this experiment, we will use the black hole Gremlin to delete all network traffic going to and from the broker. This experiment depends heavily on the timing, because we want to cause the broker to fail after it has received the message, but before its subscribers can replicate the message. This can be done in two ways:

  • Followers that are lower than the time interval, sending a message flow in a row, start the experiment, and looking for the user in the output gap (up) the fetch. Wait. Max. Ms).
  • Immediately after sending the message, the chaos experiment is triggered from the producer application using the Gremlin API.

In this experiment, we will use the first method. The application generates a new message every 100 milliseconds. The output of the message flow is recorded as a JSON list and analyzed to look for any gaps or inconsistencies in timing. We will inject an attack on it for 30 seconds, which will generate 300 messages (one every 100 milliseconds).

  • Assumption: We lose some messages because the leader fails, but Kafka quickly selects a new leader and successfully replicates the message again.
  • Result: Despite the sudden failure of the leader, the message list still shows all successfully passed messages. Because of the additional messages logged before and after the experiment, our pipe produced a total of 336 events, each message with a timestamp of about 100 milliseconds after the previous event. Messages are not displayed in chronological order, but that’s good because Kafka doesn’t guarantee the order of messages between partitions. Here is an example of the output:

If you want to ensure that all messages are saved, you can set acks = all in the producer configuration. This tells the producer not to send a new message until it has been copied to the broker leader and all of its subscribers. This is the safest option, but it limits throughput to the speed of the slowest broker, so it can have a significant impact on performance and latency.

Experiment 3: Avoid splitting the brain

Kafka, ZooKeeper, and similar distributed systems are susceptible to a problem known as “split-brain.” In split-brain, two nodes within the same cluster are out of sync and partitioned, resulting in two separate and potentially incompatible views of the cluster. This can lead to inconsistent data, data corruption, or even a second cluster.

How did this happen? In Kafka, the controller roles are assigned to individual Broker nodes. The controller is responsible for detecting changes to the cluster state, such as failed brokers, leader elections, and partition assignments. There is only one and only one controller per cluster to maintain a single consistent view of the cluster. Although this makes the controller a single point of failure, Kafka has a process for handling such failures.

All brokers regularly register with ZooKeeper to prove their health. If the broker’s response time than they are. The session. A timeout. Ms set (the default is 18000 milliseconds), the zookeeper broker will be marked as abnormal. If the broker is the controller, the controller election is triggered and the replica ISR becomes the new controller. This new controller is assigned a number called controller era, which keeps track of the latest controller elections. If the failed controller is brought back online, it compares its controller era with the one stored in ZooKeeper, identifies the newly selected controller, and then falls back to the normal broker.

This process prevents a small number of brokers from failing, but what if a large number of brokers fail significantly? Can we reboot them without splitting the brain? We can test this using chaos engineering.

Restore most Broker nodes with shutdown Gremlin

In this experiment, we will restart two of the three Broker nodes in the cluster using shutdown Gremlin. Because this experiment could pose a potential risk to cluster stability (for example, we don’t want to accidentally shut down all the ZooKeeper nodes), we want to ensure that all three brokers are healthy before running the broker. We will create a status check to get a list of healthy brokers from the Kafka Monitoring API to verify that all three brokers are up and running.

Here is our fully configured scenario, showing status checking and shutdown Gremlin:

  • Assumption: Kafka throughput will stop temporarily, but both Broker nodes will rejoin the cluster without problems.
  • Result: Control still lists three brokers, but shows that two of them are out of sync with insufficient partition replication. This is to be expected, as the node has lost contact with the rest of the brokers and ZooKeeper.

When the previous controller (Broker1) goes offline, ZooKeeper immediately elects the remaining broker (Broker3) as the new controller. Since both brokers were restarted without exceeding the timeout of the ZooKeeper session, you can see from the diagram of the uptime of the brokers that they are always online. However, when we move our message pipeline to Broker3 and look at the throughput and replication charts, we can see that this significantly affects throughput and partition health.

Unsurprisingly, however, the broker rejoined the cluster. It follows that our cluster can withstand a temporary majority of failures. Performance will degrade significantly, and the cluster will need to elect new leaders, reallocate partitions, and replicate data among the remaining brokers without falling into a split brain. This result may be different if it takes longer to restore the broker, so we want to be sure that we have an event response plan in place in the event of a major production disruption.

Experiment 4: ZooKeeper interrupt

ZooKeeper is a basic dependency of Kafka. It is responsible for activities such as identifying brokers, electing leaders, and tracking the distribution of partitions between brokers. ZooKeeper interrupts do not necessarily cause Kafka to fail, but they can cause unexpected problems if they take longer to resolve.

In one example, HubSpot experienced a ZooKeeper failure due to a short period of time generating a large number of requests for backups. ZooKeeper could not recover for a few minutes, which in turn caused the Kafka node to start crashing. The result was data corruption, and the team had to manually restore the backup data to the server. While this is an unusual situation that HubSpot addresses, it underscores the importance of testing both ZooKeeper and Kafka as individual services and as a whole.


Simulated ZooKeeper interrupts using black hole Gremlin

In this experiment, we want to verify that a Kafka cluster can survive an unexpected outage of ZooKeeper. We will use the black hole Gremlin to discard all traffic to and from the ZooKeeper node. We will run an injection attack for five minutes while monitoring the status of the cluster in the control center.

  • Assumption: Kafka can tolerate short-term ZooKeeper outages without crashing, losing or corrupting data. However, any changes to the cluster state will not be resolved until ZooKeeper is back online.
  • Result: Running the experiment had no results on message throughput or broker availability. As we assumed, messages can continue to be generated and consumed without unexpected problems.

If one of the brokers fails, the broker cannot rejoin the cluster until ZooKeeper is back online. This alone is not likely to result in failure, but it could lead to another problem: cascading failure. For example, a broker failure causes the producer to shift the burden to the remaining brokers. If these brokers approach their production limits, they will in turn collapse. Even if we bring the failed broker back online, we will not be able to join the cluster until ZooKeeper is available again.

This experiment shows that we can tolerate a temporary failure of ZooKeeper, but we should still work quickly to get it back online. You should also look for ways to mitigate the risk of complete outages, such as distributing ZooKeeper across multiple zones for redundancy. Although this increases operational costs and adds latency, it is a small cost compared to the production failure of the entire cluster.

Further improve the reliability of Kafka

Kafka is a complex platform with a large number of interdependent components and processes. For Kafka to run reliably, planning, continuous monitoring, and proactive failure testing are required. This applies not only to our Kafka and ZooKeeper clusters, but also to applications that use Kafka. Chaos engineering allows us to safely and efficiently detect reliability issues in Kafka deployments. Preparing for today’s failures prevents or reduces the risk and impact of future failures, thereby saving your organization time, effort, and customer trust.

Now that we’ve shown you four different chaos experiments from Kafka, try signing up for a Gremlin Free account to run these experiments. When creating your experiments, consider potential points of failure in your Kafka deployment (for example, the connection between the broker and the consumer) and observe how they respond to different injection attacks. If you find a problem, implement a fix and repeat the experiment to verify that it solves the problem. As the system becomes more reliable, gradually increase the amplitude (the intensity of the experiment) and the explosion radius (the number of systems affected by the experiment) in order to fully test the entire deployment.

Source: Chaos Engineering Practice Author: Li Dashan

Originally written by Andree Newman at Gremlin.com

The first 4 Chaos Experiments to Run on Apache Kafka

Disclaimer: The article was forwarded on IDCF public account (devopshub) with the authorization of the author. Quality content to share with the technical partners of the Sifou platform, if the original author has other considerations, please contact Xiaobian to delete, thanks.

Every Thursday in July at 8 PM, “Dong Ge has words” R & D efficiency tools special, public account message “R & D efficiency” can be obtained address

  • “Azure DevOps Toolchain” by Zhou Wenyang (July 8)
  • July 15, Chen Xun, “Effectiveness Improvement Practices in Complex R&D Collaboration Models”
  • July 22, Yang Yang “Infrastructure is code automation test exploration”
  • July 29, Xianbin Hu, “Automated Test, How to” Attack and Defend “?