This article describes how Slack builds an automated Kafka infrastructure based on open source tools such as Chef, Terraform, CruiseControl, and CMAK to automate the operation and maintenance of Kafka clusters. Building self-driving Kafka clusters Using Open Source Components[1]

This article discusses how Slack uses Kafka and shows how autonomous Kafka clusters have been built and scaled up by a small, lean team over the past four years.

Slack uses Kafka to build a publish/subscribe system that plays an important role in all important Job queues [2], providing Slack users with almost all actions (e.g., unrolling links in channels, sending notifications, notifying bots, Update search index and run safety check) provides an asynchronous job execution framework. In addition, Kafka acts as a nervous system for delivering mission-critical data across Slack, enabling logging pipelines [3], tracking data [4], billing pipelines, enterprise analytics [5], and security analytics data.


Slack’s Kafka journey

As early as 2018, several teams adopted Kafka in their own use cases and each ran a separate Kafka cluster. As a result, different teams deployed different versions of Kafka, and each had repeated deployment, operations, and management of the Kafka cluster.

So we took on the project of standardizing all Kafka clusters into one release, managed by one team. However, due to our small team size, we wanted to automate operations as much as possible.

Today, Slack manages approximately 0.7 PB of data across 10 Kafka clusters, running on hundreds of nodes and processing millions of messages per second across hundreds of topics, with a maximum total throughput of 6.5 Gbps. Our Kafka infrastructure costs are determined not only by the hardware but also by the network, and consumers range from heavy batch jobs to very latency-sensitive applications like Job Queue[2].


What is self-driving Kafka cluster?

Kafka is an excellent piece of software that runs on hundreds of Slack nodes. However, if you’ve ever tried deploying or managing Kafka, you know it’s not an easy task. We are often asked for support because of broker slowness, occasional failures, or capacity management issues.

The goal of Kafka o&M automation is to eliminate the o&M overhead of day-to-day management of Kafka.

To this end, we identified some common operations and maintenance tasks for Kafka clusters, including:

  • Kafka generally manages operations such as creating a topic, changing the partition count, and reassigning partitions to brokers
  • Capacity planning operations, such as adding/deleting brokers to a cluster
  • Operational issues, such as replacing brokers or deploying new versions of software
  • On-call work to diagnose Kafka cluster problems
  • Explain whether Kafka consumers are spending at a rate that is sufficient for customer support work

Therefore, when we migrated to the new version of Kafka cluster, we decided to automate the operation and maintenance aspects, or to self-serve the users.


Kafka 2 project

We aligned our efforts to implement a more automated Kafka based on version 2.0.1. Our Kafka setup consists of the following components:

Build, publish, configure, and deploy: Chef and Terraform

We used Chef to manage the base operating system, deploying and configuring Kafka software on the host. Each of our Kafka clusters runs under a different role and has its own custom configuration, but shares the same base configuration. We use the Terraform module to create the ASG for the Chef role in AWS, which automatically manages the startup and shutdown of nodes.

Previously Kafka deployment was mainly managed by deploying the Debian Kafka package. However, we found that deploying pre-built packages was a pain because the configuration was not always reliable. In addition, the Chef configuration is very complex because we have to override the default configuration anyway. To address these issues, we forked the Kafka repository internally and set up a CI/CD pipeline to build Kafka and publish static binaries to S3. Chef then extracts the binaries from S3 and deploys them, a process that can be repeated.

Apache Zookeeper 3.4 Clusters are generally manually configured, there is no automatic way to ensure that each Zookeeper node has a unique ID, and there is no way to reassign ids without restarting the entire cluster. Manual configuration of Zookeeper nodes is not only tedious (we are often called upon to provide support during common node failures), it is also error-prone, and we could accidentally start multiple Zookeeper nodes in the same AWS available domain, increasing the radius of impact. To reduce tedium and errors, we automated this process by upgrading to Zookeeper 3.6, which does not require a cluster restart when replacing brokers. When a Zookeeper node is configured, a unique ID is automatically assigned on Consul KV. With these two changes, you can use Terraform to enable the Zookeeper cluster through the ASG.

Kafka cluster stability optimization

While the above configuration helps ease the pain of configuring hosts automatically, we still have to take care of cluster operations, such as migrating partitions to new brokers and rebalancing brokers for load. In addition, cluster operations affect customers, causing them to need online support or fail to reach the SLO.

After analysis, we found that hot spots in Kafka clusters cause instability, with different problems causing hot spots.

We noticed that there were several hundred Kafka topics in the cluster, each with a different partition count based on different loads. During normal operations, some brokers process more data than others. During cluster operations, such as adding/deleting brokers, these hot spots are exacerbated, resulting in data consumption delays.

* To resolve hot issues, we want to evenly utilize all brokers in the cluster. * We changed all partition counts to multiples of broker counts to eliminate write hot spots. We smooth read hotspots by selecting an even number of consumers on all nodes. As long as all partitions are evenly distributed across the cluster, reliable utilization can be achieved across the cluster and read and write rates can be smoothed out. In addition, when the broker or consumer is extended, the partition count for the topic is updated so that the partition count is still a multiple of the broker count to ensure even utilization.

Another cause of hotspot discovery in the Kafka cluster is replication bandwidth consumption during partition rebalancing events. We found that producers or consumers with insufficient replication bandwidth consume most of the resources, especially during peak hours. Therefore, we limit the replication bandwidth available to the cluster. However, limiting the replication bandwidth resulted in slower cluster operations, so we also modified the operation to move only a few partitions at a time, which allowed us to make many small changes continuously.

Despite these efforts, Kafka clusters are still out of balance due to partial failures. To automate these operations, we used the excellent Cruise Control[6] automation suite (built by LinkedIn) to automate cluster balancing operations and ensure average utilization of all nodes in the cluster. Overall, these tunings have helped the cluster run stably.

Chaotic engineering

Considering the high impact of switching traffic from the old cluster to the new cluster, we conducted chaos experiments on the new cluster using dark flow.

The tests involved a variety of resources under cluster load. In addition, we can terminate brokers under controlled conditions, which helps us better understand the pattern of broker failure and its impact on producers and consumers.

In these tests, we found that cluster recovery operations were mostly limited by the number of packets sent per second by the host. To support faster recovery, we have enabled Jumbo Frame on these hosts. Our Kafka instance now has the highest packet per second utilization of any Slack infrastructure team.

In addition, this helped us identify bugs in some edge cases among users of the Go Sarama library. In some use cases, we have migrated these clients to Confluent Go, which has also helped us standardize client configuration across languages. In cases where we were unable to upgrade consumers, we added appropriate workspaces and alarms to monitor these use cases.

During these tests, we also realized that Zookeeper issues can quickly evolve into larger Kafka issues. So, we configured a separate Zookeeper cluster for each Kafka cluster to reduce the radius of Zookeeper failures, although doing so would add slightly to the cost.

Chaos experiments also help us understand operational problems that may occur during actual failures and help us better tune the cluster.

Self-service Kafka cluster

In many cases, the consumer team will ask us or ask us to increase/decrease the cluster capacity. One type of question is about general operations, such as capacity planning, and the other is about understanding the health of the assembly line.

Also, using CLI tools to understand what’s going on in Kafka is tedious. So, we deployed kafka Manager [7] to make the metadata of kafka clusters, such as broker lists and topic lists, available to everyone. Kafka Manager also helps simplify everyday operations, such as creating new themes and increasing the number of partitions for themes.

To provide operational visibility into Kafka consumer health, we deployed a branch of Kafka Offset exporter[8] that exports consumer offset information as Prometheus metrics. Based on these data, we built dashboards to provide real-time aggregated consumption indicators for each topic and each consumer.

To reduce knowledge silos, we standardized various implementation manuals into a single implementation manual, helping to bring all Kafka knowledge together in one place. In addition, we consolidated multiple Kafka dashboards into a global dashboard for all Kafka clusters.

Together, these self-service tools help customers better understand the data while reducing the team’s operational overhead. These tools also improve security by reducing the need for SSH to Kafka Broker.

Upgrade the Kafka cluster

In order to upgrade the Kafka cluster, we decided not to do a local cluster upgrade. The reason is that we don’t have the confidence to ensure uninterrupted downtime during the upgrade window, especially when upgrading multiple versions at the same time. In addition, there is no way to verify that there is a problem with the new cluster, especially when changing the underlying hardware type.

To solve these problems, we developed a new upgrade strategy for cluster switching. The switching process is as follows:

  • Start a new cluster
  • Run all validation tests on the new cluster using dark traffic
  • Stop producing data to the old cluster
  • Start producing data to the new cluster
  • Close the old cluster after the retention window expires

Although this strategy has the disadvantage of coordinating truncation of consumers, it is a standard operational process that can be applied to other scenarios, such as moving topics across clusters and testing new EC2 instance types.


Split the Kafka main cluster

After investing time in making Kafka self-sustaining and improving reliability, we were able to devote time to other important features, such as tracing[4]. However, even after all the work has been done, assumptions and capacity still need to be revisited when the system reaches a tipping point.

We approached this point in early 2021, with 90 broker clusters reaching a tipping point in network throughput, peaking at 40,000 PPS. Network saturation caused delays in downstream pipelines, and Kafka struggled to keep up with users under normal workloads, let alone handle large traffic spikes. Developers who rely on the logging pipeline to debug problems are affected daily by the saturation of the Kafka network.

To reduce the load on the Kafka main cluster, we used tools and automation to split large topics into smaller, higher-performance clusters (upgrading from the old D2 [9] instance to the modern NITro-enabled D3EN instance [10]). Comparing similar workloads between the two clusters, the new cluster was able to achieve similar performance (per 1,000 PPS) across 20 brokers, approximately 2.5 times more efficient.

The problem was alleviated immediately after the three largest topics were removed from the main cluster. Here are some charts from the time to illustrate the impact of this work.

These spikes represent the lag in consumer consumption of data from one of the largest topics, with each lag exceeding 500 million breaking slAs due to log flusher.

After theme migration is complete, consumer latency is greatly improved. Our log latency was reduced from 1.5 hours in the worst case to 3-4 minutes in the worst case. In addition, the number of on-calls on our logging pipeline decreased from 71 alarms per month to 9 alarms per month. Quite an improvement!

Smaller dedicated Kafka clusters are also easier to manage, and cluster operations can be completed more quickly with fewer noise issues due to interference.


conclusion

Self-healing Kafka clusters can be run on a large scale using open source components such as Cruise Control, Kafka Manager, Chef, and Terraform. In addition, reliable, self-service, and autonomous Kafka can be built using standard SRE principles and appropriate tools such as Kafka Manager and Kafka Offset.

We have benefited greatly from other people’s Kafka configurations, and in the spirit of sharing and passing, our Kafka configuration can be found on Gibhub [11].

This architecture has been working successfully with Slack for the past few years. Looking ahead, Kafka will play a bigger role in Slack as part of the new Change Data Capture (CDC, Change Data Capture) project. The new CDC feature will support caching requirements for Slack’s Permission Service, which authorizes operations in Slack, as well as near-real-time updates to the data warehouse. To that end, we’ve created a new data flow team at Slack to handle all current and future Kafka use cases and maintain and support all Of Slack’s Kafka clusters.

References: [1] Building Self – driving Kafka clusters using the open source components: slack. Engineering/Building – se… [2] Scaling Slack ‘s Job Queue: Slack. Engineering/Scaling – the sla… [3] Data Wrangling at Slack: Slack. Engineering /data-wrangl… [4] Tracing at Slack Thinking in Causal Graphs: Slack. Engineering/Tracing – at -… [5] Understand the Data in Your Slack Analytics Dashboard: slack.com/help/articl… [6] Cruise Control: github.com/linkedin/cr… [7] Cluster Manager for Apache Kafka: github.com/yahoo/CMAK [8] Kafka offset: github.com/echojc/kafk… [9] Now AFailable D2 instance the latest generation of Amazon EC2 dense Storage Instances: aws.amazon.com/about-aws/w… [10] D3: aws.amazon.com/ec2/instanc… [11] gist.github.com/mansu/dfe52…

Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind