Apache Kafka for beginners

This is the 9th day of my participation in Gwen Challenge.

Apache Kafka is written in Scala and Java and is the creation of a former LinkedIn data engineer. Back in 2011, the technology was handed over to the open source community as a highly extensible messaging system. Today, Apache Kafka is part of the Confluent Stream Platform and handles trillions of events every day. Apache Kafka is already well established in the market, and many trusted companies are waving the Kafka flag.

The data and logs involved in today’s complex systems must be processed, reprocessed, analyzed, and processed — often in real time. This is why Apache Kafka plays an important role in the world of message flows. The key design principles of Kafka are based on the growing need for high-throughput architectures that are easily scalable and provide the ability to store, process, and reprocess streaming data.

Kafka topic partitions A Kafka topic is divided into partitions that contain records in an unalterable order. Each record in a partition is assigned and identified by its unique offset. A topic can also have multiple partitioned logs. This allows multiple consumers to read a topic in parallel.

Partitioning allows you to parallelize topics by splitting data into specific topics that span multiple agents.

In Kafka, replication is implemented at the partition level. Redundant units of topic partitions are called replicas. Each partition usually has one or more copies, which means that the partition contains messages replicated through several Kafka agents in the cluster.

Each partition (replica) has one server as the leader and the rest as followers. The leader copy handles all read and write requests for a particular partition, and the follower copies the leader. If the leader server fails, one of the follower servers becomes the leader server by default. You should strive for a good balance among the leaders, so that each agent is the leader of an equal number of partitions to distribute the load.

When a producer publishes a record to a topic, it publishes it to its leader. The leader appends the record to its commit log and increases its record offset. Kafka exposes records to consumers only after they are committed, and each incoming piece of data is stacked on a cluster.

The producer must know which partition to write to, and this does not depend on the agent. Producers can attach a key to a record indicating the partition to which the record should go. All records with the same key will end up in the same partition. Before the producer can send any records, it must request metadata about the cluster from the broker. The metadata contains information about which agent is the leader of each partition, and the producer always writes to the partition leader. The producer then uses the key to know which partition to write to. The default implementation is to hash the partition using the key, or you can skip this step and specify the partition yourself.

A common mistake when publishing records is to set the same key or empty key for all records, which results in all records ending up in the same partition, and you end up with an unbalanced topic.

Consumers and consumer groups Consumers can start reading messages at specific offsets and allow messages to be read from any offsets they choose. This allows consumers to join the cluster at any point in time.

There are two types of consumers in Kafka. First, the low-level consumer, where the subject and partition are specified as read offset, fixed position, beginning or end. Of course, keeping track of which offsets are consumed so that the same record is never read more than once can be cumbersome. So Kafka adds another, simpler way to consume:

Premium consumers Premium consumers (more commonly known as consumer groups) are made up of one or more consumers. Here you create a consumer group by adding the attribute “group.id” to the consumer. Providing the same group ID to another consumer means that it will join the same group.

The broker allocates data based on which consumer should read from which partition, and it also tracks the offset of each partition’s group. It tracks this by having all consumers submit offsets that they have processed.

Each time a consumer is added or removed from a group, consumption is rebalanced across the groups. All consumers stop with each rebalancing, so clients that time out or restart frequently reduce throughput. Make consumers stateless, as consumers may get different partitions when rebalancing.

Consumers pull messages from topic partitions. Different consumers can be responsible for different partitions. Kafka can support large numbers of consumers and retain large amounts of data with little overhead. By using consumer groups, consumers can be parallelized so that multiple consumers can read data from multiple partitions of a topic, resulting in very high message processing throughput. The number of partitions affects the maximum parallelism of consumers because there cannot be more consumers than partitions.

Records are never pushed to consumers, who request messages when they are ready to process them.

Because all records are queued in Kafka, consumers never overload themselves or lose any data due to a large amount of data. If the consumer falls behind in message processing, it can choose to eventually catch up and return real-time processing data.

Related Posts

After seeing the code written by my colleague, I started to imitate it silently… (a)

Moron.js

Spring Boot uses JPA to implement add, delete, change, and check