Kafka Kafka Kafka Kafka

Cabbage Java self study room covers core knowledge

Kafka iS a Java engineer. Kafka is a Java engineer. Kafka is a Java engineer.

1. How Kafka was born

Kafka is a distributed message system based on ZooKeeper coordination. Its biggest feature is that it can process a large amount of data in real time to meet various demand scenarios: Hadoop-based batch systems, low-latency real-time systems, Storm /Spark streaming engines, Web/Nginx logs, access logs, messaging services, etc., written in Scala, Linkedin was donated to the Apache Foundation in 2010 and became a top open source project.

In today’s society, various application systems such as commerce, social networking, search and browsing are constantly producing all kinds of information like information factories. In the era of big data, we are faced with the following challenges:

How to gather this vast amount of information;
How to analyze it;
How to do the above two points in time;

These challenges form a business demand model, in which producers produce information, consumers consume it, and between producers and consumers, there needs to be a bridge – message system. At a micro level, this requirement can also be understood as how messages are passed between different systems.

Kafka is a distributed messaging system:

Kafka- Open-source by Linked-in;
Kafka – is a framework that solves these problems by seamlessly connecting producers and consumers.
Kafka – A throughput distributed messaging System

2. Why use messaging systems

Decoupling:

Allow you to extend or modify both processes independently, as long as they adhere to the same interface constraints.

Redundant:

Message queues persist data until it has been fully processed, thus avoiding the risk of data loss. In the “insert-retrieve-delete” paradigm used by many message queues, before removing a message from the queue, your processing system needs to explicitly indicate that the message has been processed, ensuring that your data is stored safely until you are done using it.

Extensibility:

Because message queues decouple your processing, it is easy to increase the frequency with which messages are queued and processed, simply by adding additional processing.

Flexibility & Peak processing capability:

Applications need to continue to function when traffic surges, but such bursts of traffic are rare. It would be a huge waste to invest resources in being able to handle these spikes. Using message queues enables key components to withstand sudden access pressures without completely collapsing under sudden overload of requests.

Recoverability:

The failure of a component does not affect the entire system. Message queuing reduces coupling between processes, so that even if a process that processes messages dies, messages that are queued can still be processed after the system recovers.

Order guarantee:

In most usage scenarios, the order in which data is processed is important. Most message queues are inherently sorted and ensure that the data will be processed in a particular order. (Kafka guarantees the order of messages within a Partition)

Buffer:

It helps to control and optimize the speed at which data flows through the system, resolving the inconsistency between the processing speed of production and consumption messages.

Asynchronous communication:

Many times, users do not want or need to process messages immediately. Message queues provide asynchronous processing, allowing users to queue a message without processing it immediately. Put as many messages on the queue as you want, and then process them as needed.

3. Kafka basic architecture

3.1. Topology structure

3.2. Noun concepts

Producer: Producers of messages that publish messages to terminals or services in the Kafka cluster.
Broker: Servers contained in a Kafka cluster.
Topic: The category to which every message published to a Kafka cluster belongs, that is, Kafka is topic-oriented.
Partition: A partition is a physical concept. Each topic contains one or more partitions. The units kafka allocates are partitions.
Consumer: Terminal or service that consumes messages from a Kafka cluster.
Consumer group: In the high-level Consumer API, each consumer belongs to a consumer group, and each message can only be consumed by one consumer in the consumer group. But it can be consumed by multiple consumer groups.
Replica: copy of a partition to ensure high availability of the partition.
Leader: a role in replica. Producer and consumer only interact with the leader.
Follower: a role in the replica that copies data from the leader.
Controller: One of the servers in the Kafka cluster, used for leader election and various failover.
Zookeeper: Kafka uses ZooKeeper to store meta information about a cluster.

4. Basic Kafka features

High throughput and low latency: Kafka can process hundreds of thousands of messages per second with latency as low as a few milliseconds;

Scalability: Kafka clusters support hot scaling;

Persistence and reliability: Messages are persisted to local disks, and data backup is supported to prevent data loss.

Fault tolerance: Nodes in the cluster are allowed to fail (if the number of copies is N, n-1 nodes are allowed to fail).

High concurrency: thousands of clients can read and write data simultaneously.

4.1. Design idea

Consumergroup: Each consumer can form a group, and each message can be consumed by only one consumer in the group. If a message can be consumed by multiple consumers, these consumers must be in different groups.
Message status: In Kafka, the state of the message is stored in the consumer. The broker does not care which message is consumed by whom, but only records an offset value (which points to the partition where the message will be consumed next). This means that a message on the broker may be consumed more than once if the consumer does not handle it well.
Message persistence: Kafka persists messages to local file systems and is extremely efficient.
Message expiration: Kafka keeps messages for a long time so that consumers can consume them multiple times, although many of the details are configurable.
Batch sending: Kafka supports batch sending by message set to improve push efficiency.
push-and-pull: In Kafka, producers and consumers use push-and-pull patterns. Producers push messages to the broker, and consumers pull messages from the broker. Both produce and consume messages asynchronously. The relationship between brokers in a Kafka cluster: It is not a master-slave relationship. Brokers have the same status in the cluster. We can add or remove any broker node at will.
Load balancing: Kafka provides a metadata API to manage load between brokers (for Kafka 0.8.x and ZooKeeper for 0.7.x).
Synchronous asynchrony: Producer adopts asynchronous push mode, which greatly improves the throughput of Kafka system (synchronous or asynchronous mode can be controlled by parameters).
Partition: Kafka’s broker supports message partitioning. Producer can decide which partition to send messages to. The order of messages in a partition is the order in which Producer sends messages. Zoning is important, as we’ll see later.
Offline data loading: Kafka is also ideal for loading data into Hadoop or a data warehouse due to its support for scalable data persistence.
Plugin support: A number of plugins have been developed by the active community to extend Kafka’s capabilities, such as Storm, Hadoop, and Flume plugins.

4.2. Application Scenarios

Log collection: A company can use Kafka to collect logs for a variety of services and open them to consumers, such as Hadoop, Hbase, and Solr, as a unified interface service.
Message systems: decouple producers and consumers, cache messages, and so on.
User activity tracking: Kafka is often used to record the activities of Web users or app users, such as browsing, searching, clicking, etc. These activities are published by various servers to Kafka topics, which subscribers subscribe to for real-time monitoring and analysis. Or load it into Hadoop or data warehouse for offline analysis and mining.
Operational metrics: Kafka is also used to record operational monitoring data. This includes collecting data for various distributed applications and producing centralized feedback for various operations, such as alarms and reports.
Streaming: spark Streaming and Storm

5. Push vs. Pull

5.1. Point-to-point mode

As shown in the figure above, the point-to-point pattern is typically a pull or polling based messaging model characterized by messages sent to queues being processed by one and only one consumer. After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption. The advantage of the point-to-point model is that consumers can control how often they pull messages. However, whether there are messages in the message queue that need to be consumed cannot be sensed on the consumer side, so additional threads are required to monitor on the consumer side.

5.2. Publish and subscribe model

As shown in the figure above, the publish-and-subscribe pattern is a message-based messaging model that can have many different subscribers. After the producer puts the message into the message queue, the queue pushes the message to consumers who have subscribed to the message (similar to wechat public account). Since it is the consumer who passively receives the push, there is no need to sense whether the message queue is waiting to consume the message! Consumer1, Consumer2, and Consumer3 can process messages differently due to different machine performance, but the message queue can’t sense how fast the consumer is consuming! So the speed of push is a problem with the publish/subscribe model! Suppose the three consumer processing speeds are 8M/s, 5M/s, and 2M/s respectively. If the queue pushes at 5M/s, consumer3 cannot handle it! If the queue push speed is 2M/s, then consumer1 and Consumer2 will have a huge waste of resources!

5.3. Kafka’s selection

As a messaging system, Kafka follows the traditional pattern of having producers push messages to the broker and consumers pull messages from the broker. Some logging-centric systems, such as Facebook’s Scribe and Cloudera’s Flume, operate in push mode. In fact, both push and pull modes have their strengths and weaknesses.

The push pattern is difficult to accommodate consumers with different consumption rates, because the message sending rate is determined by the broker. The goal of the push mode is to deliver messages as quickly as possible, but this can easily cause consumers to fail to process messages, typically through denial of service and network congestion. The Pull pattern consumes messages at an appropriate rate based on the Consumer’s ability to consume.

For Kafka, the pull mode is more appropriate. The Pull pattern simplifies broker design and allows consumers to control the rate at which messages are consumed, as well as the way they consume them — either in bulk or on a piece-by-piece basis — and to choose different delivery methods to implement different transport semantics.

6. Kafka workflow

6.1. Sending data

In the diagram above, a producer is a producer and a gateway to data. Notice the red arrows in the graph. The Producer always looks for the leader when writing data to the follower, rather than directly writing data to the follower! How to find a leader? What is the process of writing? Take a look at the picture below:

Obtain the leader of the partition from the cluster.
Producer sends messages to the leader.
The Leader writes the message to a local file;
Followers pull messages from l eader;
The Followers write the message locally and send an ACK to the leader.
The leader sends an ACK to the producer after receiving the ACKS of all copies.

6.1.1. Ensure that messages are in order

Note that after messages are written to the leader, the followers take the initiative to synchronize with the leader! The producer uses push mode to publish data to the broker, and each message is appended to the partition and written to disk in sequence, ensuring that data in the same partition is in order! Write the schematic diagram as follows:

6.1.2. Message load partitioning

Data is written to different partitions. Why partition kafka? As you can probably guess, the main purpose of zoning is:

Convenient expansion: Since a topic can have multiple partitions, we can easily cope with the increasing amount of data by expanding the machine.
Improved concurrency: Multiple consumers can consume data at the same time, improving message processing efficiency.

Those familiar with load balancing should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In Kafka, if a topic has multiple partitions, How does the producer know which partition to send data to? There are several principles in Kafka:

Partition Specifies the partition to be written to. If the partition is specified, the corresponding partition is written to.
If no partition is specified but a data key is set, a partition is hash based on the key value.
If neither a partition nor a key is specified, a partition is selected by polling.

6.1.3. Ensure that messages are not lost

Guarantee against message loss is a basic guarantee of message queuing middleware. How can producer guarantee against message loss when writing messages to Kafka? In fact, the above write flow chart has been described, that is through the ACK response mechanism! When a producer writes data to a queue, a parameter can be set to confirm that kafka received data. This parameter can be set to 0, 1, or all.

0 indicates that the producer sends data to the cluster without waiting for the return of the cluster. Therefore, the message cannot be sent successfully. The least safe but the most efficient.
1 indicates that producer sends data to the cluster and sends the next data as long as the leader responds. This ensures that the leader sends data successfully.
All indicates that the producer sends data to the cluster only after all followers complete synchronization with the leader, ensuring that the leader sends data successfully and all copies are backed up. Safety is the highest, but efficiency is the lowest.

Finally, if I write to a topic that doesn’t exist, can I write to it successfully? Kafka automatically creates topics, and the number of partitions and replicas is set to 1 by default.

6.2. Save data

After Producer writes data to Kafka, the cluster needs to save the data! Kafka stores data on disk, and writing to disk is a time-consuming operation, perhaps in our common sense, not suitable for such a high-concurrency component. Kafka starts with a separate disk space and writes data sequentially (which is more efficient than random writes).

6.2.1. Partition structure

As mentioned earlier, every topic can be divided into one or more partitions. If you think topic is abstract, then partition is more concrete. A Partition is represented as a folder with multiple sets of segment files. Each segment file contains a.index file, a.log file, and a.timeindex file. The log file is used to store messages, while the index and timeindex files are used as index files. Used to retrieve messages.

As shown in the figure above, this partition has three sets of segment files. Each log file has the same size, but the number of messages stored is not necessarily the same. The file is named after the segment’s minimum offset. For example, 000. Index stores messages whose offset is 0 to 368795. Kafka uses segment + index to solve the search efficiency problem.

6.2.2. The structure of the Message

The log file is actually where messages are stored. We also write messages to Kafka from producer. What are messages stored in logs? The message contains the message body, message size, offset, compression type… Here are three key things to know:

Offset: Offset is an ordered ID number of 8 bytes, which uniquely determines the position of each message in PARITION;
Message size: The message size takes 4 bytes and describes the size of the message.
Message body: The message body stores the actual message data (compressed) and takes up different space depending on the message.

6.2.3. Storage Policy

Kafka stores all messages, whether or not they are consumed. So what’s the deletion strategy for old data?

Based on time, the default value is 168 hours (7 days).
The default configuration is 1073741824 based on size.

Note that the time complexity for Kafka to read a particular message is O(1)O(1)O(1), so deleting expired files here will not improve Kafka’s performance!

6.3. Consumption data

After the message is stored in the log file, the consumer can consume it. I talked about the point-to-point and publish-subscribe modes when I talked about the two modes of message queue communication. Kafka uses a publish-subscribe model. Consumers actively pull messages from Kafka clusters. Similar to producer, consumers also ask the leader to pull messages.

Multiple consumers can form a consumer group, each with a group ID! Consumers in the same consumer group can consume data from different partitions under the same topic, but not multiple consumers in the group can consume data from the same partition! Take a look at the picture below:

The figure shows that the number of consumers in a consumer group is less than the number of partitions. Therefore, a consumer can consume multiple partitions at a slower speed than a consumer processing only one partition! If there are more consumers in a consumer group than partitions, will there be multiple consumers consuming the same partition? It has already been mentioned that this will not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions!

Log file. Timeindex file. Timeindex file. How to use segment+offset to find messages? What if we now need to find a message whose offset is 368801? Let’s take a look at the picture below:

Find the segment file where the 368801 message from offset is located.
Open the.index file in the segment where the offset is 368796+1 The index offset is 368796+5=368801, so the index offset is 5. Since the file uses a sparse index to store the relationship between relative offset and the corresponding physical offset of Message, the index with relative offset of 5 cannot be found directly. Here, the dichotomy method is also used to find the largest relative offset in the index entry whose relative offset is less than or equal to the specified relative offset, so the index whose relative offset is 4 is found.
The physical offset of the message store is 256 based on the index found with offset relative to 4. Open the data file and scan sequentially from position 256 until you find the Message with offset 368801.

This mechanism is built on the basis of ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to efficiently search data. At this point, consumers can get the data they need to process. How does each consumer record the location of his consumption? In the early version, consumers maintained the offset they consumed in ZooKeeper, and consumers reported it every time at intervals, which was easy to lead to repeated consumption and poor performance! In the new version, offsets consumed by consumers are maintained directly in the Kafka cluster topic consumer_offsets.