The secret at the heart of Kafka lies in these 16 images

Kafka is an excellent distributed message middleware, many systems will use Kafka to do message communication. The understanding and use of distributed message system has almost become a necessary skill for backend developers. Kafka interview questions: Kafka interview questions: Kafka interview questions

Talk about distributed messaging middleware

The problem

What is distributed messaging middleware?
What does message-oriented middleware do?
What are the usage scenarios for messaging middleware?
Message middleware selection?

Distributed message is a communication mechanism. Different from RPC, HTTP and RMI, message middleware uses distributed intermediate proxy to communicate. As shown in the figure, with the adoption of message-oriented middleware, upstream business systems send messages, which are stored in the message-oriented middleware and then distributed to the corresponding business module applications (distributed producer-consumer pattern). This asynchronous approach reduces the degree of coupling between services.

Define message middleware:

Use efficient and reliable messaging mechanisms for platform-independent data exchange
Based on data communication, to carry on the integration of distributed system
Interprocess communication can be extended in a distributed environment by providing a messaging and message queuing model

Reference of additional components in the system architecture will inevitably increase the complexity of the system architecture and the difficulty of operation and maintenance, so what are the advantages of using distributed messaging middleware in the system? What is the role of message-oriented middleware in a system?

The decoupling
Redundancy (storage)
scalability
Peak clipping
recoverability
In order to ensure
The buffer
Asynchronous communication

During the interview, interviewers often care about the selection ability of open source components, which can test the breadth of the interviewees’ knowledge, the depth of the interviewees’ knowledge of a certain type of system, and the ability of the interviewees’ overall grasp of the system and system architecture design. There are many open source distributed message systems, and different message systems have different characteristics. To choose a message system, you need not only to have a certain understanding of each message system, but also need to have a clear understanding of their own system requirements.

Here is a comparison of several common distributed messaging systems:

Answer key

What is distributed messaging middleware? Communication, queue, distributed, production consumer pattern.
What does message-oriented middleware do? Decoupling, peak processing, asynchronous communication, buffering.
What are the usage scenarios for messaging middleware? Asynchronous communication, message storage processing.
Message middleware selection? Language, protocol, HA, data reliability, performance, transaction, ecology, simplicity, push and pull mode.

Basic Kafka concepts and architecture

The problem

What is Kafka’s architecture?
Is Kafka push or pull, and what is the difference between push and pull?
How does Kafka broadcast messages?
Are Kafka’s messages orderly?
Does Kafka support read/write separation?
How does Kafka ensure high availability of data?
What does ZooKeeper do in Kafka?
Are transactions supported?
Can the partition number be reduced?

General concepts in Kafka architecture:

Producer: The party that sends messages. The producer is responsible for creating the message and then sending it to Kafka.
Consumer: The person who receives the message. Consumers connect to Kafka and receive messages for business logic processing.
Consumer Group: A Consumer Group can contain one or more consumers. Using multi-partition + multi-consumer mode can greatly improve the processing speed of data downstream. Consumers in the same consumer group will not consume messages repeatedly, and similarly, messages from consumers in different consumer groups will not affect each other. Kafka implements the message P2P and broadcast modes through consumer groups.
Broker: Service Broker node. A Broker is a service node for Kafka.
Topic: Messages in Kafka are segmented by Topic, with producers sending messages to specific topics and consumers subscribing to and consuming the messages for the topics.
Partition: Topic is a logical concept that can be subdivided into multiple partitions, each belonging to a single Topic. Different partitions under the same topic contain different messages. Partitions can be regarded as an appending Log file at the storage level, and messages are assigned a specific offset when they are appended to the partition Log file.
Offset: The Offset is the unique identifier of the message in the partition. Kafka uses it to ensure that the message is ordered within the partition, but the Offset does not span partitions. That is, Kafka guarantees partition order rather than topic order.
Replication: Copy is a way for Kafka to ensure high availability of data. Kafka can have multiple copies of the same Partition on multiple brokers. Usually, only the primary copy provides read and write services. Kafka selects a new Leader copy under the Controller’s management to provide read and write services.
Record: A message Record that is actually written to Kafka and can be read. Each record contains a key, value, and timestamp.

Kafka Topic Partitions Layout

Kafka divides topics into partitions that can be read and written concurrently.

Kafka Consumer Offset

zookeeper

Broker registration: Brokers are distributed and independent, and Zookeeper manages all Broker nodes registered to the cluster.
Topic registration: In Kafka, messages from the same Topic are divided into partitions and distributed across multiple brokers. These partitions and their relationships are maintained by Zookeeper
Producer load balancing: Since the same Topic message can be partitioned and distributed across multiple brokers, producers need to properly distribute messages to these distributed brokers.
Consumer load balancing: Similar to producers, consumers in Kafka need to perform load balancing so that multiple consumers can reasonably receive messages from the corresponding Broker server. Each consumer group contains several consumers, and each message is sent to only one consumer in the group. Different consumers are grouped to consume messages under their own specific topics without interfering with each other.

Answer key

What is Kafka’s architecture?

Producer, Consumer, Consumer Group, Topic, Partition
Is Kafka push or pull, and what is the difference between push and pull?

Kafka Producer sends messages to brokers using Push mode, while Consumer consumers use Pull mode. Pull mode, which lets the consumer manage the offset itself, provides read performance
How does Kafka broadcast messages?

Consumer group
Are Kafka’s messages orderly?

The Topic level is in disorder and the Partition is in order
Does Kafka support read/write separation?

Only the Leader provides read and write services
How does Kafka ensure high availability of data?

Copy, ACK, HW
What does ZooKeeper do in Kafka?

Cluster management, metadata management
Are transactions supported?

Support for transactions after 0.11, can implement “exactly once”
Can the partition number be reduced?

No, data will be lost

Kafka use

The problem

What command-line tools does Kafka have? Which ones have you used?
Kafka Producer’s execution process?
What are the common configurations of Kafka Producer?
How do I keep Kafka messages in order?
How does the Producer ensure that data is sent without loss?
How to improve the performance of Producer?
What does Kafka do if the number of consumers in a group is greater than the number of parts?
Is Kafka Consumer thread safe?
Tell me about your thread model for consuming messages using Kafka Consumer. Why is it designed this way?
Common configuration of Kafka Consumer?
When will consumers be kicked out of the cluster?
How does Kafka react when a Consumer joins or leaves?
What is Rebalance and when does Rebalance occur?

Command line tool

Kafka command line tools in the Kafka package /bin directory, mainly includes service and cluster management scripts, configuration scripts, information view scripts, Topic scripts, client scripts, etc.

Kafka -configs.sh: configuration management script
Kafka-console-consumer. sh: Kafka consumer console
Kafka-console-producer. sh: kafka producer console
Kafka-consumer-groups. sh: kafka consumer group information
Kafka-delete-records. Sh: deletes a log file with a low watermark
Kafka-log-dirs. sh: kafka message log directory information
Kafka-mirror-maker. sh: Kafka cluster replication tool for different data centers
Kafka-preferred-replica-election. sh: triggers the preferred replica election
Kafka-producer-perf-test. sh: kafka producer performance test script
Kafka-reassignment-partitions. Sh: partition reassignment script
Kafka-replica-verification. sh: script for verifying the replication progress
Kafka-server-start. sh: starts the Kafka service
Kafka-server-stop. sh: stops the Kafka service
Kafka-topics. Sh: topic Management script
Kafka-verifiable -consumer.sh: Verifiable kafka consumers
Kafka -verifiable-producer.sh: verifiable kafka producer
Zookeeper-server-start. sh: starts the ZK service
Zookeeper-server-stop. sh: stops the ZK service
Zookeeper-shell. sh: indicates the ZK client

Kafka-console-consumer. sh and kafka-console-producer.sh scripts can be used to test kafka production and consumption. Kafka-consumer-groups. sh can view and manage topics in a cluster. Kafka-topics. Sh is usually used to view consumption groups for Kafka.

Kafka Producer

The normal production logic for Kafka Producer consists of the following steps:

Set producer client parameters Common producer instances.
Build the message to be sent.
Send a message.
Close the producer instance.

The process of sending messages from the Producer is shown below. The messages go through interceptors, serializers, and dividers before being sent in batches by accumulators to the Broker.

Kafka Producer requires the following parameters:

Bootstrap. server: specifies the address of Kafka’s Broker
Key. serializer: Key serializer
Serializer: Value serializer

Common parameters:

batch.num.messages

Default value: 200. The number of messages in a batch is only applicable to asyC.
request.required.acks

Default value: 0:0 indicates that the producer does not need to wait for the leader’s confirmation. 1 indicates that the leader needs to confirm writing the local log to the producer immediately. -1 indicates that the producer needs to confirm after all backups are completed. It only works on async mode. The adjustment of this parameter is the tradeoff of data loss and sending efficiency. If the scenario is not sensitive to data loss and cares about efficiency, it can be set to 0, which can greatly improve the efficiency of the producer sending data.
request.timeout.ms

Default value: 10000. Confirm timeout period.
partitioner.class

Default value: kafka. Producer. DefaultPartitioner, must implement the kafka. Producer. The Partitioner, according to the Key for a partition strategy. Sometimes we need messages of the same type to be processed sequentially, so we have to customize allocation strategies to allocate data of the same type to the same partition.
producer.type

Default value: sync, which specifies whether messages are sent synchronously or asynchronously. Asynchronous asyc batch with kafka. Producer. AyncProducer, synchronous sync with kafka. Producer. SyncProducer. Synchronous and asynchronous sending also affects message production efficiency.
compression.topic

Default value: None, message compression, default no compression. Other compression methods include “gzip”, “SNappy” and “LZ4”. Message compression can greatly reduce network traffic, reduce network IO, and improve overall performance.
compressed.topics

Default: null. With compression set, specific topic compression can be specified, without compression all.
message.send.max.retries

Default value: 3. Maximum number of attempts to send messages.
retry.backoff.ms

Default: 300, additional interval for each attempt.
topic.metadata.refresh.interval.ms

Default value: 600000. Duration for periodically obtaining metadata. If the partition is lost and the leader is unavailable, the producer will actively obtain metadata. If the value is 0, the producer will obtain metadata after each message is sent. This is not recommended. If it is negative, metadata is retrieved only in the case of failure.
queue.buffering.max.ms

Default value: 5000, the maximum time for caching data in the Producer Queue, only for asyC.
queue.buffering.max.message

Default value: 10000, the maximum number of messages cached by producer, only for ASYC.
queue.enqueue.timeout.ms

Default value: -1, 0 Discard queue when queue is full, negative value is block when queue is full, positive value is block time when queue is full, only for asyc.

Kafka Consumer

Kafka has a concept of consumer groups. Each consumer can only consume messages from an assigned partition, and each partition can only be consumed by one consumer in a consumer group. Therefore, if the number of consumers in a consumer group exceeds the number of partitions, some consumers will not be allocated to the partitions. The relationship between consumer groups and consumers is shown in the figure below:

A Kafka Consumer Client Consumer message usually contains the following steps:

Configure the client and create the consumer
Subscribe to the topic
Pull the message and consume it
Commit consumption shift
Close the consumer instance

Because Kafka’s Consumer client is thread-unsafe, to ensure thread-safety and improve consumption performance, we can use a thread-like Reactor model on the Consumer side to consume data.

Kafka consumer parameters

Bootstrap. servers: connects to the broker address,Host: portFormat.
Group-id: indicates the consumer group to which a consumer belongs.
Key. deserializer: Works with producerskey.serializerCorresponding, the deserialization of key.
Value. Deserializer: Works with producersvalue.serializerCorresponding to the deserialization of value.
Session.timeout. ms: indicates the time when coordinator detection fails. 10s By default this parameter is the interval at which the Consumer Group proactively detects the crash of member comsummer, similar to the heartbeat expiration time.
Auto. Offset. Reset: This property specifies what the consumer should do if it reads a partition with no offset whose offset is invalid (the consumer has been invalid for a long time and the current offset has been deleted). The default value is latest, which reads data from the latest record (the record generated after the consumer started), The other value is earliest, which means that the consumer reads data from the starting position in case the offset is invalid.
Enable.auto.com MIT: No Automatically submits the shift if yesfalse, the displacement needs to be submitted manually in the program. For semantics that are accurate to one time, it is best to commit the shift manually
Fetch. Max. bytes: indicates the maximum number of bytes to be fetched at a time
Max.poll. records: The maximum number of messages returned by a single poll call, which can be increased appropriately if the processing logic is light. butmax.poll.recordsThe data must be processed within session.timeout.ms. The default value is 500
Request.timeout. ms: indicates the maximum waiting time of a request response. If no response is received within the timeout period, Kafka either resends the message or sets it to failure if the number of retries exceeds.

Kafka Rebalance

Rebalance is essentially a protocol that dictates how all consumers in a consumer group agree to allocate each partition of a subscription topic. For example, a group with 20 consumers subscribes to a topic with 100 partitions. Normally, Kafka allocates an average of five partitions per consumer. This allocation process is called rebalance.

When is rebalance?

This is a question that is often asked. Rebalance can be triggered in three ways:

Group membership changes (a new consumer joins the group, an existing consumer voluntarily leaves the group, or an existing consumer crashes — the difference will be discussed later)
The number of subscribed topics changed. Procedure
The number of partitions subscribed to the topic changed

How do I allocate intra-group partitions?

Kafka provides two allocation strategies by default: Range and round-robin. Of course Kafka uses pluggable allocation strategies, you can create your own allocator to implement different allocation strategies.

Answer key

What command-line tools does Kafka have? Which ones have you used?/binDirectory, manage Kafka cluster, manage topic, produce and consume Kafka
Kafka Producer’s execution process? Interceptors, serializers, dividers and accumulators
What are the common configurations of Kafka Producer? Broker configuration, ACK configuration, network and send parameters, compression parameters, ACK parameters
How do I keep Kafka messages in order? Kafka itself is unordered at the Topic level, ordered only on partitions, so you can customize partitions to send sequential data to the same partition in order of processing
How does the Producer ensure that data is sent without loss? Ack mechanism, retry mechanism
How to improve the performance of Producer? Batch, asynchronous, compression
What does Kafka do if the number of consumers in a group is greater than the number of parts? Redundant parts are rendered useless and do not consume data
Is Kafka Consumer thread safe? Unsafe, single thread consumption, multi-thread processing
Tell me about your thread model for consuming messages using Kafka Consumer. Why is it designed this way? Pull and process separation
Common configuration of Kafka Consumer? Broker, network and pull parameters, heartbeat parameters
When will consumers be kicked out of the cluster? The network is abnormal. The processing time is too long. The submission displacement times out
How does Kafka react when a Consumer joins or leaves? To Rebalance
What is Rebalance and when does Rebalance occur? Topic changes, consumer changes

High availability and performance

The problem

How does Kafka ensure high availability?
Kafka delivery semantics?
Replic’s role?
What is it AR, ISR?
What are Leader and Flower?
What do HW, LEO, LSO, LW, etc. stand for in Kafka?
What does Kafka do to ensure superior performance?

Partitions and replicas

In distributed data systems, partitions are usually used to improve the processing capacity of the system and ensure high availability of data through duplicates. Multi-partition implies the ability to process concurrently, with only one of these replicas being the leader and the rest being followers. Only the Leader copy can provide services externally. Multiple follower copies are usually stored in different brokers from the leader copy. This is a highly available mechanism, so that when a machine dies, other follower replicas can quickly become “normal” and start providing services.

Why do followers copies not provide read services?

The issue is essentially a trade-off between performance and consistency. Imagine if follower replicas also provided services to others. First of all, performance will definitely improve. But at the same time, a series of problems arise. Similar to database transactions in the phantom read, dirty read. For example, if you now write a piece of data to Kafka topic A, consumer B consumes data from topic A and finds that it cannot consume data because the latest message has not been written to the partition copy that consumer B reads. At this time, another consumer C can consume the latest data because it consumes the leader copy. Kafka uses the management of WH and Offset to determine what data can be consumed by the Consumer, which data has been currently written.

Only the Leader can provide read services externally. How to elect the Leader

Kafka places the replica that is synchronized with the leader replica into the ISR replica collection. Of course, the Leader replica always exists in the ISR replica set, and in some special cases, there is even only one LEADER replica in the ISR replica. When the leader fails, Kakfa senses the situation through ZooKeeper and selects a new ISR replica to become the Leader to provide services externally. However, there is still a problem. As mentioned above, it is possible that only the leader copy in the ISR replica set will be empty after the leader copy fails. In this case, what can be done? If set unclean. Leader. Election. Enable parameter is true, then the kafka would in asynchronous, which is not in the ISR copy copies in the collection, select the copy become a leader.

The existence of replicas causes copy synchronization problems

Kafka maintains a list of available replicas (ISR) among all allocated replicas (AR). Producer sends a message to the Broker based on ack configuration to determine that the message needs to wait until several replicas have synchronized. A ReplicaManager service within the Broker manages data synchronization between the flower and the leader.

Performance optimization

Partition of concurrent
Sequential reading and writing disks
Page cache: Read and write by page
Read ahead: Kafka reads messages to be consumed into memory in advance
High performance serialization (binary)
The memory mapping
Lockless offset management: improves concurrency
Java NIO model
Batch: Reads and writes in batches
Compression: message compression, storage compression, reduce network and IO overhead

Partition of concurrent

On the one hand, since different partitions can be located on different machines, clustering can be fully utilized to achieve parallel processing between machines. On the other hand, Partition corresponds to a folder physically. Even if multiple partitions are located on the same node, you can configure different partitions on the same node to be placed on different disk drives to achieve parallel processing among disks. Give full play to the advantages of multiple disks.

In order to read and write

Kafka files in each partition are evenly split into data files of equal size (default is 500 MEgabytes per file, which can be set manually). Each data file is called a segment file, and each segment appends data.

Answer key

How does Kafka ensure high availability?

High availability of data can be ensured through duplicates, producer ack, retry, automatic Leader election, and Consumer self-balancing
Kafka delivery semantics?

Delivery semantics generally include at least once, at most once, and exactly once. Kafka implements the first two through the configuration of ack.
Replic’s role?

High availability of data
What is AR, ISR?

AR: Assigned Replicas. AR is the set of copies allocated when the partition is created after the topic is created. The number of copies is determined by the copy factor. ISR: In-sync Replicas. A particularly important concept in Kafka refers to the collection of replicas in AR that are held in sync with the Leader. The replica in the AR may not be in the ISR, but the Leader replica is naturally included in the ISR. Another common interview question about ISR is how to determine whether a copy should belong to an ISR. The current judgment is based on whether the time that the Follower copy LEO lags behind the Leader LEO exceeds the replica.lag.time.max.ms value of the Broker parameters. If it does, the replica is removed from the ISR.
What are Leader and Flower?
What does HW stand for in Kafka?

High watermark value. This is an important field that controls the range of messages that consumers can read. An ordinary consumer can only “see” all messages between Log Start Offset and HW (excluding) on the Leader copy. Messages above the water level are not visible to consumers.
What does Kafka do to ensure superior performance?

Partition concurrency, sequential disk read and write, Page cache compression, high-performance serialization (binary), memory mapping lock-free offset management, Java NIO model

This article is not in-depth Kafka implementation details and source code analysis, but Kafka is indeed an excellent open source system, a lot of elegant architecture design and source code design are worth our learning, it is recommended that interested students more in-depth to understand this open source system, for their own architecture design ability, coding ability, Performance tuning will help a lot.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The secret at the heart of Kafka lies in these 16 images

Talk about distributed messaging middleware

The problem

Answer key

Basic Kafka concepts and architecture

The problem

zookeeper

Answer key

Kafka use

The problem

Command line tool

Kafka Producer

Kafka Consumer

Kafka consumer parameters

Kafka Rebalance

Answer key

High availability and performance

The problem

Partitions and replicas

Performance optimization

Partition of concurrent

In order to read and write

Answer key

Recommended reading

The secret at the heart of Kafka lies in these 16 images

Talk about distributed messaging middleware

The problem

Answer key

Basic Kafka concepts and architecture

The problem

zookeeper

Answer key

Kafka use

The problem

Command line tool

Kafka Producer

Kafka Consumer

Kafka consumer parameters

Kafka Rebalance

Answer key

High availability and performance

The problem

Partitions and replicas

Performance optimization

Partition of concurrent

In order to read and write

Answer key

Recommended reading

Related Posts

Walk you through k8S-Jenkins assembly line grammar

Insertion sort of sorting algorithm (direct insertion sort, split insertion sort, Hill sort)

Linux disk space eaten? This investigation is not back!