Kafka: The Path to Java Engineers

Cabbage Java self study room covers core knowledge

Kafka iS a Java engineer. Kafka is a Java engineer. Kafka is a Java engineer.

The basic concepts and workflows of Kafka were introduced above, but understanding the inner workings of Kafka is not essential for developing Kafka applications, or simply using Kafka in a production environment. However, understanding the inner workings of Kafka can help you understand Kafka’s behavior, as well as quickly diagnose problems.

1. Kafka cluster membership

Kafka runs on Top of ZooKeeper, and because ZooKeeper is a cluster, Kafka can also be a cluster. This involves the coordination of multiple producers and consumers, and the maintenance of the relationship between clusters is also done by ZooKeeper. The storage structure of Kafka in ZooKeeper is as follows:

There are multiple brokers between Kafka clusters. Each broker has a broker.ID. Each broker id is distinguished by a unique identifier, which can be specified manually in the configuration file or generated automatically.

Kafka through broker. Id. Generation. The enable and reserved. Broker. Max. Id to cooperate to generate a new broker. Id. Broker. Id. Generation. Enable parameter is used to configure whether open automatically generated broker. The function of the id, the default is true, that is to open this function. The automatically generated broker.id has a default value of 1000, meaning that by default the automatically generated broker.id starts at 1001.

Kafka starts up by registering a temporary node with the same ID as the current broker in the /brokers/ IDS directory of ZooKeeper. Kafka’s health check depends on this node. These components are notified when a broker joins or leaves a cluster.

If you try to start another broker with the same ID, you will get an error — the new broker will try to register, but will not succeed because there is already a broker with the same ID in ZooKeeper.
In the event of broker downtime, partitioning, or long garbage collection pauses, the broker is disconnected from ZooKeeper and temporary nodes created at startup are removed from ZooKeeper. A Kafka component listening to a list of brokers is told that the broker has been removed.
When the broker is closed, its node also disappears, but its ID continues to exist in other data structures, such as the subject’s replica list, which we’ll discuss below. If you start a new broker with the same ID after shutting it down completely, it will immediately join the cluster and have the same partition and theme as the old broker.

Broker Controller 2

There is also a Controller component between Kafka’s broker clusters, which is the core component of Kafka. Each broker in a Kafka cluster can be called a Controller, but after the Kafka cluster is started, only one broker becomes a Controller.

Zookeeper may be covered here, but please see the author’s article: The Path to Java Engineers’ Advancement Zookeeper

2.1. Functions of Controller

Kafka is designed as a multithreaded controller that emulates a state machine. It can do several things:

A controller acts as a broker controller in a department (cluster) and is used to manage brokers in a department.
A controller is a monitor of all brokers, monitoring them coming online and offline.
The controller can elect a new partition Leader after the broker goes down.
The controller can send messages to the broker’s newly selected Leader.

Further segmentation can be specifically divided into the following five points:

Theme management: The Kafka Controller allows you to create, delete, and add partitions to A Kafka theme. In short, you have maximum control over partitions. In other words, when we execute the kafka-Topics script, most of the background work is done by the controller.

Partition reassignment: Partition reassignment mainly refers to the fine-grained allocation of existing subject partitions provided by kafka-reassign-partitions scripts. This part of the function is also implemented by the controller.

Prefered Leader election: Preferred Leader election is a Preferred Leader election scheme provided by Kafka to avoid overload of some brokers.

Cluster member management: Manages new brokers, broker shutdowns, and broker outages.

Data services: The last big category of controller work is to provide data services to other brokers. The most complete cluster metadata information is stored on the controller, and all brokers periodically receive metadata update requests from the controller to update their cached data in memory.

When the controller detects a broker leaving the cluster (by observing the relevant ZooKeeper path), the controller receives a message that the partitions managed by this broker need a new Leader. The controller iterates through each partition in turn to determine who can serve as the new Leader, and then sends a message to all partitions that contain the new Leader or existing followers. The request message contains information about who the new Leader is and who the Follower is. The new Leader then starts processing requests from producers and consumers, and the followers are used to make copies from the new Leader.

When the controller discovers that a broker has joined the cluster, it uses the Broker ID to check whether the newly added broker contains a copy of an existing partition. If there is a controller, messages are sent to new and existing brokers.

2.2. Election of Controller

Kafka’s current election controller rules are:

The first broker to start in a Kafka cluster creates a temporary node in ZooKeeper/controllerMake yourself a controller controller.
Other brokers will also try to create this node at startup, but since this node already exists, they will want to create it later/controllerAn exception that already exists on the node is received.
Other brokers then register a ZooKeeper Watch object on this controller,/controllerWhen a node changes, other brokers are notified of node changes.

This ensures that only one controller exists. Then there must be a problem with only individual nodes, and that is a single point problem.

If the controller shuts down or disconnects from ZooKeeper, the temporary node on ZooKeeper disappears. After other nodes in the cluster receive the message that the Watch object sends the controller offline, other broker nodes attempt to make themselves the new controller. The creation rules of other nodes are the same as those of the first node: The first broker that successfully creates a controller node in ZooKeeper becomes the new controller. Then other nodes will receive the existing exception of the node and create a Watch object on the new controller node to listen.

2.3. Data storage of Controller

As described above, the Broker Controller provides a data service for storing large amounts of Kafka cluster data. The diagram below:

The information saved above can be classified into three categories:

All information on the broker: all partitions within the broker, copies of all partitions of the broker, which brokers are currently running and which are closing.
All subject information: this includes specific partition information, such as who the leader replicas are, what replicas are in the ISR collection, etc.
All partitions that involve O&M tasks: Includes the list of partitions where the Preferred leader election and partition reallocation are currently taking place.

Kafka is inseparable from ZooKeeper, so a copy of this data is stored in ZooKeeper. Each time the controller initializes, it reads the corresponding metadata from ZooKeeper and populates it into its cache.

2.4. Failover of Controller

As mentioned earlier, the first broker created under /brokers/ IDS in ZooKeeper is the Broker controller. In other words, there is only one broker controller, so there must be a single point of failure. Kafka provides a failover feature, or Fail Over, for this purpose. The diagram below:

Broker1 will register as controller first, and then broker1 will drop due to network jitter or some other reason, and ZooKeeper will detect the broker1 drop through the Watch mechanism, Broker3 is registered first. ZooKeeper stores controller information by Broker1 -> Broker3. Broker3 reads the metadata information from ZooKeeper and initializes it into its own cache.

Note: ZooKeeper does not store cached information; broker stores cached information.

2.5. Design principle of Controller

Prior to Kafka version 0.11, controller design was fairly cumbersome. Kafka Controller is designed as a multithreaded controller that simulates a state machine. There are some problems with this design:

Changes to controller state are performed concurrently by different listeners, so they require complex synchronization and are error-prone and difficult to debug.
State propagation is not synchronous, and the broker may have multiple states at uncertain times, resulting in unnecessary additional data loss
Controller The controller also creates additional I/O threads for topic deletion, resulting in performance loss
The controller’s multithreaded design also allows access to shared data. As we know, multi-threaded access to shared data is the most troublesome part of thread synchronization. To protect data security, the controller has to make extensive use of ReentrantLock synchronization mechanism in its code, which further slows down the processing speed of the entire controller.

After Kafka 0.11, Kafka Controller adopted a new design that changed the multi-threaded scheme to a single-thread plus event queue scheme. As shown below:

The main changes are as follows:

– Added an Event Executor Thread. As you can see from the figure, both the Event Queue and the Controller Context are handed to the Event thread for processing. All the previously performed operations are modeled as separate events that are sent to a dedicated event queue for consumption by this thread.
All ZooKeeper synchronization operations are changed to asynchronous operations. The ZooKeeper API provides two read and write modes: synchronous and asynchronous. Previously, the controller operated ZooKeeper in synchronous mode. This time, the synchronous mode was changed to asynchronous mode. According to the test, the efficiency was improved by 10 times.
The previous design was that the broker would fairly handle all requests sent by the Controller. What does that mean? Isn’t fairness good? In some cases, yes. For example, the broker is queuing for produce and the Controller issues a StopReplica request. What do you do? Are you still processing produce requests? Does this produce request still work? At this time, the most reasonable processing sequence should be to give StopReplica request a higher priority so that it can get preemptive processing.

3. Kafka’s replica mechanism

Replication is a core feature of the Kafka architecture. In the Kafka documentation Kafka describes itself as a distributed, partitioned, and replicable commit log service. Replication is critical because persistent storage of messages is important to ensure that Kafka remains highly available even after the primary node goes down. Replication, also known as Replication, usually refers to a distributed system that keeps the same backup/copy of data on multiple interacting machines.

Kafka uses themes to organize data, and each theme is divided into partitions. Partitions can be deployed on one or more brokers. Each partition has multiple copies, so copies can also be stored on brokers, which may hold thousands of copies. Here is a schematic of a copy:

As shown in the figure above, for simplicity I draw only two brokers, each holding a Topic message, and in Broker1 partition 0 is the Leader, which does the copying of the partition, Copy A copy of partition 0 from Broker1 to partition 0 for Broker2 theme A. The same is true for partition 1 of topic A.

3.1. A copy of the Leader

There are two types of replica copies: Leader replica and Follower replica.

Kafka elects a replica whenever a partition is created, and this replica is the Leader replica.

3.2. A copy of the followers

All copies except the Leader copy are collectively referred to as the Follower copy. Followers do not provide external services. Here’s how the Leader replica works:

Note the following points:

In Kafka, Follower copies are not serviced. That is, none of the follower replicas can respond to consumer and producer requests. All requests are handled by the leader deputy. In other words, all requests must be sent to the broker where the Leader copy resides. The Follower copy is only used for data pulling asynchronously and is written to its own commit log to achieve synchronization with the Leader.
Kafka, based on the monitoring function provided by ZooKeeper, senses in real time when the Leader copy of the broker is down and initiates a new election, choosing one of the followers as the Leader copy. If the downed broker is finished restarting, the copy of the partition will rejoin as a Follower.

3.3. The Follower and Leader copies are synchronized

Another task for the Leader is to figure out which Follower’s state is the same as theirs. To ensure that the status of the Follower is consistent with that of the Leader, the Follower attempts to copy new messages from the Leader before they arrive. To keep pace with the Leader, followers make requests to the Leader for data that are the same as the information the consumer sends to read the message.

The Follower sends a message to the Leader like this. The Follower first requests message 1, then receives message 1, and then sends request 2 when it is time for request 1. The Follower does not send any more messages until it receives the message from the Leader and sends it to the Follower. The process is as follows:

It is important that the Follower copy does not continue sending messages until it receives the response message. By looking at the latest offset of each Follower request, the Leader knows the progress of each Follower replication. If a Follower does not request any messages for 10 seconds, or if a Follower has sent a request but does not receive a message for 10 seconds, it is considered out of sync. If a replica is not synchronized with the Leader, the replica will not be called the Leader after the Leader goes offline because the replica’s messages are not complete.

Conversely, if the Follower synchronized message is the same as the Leader copy, the Follower copy is called the synchronized copy. That is, if the Leader goes offline, only the synchronized copy can be called the Leader.

What are the benefits of duplicating?

To be able to see written messages immediately, you can use the producer API to successfully write messages to the partition, and immediately use the consumer to read the message just written;
What does it mean to be idempotent in a message? That is, for the message generated by the producer, when the consumer consumes, it will see the message exists every time, and there will not be the case that the message does not exist.

3.4. Synchronous and Asynchronous Replication

Since the Leader copy and Follower copy are send-wait, which is a synchronous replication mode, why does the Follower copy synchronize with the Leader asynchronously?

After synchronizing the Follower copy with the Leader copy, the Follower copy saves the message to the local log. At this time, the Follower copy sends a response message to the Leader copy, telling the Leader that the save has been successful. The synchronous replication Leader waits for all the Follower copies to be written successfully and then returns a message to the producer indicating that the write is successful. In asynchronous replication, the Leader copy does not need to care whether the Follower copy writes successfully or not. As long as the Leader copy saves the message to the local log, the Leader copy returns a message indicating that the write is successful to the producer.

Synchronous replication:

Producer notifies ZooKeeper to identify the leader.
Producer writes messages to the leader;
The leader writes the message to the local log after receiving it.
Followers pull messages from leaders;
The follower writes log locally;
The follower sends a write success message to the leader;
The leader receives all messages sent by followers;
The leader sends a write success message to the producer.

Asynchronous replication:

The difference from synchronous replication is that the leader directly sends a write success message to the client after writing to the local log, without waiting for all followers to complete replication.

3.5. In-sync Replicas (ISR)

Kafka dynamically maintains a set of in-sync Replicas (ISR).

ISR is also a very important concept. As we mentioned earlier, follower replicas do not provide services, but only asynchronously pull data from leader replicas on a regular basis. This pull operation is equivalent to copying, ctrl-C + Ctrl-V. Is the number of replica messages in an ISR collection the same as the number of leader replica messages? It depends on the value of the replica.lag.time.max.ms parameter in the broker. This parameter means the maximum amount of time that a follower replica can lag behind a leader replica.

The default time of the replica.lag.time.max.ms parameter is 10 seconds. If the follower replica is less than 10 seconds behind the leader replica, Kafka considers the leader and follower to be synchronized. Even though there are fewer messages stored in the follower replica than in the leader replica. Follower replicas are removed from the ISR if they lag more than 10 seconds behind the leader replicas. If the copy slowly catches up with the leader, it can be added back to the ISR. This also suggests that the ISR is a dynamically adjusted set rather than a statically invariant one.

3.6. On Unclean Leadership Election

Since the ISR can be dynamically adjusted, the ISR must be empty. Since the leader copy must be in the ISR, the ISR set must be empty, which means that the leader copy is also dead. Therefore, Kafka needs to elect a new leader. Now you need to change your mind, we said that the ISR set must be the replica that is synchronized with the leader, so the replica that is no longer in the ISR set must be the replica that is not synchronized with the leader, that is, the follower copy that is not in the ISR list will lose some messages.

If you open the broker end parameters unclean. Leader. Election. The enable, under a leader will be in these asynchronous copy of election. This election is also called Unclean leadership election.

If you’ve worked on distributed projects, you know the CAP theory, and this Unclean leader election effectively sacrificed data consistency to ensure high availability for Kafka. You can determine whether to enable Unclean leader election based on your actual business scenario. It is generally not recommended to enable this parameter because data consistency is more important than availability.

4. Kafka request processing process

Most of the work of the broker is to process requests sent to the partition leader by clients, partition replicas, and controllers. This type of request is usually request/response, and I’m guessing HTTP is the first request/response you’ll encounter. In fact, HTTP requests can be synchronous or asynchronous. Normal HTTP requests are synchronous, and the main feature of synchronous mode is to submit the request -> wait for the server to process it -> return the client browser cannot do anything during this period. The best feature of the asynchronous approach is that the request is triggered by an event -> server processing (which the browser can still do) -> complete.

Note: We are using HTTP requests as an example, whereas Kafka uses TCP for socket-based communication.

For example, synchronous requests are processed sequentially, whereas asynchronous requests are not executed in a certain way because asynchronous requests require the creation of multiple threads of execution, each of which executes in a different order. So what are the downsides of both approaches?

The biggest disadvantage of the synchronous approach is poor throughput and low resource utilization. Since requests can only be processed sequentially, each request must wait for the completion of the previous request to be processed. This approach only works on systems where requests are sent very infrequently.
The downside of the asynchronous approach is that creating threads for each request can be expensive, and in some cases can overwhelm the entire service.

4.1. Responsive model

Is Kafka synchronous or asynchronous? No, Kafka uses a Reactor model.

So what is a responsive model? To put it simply, the Reactor pattern is an event-driven architecture implementation that is particularly suited for scenarios where multiple clients send requests to the server simultaneously, as shown in the following figure:

Kafka brokers have a SocketServer component, similar to a processor. Socketservers are tcp-based sockets that accept client requests. Each request message contains a header containing the following information:

Request type — (API Key)
Request Version — The broker can process different versions of client requests and respond to them differently.
Correlation ID – a unique number that identifies the request message and also appears in the response message and error log (used to diagnose problems)
Client ID specifies the Client that sends the request

The broker runs an Acceptor thread on each port on which it listens. This thread creates a connection and passes it to the Processor, whose number can be configured using num.net Work.Threads. The default value is 3, which means that each broker starts with three threads created to handle requests sent by the client.

Acceptor threads poll the stack requests fairly to the network thread pool. Therefore, in practice, these threads usually have an equal chance of being allocated to the pending request queue and then receiving the response messages from the response queue and sending them to the client. The processing of request-response in the network thread pool is relatively complex. The following is the flow chart of processing in the network thread pool:

After receiving a message from a client or other broker, the network thread pool places the message in the request queue. Note that this is a shared request queue. Because the network thread pool is multi-threaded, the message in the request queue is shared by multiple threads, and then processed by the IO thread pool. Processing depends on the type of message, such as the PRODUCE request, which writes the message to the log, or in the case of the FETCH request, which reads the message from disk or the page cache. In other words, the IO thread pool is the one component that actually makes the judgments and processes the requests. After the IO thread pool is done, it decides whether it’s in a response queue or Purgatory, which is what Purgatory is, but let’s talk about the response queue, the response queue is unique to each thread, because the responsive model doesn’t care where the request goes, So it’s up to each thread to send the response back, so it doesn’t have to be shared.

Note: THE IO thread pool can be configured using the broker side parameter num.io. Threads. The default number of threads is 8, which means that each broker will automatically create 8 IO processing threads after starting.

4.2. Production request

How can Producer write messages to Kafka without losing them? Through the ACK response mechanism! When a producer writes data to a queue, a parameter can be set to confirm that kafka received data. This parameter can be set to 0, 1, or all.

Simply speaking, different configurations define write success differently. If acks = 1, the leader receives the message as long as the write is successful; if acks = 0, the leader sends the message as long as the write is successful, regardless of the impact of the return value. If acks = all, the leader needs to receive a message for all replicas before the write is successful.

After the message is written to the head of the partition, if the acks configured value is all, the requests are stored in the Purgatory buffer until the leader replica finds that the follower replica has copied the message, and the response is not sent to the client.

4.3. Get the request

The broker receives the request in a manner similar to production requests. The client sends the request to the Broker for a message at a specific offset in the topic partition. If the offset is present, Kafka uses zero-copy technology to send the message to the client. It does not need to pass through any buffers, thus achieving better performance.

The client can set the upper and lower limits for retrieving requested data. The upper limit refers to the memory space allocated by the client to receive sufficient messages. This limit is important because if the upper limit is too large, the client may run out of memory. The lower limit can be understood as the meaning of collecting enough packets before sending them, which increases the time cost. As shown below:

As you can see in the figure, there is a waiting process between pull messages -> messages, which you can think of as a timeout, but the timeout will run an exception, and the message accumulation will respond to the receipt. The delay time can be set by replica.lag.time.max.ms, which specifies the maximum delay a replica can allow when copying messages.

4.4. Metadata request

Both production requests and response requests must be sent to leader replicas. If a broker receives a request for a specific partition and the leader of the request is in another broker, the sending client will receive an error response from a non-partition leader. The same error occurs if a request for a partition is sent to a broker that does not contain a leader. The Kafka client needs to send requests and responses to the correct broker.

In fact, the client uses a metadata request that contains a list of topics of interest to the client, and the response message on the server indicates the topic partition, the leader copy, and the follower copy. Metadata requests can be sent to any broker because all brokers cache this information.

Typically, the client caches this information and sends production and corresponding requests directly to the target broker. These caches need to be refreshed periodically, configured with the metadata.max.age.ms parameter, to know if the metadata has changed. For example, when a new broker is added, rebalancing is triggered and some copies are moved to the new broker. At this point, if the client receives an error that is not the boss, the client flusses the metadata cache before sending the request.

5. Kafka rebalance process

A consumer group must have a group Coordinator, and the rebalancing process can be done with the help of a Coordinator.

A group Coordinator: A group Coordinator is a broker that receives heartbeat messages sent by all consumers from a group of consumers. In the earliest versions, metadata information was stored in ZooKeeper, but now metadata information is stored in the Broker. Each consumer group should synchronize with the group coordinator within the group. When all decisions are to be made in the application nodes, the Group coordinator satisfies the JoinGroup request and provides metadata information about the consumer groups, such as assignments and offsets.

The group coordinator also has the right to know the heartbeat of all consumers. Another role in the consumer group is the leader, which should be distinguished from the leader replica and Kafka Controller. The leader is the role responsible for making decisions in the group, so if the leader falls off line, the group coordinator has the authority to kick all consumers out of the group. Therefore, an important activity of consumer groups is to elect leaders and to read and write metadata information about allocation and partitioning with the coordinator.

Consumer Leader: Each consumer group has a leader. If the consumer stops sending heartbeats, the coordinator triggers rebalancing.

The conditions under which rebalancing occurs:

Changes to any topics that consumers subscribe to;
Changes in the number of consumers;
The number of partitions changes;
If you subscribe to a topic that has not yet been created, rebalancing occurs when the topic is created. If you subscribe to a topic that gets deleted then it gets rebalanced;
The consumer is considered DEAD by the group coordinator, which can occur because the consumer crashes or is up for a long time, meaning that the consumer does not send any heartbeat to the group coordinator for a reasonably configured period of time, which can also cause rebalancing to occur.

5.1. Transition of rebalance state

Kafka designed a set of consumer group State machines to assist the coordinator in the rebalancing process. The consumer state machine has five main states: Empty, Dead, PreparingRebalance, CompletingRebalance, and Stable.

state	meaning
Empty	There are no members in the group, but the consumer group may have submitted shifts that have not yet expired.
Dead	There is no member in the group, but the metadata information of the group has been removed on the coordinator side. The coordinator component holds information about all groups that are currently registered with it, and the so-called metadata information is similar to this registration information.
PreparingRebalance	The consumer group is ready to start rebalancing, at which point all members rerequest to join the consumer group.
CompletingRebalance	All members under the Consumer group have joined, and each member is waiting for the allocation plan. This state is called AwaitingSync in older versions, and is equivalent to CompletingRebalance.
Stable	The stable state of the consumer group. This status indicates that the rebalancing is complete and that members of the group are able to consume data normally.

Now that we know what these states mean, there are several paths to represent the rotation of consumer states:

The consumer group starts out in an Empty state, and when rebalancing is turned on, it is placed in PreparingRebalance waiting for new consumers to join. Once new consumers join, the consumer group is in a CompletingRebalance state waiting for allocation. As soon as a new consumer joins the group or leaves, rebalancing is triggered, and the consumer is in a PreparingRebalance state. Wait for the allocation mechanism to be specified to complete the allocation, then its flow chart looks like this:

Based on the diagram above, after the consumer groups have all reached a Stable state, as soon as new consumers join/leave/heartbeat expires, rebalancing is triggered and the consumer group is back in PreparingRebalance. So its flow chart looks like this:

Based on the diagram above, after the consumer group is in a PreparingRebalance, unfortunately, no one is playing and all the consumers have left, there may still be shift data for consumer consumption, and once the shift data expires or is refreshed, the consumer group is Dead. Its flow chart looks like this:

Based on the above figure, we analyzed consumer rebalancing, in PreparingRebalance, CompletingRebalance, or Stable, where the Leader of the topic partition changes and the group is directly Dead. All its paths are as follows:

Note: Required XX expired offsets in XXX milliseconds indicates that Kafka has probably removed the group’s displacement data. Only groups in the Empty state will perform the expiration shift deletion operation.

5.2. Consumers value balance

There are two steps from the consumer value balance: the consumer joins the group and waits for the leader to assign the solution respectively. The corresponding requests after these two steps are JoinGroup and SyncGroup respectively.

When a new consumer joins the group, the consumer sends a JoinGroup request to the coordinator. In this request, each consumer member needs to submit the topics they consume, as we described in the group coordinator above, so that the coordinator can gather enough metadata information to select the leader of the consumer group. Normally, the first consumer to send a JoinGroup request is automatically called the leader. The leader’s task is to collect subscription information from all members, and then, based on this information, formulate a specific partition consumption allocation plan. As shown in figure:

After all the consumers have joined in and submitted the metadata information to the leader, the leader makes the allocation plan and sends the SyncGroup request to the coordinator, who is responsible for delivering the consumption strategy in the group. The following diagram illustrates the process of a SyncGroup request:

When all members have successfully received the allocation scheme, the consumer group enters the Stable state, that is, the normal consumption work begins.

5.3. Focus on balance from the coordinator

Focusing on balance from a coordinator’s perspective has the following triggers:

A new member joins the group
The group members left voluntarily
Group member crashes and leaves
Group members commit shifts

5.3.1. A new member joins a group

The scenario in which we discussed the consumer cluster is in a Stable state waiting for allocation, and when new members join the group, the process of rebalancing:

5.3.2. Group members leave voluntarily

When a group member leaves a consumer group, the consumer instance calls the close() method to actively notify the coordinator that it is leaving. There’s going to be a new request for the LeaveGroup() request. As shown below:

5.3.3. Group member crashes and leaves

Group member crash refers to the serious failure, downtime or non-response of the consumer instance. If the coordinator cannot receive the heartbeat of the consumer, it will be regarded as group member crash. The collapse is passive, and the coordinator usually needs to wait for a period of time to perceive it. This time is typically controlled by the consumer parameter session.timeout.ms. As shown below:

5.3.4. Group member commit shift

After sending the JoinGroup request, the consumers in the group must submit their displacement within the specified time range, and then start sending the normal JoinGroup/SyncGroup request.