Episode: Kafka HW, LEO update principle and operation process summary

preface

Although this is a warm-up to the source code and a summary of the episode, we have covered the roles of clusters from the beginning, to cluster design, to network models, to producers and consumers. This article will put the last kernel mentioned, and then do a summary of the previous concept of three, the process of sorting out, increase memory at the same time, also let the source code behind become more relaxed and happy, good this is the last pigeon 🤣.

Previous link

Concept 1: Episode: Kafka in Plain English

Practical: Episode: Kafka’s cluster deployment practices and operations

Concept 2: episode: Kafka producer principle and important parameter description

Concept 3: Episode: Analysis of Kafka’s producer case and consumer principle

Nothing: episode: Kafka source preheat – Java NIO

1. Finish what Kafka didn’t mention in the last article

In the producer case and consumer Principle of Kafka we mentioned that there is a LEO&HW principle in the core of Kafka.

1.1 Principle of LEO&HW renewal

There are two brokers, that is, two servers, each of which stores two copies of P0 in its partition, one leader and one follower

The producer sends data to the leader partition, which must eventually be written to disk. The followers then synchronize the data from the leader, and the data on the followers is also written to disk

The follower, however, synchronizes data from the leader before writing to the disk, so its disk is bound to have less data than the leader’s disk.

1.1.1 What is LEO

The last end offset (LEO) is the next maximum offset of the data in the underlying log file of the copy, so in the figure above, the LEO of the leader is 5+1 = 6, and the LEO of the follower is 5. Similarly, when I know that LEO is 10, I know that the log file has saved 10 pieces of information with a displacement range of [0,9]

1.1.2 What is HW

HW (Highwater mark) is the water level, which must be less than LEO. This value specifies that consumers can only consume data prior to HW.

1.1.3 Process Analysis

When the follower synchronizes data with the leader, the data from the follower will carry the LEO value. However, there may be more than two copies of P0. At this point I draw more followers (P0), which also synchronize data to the leader partition, bringing their Own LEOs. The Leader Partition records the LEO synchronized from these followers and takes the smallest LEO value as the HW value

This ensures that if the leader partition fails, the cluster elects a new leader partition from another follower partition. At this time, no matter which node is elected as the leader, the data to be consumed at this moment can be guaranteed to ensure the security of data.

Then, how to determine the HW value of the followers themselves is that the followers also carry the HW value of the leader partition with them when obtaining the data, and then take a smaller value with their own LEO value as their own HW value.

Now if you think back to the ISR mentioned earlier, it becomes even clearer. Followers are kicked out of the ISR if they do not synchronize data with the leader for more than 10 seconds. Its function is to help us quickly select a leader when the leader is down, because the followers in the ISR list have a high synchronization rate with the leader, so they will not lose too much data even if they lose data.

When the follower LEO value is greater than or equal to the leader HW value, the follower can return to the ISR.

But according to the process just can’t avoid losing part of the data, of course, there are ways to ensure the integrity of the data, let’s leave it to the source code after the summary of time to mention.

1.1.4 I think there are too many words in the picture

Second, Kafka process sorting

I showed you how to draw in the vernacular, and I’m going to do it again.

Start with two brokers. When they start up, they register with the ZooKeeper cluster. At this point, the two servers grab a directory named Controller. Now, for example, the first Broker has been captured. It is the Controller, which listens for changes to each directory in ZooKeeper and manages metadata for the entire cluster.

At this point, we create a theme through the client command. At this time, a theme partition scheme will be written to the ZooKeeper directory, and after the controller listens to the directory to write partition scheme (actually some metadata information), it will also change its metadata information. Other brokers then synchronize metadata to the Controller. Ensure that metadata is consistent across brokers in a cluster

For example, we now know from metadata information that there is a partition P0, the leader partition is on the first Broker, and the follower partition is on the second Broker.

The producer needs to wrap each message into a ProducerRecord object before sending a message to the cluster. This is done internally by the producer. And then it goes through a serialization process. Next, it needs to pull metadata from the past cluster (so you can see why one or more broker addresses are provided in the 1-⑤-1 producer code in the episode: Kafka producer principle and important parameters)

props.put("bootstrap.servers", "hadoop1:9092,hadoop2:9092,hadoop3:9092");
Copy the code

Because you can’t get metadata information without providing the address of the server. At this point, the producer’s message does not know which server and which partition to send to.

Instead of rushing to send the message, the producer puts it into a buffer. After putting the message into the buffer, at the same time a separate thread Sender wraps the message Batch by Batch. After finishing each batch, they start to send it to the corresponding host. At this point, Kafka’s three-tier network architecture model is written to the OS cache, and then to disk.

The disk writing process then combines the binary log lookup mentioned in Kafka’s Producer case and consumer Principles analysis with the ISR,LEO, and HW just mentioned. Because when the leader finishes writing, the followers go over to synchronize data.

This consumer group will have its group. Id number, which can be used to calculate which broker will act as its coodinator. After a coordinator is identified, all consumers will send a join group to register. By default, the coordinator will select the first registered consumer as the leader consumer and report the situation of the whole Topic to the leader consumer. The leader consumer then makes a consumption plan based on load balancing and returns it to the coordinator. After the coordinator gets the plan, it sends it to all consumers to complete the process.

So this links up everything we’ve talked about in the episode series, basically everything you need to know. Such a big thing is all separate one by one a small knowledge point gradually explained the complete. If you are interested in Kafka, I really recommend reading the previous several articles. I believe it will help you.

Back to the source code thing

The basic knowledge of Java NIO and Scala are two conditions in the source text, but if you are not familiar with Scala, don’t worry, it is very similar to Java. I believe that with a certain explanation, we can also understand all the routines.

Three, a brief talk about the environment

Kafka is 0.10.1, the latest version is 2.2.x. Core processes are not too big change, old version compared with the new version is more stable, older versions of the code structure will become more clear, because open source projects like this, many people will go to submit some patch, but submit a patch of the group of personnel are not necessarily the best, make the new version of the code looks very chaotic, learning it is not very convenient

1.1 JDK1.7+

1.2 the scala

Kafka was originally written in Scala, but the producer side and the consumer side of the code was rewritten in Java, but the server side of the source code has been written in Scala, so we analyze Kafka source code needs to install scala environment.

I use version 2.11.8, download the configuration environment variables can be perfect (through Baidu, and Java configuration is also very similar, not to expand here).

IDEA requires a Scala plugin to be installed. In settings-plugins, search for Scala.

1.3 gradle

Kafka source code does not use Maven to manage, but use Gradle, we think of this as a similar to maven code management tool can be. Install it the same way you install Maven.

finally

Will talk about the source code

Server: KafkaProducer: KafkaProducer: KafkaProducer: KafkaProducer: (Very basic, I don't need to talk about it, in fact, the big data framework about reading data is not too difficult)Copy the code

If you want to explain a class by class, it will certainly be very confusing, so to use the scene to elaborate, this scene does not even need me to write. See an example package in the source code? Most big data frameworks are open source. In order to promote them, the official documentation should be written in detail and some good sample packages should be provided. From now on, the explanation will rely on code comments.

This is where the next one begins. We encourage each other and work together