Kakfa is the undisputed husband of the people in the field of big data messaging.

This is the first article in the Kafka series. Expect a total of 20 articles, all original, from 0 to 1, with you die on Kafka.

This article takes a look at Kafka terminology and interprets it. Terminology can be boring, but it’s the essence of the essence!

Before we understand Kafka, we must first master its related concepts and terminology, which will be helpful for further study of Kafka’s various features. So, boring you have to show me!

There are probably some things to master, no more, no more, it is expected to be 20 minutes to understand:

The theme layer

The theme layer has three sons called Topic, Partition and Replica. Well, since I said three sons, you get the idea, they’re all indivisible.

Topic (Topic)

Kafka is a distributed messaging engine system that provides a complete set of Message publishing and subscription solutions.

In Kafka, the publish-subscribe object is a Topic, which you can create for each business, for each application, and even for each type of data.

A Topic is a generalization of a set of messages. You can also think of it as a table in a traditional database, or a directory in a file system.

Partition (Partition)

A Topic is usually composed of multiple partitions, and you can specify the number of partitions when creating a Topic.

📷 Partition advantages

Why do you need to partition topics? If you’re familiar with other distributed systems, you’ve probably heard of splicing and regions, such as Sharding in MongoDB and Elasticsearch and Region in HBase, but they all work the same way.

What if a Topic has accumulated too much data to accommodate a single Broker machine?

A natural thought was, can you split the data into multiple pieces and store them on different machines? Isn’t that what partitioning is for? To solve the scalability problem, each partition can be placed on a separate server.

Of course, the advantage is not only that, but also increased throughput. Kafka only allows data from a single partition to be consumed by a single consumer thread. Therefore, on the consumer side, the consumer parallelism is entirely dependent on the number of partitions consumed. To sum up, in general, the more partitions you have in a Kafka cluster, the more throughput you can reach.

📷 partition structure

Each partition corresponds to a folder in which data and index files of the partition are stored.

As shown in the figure, you can see that two folders, each corresponding to a topic called ASD, have two partitions on this server, 0 and 2. What about 1? It’s on another server! Distributed, after all!

Let’s go into the ASD-0 directory and see what it is? Log files are the data files and index files of this partition.

I don’t care what they are for now, because I will describe them in detail in this article.

📷 partition Sequentiality

Now, I need you to open your eyes to a very important point about partitioning:

The order of messages is guaranteed within each partition, but not between partitions.

This is important, for example, the message in Kafka is the data of some business library, mysql binlog is sequential, 10:01 I did not pay, so pay_date is null, 10:02 I paid, pay_date is updated.

In Kafka, however, because it is distributed and multi-partitioned, the order is not guaranteed. Maybe the 10:02 one comes first, which can cause serious production problems. Therefore, we generally need to partition by table + primary key. Ensure that data from the same primary key is sent to the same partition.

If you want all data in Kafka to be stored in chronological order, set the number of partitions to 1.

Replica (copy)

Multiple copies can be configured for each partition. Kafka defines two types of replicas: Leader Replica and Follower Replica. There can only be 1 leader replica and N-1 follower replica.

Why copy? Why do you need to back up important files multiple times? Replication is one way to achieve high availability.

It should be noted that only the Leader Replica provides external services and interacts with the client program. The producer always writes messages to the Leader Replica, while the consumer always reads messages from the Leader Replica. Follower Replica, on the other hand, cannot interact with the outside world. It only does one thing: send requests to the leader Replica to keep abreast of the leader by sending it the latest production information.

If you’re still wondering about themes, partitions, and replicas, think again with this image, and I’m sure you can play with it:

As shown in the following figure, TopicA has three partitions, with each partion having a leader copy and a follower copy. To ensure high availability, the leader and follower copies must be distributed on different machines.

Message level

The official definition of Kafka is Message System, so we can know that the most basic data unit in Kafka is message, which can be understood as a row or a record in a database. Messages are made up of character arrays. Here are a few things you need to know about the news:

📷 message key

When sending a message, specify a key, which is also an array of characters. Key is used to determine which partition to enter when a message is written to. You can use fields with clear business meaning as keys, such as user numbers, to ensure that the same user number goes into the same partition.

📷 Batch write

To improve efficiency, Kafka writes in batches.

A batch is a set of messages that are sent to the same topic and partition (based on the producer’s configuration).

Having each message sent over the network once can be a performance drain, so it is much more efficient to gather messages together and process them simultaneously.

Of course, this introduces higher latency and throughput: the larger the batch, the more messages are processed at the same time. Batch is usually compressed so that it is more efficient in transmission and storage.

📷 the displacement producer writes messages to the partition, and the position of each message in the partition is represented by a data called Offset. The partition displacement always starts at 0. Suppose a producer writes 10 messages to an empty partition, then the displacement of the 10 messages is 0, 1, 2… , 9.

The service side

A Kafka cluster consists of multiple brokers. Kafka supports horizontal scaling. The more brokers there are, the higher the throughput of the cluster. Each broker in the cluster has a unique BrokerID that must not be repeated. The Broker is responsible for receiving and processing requests from clients and for persisting messages.

Brokers run on different machines, so that if one machine in the cluster goes down, Kafka automatically elects brokers on other machines to continue serving. This is one of the ways that Kafka provides high availability.

📷 controller

There are one or more brokers in a Kafka cluster, and only one of them is elected as a Kafka Controller, which is responsible for managing the state of all partitions and replicas in the cluster.

When the leader replica of a partition fails, the controller elects a new leader replica for the partition. It is the responsibility of the controller to inform all brokers to update their metadata information when changes to the ISR collection of a partition are detected. When you increase the number of partitions for a topic, the controller is also responsible for redistributing partitions.

These words may confuse you, but instead of highlighting the functions of the controller, the details of these functions will be covered in a later article.

The controller election in Kafka depends on Zookeeper. The broker that is successfully elected as the controller creates the EPHEMERAL node /controller in Zookeeper.

Where version is fixed to 1 in the current version, brokerid represents the id number of the broker called the controller, and timestamp represents the timestamp when campaigning to be called the controller.

Two types of clients

Kafka has two types of clients. Producers and consumers. We refer to producers and consumers collectively as Clients.

Client applications that publish messages to Topic topics are called producers, and Producer programs typically send messages continuously to one or more topics.

Client applications that subscribe to messages on these topics are called consumers. Like producers, consumers can subscribe to messages on multiple topics simultaneously.

Producer

Producer is used to create messages. In publish-subscription systems, they are also called Publisher or writer writers.

Typically, it publishes to a particular Topic and is responsible for deciding which partition to publish to (usually simply by random selection by a load-balancing mechanism, either by key or by a specific partitioning function). There are Sync Producer and Aync Producer.

Sync Producer Indicates the Producer of synchronization. That is, a message is sent only when a message succeeds. So it has low throughput and generally no data loss.

Aync Producer Is an asynchronous Producer that is directly sent to a queue and sent in batches. High throughput and possible data loss.

Consumer and Consumer Group

📷 consumers

The Consumer reads the message. In publishing and subscription systems, it is also called subscriber subscriber or reader. Consumers subscribe to one or more topics and then read the data in the topics in sequence.

📷 Consumption shift

Consumers need to record the progress of consumption, that is, the location of consumption in which partition, which is the Consumer Offset. Note that this is not at all the same as the displacement of messages on partitions mentioned above. The above “displacement” represents the position of the message within the partition. It is immutable, that is, once the message is successfully written to a partition, its displacement value is fixed.

The consumer shift, on the other hand, can change over time because it is, after all, an indicator of the pace of consumer spending. By storing the Offset of the last consumption, the consumer application can continue to read from the previous location after restarting or stopping. The saving mechanism can be ZooKeeper, or Kafka itself.

📷 Consumer Group

ConsumerGroup: Refers to a group of multiple consumer instances that consume a set of topics. Only one of the consumers in the group can consume a section. No repeated consumption is allowed among the group members.

Why introduce consumer groups? The main purpose is to improve throughput on the consumer side. Multiple consumer instances consume simultaneously, accelerating throughput (TPS) across the consumer side.

Of course, it does more than just divide up the data of the subscribed topics and speed up consumption. They can also help each other. If an instance in a group fails, Kafka automatically detects it and transfers the partition that Failed instance was responsible for to another living consumer, a process called Rebalance.

It’s important to keep this word in mind. It’s kafka’s famous rebalancing mechanism, which is responsible for many of the production anomalies. I’ll break it down in a follow-up article on Kafka’s infamous rebalancing.

Zookeeper

Zookeeper is an indispensable component of Kafka.

Apache Kafka currently uses Apache ZooKeeper to store its metadata, such as brokers information, partition location, and topic configuration, in a ZooKeeper cluster.

Watch my language. I’m just saying for now. According to? In 2019 the community came up with a plan to break this dependency and bring metadata management to Kafka itself. Because having two systems leads to a lot of duplication.

In previous designs, we had to run at least three additional Java processes, and sometimes more. In fact, we often see Kafka clusters with as many ZooKeeper nodes as Kafka nodes! In addition, the data in ZooKeeper needs to be cached on the Kafka controller, resulting in double caches.

Worse, storing metadata externally limits Kafka’s scalability. When a Kafka cluster starts, or when a new controller is selected, the controller must load the full state of the cluster from ZooKeeper. As the amount of metadata increases, the loading process takes longer, limiting the number of partitions Kafka can store.

Finally, storing metadata externally increases the likelihood that the controller’s memory state is out of sync with the external state.

So, in the future, Kafka’s metadata will be stored in Kafka itself, rather than in an external system like ZooKeeper. Stay tuned to the Kafka community!

conclusion

A typical Kafka cluster consists of producers (who publish new messages to topics), consumers (who subscribe to new messages from topics, with consumer offsets representing the progress of consumers), cousumergroups (groups of consumer instances, Consume multiple partitions together), several brokers (server-side processes). There they are.

Kafka publishes and subscribe objects called topics. Each Topic can have multiple partitions, and the location of each message in a Partition is called message Offset. Partitions have a copy mechanism, which enables the same message to be copied to multiple places to provide data redundancy. Replicas are divided into leader replicas and follower replicas.

Kafka’s composition can be visualized in the following diagram:

Also, post a mind map to help you review the terms described in this article.

Important!!!!! Follow [fat rolling pig learning programming] public account send “kafka”. Get all architecture diagrams and Kafka mind maps!

This article comes from the public account: [Fat rolling pig learning programming]. A set of appearance level and talent in a suit, not smart but hard enough female program yuan. Programming in comic form so easy and interesting! Beg attention!