Distributed messaging middleware Kafka

What is a kafka

Kafka is introduced

Kafka is a distributed message publishing and subscription system with high performance, high throughput characteristics and is widely used in big data transmission scenarios. Published by LinkedIn, written in Scala, and later as a top-level Project of the Apache Foundation.

Kafka’s background

Kafka, as a messaging system, was originally designed to serve as LinkedIn’s activity stream and operational data processing pipeline. Activity flow data is the most common part of any website’s user usage analysis. Activity data includes page views, information about what is being viewed, and what is being searched for. This data is usually processed by logging various activities to a file, and then periodically analyzing these files. Operational data refers to server performance data (CPU, I/O usage, request times, service logs, etc.)

Application scenarios of Kafka

Its better throughput, built-in partitioning, redundancy, and fault tolerance (kafka can handle hundreds of thousands of messages per second) make Kafka a great solution for large-scale message processing applications. Therefore, in enterprise-level applications, it is mainly applied in several aspects:

  • Behavior tracking: Kafka can be used to track the user’s browsing, searching, and other behaviors. Real-time records are recorded in the corresponding topic through the publis-subscribe mode, and processed and analyzed through the back-end big data platform for further real-time processing and monitoring.
  • Log collection: There are many good products for log collection, such as Apache Flume, and many companies use Kafka proxy log aggregation. In the actual application development, the log of our application will be output to the local disk, if the problem is solved by the Linux command, if the application is composed of a load balancing cluster, and the cluster of machines and dozens of more than, then want to quickly locate the problem through the log, it is very troublesome things. Therefore, many companies are centralized application logs in Kafka, and then imported to ES and HDFS respectively, for real-time retrieval analysis and offline statistical data backup. Kafka, on the other hand, provides a great API to integrate logging and do log collection.

Kafka’s own architecture

A typical Kafka cluster consists of several producers, brokers, ConsumerGroups, and (optionally) a ZooKeeper cluster. Zookeeper is used to manage kafka cluster configuration and service coordination.

Producers publish messages to the broker in push mode, and consumers subscribe and consume messages from the broker by listening in pull mode.

Multiple brokers work together, with producers and consumers deployed in various business logic. The zooKeeper management coordinates requests and forwarding. The result is a high-performance distributed message publishing and subscription system.

One of the details is that unlike other MQ middleware, a producer sends messages to the broker as a push, while a consumer consumes messages from the broker as a pull, actively pulling data. Instead of the broker unsolicited sending the data to the consumer.

Some simple concepts in Kafka

Broker

A Kafka server is a broker. A cluster consists of multiple brokers, and one broker can hold multiple topics.

topic

You can think of it as a queue.

partition

For extensibility, a very large topic can be distributed across multiple brokers, and a topic can be divided into multiple partitions, each of which is an ordered queue. Each message in the partition is assigned an ordered ID (offset). Kafka only guarantees that messages are sent to consumers in the order of a partition, not in the order of a topic as a whole (multiple partitions).

The following figure divides topic Test into three partitions:

producer

The message producer is essentially the client that sends messages to the Kafka Broker.

consuemr

Message consumers are the clients that fetch messages from the Kafka Broker.

consumer group

This is an overview that is unique to Kafka for implementing broadcast and unicast topic messages. A topic can have multiple consumer groups. Each message is sent to all consumer groups, but the consumer group needs to send the message to one of the consumers in the group for processing. If broadcasting is required, it is fine as long as each consumer is a separate group.

segment

A partition physically consists of multiple segments.

offset

Each partition consists of a series of ordered, immutable messages that are appended to the partition in succession. Each message in a partition has a sequential sequence number called offset, which uniquely identifies a message to the partition.

How does the producer produce the message

A producer can specify to send a message to a topic in Kafka. Each topic can have multiple producers sending messages to it, and multiple consumers consuming messages in it.

Each topic can be divided into multiple partitions (each topic has at least one partition). Different partitions within the same topic contain different information (there are no more than one partition for the same message). When a message enters a partition, it is assigned an offset, which is the unique number of the message in that partition. Kafka uses offset to ensure that messages are in order within a partition. The order of offset does not cross partitions. That is, Kafka only guarantees that messages are in order within the same partition.

In the following figure, for a topic named test, there are three partitions: P0, P1, and P2

  • Each message sent to the broker is stored in one of the partitions according to the partition rules. If the partition rules are set properly, all messages should be evenly distributed across the partitions. It is similar to the database partition table, the data is sharded.

How do consumers consume messages

Each topic has multiple partitions. The advantage of multiple partitions is that on the one hand, data on the broker can be sharpened to reduce message capacity and improve I/O performance. On the other hand, in order to improve the consumption power of consumers, multiple consumers are usually used to consume the same topic, which is the load balancing of consumers.

How the consumer consumes the message in the case of multiple partitions and consumers. Kafka has the concept of a consumerGroup, which is a Groupid-like consumer that belongs to a Single consumerGroup, where all consumers coordinate to consume all partitions of a subscription topic. Of course, each partition can only be consumed by consumers in the same consumer group, so how do consumers in the same Consumer group allocate the data in which partition of the message? As shown in the figure below, there are three partitions and three consumers, so which consumer consumes which partition?

For this graph, three consumers correspond to three partitions, so each consumer consumes three partitions, and each consumer consumes one partition.

What about partition allocation in Kafka?

In Kafka, there are two partitioning strategies. One is range(default) and the other is roundRobin(polling).

Range is a range partition, so if we have 10 partitions, and we have three consumers, then each consumer will consume three, and then one to consume will be assigned to Consumer1, which is:

  • 0,1,2,3
  • Consumer2:4,5,6
  • Consumer3:7,8,9

The polling partitioning strategy is to list all partitions and all consumer threads and sort them by hashCode. Finally, the partition is allocated to the consuming thread by polling algorithm. If the subscriptions for all consumer instances are the same, then the partition is evenly distributed.

Kafka makes a rebalance when the following conditions occur. This is the kafka Consumer rebalance

  1. The same consumerGroup has a new consumer
  2. The consumer leaves the current consumerGroup. Such as active downtime or downtime
  3. New partitions for topic (i.e. number of partitions changed)

Message persistence

Kafka uses a log file to hold messages from producers and senders. Each message has an offset value to indicate its offset in the partition. Kafka generally stores a large amount of message data. To prevent log files from becoming too large, log does not correspond to a log file on a disk, but to a directory on the disk. For example, create a topic called firstTopic with three partitions. There are three catalogs in the/TMP /kafka-log directory, firsttopic-0 ~3

Allocation of multiple partitions on a cluster

If we create multiple partitions in a cluster for a topic, how are the partitions distributed

  1. Sort all N brokers and I partitions to be allocated
  2. Assign the ith partition to the (I mod n) broker

With that in mind, you should be able to understand what partition messages are sent to the broker, which partition messages are stored in, and which partition data should be consumed by the consumer.

conclusion

This section briefly explains the principles of production, consumption, and message persistence in Kafka. In the next section we’ll elaborate on some of kafka’s features.