This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021

1. Basic Concepts

Kafka is a distributed, partitioned, multi-replica, multi-producer, multi-subscriber, zooKeeper coordinated distributed logging system (also known as MQ system), commonly used for Web/Nginx logging, access logging, messaging services, and more.

1.1 characteristics:

  • Message persistence (provided in O(1) time complexity to ensure constant time access performance even for data over terabytes).
  • High throughput (messages are sent in batches). Even on very cheap commercial machines, it can be done on a single machine to support 100K messages per second.
  • Horizontal scaling is achieved by dividing multiple partitions (FIFO queues) within a topic
  • High availability is achieved by partitioning copies
  • In a consumer group, each partition is consumed by only one consumer:
    • If the number of consumers is less than the number of partitions, there will be consumers consuming multiple partitions
    • If the number of consumers > the number of partitions, there will be free consumers (to avoid double consumption)
  • Each record consists of a key (optional), value, and timestamp.

For message-oriented middleware:

  • From the perspective of message delivery mode, it can be divided into point-to-point mode and publish-subscribe mode. Kafka is a publish-subscribe model.

  • From the perspective of message consumption mode, it can be divided into push and pull modes. Kafka only pulls messages, not pushes (which can be implemented through polling).

1.2 Four core apis:

  • Producer API: Allows applications to publish record streams to one or more Kafka topics.
  • Consumer API: Allows applications to subscribe to one or more topics and process the record streams generated for them.
  • The Streams API: Allows an application to effectively transform input Streams into output Streams by acting as a stream processor, consuming input Streams for one or more topics and generating output Streams for one or more output topics.
  • Connector API: Allows you to build and run reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector for a relational database might capture all changes to a table.

1.3 How do I Ensure the order of Messages?

  • There is no guarantee of overall ordering, only that messages on the same partition are consumed sequentially
  • You can specify partitions when sending messages (there are load balancing issues)
  • [Recommendation] When sending a group of ordered messages, you can specify the key, hash the key, and allocate the key to the same partition.
  • [Extreme] To ensure strict order of messages, set the number of partitions to 1

1.4 advantage

  • High throughput: a single machine processes tens of millions of messages per second. It maintains stable performance even with many terabytes of messages stored.
  • High performance: A single node supports thousands of clients and ensures zero downtime and zero data loss.
  • Persistent datastore: Persist messages to disk. Prevent data loss by persisting data to hard disk and replication.
    • Zero copy
    • Read sequentially, write sequentially
    • Take advantage of Linux page caching
  • Distributed system, easy to scale out. There are multiple producers, brokers, and consumers, all of which are distributed. The machine can be expanded without stopping. Multiple producers and consumers may be different applications.
  • Reliability – Kafka is distributed, partitioned, replicated, and fault tolerant.
  • Client-side state maintenance: The state in which messages are processed is maintained on the Consumer side, not the server side. Automatic balancing in case of failure.
  • Online and offline scenarios are supported.
  • Support for multiple client languages. Kafka supports Java,.NET, PHP, Python and many other languages.

1.5 Application Scenarios

  • Log collection: open to consumers as a unified interface service through Kafka;
  • Message systems: decouple producers and consumers, cache messages, and so on;
  • User activity tracking: Kafka is often used to record the activities of Web users or App users, which are published by various servers to Kafka topics. Consumers can then subscribe to these topics for real-time monitoring and analysis, and can also be saved to a database.
  • Operational metrics: Kafka is also used to record operational monitoring data. This includes collecting data from various distributed applications and producing centralized feedback on various operations, such as alarms and reports;
  • Streaming: for example Spark and Storm.

Ii. Core Concepts

2.1 Messages and batches

Kafka’s data units are called messages. The message consists of an array of bytes. A message has a key, which is also an array of bytes. Keys are used when messages are written to different partitions in a controlled way.

For efficiency, messages are written to Kafka in batches. A batch is a group of messages that belong to the same topic and partition. Breaking messages into batches reduces network overhead. The larger the batch, the more messages are processed per unit of time and the longer the transmission time of individual messages. Batch data is compressed, which improves data transmission and storage, but requires more computational processing.

2.2 Message Mode

Is the data format of the message

Message schemas have a number of options available for easy understanding. Examples include JSON and XML, but they lack strong typing capabilities. More commonly used is Apache Avro(Avro provides a compact serialization format with schema and message body separated). There is no need to regenerate code when schemas change, strong typing and schema evolution are supported, and versions are both forward and backward compatible.

Consistency in the data format is important to Kafka because it eliminates coupling between message reads and writes.

2.3 Topics and Partitions

Kafka’s messages are sorted by topic. Topics can be divided into partitions, and a topic is partitioned across a Kafka cluster, providing the ability to scale horizontally.

2.4 Producers and consumers

The producer creates the message. Consumer consumption news.

When a message is published on a particular topic, the producer distributes the message evenly across all partitions of the topic using three strategies:

  • Specify the partition of the message directly
  • The partition is modeled based on the key of the message
  • Poll all partitions.

Consumers consume messages by differentiating between already read messages with offsets.

A consumer group consists of multiple consumers. Consumer groups ensure that each partition can only be used by one consumer to avoid repeated consumption.

2.5 Cluster Controllers and Zone Chiefs

A separate Kafka server is called the Broker. The broker receives messages from the producer, sets offsets for the message, and commits the message to disk for saving. The broker serves consumers and responds to requests to read partitions by returning messages that have been saved to disk.

Each cluster has a broker that is the cluster controller (elected by preempting distributed locks on the ZK). The controller is responsible for management:

  • Assign partitions to brokers
  • Monitoring the broker

The leader of one of the partitions in the cluster resides on a broker called the partition leader.

A partition can be assigned to multiple brokers, where partition replication occurs. Replication of partitions provides message redundancy and high availability.

Replica partitions are not responsible for handling read and write messages.

Three, core components

3.1 Producer

Producers publish messages to Kafka’s topics. After the broker receives a message sent by the producer, it appends the message to the corresponding segment file.

Typically, a message will be posted to a specific topic.

  • By default (no key is specified), messages are distributed to the partitions of the topic by polling.
  • You can specify the partition to send the message directly when it is sent.
  • Messages can be written directly to the specified partition through message keys and partitions.
    • The divider generates a hash value for the key and maps it to the specified partition. This ensures that messages containing the same key will be written to the same partition.
    • Producers can also use custom partitions to map messages to partitions based on different business rules.

3.2 Consumer

The consumer subscribes to one or more topics and reads them in the partition in the order in which the messages are generated.

  • Offset: the consumer passes the inspection messageThe offset (offset)To distinguish between messages that have already been read.
    • An offset is another type of metadata, which is an incrementing integer value that Kafka adds to the message when it is created.
    • Within a partition, each message has a unique offset.
    • The consumer stores the last read message offset for each partition in Kafka(the lower version is kept in ZK), and its read state is not lost if the consumer shuts down or restarts.
  • Consumer group: Consumers are part of a consumer group. Groups guarantee that each partition will only be used by one consumer.
  • Rebalance: If a consumer fails, other consumers in the consumer group can take over the work of the failed consumer (reassign consumers to the partition).

3.3 the Broker

A separate Kafka server is called the broker. The broker serves consumers and responds to requests to read partitions by returning messages that have been committed to disk.

Suppose there are a number of brokers in a cluster and a number of partitions in a topic:

  • Number of partitions == Number of brokers: Each broker stores one partition of the topic.
  • Number of partitions < number of brokers: N brokers store one partition for that topic and the remaining brokers do not store partition data for that topic.
  • Number of partitions > Number of brokers: then a broker stores one or more partitions for that topic. In the actual production environment, avoid this situation. This situation may cause data imbalance in the Kafka cluster.

Two concepts related to broker:

  • Cluster controller
    • A broker is part of a cluster. Each cluster has a node that acts as both a broker and a cluster controller (automatically elected from the active members of the cluster by preempting distributed locks in the ZK).
    • The controller is responsible for administrative work, including assigning partitions to brokers and monitoring brokers.
  • District leader:
    • In a cluster, a leader partition is subordinate to a broker calledPartition the princes.
    • Broker1 is partition 0 in the figure belowPartition the princesBroker2 is for partition 1Partition the princes.

3.4 the Topic theme

Each message published to the Kafka cluster has a category called Topic.

Messages for different topics are physically stored separately (in different folders).

3.5 Partition Partition

  • Topics can be divided into partitions, with one partition corresponding to a group file on disk.
  • Messages are appended to the partition and then read from the partition in first-in, first-out order.
  • The order of messages cannot be guaranteed across the topic, but within a single partition.
  • Kafka implements data redundancy and scalability through partitioning.
  • The number of partitions needs to be set to 1 in scenarios where the consumption order of messages needs to be strictly guaranteed.

A Partition in Kafka is equivalent to a queue in other MQ.

A copy of the 1.3.6 Replicas

Kafka uses themes to organize data, and each theme is divided into partitions with multiple copies of each partition. Copies of the same partition are stored on different brokers

There are two types of copies:

  • Leader replica (that is, Leader partition)

    • Each partition has one and only one replica of the leader.
    • To ensure consistency, all producer and consumer requests pass through this copy.
  • Follower copy

    • Instead of processing requests from clients,
    • Their only job is to copy messages from the leader and keep them in the same state as the leader. If there is a breakdown of the chief,
    • One of the followers will be promoted to the new leader.

Follower replicas include synchronous replicas and unsynchronized replicas. When leader replicas are switched, only synchronous replicas can be switched to leader replicas.

  • AR

    All follower Replicas in a partition are collectively called AR (Assigned Replicas), and AR=ISR+OSR.

  • ISR

    • All Replicas (including the Leader) that maintain a certain degree of synchronization with the Leader Replicas constitute in-Sync Replicas (ISRs).
    • The ISR set is a subset of the AR set.
    • The messages are first sent to the leader copy, and then the follower copy can pull the messages from the leader copy for synchronization. During synchronization, the follower copy will lag behind the leader copy to a certain extent. The “degree” mentioned above refers to the tolerable lag range, which can be configured with parameters.
  • OSR

    • Out-of-sync Relipcas (OSR) are formed when the synchronization with the leader replica lags too much (excluding the leader replica).
    • Under normal circumstances, all follower replicas should be in sync with the leader replicas to some degree, that is, AR=ISR, while the OSR set is empty.
  • HW

    • HW is short for High Watermak. It represents the offset of a particular message. The consumer can only pull the message before the offset.
  • LEO

    • LEO is short for Log End Offset, which represents the Offset of the next ** message to be written (but not written) in the current partion/segment.

3.7 Offset Specifies the Offset

  • Producers Offset

    When a message is written, each partition has an offset, which is the producer’s offset and the latest and largest offset for that partition. Sometimes the offset of a partition is not specified. Kafka does this for us.

  • Consumers offset

    This is the offset of A partition. The producer wrote the latest and largest value of offset is 12. The offset of Consumer A is recorded at 9, and that of Consumer B is recorded at 11. The next time they come back, they can pick up where they left off, start from scratch, or skip to the most recent record and start spending “now.”