This is the 18th day of my participation in Gwen Challenge

background

Recently ready to start learning Kafka source code, here first put the previous project accumulated Kafka some notes summary.

Kafka

Usage scenarios for Kafka

  • Log collection: A company can use Kafka to collect logs for a variety of services and open them to consumers, such as Hadoop, Hbase, and Solr, as a unified interface service.
  • Message systems: decouple producers and consumers, cache messages, and so on.
  • User activity tracking: Kafka is often used to record the activities of Web users or app users, such as browsing, searching, clicking, etc. These activities are published by various servers to Kafka topics, which subscribers subscribe to for real-time monitoring and analysis. Or load it into Hadoop or data warehouse for offline analysis and mining.
  • Operational metrics: Kafka is also used to record operational monitoring data. This includes collecting data for various distributed applications and producing centralized feedback for various operations, such as alarms and reports.
  • Streaming: spark Streaming and Storm
  • The event source

The advantage of Kafka

  • Reliability: Distributed messaging systems with partitioning, duplicating, and fault tolerance mechanisms.
  • Scalability: The messaging system supports hot scaling of cluster sizes.
  • High performance: Ensures high throughput of data in both publishing and subscribing processes. Even in the case of TERabyte data storage, it can still ensure stable performance

Kafka’s broker

A Kafka cluster consists of one or more servers, whose nodes are called brokers. The broker stores data for a topic. If a topic has N partitions and a cluster has N brokers, then each broker stores one partition of that topic. If a topic has N partitions and a cluster has (N+M) brokers, then N brokers store one partition for that topic and the remaining M brokers do not store the partition data for that topic. If a topic has N partitions and there are fewer than N brokers in the cluster, then a broker stores one or more partitions for that topic. In the actual production environment, avoid this situation. This situation may cause data imbalance in the Kafka cluster.

Message expiration mechanism

The Kafka cluster stores all messages, whether they are consumed or not; We can set the expiration time of messages so that only expired data is automatically cleared to free disk space. For example, if we set the message expiration time to 2 days, all messages within these 2 days will be saved to the cluster, and data will only be cleared after 2 days.

Partition stores distribution in topic

  1. In a Kafka file store, there are multiple partitions under the same topic, and each partition is a directory. The partiton is named as topic name + ordered number. The number of partitions for the first partiton starts from 0, and the maximum number of partitions is reduced by 1.
  2. When messages are sent, they are sent to a topic, which is essentially a directory, and a topic consists of partitions
  3. A Partition is a Queue structure. Messages in each Partition are ordered. Messages are continuously added to the Partition, and each message is assigned a unique offset
  4. Kafka only maintains offset values in partitions, because the offsite identifies which messages are consumed by that Partition. Each time a Consumer consumes a message, the offset increases by one. The state of the message is completely controlled by the Consumer, who can track and reset the offset so that the Consumer can read the message anywhere.
  5. Storing message logs as partitions has multiple considerations. First, it is easy to scale in a cluster. Each Partition can be adjusted to fit the machine on which it resides, and a topic can be composed of multiple partitions, so the whole cluster can accommodate any size of data. The second is that concurrency can be improved because it can be read and written on a Partition basis
  6. As you can see, data in Kafka is persistent and fault-tolerant. Kafka allows users to set the number of replicas per topic, which determines how many brokers to store written data. If your replica number is set to 3, a copy of data will be stored on 3 different machines, allowing 2 machines to fail. It is generally recommended that the number of replicas be at least 2, so that data consumption is not affected when the machine is added, deleted, or restarted. If you have higher requirements for data persistence, you can set the number of copies to 3 or more

Partiton segment file storage structure

The producer sends messages to a topic, and messages are evenly distributed to multiple partitions (randomly or according to the callback function specified by the user). A kafka broker receives a message. It adds the message to the last segment of the partition. When the number of messages in a segment reaches the configured value or the release time exceeds the threshold, the message is flushed to disk. Only messages flushed to disk are consumed by consumers. If a segment reaches a certain size, the broker creates new segments instead of writing to them.

Message Deletion Policy

  1. For a traditional Message Queue, messages that have already been consumed are deleted, and the Kafka cluster stores all messages, whether or not they have been consumed
  2. Kafka provides two strategies for deleting old data :(1) time-based; (2) Based on Partition file size. Only expired data is automatically cleared to free disk space.