“This is the 22nd day of my participation in the Gengwen Challenge. For more details, see” Gengwen Challenge “.

This is the first part of the Kafka series. What is Kafka and what does it do? Which scenarios are mainly used?

What is kafka

The official address is kafka.apache.org/

  • Kafka is linkedin’s distributed messaging system written in Scala with high levels of scalability and high throughput.
  • A Kafka cluster is composed of multiple Instances of Kafka, each of which is called a broker.
  • Both the Kafka cluster and producers and consumers rely on ZooKeeper to ensure system availability and to store meta information for the cluster.

Originally developed by Linkedin, Kafka is a distributed, partition and replica supported distributed messaging system based on ZooKeeper coordination. The biggest feature of Kafka is that it can process a large amount of data in real time to meet various demand scenarios: Such as hadoop based batch processing system, low latency real-time system, Storm/Spark streaming engine, Web/Nginx logs, access logs, messaging service, etc., written in Scala language, Linkedin contributed to the Apache Foundation in 2010 and became a top open source project.

The new version of Kafka no longer relies on ZooKeeper and has its own implementation mechanism. General company need not newest version, be in wait and see stage, who dare to eat 🦀 first, say not!

What can Kafka do

What are the main features of Kafka?

Apache Kafka® is a distributed streaming platform.

Stream processing platform features:

  • Allows you to publish and subscribe to streaming records. This is similar in this respect to message queues or enterprise messaging systems.
  • Can store streaming records, and has good fault tolerance.
  • Streaming records can be processed as they are generated.

What scenarios does Kafka fit into

  • Construct a real-time streaming data pipeline that reliably captures data across systems or applications. (Equivalent to message queues)
  • Build real-time streaming applications that transform or influence this streaming data.

Use scenarios for Kafka

  • Log collection: A company can use Kafka to collect the log of various services, through Kafka to open a unified interface service to various consumers, such as Hadoop, Hbase, Solr, etc.
  • Messaging systems: decouple producers and consumers, cache messages, etc.
  • User activity tracking: Kafka is often used to record web users or app users of a variety of activities, such as browsing the web, search, click and other activities, these activities are published by various servers to Kafka’s topic, and then subscribers by subscribing to these topics to do real-time monitoring analysis, Or load into Hadoop, data warehouse for offline analysis and mining.
  • Operational metrics: Kafka is also often used to record operational monitoring data. This includes collecting data from a variety of distributed applications and producing centralized feedback on various operations, such as alarms and reports.

Kafka important concept

  1. Broker
    • Message middleware processing nodes, a Kafka node is a broker, one or more brokers can form a Kafka cluster
  2. Topic
    • Kafka categorizes messages by topic, and each message published to a Kafka cluster requires a topic
  3. Producer
    • Message producer, the client that sends messages to the Broker
  4. Consumer
    • Message consumer, the client that reads messages from the Broker
  5. ConsumerGroup
    • Each Consumer belongs to a specific Consumer Group. A message can be consumed by several different Consumer groups, but only one Consumer in a Consumer Group can consume the message
  6. Partition
    • In physics, a topic can be divided into multiple partitions, and messages within each partition are ordered

Kafka runs as a cluster on one or more servers. Kafka categorizes streams by topic. Each record contains a key, a value, and a timestamp.

The producer sends messages to the Kafka cluster over the network, and then the consumer makes the consumption.

The communication between servers (brokers) and clients (producers and consumers) is completed through TCP.

Topic and message Log

It is understood that a Topic is the name of a category, and similar messages are sent under the same Topic. There can be multiple Partition log files for each Topic. Topics in Kafka are always multi-subscriber, a topic can have one or more consumers subscribe to its data.

A Partition is an ordered sequence of messages that are added in order to a file called a commit log. Messages in each partition have a unique number, called offset, that uniquely identifies messages in a particular partition. Each partition has a commit log file. The message offset in a partition is unique, but the message offset in different partitions may be the same.

Why partition data under Topic?

  • Commit log files are limited by the size of the machine’s file system, and after partitioning, a topic can theoretically handle any amount of data.
  • To improve parallelism.

Each consumer works based on its consumption progress (offset) in the commit log. In Kafka, the consumer offset is maintained by the consumer; Normally we consume the messages in the Commit log one at a time, although I can specify offset to consume some messages repeatedly, or skip some messages. This means that a consumer in Kafka has very little impact on the cluster. Adding or removing a consumer has no impact on the cluster or other consumers, since each consumer maintains its own offset. Therefore, the Kafka cluster is stateless and performance is not much affected by the number of consumers. Kafka also records a lot of key information in ZooKeeper to keep itself stateless, making it easy to scale horizontally.

A distributed Distribution

Log partitions are distributed across multiple servers in the cluster. Each server processes the partitions it is assigned to, and each partition can be replicated to other servers as backup fault tolerance depending on the configuration.

Each partition has a leader and zero or more followers. The Leader processes all read and write requests to the partition, while the followers passively copy data. If the leader is down, one of the other followers will be selected as the new leader. A server may be the leader of one partition and the follower of another. This balances the load and avoids all requests being handled by just one or several servers.

Producers Producers

The producer sends messages to a topic and is responsible for choosing which partition of the topic to send messages to. Use round-robin to perform simple load balancing. It can also be distinguished by a particular keyword in the message. Usually the second method is used more.

Consumers Consumers

There are two traditional messaging modes: queue and publish-subscribe.

  • Queue mode: Multiple consumers read data from the server, and messages reach only one consumer.
  • Publish-subscribe mode: Messages are broadcast to all consumers.

Kafka provides a consumer abstraction based on these two patterns: the Consumer group.

  • Queue mode: All consumers belong to the same consumer group.
  • Publish-subscribe: All consumers have a unique consumer group.

The figure above shows a kafka cluster consisting of two brokers and four partitions (P0-P3). This cluster consists of 2 Consumer groups. A has 2 Consumer instances and B has 4 Consumer instances. Usually, there are several consumer groups for a topic, and each consumer group is a logical subscriber. Each consumer group consists of multiple Consumer instances to achieve scalability and disaster recovery.

Consumption order

Kafka has a much stronger sequential guarantee than traditional messaging systems.

Only one consumer instance in a consumer group is consuming a partition at a time, thus ensuring order. The number of Consumer instances in a consumer group cannot be larger than the number of partitions in a Topic. Otherwise, the additional consumers cannot consume messages.

Kafka only guarantees the local ordering of message consumption within a partition. It cannot guarantee the overall ordering of message consumption across multiple partitions in the same topic.

If there is a need to ensure the order of consumption in general, we can set the number of Consumer instances in the consumer group to 1 by setting the number of partitions in the topic to 1.

conclusion

This article has introduced some of the basic core concepts in Kafka, and you can only go further if you have a preliminary understanding of these concepts.

Welcome to follow the public account (MarkZoe) to learn from each other and communicate with each other.