Kafka is a fast, scalable, high-throughput distributed message queue system written in Java and Scala.

Kafka stores data to disk persistently and has its own partitioning and replication mechanism.

However, kafka does not have a confirmation mechanism for message consumption, possibly because the consumer crashes and the message is not processed. Therefore, it is not recommended to use Kafka in business scenarios with high consistency. Kafka is often used for log collection and caching between data warehouses.

For example, a website’s browsing logs are cached in Kafka and then written to a data warehouse such as ElasticSearch, Hive, or HBase in bulk. This can greatly reduce the load on the offline analysis system.

Introduction of architecture

The kafka architecture has the following roles involved:

· Broker: Server instances in a Kafka cluster are called brokers

· Producer: a client that sends messages to the broker

· Consumer: client that reads messages from Borker

· ZooKeeper: a registry for storage cluster status that does not process specific messages. It plays an important role in load balancing and cluster expansion.

Here is kafka’s logical model:

· Message: Message is the basic unit of Kafka communication

· Topic: A topic is similar to a queue in logical structure, with each message belonging to a topic.

· Consumer Group: Each group can contain several consumer instances, and each topic can be subscribed by multiple consumer groups.

Consumer groups are identified by a unique GroupID, and each consumer instance has one or only one GroupID.

· Partition: Topic is divided into several partitions for storage, and each message belongs to one partition.

· Offset: The unique identifier of each message is offset (offset) in the partition.

Kafka ensures that all consumer groups that subscribe to a topic receive all messages from that topic.

Each message in a topic is read by a consumer in a consumer group and only by that consumer.

If each consumer belongs to an independent consumer group, the message is read by all consumers, which implements message broadcast. If all consumers belong to the same consumer group, the message will be read by only one consumer, which implements message unicast.

Kafka does not actively push messages to consumers, who actively read data from the broker.

Kafka does not have a message confirmation mechanism, and consumers control their own consumption messages.

Partition and message passing implementation

Kafka stores data from a topic into multiple partitions, and each partition is divided into segmenting files stored on broker nodes.

Producer communicates with all partitions under the topic and decides which partition to write messages to based on the configured algorithm (key-hash, round robin, etc.).

Internal partitions are ordered, but the order between multiple partitions of the same topic is not guaranteed, that is, the topic is not overall ordered.

Kafka assigns a partition to the consumer of the listening topic. Within a consumer group, a partition can be assigned to at most one consumer.

If the number of consumers in the group is greater than the number of partitions, some consumers may fail to allocate data.

A partition can be listened on by multiple consumers belonging to different groups.

The mechanism for consumers to listen to different partitions enables messages to be consumed by only one consumer within the group. Avoiding locking greatly improves throughput and simplifies broker implementation.

Consumers mark their read position through offset and actively read the data in the parttion

The consumer sends a FETCH request to the broker containing offset and Max parameters to read the data in the partition. Therefore, the consumer is free to set the offset to control the location of the read, enabling such functions as incremental or ab initio reads.

When a consumer subscribes to a topic, Kafka notifies the consumer of the latest offset.

Consumers can feed their current offset back to Kafka, which saves the state to ZooKeeper, allowing consumers to freely quit or rejoin and continue spending.

Kafka does not have a message acknowledgement mechanism. The consumer sets offset entirely to consume. As a result, Kafka Broker does not need to maintain message state, which helps improve throughput.

Unlike many message queuing systems, Kafka does not delete consumed information, but deletes early messages or oversized partition files based on a configured timeout or file size limit.

replica

In kafka after 0.8, replica mechanism is supported. Each topic is divided into multiple partitions, and each partition has multiple replicas.

These replicas are distributed across different broker nodes to reduce the impact of a single broker failure on system availability.

Kafka’s replica distribution strategy is to store the JTH replica of the ith partition on the (I + j) % n broker in a cluster of N broker nodes.

There is a leader in the replica of the same partition. The producer-consumer only interacts with the leader replica. Other replicas synchronize data from the leader.

Kafka provides two master-slave replication mechanisms:

· Synchronous replication: Only replica messages in all alive states whose messages are partitioned are successfully submitted. This method ensures consistency but greatly affects throughput rate.

· Asynchronous submission: the message is successfully submitted when written by the leader replica of the partition, and the data is asynchronously synchronized by other replicas. In this mode, the throughput rate is high but the consistency is low. The leader crash may cause message loss.

Kafka uses two mechanisms to determine the alive state:

· ZooKeeper heartbeat mechanism: The broker must maintain zooKeeper sessions

· The delay for the slave copy to replicate data from the leader cannot exceed the threshold.

Experience kafka

Install the kafka

Zookeeper can use the official Docker image: Docker Run -d ZooKeeper

· Run Wurstmeister/Kafka-Docker using Docker-compose

· Kafka running with Docker can be referred to: Kafka learning under Docker, one of the trilogy: extreme speed experience Kafka

· Install Apache kafka on Ubuntu 14.04 for details on How To Install Apache kafka on Ubuntu 14.04

· Homebrew can be installed directly on MAC systems

Here the author chooses homebrew for the installation.

brewinstallkafka

Configuration files in/usr/local/etc/kafka/server. The properties and/usr/local/etc/kafka/zookeeper properties.

Start the zookeeper:

zookeeper-server-start/usr/local/etc/kafka/zookeeper.properties

Start the kafka:

kafka-server-start/usr/local/etc/kafka/server.properties

Command line tool

Create a topic:

kafka-topics–zookeeperlocalhost:2181–create–topictest–partitions30–replication-factor2

· ZooKeeper: specifies the Address of the ZooKeeper service on which the cluster depends

· Topic: Topic name

· Partitions: number of partitions for topic

· replication-factor: indicates the number of copies of each partition

View topic information:

kafka-topics–zookeeperlocalhost:2181–describe–topictest

Delete the topic:

kafka-topics–zookeeperlocalhost:2181–delete–topictest

View all topics:

kafka-topics–zookeeperlocalhost:2181–list

Send a message:

kafka-console-producer–broker-listlocalhost:9092–topictest

Receiving new messages:

kafka-console-consumer–zookeeperlocalhost:2181–topictest

Read the message from scratch:

kafka-console-consumer–zookeeperlocalhost:2181–topictest–from-beginning