What is Kafka

Kafka is officially defined as a high performance open source distributed publish/subscribe message queue

Based on the official definition, Kafka has the following characteristics

  • A high performance
  • Open source
  • distributed
  • Based on publish/subscribe

But with Kafka Streams, Kafka can not only distribute and store messages, but also ingest and process them! So Kafka is officially defined as a streaming computing platform

Although Kafka supports Streaming computing, most people still use Kafka as a message queue for the most part. When Streaming computing is required, there are many middleware options such as Fink, Storm, Spark Streaming, etc. Kafka Streams may not be chosen

Kafka, brief

  • Kafka was originally developed by LinkedIn

  • In 2011 LinkedIn donated Kafka to the Apache Foundation and released the first open source version 0.7.0, which supports data compression and data copying across clusters

  • On October 23, 2012 Kafka was successfully incubated with Apache Incubator and officially became an Apache Top-level project with the simultaneous release of version 0.8.0

Kafka programming language

Kafka is written in Scala and Java, runs on the JVM and is compatible with existing Java programs, so deploying Kakfa requires deploying the JDK environment first

Important versions and features of Kafka

See Kafka version evolution

Core Kafka concepts

Kafka is a publish/subscribe based message queue, so the following concepts are inevitable

  • Topic: Message Topic, Queue in standard MQ
  • Producer: indicates message producers
  • -Penny: Consumer
  • Consumer Group: a Group of consumers

Kafka is a C/S architecture. You can also think of Producer and Consumer as clients

Kafka is also a distributed system, so there are multiple instances in a cluster, and each instance is called a Broker by Kafka

In addition, Kafka stores messages in a Topic in a distributed manner. Kafka divides messages into “segments” that are stored in different brokers. Each of these segments is called a Partition by Kafka

It is not safe to store only one Partition. For data security and high availability, Kafka can synchronize multiple partitions to different brokers. Kafka calls each Partition Replication

Partition is a logical concept in Kafka. It is logically understood that a Partition is stored in a Broker, but physically data must be stored on disk. Kafka splits a Partition into multiple segments and stores them in a file system. Each Segment consists of two files: xxx.index and xxx.log. Kafka calls such a Segment a Segment

Finally, Kafka records the progress of the production and consumption messages. Kafka calls this Offset, which you can think of simply as an auto-increment primary key in a relational database

Summarize the concepts mentioned above

The concept of Kafka architecture

  • Broker: An instance of Kafka

The concept of Topic messages

  • Partition: indicates a Topic message
  • Replication: logically a copy of a Partition, essentially the same as a Partition, and a Topic message
  • Offset: progress of production and consumption messages

Data storage concepts

  • Segment: a Segment of data in the file system to which the Partition resides

Of course, there are more concepts in Kafka, such as Leader Replication, Follower Replication, etc. These concepts are discussed in detail in the corresponding chapters

KafKa close-up view

Broker

Reference KafKa Broker

Partition and Replication

For details, see KafKa Partition and Replication

Segment

Reference kafka segment

Producer

Reference KafKa Producer

Consumer

Reference KafKa Consumer