By Kafka quickstart www.cnblogs.com/tree1123/p/…

You can see the basic deployment and usage of Kafka, but how does it differ from other message-oriented middleware?

What is Kafka’s rationale, terminology, version, etc.? What exactly is Kafka?

Introduction to Kafka

kafka.apache.org/intro

In 2011, LinkIn opened source, November 1, 2017 version 1.0 was released, July 30, 2018 version 2.0 was released

Refer to the chart on the official website:

Kafka® is used to build real-time data pipelines and streaming applications. It is horizontally scalable, fault tolerant, extremely fast, and is in production at thousands of companies.

Apache Kafka® is a distributed streaming platform

A distributed streaming platform.

Introduction:

Three characteristics:

  • Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
  • Store streams of records in a fault-tolerant durable way.
  • Process streams of records as they occur.

Message persistence flow processing

Two types of applications:

  • Building real-time streaming data pipelines that reliably get data between systems or applications

  • Building real-time streaming applications that transform or react to the streams of data

    Real-time streaming data pipeline Real-time streaming applications

    Several concepts

    • Kafka is run as a cluster on one or more servers that can span multiple datacenters.

    • The Kafka cluster stores streams of records in categories called topics.

    • Each record consists of a key, a value, and a timestamp

      Cluster topic record

      Four core apis

      • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
      • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
      • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
      • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

      Producer API  Consumer API  Streams API Connector API

The client server supports multiple languages through TCP

Topics and Logs

A topic can have zero, one or more consumer subscriptions written to its data

The Kafka cluster maintains a partition log for each topic

Each partition is an ordered, immutable sequence of records attached to a structured commit log.

Each record in the partition is assigned a sequential ID number called offset, which uniquely identifies each record in the partition.

The Kafka cluster persistently retains all published records – whether or not they have been consumed – using a configurable retention period. You can configure this time.

Kafka’s performance is virtually constant in terms of data size, so storing data for a long time is not an issue.

The only metadata retained by each consumer is the offset or position of that consumer in the log.

This offset is controlled by the consumer: normally the consumer increases its offset linearly as it reads the record, but in fact, because the consumer controls the position, it can consume the records in any order it likes. For example, a consumer can reset to an older offset to reprocess past data, or jump to a recent record and consume from “now.”

This makes it especially easy for consumers to use.

Producers:

Producers publish data to topics of their choice.

Multiple partitions can be selected for load balancing.

Consumer:

Consumer groups

Both traditional message queue publishing and subscribing have drawbacks

Queues can be scalable but not multi-user. Publish and subscribe per consumption to each consumer, not scalable.

But Kafka solves these problems

Kafka ensures that the consumer is the only reader of the partition and uses the data sequentially, which is still possible because there are many partitions

Balancing the load of many consumer instances.

As a storage system

As a stream processing system

Two, common use

kafka.apache.org/uses

The message

Kafka can replace more traditional message brokers. Message brokers are used for many reasons (to separate processing from the data generator, to buffer unprocessed messages, and so on). Kafka has better throughput, built-in partitioning, replication, and fault tolerance than most messaging systems, making it an ideal solution for large-scale messaging applications.

In our experience, messaging usage is usually relatively low, but may require low end-to-end latency, and is often dependent on the strong durability guarantees that Kafka provides.

In this area Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

Site Activity Tracking

Site activities (page viewing, searching, or other actions a user might take) are published to a central theme, one for each activity type. Real-time processing, real-time monitoring and loading into Hadoop or offline data warehouse systems for offline processing and reporting.

To measure the

Kafka is typically used for operational monitoring data.

Log polymerization

Many people use Kafka as an alternative to log aggregation solutions. Log aggregation typically collects physical log files from a server and puts them in a central location, perhaps a file server or HDFS, for processing. Kafka abstracts out the details of files and abstracts log or event data more clearly into message flows.

Stream processing

Starting with 0.10.0.0, this is a lightweight but powerful stream processing library called Kafka Streams

Official documentation – Core mechanics

Kafka.apache.org/documentati…

Introduction Using quickstart has been studied

Ecology: There are several kafka ecosystems, various connectors that can connect directly to databases such as ES, other stream processing and various management tools

Confluent specializes in Kafka ecology

Cwiki.apache.org/confluence/…

kafka connect stream management

Several issues kafka considers:

Throughput: Page cache is not used for disk reads and writes

Message persistence: This again relies on its unique offset design

Load balancing: Partitioned copy mechanism

Kafka performs better when deployed on Linux due to the zero copy technology used by the client with epoll.

Messages: Kafka messages consist of key Value timestamp messages with compressed version numbers defined in the header

CRC Version Number Attribute Timestamp Length Key Length Key Value Length value

I’m using binary instead of Java classes

Topic and partition:

This is the core and most important mechanic of Kafka that sets it apart from the rest.

Offset is the offset of a partition.

Topic partition offset These three uniquely determine a message.

The producer’s offset is actually the latest offset.

The consumer’s offset is maintained by himself. He can choose the first partition, the latest partition, or remember where he spent.

If the number of consumers is greater than the partition, there will be empty consumers. If the number of consumers is smaller than the partition, the consumption will be balanced.

Because Kafka’s design does not allow concurrency on a partition, the number of consumers should not be greater than the number of partitions.

If a consumer reads data from multiple partitions, there is no guarantee that the data is ordered between partitions. Kafka only guarantees that the data is ordered on one partition, but multiple partitions can vary depending on the order in which you read data.

Adding or deleting consumers, brokers, and partitions causes rebalance. After the rebalance, the partitions of consumers will change.

Consumer groups are so that different groups of consumers can consume messages for a partition at the same time.

replica

This is in case the server hangs.

There are two types of leader replica and follow replica

Only the Leader Replica responds to the client.

If the leader Replica broker breaks down, a new leader is elected.

Kafka ensures that multiple replicas of a partition are not allocated to the same broker.

Follow synchronizes with the leader in real time.

ISR

In-sync Replica Sets that are synchronized with the leader replica

Normally, all replicas are in the ISR, but if the response is too slow, the ISR is kicked out. Then catch up and add in.

At least one replica in the ISR is alive.

If all replicas in the ISR receive the message, the message is in the submitted state.

For more real-time computing related technical posts, follow real-time streaming computing