Big Data: How does Kafka deliver millions of messages a second

The first thing that stands out about Kafka is that it’s fast, and it’s crazy fast. According to the latest data, Kafka processes more than 1 trillion messages per day. At its peak, more than 1 million messages are posted per second. Even with low memory and CPU usage, Kafka can speed up to 100,000 messages per second and persist. So how does Kafka do it?

Kafka profile

Kafka is a distributed, publish/subscribe based messaging system. Originally developed from LinkedIn, it serves as the basis for LinkedIn’s Activity Streams and operational data processing pipelines. It became part of the Apache project and was primarily used to process active streaming data.

Kafka has the following features:

1. Message persistence is provided through O(1) disk data structure, which can maintain long-term stable performance even for terabytes of message storage;

2, high throughput: even very ordinary hardware Kafka can support hundreds of thousands of messages per second;

Support message partitioning through Kafka server and consumer machine cluster;

4. Support Hadoop parallel data loading.

Kafka architecture

The overall architecture of Kafka is very simple. It is an explicitly distributed architecture. There can be multiple producers, brokers (Kafka), and consumers. The Producer and consumer implement Kafka’s registered interface. Data is sent from the Producer to the broker, which acts as an intermediary buffer and distributor. The broker distributes consumers registered with the system. The broker acts like a cache, a cache between active data and an offline processing system.

The terms kafka are explained as follows:

1. Producer: producers of messages that publish messages to terminals or services in the Kafka cluster.

Broker: Kafka cluster contains servers.

Topic: The category to which every message published to a Kafka cluster belongs, i.e. Kafka is topic-oriented.

Partition: A partition is a physical concept. Each topic contains one or more partitions. Kafka allocates units of partitions.

5. Consumer: Terminal or service that consumes messages from a Kafka cluster.

6. Consumer Group In the high-level Consumer API, each consumer belongs to a consumer group, and each message can only be consumed by one consumer of the consumer group, but can be consumed by multiple consumer groups.

7. Replica: copies of partitions to ensure high availability of partitions.

8. Leader: a role in replica. Producer and consumer only interact with the leader.

9. Follower: a role in the replica that copies data from the leader.

Controller: one of the servers in the Kafka cluster, used for leader election and various failover.

Zookeeper: Kafka uses ZooKeeper to store cluster meta information.

Kafka has four core apis. The communication between client and server is based on the simple, high performance, and programming language-independent TCP protocol.

Since every message is appended to this Partition, it is a sequential write disk, which is very efficient (sequential writes are more efficient than random writes, which is an important guarantee of Kafka’s high throughput). The Kafka cluster partition log is as follows:

Each partition is an ordered, immutable sequence of records continuously attached to a structured commit log. Records in each partition are assigned a sequential ID number called an offset that uniquely identifies each record in the partition.

Kafka application scenarios

The message queue

Better throughput, built-in partitioning, redundancy, and fault tolerance than most messaging systems make Kafka a good solution for large-scale messaging applications.

Messaging systems typically have relatively low throughput, but require much lower end-to-end latency, and rely on the strong persistence guarantees provided by Kafka. In this area Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.

Tracking behavior

Another use of Kafka is to track user browsing, searches, and other behavior in a publish-subscribe mode to a Topic. Then these results can be obtained by subscribers for further real-time processing, or real-time monitoring, or put into hadoop offline data warehouse processing.

Meta-information monitoring

As a monitoring module of operation records to use, that is, collect and record some operation information, can be understood as the data monitoring of operation and maintenance nature.

Log collection

In terms of log collection, there are many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation.

Log aggregation typically collects log files from a server and puts them in a centralized location (file server or HDFS) for processing. Kafka, however, neglects the details of a file, abstracting it more clearly into a message flow of individual logs or events.

This allows For lower latency in Kafka processing, making it easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka offers the same high performance, higher durability guarantees due to replication, and lower end-to-end latency.

Stream processing

This scenario can be quite numerous and easy to understand. Save the collected stream data for subsequent processing by Storm or other streaming computing frameworks. Many users will stage, aggregate, expand, or otherwise transfer data from the original Topic to a new Topic for further processing.

An example of an article recommendation process might be to grab the content of an article from an RSS source and drop it into a topic called “articles.” Follow-up operations may be required to clear the content, such as restoring normal data or deleting duplicate data. Finally, the matching result is returned to the user.

This creates a series of real-time data processing processes outside of a single Topic. Strom and Samza are well-known frameworks that implement this type of data transformation.

The event source

An event source is an application design in which state transitions are logged as a chronological sequence of records. Kafka can store large amounts of log data, which makes it an excellent backend for this approach. Take the News Feed.

Persistent log (Commit log)

Kafka can serve as an external, distributed system of persistent logging. Such logs can back up data between nodes and provide a resynchronization mechanism for recovering data from failed nodes. The log compression feature in Kafka enables this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

Follow public accounts

【 Pegasus Club 】

▼

Previous welfare concerns about the pegasus public number, reply to the corresponding keywords package download learning materials; Reply “join the group”, join the Pegasus AI, big data, project manager learning group, and grow together with excellent people!

Microsoft Danniu artificial intelligence series of lessons

(Scan or subscribe)

M.qlchat.com/live/channe… (Qr code automatic recognition)

From beginning to research, the 10 most Readable books in the field of artificial intelligence

RSVP number “2” machine learning & Data Science must-read classic book with resource pack!

Into AI & ML: Learning machine Learning from Basic Statistics (PDF download)

Answer the number “4” to learn about ARTIFICIAL intelligence, 30 books should not be missed (with electronic PDF download)

Answer number “6” AI AI: 54 Industry Blockbuster Reports

TensorFlow Introduction, Installation tutorial, Image Recognition application (with installation package/guide)

According to a 160-page McKinsey report, 800 million people around the world could lose their jobs to machines by 2030

Reply number “12” small white | Python + + machine learning Matlab neural network theory + practice + + + depth video + courseware + source code, download attached!

Reply number “14” small white | machine learning and deep learning required books + machine learning field video/PPT + large data analysis books recommend!

Reply to the number “16” 100G Python from beginner to Master! Complete video tutorials + Python Classics for self-study!

Answer number “17” 【 dry article 】31 papers on deep learning required reading

526 Industry reports + White papers: AI, Artificial intelligence, robotics, smart mobility, smart home, Internet of Things, VR/AR, blockchain, etc. (download)

Reply number “19” 800G ARTIFICIAL intelligence learning materials :AI ebook +Python language introduction + tutorial + machine learning and other limited time free access!

17 mind maps for machine learning statistics

Ten years ago on This day on Machine Learning Projects.

Machine learning: How to go from beginner to Never Giving up? (With benefits)

Respond to digital “24” flash download | 132 g programming data: Python, JAVA, C, C + +, robot programming, PLC, entry to the proficient in ~

Reply number “25” limited resources | 177 g Python/machine learning/TensorFlow video/deep learning algorithm, introduction to cover/intermediate/project each stage!

Reply number “26” introduction to artificial intelligence book list recommended, learn AI please collect well (attached PDF download)

Reply | digital “27” Wu En of Stanford CS230 deep learning course a full range of information release (download)

Reply number “28” Programmers who understand this technology are being snapped up by BAT… (Information pack included)

Respond to digital “29” dry | 28 this big data/data analysis, data mining ebook collection of free download!

Big Data: How does Kafka deliver millions of messages a second

Related Posts

How to make yourself crazy like playing games, fight the life, Shi Lezhi study or work?

Concurrent Job execution – 5 minutes per day with Docker Container Technology (134)

Mackdown’s simple use tutorial