Kafka is a distributed publish-subscribe messaging system. Kafka is a distributed, partitioned, redundant and persistent logging service. It is primarily used to process active streaming data.

Kafka is now widely used in the construction of real-time data pipeline and flow application scenarios, has the advantages of horizontal expansion, fault tolerance, fast, and has been run in the production environment of many large and medium-sized companies, successfully applied in the field of big data, this article to share what I know about Kafka.

Kafka’s high throughput reveals secrets

The first thing that stands out about Kafka is that it’s fast, and it’s fast in a weird way. On cheap, common virtual machines, like SAS disks, According to LINDEDIN, Kafka processes more than a trillion messages a day. At its peak, more than a million messages are posted per second. Even with low memory and CPU usage, Kafka can speed up to 100,000 data per second and persist.

Kafka writes message logs to Kafka. Kafka writes message logs to Kafka. Kafka writes messages to Kafka.

Kafka makes code fly and write fast

First, you can use the production-side API provided by Kafka to publish messages to one (in order of data) or to multiple partitions (in parallel, but not necessarily in order of data) of one or more topics (topics). A Topic can be simply defined as a data category that distinguishes different data.

Kafka maintains a partition log in a Topic that writes messages sequentially to partitions, each of which is an immutable message queue. Messages in the partition are in k-V form.

K stands for offset, called an offset, a unique identifier of a 64-bit integer that represents the starting byte position of the message in all message flows within a Topic partition. V is the actual message content, each offset in each partition is unique, all partition messages are written once, and the offset can be adjusted for multiple reads before the message expires.

Mentioned above is the first factor that makes Kafka “fast” : messages are written sequentially to disk.

We know that disks today are mostly mechanical (SSDS are out of the question), and that if messages are written to disk randomly, they are processed in terms of cylinders, heads, and sectors (addressing process), and slow mechanical movement (relative to memory) consumes a lot of time. As a result, the disk write speed is only a few millionths of the memory write speed. To avoid the time consumption caused by random write, KAFKA stores data in sequential write mode, as shown in the following figure:

New messages can only be appended to the end of existing messages, and produced messages do not support random deletion or random access, but consumers can access consumed data by resetting offsets.

Even with sequential reads and writes, a large number of small I/O operations that occur too frequently can cause disk bottlenecks, so Kafka’s approach here is to batch send these messages together to reduce excessive reads and writes to disk IO, rather than sending a single message at a time.

Another is inefficient byte copying, especially when the load is high. To avoid this, Kafka uses a standardized binary message format shared by producers, brokers, and consumers so that blocks of data can be transferred between them without conversion, reducing the cost of byte copying.

At the same time, Kafka uses MMAP(Memory Mapped Files) technology. Many modern operating systems make heavy use of main memory for disk caching. A modern operating system can use all the remaining space in memory for disk caching with little performance penalty when memory is reclaimed.

Since Kafka is jVM-based, and anyone who has dealt with Java memory usage knows two things:

▪ The memory overhead of objects is very high, often twice the actual size of the data to be stored;

▪ As data increases, Java garbage collection becomes more frequent and slow.

Based on this, using a file system while relying on the page cache is more attractive than using other data structures and maintaining an in-memory cache:

▪ Eliminating in-process caching frees up memory that can almost double the amount of page caching.

▪ If Kafka restarts, the internal cache is lost, but the operating system page cache can still be used.

One might ask what happens if Kafka runs out of memory when it makes such frequent use of page caching.

Kafka writes data to a persistent log rather than flushing it to disk. In fact, it just moves to the kernel’s page cache.

Rather than maintaining a memory cache or some other structure, using a file system and relying on a page cache can map files directly to physical memory using the operating system’s page cache. Operations on physical memory after mapping are synchronized to hard disk in due course.

Kafka makes code fly and read fast

In addition to writing fast when Kafka receives data, another feature of Kafka is that it sends fast when it pushes data.

Kafka this message queue in the production side and the consumer side respectively take push and pull way, that is, you can think Kafka is a bottomless pit, how much data can be pushed into the consumer side is based on their own consumption capacity, how much data you need, you come to Kafka here pull, Kafka guarantees that as long as there is data, the consumer side can come and fetch as much as it needs.

Bring zero copy

When it comes to message landing, the message log maintained by the broker is itself a directory of files. Each file is stored in binary format and processed by producers and consumers alike. Maintain this common format and allow optimization for the most important operation: the network transfer of persistent log blocks. Modern Unix operating systems provide an optimized code path for transferring data from the page cache to the socket; In Linux, this is done through the SendFile system call. Java provides a method to access this system call: the Filechannel.transferto API.

To understand the impact of SenFile, it is important to understand the common data path for transferring data from a file to the socket, as shown in the following figure. Data is transferred from disk to the socket through the following steps:

▪ Page cache where the operating system reads data from disk into kernel space

▪ Applications read data from kernel space into the user-space cache

▪ The application writes data back to the kernel space into the socket cache

▪ The operating system copies data from the socket buffer to the nic buffer to send data across the network

There are four copies, two system calls, which is very inefficient. If sendFile is used, only one copy is required: it allows the operating system to send data directly from the page cache to the network. So in this optimized path, only the last step is required to copy the data into the nic cache.

Performance comparison between regular file transfer and zeroCopy:

Assuming that a Topic has multiple consumers, and using the zero-copy optimization above, the data is copied to the page cache once and reused for each consumption, not stored in storage, nor copied to user space for each read. This allows messages to be consumed at speeds close to the network connection limit.

This combination of page caching and sendFile means that most consumers of the Kafka cluster consume messages entirely from the cache, with no read activity on disk.

▲ Batch compression

In many cases, the bottleneck is not CPU or disk, but network bandwidth, especially for data pipelines that need to send messages between data centers on a wide area network. So data compression is very important. Each message can be compressed, but the compression rate is relatively low. So Kafka uses batch compression, where multiple messages are compressed together instead of a single message.

Kafka allows the use of recursive message sets. Bulk messages can be transmitted in compressed form and remain compressed in logs until they are decompressed by consumers.

Kafka supports Gzip and Snappy compression protocols.

Deep interpretation of Kafka data reliability

Kafka’s messages are stored in topics, which can be divided into multiple partitions and each partition has multiple ReplIas for data security.

▪ Multi-zone design features:

Speed up reading and writing for concurrent reading and writing; Is the use of multi-partition storage, conducive to data balance; To speed up the data recovery rate, if a machine is down, only part of the data in the cluster needs to be recovered, which speeds up the recovery time. Each Partition is divided into multiple segments. Each Segment has two files:.log and.index. Each log file contains specific data. When searching for offset, the Consumer uses dichotomy to locate the Segment based on the filename, and then parses the MSG to match the corresponding offset.

<Partition Recovery >

Each Partition records a RecoveryPoint on the disk, which is flushed to the maximum offset of the disk. LoadLogs are performed when the broker fails to restart. The segment containing the Partition’s RecoveryPoint and subsequent segments may not be fully flushed to disk segments. Segment recover is then called to re-read the MSG of each segment and rebuild the index. Each time Kafka’s broker is restarted, you can see the process of rebuilding the indexes in the output log.

< Data synchronization >

Both producers and consumers interact only with the Leader, and each Follower pulls data from the Leader for synchronization.

As shown in the figure above, ISR is a collection of all replicas that are not behind. Not behind has two meanings: If the time since the last FetchRequest is no more than a certain value or the number of messages behind is no more than a certain value, the Leader randomly selects a Follower from the ISR to act as the Leader. This process is transparent to users.

When Producer sends data to the Broker, the level of data reliability can be set using the request.required. Acks parameter.

This configuration is the confirmation value that indicates when a Producer request is considered complete. In particular, how many other brokers must have committed data to their log and confirmed this information to their Leader.

▪ Typical values:

0: Indicates that Producer never waits for confirmation from the broker. This option provides the lowest latency but also the greatest risk (because data will be lost when the server goes down).

1: indicates the confirmation that the Leader Replica has received data. This option has a low latency and ensures that the server accepts the packet successfully.

-1: Producer receives confirmation that all synchronous Replicas have received data. However, this approach does not completely eliminate the risk of message loss because the number of synchronous Replicas may be 1. If you want to ensure that some replicas receive data, you should set min.insync.replicas in topic-level Settings.

Setting acks= -1 does not guarantee data loss. If only the Leader is in the ISR list, data loss may occur. To ensure that data is not lost, set acks=-1 and ensure that the ISR size is greater than or equal to 2.

▪ Specific parameter Settings:

Request. required. Acks: set this parameter to -1. Wait until the Replica in the ISR list receives a message and calculates and writes successfully.

Min.insync. replicas: Set it to >=2 to ensure that there are at least two replicas in the ISR.

Producer: There is a trade-off between throughput and data reliability.

Kafka as a leader in modern messaging middleware, with its speed and high reliability to win the majority of the market and users of all ages, many of the design concepts are very worthy of our learning, this article is just the tip of the iceberg, I hope to understand Kafka has a certain role.

Welcome Java engineers who have worked for one to five years to join Java Programmer development: 854393687 group provides free Java architecture learning materials (which have high availability, high concurrency, high performance and distributed, Jvm performance tuning, Spring source code, MyBatis, Netty, Redis, Kafka, Mysql, Zookeeper, Tomcat, Docker, Dubbo, multiple knowledge architecture data, such as the Nginx) reasonable use their every minute and second time to learn to improve yourself, don’t use “no time” to hide his ideas on the lazy! While young, hard to fight, to the future of their own account!