Why do you need Kafka?

When we learn something, we often only really understand the meaning behind it, in order to master it step by step, until it is well planned. This article is based on official documentation. Official documentation is really important, by the way, to read and cherish.

background

Kafka was originally developed by LinkedIn as a messaging foundation for its own business. LinkedIn later donated Kafka to Apache, which has since become a top-level project. Kafka is a high-throughput distributed messaging system. At present, it has been applied in the actual business of many companies and combined with many data processing frameworks, such as Hadoop and Spark.

The messaging system

In the actual business requirements, we need to deal with various messages, such as Page View, log, request, etc., so what functions should a good message system have?

Have publish and subscribe capabilities, similar to message queues or enterprise messaging systems;
Can store message flows and have fault tolerance;
Can process messages in real time;

The above three points are the most basic capabilities of a good messaging system.

So why was Kafka born?

In fact, in our work, I believe that many of us have come into contact with message queues, or even written a simple message system, which should have the basic publish/subscribe function, as shown in the following figure:

Both consumer A and consumer B subscribe to message sources A and B. This mode is very simple, but it also has disadvantages relatively, such as the following two points:

In this mode, consumers need to process messages in real time, because neither the source nor the consumer will maintain a message queue (the maintenance cost is too high), which will lead to the loss of messages if consumers are temporarily unable to consume, and of course, they will not be able to obtain historical messages.
A message source needs to maintain work that is not its job, such as maintaining information for subscribers (consumers), sending messages to multiple consumers, or handling message feedback, which makes a pure message source more and more complex.

Of course, these problems can be improved. For example, we can add a message queue between the message source and the consumer, as shown in the following figure:

We can see from the figure that now the message source only needs to send the message to the message queue, and the rest is done by the message queue. We can persist the message in the message queue and actively push the message to the consumers who have subscribed to the message queue. So what are the disadvantages of this model?

The answer is yes, the picture above is just two message queues, which we can easily maintain, but what if there are hundreds or thousands? That may not be gg, in fact, we can find that the function of the message queue is very similar, is persistent, push message, give feedback, and other functions, the structure is very similar, mainly is the message content, of course, if you want to generalization, message structure also want as far as possible, universal, has nothing to do with the concrete platform specific language, such as using JSON format, etc., So we can evolve the following messaging system:

This method looks like merging the above queues together, but it is not that simple because the set of message queues has the following functions:

Unified management of all message queues is not a special requirement for developers to maintain;
Efficient storage of messages;
Consumers can quickly find the information they want to buy;

Of course, these are only the most basic functions, there are such as multi-node fault tolerance, data backup, etc., a good message system needs to deal with a lot of things, fortunately, Kafka helps us do it.

Kafka

Before we dive into the details of Kafka, let’s take a look at some of its basic concepts:

Kafka runs on a cluster, so it can have one or more service nodes;
Kafka clusters store messages in specific files that behave as Topics;
Each message record contains a key, message content, and timestamp;

From the above points we can assume that Kafka is a distributed message storage system. Is that all it does? Let’s move on.

Kafka provides four core interfaces for enhanced functionality:

The Producer API allows applications to publish messages to topics in Kafka;
The Consumer API allows applications to subscribe to Topics in Kafka and consume messages.
The Streams API allows applications to act as handlers of message flows, such as consuming messages from topicA and publishing the results to topicB.
The Connector API provides compatibility between Kafka and existing applications or systems, such as database connectors that capture table structure changes.

Their relationship to the Kafka cluster can be illustrated in the following figure:

Now that you’ve looked at some of Kafka’s basic concepts, let’s look at some of its components.

Topics

As the name implies, Topics are a collection of Topics. More generally speaking, Topics are like a message queue to which producers can write messages and from which consumers can read messages. A Topic can be subscribed by multiple producers or consumers at the same time, so it has good scalability. A Topic can consist of one or more partitions, as shown in the following figure:

If a Topic has multiple partitions, the producer’s messages can be specified or algorithmically assigned by the system to specific partitions. If you want all messages to be ordered, then you are better off using only one partition. In addition, partition supports message displacement reading, which can be managed by consumers themselves, as shown in the following figure:

As can be seen from the figure above, different consumers do not interfere with each other in reading messages in the same partition. Consumers can control the data they want to obtain by setting message offset, such as read from scratch, latest data read, reread read and other functions.

The partitioning strategy for a Topic and balance with consumers will be covered in more detail in future articles.

Distribution

As mentioned above, Kafka is a distributed messaging system, so when we configure multiple Kafka Server nodes, it has distributed capabilities, such as fault tolerance, etc. Partition will be distributed on each Server node, and there is a leader in between. It handles all the read/write requests, and the other followers copy the leader’s data. If the leader is unable to serve due to some failure, a follower is chosen to handle the requests.

Geo-Replication

Remote backup is a basic function of mainstream distributed systems. It is used to back up and restore data in a cluster. Kafka uses MirrorMaker to implement this function.

Producers

Producers, as the Producers of messages, can specify their own partitions to publish messages to the subscription Topic. Policies can be specified, such as semantically or structurally similar messages to be published to the same partition, or they can be circularly published to each partition by the system.

Consumers

Consumers are a collection of Consumers, which can be called consumer group, which is a higher level of abstraction. The unit of subscribes to consumer messages to a Topic is Consumers, but of course there can be only one consumer in it. Here are the two principles of consumer:

Assuming that all consumers are in the same consumer group, they co-consume part of the messages of the subscription Topic (distributed by partition and number of consumers), preserving load balancing;
If all consumers are in different consumer groups and subscribe to the same Topic, they can consume all the messages for the Topic.

Here is a simple example to help you understand:

In the figure above, there are two Server nodes, with one Topic divided into four partitions (P0-P4) on each node, and two consumer groups (GA, GB), where GA has two consumer instances and GB has four consumer instances.

As can be seen from the figure, the unit of subscribing to a Topic is a consumer group. In addition, we find that messages in a Topic push messages to specific consumers according to certain rules, the main principles are as follows:

If the number of consumers is less than the number of partitions and the number of consumers is one, it consumes all messages.
If the number of consumers is smaller than the number of partitions, assuming that the number of consumers is N and the number of partitions is M, then the number of partitions that each consumer can consume is M/N or M/N+1.
If the number of consumers is equal to the number of partitions, then each consumer is equally allocated to a partition’s messages;
If the number of consumers is greater than the number of partitions, some consumers will not get the message partition and will be idle.

In general, Kafka allocates messages evenly based on consumer groups, such as when an instance of a message goes down, or when a new consumer joins.

Guarantees

As a high-level system, Kafka offers the following guarantees:

Messages are added in an orderly fashion. The earlier a producer sends a message to a subscribed Topic, the earlier it is added to the Topic. Of course, they may be assigned to different partitions.
Consumers are ordered in their messages in the consumption Topic section;
For a Topic with N replication nodes, the system can tolerate a maximum of n-1 node failures without losing any messages submitted to that Topic.

I plan to go into more details about these points in subsequent articles.

Kafka as a Messaging System

What is Kafka’s advantage over other messaging systems? There are two main traditional message system models: message queue and publish/subscribe.

1. Message queues

features	describe
form	A group of consumers retrieves a message from a message queue, and the message is pushed to a consumer in the group
advantage	Horizontal scaling allows message data to be processed separately
disadvantage	Message queues are not multi-user, and when a message record is read by a process, the message is lost

Publish/subscribe

features	describe
form	The message is broadcast to all consumers
advantage	Messages can be shared by multiple processes
disadvantage	Every consumer gets all the messages, and there is no way to increase processing efficiency by adding a consuming process

The above two tables show the strengths and weaknesses of the two traditional messaging models, so Kafka has been optimized to incorporate their strengths in the following two aspects:

The message queue function is achieved through Topic
Publish/subscribe through consumer groups

Kafka perfectly addresses the shortcomings of both patterns by combining these two (see the section above for a detailed description of both).

Kafka as a Storage System

Storing messages is also a major function of the message system, Kafka is relatively ordinary message queue storage, its performance is too good, first Kafka supports write confirmation, to ensure the correctness and continuity of message writing, at the same time Kafka will be written to disk data copy backup, to achieve fault tolerance, In addition, Kafka uses a consistent structure for disk usage, saying that no matter how much message data your server currently stores on disk, it is equally efficient at adding message data.

Kafka’s storage mechanism allows consumers to control what data they want to read, and in many cases you can use Kafka as a high-performance, low-latency distributed file system.

Kafka for Stream Processing

Kafka is a perfectionist. It’s not enough to have normal read, write, and store functions. It also provides an interface for processing message flows in real time.

Many times of the original data is not what we want, we want is a result of data processing, such as from one day of search data search hot that day, and you can use the Streams API to achieve the function you want, such as to get the data from the input Topic, and then release the output to a specific Topic.

Kafka’s flow processing can solve problems such as processing unordered data, complex transformations of data, and so on.

conclusion

Messaging, storage, and streaming are simple, but how they fit together is elegant, and Kafka does it.

Compared with HDFS, although it supports efficient storage and batch processing of data, it only supports processing historical data in the past.

Compared to a normal messaging system, it does not store historical data, although it can process data from now to the future.

Kafka brings together the best of the best and enables the system to meet all needs in one word: “perfect”.

This article from the evolution of the message system, to the specific composition of Kafka, finally to Kafka’s three features, to help you to understand what Kafka is about, exactly what function, of course, this is only a small white simple understanding, if there is a wrong place, I hope you can point out, appreciate.