Kafka is what?

Kafka is a publish and subscribe based messaging engine. It is commonly referred to as “distributed commit logging” or “distributed streaming platform”. File system or database commit logs are used to provide a persistent record of all things that can be reconstructed to reconstruct the state of the system. Similarly, Kafka’s data is persisted in a certain order and can be read on demand. Kafka was designed to provide three features:

  1. Provide a set of apis to implement producers and consumers
  2. Reduce network transmission and disk storage costs
  3. Implement a highly scalable architecture

Message engine

The format of the transmission message

Commonly used message transmission formats include CSV, XML or JSON, or some foreign open source serialization frameworks, such as Google’s Protocol Buffer or Facebook’s Thrift.

Kafka uses a pure binary sequence of bytes. Of course messages are still structured, but they are converted to binary byte sequences before being used.

A protocol for transmitting messages

  • Point-to-point model: Also known as message queue model. Using the “civilian” definition above, messages sent by system A can only be received by system B, and messages sent by System A cannot be read by any other system. Everyday examples such as telephone customer service are examples of this model: the same incoming customer call can only be handled by one customer service agent, and a second customer service agent cannot serve the customer.
  • Publish/subscribe model: Unlike the above, it has the concept of a Topic, which you can think of as a message container with similar logical semantics. The model also has sender and receiver, but the terms are different. The sender is also called Publisher and the receiver is called Subscriber. Unlike the point-to-point model, there may be multiple publishers sending messages to the same topic, and there may be multiple subscribers, all receiving messages to the same topic. Daily newspaper subscription is a typical publish/subscribe model.

Kafka supports both messaging engine models

The role of kafka

The main functions of message queue are as follows: traffic peak clipping and loosely coupled traffic peak clipping refer to the situation that when upstream traffic suddenly surges, downstream services will inevitably be unable to deal with upstream messages in time, resulting in message accumulation. In particular, when the emergence of such services as seckill, upstream order traffic will increase instantly, and the result may be directly across the downstream subsystem services.

When Kafka was introduced. Upstream message services no longer interact directly with downstream services. When a new message is generated, it simply sends a message to the Kafka Broker. Similarly, the downstream sub-services subscribe to the corresponding topic in Kafka and receive messages from the respective Partition of the topic in real time for processing, thus decoupling the upstream message service from the downstream message processing service.

This allows Kafka to store the instantaneous increase in order traffic in the corresponding topic as a message, without affecting the UPSTREAM service’s TPS, while giving downstream services enough time to consume it. This is where messaging engines like Kafka make the most sense.

Terminology in Kafka

Producer

  1. Message and data producers, the process of publishing messages to a Topic in Kafka is called Producers
  2. The Producer publishes the message to a specific Topic, and the Producer can also decide which partition the message belongs to. Such as round-robin or through some other algorithm;
  3. Asynchronous sending Batch sending can effectively improve the sending efficiency. Kafka Producer’s asynchronous send mode allows for batch sending, where messages are cached in memory and then sent in batches with a single request.

Producer Message distribution

  • Any broker in the Kafka cluster can provide producer with metadata containing information such as “list of servers alive in the cluster” and “list of Partitions Leader”.
  • After receiving the metadata information, the producer maintains socket connections with all the partition leaders under the Topic.
  • Messages are sent directly from producer to broker over sockets without going through any “routing layer”. In fact, the producer client determines which partition the messages are routed to.
    • For example, “random”, “key-hash”, “polling”, etc. == If there are multiple partitions in a topic, it is necessary to achieve “balanced message distribution” on the producer side. = =
  • In the producer side configuration file, developers can specify the partition routing mode.

The response mechanism of Producer sending messages sets whether sending data requires feedback from the server. The values are 0,1, and -1

0: Producer does not wait for the broker to send an ACK

1: Sends an ACK when the leader receives the message

-1: sends an ACK after the message is successfully synchronized from all followers

request.required.acks=0

The defaultPartitioner uses defaultPartitioner, using hashcode

Broker

The server side of Kafka consists of service processes called brokers. A Kafka cluster consists of brokers that receive and process requests from clients and persist messages. While multiple Broker processes can run on the same machine, it is more common to spread brokers across different machines so that if one machine in the cluster goes down, even if all Broker processes running on it die, Brokers on other machines can still provide services. This is one of the ways that Kafka provides high availability.

Another means of achieving high availability is Replication. The idea of a backup is simple: copies of the same data to multiple machines. These copies are called replicas in Kafka. There is a configurable number of replicas that hold the same data but have different roles and functions. Kafka defines two types of replicas: Leader Replica and Follower Replica.

Replicas also work simply: producers always write messages to the leader replicas; The consumer always reads the message from the leader copy. As for the follower replica, it only does one thing: it sends a request to the leader replica asking the leader to send it news of the latest production so that it can stay in sync with the leader.

While the replication mechanism ensures that data is persisted or messages are not lost, it does not solve the scalability problem.

What is scalability?

In the case of replicas, you now have leader and follower replicas, but what if the leader replicas accumulate too much data to fit into a single Broker machine? Kafka splits data into multiple pieces and stores them among different brokers, a mechanism known as Partitioning.

The partitioning mechanism in Kafka refers to the partitioning of each topic into multiple partitions, each of which is an ordered set of message logs. Each message produced by a producer is sent to only one partition, meaning that if a message is sent to a topic of a dual partition, the message is either in partition 0 or partition 1. Kafka’s partition numbers start at 0, and if a Topic has 100 partitions, they are numbered from 0 to 99.

Relationships between replicas and partitions

Replicas are defined at the partition level. Each partition can be configured with multiple replicas, including only one leader replica and n-1 follower replica. The producer writes messages to the partition, and the position of each message in the partition is represented by a number called Offset. The partition displacement always starts at 0. Suppose a producer writes 10 messages to an empty partition, then the displacement of the 10 messages is 0, 1, 2… , 9.

Kafka’s three-tier messaging architecture

  • The first layer is the theme layer, where each theme can be configured with M partitions, and each partition can be configured with N replicas.
  • The second layer is the partition layer. Only one of the N copies of each partition can act as the leader and provide services externally. The other N-1 replicas are follower replicas for data redundancy purposes.
  • The third layer is the message layer, which contains several messages in the partition, and the displacement of each message starts from 0 and increases successively.
  • The client program can only interact with the leader copy of the partition.

Kafka Broker persists data

Kafka uses a message Log to store data. A Log is a physical file on disk that can Append only messages. To avoid slow random I/O operations, Kafka uses sequential I/O writes, which is an important way to achieve kafka’s high throughput characteristics.

If you keep writing messages to a log, you will eventually run out of disk space.

So Kafka must periodically delete messages to reclaim disks. How do I delete it?

Simply put, this is through the LogSegment mechanism. At the bottom of Kafka, a log is further subdivided into multiple log segments, and messages are appended to the current log segment. When a log segment is full, Kafka automatically splits a new log segment and stores the old one. Kafka also has a scheduled task in the background that periodically checks to see if old log segments can be deleted in order to reclaim disk space.

Consumer

  1. Message and data consumers, the process of subscribing to a topic and processing the messages it publishes, are called consumers.
  2. Messages in a partition are consumed by only one consumer in the group (at a time).
  3. In Kafka, you can think of a group as a “subscriber,” and each partions in a topic is consumed by only one consumer in a “subscriber,” although a consumer can consume messages in multiple partitions

Note:

Kafka’s design dictates that for a topic, no more than a partition of a group can consume at the same time, which means that some consumers cannot receive messages.

Kafka can only ensure that messages in a partition are consumed in order by a consumer. In fact, from a Topic perspective, messages are still not globally ordered when there are multiple partitions.

Load balancing for Consumer

When a consumer joins or leaves a group, partitions balancing is triggered. The ultimate goal of balancing is to improve the concurrent consumption capacity of a topic. The steps are as follows:

  1. Given topic1, there are the following partitions: P0,P1,P2,P3
  2. Join the group as follows: consumer: C1,C2
  3. First sort partitions by their index number: P0,P1,P2, and P3
  4. Sort by consumer.id: C0,C1
  5. Calculate multiples: M = [P0, P1, P2, P3]. The size / [C0, C1]. The size, in this case value M = 2 (rounded up)
  6. Then, in turn, allocate partitions: C0 = (P0, P1), C1 = [P2 and P3], that is, Ci = [P (I * M), P (I + 1) * (M – 1)]

thinking

Why doesn’t Kafka allow follower copies to be read like MySQL does?

  1. Kafka’s partitions already allow read operations to be read from multiple brokers to achieve load balancing, unlike MySQL’s master and slave, where the pressure is on the master.
  2. Kafka is a message queue, so consumption needs to be shifted. The database is an entity data. There is no such concept.
  3. For producers, kafka can be configured to control whether or not to wait for a message to be acknowledged by a follower. If the message is read from above, it also needs to be acknowledged by all the followers before replying to the producer, causing performance to decline
  4. Read/write separation is a good solution for load types with many reads and relatively few writes. In Kafka, however, the main scenario is to provide read services in a message engine rather than as a data store. It usually involves producing and consuming messages frequently. This is not typical of read and write scenarios, so read/write separation schemes are not suitable for this scenario.