In our 360 Degree test: Does KAFKA lose data and is its high availability Sufficient? In this article, it explains in detail whether KAFKA is suitable for business systems. But some of you still don’t know what KAFKA is or why it exists. This can be a disadvantage in jobs and interviews, because somewhere along the way KAFKA seemed to be a must-have skill for engineers.

Some conceptual revision

As of version 0.9, Kafka’s tagline has changed from “a high-throughput, distributed messaging system” to “a distributed streaming platform.”

Kafka is not only a queue, but also a store, with strong stacking capability.

Kafka can be used not only in high-throughput big data scenarios, but also in business systems with transaction requirements, but with lower performance.

Kafka is not designed to have as many topics as possible, because its performance is inversely proportional to the number of topics after a threshold is reached.

When you introduce message queues, you introduce asynchrony, no matter what your intentions. This often means a change in business processes and even a change in product experience.

What is a messaging system

A typical scenario

This design has the following problems:

  • The polling interval of scheduled tasks is not controllable. Service processing is prone to delay.

  • The processing capability cannot be expanded horizontally, and problems such as distributed locking and sequential guarantee are introduced.

  • When other businesses need the order data, the business logic must be added to the scheduled task.

As traffic increases and business logic becomes more complex, message queues come into play.

The role of messaging systems

Peak cutting is used to accept requests that exceed the processing capacity of the service system to ensure smooth operation of services. This can result in significant cost savings, such as some split-kill activities that are not designed for peak capacity.

Buffering exists as a buffer layer in the service layer and in the slow falling layer, similar to peak clipping, but mainly used for intra-service data flow. Like sending text messages in bulk.

Decoupled from the beginning of the project does not determine specific requirements. Message queues can act as an interface layer to decouple important business processes. Just follow the conventions and program against the data to gain extended capabilities.

Redundant message data can be used in a one-to-many manner by multiple unrelated businesses.

Robust message queues can stack requests, so the consumer business can die for a short time without affecting the main business.

Message system requirements

As important as a messaging system is, there is a high demand for its own features in addition to ensuring high availability. There are generally the following points:

High performance including message delivery and message consumption, both fast. Generally, the parallel processing capability is obtained by increasing the number of fragments.

Messages must be reliable in certain scenarios and cannot be lost. The production, consumption, and MQ ends cannot lose messages. Generally by increasing the copy, forced to brush disk to solve.

Expandable to accompany you to make the project bigger, accompany you to eternity. Adding nodes The performance of an enlarged cluster cannot be reduced.

Ecological mature monitoring, operation and maintenance, multi-language support, active community.

KAFKA noun explanation

The basic function

Kafka is a distributed messaging (storage) system. Distributed system increases parallelism through fragmentation; Replicas increase reliability, and Kafka is no exception. Let’s look at the structure and explain some of the terminology.

Broker

The components that write data to KAFKA are called producers. The producers of messages are written to the business system.

There may be multiple messages sent to KAFKA. How do you classify them? That’s the concept of Topic. Once a topic is distributed, it may exist on multiple brokers.

A Topic is divided into multiple segments, and each of these segments is called a Partition after increasing the parallelism. Partitions are generally evenly distributed across all machines.

The applications that consume data in Kafka are called consumers, and we give a name to a Consumer business on a topic, which is called a Consumer Group

Extend the functionality

Connector Task includes Source and Sink interfaces, allowing users to customize data flow. Such as importing Kafka from JDBC, or landing Kafka data directly to DB.

A Stream is similar to a Spark Stream. It can process Stream data. But it has no cluster per se, just an abstraction on top of a KAFKA cluster. If you want real-time streaming and don’t need something from the Hadoop ecosystem, then this is for you.

Topic

Our message is in the theme. With multiple topics, messages can be categorized and isolated. For example, login information is written in user_activity_topic and log messages are written in log_topic.

Each topic can adjust the number of partitions. Suppose we have three brokers in our cluster. When the number of partitions is 1, the message is written to only one of the nodes. When our partition is 3, the message will hash to three nodes; When our partition is 6, each node will have 2 partition information. Adding partitions can increase parallelism, but not more is better. Generally, 6-12 is the best, which can be divisible by the number of nodes to avoid data skew.

Each partition consists of an ordered series of immutable messages appended sequentially. Each message in the partition has a sequential serial number called offset. Kafka will retain all messages for the configured time, so it is also a temporary store. During this time, all messages can be consumed, and can be consumed repeatedly, multiple times, by changing the value of offset.

Offsets are generally managed by consumers, but can also be set programmatically as needed. Offset will only change after commit, otherwise you will always get duplicate data. The new Kafka has put these offsets into a proprietary theme: __consumer_offsets, the purple area in the image above.

It is worth mentioning that the number of consumers should not exceed the number of partitions. Otherwise, the extra consumers will not receive any data.

ISR

Increasing the number of replicas is a common way to ensure data reliability in distributed systems, and ISR is built on this method.

ISR, which stands for in-Sync Replicas, is an important mechanism for ensuring HA and consistency. The number of copies has an impact on Kafka throughput, but greatly enhances availability. General 2-3 advisable.

Duplicate has two elements, one is the quantity should be enough, one is not to fall on the same instance. ISR is for partitions, and each Partition has a synchronization list. In N Replicas, one replica is the leader and the other replicas are followers. The leader processes all read and write requests to the partition, and the other replicas are replicas. At the same time, followers passively periodically copy the data on the leader.

If a flower falls too far behind a leader or has not initiated a data replication request for a certain amount of time, the Leader removes it from the reISR.

The Leader commits only when all replicas in the ISR send ACK packets to the Leader.

Kafka ISR management is ultimately fed back to the Zookeeper node. Location: /brokers/topics/[topic]/partitions/[partition]/state. When the Leader node fails, Zk is also used to elect a new Leader. Kafka’s dependency on ZK becomes less and less after Offset is moved to Kafka’s internal Topic.

reliability

Message delivery semantics

At least once a message may be lost

At most once the message is lost, but it may be repeated, so the consumer should do idempotent

A message is not lost and is delivered Exactly once

The overall message delivery semantics are guaranteed by both the Producer side and the Consumer side. KAFKA defaults to At most once. You can also configure transactions to achieve Exactly once, but it is inefficient and not recommended.

ACK

When the producer sends data to the leader, the level of data reliability can be set using the request.required. Acks parameter:

1 (default) After the data is sent to Kafka, the leader confirms that the message has been successfully received. In this case, if the leader goes down, the data is lost.

The 0 producer sends the data and does not wait for any return. In this case, the data transfer efficiency is the highest, but the data reliability is the lowest.

-1 Producer sends data at a time only after all followers in the ISR confirm that the data has been received.

Why is KAFKA fast

Cache Filesystem Cache PageCache Cache

Sequential writes Because modern operating systems provide prefetch and write technologies, sequential writes to disk are mostly faster than random writes to memory.

Zero-copy Indicates Zero copy, missing one memory swap.

Batching of Messages Batch processing Merge small requests and then stream them back and forth, hitting the network ceiling.

Pull mode The Pull mode is used to obtain and consume messages, which is consistent with the processing capability of the consumer.

Usage scenarios

  • Delivering business messages

  • User activity logs • Monitoring items

  • The log

  • Stream processing, such as certain aggregations

  • Commit logs as redundancy for some important services

The following is a typical usage scenario for logging.

Pressure test

KAFKA comes with pressure measurement tools, as follows.

./kafka-producer-perf-test.sh --topic test001 --num- records 1000000 --record-size 1024 --throughput -1 --producer.config .. /config/producer.propertiesCopy the code

Configuration management

concerns

Application scenarios Different application scenarios have different configuration policies and SLA service levels. You need to figure out if your messages are allowed to be lost or duplicated, and then set the appropriate number of replicas and ACK patterns.

Lag should be aware of the backlog of information. Too much Lag indicates a problem with processing ability. If you have a backlog of messages during the low times, it’s bound to be a problem when the traffic is heavy.

Capacity expansion After capacity expansion, partition redistribution may occur, and your network bandwidth may be a bottleneck.

If the disk is full, you are advised to set the expiration days or the maximum disk usage.

log.retention.bytes
Copy the code

The space for deleting expired disks is limited. You are advised to keep recent records and delete other records automatically.

log.retention.hours	
log.retention.minutes	
log.retention.ms	
Copy the code

Monitoring management Tool

KafkaManager KafkaManager is the most comprehensive management tool that can manage multiple Kafka clusters. Note, however, that when you have too many topics, monitoring data can take up a lot of your bandwidth, causing your machine to load up. Its monitoring function is weak and does not meet requirements.

The KafkaOffsetMonitor program is run in the form of a JAR package, which is convenient to deploy. Only monitoring function, relatively safe to use.

The Kafka Web Console provides comprehensive monitoring functions, such as previewing messages and monitoring Offset and Lag information. It is not recommended to be used in production environments.

Burrow is LinkedIn’s open source framework for monitoring consumer Lag. Supports alarm and provides only HTTP interfaces without webui.

Microsoft’s Open-source Kafka Availability and latency monitoring framework, which provides a JMX interface, is rarely used.

Rebalance

The consumer end Rebalance

Bringing the consumer online and offline will cause the relationship between partitions and consumers to be reallocated, making the Rebalance. Services may time out or jitter.

The service side reassign

When servers are expanded or scaled down, and nodes are started or shut down, data may be skewed. Therefore, you need to reassign partitions. This process can be triggered manually in the background of Kafka Manager to make the partition distribution more even.

This process will cause a large number of data copies between clusters. If your cluster has a large amount of data, this process may take hours or days. Exercise caution.

Linkedin has opened source its automation management tool Cruise-Control for those with a need for automated operations.

At the end

This article is KAFKA related to the most basic knowledge, basic cover most of the simple interview questions.

KAFKA made a lot of effort to achieve Exactly once semantics, and the result of that effort was that it was almost unusable and the throughput was just too low. If you really want to talk about high reliability, you might as well have a compensation strategy. Performance is not, the final result may be the overall unavailability; Data loss, on the other hand, is only a small amount of data in extreme cases. How would you weigh that?

KAFKA can be scary under heavy traffic, and data often fills up the network card. Once the Broker goes down, if a single node has data on T, it takes half an hour to start up, and it has to act as a Follower to catch up with data on other Master partitions. So, don’t make your KAFKA cluster too big, failover can be a disaster. After startup, if you execute the reassign, it’s another problem.