“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”

What is a Kafka

First, let’s take a look at what Kafka is. In the architecture of big data, data collection and transmission is a very important link. How to ensure that such a large amount of data can be transmitted without leakage, how to ensure the validity of data in case of failure, and how to cache data in case of network congestion, all these require the support of corresponding infrastructure.

This is a messaging system, so you know that Kafka is also a messaging system. A messaging system is simply a tool for sending data from one place to another. Of course, just saying that the messaging system does not reflect the advantages of Kafka, Kafka has its own characteristics, it is a high throughput, distributed publish/subscribe messaging system, but the core is “peak filling”.

Kafka supports multiple development languages such as Java, C/C++, Python, Go, Erlang, Node.js, etc. Many major distributed processing systems support Kafka integration, such as Spark and Flink. Kafka is also a coordinated management system based on ZooKeeper.

The structure and concept of Kafka

After a brief introduction, we can generally understand what Kafka is, and within the basic structure of Kafka, there are two important players:

  • Producers of information

  • Consumer of information

As shown in the figure below, a Kafka cluster establishes a connection between the producers and consumers of messages to ensure that messages are transported. The producer is responsible for producing the message and writing it to the Kafka cluster, while the consumer is pulling the message from the Kafka cluster, also known as consuming the message. What Kafka solves is how to store these messages, how to perform cluster scheduling to achieve load balancing, how to ensure communication and so on.

Let’s introduce a few important concepts in Kafka.

1. Producers and consumers

The producer is responsible for sending the message to Kafka, which can be an App, a service, or a variety of SDKS.

Consumers, on the other hand, pull data from Kafka.

Each consumer belongs to a specific Group. A message of the same Topic can only be consumed by a consumer of the same consumer Group, but consumers of different consumer groups can consume the message at the same time. Depending on the concept of consumer groups, messages in Kafka can be controlled by configuring multiple consumer groups if multiple consumers need to consume a message repeatedly, or by configuring them in a single consumer group if multiple consumers want to process a message source together.

The message

Message is the most basic unit of transmission in Kafka, in addition to the data we need to transfer, Kafka will add a header information to each message to mark each message, convenient in Kafka processing.

The theme

Above we talked about topics. In Kafka, a Topic is really a group of messages. Once the topic name is determined at configuration time, the producer can send messages to a topic, the consumer subscribing to the topic and consuming the data once there is data in the topic.

Partitions and copies

In Kafka, each topic is divided into several partitions. Each partition is an ordered queue, a set of messages that are physically stored, and a partition has several copies to ensure data availability. In theory, the higher the number of partitions, the higher the throughput of the system, but this needs to be configured according to the actual situation of the cluster, to see whether the server can support. At the same time, Kafka’s cache of messages is limited by the number of partitions and replicas. In Kafka’s strategy, caching is usually done on a regular basis, such as a week’s worth of data, or by partition size.

The offset

The concept of an offset can be understood as a kind of message index. Since Kafka is a message queue service, we do not read and write data randomly, but sequentially, so each message needs to be assigned an incremental offset. This allows consumers to choose where to start reading data by specifying an offset when consuming the data.

In our usual data development work, more often as a producer or consumer to use Kafka, and do not need to pay attention to the deployment and maintenance of Kafka system, the perception of more clear is the above mentioned these concepts, basic are we need to configure in the code. Of course, there are some basic components in Kafka, such as agents, ISR, and the use of ZooKeeper, which you can look at if you want to learn more about Kafka. After we add these components, let’s take a look at the structure of Kafka.

The general process is that the producer produces the data, then pushes it to the Kafka cluster and determines the subject of the data flow. The Kafka cluster works with the ZooKeeper cluster to perform scheduling, load balancing, caching, and other functions, waiting for consumers to consume data.

The characteristics of Kafka

Kafka has evolved into a mainstream message queuing tool, along with RabbitMQ, Redis message queuing, ZeroMQ, RocketMQ, and more.

As we said earlier, Kafka’s biggest feature is “peak load”, which is its application characteristics, where the valley and peak refers to the data flow valleys and peaks, peak load meaning that the data producer A and the data consumer B have different processing capacity for the data flow, We can use Kafka as a conduit for intermediate transport. So in the specific design, what Kafka features, let’s take a look at.

1. Message persistence

Kafka chooses to store data as a file system. Many message system to enhance the ability of mass data processing and high-speed transmission, will not to persistent storage of data, or only very small amounts of data, cache and Kafka would exist on disk, the data on the one hand, the disk storage capacity is big, on the other hand is a data persistence can support more applications, Support is available both in real time and offline.

2. Fast processing speed

Kafka takes a lot of effort to achieve high throughput. We know that hard disks use physical heads to read and write data. Disk speeds are usually described in revolutions, such as 5400 revolutions and 7200 revolutions. Because random addressing requires rotation to move to the next address. But because Kafka is a queue, the creative sequential read and write to the disk greatly increases the efficiency of the use of disk, both to obtain large storage and speed. Many other optimizations have been added to Kafka, such as data compression to increase throughput and support millions of messages per second.

3. The scalability

Like other components of the big data architecture, Kafka supports the use of several inexpensive servers to build a large-scale messaging system, and with ZooKeeper’s association, Kafka is easy to scale up.

4. Multi-client support

As mentioned earlier, Kafka supports a wide range of development languages such as Java, C/C++, Python, Go, Erlang, Node.js, etc.

5. Kafka Streams

Kafka introduced Kafka Streams after the 0.10 release, which is very good for stream processing.

What is the Flume

After a brief introduction to Kafka, let’s take a look at Flume, another tool.

Let’s recall the data collection process: As the client used by front-end users, whether it is App, web page or small program, when users are using it, it will transfer the user’s usage data to the back-end server through HTTP link, and the service running on the server will save the data sent back to the server in the form of logs. And there’s some transfer from the log to where we end up with the data in HDFS or into the real-time computing service.

Of course, there are many ways to implement this process. For example, in Java development, we can use the Kafka-Log4J-appender library to synchronize log4J (logging library) logs to the Kafka message queue, which kafka transmits to downstream tasks. However, this approach is crude and not well suited for large clusters, so here is a log collection tool, Flume.

Flume is a highly available, distributed log collection and transport system. Flume consists of three main parts: Source, Channel and Sink. The three parts form an Agent, and each Agent is an independent operation unit. There are various types of Source, Channel, and Sink, which can be selected as required.

  • A Channel can cache data in memory or write data to disk

  • Sink can write data to HBase

  • HDFS can also be transmitted to Kafka or even another Agent Source

Concepts in Flume

1. Source

Source is the part responsible for receiving input data. Source has two modes of operation:

  • Actively pull data

  • Wait for the data to arrive

After getting the data, the Source transfers the data to the Channel.

2. Channel

A Channel is an intermediate step that stores data temporarily. A Channel can also be configured in different ways, such as using memory, a file, or even a database as a Channel.

3. Sink

Sink is the wrapped output part. If you select different types of Sink, data will be obtained from the Channel and output to different places. For example, HDFS Sink is used when output to HDFS.

4. Event

A unit of data passed in Flume is called an event.

5. Agent

As we mentioned earlier, an Agent is an independent running unit composed of Source, Channel, and Sink. An Agent may have multiple components.

Comparison between Kafka and Flume

As you can see, Flume and Kafka have similar implementation principles in terms of data transfer, but each tool has its own focus.

  • Kafka is more about data storage and real-time processing of streaming data. It is a high-throughput, high-load message queue.

  • Flume focuses on data collection and transmission, and provides a variety of interfaces to support the collection of multiple data sources, but Flume does not directly provide data persistence.

Flume is inferior to Kafka in terms of throughput and stability. So in a usage scenario, if you need to transfer data between two systems that produce and consume data at different rates, for example, real-time data production rates will change frequently, there will be different peaks at different times, and if you write directly to HDFS, there will be congestion. Add Kafka to this process. The data can be written to Kafka and then transmitted downstream using Kafka. Flume provides more encapsulated components and is more lightweight. It is most commonly used for log collection and saves a lot of coding work.

Due to the characteristics of Kafka and Flume, there are many in the actual work is to use Kafka and Flume collocation, such as data on the line after the log, the use of Flume for collection, and then transmitted to Kafka, Then Kafka transmits the data to computing frameworks such as MapReduce, Spark, and Flink, or stores the data persistently in the HDFS.

conclusion

In my work, my basic feeling for Flume and Kafka is that both can be used to transmit data, but in fact each has its own functions and characteristics, and there are relatively large differences in specific use scenarios. So when you use Kafka or Flume, or a combination of the two, depending on your specific needs. Cloudera has even developed a new tool, Flafka, that blends the two.