preface

As requested by most partners, a kafka interlude before Yarn is a fun and relaxing experience.

First, Kafka foundation

The role of the message system

As most of you know, packing oil is an example

So the message system is what we call the repository above, which can act as a cache in the intermediate process and enable decoupling.

Introduce a scenario, we know that China Mobile, China Unicom, China Telecom log processing, is outsourced to do big data analysis, suppose that now their logs are handed over to your system to do user portrait analysis.

Following on from the message system mentioned earlier, we know that the message system is actually an analog cache, and is only used as a cache but not a real cache. Data is still stored on disk rather than in memory.

1. The Topic subject

Kafka learned how to design a database and designed a topic, which is similar to a table in a relational database

At this time, I need to obtain the data of China Mobile, so I can directly listen to TopicA

2. The Partition Partition

Kafka also has a concept called Partition, which is originally a directory on a server. There are multiple partitions under a topic, and these partitions are stored on different servers. In other words, different directories are built on different hosts. The main information about these partitions is in the.log file. Similar to the database partition, to improve performance.

As for why the performance is improved, it’s simple: multiple partitions with multiple threads, and multiple threads in parallel is surely much better than a single thread

Topic and partition are similar to the concepts of table and region in HBASE. Table is only a logical concept. Regions are really used to store data. A Topic is also a logical concept, and a partition is a distributed storage unit. This design is the basis for ensuring massive data processing. For comparison, if HDFS does not have block design, a 100T file can only be placed on one server, which will directly occupy the entire server. After block is introduced, large files can be stored on different servers.

Note: 1. Partitions can have a single point of failure, so we will set the number of copies for each partition

2. The partition number starts from 0

3.Producer

The producer sends data to the messaging system

4.Consumer

It is the consumer who reads the data in Kafka

5. The Message – the Message

The data we process in Kafka is called a message

Second, Kafka cluster architecture

Create a TopicA topic with three partitions stored on different servers (brokers). Topic is a logical concept, and we cannot directly draw the units related to Topic in the diagram

Note: Kafka did not have replicas prior to version 0.8, so you will lose data in the event of a server outage, so avoid using kafka prior to this version

A copy of the up –

Partitions in Kafka can be configured with multiple copies to ensure data security.

At this point, we set three copies for the partitions 0,1, and 2 (actually, two copies is appropriate).

In fact, each copy has its own role. One copy is selected as the leader and the others as followers. When sending data, the producer directly sends data to the leader partition. The follower partition then synchronizes data from the leader. When consumers consume data, the follower partition also consumes data from the leader.

Consumer Group – Consumer Group

When we consume data, we specify a group id in the code. The id represents the name of the consumer group, and the group id is set by default even if it is not set

conf.setProperty("group.id","tellYourDream")
Copy the code

Some well-known messaging systems are generally designed so that once one consumer consumes the data in the messaging system, none of the other consumers can consume the data. But Kafka is not like this. For example, consumerA consumes topicA data.

consumerA:
    group.id = a
consumerB:
    group.id = a
    
consumerC:
    group.id = b
consumerD:
    group.id = b
Copy the code

Then consumerB can also consume TopicA’s data, but it can’t, but we can specify another group. Id in consumerC, and consumerC can consume TopicA’s data. ConsumerD is also not consumable, so in Kafka, different groups can have a single consumer consuming the same topic’s data.

So consumer group is to allow multiple consumers to consume information in parallel, and they will not consume the same message, as follows, consumerA, B, C will not interfere with each other

consumer group:a
    consumerA
    consumerB
    consumerC
Copy the code

As shown in the figure, as mentioned above, consumers will directly establish a connection with the leader, so they consume three leaders respectively. Therefore, a partition will not allow multiple consumers in the consumer group to consume, but in the case of unsaturated consumers, a consumer can consume data from multiple partitions.

Controller

Understand the rule: 95% of big data distributed file systems are master-slave architectures, and some are equal-based architectures such as ElasticSearch.

Kafka also has a master-slave architecture. The master node is called the controller and the rest are slave nodes. The controller manages the entire Kafka cluster in coordination with ZooKeeper.

How do Kafka and ZooKeeper work together

Kafka relies heavily on the ZooKeeper cluster (so the previous ZooKeeper article is somewhat useful). All brokers register with ZooKeeper at startup to elect a controller. The election process is a straightforward one, with no algorithms involved.

It listens to multiple directories in ZooKeeper. For example, if there is a directory /brokers/, other nodes will register with it. Such as/brokers / 0

When registering each node is bound to expose their own host name, port number, and so on information, at this time the controller is registered to read up from the node’s data (through the surveillance mechanism), to generate cluster metadata information, after the information were distributed to other servers, let the other server can perceive to other members of the cluster.

In this scenario, we create a topic, kafka will generate the partitioning scheme in this directory, the controller will listen to this change, and it will synchronize the meta information in this directory. Then it is also delegated to its slave nodes, so that the entire cluster is informed of the partitioning scheme, and the slave nodes create their own directories and wait for the partition copy to be created. This is also the management mechanism of the entire cluster.

Extra time

1. What is good about Kafka performance?

1) order to write

Each time an OPERATING system reads or writes data from a disk, it needs to address the data. That is, it needs to find the physical location of the data on the disk before reading or writing the data. If the disk is a mechanical disk, addressing takes a long time. Kafka is designed so that data is stored on disk. Generally, it works better when data is stored in memory. But Kafka uses sequential write, append data is appended to the end, disk sequential write performance is extremely high, in a certain number of disks, the number of revolutions to a certain case, and the basic memory speed is consistent

Random writing modifies data in a specific location in the file, which results in poor performance.

(2) zero copy

Let’s look at the non-zero copy case first

It can be seen that data is copied from memory to the Kafka server process and then to the socket cache. The whole process takes a lot of time. Kafka uses The Linux sendFile technology (NIO), eliminating process switching and a data copy, which makes the performance better.

2. Store logs in segments

Kafka specifies a maximum of 1 GB of.log files per partition. This limit is used to facilitate loading.log files into memory

00000000000000000000.index
00000000000000000000.log
00000000000000000000.timeindex

00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex

00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex
Copy the code

Numbers such as 9936472 represent the start offset contained in the log segment file, which means that at least 10 million pieces of data have been written to the partition. A Kafka broker has a parameter, log.segment.bytes, which limits the size of each log segment file to 1GB. When a log segment file is full, a new log segment file is automatically opened for writing. This process is called log rolling, and the log segment file that is being written to is called the active log segment.

If you read the previous two articles on HDFS, you will see that NameNode edits log is also limited, so these frameworks take these issues into account.

3. Network design for Kafka

Kafka’s network design is all about tuning Kafka, which is why it supports high concurrency

Client requests are sent to an Acceptor. The broker has three threads (default: three) called Processor. Acceptors do not process client requests. It is directly encapsulated into socketchannels and sent to a queue of these processors. It is sent to the first processor in polling mode, then to the second processor, then to the third processor, and then back to the first processor again. When a consumer thread consumes a socketChannel, it gets a request request, which is accompanied by data.

By default, there are 8 threads in the thread pool. These threads are used to process requests, parse requests, and, if they are write requests, write them to disk. If read, return the result. The processor reads the response data from the response and sends it back to the client. This is Kafka’s three-tier network architecture.

So if we need to augment Kafka, we can increase the number of processors and the number of processing threads in the thread pool. In fact, the request and response part plays a caching effect to consider the problem that the processor generates requests too fast and the number of threads is not enough to process them in time.

So this is an enhanced version of reactor’s network threading model.

finally

Cluster setup will be mentioned in another time. This article briefly covers the basics of Kafka, from roles to some design aspects, and will continue to move forward in the upcoming updates, which will cover more of the basics.

There are now people running their own knowledge planet for free but that doesn’t mean there is no gain. Those who are interested in the direction of big data can pay attention to it