In the process of system operation and maintenance, every detail deserves our attention

The following figure shows our basic log processing architecture

All logs are collected by Rsyslog or Filebeat and sent to Kafka. Logstash is used as data in Kafka for Consumer consumption. Elasticsearch and Hadoop are written to Kafka respectively. Or Spark could take over for deeper analysis

In the above architecture, several core components Kafka, Elasticsearch and Hadoop naturally support high availability, but Logstash is not supported. Using a single Logstash to handle logs not only causes processing bottlenecks, but also causes single point problems in the whole system. If Logstash goes down, the entire cluster becomes unavailable, with predictable consequences

How do you solve the Logstash single point problem? We can do this with Kafka’s Consumer Group

Kafka Consumer Group

To help you understand, let me introduce some of the most important characters in Kafka:

A kafka cluster consists of three kafka servers. A kafka cluster consists of three kafka servers. A Kafka cluster consists of three kafka servers

Topic: Is a logical concept used to distinguish different message categories, similar to the table in the database, can send a group of the same data to a Topic, in log processing, usually different types of logs are written to different topics, for example, nginx logs are written to the Topic named nginx_log. The Tomcat log writes the topic named tomcat_log, which is not shown in the figure above. We can understand that the three partitions in the figure form a topic

Partition: Is the basic physical unit of Kafka data storage. The data of the same Topic can be stored in one or more partitions. For example, the data of a Topic in the figure above is stored in partition1, partition2, and partition3. Usually we set the number of partitions under a topic to an integer multiple of brokers so that data can be evenly distributed and all server resources in the cluster can be utilized simultaneously

Producer: a service that writes data to Kafka, such as FileBeat

Consumer: A service that goes to Kafka to fetch data, such as logstash

Consumer Group: also a logical concept. It is a Group of consumers. Data from the same topic will be broadcast to different groups, and only one Consumer in the same Group can get the data

In other words, for the same topic, each group can get all the same data, but the data can only be consumed by one of the consumers after entering the group. Based on this, we only need to start multiple logstsh. By putting these logstash groups together, you can achieve high availability of logStash

input {
    kafka {
        bootstrap_servers => "10.8.9.2:9092,10.8. 9.3:9092,10.8. 9.4:9092"
        topics => ["ops_coffee_cn"]
        group_id => "groupA"
        codec => "json"}}Copy the code

Group_id is a string that uniquely identifies a group. Consumers with the same group_id form a consumer group. In this way, multiple Logstash processes can be started only with the same group_id to achieve high availability of the Logstash processes. If one Logstash process is suspended, the logstash process in the same Group can continue to consume

In addition to high availability, multiple Logstash users in the same Group can consume kafka topic data at the same time to improve the processing power of Logstash. However, it should be noted that when consuming Kafka data, each consumer can only use one partition at most. When the number of consumers in a Group is greater than the number of partitions, only consumers equal to the number of partitions can consume at the same time, and other consumers are in the waiting state

For example, there are three partitions under a topic. In a group of five consumers, only three consume the data of the topic at the same time, while the other two consumers are in a waiting state. Therefore, we want to increase the consumption performance of logstash. You can increase the number of partitions in a topic appropriately. However, if kafka has too many partitions, it will take too long to recover from failures and consume more file handles and client memory. It is not the more partitions, the better

kafka partition

The number of partitions in Kafka can be specified when creating a topic:

# bin/kafka-topics. Sh --zookeeper 127.0.0.1:2181 --create --topic ops_coffee --partitions 3
Created topic "ops_coffee".
Copy the code

— Partitions: Specifies the number of partitions. If not, the number of num. Partitions configured in the configuration file is used by default

You can also manually change the number of partitions:

# bin/kafka-topics. Sh --alter --zookeeper 127.0.0.1:2181 -- Partitions 5 --topic ops_coffee
Adding partitions succeeded!
Copy the code

Note The number of partitions can only be increased, but cannot be decreased

If you want to know the partition information of a topic, you can run the following command to view the details of the topic:

# bin/kafka-topics. Sh --zookeeper 127.0.0.1:2181 --describe --topic ops_coffeeTopic:ops_coffee PartitionCount:3 ReplicationFactor:2 Configs: Topic: ops_coffee Partition: 0 Leader: 1 Replicas: 1,2 Isr: 1,2 Topic: ops_coffee Partition: 1 Leader: 2 Replicas: 2,3 Isr: 2,3 Topic: ops_coffee Partition: 2 Leader: 2 Replicas: 2,3 3 Replicas: 3,1 Isr: 3,1Copy the code

Now I have a deeper understanding of The Kafka Consumer Group, and I can use it with ease


Related articles recommended reading:

  • ELK build MySQL slow log collection platform details
  • ELK log system uses Rsyslog to collect Nginx logs quickly and easily
  • General application log access scheme for ELK log system
  • Logstash Read Kafka data write HDFS
  • Filebeat Registry file interpretation
  • Elasticsearch Query Getting started with DSL Query