This is the fifth day of my participation in the First Challenge 2022

Kafka configuration jmx_exporter

Click: github.com/prometheus/… , select the following JAR package to download:

Upload the agent JAR package to the server where kafka’s Broker nodes are located, as required by each broker, such as the following path:

/ opt/agent/jmx_prometheus_javaagent - 0.16.1. JarCopy the code

Run the following command to change the kafka startup script: bin/kafka-server-start.sh

export JMX_EXPORTER_OPTS="- javaagent: / opt/agent/jmx_prometheus_javaagent - 0.16.1. Jar = 9095: / opt/agent/kafka_broker yml"
Copy the code

Kafka: bin/kafka-run-class.sh kafka: bin/kafka-run-class.sh

if [ $JMX_EXPORTER_OPTS ]; then
        KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS $JMX_EXPORTER_OPTS"
fi
Copy the code

Kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh: kafka-run-class.sh

In the proxy configuration, 9095 is specified as the port for indicator collection. If the port conflicts, switch to another port. The kafka_brock. yml configuration used by jmx_EXPORTER is as follows: github.com/xxd76379515…

Configure each broker instance of the Kafka cluster in this way and restart Kafka.

After the kafka startup is complete, you can check whether kafka has startup error logs or visit http://kafka-host:9095/metrics to check whether monitoring indicators are available to check whether the configuration is successful.

Prometheus configuration

Modify the Prometheus configuration file Prometheus. Yml and add the following parameters:

  - job_name: 'kafka'
    metrics_path: /metrics
    static_configs:
    - targets: ['kafka1:9095'.'kafka2:9095'.'kafka3:9095']
      labels:
         env: "test"
Copy the code

P.S. Note that job_name should not be changed, the value is “kafka”, or I will not be able to use grafana directly, and I will need to change each panel in turn. If you are familiar with this set, you can make relevant adjustments yourself.

Grafana configuration

The following Grafana panels I have configured are ready to be used directly and can be added or removed later as needed: github.com/xxd76379515…

Post a few screenshots:

​ 

 

Message backlog

Metrics such as message backlogs are not directly available on the Kafka broker, and it is not possible to connect to all consumers for monitoring information on the consumer side.

So, I wrote a separate kafka-Exporter that can maintain a monitoring indicator for message backlogs: github.com/xxd76379515…

Click this link to go to the Github repository, follow the instructions to deploy and configure it, and then add the following configuration in Prometheus.yml:

  - job_name: 'kafka-exporter'
    metrics_path: /prometheus
    static_configs:
    - targets: ['kafka-expoter-host:9097']
      labels:
         env: "test"
Copy the code

The grafana configuration above already contains the message backlog panel:

If additional metrics are available in JMX, you can continue to supplement Kafka-HALF and scrape more metrics as you want.

The alarm

The latest configuration code will be submitted here: github.com/xxd76379515…

The following is an example:

groups:
  - name: Kafka tests cluster alarms
    rules:
      - alert: "Kafka cluster, split brain."
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount{env="test"}) by (env) > 1
        for: 0m
        labels:
          severity: warning
        annotations:
          description: 'The number of active controllers is{{$value}}Clusters may appear split brain '
          summary: '{{$labels.env}}Cluster split, please check the network before cluster '
      - alert: "Kafka cluster has no active controller"
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount{env="test"}) by (env) < 1
        for: 0m
        labels:
          severity: warning
        annotations:
          description: 'The number of active controllers is{{$value}}, no active controller '
          summary: '{{$labels.env}}The cluster may not be managed properly without an active controller.
      - alert: "Kafka node is down"
        expr: count(kafka_server_replicamanager_leadercount{env="test"}) by (env) < 3
        for: 0m
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The node of the cluster is down. The node is currently available:{{$value}}'
          summary: '{{$labels.env}}Cluster node down '
      - alert: "Kafka cluster has a partition where the leader is not on the preferred replica"
        expr: sum(kafka_controller_kafkacontroller_preferredreplicaimbalancecount{env="test"}) by (env) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The number of partitions in the cluster where the leader is not on the preferred replica:{{$value}}'
          summary: '{{$labels.env}}If a cluster has a partition whose leader is not on the preferred replica, the load of the partition replica is unbalanced. Use the kafka-prefered-replica-election script to rectify the fault.
      - alert: "Kafka cluster offline partition number greater than 0"
        expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount{env="test"}) by (env) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The number of offline partitions in the cluster is greater than 0.{{$value}}'
          summary: '{{$labels.env}}The number of offline partitions in the cluster is greater than 0.
      - alert: "The number of unsynchronized kafka clusters is greater than 0"
        expr: sum(kafka_server_replicamanager_underreplicatedpartitions{env="test"}) by (env) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The number of unsynchronized partitions in the cluster is greater than 0.{{$value}}'
          summary: '{{$labels.env}}If the number of unsynchronized partitions in the cluster is greater than 0, messages may be lost.
      - alert: "CPU usage of kafka node host is too high"
        expr: irate(process_cpu_seconds_total{env="test"}[5m])*100 > 50
        for: 10s
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The cluster CPU usage is too high.{{$labels.instance}}, current CPU usage:{{$value}}'
          summary: '{{$labels.env}}The cluster CPU usage is too high.
      - alert: "Kafka node YCG too frequent"
        expr: jvm_gc_collection_seconds_count{env="test", gc=~'.*Young.*'} - jvm_gc_collection_seconds_count{env="test", gc=~'.*Young.*'} offset 1m > 30
        for: 0s
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}Cluster node YCG is too frequent, host:{{$labels.instance}}, number of YGC in the last 1 minute:{{$value}}'
          summary: '{{$labels.env}}Cluster node YCG too frequent '
      - alert: "Message Backlog alarm for Kafka Cluster"
        expr: sum(consumer_lag{env="test"}) by (groupId, topic, env) > 20000
        for: 30s
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}Message backlog in cluster, consumer group:{{$labels.groupId}}Topic:{{$labels.topic}}, current backlog value:{{$value}}'
          summary: '{{$labels.env}}Message backlog in cluster '
      - alert: "Kafka cluster network processing busy"
        expr: kafka_network_socketserver_networkprocessoravgidlepercent{env="test"} < 0.3
        for: 0s
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The cluster network thread pool is not too idle, maybe the network processing pressure is too heavy, the host:{{$labels.instance}}, current idle value:{{$value}}'
          summary: '{{$labels.env}}Cluster network processing busy '
      - alert: Kafka cluster IO processing busy
        expr: kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent_total{env="test"} < 0.3
        for: 0s
        labels:
          severity: warning
        annotations:
          description: '{{$labels.env}}The cluster I/O thread pool is not idle, and the processing pressure may be too heavy, so the number of threads needs to be adjusted. Host:{{$labels.instance}}, current idle value:{{$value}}'
          summary: '{{$labels.env}}Cluster IO processing busy '
Copy the code

At the end of the language

I searched several dashboards from Grafana, but the metrics were too few. Thanks for the grafana configuration I found in this blog post. The grafana panel provides a number of metrics that I can sort through with kafka official monitoring JMX instructions, saving me half of the work on the grafana panel configuration:

www.confluent.io/blog/monito…