This is the second day of my participation in the First Challenge 2022

scenario

If I now monitor an MQ cluster and set alarms, I have the following two rules:

  - alert: "Message backlogs at RocketMQ xxx_Consumer."
    expr: sum by(group, topic) (rocketmq_group_diff{group="xxx_consumer",topic="xxx"}) > 1000
    for: 1m
    labels:
      severity: busi
    annotations:
      description: 'There is a message backlog in the consumer group xxx_Consumer consuming XXX, the backlog is over 1000'
      summary: 'Message backlog for RocketMQ, xxx_Consumer'
  - alert: "Broker node is down"
    expr: count(rocketmq_broker_disk_ratio{cluster="XXXCluster"}) < 4
    for: 0m
    labels:
      severity: warning
    annotations:
      description: 'Fewer than 4 broker nodes'
      summary: 'Broker node is down'
Copy the code

The above two rules are as follows:

  1. Rule 1: A consumer group (core service) in a service group cannot receive messages that are overlogged. If the number of messages exceeds 1000, notify the consumer group
  2. Rule 2: When a service in our MQ cluster is down, we need to receive the alarm in time

The actual situation is that if an alarm is generated for rule 1, the alarm must be sent to the corresponding alarm spike group of the service group as well as to our own spike group. Rule 2 Alarms are notified only to our own group, but not to the service side. That is, some alarms need to be sent to multiple groups at the same time, and some alarms need to be sent to only one group.

Note the serverity configuration above, which is distinguished by rule 1: busi, rule 2: warning.

The configuration example is as follows:

Alertmanager configuration

global:
  resolve_timeout: 5m
  smtp_from: [email protected]
  smtp_smarthost: smtp.net:port
  smtp_auth_username: [email protected]
  smtp_auth_password: PASS
  smtp_require_tls: false
route:
  receiver: 'email'
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  routes:
  - receiver: 'our'
    group_wait: 10s
    match_re:
       severity: warning
  - receiver: 'other'
    group_wait: 10s
    match_re:
       severity: busi

templates:
  - '*.html'
receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]'
    send_resolved: false
    html: '{{ template "default-monitor.html" . }}'
    headers: { Subject: "[WARN] Alarm email" } # Email subject
- name: 'our'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/our/send
- name: 'other'
  webhook_configs:
  - url: http://127.0.0.1:8060/dingtalk/our/send
  - url: http://127.0.0.1:8060/dingtalk/other/send
Copy the code
  • Global: Sets the default mailbox configuration. If there is no matching recipient, email notification is used

  • Route: Routes specifies two specific receivers of the global configuration. One is called “our”, which matches the warning level. The other, called “other”, matches the busi level, which is defined in the first rule, either as a specific keyword, or as a tag you define arbitrarily. Right

  • Receivers: This specifies the configuration for the receiver defined above. Email specifies who the message is sent to. “Our” specifies the sending URL of dingtalk. Note the “our” at the end of this URI. “Other” specifies two urls. The difference is that one is “our” and the other is “other” before the send at the end of the URL.

Here is the mail template I used (file name: default-monitor.html). The template format is a table:

{{ define "default-monitor.html" }}
<table>
    <tr><td>Call the police,</td><td>describe</td><td>The start time</td></tr>
    {{ range $i, $alert := .Alerts }}
        <tr><td>{{ index $alert.Labels "alertname" }}</td><td>{{ index $alert.Annotations "description" }}</td><td>{{ $alert.StartsAt }}</td></tr>
    {{ end }}
</table>
{{ end }}
Copy the code

Prometheus – webhook – dingtalk configuration

## Customizable templates path
templates:
   - / home/user/monitor/alert/Prometheus - webhook - dingtalk - 1.4.0. Linux - amd64 / template/template TMPL

## Targets, previously was known as "profiles"
targets:
  our:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxx
    secret: xxx_secret
  other:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxx_other
    secret: xxx_other_secret
Copy the code

There are two “our” and “other” under targets, which correspond to “our” and “other” in the URL configured by alertManager. Access_token and secret are generated by adding machine assistants to the group.

In this configuration, if rule 1 is an alarm, that is, an AlertManager whose name is Other, alarm notifications are sent to our nail group and service nail group. If a rule two alarm is sent through our, it is sent only to our spike group.