Use Prometheus+ AlertManager to alert JVM exceptions

The original address

In the previous article, I mentioned how to use Prometheus+Grafana to monitor the JVM. This article describes how to use Prometheus+ AlertManager to alert the JVM to certain conditions.

The scripts mentioned in this article can be downloaded here.

Abstract

Tools used:

Docker. This paper makes extensive use of Docker to start various applications.
Prometheus, which is responsible for fetching/storing metric information and providing query capabilities, focuses on its alert capabilities in this article.
Grafana, in charge of data visualization (not the point of this article, but to give the reader a visual look at the anomaly metrics).
Alertmanager, is responsible for notifying the alarm to the relevant personnel.
JMX Exporter, providing JMX and JVM-related metrics.
Tomcat, used to simulate a Java application.

Here are the general steps:

Using JMX Exporter, start a small HTTP Server inside a Java process
Configure ProMetheus to grab the metrics provided by that HTTP Server.
configurationPrometheusThe alarm trigger rules of
- Heap usage exceeds the maximum limit by 50%, 80%, and 90%
- Instance down time is more than 30 seconds, 1 minute, 5 minutes
- The old GC time exceeds 50% and 80% in the last 5 minutes
Configure Grafana to connect to Prometheus, configure Dashboard.
configurationAlertmanagerThe warning notification rules

The general process of the alarm is as follows:

Prometheus checks whether an alert is triggered according to the alert trigger rule, and if so, sends the alert message to the AlertManager.
When the alert is received, the AlertManager decides whether or not to send the notification and, if so, to whom.

Step 1: Launch a few Java applications

1) Create a new directory called PROM-JVM-DEMO.

2) Download JMX Exporter to this directory.

3) Create a new file simple-config.yml with the following contents:

--- lowercaseOutputLabelNames: true lowercaseOutputName: true whitelistObjectNames: ["java.lang:type=OperatingSystem"] rules: - pattern: 'java.lang<type=OperatingSystem><>((? ! process_cpu_time)\w+):' name: os_$1 type: GAUGE attrNameSnakeCase: true

4) Run the following command to start the three tomcats, and remember to replace with the correct path (-xmx and -xms are deliberately small to trigger an alarm condition here) :

docker run -d \ --name tomcat-1 \ -v <path-to-prom-jvm-demo>:/jmx-exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaAgent :/ JMX-Exporter/jmx_prometheus_javaAgent -0.3.1.jar=6060:/ JMX-Exporter /simple-config.yml" \ -p 6060:6060 \ -p 8080:8080 \ Tomcat: 8.5-Alpine Docker run-d \ --name Tomcat -2 \ -v <path-to-prom-jvm-demo>:/ JMX-Exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaAgent :/ JMX-Exporter/jmx_prometheus_javaAgent -0.3.1.jar=6060:/ JMX-Exporter /simple-config.yml" \ -p 6061:6060 \ -p 8081:8080 \ Tomcat: 8.5-Alpine docker run-d \ --name tomcat-3 \ -v <path-to-prom-jvm-demo>:/ JMX-Exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaAgent :/ JMX-Exporter/jmx_prometheus_javaAgent -0.3.1.jar=6060:/ JMX-Exporter /simple-config.yml" \ -p 6062:6060 \ -p 8.5 alpine \ tomcat: 8082:8080

5) to http://localhost:8080 | 8081 | 8082 see if Tomcat startup success.

6) to access the corresponding http://localhost:6060 | 6061 | 6062 look at JMX exporter provide metrics.

Note: The simple-config.yml provided here only provides information about the JVM; refer to the JMX Exporter documentation for more complex configurations.

Step 2: Launch Prometheus

1) Create a PROM-JMX.YML file in PROM-JMX.YML with the following contents:

scrape_configs: - job_name: 'java' static_configs: - targets: - '<host-ip>:6061' - '<host-ip>:6062' - '<host-ip>:6062' Rule_files: - '/prometheus-config/prom-alert-rules.yml' - targets: - '<host-ip>:9093' # Read warning trigger condition rule rule_files: - '/prometheus-config/prom-alert-rules.yml'

2) Create a new file, PROM-ALERt-rules.yml, which is the alarm trigger rule:

# severity: red, orange, yello, blue groups: -name: jvam-alerting rules: # down more than 30 seconds -alert: instance-down expr: up == 0 for: 30s labels: severity: yellow annotations: summary: "Instance {{ $labels.instance }} down" description: "{{$labels. Instance} of job {{$labels. Job}} has been down for more than 30 seconds." instance-down expr: up == 0 for: 1m labels: severity: orange annotations: summary: "Instance {{ $labels.instance }} down" description: "{{$labels. Instance} of job {{$labels. Job}} has been down for more than 1 minutes." instance-down expr: up == 0 for: 5m labels: severity: red annotations: summary: "Instance {{ $labels.instance }} down" description: "{{$labels. Instance} of job {{$labels. Job}} has been down for more than 5 minutes." heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50 for: 1m labels: severity: yellow annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 50%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current Usage ({{$value}}%)" # heap space used more than 80% -alert: heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80 for: 1m labels: severity: orange annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 80%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current Usage ({{$value}}%)" # heap space used more than 90% -alert: heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90 for: 1m labels: severity: red annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 90%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current Usage ({{$value}}%)" # Old GC spends more than 30% of its usage time in 5 minutes -alert: old-GC -time-too-much expr: Increase (jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3 for: 5m labels: severity: yellow annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 Minute.current seconds ({{$value}}%)" # Old GC takes more than 50% of the time- alert: old-GC -time-too-much expr: Increase (jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5 for: 5m labels: severity: orange annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 Minute.current seconds ({{$value}}%)" # Old GC takes more than 80% of the time- alert: old-GC -time-too-much expr: Increase (jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8 for: 5m labels: severity: red annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

3) Start Prometheus:

docker run -d \
  --name=prometheus \
  -p 9090:9090 \
  -v <path-to-prom-jvm-demo>:/prometheus-config \
  prom/prometheus --config.file=/prometheus-config/prom-jmx.yml

4) go to http://localhost:9090/alerts you should see the configuration before the alarm rules:

If you don’t see three instances, wait a while and try again.

Step 3: Configure Grafana

See Prometheus+Grafana for monitoring the JVM

Step 4: Launch AlertManager

Create a new file alertManager-config.yml:

global:
  smtp_smarthost: '<smtp.host:ip>'
  smtp_from: '<from>'
  smtp_auth_username: '<username>'
  smtp_auth_password: '<password>'

# The directory from which notification templates are read.
templates: 
- '/alertmanager-config/*.tmpl'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname', 'instance']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: "user-a"

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    severity: 'red'
  target_match_re:
    severity: ^(blue|yellow|orange)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']
- source_match:
    severity: 'orange'
  target_match_re:
    severity: ^(blue|yellow)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']
- source_match:
    severity: 'yellow'
  target_match_re:
    severity: ^(blue)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']

receivers:
- name: 'user-a'
  email_configs:
  - to: '<[email protected]>'

Modify the smtp_* section and the email address of user-a at the bottom.

Note: Since almost all domestic mailboxes do not support TLS, and AlertManager currently does not support SSL, so please use Gmail or other mailboxes that support TLS to send alert messages, see this issue, this problem has been fixed, the following is the configuration example of AliCloud enterprise mailbox:

smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_hello: 'company.com'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: password
smtp_require_tls: false

2) Create a new file alert-template.tmpl. This is the message content template:

{{ define "email.default.html" }}
<h2>Summary</h2>
  
<p>{{ .CommonAnnotations.summary }}</p>

<h2>Description</h2>

<p>{{ .CommonAnnotations.description }}</p>
{{ end}}

3) Run the following command to start:

docker run -d \
  --name=alertmanager \
  -v <path-to-prom-jvm-demo>:/alertmanager-config \
  -p 9093:9093 \
  prom/alertmanager:master --config.file=/alertmanager-config/alertmanager-config.yml

4) Visit http://localhost:9093 to see if you’ve received any alerts sent by Prometheus (wait if you haven’t) :

Step 5: Wait for mail

Wait a while (up to 5 minutes) to see if any mail has arrived. If not, check the configuration or Docker Logs AlertManager to see the log. This is usually caused by a mailbox configuration error.

Use Prometheus+ AlertManager to alert JVM exceptions

Abstract

Step 1: Launch a few Java applications

Step 2: Launch Prometheus

Step 3: Configure Grafana

Step 4: Launch AlertManager

Step 5: Wait for mail

Related Posts

SpringBoot cache management with Redis cache integration implementation

Java executable JAR

Distributed timing task framework selection, written too well!