An overview of

Prometheus is an open source monitoring system that was formerly SoundCloud’s alarm toolkit. Starting in 2012, many companies and organizations began using Prometheus. The project’s developer and user community is very active, with more and more developers and users participating in the project. It is currently an independent open source project and is not dependent on any company. To emphasize this and clarify the governance structure of the project, Prometheus joined the Cloud Native Computing Foundation in 2016, following Kurberntes.

1.1 Core concepts of Prometheus

1.1.1 Data Model

All the Data Prometheus fundamentally stores is Time Serie Data, or Time series Data. Time series data is a data flow with a timestamp. The data flow belongs to a Metric and multiple labels under the Metric. In addition to providing storage capabilities, Prometheus can use query expressions to perform very flexible and complex queries.

Metrics and labels

Each Time series (Time Serie) is uniquely determined by metrics and a set of tag key-value pairs.

The metric name describes a measurement characteristic of the monitored system (for example, http_REQUESTS_TOTAL indicates the total number of HTTP requests). The metric name consists of ASCII letters, digits, underscores (_), and colons (-) and must match the regular expression [a-za-z_ :][A-za-Z0-9_ :]*.

The tag turns on Prometheus’ multidimensional data model. For the same metric, different combinations of label values form timings for specific dimensions. Prometheus’ query language filters and aggregates time series data through metrics and tags. Changing the value of any label on any metric results in a new timing. The tag name can contain ASCII letters, digits, and underscores and must match the regular expression [A-za-z_][A-Za-Z0-9_]*. The tag name with underscores is reserved for internal use. The tag value can contain any Unicode character, including Chinese.

Sample value (Sample)

Time series data is really just a set of sample values. Each sample value includes:
- A 64-bit floating point value
- A timestamp accurate to the millisecond
Notation

An annotation consists of a metric and a set of label key-value pairs. The form is as follows:

[metric name]{[label name]=[label value], ... }Copy the code

For example, if the metric is API_HTTP_requestS_total, and the tag is method=”POST” and handler=”/messages”, the annotations are as follows:

api_http_requests_total{method="POST", handler="/messages"}
Copy the code

1.1.2 Metric Types

Counter

A counter is a cumulative measure that is a number that can only be increased. Counters are primarily used to measure data such as service requests, task completion, and error occurrences.

Gauge Gauge

A meter represents a metric value that can be increased or decreased. Gauges are used to measure instantaneous data such as temperature and memory usage.

In the Histogram

The histogram samples observations (typically data such as request duration or response size) and counts them in a configurable bucket. There are several ways to generate a histogram (assuming the metric is) :
- In buckets, equivalent to<basename>_bucket{le="<upper inclusive bound>"}
- The sum of the sample values, is equal to<basename>_sum
- The total number of sample values, equal to<basename>_count, is the same as putting all the sample values into a bucket to count<basename>_bucket{le="+Inf"}
Summary

Like a histogram, a summary also samples observations. In addition to collecting the sum and total number of sample values, it can also collect statistics by quantile. There are several ways to generate a summary (assuming the metric is) :
- By the quantile, that is, the proportion of the number of sample values less than the quantile to the total is less than φ, which is equal to< the basename > {quantile = "< phi >}"
- The sum of the sample values, is equal to<basename>_sum
- The total number of sample values, equal to<basename>_count

1.1.3 Job and Instance

In Prometheus, the endpoints from which sample values can be captured are called instances, and multiple such instances are copied to form a task for performance scaling.

For example, the following api-server task has four identical instances:

Job: api-server instance 1: 1.2.3.4:5670 instance 2: 1.2.3.4:5671 instance 3: 5.6.7.8:5670 instance 4: 5.6.7.8:5671Copy the code

After Prometheus has captured the sampled value, it automatically adds the following tags and values to the sampled value:

Job: Grabs the task.
Instance: Captures the source instance

In addition, Prometheus will automatically insert sample values in the following sequence at each capture:

up{job="[job-name]", instance="instance-id"}: If the sample value is 1, the instance is healthy. Otherwise, the instance is unhealthy
scrape_duration_seconds{job="[job-name]", instance="[instance-id]"}: The sampling value is the consumption time of this capture
scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}: Indicates the number of sample values after the label is re-labeled
scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}: Sampling value is the number of sampled values captured this time

1.2 Prometheus characteristics

In a multidimensional data model, a time series is determined by a metric and multiple tag key-value pairs
Flexible query language to reorganize the time data collected
Powerful data visualization capabilities, in addition to the built-in browser, also support grafana integration
Efficient storage, memory plus local disks, scalability through feature sharding and federation
Simple operation and maintenance, only rely on the local disk, go binary installation package has no other dependencies
To streamline the alarm
Lots of client libraries
A number of exporters are provided to collect common system metrics

1.3 AlterManager Core concepts

1.3.1 grouping

Groups classify alerts of a similar nature into a single notification. This is especially useful during large outages where many systems fail at once and hundreds to thousands of alerts may occur simultaneously.

Example: Tens or hundreds of service instances are running in the cluster when network partitioning occurs. Half of the service instances can no longer access the database. The alert rules in Prometheus are configured to send alerts when each service instance fails to communicate with the database. As a result, hundreds of alerts were sent to Alertmanager.

As users, people just want to get a single page and still be able to see exactly which service instances are affected. Therefore, an Alertmanager can be configured to group alerts by cluster and alertname to send a single compact notification.

Configure groups of alerts, timing of group notifications, and receivers for these notifications through the routing tree in the configuration file.

1.3.2 inhibition

Suppression is the concept of notifications that suppress certain alarms if some other alarms have been triggered. Example: Triggering an alert notifying that the entire cluster is not accessible. The Alertmanager can be configured to mute all other alerts related to the cluster when that particular alert is triggered. This prevents hundreds or thousands of alerts that trigger alarms that have nothing to do with the actual problem. Disable the configuration using the Alertmanager configuration file.

1.3.3 silence

Silence is a simple way to simply mute an alarm for a given amount of time. Configure silence based on matchers, just like a routing tree. Check that the incoming alerts match all the equality or regular expression matchers of the active silence. If they do, no notification of the alert will be sent. The mute function is configured on the Alertmanager Web UI.

1.3.4 Client behavior

Alertmanager has specific requirements for the behavior of its customers. These only apply to advanced use cases that do not use Prometheus to send alerts.

1.3.5 high availability

Alertmanager supports configuration to create clusters for high availability. This can be configured using the –cluster- * flag. It is important not to load balanced traffic between Prometheus and its Alertmanagers, but to point Prometheus to a list of all Alertmanagers.

The second architecture

2.1 Architecture diagram for Prometheus

2.2 altermanager architecture diagram

Is it compatible with other monitoring systems

3.1 Prometheus vs. Zabbix

Zabbix uses C and PHP, Prometheus uses Golang, and Prometheus runs faster overall.
Zabbix monitors physical hosts, switches, and networks. Prometheus monitors not only hosts, but also clouds, SaaS, Openstack, and Containers.
Zabbix has more plug-ins for traditional host monitoring.
Zabbix can configure many things in WebGui, Prometheus needs to manually modify the file configuration. ,

3.2 Prometheus vs. Nagios

Nagios does not support user-defined Labels, query, alarm denoising, or grouping. There is no data store. If you want to query the historical status, install the plug-in.
Nagios is a 1990s monitoring system that is better suited for small clusters or static systems. Nagios is too old to have many features. Prometheus is much better.

3.3 Prometheus vs Sensu

Sensu is basically an updated version of Nagios. It solves a lot of Nagios problems. If you’re familiar with Nagios, Sensu is a good choice.
Sensu relies on RabbitMQ and Redis for better scalability on data storage.

3.4 Prometheus vs InfluxDB

InfluxDB is an open source temporal database that is mainly used for data storage. If you want to build a monitoring and alarm system, you need to rely on other systems.
InfluxDB does a better job of horizontal scalability and high availability of storage, after all the core is the database.

4 Installation and Deployment

4.1 Prometheus installation

Binary installation

CD/opt && wget HTTP: / / https://github.com/prometheus/prometheus/releases/download/v2.12.0/prometheus-2.12.0.linux-amd64.tar.gz Gz mv Prometheus -2.12.0. Linux-amd64 Prometheus chown root. Root Prometheus -r
#Configure as a servicecat >/usr/lib/systemd/system/prometheus.service <<EOF [Unit] Description=Prometheus Documentation=https://prometheus.io/  After=network.target [Service] Type=simple ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
#Set the service to automatically start upon startup
systemctl enable prometheus
systemctl start prometheus

#Direct start
nohup ./prometheus --config.file=prometheus.yml 2>&1 1>prometheus.log &

#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep prometheus
tcp6       0      0 :::9090                 :::*                    LISTEN      16655/prometheus
Copy the code

Source compilation and installation

$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus --config.file=your_config.yml


#Or make the build
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=your_config.yml
Copy the code

Docker installation

Docker run --name Prometheus -d -p 127.0.0.1:9090:9090 PROM/PrometheusCopy the code

4.2 alertmanager installation

Binary installation

cd /opt && wget -c https://github.com/prometheus/alertmanager/releases/download/v0.18.0/alertmanager-0.18.0.linux-amd64.tar.gz tar ZXF Gz mv alertManager-0.18.0.linux-amd64 alertManager chown root. Root alertManager -r
#Configure the servicecat >/usr/lib/systemd/system/alertmanager.service <<EOF [Unit] Description=Alertmanager Documentation=https://prometheus.io/ After=network.target [Service] Type=simple ExecStart=/opt/alertmanager/alertmanager  --config.file=/opt/alertmanager/alertmanager.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
#Set the service to automatically start upon startup
systemctl enable alertmanager
systemctl start alertmanager

#Direct start
nohup ./alertmanager --config.file=alertmanager.yml 2>&1 1>alertmanager.log &

#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep alertmanager
tcp6       0      0 :::9094                 :::*                    LISTEN      17237/alertmanager
tcp6       0      0 :::9093                 :::*                    LISTEN      17237/alertmanager
udp6       0      0 :::9094                 :::*                                17237/alertmanager
Copy the code

Compile the installation

$ GO15VENDOREXPERIMENT=1 go get github.com/prometheus/alertmanager/cmd/...
# cd $GOPATH/src/github.com/prometheus/alertmanager
$ alertmanager --config.file=<your_file>

#Manual source build
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/alertmanager.git
$ cd alertmanager
$ make build
$ ./alertmanager --config.file=<your_file>

#Amtool build
$ make build BINARIES=amtool
Copy the code

Docker installation

docker pull quay.io/prometheus/alertmanager
Copy the code

4.3 node_export installation

Node_export is used to monitor the host, and there are many other official exports that can be used directly

Binary installation

cd /opt && wget -c https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz tar ZXF Gz mv node_worlder -0.18.1.linux-amd64 node_worlder chown root.root node_worlder -R
#Configure the service
cat >/usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

#Set the service to automatically start upon startup
systemctl enable node_exporter
systemctl start node_exporter

#Direct start
nohup ./node_exporter --config.file=node_exporter.yml 2>&1 1>node_exporter.log &

#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep node_export
tcp6       0      0 :::9100                 :::*                    LISTEN      4551/node_exporter
Copy the code

4.4 pushgateway

The installation

cd /opt && wget -c https://github.com/prometheus/pushgateway/releases/download/v0.9.1/pushgateway-0.9.1.linux-amd64.tar.gz tar ZXF Gz mv pushgateway-0.9.1.linux-amd64 pushGateway chown root.root pushgateway -r
#Configure the service
cat >/usr/lib/systemd/system/pushgateway.service <<EOF
[Unit]
Description=pushgateway
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/opt/pushgateway/pushgateway
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

#Set the service to automatically start upon startup
systemctl enable pushgateway
systemctl start pushgateway

#Direct start
nohup ./pushgateway --config.file=node_exporter.yml 2>&1 1>node_exporter.log &

#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep push
tcp6       0      0 :::9091                 :::*                    LISTEN      5982/pushgateway
Copy the code

Viewing the Web Page

Shell command creation

Echo "some_metric 3.14" | curl - data - binary @ - http://localhost:9091/metrics/job/some_jobCopy the code

Sending complex data

cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/some_job/instance/some_instance
# TYPE some_metric counter
some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.Another_metric 2398.283 EOFCopy the code

4.5 Grafana configuration

4.5.1 grafana installation

Install grafana

Wget https://dl.grafana.com/oss/release/grafana-6.3.3-1.x86_64.rpm sudo yum localinstall grafana 6.3.3-1. X86_64. RPM - y systemctl enable grafana-server.service systemctl start grafana-server.service#Web page 3000 Login information: admin/admin

#Installing a plug-in
grafana-cli plugins install grafana-piechart-panel
systemctl restart grafana-server
Copy the code

4.5.2 Adding a Data source

Add Prometheus and enter the management address of Prometheus

Import the dashboard

Through https://grafana.com/grafana/dashboards

Configuration dashboard

4.5.3 Grafana alarm email configuration

Modify the Grafana configuration file to add the email configuration

#Modify/etc/grafana/grafana. Ini

[smtp]
enabled = true
host = smtp.163.com:465
user = 18329903316
# If the password contains # or ; you have to wrap it with trippel quotes. Ex """#password;" ""password = xxxxxxxxxxxx ; cert_file = ; key_file = ; skip_verify = false from_address = [email protected] ; from_name = Grafana ; ehlo_identity = dashboard.example.comCopy the code

Notification Channels was configured on the Grafana Web interface

4.5.4 alter the configuration

⚠️ : Template variables are not supported in alert queries, otherwise alarms cannot be created

The alarm test

Viewing Alarm History

The alarm is triggered

Five PromQL

Prometheus Query Language (PromQL) is an expression Language developed by Prometheus. It has rich expression and many built-in functions. You can use it to filter and aggregate time series data.

5.1 PromQL grammar

5.1.1 Data Types

The PromQL expression computs the following types of values:

Instant vector: A set of timing sequences, each with a single sample value
Range vector: A set of timing sequences, each containing multiple sample values over a period of time
Scalar data: a floating point number
String: A String, temporarily unused

5.1.2 Timing selectors

Instantaneous vector selector

The instantaneous vector selector is used to select the sampling value of a set of timing sequences at a certain sampling point.

The simplest case is to specify a measure and select the current sample values for all the timings that belong to that measure. For example, the following expression:
```
http_requests_total
Copy the code
```
You can filter the timing by following it with a set of tag key-value pairs enclosed in curly braces. For example, the following expression screens the timing when job is Prometheus and group is canary:
```
http_requests_total{job="prometheus", group="canary"}
Copy the code
```
The tag value can be equal or a regular expression can be used. In total, there are the following matching operators:
- = : Exactly equal
- ! = : no equal
- =~: matches the regular expression
- ! ~: The regular expression does not match
The following expression filters out environment staging or testing or development, and method is not the timing of GET:
```
http_requests_total{environment=~"staging|testing|development",method! ="GET"}
Copy the code
```
The metric names can be matched using the internal tag __name__, and the expression http_requestS_total can also be written as {__name__=” HTTP_requestS_total “}. The expression {__name__=~”job:.*”} matches all metrics whose names begin with job:.
Interval vector selector

The interval vector selector is similar to the instantaneous vector selector except that it selects samples from the past. Interval vector selectors can be obtained by adding the duration contained in [] to the instantaneous vector selectors. For example, the following expression selects the sampled values over the last 5 minutes for all time series whose metric is HTTP_requestS_total and job is Prometheus.
```
http_requests_total{job="prometheus"}[5m]
Copy the code
```
The units of time can be one of the following:
- S: seconds
- M: you can
- H: pump
- D: days
- W: weekes
- Y: years
Migration modifier

The previous selectors default to the current time, and the offset modifier is used to adjust the base time to offset it forward by some time. The offset decorator follows the selector, using offset to specify the amount to be offset. For example, the following expression selects the sampled values of all timing sequences with the metric name http_requestS_Total five minutes ago.
```
http_requests_total offset 5m
Copy the code
```
The following expression selects the http_requestS_TOTAL metric sampled five minutes past this point in time one week ago.
```
http_requests_total[5m] offset 1w
Copy the code
```

5.2 PromQL operator

5.2.1 Binary operators

The binary operators of PromQL support basic logic and arithmetic operations, including arithmetic, comparison, and logic operations.

Arithmetic class binary operator

There are several arithmetic binary operators:
- + :
- – : decrease
- * :
- / :
- More than % : please
- ^ : power
Arithmetic class binary operators can be used between scalars, vectors and scalars, and vectors and vectors

Vectors in the context of binary operators are transient vectors, not interval vectors.
- Between the scalars and the scalars, the result is obvious, consistent with the usual arithmetic operation.
- Between a vector and a scalar, you take the scalar and every scalar in the vector, and you get a new vector.
- Between vectors, it’s a little bit more complicated. The operation will first look for a matching element in the right vector for each element in the left vector (the matching rules will be discussed later), and then perform the calculation on the two matching elements, so that the results of each pair of matching elements form a new vector. If no matching element is found, the element is discarded.
Comparison class binary operators

There are several comparison binary operators:
- == (equal)
- ! = (not-equal)
- > (greater-than)
- < (less-than)
- >= (greater-or-equal)
- <= (less-or-equal)
Comparison class binary operators can also be used between scalars, vectors and scalars, and vectors and vectors. By default, filtering is performed, that is, reserving values. Instead of filtering, we can make the return values 0 and 1 by following the operator with the bool modifier.
- Between scalars, there must be a bool modifier, so the result can only be 0 (false) or 1 (true).
- Between a vector and a scalar, it is equivalent to comparing every scalar in the vector with a scalar, keep if true, or discard if not. If the bool modifier is followed, the results are 1 and 0, respectively.
- Between vectors, the operation is similar to an arithmetic operator, except that the values on the left (including attributes such as metrics and labels) are kept if the comparison is true, otherwise discarded, and if no match is found, discarded. If the bool modifier is followed, the result is 1 and 0 when reserved and when discarded.
Logic class binary operators

Logical operators are only used between vectors.
- And: intersection
- Or: collection
- Unless: complement
The specific operation rules are as follows:
- vector1 and vector2The result is composed of elements in Vector1 that have matching elements in vector2 (the same tag, key, and value pair combinations).
- vector1 or vector2The result is composed of all elements in Vector1 plus elements in Vector2 that have no matching elements in vector1 (the same tag, key, and value pair combinations).
- vector1 unless vector2The result consists of elements in Vector1 that have no matching elements in vector2 (with the same tag, key, and value pair combinations).
Binary operator precedence

The priorities of PromQL binary operators are as follows:
1. ^
2. *, /, %
3. The +, –
4. = =,! <=, <, >=, >
5. and, unless
6. or

5.2.2 Vector matching

The previous arithmetic and comparison operators both need to match between vectors. There are two matching types, one-to-one and many-to-one/one-to-many.

One-to-one vector matching

Same is a match, and there will be only one matching element. You can use the Ignoring keyword to ignore the tag that does not participate in the match, or you can use the ON keyword to specify the tag that participates in the match. The syntax is as follows:

<vector expr> <bin-op> ignoring(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) <vector expr>
Copy the code

For example, for the following input:

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="get", code="404"}  30
method_code:http_errors:rate5m{method="put", code="501"}  3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21

method:http_requests:rate5m{method="get"}  600
method:http_requests:rate5m{method="del"}  34
method:http_requests:rate5m{method="post"} 120
Copy the code

Execute the following query:

method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m
Copy the code

The results are as follows:

{method="get"} 0.04 // 24/600 {method="post"} 0.05 // 6/120Copy the code

This is the percentage of the total number of requests with code 500 per method. There are no matching elements for methods put and del, so they do not appear in the result.

Many-to-one/one-to-many vector matching

In this matching pattern, multiple elements on one side match elements on the other side. The group_left or group_right group modifiers are used to indicate which side matches the most elements, group_left for the left side or group_right for the right side. The syntax is as follows:
```
<vector expr> <bin-op> ignoring(<label list>) group_left(<label list>) <vector expr> <vector expr> <bin-op> ignoring(<label list>) group_right(<label list>) <vector expr> <vector expr> <bin-op> on(<label list>) group_left(<label  list>) <vector expr> <vector expr> <bin-op> on(<label list>) group_right(<label list>) <vector expr>Copy the code
```
The group modifier applies only to arithmetic and comparison operators.

For the previous input, execute the following query:
```
method_code:http_errors:rate5m / ignoring(code) group_left method:http_requests:rate5m
Copy the code
```
You get the following result:
```
{method="get", code="500"} 0.04 // 24/600 {method="get", code="404"} 0.05 // 30/600 {method="post", code="500"} 0.05 // 6/120 {method="post", code="404"} 0.175 // 21/120Copy the code
```
This is the ratio of error counts per code per method to the number of requests per method. Ignoring code for the match enables a many-to-one match on both sides. Group_left is used to indicate the number of left.

Many-to-one/many-to-many is too advanced and complex. Avoid using many-to-one. Most of the time, ignoring can solve a problem.

5.2.3 Aggregation operators

PromQL’s aggregation operator is used to aggregate fewer elements in a vector. There are the following aggregate operators:

The sum, sum
Min: indicates the minimum value
Max: Maximum value
Avg: indicates the average value
Stddev: standard deviation
Variance stdvar:
Count: indicates the number of elements
Count_values: number of elements equal to a value
Bottomk: the smallest K elements
Topk: maximum k elements
-Penny: I don’t know, the quantile

The syntax of the aggregate operator is as follows:

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
Copy the code

Where without is used to specify the tags that do not need to be retained (that is, the multiple values of these tags are aggregated), and by is used to specify the tags that need to be retained (that is, aggregated by them).

Here are some examples:

sum(http_requests_total) without (instance)
Copy the code

The HTTP_REQUESTS_TOTAL metric is tagged with application, instance, and group. The above expression will give you the total number of requests per group for each instance of each application. The effect is equivalent to the following expression:

sum(http_requests_total) by (application, group)
Copy the code

The following expression yields the total number of requests for all instances of all groups of all applications.

sum(http_requests_total)
Copy the code

5.3 the function

Prometheus has several built-in functions to aid in calculation, some of which are described below, and refer to the official documentation for a complete list.

Abs () : absolute value
SQRT () : square root
Exp () : Exponential calculation
Ln () : indicates the natural log
Ceil () : round up
Floor () : round down
Round () : Indicates the round
Delta () : Calculates the difference between the first and last time sequence of each interval vector
Sort () : sorting

Configure alarm rules

Integrate the alarm into the Ruixiang cloud

6.1 Prometheus integrates with ALTERManager

#Edit the Prometheus. Yml

alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 'localhost:9093'
       
rule_files:
  - "onealter.yml"
Copy the code

6.2 Writing ruLE_Files Rule Files

#Write onealter. YmlGroups: - name: test-rule rules: -alert: memory usage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 40 for: 1m labels: user: prometheus annotations: summary: "{{$labels. Instance}}: memory more than 40%" description: "{{$labels. Instance}}: memory more than 40%"Copy the code

6.3 Edit AlterManager. Yml Configure Webhook callback

#The editor

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-X-pager'
receivers:
- name: 'team-X-pager'
  webhook_configs:
  - url: 'http://api.aiops.com/alert/api/event/prometheus/f307ded7-9a96-4e34-101d-dfc421a8743a'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
Copy the code

6.4 Viewing ALTER Alarms

Refer to the link

website
Alermanager
Third party plug-in
routing tree edit
grafana dashboard

Prometheus + Grafana 解 析