An overview of
Prometheus is an open source monitoring system that was formerly SoundCloud’s alarm toolkit. Starting in 2012, many companies and organizations began using Prometheus. The project’s developer and user community is very active, with more and more developers and users participating in the project. It is currently an independent open source project and is not dependent on any company. To emphasize this and clarify the governance structure of the project, Prometheus joined the Cloud Native Computing Foundation in 2016, following Kurberntes.
1.1 Core concepts of Prometheus
1.1.1 Data Model
All the Data Prometheus fundamentally stores is Time Serie Data, or Time series Data. Time series data is a data flow with a timestamp. The data flow belongs to a Metric and multiple labels under the Metric. In addition to providing storage capabilities, Prometheus can use query expressions to perform very flexible and complex queries.
- Metrics and labels
Each Time series (Time Serie) is uniquely determined by metrics and a set of tag key-value pairs.
The metric name describes a measurement characteristic of the monitored system (for example, http_REQUESTS_TOTAL indicates the total number of HTTP requests). The metric name consists of ASCII letters, digits, underscores (_), and colons (-) and must match the regular expression [a-za-z_ :][A-za-Z0-9_ :]*.
The tag turns on Prometheus’ multidimensional data model. For the same metric, different combinations of label values form timings for specific dimensions. Prometheus’ query language filters and aggregates time series data through metrics and tags. Changing the value of any label on any metric results in a new timing. The tag name can contain ASCII letters, digits, and underscores and must match the regular expression [A-za-z_][A-Za-Z0-9_]*. The tag name with underscores is reserved for internal use. The tag value can contain any Unicode character, including Chinese.
-
Sample value (Sample)
Time series data is really just a set of sample values. Each sample value includes:
- A 64-bit floating point value
- A timestamp accurate to the millisecond
-
Notation
An annotation consists of a metric and a set of label key-value pairs. The form is as follows:
[metric name]{[label name]=[label value], ... }Copy the code
For example, if the metric is API_HTTP_requestS_total, and the tag is method=”POST” and handler=”/messages”, the annotations are as follows:
api_http_requests_total{method="POST", handler="/messages"}
Copy the code
1.1.2 Metric Types
- Counter
A counter is a cumulative measure that is a number that can only be increased. Counters are primarily used to measure data such as service requests, task completion, and error occurrences.
- Gauge Gauge
A meter represents a metric value that can be increased or decreased. Gauges are used to measure instantaneous data such as temperature and memory usage.
-
In the Histogram
The histogram samples observations (typically data such as request duration or response size) and counts them in a configurable bucket. There are several ways to generate a histogram (assuming the metric is) :
- In buckets, equivalent to
<basename>_bucket{le="<upper inclusive bound>"}
- The sum of the sample values, is equal to
<basename>_sum
- The total number of sample values, equal to
<basename>_count
, is the same as putting all the sample values into a bucket to count<basename>_bucket{le="+Inf"}
- In buckets, equivalent to
-
Summary
Like a histogram, a summary also samples observations. In addition to collecting the sum and total number of sample values, it can also collect statistics by quantile. There are several ways to generate a summary (assuming the metric is) :
- By the quantile, that is, the proportion of the number of sample values less than the quantile to the total is less than φ, which is equal to
< the basename > {quantile = "< phi >}"
- The sum of the sample values, is equal to
<basename>_sum
- The total number of sample values, equal to
<basename>_count
- By the quantile, that is, the proportion of the number of sample values less than the quantile to the total is less than φ, which is equal to
1.1.3 Job and Instance
In Prometheus, the endpoints from which sample values can be captured are called instances, and multiple such instances are copied to form a task for performance scaling.
For example, the following api-server task has four identical instances:
Job: api-server instance 1: 1.2.3.4:5670 instance 2: 1.2.3.4:5671 instance 3: 5.6.7.8:5670 instance 4: 5.6.7.8:5671Copy the code
After Prometheus has captured the sampled value, it automatically adds the following tags and values to the sampled value:
- Job: Grabs the task.
- Instance: Captures the source instance
In addition, Prometheus will automatically insert sample values in the following sequence at each capture:
up{job="[job-name]", instance="instance-id"}
: If the sample value is 1, the instance is healthy. Otherwise, the instance is unhealthyscrape_duration_seconds{job="[job-name]", instance="[instance-id]"}
: The sampling value is the consumption time of this capturescrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}
: Indicates the number of sample values after the label is re-labeledscrape_samples_scraped{job="<job-name>", instance="<instance-id>"}
: Sampling value is the number of sampled values captured this time
1.2 Prometheus characteristics
- In a multidimensional data model, a time series is determined by a metric and multiple tag key-value pairs
- Flexible query language to reorganize the time data collected
- Powerful data visualization capabilities, in addition to the built-in browser, also support grafana integration
- Efficient storage, memory plus local disks, scalability through feature sharding and federation
- Simple operation and maintenance, only rely on the local disk, go binary installation package has no other dependencies
- To streamline the alarm
- Lots of client libraries
- A number of exporters are provided to collect common system metrics
1.3 AlterManager Core concepts
1.3.1 grouping
Groups classify alerts of a similar nature into a single notification. This is especially useful during large outages where many systems fail at once and hundreds to thousands of alerts may occur simultaneously.
Example: Tens or hundreds of service instances are running in the cluster when network partitioning occurs. Half of the service instances can no longer access the database. The alert rules in Prometheus are configured to send alerts when each service instance fails to communicate with the database. As a result, hundreds of alerts were sent to Alertmanager.
As users, people just want to get a single page and still be able to see exactly which service instances are affected. Therefore, an Alertmanager can be configured to group alerts by cluster and alertname to send a single compact notification.
Configure groups of alerts, timing of group notifications, and receivers for these notifications through the routing tree in the configuration file.
1.3.2 inhibition
Suppression is the concept of notifications that suppress certain alarms if some other alarms have been triggered. Example: Triggering an alert notifying that the entire cluster is not accessible. The Alertmanager can be configured to mute all other alerts related to the cluster when that particular alert is triggered. This prevents hundreds or thousands of alerts that trigger alarms that have nothing to do with the actual problem. Disable the configuration using the Alertmanager configuration file.
1.3.3 silence
Silence is a simple way to simply mute an alarm for a given amount of time. Configure silence based on matchers, just like a routing tree. Check that the incoming alerts match all the equality or regular expression matchers of the active silence. If they do, no notification of the alert will be sent. The mute function is configured on the Alertmanager Web UI.
1.3.4 Client behavior
Alertmanager has specific requirements for the behavior of its customers. These only apply to advanced use cases that do not use Prometheus to send alerts.
1.3.5 high availability
Alertmanager supports configuration to create clusters for high availability. This can be configured using the –cluster- * flag. It is important not to load balanced traffic between Prometheus and its Alertmanagers, but to point Prometheus to a list of all Alertmanagers.
The second architecture
2.1 Architecture diagram for Prometheus
2.2 altermanager architecture diagram
Is it compatible with other monitoring systems
3.1 Prometheus vs. Zabbix
- Zabbix uses C and PHP, Prometheus uses Golang, and Prometheus runs faster overall.
- Zabbix monitors physical hosts, switches, and networks. Prometheus monitors not only hosts, but also clouds, SaaS, Openstack, and Containers.
- Zabbix has more plug-ins for traditional host monitoring.
- Zabbix can configure many things in WebGui, Prometheus needs to manually modify the file configuration. ,
3.2 Prometheus vs. Nagios
- Nagios does not support user-defined Labels, query, alarm denoising, or grouping. There is no data store. If you want to query the historical status, install the plug-in.
- Nagios is a 1990s monitoring system that is better suited for small clusters or static systems. Nagios is too old to have many features. Prometheus is much better.
3.3 Prometheus vs Sensu
- Sensu is basically an updated version of Nagios. It solves a lot of Nagios problems. If you’re familiar with Nagios, Sensu is a good choice.
- Sensu relies on RabbitMQ and Redis for better scalability on data storage.
3.4 Prometheus vs InfluxDB
- InfluxDB is an open source temporal database that is mainly used for data storage. If you want to build a monitoring and alarm system, you need to rely on other systems.
- InfluxDB does a better job of horizontal scalability and high availability of storage, after all the core is the database.
4 Installation and Deployment
4.1 Prometheus installation
- Binary installation
CD/opt && wget HTTP: / / https://github.com/prometheus/prometheus/releases/download/v2.12.0/prometheus-2.12.0.linux-amd64.tar.gz Gz mv Prometheus -2.12.0. Linux-amd64 Prometheus chown root. Root Prometheus -r
#Configure as a servicecat >/usr/lib/systemd/system/prometheus.service <<EOF [Unit] Description=Prometheus Documentation=https://prometheus.io/ After=network.target [Service] Type=simple ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
#Set the service to automatically start upon startup
systemctl enable prometheus
systemctl start prometheus
#Direct start
nohup ./prometheus --config.file=prometheus.yml 2>&1 1>prometheus.log &
#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 16655/prometheus
Copy the code
- Source compilation and installation
$ go get github.com/prometheus/prometheus/cmd/...
$ prometheus --config.file=your_config.yml
#Or make the build
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/prometheus.git
$ cd prometheus
$ make build
$ ./prometheus --config.file=your_config.yml
Copy the code
- Docker installation
Docker run --name Prometheus -d -p 127.0.0.1:9090:9090 PROM/PrometheusCopy the code
4.2 alertmanager installation
- Binary installation
cd /opt && wget -c https://github.com/prometheus/alertmanager/releases/download/v0.18.0/alertmanager-0.18.0.linux-amd64.tar.gz tar ZXF Gz mv alertManager-0.18.0.linux-amd64 alertManager chown root. Root alertManager -r
#Configure the servicecat >/usr/lib/systemd/system/alertmanager.service <<EOF [Unit] Description=Alertmanager Documentation=https://prometheus.io/ After=network.target [Service] Type=simple ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
#Set the service to automatically start upon startup
systemctl enable alertmanager
systemctl start alertmanager
#Direct start
nohup ./alertmanager --config.file=alertmanager.yml 2>&1 1>alertmanager.log &
#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep alertmanager
tcp6 0 0 :::9094 :::* LISTEN 17237/alertmanager
tcp6 0 0 :::9093 :::* LISTEN 17237/alertmanager
udp6 0 0 :::9094 :::* 17237/alertmanager
Copy the code
- Compile the installation
$ GO15VENDOREXPERIMENT=1 go get github.com/prometheus/alertmanager/cmd/...
# cd $GOPATH/src/github.com/prometheus/alertmanager
$ alertmanager --config.file=<your_file>
#Manual source build
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/alertmanager.git
$ cd alertmanager
$ make build
$ ./alertmanager --config.file=<your_file>
#Amtool build
$ make build BINARIES=amtool
Copy the code
- Docker installation
docker pull quay.io/prometheus/alertmanager
Copy the code
4.3 node_export installation
Node_export is used to monitor the host, and there are many other official exports that can be used directly
- Binary installation
cd /opt && wget -c https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz tar ZXF Gz mv node_worlder -0.18.1.linux-amd64 node_worlder chown root.root node_worlder -R
#Configure the service
cat >/usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
#Set the service to automatically start upon startup
systemctl enable node_exporter
systemctl start node_exporter
#Direct start
nohup ./node_exporter --config.file=node_exporter.yml 2>&1 1>node_exporter.log &
#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep node_export
tcp6 0 0 :::9100 :::* LISTEN 4551/node_exporter
Copy the code
4.4 pushgateway
- The installation
cd /opt && wget -c https://github.com/prometheus/pushgateway/releases/download/v0.9.1/pushgateway-0.9.1.linux-amd64.tar.gz tar ZXF Gz mv pushgateway-0.9.1.linux-amd64 pushGateway chown root.root pushgateway -r
#Configure the service
cat >/usr/lib/systemd/system/pushgateway.service <<EOF
[Unit]
Description=pushgateway
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
ExecStart=/opt/pushgateway/pushgateway
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
#Set the service to automatically start upon startup
systemctl enable pushgateway
systemctl start pushgateway
#Direct start
nohup ./pushgateway --config.file=node_exporter.yml 2>&1 1>node_exporter.log &
#Check the service
[root@VM_0_13_centos pushgateway]# netstat -lntup |grep push
tcp6 0 0 :::9091 :::* LISTEN 5982/pushgateway
Copy the code
- Viewing the Web Page
- Shell command creation
Echo "some_metric 3.14" | curl - data - binary @ - http://localhost:9091/metrics/job/some_jobCopy the code
- Sending complex data
cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/some_job/instance/some_instance
# TYPE some_metric counter
some_metric{label="val1"} 42
# TYPE another_metric gauge
# HELP another_metric Just an example.Another_metric 2398.283 EOFCopy the code
4.5 Grafana configuration
4.5.1 grafana installation
- Install grafana
Wget https://dl.grafana.com/oss/release/grafana-6.3.3-1.x86_64.rpm sudo yum localinstall grafana 6.3.3-1. X86_64. RPM - y systemctl enable grafana-server.service systemctl start grafana-server.service#Web page 3000 Login information: admin/admin
#Installing a plug-in
grafana-cli plugins install grafana-piechart-panel
systemctl restart grafana-server
Copy the code
4.5.2 Adding a Data source
Add Prometheus and enter the management address of Prometheus
- Import the dashboard
Through https://grafana.com/grafana/dashboards
- Configuration dashboard
4.5.3 Grafana alarm email configuration
- Modify the Grafana configuration file to add the email configuration
#Modify/etc/grafana/grafana. Ini
[smtp]
enabled = true
host = smtp.163.com:465
user = 18329903316
# If the password contains # or ; you have to wrap it with trippel quotes. Ex """#password;" ""password = xxxxxxxxxxxx ; cert_file = ; key_file = ; skip_verify = false from_address = [email protected] ; from_name = Grafana ; ehlo_identity = dashboard.example.comCopy the code
- Notification Channels was configured on the Grafana Web interface
4.5.4 alter the configuration
⚠️ : Template variables are not supported in alert queries, otherwise alarms cannot be created
The alarm test
Viewing Alarm History
The alarm is triggered
Five PromQL
Prometheus Query Language (PromQL) is an expression Language developed by Prometheus. It has rich expression and many built-in functions. You can use it to filter and aggregate time series data.
5.1 PromQL grammar
5.1.1 Data Types
The PromQL expression computs the following types of values:
- Instant vector: A set of timing sequences, each with a single sample value
- Range vector: A set of timing sequences, each containing multiple sample values over a period of time
- Scalar data: a floating point number
- String: A String, temporarily unused
5.1.2 Timing selectors
-
Instantaneous vector selector
The instantaneous vector selector is used to select the sampling value of a set of timing sequences at a certain sampling point.
The simplest case is to specify a measure and select the current sample values for all the timings that belong to that measure. For example, the following expression:
http_requests_total Copy the code
You can filter the timing by following it with a set of tag key-value pairs enclosed in curly braces. For example, the following expression screens the timing when job is Prometheus and group is canary:
http_requests_total{job="prometheus", group="canary"} Copy the code
The tag value can be equal or a regular expression can be used. In total, there are the following matching operators:
- = : Exactly equal
- ! = : no equal
- =~: matches the regular expression
- ! ~: The regular expression does not match
The following expression filters out environment staging or testing or development, and method is not the timing of GET:
http_requests_total{environment=~"staging|testing|development",method! ="GET"} Copy the code
The metric names can be matched using the internal tag __name__, and the expression http_requestS_total can also be written as {__name__=” HTTP_requestS_total “}. The expression {__name__=~”job:.*”} matches all metrics whose names begin with job:.
-
Interval vector selector
The interval vector selector is similar to the instantaneous vector selector except that it selects samples from the past. Interval vector selectors can be obtained by adding the duration contained in [] to the instantaneous vector selectors. For example, the following expression selects the sampled values over the last 5 minutes for all time series whose metric is HTTP_requestS_total and job is Prometheus.
http_requests_total{job="prometheus"}[5m] Copy the code
The units of time can be one of the following:
- S: seconds
- M: you can
- H: pump
- D: days
- W: weekes
- Y: years
-
Migration modifier
The previous selectors default to the current time, and the offset modifier is used to adjust the base time to offset it forward by some time. The offset decorator follows the selector, using offset to specify the amount to be offset. For example, the following expression selects the sampled values of all timing sequences with the metric name http_requestS_Total five minutes ago.
http_requests_total offset 5m Copy the code
The following expression selects the http_requestS_TOTAL metric sampled five minutes past this point in time one week ago.
http_requests_total[5m] offset 1w Copy the code
5.2 PromQL operator
5.2.1 Binary operators
The binary operators of PromQL support basic logic and arithmetic operations, including arithmetic, comparison, and logic operations.
-
Arithmetic class binary operator
There are several arithmetic binary operators:
- + :
- – : decrease
- * :
- / :
- More than % : please
- ^ : power
Arithmetic class binary operators can be used between scalars, vectors and scalars, and vectors and vectors
Vectors in the context of binary operators are transient vectors, not interval vectors.
- Between the scalars and the scalars, the result is obvious, consistent with the usual arithmetic operation.
- Between a vector and a scalar, you take the scalar and every scalar in the vector, and you get a new vector.
- Between vectors, it’s a little bit more complicated. The operation will first look for a matching element in the right vector for each element in the left vector (the matching rules will be discussed later), and then perform the calculation on the two matching elements, so that the results of each pair of matching elements form a new vector. If no matching element is found, the element is discarded.
-
Comparison class binary operators
There are several comparison binary operators:
- == (equal)
- ! = (not-equal)
- > (greater-than)
- < (less-than)
- >= (greater-or-equal)
- <= (less-or-equal)
Comparison class binary operators can also be used between scalars, vectors and scalars, and vectors and vectors. By default, filtering is performed, that is, reserving values. Instead of filtering, we can make the return values 0 and 1 by following the operator with the bool modifier.
- Between scalars, there must be a bool modifier, so the result can only be 0 (false) or 1 (true).
- Between a vector and a scalar, it is equivalent to comparing every scalar in the vector with a scalar, keep if true, or discard if not. If the bool modifier is followed, the results are 1 and 0, respectively.
- Between vectors, the operation is similar to an arithmetic operator, except that the values on the left (including attributes such as metrics and labels) are kept if the comparison is true, otherwise discarded, and if no match is found, discarded. If the bool modifier is followed, the result is 1 and 0 when reserved and when discarded.
-
Logic class binary operators
Logical operators are only used between vectors.
- And: intersection
- Or: collection
- Unless: complement
The specific operation rules are as follows:
vector1 and vector2
The result is composed of elements in Vector1 that have matching elements in vector2 (the same tag, key, and value pair combinations).vector1 or vector2
The result is composed of all elements in Vector1 plus elements in Vector2 that have no matching elements in vector1 (the same tag, key, and value pair combinations).vector1 unless vector2
The result consists of elements in Vector1 that have no matching elements in vector2 (with the same tag, key, and value pair combinations).
-
Binary operator precedence
The priorities of PromQL binary operators are as follows:
- ^
- *, /, %
- The +, –
- = =,! <=, <, >=, >
- and, unless
- or
5.2.2 Vector matching
The previous arithmetic and comparison operators both need to match between vectors. There are two matching types, one-to-one and many-to-one/one-to-many.
-
One-to-one vector matching
Same is a match, and there will be only one matching element. You can use the Ignoring keyword to ignore the tag that does not participate in the match, or you can use the ON keyword to specify the tag that participates in the match. The syntax is as follows:
<vector expr> <bin-op> ignoring(<label list>) <vector expr> <vector expr> <bin-op> on(<label list>) <vector expr> Copy the code
For example, for the following input:
method_code:http_errors:rate5m{method="get", code="500"} 24 method_code:http_errors:rate5m{method="get", code="404"} 30 method_code:http_errors:rate5m{method="put", code="501"} 3 method_code:http_errors:rate5m{method="post", code="500"} 6 method_code:http_errors:rate5m{method="post", code="404"} 21 method:http_requests:rate5m{method="get"} 600 method:http_requests:rate5m{method="del"} 34 method:http_requests:rate5m{method="post"} 120 Copy the code
Execute the following query:
method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m Copy the code
The results are as follows:
{method="get"} 0.04 // 24/600 {method="post"} 0.05 // 6/120Copy the code
This is the percentage of the total number of requests with code 500 per method. There are no matching elements for methods put and del, so they do not appear in the result.
-
Many-to-one/one-to-many vector matching
In this matching pattern, multiple elements on one side match elements on the other side. The group_left or group_right group modifiers are used to indicate which side matches the most elements, group_left for the left side or group_right for the right side. The syntax is as follows:
<vector expr> <bin-op> ignoring(<label list>) group_left(<label list>) <vector expr> <vector expr> <bin-op> ignoring(<label list>) group_right(<label list>) <vector expr> <vector expr> <bin-op> on(<label list>) group_left(<label list>) <vector expr> <vector expr> <bin-op> on(<label list>) group_right(<label list>) <vector expr>Copy the code
The group modifier applies only to arithmetic and comparison operators.
For the previous input, execute the following query:
method_code:http_errors:rate5m / ignoring(code) group_left method:http_requests:rate5m Copy the code
You get the following result:
{method="get", code="500"} 0.04 // 24/600 {method="get", code="404"} 0.05 // 30/600 {method="post", code="500"} 0.05 // 6/120 {method="post", code="404"} 0.175 // 21/120Copy the code
This is the ratio of error counts per code per method to the number of requests per method. Ignoring code for the match enables a many-to-one match on both sides. Group_left is used to indicate the number of left.
Many-to-one/many-to-many is too advanced and complex. Avoid using many-to-one. Most of the time, ignoring can solve a problem.
5.2.3 Aggregation operators
PromQL’s aggregation operator is used to aggregate fewer elements in a vector. There are the following aggregate operators:
- The sum, sum
- Min: indicates the minimum value
- Max: Maximum value
- Avg: indicates the average value
- Stddev: standard deviation
- Variance stdvar:
- Count: indicates the number of elements
- Count_values: number of elements equal to a value
- Bottomk: the smallest K elements
- Topk: maximum k elements
- -Penny: I don’t know, the quantile
The syntax of the aggregate operator is as follows:
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
Copy the code
Where without is used to specify the tags that do not need to be retained (that is, the multiple values of these tags are aggregated), and by is used to specify the tags that need to be retained (that is, aggregated by them).
Here are some examples:
sum(http_requests_total) without (instance)
Copy the code
The HTTP_REQUESTS_TOTAL metric is tagged with application, instance, and group. The above expression will give you the total number of requests per group for each instance of each application. The effect is equivalent to the following expression:
sum(http_requests_total) by (application, group)
Copy the code
The following expression yields the total number of requests for all instances of all groups of all applications.
sum(http_requests_total)
Copy the code
5.3 the function
Prometheus has several built-in functions to aid in calculation, some of which are described below, and refer to the official documentation for a complete list.
- Abs () : absolute value
- SQRT () : square root
- Exp () : Exponential calculation
- Ln () : indicates the natural log
- Ceil () : round up
- Floor () : round down
- Round () : Indicates the round
- Delta () : Calculates the difference between the first and last time sequence of each interval vector
- Sort () : sorting
Configure alarm rules
Integrate the alarm into the Ruixiang cloud
6.1 Prometheus integrates with ALTERManager
#Edit the Prometheus. Yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'
rule_files:
- "onealter.yml"
Copy the code
6.2 Writing ruLE_Files Rule Files
#Write onealter. YmlGroups: - name: test-rule rules: -alert: memory usage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 40 for: 1m labels: user: prometheus annotations: summary: "{{$labels. Instance}}: memory more than 40%" description: "{{$labels. Instance}}: memory more than 40%"Copy the code
6.3 Edit AlterManager. Yml Configure Webhook callback
#The editor
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-X-pager'
receivers:
- name: 'team-X-pager'
webhook_configs:
- url: 'http://api.aiops.com/alert/api/event/prometheus/f307ded7-9a96-4e34-101d-dfc421a8743a'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Copy the code
6.4 Viewing ALTER Alarms
Refer to the link
- website
- Alermanager
- Third party plug-in
- routing tree edit
- grafana dashboard