Prometheus powers your metrics and alerts with a leading open source monitoring solution.

1 overview

1.1.What is Prometheus?

Prometheus is an open source system monitoring and alert toolkit. Since its launch in 2012, Many companies and organizations have adopted Prometheus, which has a very active developer and user community. It is now a separate open source project, maintained independently of any company. Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project after Kubernetes.

1.1.1. Main characteristics of Prometheus:

A multidimensional data model containing time series data identified by metric names and key/value pairs (Tags)

PromQL is a flexible query voice that can be used to query and leverage these dimensions of data without relying on distributed storage, with individual server nodes being autonomous

Time series collection is done through the PULL model over HTTP (with pull supported)

Push time series are supported through an intermediate gateway (Push is also supported)

Targets are discovered through service discovery or static configuration

Multi-mode graphics and dashboard support

To summarize, it’s a multidimensional data model, PromQL query language, node autonomy, HTTP pull or gateway push for time series data, automatic target discovery, and multiple dashboard support

1.1.2. Components:

Prometheus Server, which captures and stores time series data, is the primary component

Client libraries: Client libraries used to detect application code

Push Gateway to support short-term Jobs

Exporters, used to support third parties such as HAProxy

Alertmanager: handles alarms

Various support tools

Most Prometheus components are written in Go, which makes them easy to build and deploy as static binaries

1.1.3. Architecture:

This diagram shows some of the components of the architecture and its ecosystem:

Prometheus gets metrics from instrumented jobs, either directly or through intermediaries pushing gateways for short-term jobs. It stores all captured samples locally and applies rules to aggregate the data and record new time series or generate alerts. You can use Grafana or other apis to visualize the collected data.

1.2. When is it appropriate to use it

Prometheus can record any purely digital time series very well. It is suitable for both machine-centric monitoring and highly dynamic service-oriented architecture monitoring. In the world of microservices, its support for multidimensional data collection and queries is a particular advantage.

Prometheus is designed for reliability, allowing you to quickly diagnose problems if your service is down. Each Prometheus server is independent and does not rely on network storage or other remote services.

1.3. When is it inappropriate to use it

Reliability of Prometheus values. You can always view statistics about your system, even in the event of a failure. If you need 100% accuracy, such as pay-per-request, Prometheus is not a good option because the data collected may not be detailed and complete. In such cases, it is best to use other systems to collect and analyze the data for billing and use Prometheus for the rest of the monitoring.

1.4. Prometheus VS InfluxDB

InfluxDB is an open source time series database with commercial options for extension and clustering. The InfluxDB project was released nearly a year after the development of Prometheus, so it could not be considered as an alternative at the time. Nonetheless, there are significant differences between Prometheus and FluxDB. There are many similarities between the two. Both have tags (called tags in InfluxDB) to effectively support multi-dimensional measurements. They basically use the same data compression algorithm. Both have a wide range of integrations, including with each other. Both have hooks that allow them to be extended further, such as analyzing data in statistical tools or performing automation.

InfluxDB is better for the following cases:

If you are logging events

The business option provides clustering for InfluxDB, which is also better for long-term data storage

Finally, the consistency of data between copies is achieved

Prometheus is better for:

If your main thing is to measure

If you need more powerful query languages, alerts, and notifications

Higher availability and uptime for plotting and alarming

Maintained by a commercial company that follows an open core model, InfluxDB provides advanced features such as closed-source clustering, hosting, and support.

Prometheus is a fully open source and independent project maintained by many companies and individuals, some of which also provide commercial services and support.

2. Basic concepts

2.1. Data model

Prometheus essentially stores all data as a time series: a stream of timestamp values belonging to the same metric and the same set of labeled dimensions. In addition to storing time series, Prometheus can also generate temporarily derived time series based on query results.

(PS: The interpretation of time series here is this,

time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions

)

2.1.1. Metric names and labels

Every time series is uniquely identified by its metric name and optional key-value pairs called labels.

(Each time series is uniquely identified by its metric name and an optional key-value pair called a label.)

The metric name specifies the general characteristics of the system to be measured (for example, http_requestS_TOTAL indicates the total number of HTTP requests received). It may contain ASCII letters and numbers, as well as underscores and colons. It must match the regular expression [a-za-z_ :][a-za-z0-9_ :]*

The label name can contain ASCII letters, numbers, and underscores. They must match the regular expression [a-za-z_][A-za-z0-9_]*. Label names beginning with __ are reserved for internal use.

The tag value can contain any Unicode character.

2.1.2. Sample

The sample constitutes the actual time series data. Each sample includes:

a float64 value
a millisecond-precision timestamp
Copy the code

2.1.3. Notation

Given a metric name and a set of labels, time series are usually identified by the following notation:

<metric name>{<label name>=<label value>,... }Copy the code

For example, if you have a time series named API_HTTP_requestS_total and two tags (method=”POST” and handler=”/messages”), the time series would look like this:

api_http_requests_total{method="POST", handler="/messages"}
Copy the code

2.2. Metric types

2.2.1. Counter

A counter is a cumulative metric that represents a monotonically increasing counter whose value can only be increments or reset to zero on restart. For example, you can use a counter to indicate the number of requests served, the number of tasks completed, or the number of errors. Do not use a counter to reflect a value that may decrease. For example, instead of using a counter to indicate the number of processes currently running, you should use a gauge instead.

2.2.2. Gauge

A meter represents a number that can be moved up or down at will.

A meter is usually used to measure things like temperature or current memory usage, but also for “counting,” such as the number of concurrent requests.

(3) as is shown in the Histogram

The histogram samples observations (typically things like request duration or response size) and counts them in a configurable bucket. It also provides the sum of all observations.

The histogram uses a basic metric name

to expose multiple time series during a grab:

Cumulative counter of the bucket in the format of

_bucket{le=”

“} Total of all observed values in the format of

_sum Number of observed events in the format of

_count

2.2.4. Summary

Similar to the bar chart, sample observations are summarized (typically things like request duration and response size). Although it also provides the total number of observations and the sum of all observations, it calculates the configurable quantile over a sliding time window.

2.3. Jobs AND Instances

In Prometheus terminology, the endpoints that can be grabbed are called instances, which typically correspond to a single process. A collection of instances with the same purpose is called a job.

For example, an API Server job has four instances:

job: api-server

Instance 1: 1.2.3.4:5670 instance 2: 1.2.3.4:5671 instance 3: 5.6.7.8:5670 instance 4: 5.6.7.8:5671 instance 1: 1.2.3.4:5670 instance 2: 1.2.3.4:5671 instance 3: 5.6.7.8:5670 instance 4: 5.6.7.8:5671Copy the code

2.3.1. Automatically generate labels and time series

When Prometheus grabs a target, it automatically attaches some tags to the time series of the grab to identify the target being grabbed:

Job: Name of the configured job to which the target belongs instance: : is part of the target URL that is captured 3. Prometheus is an open source system monitoring and alert toolkit with active ecosystems.

3.1. Download and install

Prometheus is a monitoring platform that collects metrics for monitored targets by capturing HTTP endpoints on these targets.

You need to download, install, and run Prometheus. You also need to download and install an Exporter, which is a tool for exporting time series data on hosts and services.

prometheus.io/download/

Before running Prometheus, let’s configure it

3.1.1. Configure Prometheus to monitor itself

Prometheus collects data from monitored targets by capturing HTTP endpoint data on the target. Because Prometheus discloses its data in the same way, it can also capture and monitor its own health.

While the Prometheus server is not very useful in practice to collect data only about itself, it is a good example to start with. Save the following basic Prometheus configuration as a file named Prometheus. Yml:

1 global:

 2   scrape_interval:     15s # By default, scrape targets every 15 seconds.
 3 
 4   # Attach these labels to any time series or alerts when communicating with
 5   # external systems (federation, remote storage, Alertmanager).
 6   external_labels:
 7     monitor: 'codelab-monitor'
 8 
 9 # A scrape configuration containing exactly one endpoint to scrape:
10 # Here it's Prometheus itself.
11 scrape_configs:
12   # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
13   - job_name: 'prometheus'
14 
15     # Override the global default and scrape targets from this job every 5 seconds.
16     scrape_interval: 5s
17 
18     static_configs:
19       - targets: ['localhost:9090']
Copy the code

3.1.2. Start the Prometheus

1 # Start Prometheus.
2 # By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
3 ./prometheus --config.file=prometheus.yml
Copy the code

3.2. The configuration

Prometheus can be configured using commands and configuration files. The configuration file defines everything related to the crawl job and its instances, as well as which rule files to load.

Run the./ Prometheus -h command to view all supported commands

To specify which configuration file to load, use the –config option

The configuration file is in YAML format

There are too many configuration items. Do not list them one by one

Prometheus. IO/docs/promet…

global:

# How frequently to scrape targets by default. [ scrape_interval: <duration> | default = 1m ] # How long until a scrape request times out. [ scrape_timeout: <duration> | default = 10s ] # How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ] # The labels to add to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: [ <labelname>: <labelvalue> ... ]  # Rule files specifies a list of globs. Rules and alerts are read from # all matching files. rule_files: [ - <filepath_glob> ... ]  # A list of scrape configurations. scrape_configs: [ - <scrape_config> ... ]  # Alerting specifies settings related to the Alertmanager. alerting: alert_relabel_configs: [ - <relabel_config> ... ]  alertmanagers: [ - <alertmanager_config> ... ]  # Settings related to the remote write feature. remote_write: [ - <remote_write> ... ]  # Settings related to the remote read feature. remote_read: [ - <remote_read> ... ]Copy the code

Here is a valid example configuration file

3.3. The query

Prometheus provides a functional query language called PromQL (Prometheus Query Language) that allows users to select and aggregate time series data in real time. The result of an expression can either be displayed as a graph, viewed as tabular data in Prometheus’ expression browser, or used by external systems through the HTTP API.

3.3.1. Expression data types

In Prometheus’s expression language, expressions or subexpressions can be evaluated into one of four types:

Instant vector: A set of time series, each containing a sample, all of which share the same timestamp

Range vector: A set of time series containing a Range of data points for each time series as it changes over time

Scalar: a simple numeric floating point value

String: A simple String value, currently unused

3.3.2 rainfall distribution on 10-12. Literal value

String literal

A string can be specified as a literal in single, double, or backquotes. Such as:

1 "this is a string"
2 'these are unescaped: \n \\ \t'
3 `these are not unescaped: \n ' " \t` 
Copy the code

Floating point numeric value

For example: 2.34

3.3.3. Time series selector

Instantaneous vector selector

The instantaneous vector selector allows the selection of a set of time series and a sample value for each time series on a given timestamp (instantaneous) : in its simplest form, only a measure name is specified. Such a vector would contain all the time series elements of the measure name.

In the following example, select all time series whose metric name is HTTP_requestS_total:

 http_requests_total
Copy the code

You can further filter these time series by adding a set of matching tags within curly braces ({}).

In the following example, select a time series where the metric name is http_requestS_total and the job tag value is Prometheus and the group tag value is canary:

 http_requests_total{job="prometheus",group="canary"} 
Copy the code

Tag match operator:

= : Select the tag that is exactly the same as the supplied string (equal to)! = : Select a tag that does not equal the supplied string (does not) =~ : Re match! ~ : non-regular match

The following example selects the HTTP_requestS_total time series for all staging, testing, and development environments and for HTTP requests that are not GET

http_requests_total{environment=~”staging|testing|development”,method! =”GET”} Do not match empty tags

 {job=~".+"}              # Good!
 {job=~".*",method="get"} # Good!
Copy the code

3.3.4. Range vector selector

Range vector literals work similarly to instantaneous vector literals, except that they select a sample range from the current instantaneous. Syntactically, the range duration is added to the square brackets ([]) at the end of the vector selector to specify how much time value should be retrieved for each result range vector element.

The time period is specified as a number, followed by one of the following units: S (seconds), M (minutes), H (hours), D (days), W (weeks), y (years)

For example, select http_requestS_total and job tag value for the last 5 minutes recorded by Prometheus:

Http_requests_total {job=” Prometheus “}[5m] Offset

The following expression returns the value of http_requestS_Total over the past 5 minutes relative to the current query evaluation time:

Http_requests_total offset 5m Note that offset always follows the selector

Sum (http_requestS_total {method=”GET”} offset 5m

rate(http_requests_total[5m] offset 1w)

3.3.5. The subquery

Syntax: <instant_query> ‘[‘ ‘:’ [] ‘]’ [ offset ]

3.3.5. The operator

Prometheus’ query language supports basic logical and arithmetic operators.

Arithmetic binary operator

+ (plus), – (minus), * (multiply), / (divide), % (remainder), ^ (exponent)

Binary arithmetic operators are defined between scalar/scalar, vector/scalar, and vector/vector-valued pairs

Compare binary operators

= =,! =, >, <, >=, <=

Logical operator

And, or, unless

Aggregation operator

Sum (sum), min (minimum value), Max (maximum value), AVg (average value), stddev (standard deviation), STdvar (variance), count (number), count_values (number of elements with the same value), bottomk (minimum element of the sample value), topk (maximum of the sample value) Quantile (0 ≤ φ ≤ 1)

These operators can be used either to aggregate all label dimensions or to store different dimensions by including the Without clause or by clause.

1 <aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]
Copy the code

For example, if http_requestS_Total has application, instance, and group tags, the following two are equivalent:

1 sum(http_requests_total) without (instance)
2 sum(http_requests_total) by (application, group)
Copy the code

3.3.6. Function

Prometheus. IO/docs/promet…

3.3.7. Sample

1 # return all time series of http_requestS_total; 2 # return time series of HTTP_requestS_total; 3 # Return time series of HTTP_requestS_total; 5 # return time series of HTTP_requestS_total http_requests_total{job="apiserver", handler="/api/comments"} 6 http_requests_total{job="apiserver", }[5m] 7 # regular expression 9 http_requests_total{job=~".*server"} 10 http_requests_total{status! ~ 4 ".." Http_requests_total {job="api-server"}[5m]) 14 # Http_requests_total {job="api-server"}[5m Rate (http_requests_total[5m])[30m:1m] 17 sum(rate(http_requests_total[5m])) by (job) 18 sum(rate(instance_cpu_time_ns[5m])) by (app, proc))Copy the code

4. Grafana support

Grafana Supports querying Prometheus

Here is an example of Grafana Dashboard querying Prometheus data:

use

By default, Grafana listens on http://localhost:3000 and defaults to admin/admin login

Create a Prometheus data source, then create the panel and define the metrics for the query

To begin with, if you don’t know how to write PromeQL, can go to Prometheus go upstairs for http://localhost:9090/graph

※ Some articles are from the Internet. If there is any infringement, please contact to delete. More articles and materials | click behind the text to the left left left 100 gpython self-study data package Ali cloud K8s practical manual guide] [ali cloud CDN row pit CDN ECS Hadoop large data of actual combat operations guide the conversation practice manual manual Knative cloud native application development guide OSS Operation and maintenance manual Cloud native architecture white paper Zabbix enterprise distributed monitoring system source document 10G